qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 00/16] Live update: cpr-transfer
@ 2024-11-01 13:47 Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
                   ` (15 more replies)
  0 siblings, 16 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

What?

This patch series adds the live migration cpr-transfer mode, which
allows the user to transfer a guest to a new QEMU instance on the same
host with minimal guest pause time, by preserving guest RAM in place,
albeit with new virtual addresses in new QEMU, and by preserving device
file descriptors.

The new user-visible interfaces are:
  * cpr-transfer (MigMode migration parameter)
  * cpr-uri (migration parameter)
  * cpr-uri (command-line argument)

The user sets the mode parameter before invoking the migrate command.
In this mode, the user starts new QEMU on the same host as old QEMU, with
the same arguments as old QEMU, plus the -incoming and the -cpr-uri options.
The user issues the migrate command to old QEMU, which stops the VM, saves
state to the migration channels, and enters the postmigrate state.  Execution
resumes in new QEMU.

Memory-backend objects must have the share=on attribute, but memory-backend-epc
and memory-backend-ram are not supported.  The VM must be started with the
'-machine anon-alloc=memfd' option, which allows anonymous memory to be
transferred in place to the new process.

This mode requires a second migration channel, specified by the cpr-uri
migration property on the outgoing side, and by the cpr-uri QEMU command-line
option on the incoming side.  The channel must be a type, such as unix socket,
that supports SCM_RIGHTS.

Why?

This mode has less impact on the guest than any other method of updating
in place.  The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU.  Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time.

These benefits all derive from the core design principle of this mode,
which is preserving open descriptors.  This approach is very general and
can be used to support a wide variety of devices that do not have hardware
support for live migration, including but not limited to: vfio, chardev,
vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
to allow a descriptor to be used in a process that did not originally open it.

How?

All memory that is mapped by the guest is preserved in place.  Indeed,
it must be, because it may be the target of DMA requests, which are not
quiesced during cpr-transfer.  All such memory must be mmap'able in new QEMU.
This is easy for named memory-backend objects, as long as they are mapped
shared, because they are visible in the file system in both old and new QEMU.
Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
so the memfd's can be sent to new QEMU.  Pages that were locked in memory
for DMA in old QEMU remain locked in new QEMU, because the descriptor of
the device that locked them remains open.

cpr-transfer preserves descriptors by sending them to new QEMU via the
cpr-uri, which must support SCM_RIGHTS, and by sending the unique name
and value of each descriptor to new QEMU via CPR state.

For device descriptors, new QEMU reuses the descriptor when creating the
device, rather than opening it again.  For memfd descriptors, new QEMU
mmap's the preserved memfd when a ramblock is created.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel, specified by cpr-uri.  New QEMU
reads the second channel prior to creating devices or backends.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in terminal 2.

  Terminal 1: start old QEMU
  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine anon-alloc=memfd ...

  Terminal 2: start new QEMU
  # qemu-kvm ... -incoming unix:vm.sock -cpr-uri unix:cpr.sock

  Terminal 1:
  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-transfer
  (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
  (qemu) migrate -d unix:vm.sock
  (qemu) info status
  VM status: paused (postmigrate)

  Terminal 2:
  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

This patch series implements a minimal version of cpr-transfer.  Additional
series are ready to be posted to deliver the complete vision described
above, including
  * vfio
  * chardev
  * vhost and tap
  * blockers
  * cpr-exec mode
  * iommufd

Changes in V2:
  * cpr-transfer is the first new mode proposed, and cpr-exec is deferred
  * anon-alloc does not apply to memory-backend-object
  * replaced hack with proper synchronization between source and target
  * defined QEMU_CPR_FILE_MAGIC
  * addressed misc review comments

Changes in V3:
  * added cpr-transfer to migration-test
  * documented cpr-transfer in CPR.rst
  * fixed size_t trace format for 32-bit build
  * dropped explicit fd value in VMSTATE_FD
  * deferred cpr_walk_fd() and cpr_resave_fd() to later series
  * dropped "migration: save cpr mode".
    deleted mode from cpr state, and used cpr_uri to infer transfer mode.
  * dropped "migration: stop vm earlier for cpr"
  * fixed an unreported bug for cpr-transfer and migrate cancel
  * documented cpr-transfer restrictions in qapi
  * added trace for cpr_state_save and cpr_state_load
  * added ftruncate to "preserve ram blocks"

The first 4 patches below are foundational and are needed for both cpr-transfer
mode and the proposed cpr-exec mode.  The next 7 patches are specific to
cpr-transfer and implement the mechanisms for sharing state across a socket
using SCM_RIGHTS.  The last 5 patches supply tests and documentation.

Steve Sistare (16):
  machine: anon-alloc option
  migration: cpr-state
  physmem: preserve ram blocks for cpr
  hostmem-memfd: preserve for cpr
  migration: SCM_RIGHTS for QEMUFile
  migration: VMSTATE_FD
  migration: cpr-transfer save and load
  migration: cpr-uri parameter
  migration: cpr-uri option
  migration: split qmp_migrate
  migration: cpr-transfer mode
  tests/migration-test: memory_backend
  tests/qtest: defer connection
  tests/migration-test: defer connection
  migration-test: cpr-transfer
  migration: cpr-transfer documentation

 backends/hostmem-memfd.c       |  12 ++-
 docs/devel/migration/CPR.rst   | 144 +++++++++++++++++++++++++-
 hw/core/machine.c              |  19 ++++
 include/hw/boards.h            |   1 +
 include/migration/cpr.h        |  29 ++++++
 include/migration/vmstate.h    |   9 ++
 migration/cpr-transfer.c       |  81 +++++++++++++++
 migration/cpr.c                | 223 +++++++++++++++++++++++++++++++++++++++++
 migration/meson.build          |   2 +
 migration/migration-hmp-cmds.c |  10 ++
 migration/migration.c          | 107 +++++++++++++++++++-
 migration/migration.h          |   2 +
 migration/options.c            |  40 +++++++-
 migration/options.h            |   1 +
 migration/qemu-file.c          |  83 ++++++++++++++-
 migration/qemu-file.h          |   2 +
 migration/ram.c                |   2 +
 migration/trace-events         |   9 ++
 migration/vmstate-types.c      |  24 +++++
 qapi/machine.json              |  14 +++
 qapi/migration.json            |  53 +++++++++-
 qemu-options.hx                |  19 ++++
 stubs/vmstate.c                |   7 ++
 system/physmem.c               |  63 ++++++++++++
 system/trace-events            |   3 +
 system/vl.c                    |  10 ++
 tests/qtest/libqtest.c         |  69 ++++++++-----
 tests/qtest/libqtest.h         |  19 +++-
 tests/qtest/migration-test.c   | 107 ++++++++++++++++++--
 29 files changed, 1115 insertions(+), 49 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-transfer.c
 create mode 100644 migration/cpr.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH V3 01/16] machine: anon-alloc option
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 14:06   ` Peter Xu
  2024-11-04 10:39   ` David Hildenbrand
  2024-11-01 13:47 ` [PATCH V3 02/16] migration: cpr-state Steve Sistare
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This option applies to
memory allocated as a side effect of creating various devices. It does
not apply to memory-backend-objects, whether explicitly specified on
the command line, or implicitly created by the -m command line option.

The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qapi/machine.json   | 14 ++++++++++++++
 qemu-options.hx     | 11 +++++++++++
 system/physmem.c    | 35 +++++++++++++++++++++++++++++++++++
 system/trace-events |  3 +++
 6 files changed, 83 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index adaba17..a89a32b 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static int machine_get_anon_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->anon_alloc;
+}
+
+static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->anon_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
+                                   &AnonAllocOption_lookup,
+                                   machine_get_anon_alloc,
+                                   machine_set_anon_alloc);
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5966069..5a87647 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -393,6 +393,7 @@ struct MachineState {
     bool enable_graphics;
     ConfidentialGuestSupport *cgs;
     HostMemoryBackend *memdev;
+    AnonAllocOption anon_alloc;
     /*
      * convenience alias to ram_memdev_id backend memory region
      * or to numa container memory region
diff --git a/qapi/machine.json b/qapi/machine.json
index 3cc055b..f634c40 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1898,3 +1898,17 @@
 { 'command': 'x-query-interrupt-controllers',
   'returns': 'HumanReadableText',
   'features': [ 'unstable' ]}
+
+##
+# @AnonAllocOption:
+#
+# An enumeration of the options for allocating anonymous guest memory.
+#
+# @mmap: allocate using mmap MAP_ANON
+#
+# @memfd: allocate using memfd_create
+#
+# Since: 9.2
+##
+{ 'enum': 'AnonAllocOption',
+  'data': [ 'mmap', 'memfd' ] }
diff --git a/qemu-options.hx b/qemu-options.hx
index dacc979..fdd6bf2 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                nvdimm=on|off controls NVDIMM support (default=off)\n"
     "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
     "                hmat=on|off controls ACPI HMAT support (default=off)\n"
+    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
     "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
     "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
     QEMU_ARCH_ALL)
@@ -101,6 +102,16 @@ SRST
         Enables or disables ACPI Heterogeneous Memory Attribute Table
         (HMAT) support. The default is off.
 
+    ``anon-alloc=mmap|memfd``
+        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
+        or memfd_create.  This option applies to memory allocated as a
+        side effect of creating various devices. It does not apply to
+        memory-backend-objects, whether explicitly specified on the
+        command line, or implicitly created by the -m command line
+        option.
+
+        Some migration modes require anon-alloc=memfd.
+
     ``memory-backend='id'``
         An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
         Allows to use a memory backend as main RAM.
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a..174f7e0 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -69,6 +70,8 @@
 
 #include "qemu/pmem.h"
 
+#include "qapi/qapi-types-migration.h"
+#include "migration/options.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
+
+        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
+                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
+                                        TYPE_MEMORY_BACKEND)) {
+            size_t max_length = new_block->max_length;
+            MemoryRegion *mr = new_block->mr;
+            const char *name = memory_region_name(mr);
+
+            new_block->mr->align = QEMU_VMALLOC_ALIGN;
+            new_block->flags |= RAM_SHARED;
+
+            if (new_block->fd == -1) {
+                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
+                                                  0, 0, 0, errp);
+            }
+
+            if (new_block->fd >= 0) {
+                int mfd = new_block->fd;
+                qemu_set_cloexec(mfd);
+                new_block->host = file_ram_alloc(new_block, max_length, mfd,
+                                                 false, 0, errp);
+            }
+            if (!new_block->host) {
+                qemu_mutex_unlock_ramlist();
+                return;
+            }
+            memory_try_enable_merging(new_block->host, new_block->max_length);
+            free_on_error = true;
+
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
@@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         ram_block_notify_add(new_block->host, new_block->used_length,
                              new_block->max_length);
     }
+    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
+                        new_block->fd, new_block->used_length,
+                        new_block->max_length);
     return;
 
 out_free:
diff --git a/system/trace-events b/system/trace-events
index 074d001..267daca 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P
 
 # cpu-throttle.c
 cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
+
+#physmem.c
+ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %zu, maxlen %zu"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 02/16] migration: cpr-state
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 20:36   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 03/16] physmem: preserve ram blocks for cpr Steve Sistare
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

CPR must save state that is needed after QEMU is restarted, when devices
are realized.  Thus the extra state cannot be saved in the migration stream,
as objects must already exist before that stream can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a cpr-specific
stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
restarts, before devices are realized.

Provide accessors for clients to register file descriptors for saving.
The mechanism for passing the fd's to the new process will be specific
to each migration mode, and added in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/cpr.h |  23 ++++++
 migration/cpr.c         | 192 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 migration/migration.c   |   1 +
 migration/trace-events  |   5 ++
 5 files changed, 222 insertions(+)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..6e4781c
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#define QEMU_CPR_FILE_MAGIC     0x51435052
+#define QEMU_CPR_FILE_VERSION   0x00000001
+
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+void cpr_state_close(void);
+struct QIOChannel *cpr_state_ioc(void);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..be1dc92
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
+#include "migration/options.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+
+typedef struct CprState {
+    CprFdList fds;
+} CprState;
+
+static CprState cpr_state;
+
+/****************************************************************************/
+
+typedef struct CprFd {
+    char *name;
+    unsigned int namelen;
+    int id;
+    int fd;
+    QLIST_ENTRY(CprFd) next;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+    .name = "cpr fd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprFd),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
+        VMSTATE_INT32(id, CprFd),
+        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = g_new0(CprFd, 1);
+
+    trace_cpr_save_fd(name, id, fd);
+    elem->name = g_strdup(name);
+    elem->namelen = strlen(name) + 1;
+    elem->id = id;
+    elem->fd = fd;
+    QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
+}
+
+static CprFd *find_fd(CprFdList *head, const char *name, int id)
+{
+    CprFd *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name) && elem->id == id) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+
+    trace_cpr_delete_fd(name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+    int fd = elem ? elem->fd : -1;
+
+    trace_cpr_find_fd(name, id, fd);
+    return fd;
+}
+/*************************************************************************/
+#define CPR_STATE "CprState"
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+/*************************************************************************/
+
+static QEMUFile *cpr_state_file;
+
+QIOChannel *cpr_state_ioc(void)
+{
+    return qemu_file_get_ioc(cpr_state_file);
+}
+
+int cpr_state_save(Error **errp)
+{
+    int ret;
+    QEMUFile *f;
+
+    /* set f based on mode in a later patch in this series */
+    return 0;
+
+    qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
+    qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
+
+    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+    if (ret) {
+        error_setg(errp, "vmstate_save_state error %d", ret);
+        qemu_fclose(f);
+        return ret;
+    }
+
+    /*
+     * Close the socket only partially so we can later detect when the other
+     * end closes by getting a HUP event.
+     */
+    qemu_fflush(f);
+    qio_channel_shutdown(qemu_file_get_ioc(f), QIO_CHANNEL_SHUTDOWN_WRITE,
+                         NULL);
+    cpr_state_file = f;
+    return 0;
+}
+
+int cpr_state_load(Error **errp)
+{
+    int ret;
+    uint32_t v;
+    QEMUFile *f;
+
+    /* set f based on other parameters in a later patch in this series */
+    return 0;
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_CPR_FILE_MAGIC) {
+        error_setg(errp, "Not a migration stream (bad magic %x)", v);
+        qemu_fclose(f);
+        return -EINVAL;
+    }
+    v = qemu_get_be32(f);
+    if (v != QEMU_CPR_FILE_VERSION) {
+        error_setg(errp, "Unsupported migration stream version %d", v);
+        qemu_fclose(f);
+        return -ENOTSUP;
+    }
+
+    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+    if (ret) {
+        error_setg(errp, "vmstate_load_state error %d", ret);
+        qemu_fclose(f);
+        return ret;
+    }
+
+    /*
+     * Let the caller decide when to close the socket (and generate a HUP event
+     * for the sending side).
+     */
+    cpr_state_file = f;
+
+    return ret;
+}
+
+void cpr_state_close(void)
+{
+    if (cpr_state_file) {
+        qemu_fclose(cpr_state_file);
+        cpr_state_file = NULL;
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index 66d3de8..e5f4211 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -13,6 +13,7 @@ system_ss.add(files(
   'block-dirty-bitmap.c',
   'channel.c',
   'channel-block.c',
+  'cpr.c',
   'dirtyrate.c',
   'exec.c',
   'fd.c',
diff --git a/migration/migration.c b/migration/migration.c
index 021faee..6dc7c09 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -27,6 +27,7 @@
 #include "sysemu/cpu-throttle.h"
 #include "rdma.h"
 #include "ram.h"
+#include "migration/cpr.h"
 #include "migration/global_state.h"
 #include "migration/misc.h"
 #include "migration.h"
diff --git a/migration/trace-events b/migration/trace-events
index c65902f..5356fb5 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -341,6 +341,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
 # colo-failover.c
 colo_failover_set_state(const char *new_state) "new state %s"
 
+# cpr.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
 send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 03/16] physmem: preserve ram blocks for cpr
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 02/16] migration: cpr-state Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 04/16] hostmem-memfd: preserve " Steve Sistare
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Save the memfd for anonymous ramblocks in CPR state, along with a name
that uniquely identifies it.  The block's idstr is not yet set, so it
cannot be used for this purpose.  Find the saved memfd in new QEMU when
creating a block.  QEMU hard-codes the length of some internally-created
blocks, so to guard against that length changing, use lseek to get the
actual length of an incoming memfd.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 system/physmem.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index 174f7e0..cd468eb 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -72,6 +72,7 @@
 
 #include "qapi/qapi-types-migration.h"
 #include "migration/options.h"
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1663,6 +1664,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
     }
 }
 
+static char *cpr_name(RAMBlock *block)
+{
+    MemoryRegion *mr = block->mr;
+    const char *mr_name = memory_region_name(mr);
+    g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
+
+    if (id) {
+        return g_strdup_printf("%s/%s", id, mr_name);
+    } else {
+        return g_strdup(mr_name);
+    }
+}
+
 size_t qemu_ram_pagesize(RAMBlock *rb)
 {
     return rb->page_size;
@@ -1858,14 +1872,23 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                                         TYPE_MEMORY_BACKEND)) {
             size_t max_length = new_block->max_length;
             MemoryRegion *mr = new_block->mr;
-            const char *name = memory_region_name(mr);
+            g_autofree char *name = cpr_name(new_block);
 
             new_block->mr->align = QEMU_VMALLOC_ALIGN;
             new_block->flags |= RAM_SHARED;
+            new_block->fd = cpr_find_fd(name, 0);
 
             if (new_block->fd == -1) {
                 new_block->fd = qemu_memfd_create(name, max_length + mr->align,
                                                   0, 0, 0, errp);
+                cpr_save_fd(name, 0, new_block->fd);
+            } else if (lseek(new_block->fd, 0, SEEK_END) < max_length &&
+                       ftruncate(new_block->fd, max_length)) {
+                error_setg_errno(errp, errno,
+                                 "cannot grow ram block %s fd %d to %ld bytes",
+                                 name, new_block->fd, max_length);
+                qemu_mutex_unlock_ramlist();
+                return;
             }
 
             if (new_block->fd >= 0) {
@@ -1875,6 +1898,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                                                  false, 0, errp);
             }
             if (!new_block->host) {
+                cpr_delete_fd(name, 0);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
@@ -2182,6 +2206,8 @@ static void reclaim_ramblock(RAMBlock *block)
 
 void qemu_ram_free(RAMBlock *block)
 {
+    g_autofree char *name = NULL;
+
     if (!block) {
         return;
     }
@@ -2192,6 +2218,8 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    name = cpr_name(block);
+    cpr_delete_fd(name, 0);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 04/16] hostmem-memfd: preserve for cpr
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (2 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 03/16] physmem: preserve ram blocks for cpr Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile Steve Sistare
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Preserve memory-backend-memfd memory objects during cpr-transfer.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 backends/hostmem-memfd.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 6a3c89a..2740222 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
+#include "migration/cpr.h"
 
 #define TYPE_MEMORY_BACKEND_MEMFD "memory-backend-memfd"
 
@@ -35,15 +36,19 @@ static bool
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
-    g_autofree char *name = NULL;
+    g_autofree char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
     uint32_t ram_flags;
-    int fd;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return false;
     }
 
+    if (fd >= 0) {
+        goto have_fd;
+    }
+
     fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
                            m->hugetlb, m->hugetlbsize, m->seal ?
                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
@@ -51,9 +56,10 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     if (fd == -1) {
         return false;
     }
+    cpr_save_fd(name, 0, fd);
 
+have_fd:
     backend->aligned = true;
-    name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     ram_flags |= backend->guest_memfd ? RAM_GUEST_MEMFD : 0;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (3 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 04/16] hostmem-memfd: preserve " Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 20:54   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 06/16] migration: VMSTATE_FD Steve Sistare
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define functions to put/get file descriptors to/from a QEMUFile, for qio
channels that support SCM_RIGHTS.  Maintain ordering such that
  put(A), put(fd), put(B)
followed by
  get(A), get(fd), get(B)
always succeeds.  Other get orderings may succeed but are not guaranteed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/qemu-file.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++---
 migration/qemu-file.h  |  2 ++
 migration/trace-events |  2 ++
 3 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index b6d2f58..7f951ab 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -37,6 +37,11 @@
 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
 
+typedef struct FdEntry {
+    QTAILQ_ENTRY(FdEntry) entry;
+    int fd;
+} FdEntry;
+
 struct QEMUFile {
     QIOChannel *ioc;
     bool is_writable;
@@ -51,6 +56,9 @@ struct QEMUFile {
 
     int last_error;
     Error *last_error_obj;
+
+    bool fd_pass;
+    QTAILQ_HEAD(, FdEntry) fds;
 };
 
 /*
@@ -109,6 +117,8 @@ static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
     object_ref(ioc);
     f->ioc = ioc;
     f->is_writable = is_writable;
+    f->fd_pass = qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS);
+    QTAILQ_INIT(&f->fds);
 
     return f;
 }
@@ -310,6 +320,10 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
     int len;
     int pending;
     Error *local_error = NULL;
+    g_autofree int *fds = NULL;
+    size_t nfd = 0;
+    int **pfds = f->fd_pass ? &fds : NULL;
+    size_t *pnfd = f->fd_pass ? &nfd : NULL;
 
     assert(!qemu_file_is_writable(f));
 
@@ -325,10 +339,9 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
     }
 
     do {
-        len = qio_channel_read(f->ioc,
-                               (char *)f->buf + pending,
-                               IO_BUF_SIZE - pending,
-                               &local_error);
+        struct iovec iov = { f->buf + pending, IO_BUF_SIZE - pending };
+        len = qio_channel_readv_full(f->ioc, &iov, 1, pfds, pnfd, 0,
+                                     &local_error);
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(f->ioc, G_IO_IN);
@@ -348,9 +361,65 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
         qemu_file_set_error_obj(f, len, local_error);
     }
 
+    for (int i = 0; i < nfd; i++) {
+        FdEntry *fde = g_new0(FdEntry, 1);
+        fde->fd = fds[i];
+        QTAILQ_INSERT_TAIL(&f->fds, fde, entry);
+    }
+
     return len;
 }
 
+int qemu_file_put_fd(QEMUFile *f, int fd)
+{
+    int ret = 0;
+    QIOChannel *ioc = qemu_file_get_ioc(f);
+    Error *err = NULL;
+    struct iovec iov = { (void *)" ", 1 };
+
+    /*
+     * Send a dummy byte so qemu_fill_buffer on the receiving side does not
+     * fail with a len=0 error.  Flush first to maintain ordering wrt other
+     * data.
+     */
+
+    qemu_fflush(f);
+    if (qio_channel_writev_full(ioc, &iov, 1, &fd, 1, 0, &err) < 1) {
+        error_report_err(error_copy(err));
+        qemu_file_set_error_obj(f, -EIO, err);
+        ret = -1;
+    }
+    trace_qemu_file_put_fd(f->ioc->name, fd, ret);
+    return ret;
+}
+
+int qemu_file_get_fd(QEMUFile *f)
+{
+    int fd = -1;
+    FdEntry *fde;
+
+    if (!f->fd_pass) {
+        Error *err = NULL;
+        error_setg(&err, "%s does not support fd passing", f->ioc->name);
+        error_report_err(error_copy(err));
+        qemu_file_set_error_obj(f, -EIO, err);
+        goto out;
+    }
+
+    /* Force the dummy byte and its fd passenger to appear. */
+    qemu_peek_byte(f, 0);
+
+    fde = QTAILQ_FIRST(&f->fds);
+    if (fde) {
+        qemu_get_byte(f);       /* Drop the dummy byte */
+        fd = fde->fd;
+        QTAILQ_REMOVE(&f->fds, fde, entry);
+    }
+out:
+    trace_qemu_file_get_fd(f->ioc->name, fd);
+    return fd;
+}
+
 /** Closes the file
  *
  * Returns negative error value if any error happened on previous operations or
@@ -361,11 +430,17 @@ static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f)
  */
 int qemu_fclose(QEMUFile *f)
 {
+    FdEntry *fde, *next;
     int ret = qemu_fflush(f);
     int ret2 = qio_channel_close(f->ioc, NULL);
     if (ret >= 0) {
         ret = ret2;
     }
+    QTAILQ_FOREACH_SAFE(fde, &f->fds, entry, next) {
+        warn_report("qemu_fclose: received fd %d was never claimed", fde->fd);
+        close(fde->fd);
+        g_free(fde);
+    }
     g_clear_pointer(&f->ioc, object_unref);
     error_free(f->last_error_obj);
     g_free(f);
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 11c2120..3e47a20 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -79,5 +79,7 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
                           off_t pos);
 
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
+int qemu_file_put_fd(QEMUFile *f, int fd);
+int qemu_file_get_fd(QEMUFile *f);
 
 #endif
diff --git a/migration/trace-events b/migration/trace-events
index 5356fb5..345506b 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -88,6 +88,8 @@ put_qlist_end(const char *field_name, const char *vmsd_name) "%s(%s)"
 
 # qemu-file.c
 qemu_file_fclose(void) ""
+qemu_file_put_fd(const char *name, int fd, int ret) "ioc %s, fd %d -> status %d"
+qemu_file_get_fd(const char *name, int fd) "ioc %s -> fd %d"
 
 # ram.c
 get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 06/16] migration: VMSTATE_FD
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (4 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 20:55   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 07/16] migration: cpr-transfer save and load Steve Sistare
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define VMSTATE_FD for declaring a file descriptor field in a
VMStateDescription.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/vmstate.h |  9 +++++++++
 migration/vmstate-types.c   | 23 +++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index f313f2f..a1dfab4 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -230,6 +230,7 @@ extern const VMStateInfo vmstate_info_uint8;
 extern const VMStateInfo vmstate_info_uint16;
 extern const VMStateInfo vmstate_info_uint32;
 extern const VMStateInfo vmstate_info_uint64;
+extern const VMStateInfo vmstate_info_fd;
 
 /** Put this in the stream when migrating a null pointer.*/
 #define VMS_NULLPTR_MARKER (0x30U) /* '0' */
@@ -902,6 +903,9 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
     VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
 
+#define VMSTATE_FD_V(_f, _s, _v)                                  \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_fd, int32_t)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8_V(_f, _s, _v)                                   \
@@ -936,6 +940,9 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64(_f, _s)                                        \
     VMSTATE_UINT64_V(_f, _s, 0)
 
+#define VMSTATE_FD(_f, _s)                                            \
+    VMSTATE_FD_V(_f, _s, 0)
+
 #ifdef CONFIG_LINUX
 
 #define VMSTATE_U8(_f, _s)                                         \
@@ -1009,6 +1016,8 @@ extern const VMStateInfo vmstate_info_qlist;
 #define VMSTATE_UINT64_TEST(_f, _s, _t)                                  \
     VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_uint64, uint64_t)
 
+#define VMSTATE_FD_TEST(_f, _s, _t)                                            \
+    VMSTATE_SINGLE_TEST(_f, _s, _t, 0, vmstate_info_fd, int32_t)
 
 #define VMSTATE_TIMER_PTR_TEST(_f, _s, _test)                             \
     VMSTATE_POINTER_TEST(_f, _s, _test, vmstate_info_timer, QEMUTimer *)
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index e83bfcc..f31deb3 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -314,6 +314,29 @@ const VMStateInfo vmstate_info_uint64 = {
     .put  = put_uint64,
 };
 
+/* File descriptor communicated via SCM_RIGHTS */
+
+static int get_fd(QEMUFile *f, void *pv, size_t size,
+                  const VMStateField *field)
+{
+    int32_t *v = pv;
+    *v = qemu_file_get_fd(f);
+    return 0;
+}
+
+static int put_fd(QEMUFile *f, void *pv, size_t size,
+                  const VMStateField *field, JSONWriter *vmdesc)
+{
+    int32_t *v = pv;
+    return qemu_file_put_fd(f, *v);
+}
+
+const VMStateInfo vmstate_info_fd = {
+    .name = "fd",
+    .get  = get_fd,
+    .put  = put_fd,
+};
+
 static int get_nullptr(QEMUFile *f, void *pv, size_t size,
                        const VMStateField *field)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 07/16] migration: cpr-transfer save and load
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (5 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 06/16] migration: VMSTATE_FD Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 08/16] migration: cpr-uri parameter Steve Sistare
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add functions to create a QEMUFile based on a unix URI, for saving or
loading, for use by cpr-transfer mode to preserve CPR state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/migration/cpr.h  |  3 ++
 migration/cpr-transfer.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build    |  1 +
 3 files changed, 85 insertions(+)
 create mode 100644 migration/cpr-transfer.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 6e4781c..ae318da 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -20,4 +20,7 @@ int cpr_state_load(Error **errp);
 void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
+QEMUFile *cpr_transfer_output(const char *uri, Error **errp);
+QEMUFile *cpr_transfer_input(const char *uri, Error **errp);
+
 #endif
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
new file mode 100644
index 0000000..fb9ecd8
--- /dev/null
+++ b/migration/cpr-transfer.c
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2022, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "io/channel-socket.h"
+#include "io/net-listener.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/savevm.h"
+#include "migration/qemu-file.h"
+#include "migration/vmstate.h"
+
+QEMUFile *cpr_transfer_output(const char *uri, Error **errp)
+{
+    g_autoptr(MigrationChannel) channel = NULL;
+    QIOChannel *ioc;
+
+    if (!migrate_uri_parse(uri, &channel, errp)) {
+        return NULL;
+    }
+
+    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
+        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
+
+        QIOChannelSocket *sioc = qio_channel_socket_new();
+        SocketAddress *saddr = &channel->addr->u.socket;
+
+        if (qio_channel_socket_connect_sync(sioc, saddr, errp)) {
+            object_unref(OBJECT(sioc));
+            return NULL;
+        }
+        ioc = QIO_CHANNEL(sioc);
+
+    } else {
+        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
+        return NULL;
+    }
+
+    qio_channel_set_name(ioc, "cpr-out");
+    return qemu_file_new_output(ioc);
+}
+
+QEMUFile *cpr_transfer_input(const char *uri, Error **errp)
+{
+    g_autoptr(MigrationChannel) channel = NULL;
+    QIOChannel *ioc;
+
+    if (!migrate_uri_parse(uri, &channel, errp)) {
+        return NULL;
+    }
+
+    if (channel->addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET &&
+        channel->addr->u.socket.type == SOCKET_ADDRESS_TYPE_UNIX) {
+
+        QIOChannelSocket *sioc;
+        SocketAddress *saddr = &channel->addr->u.socket;
+        QIONetListener *listener = qio_net_listener_new();
+
+        qio_net_listener_set_name(listener, "cpr-socket-listener");
+        if (qio_net_listener_open_sync(listener, saddr, 1, errp) < 0) {
+            object_unref(OBJECT(listener));
+            return NULL;
+        }
+
+        sioc = qio_net_listener_wait_client(listener);
+        ioc = QIO_CHANNEL(sioc);
+
+    } else {
+        error_setg(errp, "bad cpr-uri %s; must be unix:", uri);
+        return NULL;
+    }
+
+    qio_channel_set_name(ioc, "cpr-in");
+    return qemu_file_new_input(ioc);
+}
diff --git a/migration/meson.build b/migration/meson.build
index e5f4211..684ba98 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -14,6 +14,7 @@ system_ss.add(files(
   'channel.c',
   'channel-block.c',
   'cpr.c',
+  'cpr-transfer.c',
   'dirtyrate.c',
   'exec.c',
   'fd.c',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 08/16] migration: cpr-uri parameter
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (6 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 07/16] migration: cpr-transfer save and load Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 09/16] migration: cpr-uri option Steve Sistare
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define the cpr-uri migration parameter to specify the URI to which
CPR vmstate is saved for cpr-transfer mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 migration/migration-hmp-cmds.c | 10 ++++++++++
 migration/options.c            | 28 ++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 18 +++++++++++++++---
 4 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 20d1a6e..79d8c66 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -358,6 +358,11 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
                                MIGRATION_PARAMETER_DIRECT_IO),
                            params->direct_io ? "on" : "off");
         }
+
+        assert(params->cpr_uri);
+        monitor_printf(mon, "%s: '%s'\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_CPR_URI),
+            params->cpr_uri);
     }
 
     qapi_free_MigrationParameters(params);
@@ -639,6 +644,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_direct_io = true;
         visit_type_bool(v, param, &p->direct_io, &err);
         break;
+    case MIGRATION_PARAMETER_CPR_URI:
+        p->cpr_uri = g_new0(StrOrNull, 1);
+        p->cpr_uri->type = QTYPE_QSTRING;
+        visit_type_str(v, param, &p->cpr_uri->u.s, &err);
+        break;
     default:
         g_assert_not_reached();
     }
diff --git a/migration/options.c b/migration/options.c
index ad8d698..82de1d8 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -172,6 +172,8 @@ Property migration_properties[] = {
     DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
                        parameters.zero_page_detection,
                        ZERO_PAGE_DETECTION_MULTIFD),
+    DEFINE_PROP_STRING("cpr-uri", MigrationState,
+                       parameters.cpr_uri),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -837,6 +839,13 @@ ZeroPageDetection migrate_zero_page_detection(void)
     return s->parameters.zero_page_detection;
 }
 
+const char *migrate_cpr_uri(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.cpr_uri;
+}
+
 /* parameters helpers */
 
 AnnounceParameters *migrate_announce_params(void)
@@ -922,6 +931,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->cpr_uri = g_strdup(s->parameters.cpr_uri);
 
     return params;
 }
@@ -956,6 +966,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_mode = true;
     params->has_zero_page_detection = true;
     params->has_direct_io = true;
+    params->cpr_uri = g_strdup("");
 }
 
 /*
@@ -1255,6 +1266,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->cpr_uri) {
+        assert(params->cpr_uri->type == QTYPE_QSTRING);
+        dest->cpr_uri = params->cpr_uri->u.s;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1387,6 +1403,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+
+    if (params->cpr_uri) {
+        g_free(s->parameters.cpr_uri);
+        assert(params->cpr_uri->type == QTYPE_QSTRING);
+        s->parameters.cpr_uri = g_strdup(params->cpr_uri->u.s);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1413,6 +1435,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
         params->tls_authz->u.s = strdup("");
     }
 
+    if (params->cpr_uri && params->cpr_uri->type == QTYPE_QNULL) {
+        qobject_unref(params->cpr_uri->u.n);
+        params->cpr_uri->type = QTYPE_QSTRING;
+        params->cpr_uri->u.s = strdup("");
+    }
+
     migrate_params_test_apply(params, &tmp);
 
     if (!migrate_params_check(&tmp, errp)) {
diff --git a/migration/options.h b/migration/options.h
index 79084ee..2b4c082 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -84,6 +84,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 ZeroPageDetection migrate_zero_page_detection(void);
+const char *migrate_cpr_uri(void);
 
 /* parameters helpers */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 3af6aa1..5bf3e49 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -844,6 +844,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -870,7 +873,8 @@
            'vcpu-dirty-limit',
            'mode',
            'zero-page-detection',
-           'direct-io'] }
+           'direct-io',
+           'cpr-uri'] }
 
 ##
 # @MigrateSetParameters:
@@ -1025,6 +1029,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1066,7 +1073,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-uri': 'StrOrNull' } }
 
 ##
 # @migrate-set-parameters:
@@ -1235,6 +1243,9 @@
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
 #
+# @cpr-uri: URI for an additional migration channel needed by
+#     @cpr-transfer mode. (Since 9.2)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1273,7 +1284,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*cpr-uri': 'str' } }
 
 ##
 # @query-migrate-parameters:
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 09/16] migration: cpr-uri option
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (7 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 08/16] migration: cpr-uri parameter Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 10/16] migration: split qmp_migrate Steve Sistare
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Define the cpr-uri QEMU command-line option to specify the URI from
which CPR vmstate is loaded for cpr-transfer mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/migration/cpr.h |  3 +++
 migration/cpr.c         | 12 ++++++++++++
 qemu-options.hx         |  8 ++++++++
 system/vl.c             |  4 ++++
 4 files changed, 27 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index ae318da..c9c291f 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -15,6 +15,9 @@ void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
 int cpr_find_fd(const char *name, int id);
 
+void cpr_set_cpr_uri(const char *uri);
+const char *cpr_get_cpr_uri(void);
+
 int cpr_state_save(Error **errp);
 int cpr_state_load(Error **errp);
 void cpr_state_close(void);
diff --git a/migration/cpr.c b/migration/cpr.c
index be1dc92..b72d1f4 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -116,6 +116,18 @@ QIOChannel *cpr_state_ioc(void)
     return qemu_file_get_ioc(cpr_state_file);
 }
 
+static char *cpr_uri;
+
+void cpr_set_cpr_uri(const char *uri)
+{
+    cpr_uri = g_strdup(uri);
+}
+
+const char *cpr_get_cpr_uri(void)
+{
+    return cpr_uri;
+}
+
 int cpr_state_save(Error **errp)
 {
     int ret;
diff --git a/qemu-options.hx b/qemu-options.hx
index fdd6bf2..89bbc9f 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4922,6 +4922,14 @@ SRST
 
 ERST
 
+DEF("cpr-uri", HAS_ARG, QEMU_OPTION_cpr_uri, \
+    "-cpr-uri unix:socketpath\n",
+    QEMU_ARCH_ALL)
+SRST
+``-cpr-uri unix:socketpath``
+    URI for incoming CPR state, for the cpr-transfer migration mode.
+ERST
+
 DEF("incoming", HAS_ARG, QEMU_OPTION_incoming, \
     "-incoming tcp:[host]:port[,to=maxport][,ipv4=on|off][,ipv6=on|off]\n" \
     "-incoming rdma:host:port[,ipv4=on|off][,ipv6=on|off]\n" \
diff --git a/system/vl.c b/system/vl.c
index df26264..5d08fade 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -77,6 +77,7 @@
 #include "hw/block/block.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/pc.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
 #include "sysemu/tpm.h"
@@ -3479,6 +3480,9 @@ void qemu_init(int argc, char **argv)
                     exit(1);
                 }
                 break;
+            case QEMU_OPTION_cpr_uri:
+                cpr_set_cpr_uri(optarg);
+                break;
             case QEMU_OPTION_incoming:
                 if (!incoming) {
                     runstate_set(RUN_STATE_INMIGRATE);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 10/16] migration: split qmp_migrate
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (8 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 09/16] migration: cpr-uri option Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 21:11   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 11/16] migration: cpr-transfer mode Steve Sistare
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Split qmp_migrate into start and finish functions.  Finish will be
called asynchronously in a subsequent patch, but for now, call it
immediately.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/migration.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 6dc7c09..86b3f39 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1521,6 +1521,7 @@ static void migrate_fd_error(MigrationState *s, const Error *error)
 static void migrate_fd_cancel(MigrationState *s)
 {
     int old_state ;
+    bool setup = (s->state == MIGRATION_STATUS_SETUP);
 
     trace_migrate_fd_cancel();
 
@@ -1565,6 +1566,15 @@ static void migrate_fd_cancel(MigrationState *s)
             s->block_inactive = false;
         }
     }
+
+    /*
+     * If qmp_migrate_finish has not been called, then there is no path that
+     * will complete the cancellation.  Do it now.
+     */
+    if (setup && !s->to_dst_file) {
+        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_CANCELLED);
+        vm_resume(s->vm_old_state);
+    }
 }
 
 void migration_add_notifier_mode(NotifierWithReturn *notify,
@@ -2072,6 +2082,9 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
     return true;
 }
 
+static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
+                               Error **errp);
+
 void qmp_migrate(const char *uri, bool has_channels,
                  MigrationChannelList *channels, bool has_detach, bool detach,
                  bool has_resume, bool resume, Error **errp)
@@ -2118,6 +2131,20 @@ void qmp_migrate(const char *uri, bool has_channels,
         return;
     }
 
+    qmp_migrate_finish(addr, resume_requested, errp);
+
+    if (local_err) {
+        migrate_fd_error(s, local_err);
+        error_propagate(errp, local_err);
+    }
+}
+
+static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
+                               Error **errp)
+{
+    MigrationState *s = migrate_get_current();
+    Error *local_err = NULL;
+
     if (!resume_requested) {
         if (!yank_register_instance(MIGRATION_YANK_INSTANCE, errp)) {
             return;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (9 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 10/16] migration: split qmp_migrate Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 21:58   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 12/16] tests/migration-test: memory_backend Steve Sistare
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add the cpr-transfer migration mode.  Usage:
  qemu-system-$arch -machine anon-alloc=memfd ...

  start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"

  Issue commands to old QEMU:
  migrate_set_parameter mode cpr-transfer
  migrate_set_parameter cpr-uri <uri-2>
  migrate -d <uri-1>

The migrate command stops the VM, saves CPR state to uri-2, saves
normal migration state to uri-1, and old QEMU enters the postmigrate
state.  The user starts new QEMU on the same host as old QEMU, with the
same arguments as old QEMU, plus the -incoming option.  Guest RAM is
preserved in place, albeit with new virtual addresses in new QEMU.

This mode requires a second migration channel, specified by the
cpr-uri migration property on the outgoing side, and by the cpr-uri
QEMU command-line option on the incoming side.  The channel must
be a type, such as unix socket, that supports SCM_RIGHTS.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started with
the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.  The memfds
are kept open by sending the descriptors to new QEMU via the
cpr-uri, which must support SCM_RIGHTS, and they are mmap'd
in new QEMU.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/cpr.c           | 29 ++++++++++++++---
 migration/migration.c     | 81 +++++++++++++++++++++++++++++++++++++++++++++--
 migration/migration.h     |  2 ++
 migration/options.c       | 12 +++++--
 migration/ram.c           |  2 ++
 migration/trace-events    |  2 ++
 migration/vmstate-types.c |  1 +
 qapi/migration.json       | 35 +++++++++++++++++++-
 stubs/vmstate.c           |  7 ++++
 system/vl.c               |  6 ++++
 10 files changed, 167 insertions(+), 10 deletions(-)

diff --git a/migration/cpr.c b/migration/cpr.c
index b72d1f4..3f3ef43 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -45,7 +45,7 @@ static const VMStateDescription vmstate_cpr_fd = {
         VMSTATE_UINT32(namelen, CprFd),
         VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
         VMSTATE_INT32(id, CprFd),
-        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_FD(fd, CprFd),
         VMSTATE_END_OF_LIST()
     }
 };
@@ -132,9 +132,18 @@ int cpr_state_save(Error **errp)
 {
     int ret;
     QEMUFile *f;
+    MigMode mode = migrate_mode();
 
-    /* set f based on mode in a later patch in this series */
-    return 0;
+    if (mode == MIG_MODE_CPR_TRANSFER) {
+        f = cpr_transfer_output(migrate_cpr_uri(), errp);
+    } else {
+        return 0;
+    }
+    if (!f) {
+        return -1;
+    }
+
+    trace_cpr_state_save(MigMode_str(mode), migrate_cpr_uri());
 
     qemu_put_be32(f, QEMU_CPR_FILE_MAGIC);
     qemu_put_be32(f, QEMU_CPR_FILE_VERSION);
@@ -162,9 +171,19 @@ int cpr_state_load(Error **errp)
     int ret;
     uint32_t v;
     QEMUFile *f;
+    MigMode mode = 0;
 
-    /* set f based on other parameters in a later patch in this series */
-    return 0;
+    if (cpr_uri) {
+        mode = MIG_MODE_CPR_TRANSFER;
+        f = cpr_transfer_input(cpr_uri, errp);
+    } else {
+        return 0;
+    }
+    if (!f) {
+        return -1;
+    }
+
+    trace_cpr_state_load(MigMode_str(mode), cpr_uri);
 
     v = qemu_get_be32(f);
     if (v != QEMU_CPR_FILE_MAGIC) {
diff --git a/migration/migration.c b/migration/migration.c
index 86b3f39..5a53d01 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -76,6 +76,7 @@
 static NotifierWithReturnList migration_state_notifiers[] = {
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
     NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
+    NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_TRANSFER),
 };
 
 /* Messages sent on the return path from destination to source */
@@ -109,6 +110,7 @@ static int migration_maybe_pause(MigrationState *s,
 static void migrate_fd_cancel(MigrationState *s);
 static bool close_return_path_on_source(MigrationState *s);
 static void migration_completion_end(MigrationState *s);
+static void migrate_hup_delete(MigrationState *s);
 
 static void migration_downtime_start(MigrationState *s)
 {
@@ -204,6 +206,12 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
+        addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
+        error_setg(errp, "Migration requires streamable transport (eg unix)");
+        return false;
+    }
+
     return true;
 }
 
@@ -316,6 +324,7 @@ void migration_cancel(const Error *error)
         qmp_cancel_vcpu_dirty_limit(false, -1, NULL);
     }
     migrate_fd_cancel(current_migration);
+    migrate_hup_delete(current_migration);
 }
 
 void migration_shutdown(void)
@@ -416,6 +425,7 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    cpr_set_cpr_uri(NULL);
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
 
@@ -717,6 +727,9 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
     } else {
         error_setg(errp, "unknown migration protocol: %s", uri);
     }
+
+    /* Close cpr socket to tell source that we are listening */
+    cpr_state_close();
 }
 
 static void process_incoming_migration_bh(void *opaque)
@@ -1413,6 +1426,8 @@ static void migrate_fd_cleanup(MigrationState *s)
     s->vmdesc = NULL;
 
     qemu_savevm_state_cleanup();
+    cpr_state_close();
+    migrate_hup_delete(s);
 
     close_return_path_on_source(s);
 
@@ -1573,6 +1588,8 @@ static void migrate_fd_cancel(MigrationState *s)
      */
     if (setup && !s->to_dst_file) {
         migrate_set_state(&s->state, s->state, MIGRATION_STATUS_CANCELLED);
+        cpr_state_close();
+        migrate_hup_delete(s);
         vm_resume(s->vm_old_state);
     }
 }
@@ -1707,7 +1724,9 @@ bool migration_thread_is_self(void)
 
 bool migrate_mode_is_cpr(MigrationState *s)
 {
-    return s->parameters.mode == MIG_MODE_CPR_REBOOT;
+    MigMode mode = s->parameters.mode;
+    return mode == MIG_MODE_CPR_REBOOT ||
+           mode == MIG_MODE_CPR_TRANSFER;
 }
 
 int migrate_init(MigrationState *s, Error **errp)
@@ -2042,6 +2061,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
         return false;
     }
 
+    if (migrate_mode() == MIG_MODE_CPR_TRANSFER &&
+        (!s->parameters.cpr_uri || !s->parameters.cpr_uri[0])) {
+        error_setg(errp, "cpr-transfer mode requires setting cpr-uri");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
@@ -2085,6 +2110,37 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
 static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
                                Error **errp);
 
+static void migrate_hup_add(MigrationState *s, QIOChannel *ioc, GSourceFunc cb,
+                            void *opaque)
+{
+        s->hup_source = qio_channel_create_watch(ioc, G_IO_HUP);
+        g_source_set_callback(s->hup_source, cb, opaque, NULL);
+        g_source_attach(s->hup_source, NULL);
+}
+
+static void migrate_hup_delete(MigrationState *s)
+{
+    if (s->hup_source) {
+        g_source_destroy(s->hup_source);
+        g_source_unref(s->hup_source);
+        s->hup_source = NULL;
+    }
+}
+
+static gboolean qmp_migrate_finish_cb(QIOChannel *channel,
+                                      GIOCondition cond,
+                                      void *opaque)
+{
+    MigrationAddress *addr = opaque;
+
+    qmp_migrate_finish(addr, false, NULL);
+
+    cpr_state_close();
+    migrate_hup_delete(migrate_get_current());
+    qapi_free_MigrationAddress(addr);
+    return G_SOURCE_REMOVE;
+}
+
 void qmp_migrate(const char *uri, bool has_channels,
                  MigrationChannelList *channels, bool has_detach, bool detach,
                  bool has_resume, bool resume, Error **errp)
@@ -2131,8 +2187,29 @@ void qmp_migrate(const char *uri, bool has_channels,
         return;
     }
 
-    qmp_migrate_finish(addr, resume_requested, errp);
+    if (cpr_state_save(&local_err)) {
+        goto out;
+    }
 
+    /*
+     * For cpr-transfer, the target may not be listening yet on the migration
+     * channel, because first it must finish cpr_load_state.  The target tells
+     * us it is listening by closing the cpr-state socket.  Wait for that HUP
+     * event before connecting in qmp_migrate_finish.
+     *
+     * The HUP could occur because the target fails while reading CPR state,
+     * in which case the target will not listen for the incoming migration
+     * connection, so qmp_migrate_finish will fail to connect, and then recover.
+     */
+    if (s->parameters.mode == MIG_MODE_CPR_TRANSFER) {
+        migrate_hup_add(s, cpr_state_ioc(), (GSourceFunc)qmp_migrate_finish_cb,
+                        QAPI_CLONE(MigrationAddress, addr));
+
+    } else {
+        qmp_migrate_finish(addr, resume_requested, errp);
+    }
+
+out:
     if (local_err) {
         migrate_fd_error(s, local_err);
         error_propagate(errp, local_err);
diff --git a/migration/migration.h b/migration/migration.h
index 38aa140..74c167b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -457,6 +457,8 @@ struct MigrationState {
     bool switchover_acked;
     /* Is this a rdma migration */
     bool rdma_migration;
+
+    GSource *hup_source;
 };
 
 void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
diff --git a/migration/options.c b/migration/options.c
index 82de1d8..3733bc9 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -22,6 +22,7 @@
 #include "qapi/qmp/qnull.h"
 #include "sysemu/runstate.h"
 #include "migration/colo.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration.h"
 #include "migration-stats.h"
@@ -747,9 +748,16 @@ uint64_t migrate_max_postcopy_bandwidth(void)
 
 MigMode migrate_mode(void)
 {
-    MigrationState *s = migrate_get_current();
-    MigMode mode = s->parameters.mode;
+    MigMode mode;
 
+    /*
+     * cpr_uri is only set during the early cpr-transfer loading stage,
+     * after which it is cleared.
+     */
+    if (cpr_get_cpr_uri()) {
+        return MIG_MODE_CPR_TRANSFER;
+    }
+    mode = migrate_get_current()->parameters.mode;
     assert(mode >= 0 && mode < MIG_MODE__MAX);
     return mode;
 }
diff --git a/migration/ram.c b/migration/ram.c
index 326ce7e..bafe41b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
 
 bool migrate_ram_is_ignored(RAMBlock *block)
 {
+    MigMode mode = migrate_mode();
     return !qemu_ram_is_migratable(block) ||
+           mode == MIG_MODE_CPR_TRANSFER ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
                                     && qemu_ram_is_named_file(block));
 }
diff --git a/migration/trace-events b/migration/trace-events
index 345506b..455dec5 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -347,6 +347,8 @@ colo_failover_set_state(const char *new_state) "new state %s"
 cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
 cpr_delete_fd(const char *name, int id) "%s, id %d"
 cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+cpr_state_save(const char *mode, const char *uri) "%s to %s"
+cpr_state_load(const char *mode, const char *uri) "%s from %s"
 
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
diff --git a/migration/vmstate-types.c b/migration/vmstate-types.c
index f31deb3..2210f0c 100644
--- a/migration/vmstate-types.c
+++ b/migration/vmstate-types.c
@@ -15,6 +15,7 @@
 #include "qemu-file.h"
 #include "migration.h"
 #include "migration/vmstate.h"
+#include "migration/client-options.h"
 #include "qemu/error-report.h"
 #include "qemu/queue.h"
 #include "trace.h"
diff --git a/qapi/migration.json b/qapi/migration.json
index 5bf3e49..3328d1b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -614,9 +614,42 @@
 #     or COLO.
 #
 #     (since 8.2)
+#
+# @cpr-transfer: This mode allows the user to transfer a guest to a
+#     new QEMU instance on the same host with minimal guest pause
+#     time, by preserving guest RAM in place, albeit with new virtual
+#     addresses in new QEMU.
+#
+#     The user starts new QEMU on the same host as old QEMU, with the
+#     the same arguments as old QEMU, plus the -incoming option.  The
+#     user issues the migrate command to old QEMU, which stops the VM,
+#     saves state to the migration channels, and enters the
+#     postmigrate state.  Execution resumes in new QEMU.
+#
+#     This mode requires a second migration channel, specified by the
+#     cpr-uri migration property on the outgoing side, and by the
+#     cpr-uri QEMU command-line option on the incoming side.  The
+#     channel must be a type, such as unix socket, that supports
+#     SCM_RIGHTS.
+#
+#     Memory-backend objects must have the share=on attribute, but
+#     memory-backend-epc and memory-backend-ram are not supported.
+#     The VM must be started with the '-machine anon-alloc=memfd'
+#     option.
+#
+#     The incoming migration channel cannot be a file type, and for
+#     the tcp type, the port cannot be 0 (meaning dynamically choose
+#     a port).
+#
+#     When using -incoming defer, you must issue the migrate command
+#     to old QEMU before issuing any monitor commands to new QEMU.
+#     However, new QEMU does not open and read the migration stream
+#     until you issue the migrate incoming command.
+#
+#     (since 9.2)
 ##
 { 'enum': 'MigMode',
-  'data': [ 'normal', 'cpr-reboot' ] }
+  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
 
 ##
 # @ZeroPageDetection:
diff --git a/stubs/vmstate.c b/stubs/vmstate.c
index 8513d92..c190762 100644
--- a/stubs/vmstate.c
+++ b/stubs/vmstate.c
@@ -1,5 +1,7 @@
 #include "qemu/osdep.h"
 #include "migration/vmstate.h"
+#include "qapi/qapi-types-migration.h"
+#include "migration/client-options.h"
 
 int vmstate_register_with_alias_id(VMStateIf *obj,
                                    uint32_t instance_id,
@@ -21,3 +23,8 @@ bool vmstate_check_only_migratable(const VMStateDescription *vmsd)
 {
     return true;
 }
+
+MigMode migrate_mode(void)
+{
+    return MIG_MODE_NORMAL;
+}
diff --git a/system/vl.c b/system/vl.c
index 5d08fade..9bd0e33 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -3714,6 +3714,12 @@ void qemu_init(int argc, char **argv)
 
     qemu_create_machine(machine_opts_dict);
 
+    /*
+     * Load incoming CPR state before any devices are created, because it
+     * contains file descriptors that are needed in device initialization code.
+     */
+    cpr_state_load(&error_fatal);
+
     suspend_mux_open();
 
     qemu_disable_default_devices();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 12/16] tests/migration-test: memory_backend
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (10 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 11/16] migration: cpr-transfer mode Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 22:19   ` Fabiano Rosas
  2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Allow each migration test to define its own memory backend, replacing
the standard "-m <size>" specification.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 tests/qtest/migration-test.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 95e45b5..a008316 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -609,6 +609,11 @@ typedef struct {
     const char *opts_target;
     /* suspend the src before migrating to dest. */
     bool suspend_me;
+    /*
+     * Format string for the main memory backend, containing one %s where the
+     * size is plugged in.  If omitted, "-m %s" is used.
+     */
+    const char *memory_backend;
 } MigrateStart;
 
 /*
@@ -727,6 +732,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     const char *memory_size;
     const char *machine_alias, *machine_opts = "";
     g_autofree char *machine = NULL;
+    g_autofree char *memory_backend = NULL;
 
     if (args->use_shmem) {
         if (!g_file_test("/dev/shm", G_FILE_TEST_IS_DIR)) {
@@ -802,6 +808,12 @@ static int test_migrate_start(QTestState **from, QTestState **to,
             memory_size, shmem_path);
     }
 
+    if (args->memory_backend) {
+        memory_backend = g_strdup_printf(args->memory_backend, memory_size);
+    } else {
+        memory_backend = g_strdup_printf("-m %s ", memory_size);
+    }
+
     if (args->use_dirty_ring) {
         kvm_opts = ",dirty-ring-size=4096";
     }
@@ -820,12 +832,12 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     cmd_source = g_strdup_printf("-accel kvm%s -accel tcg "
                                  "-machine %s,%s "
                                  "-name source,debug-threads=on "
-                                 "-m %s "
+                                 "%s "
                                  "-serial file:%s/src_serial "
                                  "%s %s %s %s %s",
                                  kvm_opts ? kvm_opts : "",
                                  machine, machine_opts,
-                                 memory_size, tmpfs,
+                                 memory_backend, tmpfs,
                                  arch_opts ? arch_opts : "",
                                  arch_source ? arch_source : "",
                                  shmem_opts ? shmem_opts : "",
@@ -841,13 +853,13 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
                                  "-machine %s,%s "
                                  "-name target,debug-threads=on "
-                                 "-m %s "
+                                 "%s "
                                  "-serial file:%s/dest_serial "
                                  "-incoming %s "
                                  "%s %s %s %s %s",
                                  kvm_opts ? kvm_opts : "",
                                  machine, machine_opts,
-                                 memory_size, tmpfs, uri,
+                                 memory_backend, tmpfs, uri,
                                  arch_opts ? arch_opts : "",
                                  arch_target ? arch_target : "",
                                  shmem_opts ? shmem_opts : "",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 13/16] tests/qtest: defer connection
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (11 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 12/16] tests/migration-test: memory_backend Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 22:36   ` Fabiano Rosas
  2024-11-13 22:53   ` Peter Xu
  2024-11-01 13:47 ` [PATCH V3 14/16] tests/migration-test: " Steve Sistare
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add an option to defer making the connecting to the monitor and qtest
sockets when calling qtest_init_with_env.  The client makes the connection
later by calling qtest_connect_deferred and qtest_qmp_handshake.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 tests/qtest/libqtest.c       | 69 +++++++++++++++++++++++++++++---------------
 tests/qtest/libqtest.h       | 19 +++++++++++-
 tests/qtest/migration-test.c |  4 +--
 3 files changed, 65 insertions(+), 27 deletions(-)

diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
index 9d07de1..95408fb 100644
--- a/tests/qtest/libqtest.c
+++ b/tests/qtest/libqtest.c
@@ -75,6 +75,8 @@ struct QTestState
 {
     int fd;
     int qmp_fd;
+    int sock;
+    int qmpsock;
     pid_t qemu_pid;  /* our child QEMU process */
     int wstatus;
 #ifdef _WIN32
@@ -443,7 +445,8 @@ static QTestState *G_GNUC_PRINTF(2, 3) qtest_spawn_qemu(const char *qemu_bin,
 }
 
 static QTestState *qtest_init_internal(const char *qemu_bin,
-                                       const char *extra_args)
+                                       const char *extra_args,
+                                       bool defer_connect)
 {
     QTestState *s;
     int sock, qmpsock, i;
@@ -485,22 +488,17 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
     qtest_client_set_rx_handler(s, qtest_client_socket_recv_line);
     qtest_client_set_tx_handler(s, qtest_client_socket_send);
 
-    s->fd = socket_accept(sock);
-    if (s->fd >= 0) {
-        s->qmp_fd = socket_accept(qmpsock);
-    }
-    unlink(socket_path);
-    unlink(qmp_socket_path);
-    g_free(socket_path);
-    g_free(qmp_socket_path);
-
-    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
-
     s->rx = g_string_new("");
     for (i = 0; i < MAX_IRQ; i++) {
         s->irq_level[i] = false;
     }
 
+    s->sock = sock;
+    s->qmpsock = qmpsock;
+    if (!defer_connect) {
+        qtest_connect_deferred(s);
+    }
+
     /*
      * Stopping QEMU for debugging is not supported on Windows.
      *
@@ -515,34 +513,57 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
     }
 #endif
 
+   return s;
+}
+
+void qtest_connect_deferred(QTestState *s)
+{
+    g_autofree gchar *socket_path = NULL;
+    g_autofree gchar *qmp_socket_path = NULL;
+
+    socket_path = g_strdup_printf("%s/qtest-%d.sock",
+                                  g_get_tmp_dir(), getpid());
+    qmp_socket_path = g_strdup_printf("%s/qtest-%d.qmp",
+                                      g_get_tmp_dir(), getpid());
+
+    s->fd = socket_accept(s->sock);
+    if (s->fd >= 0) {
+        s->qmp_fd = socket_accept(s->qmpsock);
+    }
+    unlink(socket_path);
+    unlink(qmp_socket_path);
+    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
     /* ask endianness of the target */
-
     s->big_endian = qtest_query_target_endianness(s);
-
-   return s;
 }
 
 QTestState *qtest_init_without_qmp_handshake(const char *extra_args)
 {
-    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args);
+    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args, false);
 }
 
-QTestState *qtest_init_with_env(const char *var, const char *extra_args)
+void qtest_qmp_handshake(QTestState *s)
 {
-    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args);
-    QDict *greeting;
-
     /* Read the QMP greeting and then do the handshake */
-    greeting = qtest_qmp_receive(s);
+    QDict *greeting = qtest_qmp_receive(s);
     qobject_unref(greeting);
     qobject_unref(qtest_qmp(s, "{ 'execute': 'qmp_capabilities' }"));
+}
 
+QTestState *qtest_init_with_env(const char *var, const char *extra_args,
+                                bool defer_connect)
+{
+    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args,
+                                        defer_connect);
+    if (!defer_connect) {
+        qtest_qmp_handshake(s);
+    }
     return s;
 }
 
 QTestState *qtest_init(const char *extra_args)
 {
-    return qtest_init_with_env(NULL, extra_args);
+    return qtest_init_with_env(NULL, extra_args, false);
 }
 
 QTestState *qtest_vinitf(const char *fmt, va_list ap)
@@ -1523,7 +1544,7 @@ static struct MachInfo *qtest_get_machines(const char *var)
 
     silence_spawn_log = !g_test_verbose();
 
-    qts = qtest_init_with_env(qemu_var, "-machine none");
+    qts = qtest_init_with_env(qemu_var, "-machine none", false);
     response = qtest_qmp(qts, "{ 'execute': 'query-machines' }");
     g_assert(response);
     list = qdict_get_qlist(response, "return");
@@ -1578,7 +1599,7 @@ static struct CpuModel *qtest_get_cpu_models(void)
 
     silence_spawn_log = !g_test_verbose();
 
-    qts = qtest_init_with_env(NULL, "-machine none");
+    qts = qtest_init_with_env(NULL, "-machine none", false);
     response = qtest_qmp(qts, "{ 'execute': 'query-cpu-definitions' }");
     g_assert(response);
     list = qdict_get_qlist(response, "return");
diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
index beb96b1..db76f2c 100644
--- a/tests/qtest/libqtest.h
+++ b/tests/qtest/libqtest.h
@@ -60,13 +60,15 @@ QTestState *qtest_init(const char *extra_args);
  * @var: Environment variable from where to take the QEMU binary
  * @extra_args: Other arguments to pass to QEMU.  CAUTION: these
  * arguments are subject to word splitting and shell evaluation.
+ * @defer_connect: do not connect to qemu monitor and qtest socket.
  *
  * Like qtest_init(), but use a different environment variable for the
  * QEMU binary.
  *
  * Returns: #QTestState instance.
  */
-QTestState *qtest_init_with_env(const char *var, const char *extra_args);
+QTestState *qtest_init_with_env(const char *var, const char *extra_args,
+                                bool defer_connect);
 
 /**
  * qtest_init_without_qmp_handshake:
@@ -78,6 +80,21 @@ QTestState *qtest_init_with_env(const char *var, const char *extra_args);
 QTestState *qtest_init_without_qmp_handshake(const char *extra_args);
 
 /**
+ * qtest_connect_deferred:
+ * @s: #QTestState instance to connect
+ * Connect to qemu monitor and qtest socket, after deferring them in
+ * qtest_init_with_env.  Does not handshake with the monitor.
+ */
+void qtest_connect_deferred(QTestState *s);
+
+/**
+ * qtest_qmp_handshake:
+ * @s: #QTestState instance to operate on.
+ * Perform handshake after connecting to qemu monitor.
+ */
+void qtest_qmp_handshake(QTestState *s);
+
+/**
  * qtest_init_with_serial:
  * @extra_args: other arguments to pass to QEMU.  CAUTION: these
  * arguments are subject to word splitting and shell evaluation.
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index a008316..d359b10 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -844,7 +844,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                  args->opts_source ? args->opts_source : "",
                                  ignore_stderr);
     if (!args->only_target) {
-        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
+        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source, false);
         qtest_qmp_set_event_callback(*from,
                                      migrate_watch_for_events,
                                      &src_state);
@@ -865,7 +865,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                  shmem_opts ? shmem_opts : "",
                                  args->opts_target ? args->opts_target : "",
                                  ignore_stderr);
-    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
+    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target, false);
     qtest_qmp_set_event_callback(*to,
                                  migrate_watch_for_events,
                                  &dst_state);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 14/16] tests/migration-test: defer connection
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (12 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-14 12:46   ` Fabiano Rosas
  2024-11-01 13:47 ` [PATCH V3 15/16] migration-test: cpr-transfer Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 16/16] migration: cpr-transfer documentation Steve Sistare
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add an option to defer connection to the target monitor, needed by the
cpr-transfer test.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 tests/qtest/migration-test.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index d359b10..dfeb8d2 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -614,6 +614,9 @@ typedef struct {
      * size is plugged in.  If omitted, "-m %s" is used.
      */
     const char *memory_backend;
+
+    /* Do not connect to target monitor and qtest sockets in qtest_init */
+    bool defer_target_connect;
 } MigrateStart;
 
 /*
@@ -733,6 +736,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
     const char *machine_alias, *machine_opts = "";
     g_autofree char *machine = NULL;
     g_autofree char *memory_backend = NULL;
+    const char *events;
 
     if (args->use_shmem) {
         if (!g_file_test("/dev/shm", G_FILE_TEST_IS_DIR)) {
@@ -850,22 +854,31 @@ static int test_migrate_start(QTestState **from, QTestState **to,
                                      &src_state);
     }
 
+    /*
+     * If the monitor connection is deferred, enable events on the command line
+     * so none are missed.  This is for testing only, do not set migration
+     * options like this in general.
+     */
+    events = args->defer_target_connect ? "-global migration.x-events=on" : "";
+
     cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
                                  "-machine %s,%s "
                                  "-name target,debug-threads=on "
                                  "%s "
                                  "-serial file:%s/dest_serial "
                                  "-incoming %s "
-                                 "%s %s %s %s %s",
+                                 "%s %s %s %s %s %s",
                                  kvm_opts ? kvm_opts : "",
                                  machine, machine_opts,
                                  memory_backend, tmpfs, uri,
+                                 events,
                                  arch_opts ? arch_opts : "",
                                  arch_target ? arch_target : "",
                                  shmem_opts ? shmem_opts : "",
                                  args->opts_target ? args->opts_target : "",
                                  ignore_stderr);
-    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target, false);
+    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target,
+                              args->defer_target_connect);
     qtest_qmp_set_event_callback(*to,
                                  migrate_watch_for_events,
                                  &dst_state);
@@ -883,7 +896,9 @@ static int test_migrate_start(QTestState **from, QTestState **to,
      * to mimic as closer as that.
      */
     migrate_set_capability(*from, "events", true);
-    migrate_set_capability(*to, "events", true);
+    if (!args->defer_target_connect) {
+        migrate_set_capability(*to, "events", true);
+    }
 
     return 0;
 }
@@ -1751,6 +1766,11 @@ static void test_precopy_common(MigrateCommon *args)
 
     migrate_qmp(from, to, args->connect_uri, args->connect_channels, "{}");
 
+    if (args->start.defer_target_connect) {
+        qtest_connect_deferred(to);
+        qtest_qmp_handshake(to);
+    }
+
     if (args->result != MIG_TEST_SUCCEED) {
         bool allow_active = args->result == MIG_TEST_FAIL;
         wait_for_migration_fail(from, allow_active);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 15/16] migration-test: cpr-transfer
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (13 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 14/16] tests/migration-test: " Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-01 13:47 ` [PATCH V3 16/16] migration: cpr-transfer documentation Steve Sistare
  15 siblings, 0 replies; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Add a migration test for cpr-transfer mode.  Defer the connection to the
target monitor, else the test hangs because in cpr-transfer mode QEMU does
not listen for monitor connections until we send the migrate command to
source QEMU.

To test -incoming defer, send a migrate incoming command to the target,
after sending the migrate command to the source, as required by
cpr-transfer mode.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 tests/qtest/migration-test.c | 59 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index dfeb8d2..e364e9b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1769,6 +1769,9 @@ static void test_precopy_common(MigrateCommon *args)
     if (args->start.defer_target_connect) {
         qtest_connect_deferred(to);
         qtest_qmp_handshake(to);
+        if (!strcmp(args->listen_uri, "defer")) {
+            migrate_incoming_qmp(to, args->connect_uri, "{}");
+        }
     }
 
     if (args->result != MIG_TEST_SUCCEED) {
@@ -2413,6 +2416,58 @@ static void test_multifd_file_mapped_ram_fdset_dio(void)
 }
 #endif /* !_WIN32 */
 
+static char *get_cpr_uri(void)
+{
+    return g_strdup_printf("unix:%s/cpr.sock", tmpfs);
+}
+
+static void *test_mode_transfer_start(QTestState *from, QTestState *to)
+{
+    g_autofree char *cpr_uri = get_cpr_uri();
+
+    migrate_set_parameter_str(from, "mode", "cpr-transfer");
+    migrate_set_parameter_str(from, "cpr-uri", cpr_uri);
+    return NULL;
+}
+
+/*
+ * cpr-transfer mode cannot use the target monitor prior to starting the
+ * migration, and cannot connect synchronously to the monitor, so defer
+ * the target connection.
+ */
+static void test_mode_transfer_common(bool incoming_defer)
+{
+    g_autofree char *cpr_uri = get_cpr_uri();
+    g_autofree char *uri = g_strdup_printf("unix:%s/migsocket", tmpfs);
+
+    const char *opts = "-machine anon-alloc=memfd -nodefaults";
+    g_autofree char *opts_target = g_strdup_printf("-cpr-uri %s %s",
+                                                   cpr_uri, opts);
+
+    MigrateCommon args = {
+        .start.opts_source = opts,
+        .start.opts_target = opts_target,
+        .start.defer_target_connect = true,
+        .start.memory_backend = "-object memory-backend-memfd,id=pc.ram,size=%s"
+                                " -machine memory-backend=pc.ram",
+        .listen_uri = incoming_defer ? "defer" : uri,
+        .connect_uri = uri,
+        .start_hook = test_mode_transfer_start,
+    };
+
+    test_precopy_common(&args);
+}
+
+static void test_mode_transfer(void)
+{
+    test_mode_transfer_common(NULL);
+}
+
+static void test_mode_transfer_defer(void)
+{
+    test_mode_transfer_common(true);
+}
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -3866,6 +3921,10 @@ int main(int argc, char **argv)
         migration_test_add("/migration/mode/reboot", test_mode_reboot);
     }
 
+    migration_test_add("/migration/mode/transfer", test_mode_transfer);
+    migration_test_add("/migration/mode/transfer/defer",
+                       test_mode_transfer_defer);
+
     migration_test_add("/migration/precopy/file/mapped-ram",
                        test_precopy_file_mapped_ram);
     migration_test_add("/migration/precopy/file/mapped-ram/live",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH V3 16/16] migration: cpr-transfer documentation
  2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
                   ` (14 preceding siblings ...)
  2024-11-01 13:47 ` [PATCH V3 15/16] migration-test: cpr-transfer Steve Sistare
@ 2024-11-01 13:47 ` Steve Sistare
  2024-11-13 22:02   ` Peter Xu
  15 siblings, 1 reply; 86+ messages in thread
From: Steve Sistare @ 2024-11-01 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, Steve Sistare

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 docs/devel/migration/CPR.rst | 144 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 142 insertions(+), 2 deletions(-)

diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
index 63c3647..732d5a6 100644
--- a/docs/devel/migration/CPR.rst
+++ b/docs/devel/migration/CPR.rst
@@ -5,7 +5,7 @@ CPR is the umbrella name for a set of migration modes in which the
 VM is migrated to a new QEMU instance on the same host.  It is
 intended for use when the goal is to update host software components
 that run the VM, such as QEMU or even the host kernel.  At this time,
-cpr-reboot is the only available mode.
+the cpr-reboot and cpr-transfer modes are available.
 
 Because QEMU is restarted on the same host, with access to the same
 local devices, CPR is allowed in certain cases where normal migration
@@ -53,7 +53,7 @@ RAM is copied to the migration URI.
 Outgoing:
   * Set the migration mode parameter to ``cpr-reboot``.
   * Set the ``x-ignore-shared`` capability if desired.
-  * Issue the ``migrate`` command.  It is recommended the the URI be a
+  * Issue the ``migrate`` command.  It is recommended the URI be a
     ``file`` type, but one can use other types such as ``exec``,
     provided the command captures all the data from the outgoing side,
     and provides all the data to the incoming side.
@@ -145,3 +145,143 @@ Caveats
 
 cpr-reboot mode may not be used with postcopy, background-snapshot,
 or COLO.
+
+cpr-transfer mode
+-----------------
+
+This mode allows the user to transfer a guest to a new QEMU instance
+on the same host with minimal guest pause time, by preserving guest
+RAM in place, albeit with new virtual addresses in new QEMU.
+
+The user starts new QEMU on the same host as old QEMU, with the
+same arguments as old QEMU, plus the ``-incoming option``.  The user
+issues the migrate command to old QEMU, which stops the VM, saves
+state to the migration channels, and enters the postmigrate state.
+Execution resumes in new QEMU.
+
+This mode requires a second migration channel, specified by the
+``cpr-uri`` migration property on the outgoing side, and by the
+``cpr-uri`` QEMU command-line option on the incoming side.  The
+channel must be a type, such as unix socket, that supports SCM_RIGHTS.
+
+Usage
+^^^^^
+
+Memory backend objects must have the ``share=on`` attribute.
+
+The VM must be started with the ``-machine anon-alloc=memfd``
+option.  This causes implicit RAM blocks (those not described by
+a memory-backend object) to be allocated by mmap'ing a memfd.
+Examples include VGA and ROM.
+
+Outgoing:
+  * Set the migration mode parameter to ``cpr-transfer``.
+  * Set the ``cpr-uri`` parameter.  It must be a ``unix`` type.
+  * Issue the ``migrate`` command.
+
+Incoming:
+  * Start new QEMU with the ``-incoming`` and ``-cpr-uri`` options.
+  * If the VM was running when the outgoing ``migrate`` command was
+    issued, then QEMU automatically resumes VM execution.
+
+Caveats
+^^^^^^^
+
+cpr-transfer mode may not be used with postcopy, background-snapshot,
+or COLO.
+
+memory-backend-epc and memory-backend-ram are not supported.
+
+The incoming migration channel cannot be a file type.
+
+If the incoming migration channel is a tcp type, then the port cannot
+be 0 (meaning dynamically choose a port).
+
+When using ``-incoming defer``, you must issue the migrate command to
+old QEMU before issuing any monitor commands to new QEMU, because new
+QEMU blocks waiting to read from cpr-uri before starting its monitor,
+and old QEMU does not write to cpr-uri until the migrate command is
+issued.  However, new QEMU does not open and read the migration stream
+until you issue the migrate incoming command.
+
+Example 1: incoming channel
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In these examples, we simply restart the same version of QEMU, but
+in a real scenario one would start new QEMU on the incoming side.
+Note that new QEMU does not print the monitor prompt until old QEMU
+has issued the migrate command.
+
+::
+
+  Outgoing:                             Incoming:
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-file,id=ram0,size=4G,
+  mem-path=/dev/shm/ram0,share=on -m 4G
+  -machine anon-alloc=memfd'
+  ...
+                                        # qemu-kvm -incoming tcp:0:44444
+                                        -cpr-uri unix:cpr.sock
+                                        ...
+  QEMU 9.2.50 monitor
+  (qemu) info status
+  VM status: running
+  (qemu) migrate_set_parameter mode cpr-transfer
+  (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
+  (qemu) migrate -d tcp:0:44444
+
+                                        QEMU 9.2.50 monitor
+                                        (qemu) info status
+                                        VM status: running
+
+  (qemu) info status
+  VM status: paused (postmigrate)
+
+
+Example 2: incoming defer
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This example uses ``-incoming defer`` to hot plug a device before
+accepting the migration stream.  Again note you must issue the
+migrate command to old QEMU before you can issue any monitor
+commands to new QEMU.
+
+
+::
+
+  Outgoing:                             Incoming:
+
+  # qemu-kvm -monitor stdio
+  -object memory-backend-file,id=ram0,size=4G,
+  mem-path=/dev/shm/ram0,share=on -m 4G
+  -machine anon-alloc=memfd'
+  ...
+                                        # qemu-kvm -incoming defer
+                                        -cpr-uri unix:cpr.sock
+                                        ...
+  QEMU 9.2.50 monitor
+  (qemu) device_add pcie-root-port
+  (qemu) migrate_set_parameter mode cpr-transfer
+  (qemu) migrate_set_parameter cpr-uri unix:cpr.sock
+  (qemu) migrate -d tcp:0:44444
+
+                                        QEMU 9.2.50 monitor
+                                        (qemu) info status
+                                        VM status: paused (inmigrate)
+                                        (qemu) device_add pcie-root-port
+                                        (qemu) migrate_incoming tcp:0:44444
+                                        (qemu) info status
+                                        VM status: running
+
+  (qemu) info status
+  VM status: paused (postmigrate)
+
+Futures
+^^^^^^^
+
+cpr-transfer mode is based on a capability to transfer open file
+descriptors from old to new QEMU.  In the future, descriptors for
+vfio, iommufd, vhost, and char devices could be transferred,
+preserving those devices and their kernel state without interruption,
+even if they do not explicitly support live migration.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
@ 2024-11-01 14:06   ` Peter Xu
  2024-11-04 10:39   ` David Hildenbrand
  1 sibling, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-01 14:06 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:40AM -0700, Steve Sistare wrote:
> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> +
> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> +                                        TYPE_MEMORY_BACKEND)) {

Steve,

I think I'll postpone a few days on reading the whole series for other
things.. as I think this will miss 9.2 anyway (but we can see whether we
can still get it in early 10.0).

Said that, I want to mention this early that there was concern raised on
using block->mr->parent_obj.parent here to detect mem backends, which can
be error prone.  Please have a look at the discussion in v2 for that.

https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
  2024-11-01 14:06   ` Peter Xu
@ 2024-11-04 10:39   ` David Hildenbrand
  2024-11-04 10:45     ` David Hildenbrand
  2024-11-04 17:38     ` Steven Sistare
  1 sibling, 2 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 10:39 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 01.11.24 14:47, Steve Sistare wrote:
> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property.  This option applies to
> memory allocated as a side effect of creating various devices. It does
> not apply to memory-backend-objects, whether explicitly specified on
> the command line, or implicitly created by the -m command line option.
> 
> The memfd option is intended to support new migration modes, in which the
> memory region can be transferred in place to a new QEMU process, by sending
> the memfd file descriptor to the process.  Memory contents are preserved,
> and if the mode also transfers device descriptors, then pages that are
> locked in memory for DMA remain locked.  This behavior is a pre-requisite
> for supporting vfio, vdpa, and iommufd devices with the new modes.

A more portable, non-Linux specific variant of this will be using shm,
similar to backends/hostmem-shm.c.

Likely we should be using that instead of memfd, or try hiding the
details. See below.

[...]

> @@ -69,6 +70,8 @@
>   
>   #include "qemu/pmem.h"
>   
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/options.h"
>   #include "migration/vmstate.h"
>   
>   #include "qemu/range.h"
> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                   qemu_mutex_unlock_ramlist();
>                   return;
>               }
> +
> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
> +                                        TYPE_MEMORY_BACKEND)) {

This looks a bit and hackish, and I don't think ram_block_add() is the right
place where this should be. It should likely happen in the caller.

We already do have two ways of allocating "shared anonymous memory":

(1) memory-backend-ram,share=on
(2) memory-backend-shm

(2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].

[there is also Linux specific memfd, which gives us more flexibility with
hugetlb etc, but for the purpose here shm should likely be sufficient?]

So why not make (1) behave like (2) and move that handling into
qemu_ram_alloc_internal(), from where we can easily enable it using a
new RMA_SHARED flag? So as a first step, something like:

 From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 4 Nov 2024 11:29:22 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  backends/hostmem-shm.c | 56 ++++----------------------------
  system/physmem.c       | 73 ++++++++++++++++++++++++++++++++++++++++--
  2 files changed, 76 insertions(+), 53 deletions(-)

diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
index 374edc3db8..0f33b35e9c 100644
--- a/backends/hostmem-shm.c
+++ b/backends/hostmem-shm.c
@@ -25,11 +25,8 @@ struct HostMemoryBackendShm {
  static bool
  shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
  {
-    g_autoptr(GString) shm_name = g_string_new(NULL);
-    g_autofree char *backend_name = NULL;
+    g_autofree char *name = NULL;
      uint32_t ram_flags;
-    int fd, oflag;
-    mode_t mode;
  
      if (!backend->size) {
          error_setg(errp, "can't create shm backend with size 0");
@@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
          return false;
      }
  
-    /*
-     * Let's use `mode = 0` because we don't want other processes to open our
-     * memory unless we share the file descriptor with them.
-     */
-    mode = 0;
-    oflag = O_RDWR | O_CREAT | O_EXCL;
-    backend_name = host_memory_backend_get_name(backend);
-
-    /*
-     * Some operating systems allow creating anonymous POSIX shared memory
-     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
-     * defined by POSIX, so let's create a unique name.
-     *
-     * From Linux's shm_open(3) man-page:
-     *   For  portable  use,  a shared  memory  object should be identified
-     *   by a name of the form /somename;"
-     */
-    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
-                    backend_name);
-
-    fd = shm_open(shm_name->str, oflag, mode);
-    if (fd < 0) {
-        error_setg_errno(errp, errno,
-                         "failed to create POSIX shared memory");
-        return false;
-    }
-
-    /*
-     * We have the file descriptor, so we no longer need to expose the
-     * POSIX shared memory object. However it will remain allocated as long as
-     * there are file descriptors pointing to it.
-     */
-    shm_unlink(shm_name->str);
-
-    if (ftruncate(fd, backend->size) == -1) {
-        error_setg_errno(errp, errno,
-                         "failed to resize POSIX shared memory to %" PRIu64,
-                         backend->size);
-        close(fd);
-        return false;
-    }
-
+    /* Let's do the same as memory-backend-ram,share=on would do. */
+    name = host_memory_backend_get_name(backend);
      ram_flags = RAM_SHARED;
      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
-
-    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-                                              backend_name, backend->size,
-                                              ram_flags, fd, 0, errp);
+    return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend),
+                                                  name, backend->size,
+                                                  ram_flags, errp);
  }
  
  static void
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..4d331b3828 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
  }
  #endif
  
+static int qemu_shm_alloc(size_t size, Error **errp)
+{
+    g_autoptr(GString) shm_name = g_string_new(NULL);
+    int fd, oflag, cur_sequence;
+    static int sequence;
+    mode_t mode;
+
+    cur_sequence = qatomic_fetch_inc(&sequence);
+
+    /*
+     * Let's use `mode = 0` because we don't want other processes to open our
+     * memory unless we share the file descriptor with them.
+     */
+    mode = 0;
+    oflag = O_RDWR | O_CREAT | O_EXCL;
+
+    /*
+     * Some operating systems allow creating anonymous POSIX shared memory
+     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+     * defined by POSIX, so let's create a unique name.
+     *
+     * From Linux's shm_open(3) man-page:
+     *   For  portable  use,  a shared  memory  object should be identified
+     *   by a name of the form /somename;"
+     */
+    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
+                    cur_sequence);
+
+    fd = shm_open(shm_name->str, oflag, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno,
+                         "failed to create POSIX shared memory");
+        return false;
+    }
+
+    /*
+     * We have the file descriptor, so we no longer need to expose the
+     * POSIX shared memory object. However it will remain allocated as long as
+     * there are file descriptors pointing to it.
+     */
+    shm_unlink(shm_name->str);
+
+    if (ftruncate(fd, size) == -1) {
+        error_setg_errno(errp, errno,
+                         "failed to resize POSIX shared memory to %" PRIu64,
+                         size);
+        close(fd);
+        return false;
+    }
+
+    return fd;
+}
+
  static
  RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
                                    void (*resized)(const char*,
@@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
      new_block->used_length = size;
      new_block->max_length = max_size;
      assert(max_size >= size);
-    new_block->fd = -1;
+
      new_block->guest_memfd = -1;
      new_block->page_size = qemu_real_host_page_size();
-    new_block->host = host;
      new_block->flags = ram_flags;
-    ram_block_add(new_block, &local_err);
+        new_block->host = host;
+
+    if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) {
+        new_block->fd = -1;
+    } else {
+        /*
+         * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but
+         * some users want the fd. So let's allocate shm explicitly, which will
+         * give us the fd.
+         */
+        assert(!host);
+        new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err);
+    }
+    if (!local_err) {
+        ram_block_add(new_block, &local_err);
+    }
      if (local_err) {
          g_free(new_block);
          error_propagate(errp, local_err);
-- 
2.47.0



Then, you only need a machine option to say "anon-shared", to make all
anonymous memory sharable between processes. All it would do is setting
the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable
(!(ram_flags & RAM_PREALLOC)).

To handle "memory-backend-ram,share=off", can we find a way to bail out if
memory-backend-ram,share=off was used while the machine option "anon-shared"
would be active? Or just document that the "anon-shared" will win?

Alternatives might be a RAM_PFORCE_PRIVATE flag, set by the memory backend.


With above change, we could drop the "bool share" flag from,
qemu_anon_ram_alloc(), as it would be unused.

[1] https://patchwork.kernel.org/project/qemu-devel/patch/20180201205511.19198-2-marcel@redhat.com/


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 10:39   ` David Hildenbrand
@ 2024-11-04 10:45     ` David Hildenbrand
  2024-11-04 17:38     ` Steven Sistare
  1 sibling, 0 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 10:45 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 04.11.24 11:39, David Hildenbrand wrote:
> On 01.11.24 14:47, Steve Sistare wrote:
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This option applies to
>> memory allocated as a side effect of creating various devices. It does
>> not apply to memory-backend-objects, whether explicitly specified on
>> the command line, or implicitly created by the -m command line option.
>>
>> The memfd option is intended to support new migration modes, in which the
>> memory region can be transferred in place to a new QEMU process, by sending
>> the memfd file descriptor to the process.  Memory contents are preserved,
>> and if the mode also transfers device descriptors, then pages that are
>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>> for supporting vfio, vdpa, and iommufd devices with the new modes.
> 
> A more portable, non-Linux specific variant of this will be using shm,
> similar to backends/hostmem-shm.c.
> 
> Likely we should be using that instead of memfd, or try hiding the
> details. See below.
> 
> [...]
> 
>> @@ -69,6 +70,8 @@
>>    
>>    #include "qemu/pmem.h"
>>    
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/options.h"
>>    #include "migration/vmstate.h"
>>    
>>    #include "qemu/range.h"
>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                    qemu_mutex_unlock_ramlist();
>>                    return;
>>                }
>> +
>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>> +                                        TYPE_MEMORY_BACKEND)) {
> 
> This looks a bit and hackish, and I don't think ram_block_add() is the right
> place where this should be. It should likely happen in the caller.
> 
> We already do have two ways of allocating "shared anonymous memory":
> 
> (1) memory-backend-ram,share=on
> (2) memory-backend-shm
> 
> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
> 
> [there is also Linux specific memfd, which gives us more flexibility with
> hugetlb etc, but for the purpose here shm should likely be sufficient?]
> 
> So why not make (1) behave like (2) and move that handling into
> qemu_ram_alloc_internal(), from where we can easily enable it using a
> new RMA_SHARED flag? So as a first step, something like:
> 
>   From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Mon, 4 Nov 2024 11:29:22 +0100
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>    backends/hostmem-shm.c | 56 ++++----------------------------
>    system/physmem.c       | 73 ++++++++++++++++++++++++++++++++++++++++--
>    2 files changed, 76 insertions(+), 53 deletions(-)
> 
> diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
> index 374edc3db8..0f33b35e9c 100644
> --- a/backends/hostmem-shm.c
> +++ b/backends/hostmem-shm.c
> @@ -25,11 +25,8 @@ struct HostMemoryBackendShm {
>    static bool
>    shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>    {
> -    g_autoptr(GString) shm_name = g_string_new(NULL);
> -    g_autofree char *backend_name = NULL;
> +    g_autofree char *name = NULL;
>        uint32_t ram_flags;
> -    int fd, oflag;
> -    mode_t mode;
>    
>        if (!backend->size) {
>            error_setg(errp, "can't create shm backend with size 0");
> @@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>            return false;
>        }
>    
> -    /*
> -     * Let's use `mode = 0` because we don't want other processes to open our
> -     * memory unless we share the file descriptor with them.
> -     */
> -    mode = 0;
> -    oflag = O_RDWR | O_CREAT | O_EXCL;
> -    backend_name = host_memory_backend_get_name(backend);
> -
> -    /*
> -     * Some operating systems allow creating anonymous POSIX shared memory
> -     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
> -     * defined by POSIX, so let's create a unique name.
> -     *
> -     * From Linux's shm_open(3) man-page:
> -     *   For  portable  use,  a shared  memory  object should be identified
> -     *   by a name of the form /somename;"
> -     */
> -    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
> -                    backend_name);
> -
> -    fd = shm_open(shm_name->str, oflag, mode);
> -    if (fd < 0) {
> -        error_setg_errno(errp, errno,
> -                         "failed to create POSIX shared memory");
> -        return false;
> -    }
> -
> -    /*
> -     * We have the file descriptor, so we no longer need to expose the
> -     * POSIX shared memory object. However it will remain allocated as long as
> -     * there are file descriptors pointing to it.
> -     */
> -    shm_unlink(shm_name->str);
> -
> -    if (ftruncate(fd, backend->size) == -1) {
> -        error_setg_errno(errp, errno,
> -                         "failed to resize POSIX shared memory to %" PRIu64,
> -                         backend->size);
> -        close(fd);
> -        return false;
> -    }
> -
> +    /* Let's do the same as memory-backend-ram,share=on would do. */
> +    name = host_memory_backend_get_name(backend);
>        ram_flags = RAM_SHARED;
>        ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> -
> -    return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
> -                                              backend_name, backend->size,
> -                                              ram_flags, fd, 0, errp);
> +    return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend),
> +                                                  name, backend->size,
> +                                                  ram_flags, errp);
>    }
>    
>    static void
> diff --git a/system/physmem.c b/system/physmem.c
> index dc1db3a384..4d331b3828 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>    }
>    #endif
>    
> +static int qemu_shm_alloc(size_t size, Error **errp)
> +{
> +    g_autoptr(GString) shm_name = g_string_new(NULL);
> +    int fd, oflag, cur_sequence;
> +    static int sequence;
> +    mode_t mode;
> +
> +    cur_sequence = qatomic_fetch_inc(&sequence);
> +
> +    /*
> +     * Let's use `mode = 0` because we don't want other processes to open our
> +     * memory unless we share the file descriptor with them.
> +     */
> +    mode = 0;
> +    oflag = O_RDWR | O_CREAT | O_EXCL;
> +
> +    /*
> +     * Some operating systems allow creating anonymous POSIX shared memory
> +     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
> +     * defined by POSIX, so let's create a unique name.
> +     *
> +     * From Linux's shm_open(3) man-page:
> +     *   For  portable  use,  a shared  memory  object should be identified
> +     *   by a name of the form /somename;"
> +     */
> +    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
> +                    cur_sequence);
> +
> +    fd = shm_open(shm_name->str, oflag, mode);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno,
> +                         "failed to create POSIX shared memory");
> +        return false;
> +    }
> +
> +    /*
> +     * We have the file descriptor, so we no longer need to expose the
> +     * POSIX shared memory object. However it will remain allocated as long as
> +     * there are file descriptors pointing to it.
> +     */
> +    shm_unlink(shm_name->str);
> +
> +    if (ftruncate(fd, size) == -1) {
> +        error_setg_errno(errp, errno,
> +                         "failed to resize POSIX shared memory to %" PRIu64,
> +                         size);
> +        close(fd);
> +        return false;
> +    }
> +
> +    return fd;
> +}
> +
>    static
>    RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>                                      void (*resized)(const char*,
> @@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>        new_block->used_length = size;
>        new_block->max_length = max_size;
>        assert(max_size >= size);
> -    new_block->fd = -1;
> +
>        new_block->guest_memfd = -1;
>        new_block->page_size = qemu_real_host_page_size();
> -    new_block->host = host;
>        new_block->flags = ram_flags;
> -    ram_block_add(new_block, &local_err);
> +        new_block->host = host;
> +
> +    if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) {
> +        new_block->fd = -1;
> +    } else {
> +        /*
> +         * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but
> +         * some users want the fd. So let's allocate shm explicitly, which will
> +         * give us the fd.
> +         */
> +        assert(!host);
> +        new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err);

Note: completely untested.

Likely a file_ram_alloc() call is missing here.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 10:39   ` David Hildenbrand
  2024-11-04 10:45     ` David Hildenbrand
@ 2024-11-04 17:38     ` Steven Sistare
  2024-11-04 19:51       ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-04 17:38 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> On 01.11.24 14:47, Steve Sistare wrote:
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This option applies to
>> memory allocated as a side effect of creating various devices. It does
>> not apply to memory-backend-objects, whether explicitly specified on
>> the command line, or implicitly created by the -m command line option.
>>
>> The memfd option is intended to support new migration modes, in which the
>> memory region can be transferred in place to a new QEMU process, by sending
>> the memfd file descriptor to the process.  Memory contents are preserved,
>> and if the mode also transfers device descriptors, then pages that are
>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>> for supporting vfio, vdpa, and iommufd devices with the new modes.
> 
> A more portable, non-Linux specific variant of this will be using shm,
> similar to backends/hostmem-shm.c.
> 
> Likely we should be using that instead of memfd, or try hiding the
> details. See below.

For this series I would prefer to use memfd and hide the details.  It's a
concise (and well tested) solution albeit linux only.  The code you supply
for posix shm would be a good follow on patch to support other unices.

We could drop
   -machine anon-alloc=mmap|memfd
and define
   -machine anon-shared

as you suggest at the end.

> [...]
> 
>> @@ -69,6 +70,8 @@
>>   #include "qemu/pmem.h"
>> +#include "qapi/qapi-types-migration.h"
>> +#include "migration/options.h"
>>   #include "migration/vmstate.h"
>>   #include "qemu/range.h"
>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                   qemu_mutex_unlock_ramlist();
>>                   return;
>>               }
>> +
>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>> +                                        TYPE_MEMORY_BACKEND)) {
> 
> This looks a bit and hackish, 

OK. I can revert parts of the previous version which passed in RAM_SHARED from
various call sites to request anonymous shared memory:
   https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com
See the various sites that do
     uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
Does that look OK to you?

> and I don't think ram_block_add() is the right
> place where this should be. It should likely happen in the caller.

I agree, but I received no feedback when I proposed to refactor allocation
vs ram_block_add, so I dropped them to simplify the live update review.
These refactor but do not change functionality.  Are you OK with something
like this?  Is this overkill?

https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/
   physmem: ram_block_create
   physmem: hoist guest_memfd creation
   physmem: hoist host memory allocation

> We already do have two ways of allocating "shared anonymous memory":
> 
> (1) memory-backend-ram,share=on
> (2) memory-backend-shm
> 
> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
> 
> [there is also Linux specific memfd, which gives us more flexibility with
> hugetlb etc, but for the purpose here shm should likely be sufficient?]
> 
> So why not make (1) behave like (2) and move that handling into
> qemu_ram_alloc_internal(), from where we can easily enable it using a
> new RMA_SHARED flag? So as a first step, something like:

I prefer that, and an earlier version did so, but only if anon-alloc==memfd.

To be clear, do you propose that memory-backend-ram,shared=on unconditionally
mmap fd-based shared memory, independently of the setting of anon-alloc?
And drop the MAP_ANON|MAP_SHARED possibility?

Or, do you propose that for memory-backend-ram,shared=on:
   if anon-shared
     mmap fd
   else
      MAP_ANON|MAP_SHARED

The former is simpler from a user documentation point of view, but either
works for me.  I could stop listing memory-backend-ram  as an exception in
the docs, which currently state:
   #     Memory-backend objects must have the share=on attribute, but
   #     memory-backend-epc and memory-backend-ram are not supported.

[...]
>
> Then, you only need a machine option to say "anon-shared", to make all
> anonymous memory sharable between processes. All it would do is setting
> the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable
> (!(ram_flags & RAM_PREALLOC)).
> 
> To handle "memory-backend-ram,share=off", can we find a way to bail out if
> memory-backend-ram,share=off was used while the machine option "anon-shared"
> would be active? 

In later patches I install migration blockers for various conditions, including
when a ram block does not support CPR.

> Or just document that the "anon-shared" will win?

IMO a blocker is sufficient.

I think you are also suggesting that an unadorned "memory-backend-ram"
specification (with implicit shared=off), plus anon-shared, should cause
shared anon to be allocated:
   "you only need a machine option to say "anon-shared", to make all anonymous
    memory sharable"

I did that previously, and Peter objected, saying the explicit anon-shared
should not override the implicit shared=off.

But perhaps I misinterpret someone.

- Steve

> Alternatives might be a RAM_PFORCE_PRIVATE flag, set by the memory backend.
> 
> 
> With above change, we could drop the "bool share" flag from,
> qemu_anon_ram_alloc(), as it would be unused.
> 
> [1] https://patchwork.kernel.org/project/qemu-devel/patch/20180201205511.19198-2-marcel@redhat.com/






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 17:38     ` Steven Sistare
@ 2024-11-04 19:51       ` David Hildenbrand
  2024-11-04 20:14         ` Peter Xu
  2024-11-04 20:15         ` David Hildenbrand
  0 siblings, 2 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 19:51 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 04.11.24 18:38, Steven Sistare wrote:
> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>> On 01.11.24 14:47, Steve Sistare wrote:
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property.  This option applies to
>>> memory allocated as a side effect of creating various devices. It does
>>> not apply to memory-backend-objects, whether explicitly specified on
>>> the command line, or implicitly created by the -m command line option.
>>>
>>> The memfd option is intended to support new migration modes, in which the
>>> memory region can be transferred in place to a new QEMU process, by sending
>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>> and if the mode also transfers device descriptors, then pages that are
>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>
>> A more portable, non-Linux specific variant of this will be using shm,
>> similar to backends/hostmem-shm.c.
>>
>> Likely we should be using that instead of memfd, or try hiding the
>> details. See below.
> 
> For this series I would prefer to use memfd and hide the details.  It's a
> concise (and well tested) solution albeit linux only.  The code you supply
> for posix shm would be a good follow on patch to support other unices.

Unless there is reason to use memfd we should start with the more 
generic POSIX variant that is available even on systems without memfd. 
Factoring stuff out as I drafted does look quite compelling.

I can help with the rework, and send it out separately, so you can focus 
on the "machine toggle" as part of this series.

Of course, if we find out we need the memfd internally instead under 
Linux for whatever reason later, we can use that instead.

But IIUC, the main selling point for memfd are additional features 
(hugetlb, memory sealing) that you aren't even using.

> 
> We could drop
>     -machine anon-alloc=mmap|memfd

Right, the memfd here might be an unnecessary detail. Especially, 
because all things here are mmap'ed ... so I don't quite like this 
interface :)


> and define
>     -machine anon-shared
> 
> as you suggest at the end.

Likely we should remove the "anon" part from the interface as well ... 
hmm ...

We want to instruct QEMU: "all guest RAM that is not explicitly 
specified should be sharable with another process".

"internal-ram-share=true"

"default-ram-share=true"

Maybe we can come up with something even better. But getting rid of the 
"anon" would be great. I think I prefer the latter (below).

> 
>> [...]
>>
>>> @@ -69,6 +70,8 @@
>>>    #include "qemu/pmem.h"
>>> +#include "qapi/qapi-types-migration.h"
>>> +#include "migration/options.h"
>>>    #include "migration/vmstate.h"
>>>    #include "qemu/range.h"
>>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>                    qemu_mutex_unlock_ramlist();
>>>                    return;
>>>                }
>>> +
>>> +        } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD &&
>>> +                   !object_dynamic_cast(new_block->mr->parent_obj.parent,
>>> +                                        TYPE_MEMORY_BACKEND)) {
>>
>> This looks a bit and hackish,
> 
> OK. I can revert parts of the previous version which passed in RAM_SHARED from
> various call sites to request anonymous shared memory:
>     https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com
> See the various sites that do
>       uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0;
> Does that look OK to you?

That's one option, or we just handle it in qemu_ram_alloc_internal() as 
I drafted below.

Or we simply have another interface to allocate this "default RAM that 
does not come from a memory backend and is subject to the global 
toggle", and hide that detail (conditionally setting RAM_SHARED) in there.

> 
>> and I don't think ram_block_add() is the right
>> place where this should be. It should likely happen in the caller.
> 
> I agree, but I received no feedback when I proposed to refactor allocation
> vs ram_block_add, so I dropped them to simplify the live update review.
> These refactor but do not change functionality.  Are you OK with something
> like this?  Is this overkill?
> 

Probably overkill :)

> https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/
>     physmem: ram_block_create
>     physmem: hoist guest_memfd creation
>     physmem: hoist host memory allocation
> 
>> We already do have two ways of allocating "shared anonymous memory":
>>
>> (1) memory-backend-ram,share=on
>> (2) memory-backend-shm
>>
>> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it
>> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1].
>>
>> [there is also Linux specific memfd, which gives us more flexibility with
>> hugetlb etc, but for the purpose here shm should likely be sufficient?]
>>
>> So why not make (1) behave like (2) and move that handling into
>> qemu_ram_alloc_internal(), from where we can easily enable it using a
>> new RMA_SHARED flag? So as a first step, something like:
> 
> I prefer that, and an earlier version did so, but only if anon-alloc==memfd.
> 
> To be clear, do you propose that memory-backend-ram,shared=on unconditionally
> mmap fd-based shared memory, independently of the setting of anon-alloc?
> And drop the MAP_ANON|MAP_SHARED possibility?

Yes, as done in my draft patch. MAP_ANON|MAP_SHARED was primarily a hack 
to make this RDMA thingy fly that could not deal with anonymous memory, 
and we didn't have

memory-backend-ram,share=on was added via 
06329ccecfa022494fdba288b3ab5bcb8dff4159 before
memory-backend-memfd was added via dbb9e0f40d7d561dcffcf7e41ac9f6a5ec90e5b5

Both ended up in the same QEMU release.

So likely memory-backend-ram,share=on could just have used 
memory-backend-memfd if it would have been available earlier, at least 
on Linux ...


But, it looks like the use case for memory-backend-ram,share=on does no 
longer even exist, because

commit 1dfd42c4264bbf47415a9e73f0d6b4e6a7cd7393
Author: Philippe Mathieu-Daudé <philmd@linaro.org>
Date:   Thu Mar 28 12:53:00 2024 +0100

     hw/rdma: Remove deprecated pvrdma device and rdmacm-mux helper

Removed that mremap() from the code base.

So we can change how memory-backend-ram,share=on is implemented 
internally, as long as it keeps on working in a similar way.

If "memory-backend-ram,share=on" will be the same as specifying 
"default-ram-share=on", that would actually be quite nice ... no need to 
bring in memfds at all as long as we only want some memory with an fd to 
share with other processes.

> 
> Or, do you propose that for memory-backend-ram,shared=on:
>     if anon-shared
>       mmap fd
>     else
>        MAP_ANON|MAP_SHARED


My suggestion would be to unconditionally use shm (which is available 
even on kernels without memfd support; if required use memfd first and 
fallback to shmem) as in the patch I drafted.

> 
> The former is simpler from a user documentation point of view, but either
> works for me.  I could stop listing memory-backend-ram  as an exception in
> the docs, which currently state:
>     #     Memory-backend objects must have the share=on attribute, but
>     #     memory-backend-epc and memory-backend-ram are not supported.

Likely that was never updated to document the memory-backend-ram use case.

> 
> [...]
>>
>> Then, you only need a machine option to say "anon-shared", to make all
>> anonymous memory sharable between processes. All it would do is setting
>> the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable
>> (!(ram_flags & RAM_PREALLOC)).
>>
>> To handle "memory-backend-ram,share=off", can we find a way to bail out if
>> memory-backend-ram,share=off was used while the machine option "anon-shared"
>> would be active?
> 
> In later patches I install migration blockers for various conditions, including
> when a ram block does not support CPR.

Good!

> 
>> Or just document that the "anon-shared" will win?
> 
> IMO a blocker is sufficient.
> 
> I think you are also suggesting that an unadorned "memory-backend-ram"
> specification (with implicit shared=off), plus anon-shared, should cause
> shared anon to be allocated:
>     "you only need a machine option to say "anon-shared", to make all anonymous
>      memory sharable"
> 
> I did that previously, and Peter objected, saying the explicit anon-shared
> should not override the implicit shared=off.

Yes, it's better if we can detect that somehow. There should be easy 
ways to make that work, so I wouldn't worry about that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 19:51       ` David Hildenbrand
@ 2024-11-04 20:14         ` Peter Xu
  2024-11-04 20:17           ` David Hildenbrand
  2024-11-04 20:15         ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-04 20:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
> > I did that previously, and Peter objected, saying the explicit anon-shared
> > should not override the implicit shared=off.
> 
> Yes, it's better if we can detect that somehow. There should be easy ways to
> make that work, so I wouldn't worry about that.

I still think whenever the caller is capable of passing RAM_SHARED flag
into ram_block_add(), we should always respect what's passed in from the
caller, no matter it's a shared / private request.

A major issue with that idea is when !RAM_SHARED, we don't easily know
whether it's because the caller explicitly chose share=off, or if it's
simply the type of ramblock that we don't care (e.g. ROMs).

So besides what I used to suggest on monitoring the four call sites that
can involve those, another simpler (and now I see it even cleaner..) way
could be that we explicitly introduce RAM_PRIVATE.  It means whenever we
have things like below in the callers:

    int ram_flags = shared ? RAM_SHARED : 0;

We start to switch to:

    int ram_flags = shared ? RAM_SHARED : RAM_PRIVATE;

Then in ram_block_add():

    if (!(ram_flags & (RAM_SHARED | RAM_PRIVATE))) {
        // these are the target ramblocks for cpr's whatever new machine
        // flag..
    }

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 19:51       ` David Hildenbrand
  2024-11-04 20:14         ` Peter Xu
@ 2024-11-04 20:15         ` David Hildenbrand
  2024-11-04 20:56           ` Steven Sistare
  1 sibling, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 20:15 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 04.11.24 20:51, David Hildenbrand wrote:
> On 04.11.24 18:38, Steven Sistare wrote:
>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>> on the value of the anon-alloc machine property.  This option applies to
>>>> memory allocated as a side effect of creating various devices. It does
>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>> the command line, or implicitly created by the -m command line option.
>>>>
>>>> The memfd option is intended to support new migration modes, in which the
>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>> and if the mode also transfers device descriptors, then pages that are
>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>
>>> A more portable, non-Linux specific variant of this will be using shm,
>>> similar to backends/hostmem-shm.c.
>>>
>>> Likely we should be using that instead of memfd, or try hiding the
>>> details. See below.
>>
>> For this series I would prefer to use memfd and hide the details.  It's a
>> concise (and well tested) solution albeit linux only.  The code you supply
>> for posix shm would be a good follow on patch to support other unices.
> 
> Unless there is reason to use memfd we should start with the more
> generic POSIX variant that is available even on systems without memfd.
> Factoring stuff out as I drafted does look quite compelling.
> 
> I can help with the rework, and send it out separately, so you can focus
> on the "machine toggle" as part of this series.
> 
> Of course, if we find out we need the memfd internally instead under
> Linux for whatever reason later, we can use that instead.
> 
> But IIUC, the main selling point for memfd are additional features
> (hugetlb, memory sealing) that you aren't even using.

FWIW, I'm looking into some details, and one difference is that 
shmem_open() under Linux (glibc) seems to go to /dev/shmem and 
memfd/SYSV go to the internal tmpfs mount. There is not a big 
difference, but there can be some difference (e.g., sizing of the 
/dev/shm mount).

Regarding memory-backend-ram,share=on, I assume we can use memfd if 
available, but then fallback to shm_open().

I'm hoping we can find a way where it just all is rather intuitive, like

"default-ram-share=on": behave for internal RAM just like 
"memory-backend-ram,share=on"

"memory-backend-ram,share=on": use whatever mechanism we have to give us 
"anonymous" memory that can be shared using an fd with another process.


Thoughts?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 20:14         ` Peter Xu
@ 2024-11-04 20:17           ` David Hildenbrand
  2024-11-04 20:41             ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 20:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 04.11.24 21:14, Peter Xu wrote:
> On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
>>> I did that previously, and Peter objected, saying the explicit anon-shared
>>> should not override the implicit shared=off.
>>
>> Yes, it's better if we can detect that somehow. There should be easy ways to
>> make that work, so I wouldn't worry about that.
> 
> I still think whenever the caller is capable of passing RAM_SHARED flag
> into ram_block_add(), we should always respect what's passed in from the
> caller, no matter it's a shared / private request.
> 
> A major issue with that idea is when !RAM_SHARED, we don't easily know
> whether it's because the caller explicitly chose share=off, or if it's
> simply the type of ramblock that we don't care (e.g. ROMs).

Agreed. But note that I think ram_block_add() is one level to deep to 
handle that.

> 
> So besides what I used to suggest on monitoring the four call sites that
> can involve those, another simpler (and now I see it even cleaner..) way
> could be that we explicitly introduce RAM_PRIVATE.  It means whenever we
> have things like below in the callers:

Yeah, I called it RAM_FORCE_PRIVATE, but it's the same idea. And simply 
calling it RAM_PRIVATE makes sense.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 20:17           ` David Hildenbrand
@ 2024-11-04 20:41             ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-04 20:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Mon, Nov 04, 2024 at 09:17:30PM +0100, David Hildenbrand wrote:
> On 04.11.24 21:14, Peter Xu wrote:
> > On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote:
> > > > I did that previously, and Peter objected, saying the explicit anon-shared
> > > > should not override the implicit shared=off.
> > > 
> > > Yes, it's better if we can detect that somehow. There should be easy ways to
> > > make that work, so I wouldn't worry about that.
> > 
> > I still think whenever the caller is capable of passing RAM_SHARED flag
> > into ram_block_add(), we should always respect what's passed in from the
> > caller, no matter it's a shared / private request.
> > 
> > A major issue with that idea is when !RAM_SHARED, we don't easily know
> > whether it's because the caller explicitly chose share=off, or if it's
> > simply the type of ramblock that we don't care (e.g. ROMs).
> 
> Agreed. But note that I think ram_block_add() is one level to deep to handle
> that.

True.. qemu_ram_alloc_internal() is probably the best place to do the trick.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 20:15         ` David Hildenbrand
@ 2024-11-04 20:56           ` Steven Sistare
  2024-11-04 21:36             ` David Hildenbrand
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-04 20:56 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> On 04.11.24 20:51, David Hildenbrand wrote:
>> On 04.11.24 18:38, Steven Sistare wrote:
>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>> memory allocated as a side effect of creating various devices. It does
>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>> the command line, or implicitly created by the -m command line option.
>>>>>
>>>>> The memfd option is intended to support new migration modes, in which the
>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>
>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>> similar to backends/hostmem-shm.c.
>>>>
>>>> Likely we should be using that instead of memfd, or try hiding the
>>>> details. See below.
>>>
>>> For this series I would prefer to use memfd and hide the details.  It's a
>>> concise (and well tested) solution albeit linux only.  The code you supply
>>> for posix shm would be a good follow on patch to support other unices.
>>
>> Unless there is reason to use memfd we should start with the more
>> generic POSIX variant that is available even on systems without memfd.
>> Factoring stuff out as I drafted does look quite compelling.
>>
>> I can help with the rework, and send it out separately, so you can focus
>> on the "machine toggle" as part of this series.
>>
>> Of course, if we find out we need the memfd internally instead under
>> Linux for whatever reason later, we can use that instead.
>>
>> But IIUC, the main selling point for memfd are additional features
>> (hugetlb, memory sealing) that you aren't even using.
> 
> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).

Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
To do so using shm_open requires configuration on the mount.  One step harder to use.

This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
if memory-backend-ram has hogged all the memory.

> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().

Yes, and if that is a good idea, then the same should be done for internal RAM
-- memfd if available and fallback to shm_open.

> I'm hoping we can find a way where it just all is rather intuitive, like
> 
> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> 
> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> 
> Thoughts?

Agreed, though I thought I had already landed at the intuitive specification in my patch.
The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
of options and words to describe them.

- Steve




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 20:56           ` Steven Sistare
@ 2024-11-04 21:36             ` David Hildenbrand
  2024-11-06 20:12               ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-04 21:36 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Peter Xu, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 04.11.24 21:56, Steven Sistare wrote:
> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>> On 04.11.24 20:51, David Hildenbrand wrote:
>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>
>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>
>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>> similar to backends/hostmem-shm.c.
>>>>>
>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>> details. See below.
>>>>
>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>> for posix shm would be a good follow on patch to support other unices.
>>>
>>> Unless there is reason to use memfd we should start with the more
>>> generic POSIX variant that is available even on systems without memfd.
>>> Factoring stuff out as I drafted does look quite compelling.
>>>
>>> I can help with the rework, and send it out separately, so you can focus
>>> on the "machine toggle" as part of this series.
>>>
>>> Of course, if we find out we need the memfd internally instead under
>>> Linux for whatever reason later, we can use that instead.
>>>
>>> But IIUC, the main selling point for memfd are additional features
>>> (hugetlb, memory sealing) that you aren't even using.
>>
>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> 
> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> To do so using shm_open requires configuration on the mount.  One step harder to use.

Yes.

> 
> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> if memory-backend-ram has hogged all the memory.
> 
>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> 
> Yes, and if that is a good idea, then the same should be done for internal RAM
> -- memfd if available and fallback to shm_open.

Yes.

> 
>> I'm hoping we can find a way where it just all is rather intuitive, like
>>
>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>
>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>
>> Thoughts?
> 
> Agreed, though I thought I had already landed at the intuitive specification in my patch.
> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> of options and words to describe them.

Well, yes, and making it all a bit more consistent and the "machine 
option" behave just like "memory-backend-ram,share=on".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-04 21:36             ` David Hildenbrand
@ 2024-11-06 20:12               ` Steven Sistare
  2024-11-06 20:41                 ` Peter Xu
  2024-11-07 13:23                 ` David Hildenbrand
  0 siblings, 2 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-06 20:12 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel



On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> On 04.11.24 21:56, Steven Sistare wrote:
>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>
>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>
>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>> similar to backends/hostmem-shm.c.
>>>>>>
>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>> details. See below.
>>>>>
>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>
>>>> Unless there is reason to use memfd we should start with the more
>>>> generic POSIX variant that is available even on systems without memfd.
>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>
>>>> I can help with the rework, and send it out separately, so you can focus
>>>> on the "machine toggle" as part of this series.
>>>>
>>>> Of course, if we find out we need the memfd internally instead under
>>>> Linux for whatever reason later, we can use that instead.
>>>>
>>>> But IIUC, the main selling point for memfd are additional features
>>>> (hugetlb, memory sealing) that you aren't even using.
>>>
>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>
>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>> To do so using shm_open requires configuration on the mount.  One step harder to use.
> 
> Yes.
> 
>>
>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>> if memory-backend-ram has hogged all the memory.
>>
>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>
>> Yes, and if that is a good idea, then the same should be done for internal RAM
>> -- memfd if available and fallback to shm_open.
> 
> Yes.
> 
>>
>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>
>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>
>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>
>>> Thoughts?
>>
>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>> of options and words to describe them.
> 
> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".

Hi David and Peter,

I have implemented and tested the following, for both qemu_memfd_create
and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
for simplicity.

Any comments before I submit a complete patch?

----
qemu-options.hx:
     ``aux-ram-share=on|off``
         Allocate auxiliary guest RAM as an anonymous file that is
         shareable with an external process.  This option applies to
         memory allocated as a side effect of creating various devices.
         It does not apply to memory-backend-objects, whether explicitly
         specified on the command line, or implicitly created by the -m
         command line option.

         Some migration modes require aux-ram-share=on.

qapi/migration.json:
     @cpr-transfer:
          ...
          Memory-backend objects must have the share=on attribute, but
          memory-backend-epc is not supported.  The VM must be started
          with the '-machine aux-ram-share=on' option.

Define RAM_PRIVATE

Define qemu_shm_alloc(), from David's tmp patch

ram_backend_memory_alloc()
     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
     memory_region_init_ram_flags_nomigrate(ram_flags)

qemu_ram_alloc_internal()
     ...
     if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
         new_block->flags |= RAM_SHARED;

     if (!host && (new_block->flags & RAM_SHARED)) {
         qemu_ram_alloc_shared(new_block);
     } else
         new_block->fd = -1;
         new_block->host = host;
     }
     ram_block_add(new_block);

qemu_ram_alloc_shared()
     if qemu_memfd_check()
         new_block->fd = qemu_memfd_create()
     else
         new_block->fd = qemu_shm_alloc()
     new_block->host = file_ram_alloc(new_block->fd)
----

- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 20:12               ` Steven Sistare
@ 2024-11-06 20:41                 ` Peter Xu
  2024-11-06 20:59                   ` Steven Sistare
  2024-11-07 13:23                 ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-06 20:41 UTC (permalink / raw)
  To: Steven Sistare
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> 
> 
> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > On 04.11.24 21:56, Steven Sistare wrote:
> > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > 
> > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > 
> > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > 
> > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > details. See below.
> > > > > > 
> > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > 
> > > > > Unless there is reason to use memfd we should start with the more
> > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > 
> > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > on the "machine toggle" as part of this series.
> > > > > 
> > > > > Of course, if we find out we need the memfd internally instead under
> > > > > Linux for whatever reason later, we can use that instead.
> > > > > 
> > > > > But IIUC, the main selling point for memfd are additional features
> > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > 
> > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > 
> > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > 
> > Yes.
> > 
> > > 
> > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > if memory-backend-ram has hogged all the memory.
> > > 
> > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > 
> > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > -- memfd if available and fallback to shm_open.
> > 
> > Yes.
> > 
> > > 
> > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > 
> > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > 
> > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > 
> > > > Thoughts?
> > > 
> > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > of options and words to describe them.
> > 
> > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> 
> Hi David and Peter,
> 
> I have implemented and tested the following, for both qemu_memfd_create
> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> for simplicity.

I'm ok with either shm or memfd, as this feature only applies to Linux
anyway.  I'll leave that part to you and David to decide.

> 
> Any comments before I submit a complete patch?
> 
> ----
> qemu-options.hx:
>     ``aux-ram-share=on|off``
>         Allocate auxiliary guest RAM as an anonymous file that is
>         shareable with an external process.  This option applies to
>         memory allocated as a side effect of creating various devices.
>         It does not apply to memory-backend-objects, whether explicitly
>         specified on the command line, or implicitly created by the -m
>         command line option.
> 
>         Some migration modes require aux-ram-share=on.
> 
> qapi/migration.json:
>     @cpr-transfer:
>          ...
>          Memory-backend objects must have the share=on attribute, but
>          memory-backend-epc is not supported.  The VM must be started
>          with the '-machine aux-ram-share=on' option.
> 
> Define RAM_PRIVATE
> 
> Define qemu_shm_alloc(), from David's tmp patch
> 
> ram_backend_memory_alloc()
>     ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>     memory_region_init_ram_flags_nomigrate(ram_flags)

Looks all good until here.

> 
> qemu_ram_alloc_internal()
>     ...
>     if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)

Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
that's equal to RAM_PREALLOC.  Meanwhile I slightly prefer we don't touch
anything if SHARED|PRIVATE is set.  All combined, it could be:

    if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
        // ramblock to be allocated, with no share/private request, aka,
        // aux memory chunk...
    }

>         new_block->flags |= RAM_SHARED;
> 
>     if (!host && (new_block->flags & RAM_SHARED)) {
>         qemu_ram_alloc_shared(new_block);

I'm not sure whether this needs its own helper.  Should we fallback to
ram_block_add() below, just like a RAM_SHARED?

IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
always cache the fd (even if we don't do that before)?

>     } else
>         new_block->fd = -1;
>         new_block->host = host;
>     }
>     ram_block_add(new_block);
> 
> qemu_ram_alloc_shared()
>     if qemu_memfd_check()
>         new_block->fd = qemu_memfd_create()
>     else
>         new_block->fd = qemu_shm_alloc()
>     new_block->host = file_ram_alloc(new_block->fd)
> ----
> 
> - Steve
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 20:41                 ` Peter Xu
@ 2024-11-06 20:59                   ` Steven Sistare
  2024-11-06 21:21                     ` Peter Xu
  2024-11-07 13:05                     ` David Hildenbrand
  0 siblings, 2 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-06 20:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On 11/6/2024 3:41 PM, Peter Xu wrote:
> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>
>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>
>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>
>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>> details. See below.
>>>>>>>
>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>
>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>
>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>> on the "machine toggle" as part of this series.
>>>>>>
>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>
>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>
>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>
>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>
>>> Yes.
>>>
>>>>
>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>> if memory-backend-ram has hogged all the memory.
>>>>
>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>
>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>> -- memfd if available and fallback to shm_open.
>>>
>>> Yes.
>>>
>>>>
>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>
>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>
>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>> of options and words to describe them.
>>>
>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>
>> Hi David and Peter,
>>
>> I have implemented and tested the following, for both qemu_memfd_create
>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>> for simplicity.
> 
> I'm ok with either shm or memfd, as this feature only applies to Linux
> anyway.  I'll leave that part to you and David to decide.
> 
>>
>> Any comments before I submit a complete patch?
>>
>> ----
>> qemu-options.hx:
>>      ``aux-ram-share=on|off``
>>          Allocate auxiliary guest RAM as an anonymous file that is
>>          shareable with an external process.  This option applies to
>>          memory allocated as a side effect of creating various devices.
>>          It does not apply to memory-backend-objects, whether explicitly
>>          specified on the command line, or implicitly created by the -m
>>          command line option.
>>
>>          Some migration modes require aux-ram-share=on.
>>
>> qapi/migration.json:
>>      @cpr-transfer:
>>           ...
>>           Memory-backend objects must have the share=on attribute, but
>>           memory-backend-epc is not supported.  The VM must be started
>>           with the '-machine aux-ram-share=on' option.
>>
>> Define RAM_PRIVATE
>>
>> Define qemu_shm_alloc(), from David's tmp patch
>>
>> ram_backend_memory_alloc()
>>      ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>      memory_region_init_ram_flags_nomigrate(ram_flags)
> 
> Looks all good until here.
> 
>>
>> qemu_ram_alloc_internal()
>>      ...
>>      if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> 
> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> that's equal to RAM_PREALLOC.  

IMO testing host is clearer and more future proof, regardless of how flags
are currently used.  If the caller passes host, then we should not allocate
memory here, full stop.

> Meanwhile I slightly prefer we don't touch
> anything if SHARED|PRIVATE is set.  

OK, if SHARED is already set I will not set it again.

> All combined, it could be:
> 
>      if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
>          // ramblock to be allocated, with no share/private request, aka,
>          // aux memory chunk...
>      }
> 
>>          new_block->flags |= RAM_SHARED;
>>
>>      if (!host && (new_block->flags & RAM_SHARED)) {
>>          qemu_ram_alloc_shared(new_block);
> 
> I'm not sure whether this needs its own helper.  

Reserve judgement until you see the full patch.  The helper is a
non-trivial subroutine and IMO it improves readability.  Also the
cpr find/save hooks are confined to the subroutine.

> Should we fallback to
> ram_block_add() below, just like a RAM_SHARED?

I thought we all discussed and agreed that the allocation should be performed
above ram_block_add.  David's suggested patch does it here also.

- Steve

> IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
> always cache the fd (even if we don't do that before)?
> 
>>      } else
>>          new_block->fd = -1;
>>          new_block->host = host;
>>      }
>>      ram_block_add(new_block);
>>
>> qemu_ram_alloc_shared()
>>      if qemu_memfd_check()
>>          new_block->fd = qemu_memfd_create()
>>      else
>>          new_block->fd = qemu_shm_alloc()
>>      new_block->host = file_ram_alloc(new_block->fd)
>> ----
>>
>> - Steve
>>
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 20:59                   ` Steven Sistare
@ 2024-11-06 21:21                     ` Peter Xu
  2024-11-07 14:03                       ` Steven Sistare
  2024-11-07 13:05                     ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-06 21:21 UTC (permalink / raw)
  To: Steven Sistare
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote:
> On 11/6/2024 3:41 PM, Peter Xu wrote:
> > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > 
> > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > 
> > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > 
> > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > details. See below.
> > > > > > > > 
> > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > 
> > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > 
> > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > on the "machine toggle" as part of this series.
> > > > > > > 
> > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > 
> > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > 
> > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > 
> > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > 
> > > > Yes.
> > > > 
> > > > > 
> > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > if memory-backend-ram has hogged all the memory.
> > > > > 
> > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > 
> > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > -- memfd if available and fallback to shm_open.
> > > > 
> > > > Yes.
> > > > 
> > > > > 
> > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > 
> > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > 
> > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > of options and words to describe them.
> > > > 
> > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > 
> > > Hi David and Peter,
> > > 
> > > I have implemented and tested the following, for both qemu_memfd_create
> > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > for simplicity.
> > 
> > I'm ok with either shm or memfd, as this feature only applies to Linux
> > anyway.  I'll leave that part to you and David to decide.
> > 
> > > 
> > > Any comments before I submit a complete patch?
> > > 
> > > ----
> > > qemu-options.hx:
> > >      ``aux-ram-share=on|off``
> > >          Allocate auxiliary guest RAM as an anonymous file that is
> > >          shareable with an external process.  This option applies to
> > >          memory allocated as a side effect of creating various devices.
> > >          It does not apply to memory-backend-objects, whether explicitly
> > >          specified on the command line, or implicitly created by the -m
> > >          command line option.
> > > 
> > >          Some migration modes require aux-ram-share=on.
> > > 
> > > qapi/migration.json:
> > >      @cpr-transfer:
> > >           ...
> > >           Memory-backend objects must have the share=on attribute, but
> > >           memory-backend-epc is not supported.  The VM must be started
> > >           with the '-machine aux-ram-share=on' option.
> > > 
> > > Define RAM_PRIVATE
> > > 
> > > Define qemu_shm_alloc(), from David's tmp patch
> > > 
> > > ram_backend_memory_alloc()
> > >      ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > >      memory_region_init_ram_flags_nomigrate(ram_flags)
> > 
> > Looks all good until here.
> > 
> > > 
> > > qemu_ram_alloc_internal()
> > >      ...
> > >      if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > 
> > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> > that's equal to RAM_PREALLOC.
> 
> IMO testing host is clearer and more future proof, regardless of how flags
> are currently used.  If the caller passes host, then we should not allocate
> memory here, full stop.
> 
> > Meanwhile I slightly prefer we don't touch
> > anything if SHARED|PRIVATE is set.
> 
> OK, if SHARED is already set I will not set it again.
> 
> > All combined, it could be:
> > 
> >      if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
> >          // ramblock to be allocated, with no share/private request, aka,
> >          // aux memory chunk...
> >      }
> > 
> > >          new_block->flags |= RAM_SHARED;
> > > 
> > >      if (!host && (new_block->flags & RAM_SHARED)) {
> > >          qemu_ram_alloc_shared(new_block);
> > 
> > I'm not sure whether this needs its own helper.
> 
> Reserve judgement until you see the full patch.  The helper is a
> non-trivial subroutine and IMO it improves readability.  Also the
> cpr find/save hooks are confined to the subroutine.

I thought we can use the same code path to process "aux ramblock" and all
kinds of other RAM_SHARED ramblocks.  IIUC cpr find/save should apply there
too, but maybe I missed something.

> 
> > Should we fallback to
> > ram_block_add() below, just like a RAM_SHARED?
> 
> I thought we all discussed and agreed that the allocation should be performed
> above ram_block_add.  David's suggested patch does it here also.

I was not closely followed all the discussions happened.. so I could have
missed something indeed.

One thing I want to double check is cpr will still make things like below
work, right?

  -object memory-backend-ram,share=on [1]

IIUC with the old code this won't create fd, so to make cpr work (and also
what I was trying to say in the previous email..) is we could silently
start to create memfds for these, which means we need to first teach
qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds
(which should hopefully keep the same behavior as before).

Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in
qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..),
then the rest code path in ram_block_add() should be the same as when user
specified share=on in [1].

Anyway, if both of you agreed on it, I am happy to wait and read the whole
patch.

Side note: I'll still use a few days for other things, but I'll get back to
read this whole series before next week.. btw, this series does not depend
on precreate phase now, am I right?

> 
> - Steve
> 
> > IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
> > always cache the fd (even if we don't do that before)?
> > 
> > >      } else
> > >          new_block->fd = -1;
> > >          new_block->host = host;
> > >      }
> > >      ram_block_add(new_block);
> > > 
> > > qemu_ram_alloc_shared()
> > >      if qemu_memfd_check()
> > >          new_block->fd = qemu_memfd_create()
> > >      else
> > >          new_block->fd = qemu_shm_alloc()
> > >      new_block->host = file_ram_alloc(new_block->fd)
> > > ----
> > > 
> > > - Steve
> > > 
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 20:59                   ` Steven Sistare
  2024-11-06 21:21                     ` Peter Xu
@ 2024-11-07 13:05                     ` David Hildenbrand
  2024-11-07 14:04                       ` Steven Sistare
  1 sibling, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-07 13:05 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 06.11.24 21:59, Steven Sistare wrote:
> On 11/6/2024 3:41 PM, Peter Xu wrote:
>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>
>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>
>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>
>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>> details. See below.
>>>>>>>>
>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>
>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>
>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>
>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>
>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>
>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>
>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>> if memory-backend-ram has hogged all the memory.
>>>>>
>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>
>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>> -- memfd if available and fallback to shm_open.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>
>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>
>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>> of options and words to describe them.
>>>>
>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>
>>> Hi David and Peter,
>>>
>>> I have implemented and tested the following, for both qemu_memfd_create
>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>> for simplicity.
>>
>> I'm ok with either shm or memfd, as this feature only applies to Linux
>> anyway.  I'll leave that part to you and David to decide.
>>
>>>
>>> Any comments before I submit a complete patch?
>>>
>>> ----
>>> qemu-options.hx:
>>>       ``aux-ram-share=on|off``
>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>           shareable with an external process.  This option applies to
>>>           memory allocated as a side effect of creating various devices.
>>>           It does not apply to memory-backend-objects, whether explicitly
>>>           specified on the command line, or implicitly created by the -m
>>>           command line option.
>>>
>>>           Some migration modes require aux-ram-share=on.
>>>
>>> qapi/migration.json:
>>>       @cpr-transfer:
>>>            ...
>>>            Memory-backend objects must have the share=on attribute, but
>>>            memory-backend-epc is not supported.  The VM must be started
>>>            with the '-machine aux-ram-share=on' option.
>>>
>>> Define RAM_PRIVATE
>>>
>>> Define qemu_shm_alloc(), from David's tmp patch
>>>
>>> ram_backend_memory_alloc()
>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>
>> Looks all good until here.
>>
>>>
>>> qemu_ram_alloc_internal()
>>>       ...
>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>
>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>> that's equal to RAM_PREALLOC.
> 
> IMO testing host is clearer and more future proof, regardless of how flags
> are currently used.  If the caller passes host, then we should not allocate
> memory here, full stop.
> 
>> Meanwhile I slightly prefer we don't touch
>> anything if SHARED|PRIVATE is set.
> 
> OK, if SHARED is already set I will not set it again.

We only have to make sure that stuff like qemu_ram_is_shared() will 
continue working as expected.

What I think we should do:

We should probably assert that nobody passes in SHARED|PRIVATE. And we 
can use PRIVATE only as a parameter to the function, but never actually 
set it on the ramblock.

If someone passes in PRIVATE, we don't include it in block->flags. 
(RMA_SHARED remains cleared)

If someone passes in SHARED, we do set it in block->flags.
If someone passes PRIVATE|SHARED, we assert.

If someone passes in nothing: we set block->flags to SHARED with 
aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)


If that's also what you had in mind, great.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 20:12               ` Steven Sistare
  2024-11-06 20:41                 ` Peter Xu
@ 2024-11-07 13:23                 ` David Hildenbrand
  2024-11-07 16:02                   ` Steven Sistare
  1 sibling, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-07 13:23 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 06.11.24 21:12, Steven Sistare wrote:
> 
> 
> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>> On 04.11.24 21:56, Steven Sistare wrote:
>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>
>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>
>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>
>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>> details. See below.
>>>>>>
>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>
>>>>> Unless there is reason to use memfd we should start with the more
>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>
>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>> on the "machine toggle" as part of this series.
>>>>>
>>>>> Of course, if we find out we need the memfd internally instead under
>>>>> Linux for whatever reason later, we can use that instead.
>>>>>
>>>>> But IIUC, the main selling point for memfd are additional features
>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>
>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>
>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>
>> Yes.
>>
>>>
>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>> if memory-backend-ram has hogged all the memory.
>>>
>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>
>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>> -- memfd if available and fallback to shm_open.
>>
>> Yes.
>>
>>>
>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>
>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>
>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>
>>>> Thoughts?
>>>
>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>> of options and words to describe them.
>>
>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> 
> Hi David and Peter,
> 
> I have implemented and tested the following, for both qemu_memfd_create
> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> for simplicity.
> 
> Any comments before I submit a complete patch?
> 
> ----
> qemu-options.hx:
>       ``aux-ram-share=on|off``
>           Allocate auxiliary guest RAM as an anonymous file that is
>           shareable with an external process.  This option applies to
>           memory allocated as a side effect of creating various devices.
>           It does not apply to memory-backend-objects, whether explicitly
>           specified on the command line, or implicitly created by the -m
>           command line option.
> 
>           Some migration modes require aux-ram-share=on.
> 
> qapi/migration.json:
>       @cpr-transfer:
>            ...
>            Memory-backend objects must have the share=on attribute, but
>            memory-backend-epc is not supported.  The VM must be started
>            with the '-machine aux-ram-share=on' option.
> 
> Define RAM_PRIVATE
> 
> Define qemu_shm_alloc(), from David's tmp patch
> 
> ram_backend_memory_alloc()
>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>       memory_region_init_ram_flags_nomigrate(ram_flags)
> 
> qemu_ram_alloc_internal()
>       ...
>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>           new_block->flags |= RAM_SHARED;
> 
>       if (!host && (new_block->flags & RAM_SHARED)) {
>           qemu_ram_alloc_shared(new_block);
>       } else
>           new_block->fd = -1;
>           new_block->host = host;
>       }
>       ram_block_add(new_block);
> 
> qemu_ram_alloc_shared()
>       if qemu_memfd_check()
>           new_block->fd = qemu_memfd_create()
>       else
>           new_block->fd = qemu_shm_alloc()

Yes, that way "memory-backend-ram,share=on" will just mean "give me the 
best shared memory for RAM to be shared with other processes, I don't 
care about the details", and it will work on Linux kernels even before 
we had memfds.

memory-backend-ram should be available on all architectures, and under 
Windows. qemu_anon_ram_alloc() under Linux just does nothing special, 
not even bail out.

MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I 
can share only with subprocesses", but then, *there are not subprocesses 
for QEMU*. I recall there was a trick to obtain the fd under Linux for 
these regions using /proc/self/fd/, but it's very Linux specific ...

So nobody would *actually* use that shared memory and it was only a hack 
for RDMA. Now we can do better.


We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if 
no shared memory can be created (unavailable), like we do on Windows.

So maybe something like

qemu_ram_alloc_shared()
	fd = -1;

	if (qemu_memfd_avilable()) {
		fd = qemu_memfd_create();
		if (fd < 0)
			... error
	} else if (qemu_shm_available())
		fd = qemu_shm_alloc();
		if (fd < 0)
			... error
	} else {
		/*
		 * Old behavior: try fd-less shared memory. We might
		 * just end up with non-shared memory on Windows, but
		 * nobody can make sure of this shared memory either way
		 * ... should we just use non-shared memory? Or should
		 * we simply bail out? But then, if there is no shared
		 * memory nobody could possible use it.
		 */
		qemu_anon_ram_alloc(share=true)
	}
-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-06 21:21                     ` Peter Xu
@ 2024-11-07 14:03                       ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-07 14:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On 11/6/2024 4:21 PM, Peter Xu wrote:
> On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote:
>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>
>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>> anyway.  I'll leave that part to you and David to decide.
>>>
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>       ``aux-ram-share=on|off``
>>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>>           shareable with an external process.  This option applies to
>>>>           memory allocated as a side effect of creating various devices.
>>>>           It does not apply to memory-backend-objects, whether explicitly
>>>>           specified on the command line, or implicitly created by the -m
>>>>           command line option.
>>>>
>>>>           Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>       @cpr-transfer:
>>>>            ...
>>>>            Memory-backend objects must have the share=on attribute, but
>>>>            memory-backend-epc is not supported.  The VM must be started
>>>>            with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> Looks all good until here.
>>>
>>>>
>>>> qemu_ram_alloc_internal()
>>>>       ...
>>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>
>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>> that's equal to RAM_PREALLOC.
>>
>> IMO testing host is clearer and more future proof, regardless of how flags
>> are currently used.  If the caller passes host, then we should not allocate
>> memory here, full stop.
>>
>>> Meanwhile I slightly prefer we don't touch
>>> anything if SHARED|PRIVATE is set.
>>
>> OK, if SHARED is already set I will not set it again.
>>
>>> All combined, it could be:
>>>
>>>       if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) {
>>>           // ramblock to be allocated, with no share/private request, aka,
>>>           // aux memory chunk...
>>>       }
>>>
>>>>           new_block->flags |= RAM_SHARED;
>>>>
>>>>       if (!host && (new_block->flags & RAM_SHARED)) {
>>>>           qemu_ram_alloc_shared(new_block);
>>>
>>> I'm not sure whether this needs its own helper.
>>
>> Reserve judgement until you see the full patch.  The helper is a
>> non-trivial subroutine and IMO it improves readability.  Also the
>> cpr find/save hooks are confined to the subroutine.
> 
> I thought we can use the same code path to process "aux ramblock" and all
> kinds of other RAM_SHARED ramblocks.  IIUC cpr find/save should apply there
> too, but maybe I missed something.

Yes.  qemu_ram_alloc_shared() allocates and handles CPR for aux ramblock and
memory-backend-ram,share=on.

The key change that David suggested, that allows this unification, is to
push the fd creation down from ram_backend_memory_alloc to qemu_ram_alloc_shared.

>>> Should we fallback to
>>> ram_block_add() below, just like a RAM_SHARED?
>>
>> I thought we all discussed and agreed that the allocation should be performed
>> above ram_block_add.  David's suggested patch does it here also.
> 
> I was not closely followed all the discussions happened.. so I could have
> missed something indeed.
> 
> One thing I want to double check is cpr will still make things like below
> work, right?
> 
>    -object memory-backend-ram,share=on [1]

Yes, this new patch makes that work for CPR.
V3 did not support CPR for memory-backend-ram.

> IIUC with the old code this won't create fd, so to make cpr work (and also
> what I was trying to say in the previous email..) is we could silently
> start to create memfds for these, which means we need to first teach
> qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds
> (which should hopefully keep the same behavior as before).

Now the fd is created and cached in qemu_ram_alloc_internal -> qemu_ram_alloc_shared.

> Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in
> qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..),
> then the rest code path in ram_block_add() should be the same as when user
> specified share=on in [1].
> 
> Anyway, if both of you agreed on it, I am happy to wait and read the whole
> patch.
> 
> Side note: I'll still use a few days for other things, but I'll get back to
> read this whole series before next week.. btw, this series does not depend
> on precreate phase now, am I right?

Correct, this series does not depend on precreate.

- Steve

>>> IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and
>>> always cache the fd (even if we don't do that before)?
>>>
>>>>       } else
>>>>           new_block->fd = -1;
>>>>           new_block->host = host;
>>>>       }
>>>>       ram_block_add(new_block);
>>>>
>>>> qemu_ram_alloc_shared()
>>>>       if qemu_memfd_check()
>>>>           new_block->fd = qemu_memfd_create()
>>>>       else
>>>>           new_block->fd = qemu_shm_alloc()
>>>>       new_block->host = file_ram_alloc(new_block->fd)
>>>> ----
>>>>
>>>> - Steve
>>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 13:05                     ` David Hildenbrand
@ 2024-11-07 14:04                       ` Steven Sistare
  2024-11-07 16:19                         ` David Hildenbrand
  2024-11-07 16:32                         ` Peter Xu
  0 siblings, 2 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-07 14:04 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/7/2024 8:05 AM, David Hildenbrand wrote:
> On 06.11.24 21:59, Steven Sistare wrote:
>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>
>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>> anyway.  I'll leave that part to you and David to decide.
>>>
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>       ``aux-ram-share=on|off``
>>>>           Allocate auxiliary guest RAM as an anonymous file that is
>>>>           shareable with an external process.  This option applies to
>>>>           memory allocated as a side effect of creating various devices.
>>>>           It does not apply to memory-backend-objects, whether explicitly
>>>>           specified on the command line, or implicitly created by the -m
>>>>           command line option.
>>>>
>>>>           Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>       @cpr-transfer:
>>>>            ...
>>>>            Memory-backend objects must have the share=on attribute, but
>>>>            memory-backend-epc is not supported.  The VM must be started
>>>>            with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> Looks all good until here.
>>>
>>>>
>>>> qemu_ram_alloc_internal()
>>>>       ...
>>>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>
>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>> that's equal to RAM_PREALLOC.
>>
>> IMO testing host is clearer and more future proof, regardless of how flags
>> are currently used.  If the caller passes host, then we should not allocate
>> memory here, full stop.
>>
>>> Meanwhile I slightly prefer we don't touch
>>> anything if SHARED|PRIVATE is set.
>>
>> OK, if SHARED is already set I will not set it again.
> 
> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
> 
> What I think we should do:
> 
> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
> 
> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
> 
> If someone passes in SHARED, we do set it in block->flags.
> If someone passes PRIVATE|SHARED, we assert.
> 
> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
> 
> If that's also what you had in mind, great.

Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
I will undo the latter.

Do you plan to submit the part of your "tmp" patch that refactors
shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
I could include it in my series, with your Signed-off-by.

Do you have any comments on my proposed name aux-ram-share, or my proposed text
for qemu-options.hx and migration.json?  Speaking now would prevent more version
churn later.

- Steve




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 13:23                 ` David Hildenbrand
@ 2024-11-07 16:02                   ` Steven Sistare
  2024-11-07 16:26                     ` David Hildenbrand
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-07 16:02 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/7/2024 8:23 AM, David Hildenbrand wrote:
> On 06.11.24 21:12, Steven Sistare wrote:
>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>
>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>
>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>
>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>> details. See below.
>>>>>>>
>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>
>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>
>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>> on the "machine toggle" as part of this series.
>>>>>>
>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>
>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>
>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>
>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>
>>> Yes.
>>>
>>>>
>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>> if memory-backend-ram has hogged all the memory.
>>>>
>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>
>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>> -- memfd if available and fallback to shm_open.
>>>
>>> Yes.
>>>
>>>>
>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>
>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>
>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>
>>>>> Thoughts?
>>>>
>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>> of options and words to describe them.
>>>
>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>
>> Hi David and Peter,
>>
>> I have implemented and tested the following, for both qemu_memfd_create
>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>> for simplicity.
>>
>> Any comments before I submit a complete patch?
>>
>> ----
>> qemu-options.hx:
>>       ``aux-ram-share=on|off``
>>           Allocate auxiliary guest RAM as an anonymous file that is
>>           shareable with an external process.  This option applies to
>>           memory allocated as a side effect of creating various devices.
>>           It does not apply to memory-backend-objects, whether explicitly
>>           specified on the command line, or implicitly created by the -m
>>           command line option.
>>
>>           Some migration modes require aux-ram-share=on.
>>
>> qapi/migration.json:
>>       @cpr-transfer:
>>            ...
>>            Memory-backend objects must have the share=on attribute, but
>>            memory-backend-epc is not supported.  The VM must be started
>>            with the '-machine aux-ram-share=on' option.
>>
>> Define RAM_PRIVATE
>>
>> Define qemu_shm_alloc(), from David's tmp patch
>>
>> ram_backend_memory_alloc()
>>       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>       memory_region_init_ram_flags_nomigrate(ram_flags)
>>
>> qemu_ram_alloc_internal()
>>       ...
>>       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>           new_block->flags |= RAM_SHARED;
>>
>>       if (!host && (new_block->flags & RAM_SHARED)) {
>>           qemu_ram_alloc_shared(new_block);
>>       } else
>>           new_block->fd = -1;
>>           new_block->host = host;
>>       }
>>       ram_block_add(new_block);
>>
>> qemu_ram_alloc_shared()
>>       if qemu_memfd_check()
>>           new_block->fd = qemu_memfd_create()
>>       else
>>           new_block->fd = qemu_shm_alloc()
> 
> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
> 
> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
> 
> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
> 
> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
> 
> 
> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
> 
> So maybe something like
> 
> qemu_ram_alloc_shared()
>      fd = -1;
> 
>      if (qemu_memfd_avilable()) {
>          fd = qemu_memfd_create();
>          if (fd < 0)
>              ... error
>      } else if (qemu_shm_available())
>          fd = qemu_shm_alloc();
>          if (fd < 0)
>              ... error
>      } else {
>          /*
>           * Old behavior: try fd-less shared memory. We might
>           * just end up with non-shared memory on Windows, but
>           * nobody can make sure of this shared memory either way
>           * ... should we just use non-shared memory? Or should
>           * we simply bail out? But then, if there is no shared
>           * memory nobody could possible use it.
>           */
>          qemu_anon_ram_alloc(share=true)
>      }

Good catch.  We need that fallback for backwards compatibility.  Even with
no use case for memory-backend-ram,share=on since the demise of rdma, users
may specify it on windows, for no particular reason, but it works, and should
continue to work after this series.  CPR would be blocked.

More generally for backwards compatibility for share=on for no particular reason,
should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
options and more than half of ram is requested, it will fail, whereas current qemu
succeeds using MAP_SHARED|MAP_ANON.

- Steve





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 14:04                       ` Steven Sistare
@ 2024-11-07 16:19                         ` David Hildenbrand
  2024-11-07 18:13                           ` Steven Sistare
  2024-11-07 16:32                         ` Peter Xu
  1 sibling, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-07 16:19 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 07.11.24 15:04, Steven Sistare wrote:
> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
>> On 06.11.24 21:59, Steven Sistare wrote:
>>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>
>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>
>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>
>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>> details. See below.
>>>>>>>>>>
>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>
>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>
>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>
>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>
>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>
>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>
>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>
>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>
>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>
>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>
>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>> of options and words to describe them.
>>>>>>
>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>
>>>>> Hi David and Peter,
>>>>>
>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>> for simplicity.
>>>>
>>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>>> anyway.  I'll leave that part to you and David to decide.
>>>>
>>>>>
>>>>> Any comments before I submit a complete patch?
>>>>>
>>>>> ----
>>>>> qemu-options.hx:
>>>>>        ``aux-ram-share=on|off``
>>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>>            shareable with an external process.  This option applies to
>>>>>            memory allocated as a side effect of creating various devices.
>>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>>            specified on the command line, or implicitly created by the -m
>>>>>            command line option.
>>>>>
>>>>>            Some migration modes require aux-ram-share=on.
>>>>>
>>>>> qapi/migration.json:
>>>>>        @cpr-transfer:
>>>>>             ...
>>>>>             Memory-backend objects must have the share=on attribute, but
>>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>>             with the '-machine aux-ram-share=on' option.
>>>>>
>>>>> Define RAM_PRIVATE
>>>>>
>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>
>>>>> ram_backend_memory_alloc()
>>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>
>>>> Looks all good until here.
>>>>
>>>>>
>>>>> qemu_ram_alloc_internal()
>>>>>        ...
>>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>
>>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>>> that's equal to RAM_PREALLOC.
>>>
>>> IMO testing host is clearer and more future proof, regardless of how flags
>>> are currently used.  If the caller passes host, then we should not allocate
>>> memory here, full stop.
>>>
>>>> Meanwhile I slightly prefer we don't touch
>>>> anything if SHARED|PRIVATE is set.
>>>
>>> OK, if SHARED is already set I will not set it again.
>>
>> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
>>
>> What I think we should do:
>>
>> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
>>
>> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
>>
>> If someone passes in SHARED, we do set it in block->flags.
>> If someone passes PRIVATE|SHARED, we assert.
>>
>> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
>>
>> If that's also what you had in mind, great.
> 
> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
> I will undo the latter.
> 
> Do you plan to submit the part of your "tmp" patch that refactors
> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
> I could include it in my series, with your Signed-off-by.

My patch went a bit too far I think. And would not work on win32 :)

We should probably start with this:


 From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 4 Nov 2024 11:29:22 +0100
Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous
  shared memory with an fd"

Let's factor it out so we can reuse it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  backends/hostmem-shm.c | 45 ++++-------------------------------
  include/qemu/osdep.h   |  1 +
  system/physmem.c       |  2 +-
  util/oslib-posix.c     | 53 ++++++++++++++++++++++++++++++++++++++++++
  util/oslib-win32.c     |  6 +++++
  5 files changed, 65 insertions(+), 42 deletions(-)

diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c
index 374edc3db8..837b9f1dd4 100644
--- a/backends/hostmem-shm.c
+++ b/backends/hostmem-shm.c
@@ -25,11 +25,9 @@ struct HostMemoryBackendShm {
  static bool
  shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
  {
-    g_autoptr(GString) shm_name = g_string_new(NULL);
      g_autofree char *backend_name = NULL;
      uint32_t ram_flags;
-    int fd, oflag;
-    mode_t mode;
+    int fd;
  
      if (!backend->size) {
          error_setg(errp, "can't create shm backend with size 0");
@@ -41,48 +39,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
          return false;
      }
  
-    /*
-     * Let's use `mode = 0` because we don't want other processes to open our
-     * memory unless we share the file descriptor with them.
-     */
-    mode = 0;
-    oflag = O_RDWR | O_CREAT | O_EXCL;
-    backend_name = host_memory_backend_get_name(backend);
-
-    /*
-     * Some operating systems allow creating anonymous POSIX shared memory
-     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
-     * defined by POSIX, so let's create a unique name.
-     *
-     * From Linux's shm_open(3) man-page:
-     *   For  portable  use,  a shared  memory  object should be identified
-     *   by a name of the form /somename;"
-     */
-    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
-                    backend_name);
-
-    fd = shm_open(shm_name->str, oflag, mode);
+    fd = qemu_shm_alloc(backend->size, errp);
      if (fd < 0) {
-        error_setg_errno(errp, errno,
-                         "failed to create POSIX shared memory");
-        return false;
-    }
-
-    /*
-     * We have the file descriptor, so we no longer need to expose the
-     * POSIX shared memory object. However it will remain allocated as long as
-     * there are file descriptors pointing to it.
-     */
-    shm_unlink(shm_name->str);
-
-    if (ftruncate(fd, backend->size) == -1) {
-        error_setg_errno(errp, errno,
-                         "failed to resize POSIX shared memory to %" PRIu64,
-                         backend->size);
-        close(fd);
          return false;
      }
  
+    /* Let's do the same as memory-backend-ram,share=on would do. */
+    backend_name = host_memory_backend_get_name(backend);
      ram_flags = RAM_SHARED;
      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
  
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index fe7c3c5f67..4a24f11174 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -505,6 +505,7 @@ int qemu_daemon(int nochdir, int noclose);
  void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared,
                            bool noreserve);
  void qemu_anon_ram_free(void *ptr, size_t size);
+int qemu_shm_alloc(size_t size, Error **errp);
  
  #ifdef _WIN32
  #define HAVE_CHARDEV_SERIAL 1
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..1b477fec44 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
      new_block->page_size = qemu_real_host_page_size();
      new_block->host = host;
      new_block->flags = ram_flags;
-    ram_block_add(new_block, &local_err);
+
      if (local_err) {
          g_free(new_block);
          error_propagate(errp, local_err);
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 11b35e48fb..bc5c28b162 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -931,3 +931,56 @@ void qemu_close_all_open_fd(const int *skip, unsigned int nskip)
          qemu_close_all_open_fd_fallback(skip, nskip, open_max);
      }
  }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    g_autoptr(GString) shm_name = g_string_new(NULL);
+    int fd, oflag, cur_sequence;
+    static int sequence;
+    mode_t mode;
+
+    cur_sequence = qatomic_fetch_inc(&sequence);
+
+    /*
+     * Let's use `mode = 0` because we don't want other processes to open our
+     * memory unless we share the file descriptor with them.
+     */
+    mode = 0;
+    oflag = O_RDWR | O_CREAT | O_EXCL;
+
+    /*
+     * Some operating systems allow creating anonymous POSIX shared memory
+     * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+     * defined by POSIX, so let's create a unique name.
+     *
+     * From Linux's shm_open(3) man-page:
+     *   For  portable  use,  a shared  memory  object should be identified
+     *   by a name of the form /somename;"
+     */
+    g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(),
+                    cur_sequence);
+
+    fd = shm_open(shm_name->str, oflag, mode);
+    if (fd < 0) {
+        error_setg_errno(errp, errno,
+                         "failed to create POSIX shared memory");
+        return -1;
+    }
+
+    /*
+     * We have the file descriptor, so we no longer need to expose the
+     * POSIX shared memory object. However it will remain allocated as long as
+     * there are file descriptors pointing to it.
+     */
+    shm_unlink(shm_name->str);
+
+    if (ftruncate(fd, size) == -1) {
+        error_setg_errno(errp, errno,
+                         "failed to resize POSIX shared memory to %" PRIu64,
+                         size);
+        close(fd);
+        return -1;
+    }
+
+    return fd;
+}
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b623830d62..f79a190b78 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -877,3 +877,9 @@ void qemu_win32_map_free(void *ptr, HANDLE h, Error **errp)
      }
      CloseHandle(h);
  }
+
+int qemu_shm_alloc(size_t size, Error **errp)
+{
+    error_setg("Shared memory is not supported.");
+    return -1;
+}
-- 
2.47.0


So we can reuse it for the !host && RAM_SHARED case.


> 
> Do you have any comments on my proposed name aux-ram-share, or my proposed text

aux-ram-share works for me, I prefer "aux" over the "default" I had in mind.

> for qemu-options.hx and migration.json?  Speaking now would prevent more version
> churn later.

Both sounds good to me after a quick scan.


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:02                   ` Steven Sistare
@ 2024-11-07 16:26                     ` David Hildenbrand
  2024-11-07 16:40                       ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-07 16:26 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 07.11.24 17:02, Steven Sistare wrote:
> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>> On 06.11.24 21:12, Steven Sistare wrote:
>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>
>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>
>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>
>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>> details. See below.
>>>>>>>>
>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>
>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>
>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>
>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>
>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>
>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>
>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>> if memory-backend-ram has hogged all the memory.
>>>>>
>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>
>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>> -- memfd if available and fallback to shm_open.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>
>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>
>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>> of options and words to describe them.
>>>>
>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>
>>> Hi David and Peter,
>>>
>>> I have implemented and tested the following, for both qemu_memfd_create
>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>> for simplicity.
>>>
>>> Any comments before I submit a complete patch?
>>>
>>> ----
>>> qemu-options.hx:
>>>        ``aux-ram-share=on|off``
>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>            shareable with an external process.  This option applies to
>>>            memory allocated as a side effect of creating various devices.
>>>            It does not apply to memory-backend-objects, whether explicitly
>>>            specified on the command line, or implicitly created by the -m
>>>            command line option.
>>>
>>>            Some migration modes require aux-ram-share=on.
>>>
>>> qapi/migration.json:
>>>        @cpr-transfer:
>>>             ...
>>>             Memory-backend objects must have the share=on attribute, but
>>>             memory-backend-epc is not supported.  The VM must be started
>>>             with the '-machine aux-ram-share=on' option.
>>>
>>> Define RAM_PRIVATE
>>>
>>> Define qemu_shm_alloc(), from David's tmp patch
>>>
>>> ram_backend_memory_alloc()
>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>
>>> qemu_ram_alloc_internal()
>>>        ...
>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>            new_block->flags |= RAM_SHARED;
>>>
>>>        if (!host && (new_block->flags & RAM_SHARED)) {
>>>            qemu_ram_alloc_shared(new_block);
>>>        } else
>>>            new_block->fd = -1;
>>>            new_block->host = host;
>>>        }
>>>        ram_block_add(new_block);
>>>
>>> qemu_ram_alloc_shared()
>>>        if qemu_memfd_check()
>>>            new_block->fd = qemu_memfd_create()
>>>        else
>>>            new_block->fd = qemu_shm_alloc()
>>
>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>
>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>
>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>
>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>
>>
>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>
>> So maybe something like
>>
>> qemu_ram_alloc_shared()
>>       fd = -1;
>>
>>       if (qemu_memfd_avilable()) {
>>           fd = qemu_memfd_create();
>>           if (fd < 0)
>>               ... error
>>       } else if (qemu_shm_available())
>>           fd = qemu_shm_alloc();
>>           if (fd < 0)
>>               ... error
>>       } else {
>>           /*
>>            * Old behavior: try fd-less shared memory. We might
>>            * just end up with non-shared memory on Windows, but
>>            * nobody can make sure of this shared memory either way
>>            * ... should we just use non-shared memory? Or should
>>            * we simply bail out? But then, if there is no shared
>>            * memory nobody could possible use it.
>>            */
>>           qemu_anon_ram_alloc(share=true)
>>       }
> 
> Good catch.  We need that fallback for backwards compatibility.  Even with
> no use case for memory-backend-ram,share=on since the demise of rdma, users
> may specify it on windows, for no particular reason, but it works, and should
> continue to work after this series.  CPR would be blocked.

Yes, we should keep Windows working in the weird way it is working right 
now.

 > > More generally for backwards compatibility for share=on for no 
particular reason,
> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
> options and more than half of ram is requested, it will fail, whereas current qemu
> succeeds using MAP_SHARED|MAP_ANON.

Only on Linux without memfd, of course. Maybe we should just warn when 
qemu_shm_alloc() fails (and comment that we continue for compat reasons 
only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We 
could implement a fallback to shmget() but ... let's not go down that path.

But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if 
memfd is available and that allocating the memfd failed. Failing to 
allocate a memfd might highlight a bigger problem.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 14:04                       ` Steven Sistare
  2024-11-07 16:19                         ` David Hildenbrand
@ 2024-11-07 16:32                         ` Peter Xu
  2024-11-07 16:38                           ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-07 16:32 UTC (permalink / raw)
  To: Steven Sistare
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote:
> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
> > On 06.11.24 21:59, Steven Sistare wrote:
> > > On 11/6/2024 3:41 PM, Peter Xu wrote:
> > > > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
> > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > > > 
> > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > > > 
> > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > > > 
> > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > > > details. See below.
> > > > > > > > > > 
> > > > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > > > 
> > > > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > > > 
> > > > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > > > on the "machine toggle" as part of this series.
> > > > > > > > > 
> > > > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > > > 
> > > > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > > > 
> > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > > > 
> > > > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > 
> > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > > > if memory-backend-ram has hogged all the memory.
> > > > > > > 
> > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > > > 
> > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > > > -- memfd if available and fallback to shm_open.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > 
> > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > > > 
> > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > > > 
> > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > 
> > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > > > of options and words to describe them.
> > > > > > 
> > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > > > 
> > > > > Hi David and Peter,
> > > > > 
> > > > > I have implemented and tested the following, for both qemu_memfd_create
> > > > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > > > for simplicity.
> > > > 
> > > > I'm ok with either shm or memfd, as this feature only applies to Linux
> > > > anyway.  I'll leave that part to you and David to decide.
> > > > 
> > > > > 
> > > > > Any comments before I submit a complete patch?
> > > > > 
> > > > > ----
> > > > > qemu-options.hx:
> > > > >       ``aux-ram-share=on|off``
> > > > >           Allocate auxiliary guest RAM as an anonymous file that is
> > > > >           shareable with an external process.  This option applies to
> > > > >           memory allocated as a side effect of creating various devices.
> > > > >           It does not apply to memory-backend-objects, whether explicitly
> > > > >           specified on the command line, or implicitly created by the -m
> > > > >           command line option.
> > > > > 
> > > > >           Some migration modes require aux-ram-share=on.
> > > > > 
> > > > > qapi/migration.json:
> > > > >       @cpr-transfer:
> > > > >            ...
> > > > >            Memory-backend objects must have the share=on attribute, but
> > > > >            memory-backend-epc is not supported.  The VM must be started
> > > > >            with the '-machine aux-ram-share=on' option.
> > > > > 
> > > > > Define RAM_PRIVATE
> > > > > 
> > > > > Define qemu_shm_alloc(), from David's tmp patch
> > > > > 
> > > > > ram_backend_memory_alloc()
> > > > >       ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > > > >       memory_region_init_ram_flags_nomigrate(ram_flags)
> > > > 
> > > > Looks all good until here.
> > > > 
> > > > > 
> > > > > qemu_ram_alloc_internal()
> > > > >       ...
> > > > >       if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > > > 
> > > > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
> > > > that's equal to RAM_PREALLOC.
> > > 
> > > IMO testing host is clearer and more future proof, regardless of how flags
> > > are currently used.  If the caller passes host, then we should not allocate
> > > memory here, full stop.
> > > 
> > > > Meanwhile I slightly prefer we don't touch
> > > > anything if SHARED|PRIVATE is set.
> > > 
> > > OK, if SHARED is already set I will not set it again.
> > 
> > We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
> > 
> > What I think we should do:
> > 
> > We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
> > 
> > If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
> > 
> > If someone passes in SHARED, we do set it in block->flags.
> > If someone passes PRIVATE|SHARED, we assert.
> > 
> > If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
> > 
> > If that's also what you had in mind, great.
> 
> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
> I will undo the latter.

David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
working as before?

It looks ok to remove it too, but it adds logics that doesn't seem
necessary to me, so just to double check if I missed something..

> 
> Do you plan to submit the part of your "tmp" patch that refactors
> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
> I could include it in my series, with your Signed-off-by.
> 
> Do you have any comments on my proposed name aux-ram-share, or my proposed text
> for qemu-options.hx and migration.json?  Speaking now would prevent more version
> churn later.
> 
> - Steve
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:32                         ` Peter Xu
@ 2024-11-07 16:38                           ` David Hildenbrand
  2024-11-07 17:48                             ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-07 16:38 UTC (permalink / raw)
  To: Peter Xu, Steven Sistare
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 07.11.24 17:32, Peter Xu wrote:
> On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote:
>> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
>>> On 06.11.24 21:59, Steven Sistare wrote:
>>>> On 11/6/2024 3:41 PM, Peter Xu wrote:
>>>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote:
>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>
>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>
>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>> details. See below.
>>>>>>>>>>>
>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>
>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>
>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>
>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>
>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>
>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>
>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>
>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>
>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>
>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>
>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>> of options and words to describe them.
>>>>>>>
>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>
>>>>>> Hi David and Peter,
>>>>>>
>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>> for simplicity.
>>>>>
>>>>> I'm ok with either shm or memfd, as this feature only applies to Linux
>>>>> anyway.  I'll leave that part to you and David to decide.
>>>>>
>>>>>>
>>>>>> Any comments before I submit a complete patch?
>>>>>>
>>>>>> ----
>>>>>> qemu-options.hx:
>>>>>>        ``aux-ram-share=on|off``
>>>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>            shareable with an external process.  This option applies to
>>>>>>            memory allocated as a side effect of creating various devices.
>>>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>>>            specified on the command line, or implicitly created by the -m
>>>>>>            command line option.
>>>>>>
>>>>>>            Some migration modes require aux-ram-share=on.
>>>>>>
>>>>>> qapi/migration.json:
>>>>>>        @cpr-transfer:
>>>>>>             ...
>>>>>>             Memory-backend objects must have the share=on attribute, but
>>>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>>>             with the '-machine aux-ram-share=on' option.
>>>>>>
>>>>>> Define RAM_PRIVATE
>>>>>>
>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>
>>>>>> ram_backend_memory_alloc()
>>>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>
>>>>> Looks all good until here.
>>>>>
>>>>>>
>>>>>> qemu_ram_alloc_internal()
>>>>>>        ...
>>>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>
>>>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT
>>>>> that's equal to RAM_PREALLOC.
>>>>
>>>> IMO testing host is clearer and more future proof, regardless of how flags
>>>> are currently used.  If the caller passes host, then we should not allocate
>>>> memory here, full stop.
>>>>
>>>>> Meanwhile I slightly prefer we don't touch
>>>>> anything if SHARED|PRIVATE is set.
>>>>
>>>> OK, if SHARED is already set I will not set it again.
>>>
>>> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected.
>>>
>>> What I think we should do:
>>>
>>> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock.
>>>
>>> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared)
>>>
>>> If someone passes in SHARED, we do set it in block->flags.
>>> If someone passes PRIVATE|SHARED, we assert.
>>>
>>> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared)
>>>
>>> If that's also what you had in mind, great.
>>
>> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock.
>> I will undo the latter.
> 
> David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
> pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
> working as before?
> 
> It looks ok to remove it too, but it adds logics that doesn't seem
> necessary to me, so just to double check if I missed something..

A finished ramblock is only boolean "shared" vs. "not shared/private". A 
single flag (RAM_SHARED) can express that clearly.

Consequently there is less to get wrong when using RAM_PRIVATE only as a 
flag to the creation function (and documenting that!).

To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock 
creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I 
don't think that is wroth the trouble.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:26                     ` David Hildenbrand
@ 2024-11-07 16:40                       ` Steven Sistare
  2024-11-08 11:31                         ` David Hildenbrand
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-07 16:40 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/7/2024 11:26 AM, David Hildenbrand wrote:
> On 07.11.24 17:02, Steven Sistare wrote:
>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>
>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>
>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>
>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>> details. See below.
>>>>>>>>>
>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>
>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>
>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>
>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>
>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>
>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>
>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>
>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>
>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>> -- memfd if available and fallback to shm_open.
>>>>>
>>>>> Yes.
>>>>>
>>>>>>
>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>
>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>
>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>
>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>> of options and words to describe them.
>>>>>
>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>
>>>> Hi David and Peter,
>>>>
>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>> for simplicity.
>>>>
>>>> Any comments before I submit a complete patch?
>>>>
>>>> ----
>>>> qemu-options.hx:
>>>>        ``aux-ram-share=on|off``
>>>>            Allocate auxiliary guest RAM as an anonymous file that is
>>>>            shareable with an external process.  This option applies to
>>>>            memory allocated as a side effect of creating various devices.
>>>>            It does not apply to memory-backend-objects, whether explicitly
>>>>            specified on the command line, or implicitly created by the -m
>>>>            command line option.
>>>>
>>>>            Some migration modes require aux-ram-share=on.
>>>>
>>>> qapi/migration.json:
>>>>        @cpr-transfer:
>>>>             ...
>>>>             Memory-backend objects must have the share=on attribute, but
>>>>             memory-backend-epc is not supported.  The VM must be started
>>>>             with the '-machine aux-ram-share=on' option.
>>>>
>>>> Define RAM_PRIVATE
>>>>
>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>
>>>> ram_backend_memory_alloc()
>>>>        ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>        memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>
>>>> qemu_ram_alloc_internal()
>>>>        ...
>>>>        if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>            new_block->flags |= RAM_SHARED;
>>>>
>>>>        if (!host && (new_block->flags & RAM_SHARED)) {
>>>>            qemu_ram_alloc_shared(new_block);
>>>>        } else
>>>>            new_block->fd = -1;
>>>>            new_block->host = host;
>>>>        }
>>>>        ram_block_add(new_block);
>>>>
>>>> qemu_ram_alloc_shared()
>>>>        if qemu_memfd_check()
>>>>            new_block->fd = qemu_memfd_create()
>>>>        else
>>>>            new_block->fd = qemu_shm_alloc()
>>>
>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>
>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>
>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>
>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>
>>>
>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>
>>> So maybe something like
>>>
>>> qemu_ram_alloc_shared()
>>>       fd = -1;
>>>
>>>       if (qemu_memfd_avilable()) {
>>>           fd = qemu_memfd_create();
>>>           if (fd < 0)
>>>               ... error
>>>       } else if (qemu_shm_available())
>>>           fd = qemu_shm_alloc();
>>>           if (fd < 0)
>>>               ... error
>>>       } else {
>>>           /*
>>>            * Old behavior: try fd-less shared memory. We might
>>>            * just end up with non-shared memory on Windows, but
>>>            * nobody can make sure of this shared memory either way
>>>            * ... should we just use non-shared memory? Or should
>>>            * we simply bail out? But then, if there is no shared
>>>            * memory nobody could possible use it.
>>>            */
>>>           qemu_anon_ram_alloc(share=true)
>>>       }
>>
>> Good catch.  We need that fallback for backwards compatibility.  Even with
>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>> may specify it on windows, for no particular reason, but it works, and should
>> continue to work after this series.  CPR would be blocked.
> 
> Yes, we should keep Windows working in the weird way it is working right now.
> 
>  > > More generally for backwards compatibility for share=on for no particular reason,
>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>> options and more than half of ram is requested, it will fail, whereas current qemu
>> succeeds using MAP_SHARED|MAP_ANON.
> 
> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
> 
> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.

Agreed on all.

One more opinion from you please, if you will.

RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
set in
   ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal

None of the other backends reach qemu_ram_alloc_internal.

To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
everywhere MAP_SHARED may be set, eg:
     file_backend_memory_alloc()
           ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;

Or omit RAM_PRIVATE in those cases where it will not be checked, to minimize
exposure of this new flag?

- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:38                           ` David Hildenbrand
@ 2024-11-07 17:48                             ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-07 17:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On Thu, Nov 07, 2024 at 05:38:26PM +0100, David Hildenbrand wrote:
> > David: why do we need to drop PRIVATE in ramblock flags?  I thought it was
> > pretty harmless.  I suppose things like qemu_ram_is_shared() will even keep
> > working as before?
> > 
> > It looks ok to remove it too, but it adds logics that doesn't seem
> > necessary to me, so just to double check if I missed something..
> 
> A finished ramblock is only boolean "shared" vs. "not shared/private". A
> single flag (RAM_SHARED) can express that clearly.
> 
> Consequently there is less to get wrong when using RAM_PRIVATE only as a
> flag to the creation function (and documenting that!).
> 
> To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock
> creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I don't
> think that is wroth the trouble.

Yeah, I actually prefer PRIVATE to be applied everywhere, and assert that
either SHARED|PRIVATE be set in all ramblocks.

But no strong opinions, if both of you like it only applied optionally, I
already lost the vote.  But yes, please in that case extremely carefully
document PRIVATE as it can be very tricky now.

PS: when I think about how to document that, I really, really hoped we
simply apply it to all..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:19                         ` David Hildenbrand
@ 2024-11-07 18:13                           ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-07 18:13 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/7/2024 11:19 AM, David Hildenbrand wrote:
> On 07.11.24 15:04, Steven Sistare wrote:
>> On 11/7/2024 8:05 AM, David Hildenbrand wrote:
[...]
>> Do you plan to submit the part of your "tmp" patch that refactors
>> shm_backend_memory_alloc and defines qemu_shm_alloc?  If you want,
>> I could include it in my series, with your Signed-off-by.
> 
> My patch went a bit too far I think. And would not work on win32 :)
> 
> We should probably start with this:
> 
> From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Mon, 4 Nov 2024 11:29:22 +0100
> Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous
>  shared memory with an fd"
> 
> Let's factor it out so we can reuse it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  backends/hostmem-shm.c | 45 ++++-------------------------------
>  include/qemu/osdep.h   |  1 +
>  system/physmem.c       |  2 +-
>  util/oslib-posix.c     | 53 ++++++++++++++++++++++++++++++++++++++++++
>  util/oslib-win32.c     |  6 +++++
>  5 files changed, 65 insertions(+), 42 deletions(-) 
[...]

Thanks, I see what you fixed.

FYI I deleted this remnant from "tmp":

 > @@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
 >       new_block->page_size = qemu_real_host_page_size();
 >       new_block->host = host;
 >       new_block->flags = ram_flags;
 > -    ram_block_add(new_block, &local_err);
 > +

and I added this so all programs that link with libqemuutil also link with
rt, which defines shm_open:

diff --git a/meson.build b/meson.build
index e324c49..aa54535 100644
--- a/meson.build
+++ b/meson.build
@@ -3604,9 +3604,13 @@ libqemuutil = static_library('qemuutil',
                               build_by_default: false,
                               sources: util_ss.sources() + stub_ss.sources() + genh,
                               dependencies: [util_ss.dependencies(), libm, threads, glib, socket, malloc])
+qemuutil_deps = [event_loop_base]
+if host_os != 'windows'
+  qemuutil_deps += [rt]
+endif
  qemuutil = declare_dependency(link_with: libqemuutil,
                                sources: genh + version_res,
-                              dependencies: [event_loop_base])
+                              dependencies: qemuutil_deps)

- Steve



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-07 16:40                       ` Steven Sistare
@ 2024-11-08 11:31                         ` David Hildenbrand
  2024-11-08 13:43                           ` Peter Xu
  2024-11-08 13:56                           ` Steven Sistare
  0 siblings, 2 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 11:31 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 07.11.24 17:40, Steven Sistare wrote:
> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>> On 07.11.24 17:02, Steven Sistare wrote:
>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>
>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>
>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>
>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>> details. See below.
>>>>>>>>>>
>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>
>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>
>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>
>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>
>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>
>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>
>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>
>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>
>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>
>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>
>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>
>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>> of options and words to describe them.
>>>>>>
>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>
>>>>> Hi David and Peter,
>>>>>
>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>> for simplicity.
>>>>>
>>>>> Any comments before I submit a complete patch?
>>>>>
>>>>> ----
>>>>> qemu-options.hx:
>>>>>         ``aux-ram-share=on|off``
>>>>>             Allocate auxiliary guest RAM as an anonymous file that is
>>>>>             shareable with an external process.  This option applies to
>>>>>             memory allocated as a side effect of creating various devices.
>>>>>             It does not apply to memory-backend-objects, whether explicitly
>>>>>             specified on the command line, or implicitly created by the -m
>>>>>             command line option.
>>>>>
>>>>>             Some migration modes require aux-ram-share=on.
>>>>>
>>>>> qapi/migration.json:
>>>>>         @cpr-transfer:
>>>>>              ...
>>>>>              Memory-backend objects must have the share=on attribute, but
>>>>>              memory-backend-epc is not supported.  The VM must be started
>>>>>              with the '-machine aux-ram-share=on' option.
>>>>>
>>>>> Define RAM_PRIVATE
>>>>>
>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>
>>>>> ram_backend_memory_alloc()
>>>>>         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>         memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>
>>>>> qemu_ram_alloc_internal()
>>>>>         ...
>>>>>         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>             new_block->flags |= RAM_SHARED;
>>>>>
>>>>>         if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>             qemu_ram_alloc_shared(new_block);
>>>>>         } else
>>>>>             new_block->fd = -1;
>>>>>             new_block->host = host;
>>>>>         }
>>>>>         ram_block_add(new_block);
>>>>>
>>>>> qemu_ram_alloc_shared()
>>>>>         if qemu_memfd_check()
>>>>>             new_block->fd = qemu_memfd_create()
>>>>>         else
>>>>>             new_block->fd = qemu_shm_alloc()
>>>>
>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>
>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>
>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>
>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>
>>>>
>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>
>>>> So maybe something like
>>>>
>>>> qemu_ram_alloc_shared()
>>>>        fd = -1;
>>>>
>>>>        if (qemu_memfd_avilable()) {
>>>>            fd = qemu_memfd_create();
>>>>            if (fd < 0)
>>>>                ... error
>>>>        } else if (qemu_shm_available())
>>>>            fd = qemu_shm_alloc();
>>>>            if (fd < 0)
>>>>                ... error
>>>>        } else {
>>>>            /*
>>>>             * Old behavior: try fd-less shared memory. We might
>>>>             * just end up with non-shared memory on Windows, but
>>>>             * nobody can make sure of this shared memory either way
>>>>             * ... should we just use non-shared memory? Or should
>>>>             * we simply bail out? But then, if there is no shared
>>>>             * memory nobody could possible use it.
>>>>             */
>>>>            qemu_anon_ram_alloc(share=true)
>>>>        }
>>>
>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>> may specify it on windows, for no particular reason, but it works, and should
>>> continue to work after this series.  CPR would be blocked.
>>
>> Yes, we should keep Windows working in the weird way it is working right now.
>>
>>   > > More generally for backwards compatibility for share=on for no particular reason,
>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>> succeeds using MAP_SHARED|MAP_ANON.
>>
>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>
>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
> 
> Agreed on all.
> 
> One more opinion from you please, if you will.
> 
> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
> set in
>     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
> 
> None of the other backends reach qemu_ram_alloc_internal.
> 
> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
> everywhere MAP_SHARED may be set, eg:

Hm, I think then we should set RAM_PRIVATE really everywhere where we'd 
want it and relied on !RAM_SHARED doing the right thing.

Alternatively, we make our life easier and do something like

/*
  * This flag is only used while creating+allocating RAM, and
  * prevents RAM_SHARED getting set for anonymous RAM automatically in
  * some configurations.
  *
  * By default, not setting RAM_SHARED on anonymous RAM implies
  * "private anonymous RAM"; however, in some configuration we want to
  * have most of this RAM automatically be "sharable anonymous RAM",
  * except for some cases that really want "private anonymous RAM".
  *
  * This anonymous RAM *must* be private. This flag only applies to
  * "anonymous" RAM, not fd/file-backed/preallocated one.
  */
RAM_FORCE_ANON_PRIVATE	(1 << 13)


BUT maybe an even better alternative now that we have the 
"aux-ram-share" parameter, could we use

/*
  * Auxiliary RAM that was created automatically internally, instead of
  * explicitly like using memory-backend-ram or some other device on the
  * QEMU cmdline.
  */
RAM_AUX	(1 << 13)


So it will be quite clear that "aux-ram-share" only applies to RAM_AUX 
RAMBlocks.

That actually looks quite compelling to me :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 11:31                         ` David Hildenbrand
@ 2024-11-08 13:43                           ` Peter Xu
  2024-11-08 14:14                             ` Steven Sistare
  2024-11-08 14:18                             ` David Hildenbrand
  2024-11-08 13:56                           ` Steven Sistare
  1 sibling, 2 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-08 13:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
> On 07.11.24 17:40, Steven Sistare wrote:
> > On 11/7/2024 11:26 AM, David Hildenbrand wrote:
> > > On 07.11.24 17:02, Steven Sistare wrote:
> > > > On 11/7/2024 8:23 AM, David Hildenbrand wrote:
> > > > > On 06.11.24 21:12, Steven Sistare wrote:
> > > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote:
> > > > > > > On 04.11.24 21:56, Steven Sistare wrote:
> > > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote:
> > > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote:
> > > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote:
> > > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote:
> > > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote:
> > > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > > > > > > > on the value of the anon-alloc machine property.  This option applies to
> > > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does
> > > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on
> > > > > > > > > > > > > the command line, or implicitly created by the -m command line option.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the
> > > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending
> > > > > > > > > > > > > the memfd file descriptor to the process.  Memory contents are preserved,
> > > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are
> > > > > > > > > > > > > locked in memory for DMA remain locked.  This behavior is a pre-requisite
> > > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes.
> > > > > > > > > > > > 
> > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm,
> > > > > > > > > > > > similar to backends/hostmem-shm.c.
> > > > > > > > > > > > 
> > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the
> > > > > > > > > > > > details. See below.
> > > > > > > > > > > 
> > > > > > > > > > > For this series I would prefer to use memfd and hide the details.  It's a
> > > > > > > > > > > concise (and well tested) solution albeit linux only.  The code you supply
> > > > > > > > > > > for posix shm would be a good follow on patch to support other unices.
> > > > > > > > > > 
> > > > > > > > > > Unless there is reason to use memfd we should start with the more
> > > > > > > > > > generic POSIX variant that is available even on systems without memfd.
> > > > > > > > > > Factoring stuff out as I drafted does look quite compelling.
> > > > > > > > > > 
> > > > > > > > > > I can help with the rework, and send it out separately, so you can focus
> > > > > > > > > > on the "machine toggle" as part of this series.
> > > > > > > > > > 
> > > > > > > > > > Of course, if we find out we need the memfd internally instead under
> > > > > > > > > > Linux for whatever reason later, we can use that instead.
> > > > > > > > > > 
> > > > > > > > > > But IIUC, the main selling point for memfd are additional features
> > > > > > > > > > (hugetlb, memory sealing) that you aren't even using.
> > > > > > > > > 
> > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
> > > > > > > > 
> > > > > > > > Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
> > > > > > > > To do so using shm_open requires configuration on the mount.  One step harder to use.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
> > > > > > > > if memory-backend-ram has hogged all the memory.
> > > > > > > > 
> > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
> > > > > > > > 
> > > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM
> > > > > > > > -- memfd if available and fallback to shm_open.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like
> > > > > > > > > 
> > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
> > > > > > > > > 
> > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
> > > > > > > > > 
> > > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch.
> > > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
> > > > > > > > controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
> > > > > > > > of options and words to describe them.
> > > > > > > 
> > > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
> > > > > > 
> > > > > > Hi David and Peter,
> > > > > > 
> > > > > > I have implemented and tested the following, for both qemu_memfd_create
> > > > > > and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
> > > > > > for simplicity.
> > > > > > 
> > > > > > Any comments before I submit a complete patch?
> > > > > > 
> > > > > > ----
> > > > > > qemu-options.hx:
> > > > > >         ``aux-ram-share=on|off``
> > > > > >             Allocate auxiliary guest RAM as an anonymous file that is
> > > > > >             shareable with an external process.  This option applies to
> > > > > >             memory allocated as a side effect of creating various devices.
> > > > > >             It does not apply to memory-backend-objects, whether explicitly
> > > > > >             specified on the command line, or implicitly created by the -m
> > > > > >             command line option.
> > > > > > 
> > > > > >             Some migration modes require aux-ram-share=on.
> > > > > > 
> > > > > > qapi/migration.json:
> > > > > >         @cpr-transfer:
> > > > > >              ...
> > > > > >              Memory-backend objects must have the share=on attribute, but
> > > > > >              memory-backend-epc is not supported.  The VM must be started
> > > > > >              with the '-machine aux-ram-share=on' option.
> > > > > > 
> > > > > > Define RAM_PRIVATE
> > > > > > 
> > > > > > Define qemu_shm_alloc(), from David's tmp patch
> > > > > > 
> > > > > > ram_backend_memory_alloc()
> > > > > >         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
> > > > > >         memory_region_init_ram_flags_nomigrate(ram_flags)
> > > > > > 
> > > > > > qemu_ram_alloc_internal()
> > > > > >         ...
> > > > > >         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
> > > > > >             new_block->flags |= RAM_SHARED;
> > > > > > 
> > > > > >         if (!host && (new_block->flags & RAM_SHARED)) {
> > > > > >             qemu_ram_alloc_shared(new_block);
> > > > > >         } else
> > > > > >             new_block->fd = -1;
> > > > > >             new_block->host = host;
> > > > > >         }
> > > > > >         ram_block_add(new_block);
> > > > > > 
> > > > > > qemu_ram_alloc_shared()
> > > > > >         if qemu_memfd_check()
> > > > > >             new_block->fd = qemu_memfd_create()
> > > > > >         else
> > > > > >             new_block->fd = qemu_shm_alloc()
> > > > > 
> > > > > Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
> > > > > 
> > > > > memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
> > > > > 
> > > > > MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
> > > > > 
> > > > > So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
> > > > > 
> > > > > 
> > > > > We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
> > > > > 
> > > > > So maybe something like
> > > > > 
> > > > > qemu_ram_alloc_shared()
> > > > >        fd = -1;
> > > > > 
> > > > >        if (qemu_memfd_avilable()) {
> > > > >            fd = qemu_memfd_create();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else if (qemu_shm_available())
> > > > >            fd = qemu_shm_alloc();
> > > > >            if (fd < 0)
> > > > >                ... error
> > > > >        } else {
> > > > >            /*
> > > > >             * Old behavior: try fd-less shared memory. We might
> > > > >             * just end up with non-shared memory on Windows, but
> > > > >             * nobody can make sure of this shared memory either way
> > > > >             * ... should we just use non-shared memory? Or should
> > > > >             * we simply bail out? But then, if there is no shared
> > > > >             * memory nobody could possible use it.
> > > > >             */
> > > > >            qemu_anon_ram_alloc(share=true)
> > > > >        }
> > > > 
> > > > Good catch.  We need that fallback for backwards compatibility.  Even with
> > > > no use case for memory-backend-ram,share=on since the demise of rdma, users
> > > > may specify it on windows, for no particular reason, but it works, and should
> > > > continue to work after this series.  CPR would be blocked.
> > > 
> > > Yes, we should keep Windows working in the weird way it is working right now.
> > > 
> > >   > > More generally for backwards compatibility for share=on for no particular reason,
> > > > should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
> > > > options and more than half of ram is requested, it will fail, whereas current qemu
> > > > succeeds using MAP_SHARED|MAP_ANON.
> > > 
> > > Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
> > > 
> > > But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
> > 
> > Agreed on all.
> > 
> > One more opinion from you please, if you will.
> > 
> > RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
> > set in
> >     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
> > 
> > None of the other backends reach qemu_ram_alloc_internal.
> > 
> > To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
> > everywhere MAP_SHARED may be set, eg:
> 
> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
> it and relied on !RAM_SHARED doing the right thing.
> 
> Alternatively, we make our life easier and do something like
> 
> /*
>  * This flag is only used while creating+allocating RAM, and
>  * prevents RAM_SHARED getting set for anonymous RAM automatically in
>  * some configurations.
>  *
>  * By default, not setting RAM_SHARED on anonymous RAM implies
>  * "private anonymous RAM"; however, in some configuration we want to
>  * have most of this RAM automatically be "sharable anonymous RAM",
>  * except for some cases that really want "private anonymous RAM".
>  *
>  * This anonymous RAM *must* be private. This flag only applies to
>  * "anonymous" RAM, not fd/file-backed/preallocated one.
>  */
> RAM_FORCE_ANON_PRIVATE	(1 << 13)
> 
> 
> BUT maybe an even better alternative now that we have the "aux-ram-share"
> parameter, could we use
> 
> /*
>  * Auxiliary RAM that was created automatically internally, instead of
>  * explicitly like using memory-backend-ram or some other device on the
>  * QEMU cmdline.
>  */
> RAM_AUX	(1 << 13)
> 
> 
> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
> RAMBlocks.
> 
> That actually looks quite compelling to me :)

Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
place?

IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
easier when we already have SHARED, adding PRIVATE could be mostly natural,
then we can already avoid AUX due to checking !SHARED & !PRIVATE.

Basically, SHARED|PRIVATE then must come from an user request (QMP or
cmdline), otherwise the caller should always set none of them, implying
aux.

It still looks the best to me.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 11:31                         ` David Hildenbrand
  2024-11-08 13:43                           ` Peter Xu
@ 2024-11-08 13:56                           ` Steven Sistare
  2024-11-08 14:20                             ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-08 13:56 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/8/2024 6:31 AM, David Hildenbrand wrote:
> On 07.11.24 17:40, Steven Sistare wrote:
>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>
>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>
>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>> details. See below.
>>>>>>>>>>>
>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>
>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>
>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>
>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>
>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>
>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>
>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>
>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>
>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>>
>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>
>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>
>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>> of options and words to describe them.
>>>>>>>
>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>
>>>>>> Hi David and Peter,
>>>>>>
>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>> for simplicity.
>>>>>>
>>>>>> Any comments before I submit a complete patch?
>>>>>>
>>>>>> ----
>>>>>> qemu-options.hx:
>>>>>>         ``aux-ram-share=on|off``
>>>>>>             Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>             shareable with an external process.  This option applies to
>>>>>>             memory allocated as a side effect of creating various devices.
>>>>>>             It does not apply to memory-backend-objects, whether explicitly
>>>>>>             specified on the command line, or implicitly created by the -m
>>>>>>             command line option.
>>>>>>
>>>>>>             Some migration modes require aux-ram-share=on.
>>>>>>
>>>>>> qapi/migration.json:
>>>>>>         @cpr-transfer:
>>>>>>              ...
>>>>>>              Memory-backend objects must have the share=on attribute, but
>>>>>>              memory-backend-epc is not supported.  The VM must be started
>>>>>>              with the '-machine aux-ram-share=on' option.
>>>>>>
>>>>>> Define RAM_PRIVATE
>>>>>>
>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>
>>>>>> ram_backend_memory_alloc()
>>>>>>         ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>         memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>
>>>>>> qemu_ram_alloc_internal()
>>>>>>         ...
>>>>>>         if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>             new_block->flags |= RAM_SHARED;
>>>>>>
>>>>>>         if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>             qemu_ram_alloc_shared(new_block);
>>>>>>         } else
>>>>>>             new_block->fd = -1;
>>>>>>             new_block->host = host;
>>>>>>         }
>>>>>>         ram_block_add(new_block);
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         if qemu_memfd_check()
>>>>>>             new_block->fd = qemu_memfd_create()
>>>>>>         else
>>>>>>             new_block->fd = qemu_shm_alloc()
>>>>>
>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>
>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>
>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>
>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>
>>>>>
>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>
>>>>> So maybe something like
>>>>>
>>>>> qemu_ram_alloc_shared()
>>>>>        fd = -1;
>>>>>
>>>>>        if (qemu_memfd_avilable()) {
>>>>>            fd = qemu_memfd_create();
>>>>>            if (fd < 0)
>>>>>                ... error
>>>>>        } else if (qemu_shm_available())
>>>>>            fd = qemu_shm_alloc();
>>>>>            if (fd < 0)
>>>>>                ... error
>>>>>        } else {
>>>>>            /*
>>>>>             * Old behavior: try fd-less shared memory. We might
>>>>>             * just end up with non-shared memory on Windows, but
>>>>>             * nobody can make sure of this shared memory either way
>>>>>             * ... should we just use non-shared memory? Or should
>>>>>             * we simply bail out? But then, if there is no shared
>>>>>             * memory nobody could possible use it.
>>>>>             */
>>>>>            qemu_anon_ram_alloc(share=true)
>>>>>        }
>>>>
>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>> may specify it on windows, for no particular reason, but it works, and should
>>>> continue to work after this series.  CPR would be blocked.
>>>
>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>
>>>   > > More generally for backwards compatibility for share=on for no particular reason,
>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>
>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>
>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>
>> Agreed on all.
>>
>> One more opinion from you please, if you will.
>>
>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>> set in
>>     ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>
>> None of the other backends reach qemu_ram_alloc_internal.
>>
>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>> everywhere MAP_SHARED may be set, eg:
> 
> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
> 
> Alternatively, we make our life easier and do something like
> 
> /*
>   * This flag is only used while creating+allocating RAM, and
>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>   * some configurations.
>   *
>   * By default, not setting RAM_SHARED on anonymous RAM implies
>   * "private anonymous RAM"; however, in some configuration we want to
>   * have most of this RAM automatically be "sharable anonymous RAM",
>   * except for some cases that really want "private anonymous RAM".
>   *
>   * This anonymous RAM *must* be private. This flag only applies to
>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>   */
> RAM_FORCE_ANON_PRIVATE    (1 << 13)
> 
> 
> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
> 
> /*
>   * Auxiliary RAM that was created automatically internally, instead of
>   * explicitly like using memory-backend-ram or some other device on the
>   * QEMU cmdline.
>   */
> RAM_AUX    (1 << 13)
> 
> 
> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
> 
> That actually looks quite compelling to me :)

Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
   qemu_ram_alloc_resizeable
   memory_region_init_ram_nomigrate
   memory_region_init_rom_nomigrate
   memory_region_init_rom_device_nomigrate

and test it with aux_ram_share in qemu_ram_alloc_internal.
   if RAM_AUX && aux_ram_share
     flags |= RAM_SHARED

However, we could just set RAM_SHARED at those same call sites:
   flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
which is what I did in
   [PATCH V2 01/11] machine: alloc-anon option
and test RAM_SHARED in qemu_ram_alloc_internal.
No need for RAM_PRIVATE.

RAM_AUX is nice because it declares intent more specifically.

Your preference?

- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 13:43                           ` Peter Xu
@ 2024-11-08 14:14                             ` Steven Sistare
  2024-11-08 14:32                               ` Peter Xu
  2024-11-08 14:18                             ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-08 14:14 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/8/2024 8:43 AM, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
>> it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>   * This flag is only used while creating+allocating RAM, and
>>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>   * some configurations.
>>   *
>>   * By default, not setting RAM_SHARED on anonymous RAM implies
>>   * "private anonymous RAM"; however, in some configuration we want to
>>   * have most of this RAM automatically be "sharable anonymous RAM",
>>   * except for some cases that really want "private anonymous RAM".
>>   *
>>   * This anonymous RAM *must* be private. This flag only applies to
>>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>>   */
>> RAM_FORCE_ANON_PRIVATE	(1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share"
>> parameter, could we use
>>
>> /*
>>   * Auxiliary RAM that was created automatically internally, instead of
>>   * explicitly like using memory-backend-ram or some other device on the
>>   * QEMU cmdline.
>>   */
>> RAM_AUX	(1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
>> RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> place?
> 
> IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
> easier when we already have SHARED, adding PRIVATE could be mostly natural,
> then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> 
> Basically, SHARED|PRIVATE then must come from an user request (QMP or
> cmdline), otherwise the caller should always set none of them, implying
> aux.
> 
> It still looks the best to me.

Our emails crossed. We could set PRIVATE|SHARED all over the place.  Nothing
wrong with that solution. I have no preference, other than finishing.

- Steve




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 13:43                           ` Peter Xu
  2024-11-08 14:14                             ` Steven Sistare
@ 2024-11-08 14:18                             ` David Hildenbrand
  2024-11-08 15:01                               ` Peter Xu
  1 sibling, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 14:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 08.11.24 14:43, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want
>> it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>   * This flag is only used while creating+allocating RAM, and
>>   * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>   * some configurations.
>>   *
>>   * By default, not setting RAM_SHARED on anonymous RAM implies
>>   * "private anonymous RAM"; however, in some configuration we want to
>>   * have most of this RAM automatically be "sharable anonymous RAM",
>>   * except for some cases that really want "private anonymous RAM".
>>   *
>>   * This anonymous RAM *must* be private. This flag only applies to
>>   * "anonymous" RAM, not fd/file-backed/preallocated one.
>>   */
>> RAM_FORCE_ANON_PRIVATE	(1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share"
>> parameter, could we use
>>
>> /*
>>   * Auxiliary RAM that was created automatically internally, instead of
>>   * explicitly like using memory-backend-ram or some other device on the
>>   * QEMU cmdline.
>>   */
>> RAM_AUX	(1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX
>> RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> place?
 > > IMHO RAM_AUX is too hard for any new callers to know how to set. 
It's much
> easier when we already have SHARED, adding PRIVATE could be mostly natural,
> then we can already avoid AUX due to checking !SHARED & !PRIVATE.

How is it clearer if you have to know whether you have to set 
RAM_PRIVATE or not for some RAM? Because you *wouldn't* set it "all over 
the place".

No strong opinion, but RAM_AUX aligns much better with what we actually 
want to achieve: making aux RAM shared. Which implies, detecting aux RAM ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 13:56                           ` Steven Sistare
@ 2024-11-08 14:20                             ` David Hildenbrand
  2024-11-08 14:37                               ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 14:20 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 08.11.24 14:56, Steven Sistare wrote:
> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>> On 07.11.24 17:40, Steven Sistare wrote:
>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>
>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>
>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>
>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>
>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>
>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>
>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>
>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>
>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>
>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>
>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>
>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>> of options and words to describe them.
>>>>>>>>
>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>
>>>>>>> Hi David and Peter,
>>>>>>>
>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>> for simplicity.
>>>>>>>
>>>>>>> Any comments before I submit a complete patch?
>>>>>>>
>>>>>>> ----
>>>>>>> qemu-options.hx:
>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>              command line option.
>>>>>>>
>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>
>>>>>>> qapi/migration.json:
>>>>>>>          @cpr-transfer:
>>>>>>>               ...
>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>
>>>>>>> Define RAM_PRIVATE
>>>>>>>
>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>
>>>>>>> ram_backend_memory_alloc()
>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>
>>>>>>> qemu_ram_alloc_internal()
>>>>>>>          ...
>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>
>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>          } else
>>>>>>>              new_block->fd = -1;
>>>>>>>              new_block->host = host;
>>>>>>>          }
>>>>>>>          ram_block_add(new_block);
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>          if qemu_memfd_check()
>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>          else
>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>
>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>
>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>
>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>
>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>
>>>>>>
>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>
>>>>>> So maybe something like
>>>>>>
>>>>>> qemu_ram_alloc_shared()
>>>>>>         fd = -1;
>>>>>>
>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>             fd = qemu_memfd_create();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else if (qemu_shm_available())
>>>>>>             fd = qemu_shm_alloc();
>>>>>>             if (fd < 0)
>>>>>>                 ... error
>>>>>>         } else {
>>>>>>             /*
>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>              * memory nobody could possible use it.
>>>>>>              */
>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>         }
>>>>>
>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>> continue to work after this series.  CPR would be blocked.
>>>>
>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>
>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>
>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>
>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>
>>> Agreed on all.
>>>
>>> One more opinion from you please, if you will.
>>>
>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>> set in
>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>
>>> None of the other backends reach qemu_ram_alloc_internal.
>>>
>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>> everywhere MAP_SHARED may be set, eg:
>>
>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>
>> Alternatively, we make our life easier and do something like
>>
>> /*
>>    * This flag is only used while creating+allocating RAM, and
>>    * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>    * some configurations.
>>    *
>>    * By default, not setting RAM_SHARED on anonymous RAM implies
>>    * "private anonymous RAM"; however, in some configuration we want to
>>    * have most of this RAM automatically be "sharable anonymous RAM",
>>    * except for some cases that really want "private anonymous RAM".
>>    *
>>    * This anonymous RAM *must* be private. This flag only applies to
>>    * "anonymous" RAM, not fd/file-backed/preallocated one.
>>    */
>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>
>>
>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>
>> /*
>>    * Auxiliary RAM that was created automatically internally, instead of
>>    * explicitly like using memory-backend-ram or some other device on the
>>    * QEMU cmdline.
>>    */
>> RAM_AUX    (1 << 13)
>>
>>
>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>
>> That actually looks quite compelling to me :)
> 
> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>     qemu_ram_alloc_resizeable
>     memory_region_init_ram_nomigrate
>     memory_region_init_rom_nomigrate
>     memory_region_init_rom_device_nomigrate
> 
> and test it with aux_ram_share in qemu_ram_alloc_internal.
>     if RAM_AUX && aux_ram_share
>       flags |= RAM_SHARED
> 
> However, we could just set RAM_SHARED at those same call sites:
>     flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
> which is what I did in
>     [PATCH V2 01/11] machine: alloc-anon option
> and test RAM_SHARED in qemu_ram_alloc_internal.
> No need for RAM_PRIVATE.
> 
> RAM_AUX is nice because it declares intent more specifically.
> 
> Your preference?

My preference is either using RAM_AUX to flag AUX RAM, or the inverse, 
RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely 
ivshmem.c

Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to 
decide ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:14                             ` Steven Sistare
@ 2024-11-08 14:32                               ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-08 14:32 UTC (permalink / raw)
  To: Steven Sistare
  Cc: David Hildenbrand, Fabiano Rosas, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster, qemu-devel

On Fri, Nov 08, 2024 at 09:14:02AM -0500, Steven Sistare wrote:
> > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> > place?
> > 
> > IMHO RAM_AUX is too hard for any new callers to know how to set.  It's much
> > easier when we already have SHARED, adding PRIVATE could be mostly natural,
> > then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> > 
> > Basically, SHARED|PRIVATE then must come from an user request (QMP or
> > cmdline), otherwise the caller should always set none of them, implying
> > aux.
> > 
> > It still looks the best to me.
> 
> Our emails crossed. We could set PRIVATE|SHARED all over the place.  Nothing
> wrong with that solution. I have no preference, other than finishing.

The current AUX is exactly what I was picturing when I was replying v2, so
the four paths you listed here:

https://lore.kernel.org/qemu-devel/44b15731-0ee8-4e24-b4f5-0614bca594cb@oracle.com/

        Agreed, RAM_AUX is a clear solution.  I would set it in these
        functions:

                qemu_ram_alloc_resizeable
                memory_region_init_ram_nomigrate
                memory_region_init_rom_nomigrate
                memory_region_init_rom_device_nomigrate


That is what I listed previously:

https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/

        I think that means below paths [1-4] are only relevant:

        qemu_ram_alloc
                memory_region_init_rom_device_nomigrate [1]
                memory_region_init_ram_flags_nomigrate
                        memory_region_init_ram_nomigrate    [2]
                        memory_region_init_rom_nomigrate    [3]
        qemu_ram_alloc_resizeable                   [4]

Except that if we don't want to risk using VM_SHARED unconditionally on
linux for aux, then we need to have a new flag, aka RAM_AUX or similar.

So I believe the list is at least correct.

I changed my mind because I noticed it will be non-trivial to know whether
one should set RAM_AUX when a new user needs to create some new ramblocks.

In that case, we need to define RAM_AUX properly.  One sane way to define
it is: "if the user didn't specify share or private property, please use
AUX", but then it'll overlap with RAM_SHARED / RAM_PRIVATE already, hence
redundant.

Yes, let's wait and see David's comment.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:20                             ` David Hildenbrand
@ 2024-11-08 14:37                               ` Steven Sistare
  2024-11-08 14:54                                 ` David Hildenbrand
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-08 14:37 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On 11/8/2024 9:20 AM, David Hildenbrand wrote:
> On 08.11.24 14:56, Steven Sistare wrote:
>> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>>> On 07.11.24 17:40, Steven Sistare wrote:
>>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>>
>>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>>
>>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>>
>>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>>
>>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>>
>>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>>
>>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>>
>>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>>
>>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>>
>>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>>
>>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>>> of options and words to describe them.
>>>>>>>>>
>>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>>
>>>>>>>> Hi David and Peter,
>>>>>>>>
>>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>>> for simplicity.
>>>>>>>>
>>>>>>>> Any comments before I submit a complete patch?
>>>>>>>>
>>>>>>>> ----
>>>>>>>> qemu-options.hx:
>>>>>>>>          ``aux-ram-share=on|off``
>>>>>>>>              Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>>              shareable with an external process.  This option applies to
>>>>>>>>              memory allocated as a side effect of creating various devices.
>>>>>>>>              It does not apply to memory-backend-objects, whether explicitly
>>>>>>>>              specified on the command line, or implicitly created by the -m
>>>>>>>>              command line option.
>>>>>>>>
>>>>>>>>              Some migration modes require aux-ram-share=on.
>>>>>>>>
>>>>>>>> qapi/migration.json:
>>>>>>>>          @cpr-transfer:
>>>>>>>>               ...
>>>>>>>>               Memory-backend objects must have the share=on attribute, but
>>>>>>>>               memory-backend-epc is not supported.  The VM must be started
>>>>>>>>               with the '-machine aux-ram-share=on' option.
>>>>>>>>
>>>>>>>> Define RAM_PRIVATE
>>>>>>>>
>>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>>
>>>>>>>> ram_backend_memory_alloc()
>>>>>>>>          ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>>          memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>>
>>>>>>>> qemu_ram_alloc_internal()
>>>>>>>>          ...
>>>>>>>>          if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>>              new_block->flags |= RAM_SHARED;
>>>>>>>>
>>>>>>>>          if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>>              qemu_ram_alloc_shared(new_block);
>>>>>>>>          } else
>>>>>>>>              new_block->fd = -1;
>>>>>>>>              new_block->host = host;
>>>>>>>>          }
>>>>>>>>          ram_block_add(new_block);
>>>>>>>>
>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>          if qemu_memfd_check()
>>>>>>>>              new_block->fd = qemu_memfd_create()
>>>>>>>>          else
>>>>>>>>              new_block->fd = qemu_shm_alloc()
>>>>>>>
>>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>>
>>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>>
>>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>>
>>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>>
>>>>>>>
>>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>>
>>>>>>> So maybe something like
>>>>>>>
>>>>>>> qemu_ram_alloc_shared()
>>>>>>>         fd = -1;
>>>>>>>
>>>>>>>         if (qemu_memfd_avilable()) {
>>>>>>>             fd = qemu_memfd_create();
>>>>>>>             if (fd < 0)
>>>>>>>                 ... error
>>>>>>>         } else if (qemu_shm_available())
>>>>>>>             fd = qemu_shm_alloc();
>>>>>>>             if (fd < 0)
>>>>>>>                 ... error
>>>>>>>         } else {
>>>>>>>             /*
>>>>>>>              * Old behavior: try fd-less shared memory. We might
>>>>>>>              * just end up with non-shared memory on Windows, but
>>>>>>>              * nobody can make sure of this shared memory either way
>>>>>>>              * ... should we just use non-shared memory? Or should
>>>>>>>              * we simply bail out? But then, if there is no shared
>>>>>>>              * memory nobody could possible use it.
>>>>>>>              */
>>>>>>>             qemu_anon_ram_alloc(share=true)
>>>>>>>         }
>>>>>>
>>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>>> continue to work after this series.  CPR would be blocked.
>>>>>
>>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>>
>>>>>    > > More generally for backwards compatibility for share=on for no particular reason,
>>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>>
>>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>>
>>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>>
>>>> Agreed on all.
>>>>
>>>> One more opinion from you please, if you will.
>>>>
>>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>>> set in
>>>>      ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>>
>>>> None of the other backends reach qemu_ram_alloc_internal.
>>>>
>>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>>> everywhere MAP_SHARED may be set, eg:
>>>
>>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>>
>>> Alternatively, we make our life easier and do something like
>>>
>>> /*
>>>    * This flag is only used while creating+allocating RAM, and
>>>    * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>>    * some configurations.
>>>    *
>>>    * By default, not setting RAM_SHARED on anonymous RAM implies
>>>    * "private anonymous RAM"; however, in some configuration we want to
>>>    * have most of this RAM automatically be "sharable anonymous RAM",
>>>    * except for some cases that really want "private anonymous RAM".
>>>    *
>>>    * This anonymous RAM *must* be private. This flag only applies to
>>>    * "anonymous" RAM, not fd/file-backed/preallocated one.
>>>    */
>>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>>
>>>
>>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>>
>>> /*
>>>    * Auxiliary RAM that was created automatically internally, instead of
>>>    * explicitly like using memory-backend-ram or some other device on the
>>>    * QEMU cmdline.
>>>    */
>>> RAM_AUX    (1 << 13)
>>>
>>>
>>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>>
>>> That actually looks quite compelling to me :)
>>
>> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>>     qemu_ram_alloc_resizeable
>>     memory_region_init_ram_nomigrate
>>     memory_region_init_rom_nomigrate
>>     memory_region_init_rom_device_nomigrate
>>
>> and test it with aux_ram_share in qemu_ram_alloc_internal.
>>     if RAM_AUX && aux_ram_share
>>       flags |= RAM_SHARED
>>
>> However, we could just set RAM_SHARED at those same call sites:
>>     flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
>> which is what I did in
>>     [PATCH V2 01/11] machine: alloc-anon option
>> and test RAM_SHARED in qemu_ram_alloc_internal.
>> No need for RAM_PRIVATE.
>>
>> RAM_AUX is nice because it declares intent more specifically.
>>
>> Your preference?
> 
> My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c
> 
> Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;)

I like the inverse flag RAM_NON_AUX, better name TBD.
The call sites are well defined.
That is what my V3 hack was testing (modulo ivshmem).
    object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND

- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:37                               ` Steven Sistare
@ 2024-11-08 14:54                                 ` David Hildenbrand
  2024-11-08 15:07                                   ` Peter Xu
  2024-11-08 15:15                                   ` David Hildenbrand
  0 siblings, 2 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 14:54 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel, Thomas Huth

On 08.11.24 15:37, Steven Sistare wrote:
> On 11/8/2024 9:20 AM, David Hildenbrand wrote:
>> On 08.11.24 14:56, Steven Sistare wrote:
>>> On 11/8/2024 6:31 AM, David Hildenbrand wrote:
>>>> On 07.11.24 17:40, Steven Sistare wrote:
>>>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote:
>>>>>> On 07.11.24 17:02, Steven Sistare wrote:
>>>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote:
>>>>>>>> On 06.11.24 21:12, Steven Sistare wrote:
>>>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote:
>>>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote:
>>>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote:
>>>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote:
>>>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote:
>>>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote:
>>>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote:
>>>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>>>>>>>>> on the value of the anon-alloc machine property.  This option applies to
>>>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does
>>>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on
>>>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the
>>>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending
>>>>>>>>>>>>>>>> the memfd file descriptor to the process.  Memory contents are preserved,
>>>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are
>>>>>>>>>>>>>>>> locked in memory for DMA remain locked.  This behavior is a pre-requisite
>>>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm,
>>>>>>>>>>>>>>> similar to backends/hostmem-shm.c.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the
>>>>>>>>>>>>>>> details. See below.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details.  It's a
>>>>>>>>>>>>>> concise (and well tested) solution albeit linux only.  The code you supply
>>>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Unless there is reason to use memfd we should start with the more
>>>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd.
>>>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus
>>>>>>>>>>>>> on the "machine toggle" as part of this series.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under
>>>>>>>>>>>>> Linux for whatever reason later, we can use that instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features
>>>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using.
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount).
>>>>>>>>>>>
>>>>>>>>>>> Sizing is a non-trivial difference.  One can by default allocate all memory using memfd_create.
>>>>>>>>>>> To do so using shm_open requires configuration on the mount.  One step harder to use.
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM
>>>>>>>>>>> if memory-backend-ram has hogged all the memory.
>>>>>>>>>>>
>>>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open().
>>>>>>>>>>>
>>>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM
>>>>>>>>>>> -- memfd if available and fallback to shm_open.
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like
>>>>>>>>>>>>
>>>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on"
>>>>>>>>>>>>
>>>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process.
>>>>>>>>>>>>
>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch.
>>>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc
>>>>>>>>>>> controls everything else.  Now we're just riffing on the details: memfd vs shm_open, spelling
>>>>>>>>>>> of options and words to describe them.
>>>>>>>>>>
>>>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
>>>>>>>>>
>>>>>>>>> Hi David and Peter,
>>>>>>>>>
>>>>>>>>> I have implemented and tested the following, for both qemu_memfd_create
>>>>>>>>> and qemu_shm_alloc.  This is pseudo-code, with error conditions omitted
>>>>>>>>> for simplicity.
>>>>>>>>>
>>>>>>>>> Any comments before I submit a complete patch?
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>> qemu-options.hx:
>>>>>>>>>           ``aux-ram-share=on|off``
>>>>>>>>>               Allocate auxiliary guest RAM as an anonymous file that is
>>>>>>>>>               shareable with an external process.  This option applies to
>>>>>>>>>               memory allocated as a side effect of creating various devices.
>>>>>>>>>               It does not apply to memory-backend-objects, whether explicitly
>>>>>>>>>               specified on the command line, or implicitly created by the -m
>>>>>>>>>               command line option.
>>>>>>>>>
>>>>>>>>>               Some migration modes require aux-ram-share=on.
>>>>>>>>>
>>>>>>>>> qapi/migration.json:
>>>>>>>>>           @cpr-transfer:
>>>>>>>>>                ...
>>>>>>>>>                Memory-backend objects must have the share=on attribute, but
>>>>>>>>>                memory-backend-epc is not supported.  The VM must be started
>>>>>>>>>                with the '-machine aux-ram-share=on' option.
>>>>>>>>>
>>>>>>>>> Define RAM_PRIVATE
>>>>>>>>>
>>>>>>>>> Define qemu_shm_alloc(), from David's tmp patch
>>>>>>>>>
>>>>>>>>> ram_backend_memory_alloc()
>>>>>>>>>           ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE;
>>>>>>>>>           memory_region_init_ram_flags_nomigrate(ram_flags)
>>>>>>>>>
>>>>>>>>> qemu_ram_alloc_internal()
>>>>>>>>>           ...
>>>>>>>>>           if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share)
>>>>>>>>>               new_block->flags |= RAM_SHARED;
>>>>>>>>>
>>>>>>>>>           if (!host && (new_block->flags & RAM_SHARED)) {
>>>>>>>>>               qemu_ram_alloc_shared(new_block);
>>>>>>>>>           } else
>>>>>>>>>               new_block->fd = -1;
>>>>>>>>>               new_block->host = host;
>>>>>>>>>           }
>>>>>>>>>           ram_block_add(new_block);
>>>>>>>>>
>>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>>           if qemu_memfd_check()
>>>>>>>>>               new_block->fd = qemu_memfd_create()
>>>>>>>>>           else
>>>>>>>>>               new_block->fd = qemu_shm_alloc()
>>>>>>>>
>>>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds.
>>>>>>>>
>>>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out.
>>>>>>>>
>>>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ...
>>>>>>>>
>>>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better.
>>>>>>>>
>>>>>>>>
>>>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows.
>>>>>>>>
>>>>>>>> So maybe something like
>>>>>>>>
>>>>>>>> qemu_ram_alloc_shared()
>>>>>>>>          fd = -1;
>>>>>>>>
>>>>>>>>          if (qemu_memfd_avilable()) {
>>>>>>>>              fd = qemu_memfd_create();
>>>>>>>>              if (fd < 0)
>>>>>>>>                  ... error
>>>>>>>>          } else if (qemu_shm_available())
>>>>>>>>              fd = qemu_shm_alloc();
>>>>>>>>              if (fd < 0)
>>>>>>>>                  ... error
>>>>>>>>          } else {
>>>>>>>>              /*
>>>>>>>>               * Old behavior: try fd-less shared memory. We might
>>>>>>>>               * just end up with non-shared memory on Windows, but
>>>>>>>>               * nobody can make sure of this shared memory either way
>>>>>>>>               * ... should we just use non-shared memory? Or should
>>>>>>>>               * we simply bail out? But then, if there is no shared
>>>>>>>>               * memory nobody could possible use it.
>>>>>>>>               */
>>>>>>>>              qemu_anon_ram_alloc(share=true)
>>>>>>>>          }
>>>>>>>
>>>>>>> Good catch.  We need that fallback for backwards compatibility.  Even with
>>>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users
>>>>>>> may specify it on windows, for no particular reason, but it works, and should
>>>>>>> continue to work after this series.  CPR would be blocked.
>>>>>>
>>>>>> Yes, we should keep Windows working in the weird way it is working right now.
>>>>>>
>>>>>>     > > More generally for backwards compatibility for share=on for no particular reason,
>>>>>>> should we fallback if qemu_shm_alloc fails?  If /dev/shm is mounted with default
>>>>>>> options and more than half of ram is requested, it will fail, whereas current qemu
>>>>>>> succeeds using MAP_SHARED|MAP_ANON.
>>>>>>
>>>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path.
>>>>>>
>>>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
>>>>>
>>>>> Agreed on all.
>>>>>
>>>>> One more opinion from you please, if you will.
>>>>>
>>>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be
>>>>> set in
>>>>>       ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal
>>>>>
>>>>> None of the other backends reach qemu_ram_alloc_internal.
>>>>>
>>>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends,
>>>>> everywhere MAP_SHARED may be set, eg:
>>>>
>>>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing.
>>>>
>>>> Alternatively, we make our life easier and do something like
>>>>
>>>> /*
>>>>     * This flag is only used while creating+allocating RAM, and
>>>>     * prevents RAM_SHARED getting set for anonymous RAM automatically in
>>>>     * some configurations.
>>>>     *
>>>>     * By default, not setting RAM_SHARED on anonymous RAM implies
>>>>     * "private anonymous RAM"; however, in some configuration we want to
>>>>     * have most of this RAM automatically be "sharable anonymous RAM",
>>>>     * except for some cases that really want "private anonymous RAM".
>>>>     *
>>>>     * This anonymous RAM *must* be private. This flag only applies to
>>>>     * "anonymous" RAM, not fd/file-backed/preallocated one.
>>>>     */
>>>> RAM_FORCE_ANON_PRIVATE    (1 << 13)
>>>>
>>>>
>>>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use
>>>>
>>>> /*
>>>>     * Auxiliary RAM that was created automatically internally, instead of
>>>>     * explicitly like using memory-backend-ram or some other device on the
>>>>     * QEMU cmdline.
>>>>     */
>>>> RAM_AUX    (1 << 13)
>>>>
>>>>
>>>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks.
>>>>
>>>> That actually looks quite compelling to me :)
>>>
>>> Agreed, RAM_AUX is a clear solution.  I would set it in these functions:
>>>      qemu_ram_alloc_resizeable
>>>      memory_region_init_ram_nomigrate
>>>      memory_region_init_rom_nomigrate
>>>      memory_region_init_rom_device_nomigrate
>>>
>>> and test it with aux_ram_share in qemu_ram_alloc_internal.
>>>      if RAM_AUX && aux_ram_share
>>>        flags |= RAM_SHARED
>>>
>>> However, we could just set RAM_SHARED at those same call sites:
>>>      flags = current_machine->aux_ram_shared ?  RAM_SHARED : 0;
>>> which is what I did in
>>>      [PATCH V2 01/11] machine: alloc-anon option
>>> and test RAM_SHARED in qemu_ram_alloc_internal.
>>> No need for RAM_PRIVATE.
>>>
>>> RAM_AUX is nice because it declares intent more specifically.
>>>
>>> Your preference?
>>
>> My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c
>>
>> Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;)
> 
> I like the inverse flag RAM_NON_AUX, better name TBD.
> The call sites are well defined.
> That is what my V3 hack was testing (modulo ivshmem).
>      object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND

Likely AUX is everything that is "neither explicitly specified by the user nor
very special RAM"

So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).

memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
special: we cannot possibly turn them SHARED. But that's also what your code
already handled.

So maybe, really everything is AUX ram, except
* Using memory_region_init_ram_from_fd()/
   memory_region_init_ram_from_file() users.
* Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
* Created via memory backends


Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
code was introduced.

CCing Thomas.

commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
Author: Thomas Huth <huth@tuxfamily.org>
Date:   Sat Jun 30 08:45:25 2018 +0200

     m68k: Add NeXTcube machine
     
     It is still quite incomplete (no SCSI, no floppy emulation, no network,
     etc.), but the firmware already shows up the debug monitor prompt in the
     framebuffer display, so at least the very basics are already working.
     
     This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
     
      https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
     
     and altered quite a bit to fit the latest interface and coding conventions
     of the current QEMU.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:18                             ` David Hildenbrand
@ 2024-11-08 15:01                               ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-08 15:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel

On Fri, Nov 08, 2024 at 03:18:07PM +0100, David Hildenbrand wrote:
> > 
> > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the
> > place?
> > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's
> much
> > easier when we already have SHARED, adding PRIVATE could be mostly natural,
> > then we can already avoid AUX due to checking !SHARED & !PRIVATE.
> 
> How is it clearer if you have to know whether you have to set RAM_PRIVATE or
> not for some RAM? Because you *wouldn't* set it "all over the place".

I think I answered that in previous reply, but exactly after the line where
it got cut-off.. :)

https://lore.kernel.org/qemu-devel/Zy4VkScMEpYayGtM@x1n/

        Basically, SHARED|PRIVATE then must come from an user request (QMP
        or cmdline), otherwise the caller should always set none of them,
        implying aux.

But I confess that's not accurate.. some caller is based on type of objects
etc. to decide mem must be SHARED.  A better version could be:

        RAM_SHARED|RAM_PRIVATE describes the share-able attribute of the
        RAM block.

        It can never be set together.  When one is set, the memory
        attribute must suffice the request.  When none is set, QEMU will
        decide how to request the memory.

        The flag should only be set if the caller has explicit requirement
        on such memory property.  For example, it can come from either a
        request from user (share=on/off), or the memory must be shared /
        private due to its own attribute (shm objects, like ivshmem, shm
        backend, or remotely shared mem, etc.).

        Otherwise, callers should leave both flags unset.

Maybe we should document this directly into the flag definitions, so what
flags to set would be clearer than before, and just to say it's not the
caller's own willingness to set SHARED | PRIVATE randomly (so as to make
cpr available as much as possible).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:54                                 ` David Hildenbrand
@ 2024-11-08 15:07                                   ` Peter Xu
  2024-11-08 15:09                                     ` David Hildenbrand
  2024-11-08 15:15                                   ` David Hildenbrand
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-08 15:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel, Thomas Huth

On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote:
> Likely AUX is everything that is "neither explicitly specified by the user nor
> very special RAM"
> 
> So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
> hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).
> 
> memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
> special: we cannot possibly turn them SHARED. But that's also what your code
> already handled.
> 
> So maybe, really everything is AUX ram, except
> * Using memory_region_init_ram_from_fd()/
>   memory_region_init_ram_from_file() users.
> * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
> * Created via memory backends
> 
> 
> Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
> it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
> code was introduced.
> 
> CCing Thomas.
> 
> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
> Author: Thomas Huth <huth@tuxfamily.org>
> Date:   Sat Jun 30 08:45:25 2018 +0200
> 
>     m68k: Add NeXTcube machine
>     It is still quite incomplete (no SCSI, no floppy emulation, no network,
>     etc.), but the firmware already shows up the debug monitor prompt in the
>     framebuffer display, so at least the very basics are already working.
>     This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>      https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>     and altered quite a bit to fit the latest interface and coding conventions
>     of the current QEMU.

This might also imply that our current RAM_SHARED is already not crystal
clear on when to use, not to mention RAM_AUX if to be introduced..  Please
see my other email, trying to define RAM_SHARED properly.

IIUC after we can properly define RAM_SHARED, then we don't need AUX, and
everything will be crystal clear.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 15:07                                   ` Peter Xu
@ 2024-11-08 15:09                                     ` David Hildenbrand
  0 siblings, 0 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 15:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Steven Sistare, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel, Thomas Huth

On 08.11.24 16:07, Peter Xu wrote:
> On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote:
>> Likely AUX is everything that is "neither explicitly specified by the user nor
>> very special RAM"
>>
>> So I think hw/misc/ivshmem.c would also not count as "aux", and similarly
>> hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on).
>>
>> memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly
>> special: we cannot possibly turn them SHARED. But that's also what your code
>> already handled.
>>
>> So maybe, really everything is AUX ram, except
>> * Using memory_region_init_ram_from_fd()/
>>    memory_region_init_ram_from_file() users.
>> * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr
>> * Created via memory backends
>>
>>
>> Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why
>> it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that
>> code was introduced.
>>
>> CCing Thomas.
>>
>> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
>> Author: Thomas Huth <huth@tuxfamily.org>
>> Date:   Sat Jun 30 08:45:25 2018 +0200
>>
>>      m68k: Add NeXTcube machine
>>      It is still quite incomplete (no SCSI, no floppy emulation, no network,
>>      etc.), but the firmware already shows up the debug monitor prompt in the
>>      framebuffer display, so at least the very basics are already working.
>>      This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>>       https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>>      and altered quite a bit to fit the latest interface and coding conventions
>>      of the current QEMU.
> 
> This might also imply that our current RAM_SHARED is already not crystal
> clear on when to use, not to mention RAM_AUX if to be introduced..

Likely not. When the code was introduced we used magic boolean 
parameters and likely "true" was set by accident.

There are not that many RAM_SHARED users at all ...

Anyhow ....

  Please
> see my other email, trying to define RAM_SHARED properly.
> 
> IIUC after we can properly define RAM_SHARED, then we don't need AUX, and
> everything will be crystal clear.

I think I still prefer RAM_NO_AUX, but I'll leave it to you and Steven 
to figure out, it's been way to many emails at this point :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 01/16] machine: anon-alloc option
  2024-11-08 14:54                                 ` David Hildenbrand
  2024-11-08 15:07                                   ` Peter Xu
@ 2024-11-08 15:15                                   ` David Hildenbrand
  1 sibling, 0 replies; 86+ messages in thread
From: David Hildenbrand @ 2024-11-08 15:15 UTC (permalink / raw)
  To: Steven Sistare, Peter Xu
  Cc: Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, qemu-devel, Thomas Huth

  
> CCing Thomas.
> 
> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4
> Author: Thomas Huth <huth@tuxfamily.org>
> Date:   Sat Jun 30 08:45:25 2018 +0200
> 
>       m68k: Add NeXTcube machine
>       
>       It is still quite incomplete (no SCSI, no floppy emulation, no network,
>       etc.), but the firmware already shows up the debug monitor prompt in the
>       framebuffer display, so at least the very basics are already working.
>       
>       This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at
>       
>        https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c
>       
>       and altered quite a bit to fit the latest interface and coding conventions
>       of the current QEMU.

Staring at that link, the code was

     /* MMIO */
     cpu_register_physical_memory((uint32_t)0x2000000,0xD0000,
         cpu_register_io_memory(mmio_read, mmio_write, (void *)env,DEVICE_NATIVE_ENDIAN));
     
     /* BMAP */ //acts as a catch-all for now
     cpu_register_physical_memory((uint32_t)0x2100000,0x3A7FF,
         cpu_register_io_memory(scr_read, scr_write, (void *)env,DEVICE_NATIVE_ENDIAN));

Which we converted to

     /* MMIO */
     memory_region_init_io(mmiomem, NULL, &mmio_ops, machine, "next.mmio",
                           0xD0000);
     memory_region_add_subregion(sysmem, 0x02000000, mmiomem);

     /* BMAP memory */
     memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
                                             true, &error_fatal);
     memory_region_add_subregion(sysmem, 0x020c0000, bmapm1);


So likely the "true" was added by mistake.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 02/16] migration: cpr-state
  2024-11-01 13:47 ` [PATCH V3 02/16] migration: cpr-state Steve Sistare
@ 2024-11-13 20:36   ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-13 20:36 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:41AM -0700, Steve Sistare wrote:
> CPR must save state that is needed after QEMU is restarted, when devices
> are realized.  Thus the extra state cannot be saved in the migration stream,
> as objects must already exist before that stream can be loaded.  Instead,
> define auxilliary state structures and vmstate descriptions, not associated
> with any registered object, and serialize the aux state to a cpr-specific
> stream in cpr_state_save.  Deserialize in cpr_state_load after QEMU
> restarts, before devices are realized.
> 
> Provide accessors for clients to register file descriptors for saving.
> The mechanism for passing the fd's to the new process will be specific
> to each migration mode, and added in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile
  2024-11-01 13:47 ` [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile Steve Sistare
@ 2024-11-13 20:54   ` Peter Xu
  2024-11-14 18:34     ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-13 20:54 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:44AM -0700, Steve Sistare wrote:
> Define functions to put/get file descriptors to/from a QEMUFile, for qio
> channels that support SCM_RIGHTS.  Maintain ordering such that
>   put(A), put(fd), put(B)
> followed by
>   get(A), get(fd), get(B)
> always succeeds.  Other get orderings may succeed but are not guaranteed.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

>  struct QEMUFile {
>      QIOChannel *ioc;
>      bool is_writable;
> @@ -51,6 +56,9 @@ struct QEMUFile {
>  
>      int last_error;
>      Error *last_error_obj;
> +
> +    bool fd_pass;

One nitpick: I'd rename this to allow_fd_pass, or any name clearly shows
that it's a capability.

> +    QTAILQ_HEAD(, FdEntry) fds;
>  };

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 06/16] migration: VMSTATE_FD
  2024-11-01 13:47 ` [PATCH V3 06/16] migration: VMSTATE_FD Steve Sistare
@ 2024-11-13 20:55   ` Peter Xu
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Xu @ 2024-11-13 20:55 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:45AM -0700, Steve Sistare wrote:
> Define VMSTATE_FD for declaring a file descriptor field in a
> VMStateDescription.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 10/16] migration: split qmp_migrate
  2024-11-01 13:47 ` [PATCH V3 10/16] migration: split qmp_migrate Steve Sistare
@ 2024-11-13 21:11   ` Peter Xu
  2024-11-14 18:33     ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-13 21:11 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:49AM -0700, Steve Sistare wrote:
> Split qmp_migrate into start and finish functions.  Finish will be
> called asynchronously in a subsequent patch, but for now, call it
> immediately.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/migration.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 6dc7c09..86b3f39 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1521,6 +1521,7 @@ static void migrate_fd_error(MigrationState *s, const Error *error)
>  static void migrate_fd_cancel(MigrationState *s)
>  {
>      int old_state ;
> +    bool setup = (s->state == MIGRATION_STATUS_SETUP);
>  
>      trace_migrate_fd_cancel();
>  
> @@ -1565,6 +1566,15 @@ static void migrate_fd_cancel(MigrationState *s)
>              s->block_inactive = false;
>          }
>      }
> +
> +    /*
> +     * If qmp_migrate_finish has not been called, then there is no path that
> +     * will complete the cancellation.  Do it now.
> +     */
> +    if (setup && !s->to_dst_file) {
> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_CANCELLED);
> +        vm_resume(s->vm_old_state);
> +    }

Hmm.. this doesn't look like the right place to put this change.. as this
patch logically should bring no functional change if it's only about a new
helper split an existing function.

Meanwhile, this change also doesn't yet tell where does a vm_resume() came
from.. I'm really not sure whether this is correct at all, consider someone
does QMP "stop", migrate then quickly cancel it.  I suppose it may
accidentally resume the VM which it shouldn't.

Not to mention checking "setup" early, and unconditionally modify the state
here no matter what it is (can it be things like FAILED now, then
overwritten by a CANCELLED)?  But I'd confess that's not the problem of
this patch, but that migration state machine is currently still racy.. afaiu.

>  }
>  
>  void migration_add_notifier_mode(NotifierWithReturn *notify,
> @@ -2072,6 +2082,9 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>      return true;
>  }
>  
> +static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
> +                               Error **errp);
> +
>  void qmp_migrate(const char *uri, bool has_channels,
>                   MigrationChannelList *channels, bool has_detach, bool detach,
>                   bool has_resume, bool resume, Error **errp)
> @@ -2118,6 +2131,20 @@ void qmp_migrate(const char *uri, bool has_channels,
>          return;
>      }
>  
> +    qmp_migrate_finish(addr, resume_requested, errp);
> +
> +    if (local_err) {
> +        migrate_fd_error(s, local_err);
> +        error_propagate(errp, local_err);
> +    }

I don't see when local_err will be set at all until here.. maybe you meant
*errp, but then maybe we should drop local_err and use ERRP_GUARD().

> +}
> +
> +static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
> +                               Error **errp)
> +{
> +    MigrationState *s = migrate_get_current();
> +    Error *local_err = NULL;
> +
>      if (!resume_requested) {
>          if (!yank_register_instance(MIGRATION_YANK_INSTANCE, errp)) {
>              return;
> -- 
> 1.8.3.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-01 13:47 ` [PATCH V3 11/16] migration: cpr-transfer mode Steve Sistare
@ 2024-11-13 21:58   ` Peter Xu
  2024-11-14 18:36     ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-13 21:58 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
> Add the cpr-transfer migration mode.  Usage:
>   qemu-system-$arch -machine anon-alloc=memfd ...
> 
>   start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> 
>   Issue commands to old QEMU:
>   migrate_set_parameter mode cpr-transfer
>   migrate_set_parameter cpr-uri <uri-2>
>   migrate -d <uri-1>

QMP command "migrate" already allows taking MigrationChannel lists, cpr can
be the 2nd supported channel besides "main".

I apologize on only noticing this until now.. I wished the incoming side
can do the same already (which also takes 'MigrationChannel') if monitors
init can be moved earlier, and if precreate worked out.  If not, we should
still consider doing that on source, because cpr-uri isn't usable on dest
anyway.. so they need to be treated separately even now.

Then after we make the monitor code run earlier in the future we could
introduce that to incoming side too, obsoleting -cpr-uri there.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 16/16] migration: cpr-transfer documentation
  2024-11-01 13:47 ` [PATCH V3 16/16] migration: cpr-transfer documentation Steve Sistare
@ 2024-11-13 22:02   ` Peter Xu
  2024-11-14 18:31     ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-13 22:02 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:55AM -0700, Steve Sistare wrote:
> +Caveats
> +^^^^^^^
> +
> +cpr-transfer mode may not be used with postcopy, background-snapshot,
> +or COLO.
> +
> +memory-backend-epc and memory-backend-ram are not supported.

Just to double check: now the plan is to allow memory-backend-ram,share=on
to work too, right?  If so, here needs a touchup.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 12/16] tests/migration-test: memory_backend
  2024-11-01 13:47 ` [PATCH V3 12/16] tests/migration-test: memory_backend Steve Sistare
@ 2024-11-13 22:19   ` Fabiano Rosas
  0 siblings, 0 replies; 86+ messages in thread
From: Fabiano Rosas @ 2024-11-13 22:19 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Allow each migration test to define its own memory backend, replacing
> the standard "-m <size>" specification.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  tests/qtest/migration-test.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 95e45b5..a008316 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -609,6 +609,11 @@ typedef struct {
>      const char *opts_target;
>      /* suspend the src before migrating to dest. */
>      bool suspend_me;
> +    /*
> +     * Format string for the main memory backend, containing one %s where the
> +     * size is plugged in.  If omitted, "-m %s" is used.
> +     */
> +    const char *memory_backend;
>  } MigrateStart;
>  
>  /*
> @@ -727,6 +732,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>      const char *memory_size;
>      const char *machine_alias, *machine_opts = "";
>      g_autofree char *machine = NULL;
> +    g_autofree char *memory_backend = NULL;
>  
>      if (args->use_shmem) {
>          if (!g_file_test("/dev/shm", G_FILE_TEST_IS_DIR)) {
> @@ -802,6 +808,12 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>              memory_size, shmem_path);
>      }
>  
> +    if (args->memory_backend) {
> +        memory_backend = g_strdup_printf(args->memory_backend, memory_size);
> +    } else {
> +        memory_backend = g_strdup_printf("-m %s ", memory_size);

Unnecessary space at the end of the string.

> +    }
> +
>      if (args->use_dirty_ring) {
>          kvm_opts = ",dirty-ring-size=4096";
>      }
> @@ -820,12 +832,12 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>      cmd_source = g_strdup_printf("-accel kvm%s -accel tcg "
>                                   "-machine %s,%s "
>                                   "-name source,debug-threads=on "
> -                                 "-m %s "
> +                                 "%s "
>                                   "-serial file:%s/src_serial "
>                                   "%s %s %s %s %s",
>                                   kvm_opts ? kvm_opts : "",
>                                   machine, machine_opts,
> -                                 memory_size, tmpfs,
> +                                 memory_backend, tmpfs,
>                                   arch_opts ? arch_opts : "",
>                                   arch_source ? arch_source : "",
>                                   shmem_opts ? shmem_opts : "",
> @@ -841,13 +853,13 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>      cmd_target = g_strdup_printf("-accel kvm%s -accel tcg "
>                                   "-machine %s,%s "
>                                   "-name target,debug-threads=on "
> -                                 "-m %s "
> +                                 "%s "
>                                   "-serial file:%s/dest_serial "
>                                   "-incoming %s "
>                                   "%s %s %s %s %s",
>                                   kvm_opts ? kvm_opts : "",
>                                   machine, machine_opts,
> -                                 memory_size, tmpfs, uri,
> +                                 memory_backend, tmpfs, uri,
>                                   arch_opts ? arch_opts : "",
>                                   arch_target ? arch_target : "",
>                                   shmem_opts ? shmem_opts : "",


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 13/16] tests/qtest: defer connection
  2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
@ 2024-11-13 22:36   ` Fabiano Rosas
  2024-11-14 18:45     ` Steven Sistare
  2024-11-13 22:53   ` Peter Xu
  1 sibling, 1 reply; 86+ messages in thread
From: Fabiano Rosas @ 2024-11-13 22:36 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add an option to defer making the connecting to the monitor and qtest
> sockets when calling qtest_init_with_env.  The client makes the connection
> later by calling qtest_connect_deferred and qtest_qmp_handshake.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  tests/qtest/libqtest.c       | 69 +++++++++++++++++++++++++++++---------------
>  tests/qtest/libqtest.h       | 19 +++++++++++-
>  tests/qtest/migration-test.c |  4 +--
>  3 files changed, 65 insertions(+), 27 deletions(-)
>
> diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
> index 9d07de1..95408fb 100644
> --- a/tests/qtest/libqtest.c
> +++ b/tests/qtest/libqtest.c
> @@ -75,6 +75,8 @@ struct QTestState
>  {
>      int fd;
>      int qmp_fd;
> +    int sock;
> +    int qmpsock;
>      pid_t qemu_pid;  /* our child QEMU process */
>      int wstatus;
>  #ifdef _WIN32
> @@ -443,7 +445,8 @@ static QTestState *G_GNUC_PRINTF(2, 3) qtest_spawn_qemu(const char *qemu_bin,
>  }
>  
>  static QTestState *qtest_init_internal(const char *qemu_bin,
> -                                       const char *extra_args)
> +                                       const char *extra_args,
> +                                       bool defer_connect)
>  {
>      QTestState *s;
>      int sock, qmpsock, i;
> @@ -485,22 +488,17 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>      qtest_client_set_rx_handler(s, qtest_client_socket_recv_line);
>      qtest_client_set_tx_handler(s, qtest_client_socket_send);
>  
> -    s->fd = socket_accept(sock);
> -    if (s->fd >= 0) {
> -        s->qmp_fd = socket_accept(qmpsock);
> -    }
> -    unlink(socket_path);
> -    unlink(qmp_socket_path);
> -    g_free(socket_path);
> -    g_free(qmp_socket_path);
> -
> -    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
> -
>      s->rx = g_string_new("");
>      for (i = 0; i < MAX_IRQ; i++) {
>          s->irq_level[i] = false;
>      }
>  
> +    s->sock = sock;
> +    s->qmpsock = qmpsock;
> +    if (!defer_connect) {
> +        qtest_connect_deferred(s);
> +    }

It might be cleaner to just leave qtest_connect_deferred() to the
callers and not plumb defer_connect through.

> +
>      /*
>       * Stopping QEMU for debugging is not supported on Windows.
>       *
> @@ -515,34 +513,57 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>      }
>  #endif
>  
> +   return s;
> +}
> +
> +void qtest_connect_deferred(QTestState *s)
> +{
> +    g_autofree gchar *socket_path = NULL;
> +    g_autofree gchar *qmp_socket_path = NULL;
> +
> +    socket_path = g_strdup_printf("%s/qtest-%d.sock",
> +                                  g_get_tmp_dir(), getpid());
> +    qmp_socket_path = g_strdup_printf("%s/qtest-%d.qmp",
> +                                      g_get_tmp_dir(), getpid());
> +
> +    s->fd = socket_accept(s->sock);
> +    if (s->fd >= 0) {
> +        s->qmp_fd = socket_accept(s->qmpsock);
> +    }
> +    unlink(socket_path);
> +    unlink(qmp_socket_path);
> +    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
>      /* ask endianness of the target */
> -
>      s->big_endian = qtest_query_target_endianness(s);
> -
> -   return s;
>  }
>  
>  QTestState *qtest_init_without_qmp_handshake(const char *extra_args)
>  {
> -    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args);
> +    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args, false);
>  }
>  
> -QTestState *qtest_init_with_env(const char *var, const char *extra_args)
> +void qtest_qmp_handshake(QTestState *s)
>  {
> -    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args);
> -    QDict *greeting;
> -
>      /* Read the QMP greeting and then do the handshake */
> -    greeting = qtest_qmp_receive(s);
> +    QDict *greeting = qtest_qmp_receive(s);
>      qobject_unref(greeting);
>      qobject_unref(qtest_qmp(s, "{ 'execute': 'qmp_capabilities' }"));
> +}
>  
> +QTestState *qtest_init_with_env(const char *var, const char *extra_args,
> +                                bool defer_connect)
> +{
> +    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args,
> +                                        defer_connect);
> +    if (!defer_connect) {
> +        qtest_qmp_handshake(s);
> +    }
>      return s;
>  }
>  
>  QTestState *qtest_init(const char *extra_args)
>  {
> -    return qtest_init_with_env(NULL, extra_args);
> +    return qtest_init_with_env(NULL, extra_args, false);
>  }
>  
>  QTestState *qtest_vinitf(const char *fmt, va_list ap)
> @@ -1523,7 +1544,7 @@ static struct MachInfo *qtest_get_machines(const char *var)
>  
>      silence_spawn_log = !g_test_verbose();
>  
> -    qts = qtest_init_with_env(qemu_var, "-machine none");
> +    qts = qtest_init_with_env(qemu_var, "-machine none", false);
>      response = qtest_qmp(qts, "{ 'execute': 'query-machines' }");
>      g_assert(response);
>      list = qdict_get_qlist(response, "return");
> @@ -1578,7 +1599,7 @@ static struct CpuModel *qtest_get_cpu_models(void)
>  
>      silence_spawn_log = !g_test_verbose();
>  
> -    qts = qtest_init_with_env(NULL, "-machine none");
> +    qts = qtest_init_with_env(NULL, "-machine none", false);
>      response = qtest_qmp(qts, "{ 'execute': 'query-cpu-definitions' }");
>      g_assert(response);
>      list = qdict_get_qlist(response, "return");
> diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
> index beb96b1..db76f2c 100644
> --- a/tests/qtest/libqtest.h
> +++ b/tests/qtest/libqtest.h
> @@ -60,13 +60,15 @@ QTestState *qtest_init(const char *extra_args);
>   * @var: Environment variable from where to take the QEMU binary
>   * @extra_args: Other arguments to pass to QEMU.  CAUTION: these
>   * arguments are subject to word splitting and shell evaluation.
> + * @defer_connect: do not connect to qemu monitor and qtest socket.
>   *
>   * Like qtest_init(), but use a different environment variable for the
>   * QEMU binary.
>   *
>   * Returns: #QTestState instance.
>   */
> -QTestState *qtest_init_with_env(const char *var, const char *extra_args);
> +QTestState *qtest_init_with_env(const char *var, const char *extra_args,
> +                                bool defer_connect);
>  
>  /**
>   * qtest_init_without_qmp_handshake:
> @@ -78,6 +80,21 @@ QTestState *qtest_init_with_env(const char *var, const char *extra_args);
>  QTestState *qtest_init_without_qmp_handshake(const char *extra_args);
>  
>  /**
> + * qtest_connect_deferred:
> + * @s: #QTestState instance to connect
> + * Connect to qemu monitor and qtest socket, after deferring them in
> + * qtest_init_with_env.  Does not handshake with the monitor.
> + */
> +void qtest_connect_deferred(QTestState *s);
> +
> +/**
> + * qtest_qmp_handshake:
> + * @s: #QTestState instance to operate on.
> + * Perform handshake after connecting to qemu monitor.
> + */
> +void qtest_qmp_handshake(QTestState *s);
> +
> +/**
>   * qtest_init_with_serial:
>   * @extra_args: other arguments to pass to QEMU.  CAUTION: these
>   * arguments are subject to word splitting and shell evaluation.
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index a008316..d359b10 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -844,7 +844,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>                                   args->opts_source ? args->opts_source : "",
>                                   ignore_stderr);
>      if (!args->only_target) {
> -        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
> +        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source, false);
>          qtest_qmp_set_event_callback(*from,
>                                       migrate_watch_for_events,
>                                       &src_state);
> @@ -865,7 +865,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>                                   shmem_opts ? shmem_opts : "",
>                                   args->opts_target ? args->opts_target : "",
>                                   ignore_stderr);
> -    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
> +    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target, false);
>      qtest_qmp_set_event_callback(*to,
>                                   migrate_watch_for_events,
>                                   &dst_state);


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 13/16] tests/qtest: defer connection
  2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
  2024-11-13 22:36   ` Fabiano Rosas
@ 2024-11-13 22:53   ` Peter Xu
  2024-11-14 18:31     ` Steven Sistare
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-13 22:53 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Fri, Nov 01, 2024 at 06:47:52AM -0700, Steve Sistare wrote:
> +void qtest_connect_deferred(QTestState *s)
> +{
> +    g_autofree gchar *socket_path = NULL;
> +    g_autofree gchar *qmp_socket_path = NULL;
> +
> +    socket_path = g_strdup_printf("%s/qtest-%d.sock",
> +                                  g_get_tmp_dir(), getpid());
> +    qmp_socket_path = g_strdup_printf("%s/qtest-%d.qmp",
> +                                      g_get_tmp_dir(), getpid());
> +
> +    s->fd = socket_accept(s->sock);
> +    if (s->fd >= 0) {
> +        s->qmp_fd = socket_accept(s->qmpsock);
> +    }
> +    unlink(socket_path);
> +    unlink(qmp_socket_path);

Why need to unlink again here if both sock/qmpsock are cached?  I assume we
could remove these lines together with above g_strdup_printf()s.

Otherwise two paths are leaked anyway (and we may also want to have some
macros to represent the paths used in two places).

Maybe we could also clear sock/qmpsock too after use, then check at the
entrance to skip qtest_connect_deferred() if already connected.

> +    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
>      /* ask endianness of the target */
> -
>      s->big_endian = qtest_query_target_endianness(s);
> -
> -   return s;
>  }

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 14/16] tests/migration-test: defer connection
  2024-11-01 13:47 ` [PATCH V3 14/16] tests/migration-test: " Steve Sistare
@ 2024-11-14 12:46   ` Fabiano Rosas
  0 siblings, 0 replies; 86+ messages in thread
From: Fabiano Rosas @ 2024-11-14 12:46 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster, Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add an option to defer connection to the target monitor, needed by the
> cpr-transfer test.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 16/16] migration: cpr-transfer documentation
  2024-11-13 22:02   ` Peter Xu
@ 2024-11-14 18:31     ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/13/2024 5:02 PM, Peter Xu wrote:
> On Fri, Nov 01, 2024 at 06:47:55AM -0700, Steve Sistare wrote:
>> +Caveats
>> +^^^^^^^
>> +
>> +cpr-transfer mode may not be used with postcopy, background-snapshot,
>> +or COLO.
>> +
>> +memory-backend-epc and memory-backend-ram are not supported.
> 
> Just to double check: now the plan is to allow memory-backend-ram,share=on
> to work too, right?  If so, here needs a touchup.

Yes.  I will edit here, and in qapi/migration.json.

- Steve


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 13/16] tests/qtest: defer connection
  2024-11-13 22:53   ` Peter Xu
@ 2024-11-14 18:31     ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/13/2024 5:53 PM, Peter Xu wrote:
> On Fri, Nov 01, 2024 at 06:47:52AM -0700, Steve Sistare wrote:
>> +void qtest_connect_deferred(QTestState *s)
>> +{
>> +    g_autofree gchar *socket_path = NULL;
>> +    g_autofree gchar *qmp_socket_path = NULL;
>> +
>> +    socket_path = g_strdup_printf("%s/qtest-%d.sock",
>> +                                  g_get_tmp_dir(), getpid());
>> +    qmp_socket_path = g_strdup_printf("%s/qtest-%d.qmp",
>> +                                      g_get_tmp_dir(), getpid());
>> +
>> +    s->fd = socket_accept(s->sock);
>> +    if (s->fd >= 0) {
>> +        s->qmp_fd = socket_accept(s->qmpsock);
>> +    }
>> +    unlink(socket_path);
>> +    unlink(qmp_socket_path);
> 
> Why need to unlink again here if both sock/qmpsock are cached?  I assume we
> could remove these lines together with above g_strdup_printf()s.
> 
> Otherwise two paths are leaked anyway (and we may also want to have some
> macros to represent the paths used in two places).

The original code in qtest_init_internal unlinked before creating the socket, as
a precaution, and after accepting. I assume the latter for cleanliness.  I
carried that forward.

I'll define a helper function to eliminate the format string duplication, and
I'll fix the pre-existing leak.

static char *qtest_socket_path(const char *suffix)
{
     return g_strdup_printf("%s/qtest-%d.%s", g_get_tmp_dir(), getpid(), suffix)
}

qtest_init_internal()
     g_autofree gchar *socket_path = qtest_socket_path("sock");
     g_autofree gchar *qmp_socket_path = qtest_socket_path("qmp");

> Maybe we could also clear sock/qmpsock too after use, then check at the
> entrance to skip qtest_connect_deferred() if already connected.

Will do.

- Steve

>> +    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
>>       /* ask endianness of the target */
>> -
>>       s->big_endian = qtest_query_target_endianness(s);
>> -
>> -   return s;
>>   }
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 10/16] migration: split qmp_migrate
  2024-11-13 21:11   ` Peter Xu
@ 2024-11-14 18:33     ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/13/2024 4:11 PM, Peter Xu wrote:
> On Fri, Nov 01, 2024 at 06:47:49AM -0700, Steve Sistare wrote:
>> Split qmp_migrate into start and finish functions.  Finish will be
>> called asynchronously in a subsequent patch, but for now, call it
>> immediately.  No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   migration/migration.c | 27 +++++++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 6dc7c09..86b3f39 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1521,6 +1521,7 @@ static void migrate_fd_error(MigrationState *s, const Error *error)
>>   static void migrate_fd_cancel(MigrationState *s)
>>   {
>>       int old_state ;
>> +    bool setup = (s->state == MIGRATION_STATUS_SETUP);
>>   
>>       trace_migrate_fd_cancel();
>>   
>> @@ -1565,6 +1566,15 @@ static void migrate_fd_cancel(MigrationState *s)
>>               s->block_inactive = false;
>>           }
>>       }
>> +
>> +    /*
>> +     * If qmp_migrate_finish has not been called, then there is no path that
>> +     * will complete the cancellation.  Do it now.
>> +     */
>> +    if (setup && !s->to_dst_file) {
>> +        migrate_set_state(&s->state, s->state, MIGRATION_STATUS_CANCELLED);
>> +        vm_resume(s->vm_old_state);
>> +    }
> 
> Hmm.. this doesn't look like the right place to put this change.. as this
> patch logically should bring no functional change if it's only about a new
> helper split an existing function.

I can move it to "cpr-transfer mode".
Tthe remainder of this patch becomes quite small.
I should probably just squash it all into "cpr-transfer mode".

> Meanwhile, this change also doesn't yet tell where does a vm_resume() came
> from.. 

The vm_resume is needed because of the patch "stop vm earlier for cpr".
In qmp_migrate, it calls migration_stop_vm in which sets vm_old_state.
Hence vm_resume restores that vm_old_state.

However, I moved "stop vm earlier for cpr" to a later patch series, so I
must also move the vm_resume call to that patch.

> I'm really not sure whether this is correct at all, consider someone
> does QMP "stop", migrate then quickly cancel it.  I suppose it may
> accidentally resume the VM which it shouldn't.

vm_resume only resumes execution if vm_old_state is a running state.  As
long as vm_old_state is captured at the start of qmp_migrate, as happens
in "stop vm earlier for cpr", then vm_resume does the right thing.

> Not to mention checking "setup" early, and unconditionally modify the state
> here no matter what it is (can it be things like FAILED now, then
> overwritten by a CANCELLED)?  

OK, I will only overwrite a cancelling state:
   migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED);

> But I'd confess that's not the problem of
> this patch, but that migration state machine is currently still racy.. afaiu.
> 
>>   }
>>   
>>   void migration_add_notifier_mode(NotifierWithReturn *notify,
>> @@ -2072,6 +2082,9 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
>>       return true;
>>   }
>>   
>> +static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
>> +                               Error **errp);
>> +
>>   void qmp_migrate(const char *uri, bool has_channels,
>>                    MigrationChannelList *channels, bool has_detach, bool detach,
>>                    bool has_resume, bool resume, Error **errp)
>> @@ -2118,6 +2131,20 @@ void qmp_migrate(const char *uri, bool has_channels,
>>           return;
>>       }
>>   
>> +    qmp_migrate_finish(addr, resume_requested, errp);
>> +
>> +    if (local_err) {
>> +        migrate_fd_error(s, local_err);
>> +        error_propagate(errp, local_err);
>> +    }
> 
> I don't see when local_err will be set at all until here.. maybe you meant
> *errp, but then maybe we should drop local_err and use ERRP_GUARD().

In this patch local_err is always NULL.  It may be non-NULL in patch
"cpr-transfer mode".

I'll just squash "split qmp_migrate".  It was intended to siphon some
complexity away from "cpr-transfer mode", but raises more questions
than it answers.

- Steve

>> +}
>> +
>> +static void qmp_migrate_finish(MigrationAddress *addr, bool resume_requested,
>> +                               Error **errp)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +    Error *local_err = NULL;
>> +
>>       if (!resume_requested) {
>>           if (!yank_register_instance(MIGRATION_YANK_INSTANCE, errp)) {
>>               return;
>> -- 
>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile
  2024-11-13 20:54   ` Peter Xu
@ 2024-11-14 18:34     ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/13/2024 3:54 PM, Peter Xu wrote:
> On Fri, Nov 01, 2024 at 06:47:44AM -0700, Steve Sistare wrote:
>> Define functions to put/get file descriptors to/from a QEMUFile, for qio
>> channels that support SCM_RIGHTS.  Maintain ordering such that
>>    put(A), put(fd), put(B)
>> followed by
>>    get(A), get(fd), get(B)
>> always succeeds.  Other get orderings may succeed but are not guaranteed.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
>>   struct QEMUFile {
>>       QIOChannel *ioc;
>>       bool is_writable;
>> @@ -51,6 +56,9 @@ struct QEMUFile {
>>   
>>       int last_error;
>>       Error *last_error_obj;
>> +
>> +    bool fd_pass;
> 
> One nitpick: I'd rename this to allow_fd_pass, or any name clearly shows
> that it's a capability.

Will do - steve

> 
>> +    QTAILQ_HEAD(, FdEntry) fds;
>>   };
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-13 21:58   ` Peter Xu
@ 2024-11-14 18:36     ` Steven Sistare
  2024-11-14 19:04       ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/13/2024 4:58 PM, Peter Xu wrote:
> On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
>> Add the cpr-transfer migration mode.  Usage:
>>    qemu-system-$arch -machine anon-alloc=memfd ...
>>
>>    start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>
>>    Issue commands to old QEMU:
>>    migrate_set_parameter mode cpr-transfer
>>    migrate_set_parameter cpr-uri <uri-2>
>>    migrate -d <uri-1>
> 
> QMP command "migrate" already allows taking MigrationChannel lists, cpr can
> be the 2nd supported channel besides "main".
> 
> I apologize on only noticing this until now.. I wished the incoming side
> can do the same already (which also takes 'MigrationChannel') if monitors
> init can be moved earlier, and if precreate worked out.  If not, we should
> still consider doing that on source, because cpr-uri isn't usable on dest
> anyway.. so they need to be treated separately even now.
> 
> Then after we make the monitor code run earlier in the future we could
> introduce that to incoming side too, obsoleting -cpr-uri there.

I have already been shot down on precreate and monitors init, so we are
left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
on the incoming side.  That will confuse users, will require more implementation
and specification work than you perhaps realize to explain this to users,
and only gets us halfway to your desired end point of specifying everything
using channels.  I don't like that plan!

If we ever get the ability to open the monitor early, then we can implement
a complete and clean solution using channels and declare the other options
obsolete.

- Steve


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 13/16] tests/qtest: defer connection
  2024-11-13 22:36   ` Fabiano Rosas
@ 2024-11-14 18:45     ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-14 18:45 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel
  Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
	Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
	Markus Armbruster

On 11/13/2024 5:36 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Add an option to defer making the connecting to the monitor and qtest
>> sockets when calling qtest_init_with_env.  The client makes the connection
>> later by calling qtest_connect_deferred and qtest_qmp_handshake.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   tests/qtest/libqtest.c       | 69 +++++++++++++++++++++++++++++---------------
>>   tests/qtest/libqtest.h       | 19 +++++++++++-
>>   tests/qtest/migration-test.c |  4 +--
>>   3 files changed, 65 insertions(+), 27 deletions(-)
>>
>> diff --git a/tests/qtest/libqtest.c b/tests/qtest/libqtest.c
>> index 9d07de1..95408fb 100644
>> --- a/tests/qtest/libqtest.c
>> +++ b/tests/qtest/libqtest.c
>> @@ -75,6 +75,8 @@ struct QTestState
>>   {
>>       int fd;
>>       int qmp_fd;
>> +    int sock;
>> +    int qmpsock;
>>       pid_t qemu_pid;  /* our child QEMU process */
>>       int wstatus;
>>   #ifdef _WIN32
>> @@ -443,7 +445,8 @@ static QTestState *G_GNUC_PRINTF(2, 3) qtest_spawn_qemu(const char *qemu_bin,
>>   }
>>   
>>   static QTestState *qtest_init_internal(const char *qemu_bin,
>> -                                       const char *extra_args)
>> +                                       const char *extra_args,
>> +                                       bool defer_connect)
>>   {
>>       QTestState *s;
>>       int sock, qmpsock, i;
>> @@ -485,22 +488,17 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>>       qtest_client_set_rx_handler(s, qtest_client_socket_recv_line);
>>       qtest_client_set_tx_handler(s, qtest_client_socket_send);
>>   
>> -    s->fd = socket_accept(sock);
>> -    if (s->fd >= 0) {
>> -        s->qmp_fd = socket_accept(qmpsock);
>> -    }
>> -    unlink(socket_path);
>> -    unlink(qmp_socket_path);
>> -    g_free(socket_path);
>> -    g_free(qmp_socket_path);
>> -
>> -    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
>> -
>>       s->rx = g_string_new("");
>>       for (i = 0; i < MAX_IRQ; i++) {
>>           s->irq_level[i] = false;
>>       }
>>   
>> +    s->sock = sock;
>> +    s->qmpsock = qmpsock;
>> +    if (!defer_connect) {
>> +        qtest_connect_deferred(s);
>> +    }
> 
> It might be cleaner to just leave qtest_connect_deferred() to the
> callers and not plumb defer_connect through.

I considered that, but IMO we should not force all callers to complete
a deferred connection when only one caller needs it.

- Steve

>> +
>>       /*
>>        * Stopping QEMU for debugging is not supported on Windows.
>>        *
>> @@ -515,34 +513,57 @@ static QTestState *qtest_init_internal(const char *qemu_bin,
>>       }
>>   #endif
>>   
>> +   return s;
>> +}
>> +
>> +void qtest_connect_deferred(QTestState *s)
>> +{
>> +    g_autofree gchar *socket_path = NULL;
>> +    g_autofree gchar *qmp_socket_path = NULL;
>> +
>> +    socket_path = g_strdup_printf("%s/qtest-%d.sock",
>> +                                  g_get_tmp_dir(), getpid());
>> +    qmp_socket_path = g_strdup_printf("%s/qtest-%d.qmp",
>> +                                      g_get_tmp_dir(), getpid());
>> +
>> +    s->fd = socket_accept(s->sock);
>> +    if (s->fd >= 0) {
>> +        s->qmp_fd = socket_accept(s->qmpsock);
>> +    }
>> +    unlink(socket_path);
>> +    unlink(qmp_socket_path);
>> +    g_assert(s->fd >= 0 && s->qmp_fd >= 0);
>>       /* ask endianness of the target */
>> -
>>       s->big_endian = qtest_query_target_endianness(s);
>> -
>> -   return s;
>>   }
>>   
>>   QTestState *qtest_init_without_qmp_handshake(const char *extra_args)
>>   {
>> -    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args);
>> +    return qtest_init_internal(qtest_qemu_binary(NULL), extra_args, false);
>>   }
>>   
>> -QTestState *qtest_init_with_env(const char *var, const char *extra_args)
>> +void qtest_qmp_handshake(QTestState *s)
>>   {
>> -    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args);
>> -    QDict *greeting;
>> -
>>       /* Read the QMP greeting and then do the handshake */
>> -    greeting = qtest_qmp_receive(s);
>> +    QDict *greeting = qtest_qmp_receive(s);
>>       qobject_unref(greeting);
>>       qobject_unref(qtest_qmp(s, "{ 'execute': 'qmp_capabilities' }"));
>> +}
>>   
>> +QTestState *qtest_init_with_env(const char *var, const char *extra_args,
>> +                                bool defer_connect)
>> +{
>> +    QTestState *s = qtest_init_internal(qtest_qemu_binary(var), extra_args,
>> +                                        defer_connect);
>> +    if (!defer_connect) {
>> +        qtest_qmp_handshake(s);
>> +    }
>>       return s;
>>   }
>>   
>>   QTestState *qtest_init(const char *extra_args)
>>   {
>> -    return qtest_init_with_env(NULL, extra_args);
>> +    return qtest_init_with_env(NULL, extra_args, false);
>>   }
>>   
>>   QTestState *qtest_vinitf(const char *fmt, va_list ap)
>> @@ -1523,7 +1544,7 @@ static struct MachInfo *qtest_get_machines(const char *var)
>>   
>>       silence_spawn_log = !g_test_verbose();
>>   
>> -    qts = qtest_init_with_env(qemu_var, "-machine none");
>> +    qts = qtest_init_with_env(qemu_var, "-machine none", false);
>>       response = qtest_qmp(qts, "{ 'execute': 'query-machines' }");
>>       g_assert(response);
>>       list = qdict_get_qlist(response, "return");
>> @@ -1578,7 +1599,7 @@ static struct CpuModel *qtest_get_cpu_models(void)
>>   
>>       silence_spawn_log = !g_test_verbose();
>>   
>> -    qts = qtest_init_with_env(NULL, "-machine none");
>> +    qts = qtest_init_with_env(NULL, "-machine none", false);
>>       response = qtest_qmp(qts, "{ 'execute': 'query-cpu-definitions' }");
>>       g_assert(response);
>>       list = qdict_get_qlist(response, "return");
>> diff --git a/tests/qtest/libqtest.h b/tests/qtest/libqtest.h
>> index beb96b1..db76f2c 100644
>> --- a/tests/qtest/libqtest.h
>> +++ b/tests/qtest/libqtest.h
>> @@ -60,13 +60,15 @@ QTestState *qtest_init(const char *extra_args);
>>    * @var: Environment variable from where to take the QEMU binary
>>    * @extra_args: Other arguments to pass to QEMU.  CAUTION: these
>>    * arguments are subject to word splitting and shell evaluation.
>> + * @defer_connect: do not connect to qemu monitor and qtest socket.
>>    *
>>    * Like qtest_init(), but use a different environment variable for the
>>    * QEMU binary.
>>    *
>>    * Returns: #QTestState instance.
>>    */
>> -QTestState *qtest_init_with_env(const char *var, const char *extra_args);
>> +QTestState *qtest_init_with_env(const char *var, const char *extra_args,
>> +                                bool defer_connect);
>>   
>>   /**
>>    * qtest_init_without_qmp_handshake:
>> @@ -78,6 +80,21 @@ QTestState *qtest_init_with_env(const char *var, const char *extra_args);
>>   QTestState *qtest_init_without_qmp_handshake(const char *extra_args);
>>   
>>   /**
>> + * qtest_connect_deferred:
>> + * @s: #QTestState instance to connect
>> + * Connect to qemu monitor and qtest socket, after deferring them in
>> + * qtest_init_with_env.  Does not handshake with the monitor.
>> + */
>> +void qtest_connect_deferred(QTestState *s);
>> +
>> +/**
>> + * qtest_qmp_handshake:
>> + * @s: #QTestState instance to operate on.
>> + * Perform handshake after connecting to qemu monitor.
>> + */
>> +void qtest_qmp_handshake(QTestState *s);
>> +
>> +/**
>>    * qtest_init_with_serial:
>>    * @extra_args: other arguments to pass to QEMU.  CAUTION: these
>>    * arguments are subject to word splitting and shell evaluation.
>> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
>> index a008316..d359b10 100644
>> --- a/tests/qtest/migration-test.c
>> +++ b/tests/qtest/migration-test.c
>> @@ -844,7 +844,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>>                                    args->opts_source ? args->opts_source : "",
>>                                    ignore_stderr);
>>       if (!args->only_target) {
>> -        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source);
>> +        *from = qtest_init_with_env(QEMU_ENV_SRC, cmd_source, false);
>>           qtest_qmp_set_event_callback(*from,
>>                                        migrate_watch_for_events,
>>                                        &src_state);
>> @@ -865,7 +865,7 @@ static int test_migrate_start(QTestState **from, QTestState **to,
>>                                    shmem_opts ? shmem_opts : "",
>>                                    args->opts_target ? args->opts_target : "",
>>                                    ignore_stderr);
>> -    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target);
>> +    *to = qtest_init_with_env(QEMU_ENV_DST, cmd_target, false);
>>       qtest_qmp_set_event_callback(*to,
>>                                    migrate_watch_for_events,
>>                                    &dst_state);



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-14 18:36     ` Steven Sistare
@ 2024-11-14 19:04       ` Peter Xu
  2024-11-19 19:50         ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-14 19:04 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
> On 11/13/2024 4:58 PM, Peter Xu wrote:
> > On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
> > > Add the cpr-transfer migration mode.  Usage:
> > >    qemu-system-$arch -machine anon-alloc=memfd ...
> > > 
> > >    start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> > > 
> > >    Issue commands to old QEMU:
> > >    migrate_set_parameter mode cpr-transfer
> > >    migrate_set_parameter cpr-uri <uri-2>
> > >    migrate -d <uri-1>
> > 
> > QMP command "migrate" already allows taking MigrationChannel lists, cpr can
> > be the 2nd supported channel besides "main".
> > 
> > I apologize on only noticing this until now.. I wished the incoming side
> > can do the same already (which also takes 'MigrationChannel') if monitors
> > init can be moved earlier, and if precreate worked out.  If not, we should
> > still consider doing that on source, because cpr-uri isn't usable on dest
> > anyway.. so they need to be treated separately even now.
> > 
> > Then after we make the monitor code run earlier in the future we could
> > introduce that to incoming side too, obsoleting -cpr-uri there.
> 
> I have already been shot down on precreate and monitors init, so we are
> left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
> on the incoming side.  That will confuse users, will require more implementation
> and specification work than you perhaps realize to explain this to users,

What is the specification work?  Can you elaborate?

> and only gets us halfway to your desired end point of specifying everything
> using channels.  I don't like that plan!
> 
> If we ever get the ability to open the monitor early, then we can implement
> a complete and clean solution using channels and declare the other options
> obsolete.

The sender side doesn't need to wait for destination side to be ready?
Dest side isn't a reason to me on how we should make sender side work if
they're totally separate anyway.  Dest requires -cpr-uri because we don't
yet have a choice.

Is the only concern about code changes?  I'm expecting this change is far
less controversial comparing to many others in this series, even if I
confess that may still contain some diff. They should hopefully be
straightforward, unlike many of the changes elsewhere in the series.

If you prefer not writting that patch, I am OK, and I can write one patch
on top of your series after it lands if that is OK for you. I still want to
have this there when release 10.0 if I didn't misunderstood anything, so
I'll be able to remove cpr-uri directly in that patch too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-14 19:04       ` Peter Xu
@ 2024-11-19 19:50         ` Steven Sistare
  2024-11-19 20:16           ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-19 19:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/14/2024 2:04 PM, Peter Xu wrote:
> On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
>> On 11/13/2024 4:58 PM, Peter Xu wrote:
>>> On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
>>>> Add the cpr-transfer migration mode.  Usage:
>>>>     qemu-system-$arch -machine anon-alloc=memfd ...
>>>>
>>>>     start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>>>
>>>>     Issue commands to old QEMU:
>>>>     migrate_set_parameter mode cpr-transfer
>>>>     migrate_set_parameter cpr-uri <uri-2>
>>>>     migrate -d <uri-1>
>>>
>>> QMP command "migrate" already allows taking MigrationChannel lists, cpr can
>>> be the 2nd supported channel besides "main".
>>>
>>> I apologize on only noticing this until now.. I wished the incoming side
>>> can do the same already (which also takes 'MigrationChannel') if monitors
>>> init can be moved earlier, and if precreate worked out.  If not, we should
>>> still consider doing that on source, because cpr-uri isn't usable on dest
>>> anyway.. so they need to be treated separately even now.
>>>
>>> Then after we make the monitor code run earlier in the future we could
>>> introduce that to incoming side too, obsoleting -cpr-uri there.
>>
>> I have already been shot down on precreate and monitors init, so we are
>> left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
>> on the incoming side.  That will confuse users, will require more implementation
>> and specification work than you perhaps realize to explain this to users,
> 
> What is the specification work?  Can you elaborate?
> 
>> and only gets us halfway to your desired end point of specifying everything
>> using channels.  I don't like that plan!
>>
>> If we ever get the ability to open the monitor early, then we can implement
>> a complete and clean solution using channels and declare the other options
>> obsolete.
> 
> The sender side doesn't need to wait for destination side to be ready?
> Dest side isn't a reason to me on how we should make sender side work if
> they're totally separate anyway.  Dest requires -cpr-uri because we don't
> yet have a choice.
> 
> Is the only concern about code changes?  I'm expecting this change is far
> less controversial comparing to many others in this series, even if I
> confess that may still contain some diff. They should hopefully be
> straightforward, unlike many of the changes elsewhere in the series.
> 
> If you prefer not writting that patch, I am OK, and I can write one patch
> on top of your series after it lands if that is OK for you. I still want to
> have this there when release 10.0 if I didn't misunderstood anything, so
> I'll be able to remove cpr-uri directly in that patch too.

I made the changes:
   * implementation
   * documentation in CPR.rst and QAPI
   * convert sample code in CPR.rst, commit messages, and cover letter to QMP,
     because a channel cannot be specified using HMP.
   * migration tests

New CPR.rst:

-------------------
   ...
   This mode requires a second migration channel named "cpr" in the
   channel arguments on the outgoing side.  The channel must be a type,
   such as unix socket, that supports SCM_RIGHTS.  However, the cpr
   channel cannot be added to the list of channels for a migrate-incoming
   command, because it must be read before new QEMU opens a monitor.
   Instead, the user passes the equivalent URI for the channel as part of
   the ``cpr-uri`` command-line argument to new QEMU.
   ...

   Outgoing:                             Incoming:

   # qemu-kvm -qmp stdio
   -object memory-backend-file,id=ram0,size=4G,
   mem-path=/dev/shm/ram0,share=on -m 4G
   -machine aux-ram-share=on
   ...
                                         # qemu-kvm -monitor stdio
                                         -incoming tcp:0:44444
                                         -cpr-uri unix:cpr.sock
                                         ...
   {"execute":"qmp_capabilities"}

   {"execute": "query-status"}
   {"return": {"status": "running",
               "running": true}}

   {"execute":"migrate-set-parameters",
    "arguments":{"mode":"cpr-transfer"}}

   {"execute": "migrate", "arguments": { "channels": [
     {"channel-type": "main",
      "addr": { "transport": "socket", "type": "inet",
                "host": "0", "port": "44444" }},
     {"channel-type": "cpr",
      "addr": { "transport": "socket", "type": "unix",
                "path": "cpr.sock" }}]}}

                                         QEMU 9.2.50 monitor
                                         (qemu) info status
                                         VM status: running

   {"execute": "query-status"}
   {"return": {"status": "postmigrate",
               "running": false}}
-------------------

- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 19:50         ` Steven Sistare
@ 2024-11-19 20:16           ` Peter Xu
  2024-11-19 20:32             ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-19 20:16 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Nov 19, 2024 at 02:50:40PM -0500, Steven Sistare wrote:
> On 11/14/2024 2:04 PM, Peter Xu wrote:
> > On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
> > > On 11/13/2024 4:58 PM, Peter Xu wrote:
> > > > On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
> > > > > Add the cpr-transfer migration mode.  Usage:
> > > > >     qemu-system-$arch -machine anon-alloc=memfd ...
> > > > > 
> > > > >     start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> > > > > 
> > > > >     Issue commands to old QEMU:
> > > > >     migrate_set_parameter mode cpr-transfer
> > > > >     migrate_set_parameter cpr-uri <uri-2>
> > > > >     migrate -d <uri-1>
> > > > 
> > > > QMP command "migrate" already allows taking MigrationChannel lists, cpr can
> > > > be the 2nd supported channel besides "main".
> > > > 
> > > > I apologize on only noticing this until now.. I wished the incoming side
> > > > can do the same already (which also takes 'MigrationChannel') if monitors
> > > > init can be moved earlier, and if precreate worked out.  If not, we should
> > > > still consider doing that on source, because cpr-uri isn't usable on dest
> > > > anyway.. so they need to be treated separately even now.
> > > > 
> > > > Then after we make the monitor code run earlier in the future we could
> > > > introduce that to incoming side too, obsoleting -cpr-uri there.
> > > 
> > > I have already been shot down on precreate and monitors init, so we are
> > > left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
> > > on the incoming side.  That will confuse users, will require more implementation
> > > and specification work than you perhaps realize to explain this to users,
> > 
> > What is the specification work?  Can you elaborate?
> > 
> > > and only gets us halfway to your desired end point of specifying everything
> > > using channels.  I don't like that plan!
> > > 
> > > If we ever get the ability to open the monitor early, then we can implement
> > > a complete and clean solution using channels and declare the other options
> > > obsolete.
> > 
> > The sender side doesn't need to wait for destination side to be ready?
> > Dest side isn't a reason to me on how we should make sender side work if
> > they're totally separate anyway.  Dest requires -cpr-uri because we don't
> > yet have a choice.
> > 
> > Is the only concern about code changes?  I'm expecting this change is far
> > less controversial comparing to many others in this series, even if I
> > confess that may still contain some diff. They should hopefully be
> > straightforward, unlike many of the changes elsewhere in the series.
> > 
> > If you prefer not writting that patch, I am OK, and I can write one patch
> > on top of your series after it lands if that is OK for you. I still want to
> > have this there when release 10.0 if I didn't misunderstood anything, so
> > I'll be able to remove cpr-uri directly in that patch too.
> 
> I made the changes:
>   * implementation
>   * documentation in CPR.rst and QAPI
>   * convert sample code in CPR.rst, commit messages, and cover letter to QMP,
>     because a channel cannot be specified using HMP.

Yeah we can leave HMP as of now; it can easily be added on top with
existing helpers like migrate_uri_parse().

>   * migration tests
> 
> New CPR.rst:
> 
> -------------------
>   ...
>   This mode requires a second migration channel named "cpr" in the
>   channel arguments on the outgoing side.  The channel must be a type,
>   such as unix socket, that supports SCM_RIGHTS.  However, the cpr
>   channel cannot be added to the list of channels for a migrate-incoming
>   command, because it must be read before new QEMU opens a monitor.
>   Instead, the user passes the equivalent URI for the channel as part of
>   the ``cpr-uri`` command-line argument to new QEMU.
>   ...
> 
>   Outgoing:                             Incoming:
> 
>   # qemu-kvm -qmp stdio
>   -object memory-backend-file,id=ram0,size=4G,
>   mem-path=/dev/shm/ram0,share=on -m 4G
>   -machine aux-ram-share=on
>   ...
>                                         # qemu-kvm -monitor stdio
>                                         -incoming tcp:0:44444
>                                         -cpr-uri unix:cpr.sock
>                                         ...
>   {"execute":"qmp_capabilities"}
> 
>   {"execute": "query-status"}
>   {"return": {"status": "running",
>               "running": true}}
> 
>   {"execute":"migrate-set-parameters",
>    "arguments":{"mode":"cpr-transfer"}}
> 
>   {"execute": "migrate", "arguments": { "channels": [
>     {"channel-type": "main",
>      "addr": { "transport": "socket", "type": "inet",
>                "host": "0", "port": "44444" }},
>     {"channel-type": "cpr",
>      "addr": { "transport": "socket", "type": "unix",
>                "path": "cpr.sock" }}]}}
> 
>                                         QEMU 9.2.50 monitor
>                                         (qemu) info status
>                                         VM status: running
> 
>   {"execute": "query-status"}
>   {"return": {"status": "postmigrate",
>               "running": false}}
> -------------------

Thank you, Steve!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 20:16           ` Peter Xu
@ 2024-11-19 20:32             ` Steven Sistare
  2024-11-19 20:51               ` Peter Xu
  2024-11-20  9:38               ` Daniel P. Berrangé
  0 siblings, 2 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-19 20:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/19/2024 3:16 PM, Peter Xu wrote:
> On Tue, Nov 19, 2024 at 02:50:40PM -0500, Steven Sistare wrote:
>> On 11/14/2024 2:04 PM, Peter Xu wrote:
>>> On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
>>>> On 11/13/2024 4:58 PM, Peter Xu wrote:
>>>>> On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
>>>>>> Add the cpr-transfer migration mode.  Usage:
>>>>>>      qemu-system-$arch -machine anon-alloc=memfd ...
>>>>>>
>>>>>>      start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>>>>>
>>>>>>      Issue commands to old QEMU:
>>>>>>      migrate_set_parameter mode cpr-transfer
>>>>>>      migrate_set_parameter cpr-uri <uri-2>
>>>>>>      migrate -d <uri-1>
>>>>>
>>>>> QMP command "migrate" already allows taking MigrationChannel lists, cpr can
>>>>> be the 2nd supported channel besides "main".
>>>>>
>>>>> I apologize on only noticing this until now.. I wished the incoming side
>>>>> can do the same already (which also takes 'MigrationChannel') if monitors
>>>>> init can be moved earlier, and if precreate worked out.  If not, we should
>>>>> still consider doing that on source, because cpr-uri isn't usable on dest
>>>>> anyway.. so they need to be treated separately even now.
>>>>>
>>>>> Then after we make the monitor code run earlier in the future we could
>>>>> introduce that to incoming side too, obsoleting -cpr-uri there.
>>>>
>>>> I have already been shot down on precreate and monitors init, so we are
>>>> left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
>>>> on the incoming side.  That will confuse users, will require more implementation
>>>> and specification work than you perhaps realize to explain this to users,
>>>
>>> What is the specification work?  Can you elaborate?
>>>
>>>> and only gets us halfway to your desired end point of specifying everything
>>>> using channels.  I don't like that plan!
>>>>
>>>> If we ever get the ability to open the monitor early, then we can implement
>>>> a complete and clean solution using channels and declare the other options
>>>> obsolete.
>>>
>>> The sender side doesn't need to wait for destination side to be ready?
>>> Dest side isn't a reason to me on how we should make sender side work if
>>> they're totally separate anyway.  Dest requires -cpr-uri because we don't
>>> yet have a choice.
>>>
>>> Is the only concern about code changes?  I'm expecting this change is far
>>> less controversial comparing to many others in this series, even if I
>>> confess that may still contain some diff. They should hopefully be
>>> straightforward, unlike many of the changes elsewhere in the series.
>>>
>>> If you prefer not writting that patch, I am OK, and I can write one patch
>>> on top of your series after it lands if that is OK for you. I still want to
>>> have this there when release 10.0 if I didn't misunderstood anything, so
>>> I'll be able to remove cpr-uri directly in that patch too.
>>
>> I made the changes:
>>    * implementation
>>    * documentation in CPR.rst and QAPI
>>    * convert sample code in CPR.rst, commit messages, and cover letter to QMP,
>>      because a channel cannot be specified using HMP.
> 
> Yeah we can leave HMP as of now; it can easily be added on top with
> existing helpers like migrate_uri_parse().

This begs the question, should we allow channels to be specified in hmp migrate
commands and for -incoming, in a very simple way?  Like with a prefix naming
the channel.  And eliminate the -cpr-uri argument. Examples:

(qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock

qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
qemu -incoming main:defer,cpr:unix:cpr.sock

- Steve

>>    * migration tests
>>
>> New CPR.rst:
>>
>> -------------------
>>    ...
>>    This mode requires a second migration channel named "cpr" in the
>>    channel arguments on the outgoing side.  The channel must be a type,
>>    such as unix socket, that supports SCM_RIGHTS.  However, the cpr
>>    channel cannot be added to the list of channels for a migrate-incoming
>>    command, because it must be read before new QEMU opens a monitor.
>>    Instead, the user passes the equivalent URI for the channel as part of
>>    the ``cpr-uri`` command-line argument to new QEMU.
>>    ...
>>
>>    Outgoing:                             Incoming:
>>
>>    # qemu-kvm -qmp stdio
>>    -object memory-backend-file,id=ram0,size=4G,
>>    mem-path=/dev/shm/ram0,share=on -m 4G
>>    -machine aux-ram-share=on
>>    ...
>>                                          # qemu-kvm -monitor stdio
>>                                          -incoming tcp:0:44444
>>                                          -cpr-uri unix:cpr.sock
>>                                          ...
>>    {"execute":"qmp_capabilities"}
>>
>>    {"execute": "query-status"}
>>    {"return": {"status": "running",
>>                "running": true}}
>>
>>    {"execute":"migrate-set-parameters",
>>     "arguments":{"mode":"cpr-transfer"}}
>>
>>    {"execute": "migrate", "arguments": { "channels": [
>>      {"channel-type": "main",
>>       "addr": { "transport": "socket", "type": "inet",
>>                 "host": "0", "port": "44444" }},
>>      {"channel-type": "cpr",
>>       "addr": { "transport": "socket", "type": "unix",
>>                 "path": "cpr.sock" }}]}}
>>
>>                                          QEMU 9.2.50 monitor
>>                                          (qemu) info status
>>                                          VM status: running
>>
>>    {"execute": "query-status"}
>>    {"return": {"status": "postmigrate",
>>                "running": false}}
>> -------------------
> 
> Thank you, Steve!
> 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 20:32             ` Steven Sistare
@ 2024-11-19 20:51               ` Peter Xu
  2024-11-19 21:03                 ` Steven Sistare
  2024-11-20  9:38               ` Daniel P. Berrangé
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-19 20:51 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
> This begs the question, should we allow channels to be specified in hmp migrate
> commands and for -incoming, in a very simple way?  Like with a prefix naming
> the channel.  And eliminate the -cpr-uri argument. Examples:
> 
> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
> 
> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
> qemu -incoming main:defer,cpr:unix:cpr.sock

IMHO keeping the old syntax working would still be nice to not break
scripts.  I was thinking we could simply add one more parameter for taking
cpr uri, like:

    {
        .name       = "migrate",
        .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
        .params     = "[-d] [-r] uri [cpr_uri]",
        .help       = "migrate to URI (using -d to not wait for completion)"
		      "\n\t\t\t -r to resume a paused postcopy migration",
		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
        .cmd        = hmp_migrate,
    },

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 20:51               ` Peter Xu
@ 2024-11-19 21:03                 ` Steven Sistare
  2024-11-19 21:29                   ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-19 21:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/19/2024 3:51 PM, Peter Xu wrote:
> On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
>> This begs the question, should we allow channels to be specified in hmp migrate
>> commands and for -incoming, in a very simple way?  Like with a prefix naming
>> the channel.  And eliminate the -cpr-uri argument. Examples:
>>
>> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
>>
>> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
>> qemu -incoming main:defer,cpr:unix:cpr.sock
> 
> IMHO keeping the old syntax working would still be nice to not break
> scripts.  

The channel tag would be optional, so backwards compatible.  Its unambiguous
as long as the channel names are not also protocol names.

> I was thinking we could simply add one more parameter for taking
> cpr uri, like:
> 
>      {
>          .name       = "migrate",
>          .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
>          .params     = "[-d] [-r] uri [cpr_uri]",
>          .help       = "migrate to URI (using -d to not wait for completion)"
> 		      "\n\t\t\t -r to resume a paused postcopy migration",
> 		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
>          .cmd        = hmp_migrate,
>      },

That's fine.

I do like the incoming syntax, though, instead of -cpr-uri.  What do you think?

- Steve


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 21:03                 ` Steven Sistare
@ 2024-11-19 21:29                   ` Peter Xu
  2024-11-19 21:41                     ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-19 21:29 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Nov 19, 2024 at 04:03:08PM -0500, Steven Sistare wrote:
> On 11/19/2024 3:51 PM, Peter Xu wrote:
> > On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
> > > This begs the question, should we allow channels to be specified in hmp migrate
> > > commands and for -incoming, in a very simple way?  Like with a prefix naming
> > > the channel.  And eliminate the -cpr-uri argument. Examples:
> > > 
> > > (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
> > > 
> > > qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
> > > qemu -incoming main:defer,cpr:unix:cpr.sock
> > 
> > IMHO keeping the old syntax working would still be nice to not break
> > scripts.
> 
> The channel tag would be optional, so backwards compatible.  Its unambiguous
> as long as the channel names are not also protocol names.

Ah that's ok then.  Or maybe use "="?

  "main=XXX,cpr=XXX"

Then if no "=" it's the old?

> 
> > I was thinking we could simply add one more parameter for taking
> > cpr uri, like:
> > 
> >      {
> >          .name       = "migrate",
> >          .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
> >          .params     = "[-d] [-r] uri [cpr_uri]",
> >          .help       = "migrate to URI (using -d to not wait for completion)"
> > 		      "\n\t\t\t -r to resume a paused postcopy migration",
> > 		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
> >          .cmd        = hmp_migrate,
> >      },
> 
> That's fine.
> 
> I do like the incoming syntax, though, instead of -cpr-uri.  What do you think?

That'll definitely be lovely if possible, though would any monitor be alive
at all before taking a cpr stream, with this series alone?  I thought you
dropped the precreate, then QEMU isn't able to run the monitor loop until
cpr-uri is loaded.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 21:29                   ` Peter Xu
@ 2024-11-19 21:41                     ` Steven Sistare
  2024-11-19 21:48                       ` Peter Xu
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-19 21:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/19/2024 4:29 PM, Peter Xu wrote:
> On Tue, Nov 19, 2024 at 04:03:08PM -0500, Steven Sistare wrote:
>> On 11/19/2024 3:51 PM, Peter Xu wrote:
>>> On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
>>>> This begs the question, should we allow channels to be specified in hmp migrate
>>>> commands and for -incoming, in a very simple way?  Like with a prefix naming
>>>> the channel.  And eliminate the -cpr-uri argument. Examples:
>>>>
>>>> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
>>>>
>>>> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
>>>> qemu -incoming main:defer,cpr:unix:cpr.sock
>>>
>>> IMHO keeping the old syntax working would still be nice to not break
>>> scripts.
>>
>> The channel tag would be optional, so backwards compatible.  Its unambiguous
>> as long as the channel names are not also protocol names.
> 
> Ah that's ok then.  Or maybe use "="?
> 
>    "main=XXX,cpr=XXX"
> 
> Then if no "=" it's the old?

Sure, that works.

>>> I was thinking we could simply add one more parameter for taking
>>> cpr uri, like:
>>>
>>>       {
>>>           .name       = "migrate",
>>>           .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
>>>           .params     = "[-d] [-r] uri [cpr_uri]",
>>>           .help       = "migrate to URI (using -d to not wait for completion)"
>>> 		      "\n\t\t\t -r to resume a paused postcopy migration",
>>> 		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
>>>           .cmd        = hmp_migrate,
>>>       },
>>
>> That's fine.
>>
>> I do like the incoming syntax, though, instead of -cpr-uri.  What do you think?
> 
> That'll definitely be lovely if possible, though would any monitor be alive
> at all before taking a cpr stream, with this series alone?  I thought you
> dropped the precreate, then QEMU isn't able to run the monitor loop until
> cpr-uri is loaded.

No monitor or precreate changes.  I would parse -incoming, extract and use the cpr
channel early, and use the main channel later as usual.  It's just a different way of
specifying cpr-uri.  I like it because the specification language is more consistent,
referring to a "cpr channel" both on the outgoing and incoming side:

   This mode requires a second migration channel named "cpr", included in
   the channel arguments of the migrate command on the outgoing side, and
   in the QEMU -incoming parameter on the incoming side.  The channel must
   be a type, such as unix socket, that supports SCM_RIGHTS.

- Steve






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 21:41                     ` Steven Sistare
@ 2024-11-19 21:48                       ` Peter Xu
  2024-11-19 21:51                         ` Steven Sistare
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Xu @ 2024-11-19 21:48 UTC (permalink / raw)
  To: Steven Sistare
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On Tue, Nov 19, 2024 at 04:41:07PM -0500, Steven Sistare wrote:
> On 11/19/2024 4:29 PM, Peter Xu wrote:
> > On Tue, Nov 19, 2024 at 04:03:08PM -0500, Steven Sistare wrote:
> > > On 11/19/2024 3:51 PM, Peter Xu wrote:
> > > > On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
> > > > > This begs the question, should we allow channels to be specified in hmp migrate
> > > > > commands and for -incoming, in a very simple way?  Like with a prefix naming
> > > > > the channel.  And eliminate the -cpr-uri argument. Examples:
> > > > > 
> > > > > (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
> > > > > 
> > > > > qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
> > > > > qemu -incoming main:defer,cpr:unix:cpr.sock
> > > > 
> > > > IMHO keeping the old syntax working would still be nice to not break
> > > > scripts.
> > > 
> > > The channel tag would be optional, so backwards compatible.  Its unambiguous
> > > as long as the channel names are not also protocol names.
> > 
> > Ah that's ok then.  Or maybe use "="?
> > 
> >    "main=XXX,cpr=XXX"
> > 
> > Then if no "=" it's the old?
> 
> Sure, that works.
> 
> > > > I was thinking we could simply add one more parameter for taking
> > > > cpr uri, like:
> > > > 
> > > >       {
> > > >           .name       = "migrate",
> > > >           .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
> > > >           .params     = "[-d] [-r] uri [cpr_uri]",
> > > >           .help       = "migrate to URI (using -d to not wait for completion)"
> > > > 		      "\n\t\t\t -r to resume a paused postcopy migration",
> > > > 		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
> > > >           .cmd        = hmp_migrate,
> > > >       },
> > > 
> > > That's fine.
> > > 
> > > I do like the incoming syntax, though, instead of -cpr-uri.  What do you think?
> > 
> > That'll definitely be lovely if possible, though would any monitor be alive
> > at all before taking a cpr stream, with this series alone?  I thought you
> > dropped the precreate, then QEMU isn't able to run the monitor loop until
> > cpr-uri is loaded.
> 
> No monitor or precreate changes.  I would parse -incoming, extract and use the cpr
> channel early, and use the main channel later as usual.  It's just a different way of
> specifying cpr-uri.  I like it because the specification language is more consistent,
> referring to a "cpr channel" both on the outgoing and incoming side:
> 
>   This mode requires a second migration channel named "cpr", included in
>   the channel arguments of the migrate command on the outgoing side, and
>   in the QEMU -incoming parameter on the incoming side.  The channel must
>   be a type, such as unix socket, that supports SCM_RIGHTS.

Ah, that's ok at least to me.  I hope defer could still work (for Libvirt),
though.  Probably something like main=defer,cpr=XXX.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 21:48                       ` Peter Xu
@ 2024-11-19 21:51                         ` Steven Sistare
  0 siblings, 0 replies; 86+ messages in thread
From: Steven Sistare @ 2024-11-19 21:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
	Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
	Daniel P. Berrange, Markus Armbruster

On 11/19/2024 4:48 PM, Peter Xu wrote:
> On Tue, Nov 19, 2024 at 04:41:07PM -0500, Steven Sistare wrote:
>> On 11/19/2024 4:29 PM, Peter Xu wrote:
>>> On Tue, Nov 19, 2024 at 04:03:08PM -0500, Steven Sistare wrote:
>>>> On 11/19/2024 3:51 PM, Peter Xu wrote:
>>>>> On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
>>>>>> This begs the question, should we allow channels to be specified in hmp migrate
>>>>>> commands and for -incoming, in a very simple way?  Like with a prefix naming
>>>>>> the channel.  And eliminate the -cpr-uri argument. Examples:
>>>>>>
>>>>>> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
>>>>>>
>>>>>> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
>>>>>> qemu -incoming main:defer,cpr:unix:cpr.sock
>>>>>
>>>>> IMHO keeping the old syntax working would still be nice to not break
>>>>> scripts.
>>>>
>>>> The channel tag would be optional, so backwards compatible.  Its unambiguous
>>>> as long as the channel names are not also protocol names.
>>>
>>> Ah that's ok then.  Or maybe use "="?
>>>
>>>     "main=XXX,cpr=XXX"
>>>
>>> Then if no "=" it's the old?
>>
>> Sure, that works.
>>
>>>>> I was thinking we could simply add one more parameter for taking
>>>>> cpr uri, like:
>>>>>
>>>>>        {
>>>>>            .name       = "migrate",
>>>>>            .args_type  = "detach:-d,resume:-r,uri:s,cpr:s?",
>>>>>            .params     = "[-d] [-r] uri [cpr_uri]",
>>>>>            .help       = "migrate to URI (using -d to not wait for completion)"
>>>>> 		      "\n\t\t\t -r to resume a paused postcopy migration",
>>>>> 		      "\n\t\t\t Setup cpr_uri to migrate with cpr-transfer",
>>>>>            .cmd        = hmp_migrate,
>>>>>        },
>>>>
>>>> That's fine.
>>>>
>>>> I do like the incoming syntax, though, instead of -cpr-uri.  What do you think?
>>>
>>> That'll definitely be lovely if possible, though would any monitor be alive
>>> at all before taking a cpr stream, with this series alone?  I thought you
>>> dropped the precreate, then QEMU isn't able to run the monitor loop until
>>> cpr-uri is loaded.
>>
>> No monitor or precreate changes.  I would parse -incoming, extract and use the cpr
>> channel early, and use the main channel later as usual.  It's just a different way of
>> specifying cpr-uri.  I like it because the specification language is more consistent,
>> referring to a "cpr channel" both on the outgoing and incoming side:
>>
>>    This mode requires a second migration channel named "cpr", included in
>>    the channel arguments of the migrate command on the outgoing side, and
>>    in the QEMU -incoming parameter on the incoming side.  The channel must
>>    be a type, such as unix socket, that supports SCM_RIGHTS.
> 
> Ah, that's ok at least to me.  I hope defer could still work (for Libvirt),
> though.  Probably something like main=defer,cpr=XXX.

Exactly.  defer is in my examples above.

- Steve




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-19 20:32             ` Steven Sistare
  2024-11-19 20:51               ` Peter Xu
@ 2024-11-20  9:38               ` Daniel P. Berrangé
  2024-11-20 16:12                 ` Steven Sistare
  1 sibling, 1 reply; 86+ messages in thread
From: Daniel P. Berrangé @ 2024-11-20  9:38 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
> On 11/19/2024 3:16 PM, Peter Xu wrote:
> > On Tue, Nov 19, 2024 at 02:50:40PM -0500, Steven Sistare wrote:
> > > On 11/14/2024 2:04 PM, Peter Xu wrote:
> > > > On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
> > > > > On 11/13/2024 4:58 PM, Peter Xu wrote:
> > > > > > On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
> > > > > > > Add the cpr-transfer migration mode.  Usage:
> > > > > > >      qemu-system-$arch -machine anon-alloc=memfd ...
> > > > > > > 
> > > > > > >      start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
> > > > > > > 
> > > > > > >      Issue commands to old QEMU:
> > > > > > >      migrate_set_parameter mode cpr-transfer
> > > > > > >      migrate_set_parameter cpr-uri <uri-2>
> > > > > > >      migrate -d <uri-1>
> > > > > > 
> > > > > > QMP command "migrate" already allows taking MigrationChannel lists, cpr can
> > > > > > be the 2nd supported channel besides "main".
> > > > > > 
> > > > > > I apologize on only noticing this until now.. I wished the incoming side
> > > > > > can do the same already (which also takes 'MigrationChannel') if monitors
> > > > > > init can be moved earlier, and if precreate worked out.  If not, we should
> > > > > > still consider doing that on source, because cpr-uri isn't usable on dest
> > > > > > anyway.. so they need to be treated separately even now.
> > > > > > 
> > > > > > Then after we make the monitor code run earlier in the future we could
> > > > > > introduce that to incoming side too, obsoleting -cpr-uri there.
> > > > > 
> > > > > I have already been shot down on precreate and monitors init, so we are
> > > > > left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
> > > > > on the incoming side.  That will confuse users, will require more implementation
> > > > > and specification work than you perhaps realize to explain this to users,
> > > > 
> > > > What is the specification work?  Can you elaborate?
> > > > 
> > > > > and only gets us halfway to your desired end point of specifying everything
> > > > > using channels.  I don't like that plan!
> > > > > 
> > > > > If we ever get the ability to open the monitor early, then we can implement
> > > > > a complete and clean solution using channels and declare the other options
> > > > > obsolete.
> > > > 
> > > > The sender side doesn't need to wait for destination side to be ready?
> > > > Dest side isn't a reason to me on how we should make sender side work if
> > > > they're totally separate anyway.  Dest requires -cpr-uri because we don't
> > > > yet have a choice.
> > > > 
> > > > Is the only concern about code changes?  I'm expecting this change is far
> > > > less controversial comparing to many others in this series, even if I
> > > > confess that may still contain some diff. They should hopefully be
> > > > straightforward, unlike many of the changes elsewhere in the series.
> > > > 
> > > > If you prefer not writting that patch, I am OK, and I can write one patch
> > > > on top of your series after it lands if that is OK for you. I still want to
> > > > have this there when release 10.0 if I didn't misunderstood anything, so
> > > > I'll be able to remove cpr-uri directly in that patch too.
> > > 
> > > I made the changes:
> > >    * implementation
> > >    * documentation in CPR.rst and QAPI
> > >    * convert sample code in CPR.rst, commit messages, and cover letter to QMP,
> > >      because a channel cannot be specified using HMP.
> > 
> > Yeah we can leave HMP as of now; it can easily be added on top with
> > existing helpers like migrate_uri_parse().
> 
> This begs the question, should we allow channels to be specified in hmp migrate
> commands and for -incoming, in a very simple way?  Like with a prefix naming
> the channel.  And eliminate the -cpr-uri argument. Examples:
> 
> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
> 
> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
> qemu -incoming main:defer,cpr:unix:cpr.sock

As a general rule, if you ever find yourself asking "should we add more
magic parsing logic" to the command line argv, the answer should always
be 'no'.

Any command line args where we need to have more expressive formatting
are getting converted to accept JSON syntax, backed by QAPI modelling.
We were anticipating that '-incoming' should ideally end up deprecated
except for the plain "defer" option, on the expectation that any non-
trivial use of migration needs HMP/QMP regardless. If there's a vaild
use case for something other than 'defer', then we need to QAPI-ify
-incoming with JSON syntax IMHO.

Yes, there's still the question of HMP, but personally I'm fine with
leaving feature gaps in HMP and expecting people to use QMP. HMP shares
all the same flaws as our old approach to the CLI, of needing to invent
arbitrary magic syntaxes which has proved to be an undesirble path to
take in general. I see HMP as being there for the 80% common / simple
cases, and if you need to go beyond that, then QMP is there for you.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-20  9:38               ` Daniel P. Berrangé
@ 2024-11-20 16:12                 ` Steven Sistare
  2024-11-20 16:26                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Sistare @ 2024-11-20 16:12 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On 11/20/2024 4:38 AM, Daniel P. Berrangé wrote:
> On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
>> On 11/19/2024 3:16 PM, Peter Xu wrote:
>>> On Tue, Nov 19, 2024 at 02:50:40PM -0500, Steven Sistare wrote:
>>>> On 11/14/2024 2:04 PM, Peter Xu wrote:
>>>>> On Thu, Nov 14, 2024 at 01:36:00PM -0500, Steven Sistare wrote:
>>>>>> On 11/13/2024 4:58 PM, Peter Xu wrote:
>>>>>>> On Fri, Nov 01, 2024 at 06:47:50AM -0700, Steve Sistare wrote:
>>>>>>>> Add the cpr-transfer migration mode.  Usage:
>>>>>>>>       qemu-system-$arch -machine anon-alloc=memfd ...
>>>>>>>>
>>>>>>>>       start new QEMU with "-incoming <uri-1> -cpr-uri <uri-2>"
>>>>>>>>
>>>>>>>>       Issue commands to old QEMU:
>>>>>>>>       migrate_set_parameter mode cpr-transfer
>>>>>>>>       migrate_set_parameter cpr-uri <uri-2>
>>>>>>>>       migrate -d <uri-1>
>>>>>>>
>>>>>>> QMP command "migrate" already allows taking MigrationChannel lists, cpr can
>>>>>>> be the 2nd supported channel besides "main".
>>>>>>>
>>>>>>> I apologize on only noticing this until now.. I wished the incoming side
>>>>>>> can do the same already (which also takes 'MigrationChannel') if monitors
>>>>>>> init can be moved earlier, and if precreate worked out.  If not, we should
>>>>>>> still consider doing that on source, because cpr-uri isn't usable on dest
>>>>>>> anyway.. so they need to be treated separately even now.
>>>>>>>
>>>>>>> Then after we make the monitor code run earlier in the future we could
>>>>>>> introduce that to incoming side too, obsoleting -cpr-uri there.
>>>>>>
>>>>>> I have already been shot down on precreate and monitors init, so we are
>>>>>> left with specifying a "cpr" channel on the outgoing side, and -cpr-uri
>>>>>> on the incoming side.  That will confuse users, will require more implementation
>>>>>> and specification work than you perhaps realize to explain this to users,
>>>>>
>>>>> What is the specification work?  Can you elaborate?
>>>>>
>>>>>> and only gets us halfway to your desired end point of specifying everything
>>>>>> using channels.  I don't like that plan!
>>>>>>
>>>>>> If we ever get the ability to open the monitor early, then we can implement
>>>>>> a complete and clean solution using channels and declare the other options
>>>>>> obsolete.
>>>>>
>>>>> The sender side doesn't need to wait for destination side to be ready?
>>>>> Dest side isn't a reason to me on how we should make sender side work if
>>>>> they're totally separate anyway.  Dest requires -cpr-uri because we don't
>>>>> yet have a choice.
>>>>>
>>>>> Is the only concern about code changes?  I'm expecting this change is far
>>>>> less controversial comparing to many others in this series, even if I
>>>>> confess that may still contain some diff. They should hopefully be
>>>>> straightforward, unlike many of the changes elsewhere in the series.
>>>>>
>>>>> If you prefer not writting that patch, I am OK, and I can write one patch
>>>>> on top of your series after it lands if that is OK for you. I still want to
>>>>> have this there when release 10.0 if I didn't misunderstood anything, so
>>>>> I'll be able to remove cpr-uri directly in that patch too.
>>>>
>>>> I made the changes:
>>>>     * implementation
>>>>     * documentation in CPR.rst and QAPI
>>>>     * convert sample code in CPR.rst, commit messages, and cover letter to QMP,
>>>>       because a channel cannot be specified using HMP.
>>>
>>> Yeah we can leave HMP as of now; it can easily be added on top with
>>> existing helpers like migrate_uri_parse().
>>
>> This begs the question, should we allow channels to be specified in hmp migrate
>> commands and for -incoming, in a very simple way?  Like with a prefix naming
>> the channel.  And eliminate the -cpr-uri argument. Examples:
>>
>> (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
>>
>> qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
>> qemu -incoming main:defer,cpr:unix:cpr.sock
> 
> As a general rule, if you ever find yourself asking "should we add more
> magic parsing logic" to the command line argv, the answer should always
> be 'no'.
> 
> Any command line args where we need to have more expressive formatting
> are getting converted to accept JSON syntax, backed by QAPI modelling.
> We were anticipating that '-incoming' should ideally end up deprecated
> except for the plain "defer" option, on the expectation that any non-
> trivial use of migration needs HMP/QMP regardless. If there's a vaild
> use case for something other than 'defer', then we need to QAPI-ify
> -incoming with JSON syntax IMHO.

Hi Daniel, thank you for the guidance.

CPR needs to open and read its channel before the monitor is available,
so the cpr uri must be passed on the command line in some form.  Is that
sufficient reason to violate your general rule?

If not, would you support the -cpr-uri command-line option?

If not, that leaves us with QAPI-ifying -incoming, which is messy, because
MigrationChannel has a nested type structure.  We would need to define
a flattened list of properties and duplicate much of the existing specification.
Unless, it could take a JSON object as its value, with all the {}:" syntax,
and be parsed with visit_type_MigrationChannel.  But I do not see any
precedent for that in other command-line arguments.

Of these, I still think "qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock"
is the least worst option.  We could further simplify it by allowing the
option multiple times, and only recognizing the additional "cpr" prefix.

   qemu -incoming tcp:0:44444 -incoming cpr:unix:cpr.sock
   qemu -incoming defer -incoming cpr:unix:cpr.sock

Your further comments, please.  I need a way forward that you and other
maintainers will support.

> Yes, there's still the question of HMP, but personally I'm fine with
> leaving feature gaps in HMP and expecting people to use QMP. HMP shares
> all the same flaws as our old approach to the CLI, of needing to invent
> arbitrary magic syntaxes which has proved to be an undesirble path to
> take in general. I see HMP as being there for the 80% common / simple
> cases, and if you need to go beyond that, then QMP is there for you.

Fine with me.

- Steve


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH V3 11/16] migration: cpr-transfer mode
  2024-11-20 16:12                 ` Steven Sistare
@ 2024-11-20 16:26                   ` Daniel P. Berrangé
  0 siblings, 0 replies; 86+ messages in thread
From: Daniel P. Berrangé @ 2024-11-20 16:26 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
	Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
	Paolo Bonzini, Markus Armbruster

On Wed, Nov 20, 2024 at 11:12:51AM -0500, Steven Sistare wrote:
> On 11/20/2024 4:38 AM, Daniel P. Berrangé wrote:
> > On Tue, Nov 19, 2024 at 03:32:55PM -0500, Steven Sistare wrote:
> > > 
> > > This begs the question, should we allow channels to be specified in hmp migrate
> > > commands and for -incoming, in a very simple way?  Like with a prefix naming
> > > the channel.  And eliminate the -cpr-uri argument. Examples:
> > > 
> > > (qemu) migrate -d main:tcp:0:44444,cpr:unix:cpr.sock
> > > 
> > > qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock
> > > qemu -incoming main:defer,cpr:unix:cpr.sock
> > 
> > As a general rule, if you ever find yourself asking "should we add more
> > magic parsing logic" to the command line argv, the answer should always
> > be 'no'.
> > 
> > Any command line args where we need to have more expressive formatting
> > are getting converted to accept JSON syntax, backed by QAPI modelling.
> > We were anticipating that '-incoming' should ideally end up deprecated
> > except for the plain "defer" option, on the expectation that any non-
> > trivial use of migration needs HMP/QMP regardless. If there's a vaild
> > use case for something other than 'defer', then we need to QAPI-ify
> > -incoming with JSON syntax IMHO.
> 
> Hi Daniel, thank you for the guidance.
> 
> CPR needs to open and read its channel before the monitor is available,
> so the cpr uri must be passed on the command line in some form.  Is that
> sufficient reason to violate your general rule?

Not really. IMHO it is still viable to define a CLI arg using JSON and
QAPI, even if there's no need to use it from QMP.

> If not, would you support the -cpr-uri command-line option?
> 
> If not, that leaves us with QAPI-ifying -incoming, which is messy, because
> MigrationChannel has a nested type structure.  We would need to define
> a flattened list of properties and duplicate much of the existing specification.
> Unless, it could take a JSON object as its value, with all the {}:" syntax,
> and be parsed with visit_type_MigrationChannel.  But I do not see any
> precedent for that in other command-line arguments.

Using JSON syntax exclusively is exactly what I'm suggesting. While some
command line args have invented ways to express nested types, we don't
really want to be in that business anymore. Anything complex should be
JSON syntax on the command line. We support this with -object, -device,
-audiodev, -netdev, -blockdev already, and eventually expect everything
to support JSON syntax.

You can see this in practice in libvirt, where we'll prefer JSON syntax
for any args that support it:

  https://gitlab.com/libvirt/libvirt/-/blob/master/tests/qemuxmlconfdata/x86_64-q35-graphics.x86_64-latest.args

The approach to retrofitting to an existing cli arg is pretty crude but
effective in QEMU. Just look if the first character is '{' and if so,
switch to QAPI based parsing instead of legacy parsing.

> Of these, I still think "qemu -incoming main:tcp:0:44444,cpr:unix:cpr.sock"
> is the least worst option.  We could further simplify it by allowing the
> option multiple times, and only recognizing the additional "cpr" prefix.
> 
>   qemu -incoming tcp:0:44444 -incoming cpr:unix:cpr.sock
>   qemu -incoming defer -incoming cpr:unix:cpr.sock
> 
> Your further comments, please.  I need a way forward that you and other
> maintainers will support.

In terms of where we wire up CPR, -incoming or -cpr-uri is fairly
arbitrary and I'm not seeing (easy) better answers.

The (hard) better answer, would potentally be to leverage '-object'
to create the migration state object but that would be a massive
pile of work, that is unreasonable to ask you to experiment with.

> 
> > Yes, there's still the question of HMP, but personally I'm fine with
> > leaving feature gaps in HMP and expecting people to use QMP. HMP shares
> > all the same flaws as our old approach to the CLI, of needing to invent
> > arbitrary magic syntaxes which has proved to be an undesirble path to
> > take in general. I see HMP as being there for the 80% common / simple
> > cases, and if you need to go beyond that, then QMP is there for you.
> 
> Fine with me.
> 
> - Steve
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2024-11-20 16:27 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-01 13:47 [PATCH V3 00/16] Live update: cpr-transfer Steve Sistare
2024-11-01 13:47 ` [PATCH V3 01/16] machine: anon-alloc option Steve Sistare
2024-11-01 14:06   ` Peter Xu
2024-11-04 10:39   ` David Hildenbrand
2024-11-04 10:45     ` David Hildenbrand
2024-11-04 17:38     ` Steven Sistare
2024-11-04 19:51       ` David Hildenbrand
2024-11-04 20:14         ` Peter Xu
2024-11-04 20:17           ` David Hildenbrand
2024-11-04 20:41             ` Peter Xu
2024-11-04 20:15         ` David Hildenbrand
2024-11-04 20:56           ` Steven Sistare
2024-11-04 21:36             ` David Hildenbrand
2024-11-06 20:12               ` Steven Sistare
2024-11-06 20:41                 ` Peter Xu
2024-11-06 20:59                   ` Steven Sistare
2024-11-06 21:21                     ` Peter Xu
2024-11-07 14:03                       ` Steven Sistare
2024-11-07 13:05                     ` David Hildenbrand
2024-11-07 14:04                       ` Steven Sistare
2024-11-07 16:19                         ` David Hildenbrand
2024-11-07 18:13                           ` Steven Sistare
2024-11-07 16:32                         ` Peter Xu
2024-11-07 16:38                           ` David Hildenbrand
2024-11-07 17:48                             ` Peter Xu
2024-11-07 13:23                 ` David Hildenbrand
2024-11-07 16:02                   ` Steven Sistare
2024-11-07 16:26                     ` David Hildenbrand
2024-11-07 16:40                       ` Steven Sistare
2024-11-08 11:31                         ` David Hildenbrand
2024-11-08 13:43                           ` Peter Xu
2024-11-08 14:14                             ` Steven Sistare
2024-11-08 14:32                               ` Peter Xu
2024-11-08 14:18                             ` David Hildenbrand
2024-11-08 15:01                               ` Peter Xu
2024-11-08 13:56                           ` Steven Sistare
2024-11-08 14:20                             ` David Hildenbrand
2024-11-08 14:37                               ` Steven Sistare
2024-11-08 14:54                                 ` David Hildenbrand
2024-11-08 15:07                                   ` Peter Xu
2024-11-08 15:09                                     ` David Hildenbrand
2024-11-08 15:15                                   ` David Hildenbrand
2024-11-01 13:47 ` [PATCH V3 02/16] migration: cpr-state Steve Sistare
2024-11-13 20:36   ` Peter Xu
2024-11-01 13:47 ` [PATCH V3 03/16] physmem: preserve ram blocks for cpr Steve Sistare
2024-11-01 13:47 ` [PATCH V3 04/16] hostmem-memfd: preserve " Steve Sistare
2024-11-01 13:47 ` [PATCH V3 05/16] migration: SCM_RIGHTS for QEMUFile Steve Sistare
2024-11-13 20:54   ` Peter Xu
2024-11-14 18:34     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 06/16] migration: VMSTATE_FD Steve Sistare
2024-11-13 20:55   ` Peter Xu
2024-11-01 13:47 ` [PATCH V3 07/16] migration: cpr-transfer save and load Steve Sistare
2024-11-01 13:47 ` [PATCH V3 08/16] migration: cpr-uri parameter Steve Sistare
2024-11-01 13:47 ` [PATCH V3 09/16] migration: cpr-uri option Steve Sistare
2024-11-01 13:47 ` [PATCH V3 10/16] migration: split qmp_migrate Steve Sistare
2024-11-13 21:11   ` Peter Xu
2024-11-14 18:33     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 11/16] migration: cpr-transfer mode Steve Sistare
2024-11-13 21:58   ` Peter Xu
2024-11-14 18:36     ` Steven Sistare
2024-11-14 19:04       ` Peter Xu
2024-11-19 19:50         ` Steven Sistare
2024-11-19 20:16           ` Peter Xu
2024-11-19 20:32             ` Steven Sistare
2024-11-19 20:51               ` Peter Xu
2024-11-19 21:03                 ` Steven Sistare
2024-11-19 21:29                   ` Peter Xu
2024-11-19 21:41                     ` Steven Sistare
2024-11-19 21:48                       ` Peter Xu
2024-11-19 21:51                         ` Steven Sistare
2024-11-20  9:38               ` Daniel P. Berrangé
2024-11-20 16:12                 ` Steven Sistare
2024-11-20 16:26                   ` Daniel P. Berrangé
2024-11-01 13:47 ` [PATCH V3 12/16] tests/migration-test: memory_backend Steve Sistare
2024-11-13 22:19   ` Fabiano Rosas
2024-11-01 13:47 ` [PATCH V3 13/16] tests/qtest: defer connection Steve Sistare
2024-11-13 22:36   ` Fabiano Rosas
2024-11-14 18:45     ` Steven Sistare
2024-11-13 22:53   ` Peter Xu
2024-11-14 18:31     ` Steven Sistare
2024-11-01 13:47 ` [PATCH V3 14/16] tests/migration-test: " Steve Sistare
2024-11-14 12:46   ` Fabiano Rosas
2024-11-01 13:47 ` [PATCH V3 15/16] migration-test: cpr-transfer Steve Sistare
2024-11-01 13:47 ` [PATCH V3 16/16] migration: cpr-transfer documentation Steve Sistare
2024-11-13 22:02   ` Peter Xu
2024-11-14 18:31     ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).