* [PATCH V2 00/11] Live update: cpr-exec
@ 2024-06-30 19:40 Steve Sistare
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
` (11 more replies)
0 siblings, 12 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
What?
This patch series adds the live migration cpr-exec mode, which allows
the user to update QEMU with minimal guest pause time, by preserving
guest RAM in place, albeit with new virtual addresses in new QEMU, and
by preserving device file descriptors.
The new user-visible interfaces are:
* cpr-exec (MigMode migration parameter)
* cpr-exec-command (migration parameter)
* anon-alloc (command-line option for -machine)
The user sets the mode parameter before invoking the migrate command.
In this mode, the user issues the migrate command to old QEMU, which
stops the VM and saves state to the migration channels. Old QEMU then
exec's new QEMU, replacing the original process while retaining its PID.
The user specifies the command to exec new QEMU in the migration parameter
cpr-exec-command. The command must pass all old QEMU arguments to new
QEMU, plus the -incoming option. Execution resumes in new QEMU.
Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported. The VM must be started
with the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.
Why?
This mode has less impact on the guest than any other method of updating
in place. The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU. Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time. Lastly, there is no loss of connectivity to the guest,
because chardev descriptors remain open and connected.
These benefits all derive from the core design principle of this mode,
which is preserving open descriptors. This approach is very general and
can be used to support a wide variety of devices that do not have hardware
support for live migration, including but not limited to: vfio, chardev,
vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
to allow a descriptor to be used in a process that did not originally open it.
In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources. By contrast, consider a design in
which a new container is created on the same host as the target of the
CPR operation. Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec. A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.
How?
All memory that is mapped by the guest is preserved in place. Indeed,
it must be, because it may be the target of DMA requests, which are not
quiesced during cpr-exec. All such memory must be mmap'able in new QEMU.
This is easy for named memory-backend objects, as long as they are mapped
shared, because they are visible in the file system in both old and new QEMU.
Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
so the memfd's can be sent to new QEMU. Pages that were locked in memory
for DMA in old QEMU remain locked in new QEMU, because the descriptor of
the device that locked them remains open.
cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.
For device descriptors, new QEMU reuses the descriptor when creating the
device, rather than opening it again. The same holds for chardevs. For
memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
is created.
CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.
The exec itself is trivial. After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.
Example:
In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.
# qemu-kvm -monitor stdio -object
memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
-m 4G -machine anon-alloc=memfd ...
QEMU 9.1.50 monitor - type 'help' for more information
(qemu) info status
VM status: running
(qemu) migrate_set_parameter mode cpr-exec
(qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
(qemu) migrate -d file:vm.state
(qemu) QEMU 9.1.50 monitor - type 'help' for more information
(qemu) info status
VM status: running
This patch series implements a minimal version of cpr-exec. Additional
series are ready to be posted to deliver the complete vision described
above, including
* vfio
* chardev
* vhost
* blockers
* hostmem-memfd
* migration-test cases
Works in progress include:
* vdpa
* iommufd
* cpr-transfer mode
Changes since V1:
* Dropped precreate and factory patches. Added CPR state instead.
* Dropped patches that refactor ramblock allocation
* Dropped vmstate_info_void patch (peter)
* Dropped patch "seccomp: cpr-exec blocker" (Daniel)
* Redefined memfd-alloc option as anon-alloc
* No longer preserve ramblock fields, except for fd (peter)
* Added fd preservation functions in CPR state
* Hoisted cpr code out of migrate_fd_cleanup (fabiano)
* Revised migration.json docs (markus)
* Fixed qtest failures (fabiano)
* Renamed SAVEVM_FOREACH macros (fabiano)
* Renamed cpr-exec-args as cpr-exec-command (markus)
The first 6 patches below are foundational and are needed for both cpr-exec
mode and cpr-transfer mode. The last 5 patches are specific to cpr-exec
and implement the mechanisms for sharing state across exec.
Steve Sistare (11):
machine: alloc-anon option
migration: cpr-state
migration: save cpr mode
migration: stop vm earlier for cpr
physmem: preserve ram blocks for cpr
migration: fix mismatched GPAs during cpr
oslib: qemu_clear_cloexec
vl: helper to request exec
migration: cpr-exec-command parameter
migration: cpr-exec save and load
migration: cpr-exec mode
hmp-commands.hx | 2 +-
hw/core/machine.c | 24 +++++
include/exec/memory.h | 12 +++
include/hw/boards.h | 1 +
include/migration/cpr.h | 35 ++++++
include/qemu/osdep.h | 9 ++
include/sysemu/runstate.h | 3 +
migration/cpr-exec.c | 180 +++++++++++++++++++++++++++++++
migration/cpr.c | 238 +++++++++++++++++++++++++++++++++++++++++
migration/meson.build | 2 +
migration/migration-hmp-cmds.c | 25 +++++
migration/migration.c | 43 ++++++--
migration/options.c | 23 +++-
migration/ram.c | 17 +--
migration/trace-events | 5 +
qapi/machine.json | 14 +++
qapi/migration.json | 45 +++++++-
qemu-options.hx | 13 +++
system/memory.c | 22 +++-
system/physmem.c | 61 ++++++++++-
system/runstate.c | 29 +++++
system/trace-events | 3 +
system/vl.c | 3 +
util/oslib-posix.c | 9 ++
util/oslib-win32.c | 4 +
25 files changed, 792 insertions(+), 30 deletions(-)
create mode 100644 include/migration/cpr.h
create mode 100644 migration/cpr-exec.c
create mode 100644 migration/cpr.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 77+ messages in thread
* [PATCH V2 01/11] machine: alloc-anon option
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-15 17:52 ` Fabiano Rosas
2024-07-16 9:19 ` Igor Mammedov
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
` (10 subsequent siblings)
11 siblings, 2 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property. This affects
memory-backend-ram objects, guest RAM created with the global -m option
but without an associated memory-backend object and without the -mem-path
option, and various memory regions such as ROMs that are allocated when
devices are created. This option does not affect memory-backend-file,
memory-backend-memfd, or memory-backend-epc objects.
The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process. Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked. This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.
To access the same memory in the old and new QEMU processes, the memory
must be mapped shared. Therefore, the implementation always sets
RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
user must explicitly specify the share option. In lieu of defining a new
RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
as the condition for calling memfd_create.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/core/machine.c | 24 ++++++++++++++++++++++++
include/hw/boards.h | 1 +
qapi/machine.json | 14 ++++++++++++++
qemu-options.hx | 13 +++++++++++++
system/memory.c | 12 +++++++++---
system/physmem.c | 38 +++++++++++++++++++++++++++++++++++++-
system/trace-events | 3 +++
7 files changed, 101 insertions(+), 4 deletions(-)
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 655d75c..7ca2ad0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -454,6 +454,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
ms->mem_merge = value;
}
+static int machine_get_anon_alloc(Object *obj, Error **errp)
+{
+ MachineState *ms = MACHINE(obj);
+
+ return ms->anon_alloc;
+}
+
+static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
+{
+ MachineState *ms = MACHINE(obj);
+
+ ms->anon_alloc = value;
+}
+
static bool machine_get_usb(Object *obj, Error **errp)
{
MachineState *ms = MACHINE(obj);
@@ -1066,6 +1080,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
object_class_property_set_description(oc, "mem-merge",
"Enable/disable memory merge support");
+ object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
+ &AnonAllocOption_lookup,
+ machine_get_anon_alloc,
+ machine_set_anon_alloc);
+
object_class_property_add_bool(oc, "usb",
machine_get_usb, machine_set_usb);
object_class_property_set_description(oc, "usb",
@@ -1416,6 +1435,11 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
goto out;
}
+ if (!object_property_set_bool(obj, "share",
+ ms->anon_alloc == ANON_ALLOC_OPTION_MEMFD,
+ errp)) {
+ goto out;
+ }
object_property_add_child(object_get_objects_root(), mc->default_ram_id,
obj);
/* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 73ad319..77f16ad 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -383,6 +383,7 @@ struct MachineState {
bool enable_graphics;
ConfidentialGuestSupport *cgs;
HostMemoryBackend *memdev;
+ AnonAllocOption anon_alloc;
/*
* convenience alias to ram_memdev_id backend memory region
* or to numa container memory region
diff --git a/qapi/machine.json b/qapi/machine.json
index 2fd3e9c..9173953 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1881,3 +1881,17 @@
{ 'command': 'x-query-interrupt-controllers',
'returns': 'HumanReadableText',
'features': [ 'unstable' ]}
+
+##
+# @AnonAllocOption:
+#
+# An enumeration of the options for allocating anonymous guest memory.
+#
+# @mmap: allocate using mmap MAP_ANON
+#
+# @memfd: allocate using memfd_create
+#
+# Since: 9.1
+##
+{ 'enum': 'AnonAllocOption',
+ 'data': [ 'mmap', 'memfd' ] }
diff --git a/qemu-options.hx b/qemu-options.hx
index 8ca7f34..595b693 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
" nvdimm=on|off controls NVDIMM support (default=off)\n"
" memory-encryption=@var{} memory encryption object to use (default=none)\n"
" hmat=on|off controls ACPI HMAT support (default=off)\n"
+ " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
" memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
" cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
QEMU_ARCH_ALL)
@@ -101,6 +102,18 @@ SRST
Enables or disables ACPI Heterogeneous Memory Attribute Table
(HMAT) support. The default is off.
+ ``anon-alloc=mmap|memfd``
+ Allocate anonymous guest RAM using mmap MAP_ANON (the default)
+ or memfd_create. This affects memory-backend-ram objects,
+ RAM created with the global -m option but without an
+ associated memory-backend object and without the -mem-path
+ option, and various memory regions such as ROMs that are
+ allocated when devices are created. This option does not
+ affect memory-backend-file, memory-backend-memfd, or
+ memory-backend-epc objects.
+
+ Some migration modes require anon-alloc=memfd.
+
``memory-backend='id'``
An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
Allows to use a memory backend as main RAM.
diff --git a/system/memory.c b/system/memory.c
index 2d69521..28a837d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,10 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
uint64_t size,
Error **errp)
{
+ uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+ RAM_SHARED : 0;
return memory_region_init_ram_flags_nomigrate(mr, owner, name,
- size, 0, errp);
+ size, flags, errp);
}
bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
@@ -1713,8 +1715,10 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
uint64_t size,
Error **errp)
{
+ uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+ RAM_SHARED : 0;
if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
- size, 0, errp)) {
+ size, flags, errp)) {
return false;
}
mr->readonly = true;
@@ -1731,6 +1735,8 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
Error **errp)
{
Error *err = NULL;
+ uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+ RAM_SHARED : 0;
assert(ops);
memory_region_init(mr, owner, name, size);
mr->ops = ops;
@@ -1738,7 +1744,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
mr->terminates = true;
mr->rom_device = true;
mr->destructor = memory_region_destructor_ram;
- mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
+ mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
if (err) {
mr->size = int128_zero();
object_unparent(OBJECT(mr));
diff --git a/system/physmem.c b/system/physmem.c
index 33d09f7..efe95ff 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@
#include "qemu/qemu-print.h"
#include "qemu/log.h"
#include "qemu/memalign.h"
+#include "qemu/memfd.h"
#include "exec/memory.h"
#include "exec/ioport.h"
#include "sysemu/dma.h"
@@ -54,6 +55,7 @@
#include "sysemu/hw_accel.h"
#include "sysemu/xen-mapcache.h"
#include "trace/trace-root.h"
+#include "trace.h"
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
#include <linux/falloc.h>
@@ -69,6 +71,8 @@
#include "qemu/pmem.h"
+#include "qapi/qapi-types-migration.h"
+#include "migration/options.h"
#include "migration/vmstate.h"
#include "qemu/range.h"
@@ -1828,6 +1832,32 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
qemu_mutex_unlock_ramlist();
return;
}
+
+ } else if (new_block->flags & RAM_SHARED) {
+ size_t max_length = new_block->max_length;
+ MemoryRegion *mr = new_block->mr;
+ const char *name = memory_region_name(mr);
+
+ new_block->mr->align = QEMU_VMALLOC_ALIGN;
+
+ if (new_block->fd == -1) {
+ new_block->fd = qemu_memfd_create(name, max_length + mr->align,
+ 0, 0, 0, errp);
+ }
+
+ if (new_block->fd >= 0) {
+ int mfd = new_block->fd;
+ qemu_set_cloexec(mfd);
+ new_block->host = file_ram_alloc(new_block, max_length, mfd,
+ false, 0, errp);
+ }
+ if (!new_block->host) {
+ qemu_mutex_unlock_ramlist();
+ return;
+ }
+ memory_try_enable_merging(new_block->host, new_block->max_length);
+ free_on_error = true;
+
} else {
new_block->host = qemu_anon_ram_alloc(new_block->max_length,
&new_block->mr->align,
@@ -1911,6 +1941,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
ram_block_notify_add(new_block->host, new_block->used_length,
new_block->max_length);
}
+ trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
+ new_block->fd, new_block->used_length,
+ new_block->max_length);
return;
out_free:
@@ -2097,8 +2130,11 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
void *host),
MemoryRegion *mr, Error **errp)
{
+ uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+ RAM_SHARED : 0;
+ flags |= RAM_RESIZEABLE;
return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
- RAM_RESIZEABLE, mr, errp);
+ flags, mr, errp);
}
static void reclaim_ramblock(RAMBlock *block)
diff --git a/system/trace-events b/system/trace-events
index 69c9044..f8ebf42 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+#physmem.c
+ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 02/11] migration: cpr-state
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-07-19 15:03 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 03/11] migration: save cpr mode Steve Sistare
` (9 subsequent siblings)
11 siblings, 2 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
CPR must save state that is needed after QEMU is restarted, when devices
are realized. Thus the extra state cannot be saved in the migration stream,
as objects must already exist before that stream can be loaded. Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a cpr-specific
stream in cpr_state_save. Deserialize in cpr_state_load after QEMU
restarts, before devices are realized.
Provide accessors for clients to register file descriptors for saving.
The mechanism for passing the fd's to the new process will be specific
to each migration mode, and added in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 21 ++++++
migration/cpr.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++
migration/meson.build | 1 +
migration/migration.c | 6 ++
migration/trace-events | 5 ++
system/vl.c | 3 +
6 files changed, 224 insertions(+)
create mode 100644 include/migration/cpr.h
create mode 100644 migration/cpr.c
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..8e7e705
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+typedef int (*cpr_walk_fd_cb)(int fd);
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb);
+void cpr_resave_fd(const char *name, int id, int fd);
+
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..313e74e
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
+#include "migration/qemu-file.h"
+#include "migration/savevm.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+
+typedef struct CprState {
+ CprFdList fds;
+} CprState;
+
+static CprState cpr_state;
+
+/****************************************************************************/
+
+typedef struct CprFd {
+ char *name;
+ unsigned int namelen;
+ int id;
+ int fd;
+ QLIST_ENTRY(CprFd) next;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+ .name = "cpr fd",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT32(namelen, CprFd),
+ VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
+ VMSTATE_INT32(id, CprFd),
+ VMSTATE_INT32(fd, CprFd),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+ CprFd *elem = g_new0(CprFd, 1);
+
+ trace_cpr_save_fd(name, id, fd);
+ elem->name = g_strdup(name);
+ elem->namelen = strlen(name) + 1;
+ elem->id = id;
+ elem->fd = fd;
+ QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
+}
+
+static CprFd *find_fd(CprFdList *head, const char *name, int id)
+{
+ CprFd *elem;
+
+ QLIST_FOREACH(elem, head, next) {
+ if (!strcmp(elem->name, name) && elem->id == id) {
+ return elem;
+ }
+ }
+ return NULL;
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+
+ if (elem) {
+ QLIST_REMOVE(elem, next);
+ g_free(elem->name);
+ g_free(elem);
+ }
+
+ trace_cpr_delete_fd(name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+ int fd = elem ? elem->fd : -1;
+
+ trace_cpr_find_fd(name, id, fd);
+ return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb)
+{
+ CprFd *elem;
+
+ QLIST_FOREACH(elem, &cpr_state.fds, next) {
+ if (elem->fd >= 0 && cb(elem->fd)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+ int old_fd = elem ? elem->fd : -1;
+
+ if (old_fd < 0) {
+ cpr_save_fd(name, id, fd);
+ } else if (old_fd != fd) {
+ error_setg(&error_fatal,
+ "internal error: cpr fd '%s' id %d value %d "
+ "already saved with a different value %d",
+ name, id, fd, old_fd);
+ }
+}
+/*************************************************************************/
+#define CPR_STATE "CprState"
+
+static const VMStateDescription vmstate_cpr_state = {
+ .name = CPR_STATE,
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (VMStateField[]) {
+ VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
+ VMSTATE_END_OF_LIST()
+ }
+};
+/*************************************************************************/
+
+int cpr_state_save(Error **errp)
+{
+ int ret;
+ QEMUFile *f;
+
+ /* set f based on mode in a later patch in this series */
+ return 0;
+
+ qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
+ qemu_put_be32(f, QEMU_VM_FILE_VERSION);
+
+ ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+ if (ret) {
+ error_setg(errp, "vmstate_save_state error %d", ret);
+ }
+
+ qemu_fclose(f);
+ return ret;
+}
+
+int cpr_state_load(Error **errp)
+{
+ int ret;
+ uint32_t v;
+ QEMUFile *f;
+
+ /* set f based on mode in a later patch in this series */
+ return 0;
+
+ v = qemu_get_be32(f);
+ if (v != QEMU_VM_FILE_MAGIC) {
+ error_setg(errp, "Not a migration stream (bad magic %x)", v);
+ qemu_fclose(f);
+ return -EINVAL;
+ }
+ v = qemu_get_be32(f);
+ if (v != QEMU_VM_FILE_VERSION) {
+ error_setg(errp, "Unsupported migration stream version %d", v);
+ qemu_fclose(f);
+ return -ENOTSUP;
+ }
+
+ ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+ if (ret) {
+ error_setg(errp, "vmstate_load_state error %d", ret);
+ }
+
+ qemu_fclose(f);
+ return ret;
+}
+
diff --git a/migration/meson.build b/migration/meson.build
index 5ce2acb4..87feb4c 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -13,6 +13,7 @@ system_ss.add(files(
'block-dirty-bitmap.c',
'channel.c',
'channel-block.c',
+ 'cpr.c',
'dirtyrate.c',
'exec.c',
'fd.c',
diff --git a/migration/migration.c b/migration/migration.c
index 3dea06d..e394ad7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -27,6 +27,7 @@
#include "sysemu/cpu-throttle.h"
#include "rdma.h"
#include "ram.h"
+#include "migration/cpr.h"
#include "migration/global_state.h"
#include "migration/misc.h"
#include "migration.h"
@@ -2118,6 +2119,10 @@ void qmp_migrate(const char *uri, bool has_channels,
}
}
+ if (cpr_state_save(&local_err)) {
+ goto out;
+ }
+
if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
SocketAddress *saddr = &addr->u.socket;
if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
@@ -2142,6 +2147,7 @@ void qmp_migrate(const char *uri, bool has_channels,
MIGRATION_STATUS_FAILED);
}
+out:
if (local_err) {
if (!resume_requested) {
yank_unregister_instance(MIGRATION_YANK_INSTANCE);
diff --git a/migration/trace-events b/migration/trace-events
index 0b7c332..173f2c0 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -340,6 +340,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
# colo-failover.c
colo_failover_set_state(const char *new_state) "new state %s"
+# cpr.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+
# block-dirty-bitmap.c
send_bitmap_header_enter(void) ""
send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
diff --git a/system/vl.c b/system/vl.c
index 03951be..6521ee3 100644
--- a/system/vl.c
+++ b/system/vl.c
@@ -77,6 +77,7 @@
#include "hw/block/block.h"
#include "hw/i386/x86.h"
#include "hw/i386/pc.h"
+#include "migration/cpr.h"
#include "migration/misc.h"
#include "migration/snapshot.h"
#include "sysemu/tpm.h"
@@ -3713,6 +3714,8 @@ void qemu_init(int argc, char **argv)
qemu_create_machine(machine_opts_dict);
+ cpr_state_load(&error_fatal);
+
suspend_mux_open();
qemu_disable_default_devices();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 03/11] migration: save cpr mode
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-06-30 19:40 ` [PATCH V2 04/11] migration: stop vm earlier for cpr Steve Sistare
` (8 subsequent siblings)
11 siblings, 1 reply; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Save the mode in CPR state, so the user does not need to explicitly specify
it for the target. Modify migrate_mode() so it returns the incoming mode on
the target.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 7 +++++++
migration/cpr.c | 23 ++++++++++++++++++++++-
migration/migration.c | 1 +
migration/options.c | 9 +++++++--
4 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 8e7e705..42b4019 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -8,6 +8,13 @@
#ifndef MIGRATION_CPR_H
#define MIGRATION_CPR_H
+#include "qapi/qapi-types-migration.h"
+
+#define MIG_MODE_NONE MIG_MODE__MAX
+
+MigMode cpr_get_incoming_mode(void);
+void cpr_set_incoming_mode(MigMode mode);
+
typedef int (*cpr_walk_fd_cb)(int fd);
void cpr_save_fd(const char *name, int id, int fd);
void cpr_delete_fd(const char *name, int id);
diff --git a/migration/cpr.c b/migration/cpr.c
index 313e74e..1c296c6 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -21,10 +21,23 @@
typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
typedef struct CprState {
+ MigMode mode;
CprFdList fds;
} CprState;
-static CprState cpr_state;
+static CprState cpr_state = {
+ .mode = MIG_MODE_NONE,
+};
+
+MigMode cpr_get_incoming_mode(void)
+{
+ return cpr_state.mode;
+}
+
+void cpr_set_incoming_mode(MigMode mode)
+{
+ cpr_state.mode = mode;
+}
/****************************************************************************/
@@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
/*************************************************************************/
#define CPR_STATE "CprState"
+static int cpr_state_presave(void *opaque)
+{
+ cpr_state.mode = migrate_mode();
+ return 0;
+}
+
static const VMStateDescription vmstate_cpr_state = {
.name = CPR_STATE,
.version_id = 1,
.minimum_version_id = 1,
+ .pre_save = cpr_state_presave,
.fields = (VMStateField[]) {
+ VMSTATE_UINT32(mode, CprState),
VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
VMSTATE_END_OF_LIST()
}
diff --git a/migration/migration.c b/migration/migration.c
index e394ad7..0f47765 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -411,6 +411,7 @@ void migration_incoming_state_destroy(void)
mis->postcopy_qemufile_dst = NULL;
}
+ cpr_set_incoming_mode(MIG_MODE_NONE);
yank_unregister_instance(MIGRATION_YANK_INSTANCE);
}
diff --git a/migration/options.c b/migration/options.c
index 645f550..305397a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -22,6 +22,7 @@
#include "qapi/qmp/qnull.h"
#include "sysemu/runstate.h"
#include "migration/colo.h"
+#include "migration/cpr.h"
#include "migration/misc.h"
#include "migration.h"
#include "migration-stats.h"
@@ -758,8 +759,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
MigMode migrate_mode(void)
{
- MigrationState *s = migrate_get_current();
- MigMode mode = s->parameters.mode;
+ MigMode mode = cpr_get_incoming_mode();
+
+ if (mode == MIG_MODE_NONE) {
+ MigrationState *s = migrate_get_current();
+ mode = s->parameters.mode;
+ }
assert(mode >= 0 && mode < MIG_MODE__MAX);
return mode;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 04/11] migration: stop vm earlier for cpr
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (2 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 03/11] migration: save cpr mode Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-17 18:59 ` Fabiano Rosas
2024-06-30 19:40 ` [PATCH V2 05/11] physmem: preserve ram blocks " Steve Sistare
` (7 subsequent siblings)
11 siblings, 1 reply; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Stop the vm earlier for cpr, to guarantee consistent device state when
CPR state is saved.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
migration/migration.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 0f47765..8a8e927 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
MigrationState *s = migrate_get_current();
g_autoptr(MigrationChannel) channel = NULL;
MigrationAddress *addr = NULL;
+ bool stopped = false;
/*
* Having preliminary checks for uri and channel
@@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
}
}
+ if (migrate_mode_is_cpr(s)) {
+ int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
+ if (ret < 0) {
+ error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
+ goto out;
+ }
+ stopped = true;
+ }
+
if (cpr_state_save(&local_err)) {
goto out;
}
@@ -2155,6 +2165,9 @@ out:
}
migrate_fd_error(s, local_err);
error_propagate(errp, local_err);
+ if (stopped && runstate_is_live(s->vm_old_state)) {
+ vm_start();
+ }
return;
}
}
@@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
Error *local_err = NULL;
uint64_t rate_limit;
bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
- int ret;
/*
* If there's a previous error, free it and prepare for another one.
@@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
return;
}
- if (migrate_mode_is_cpr(s)) {
- ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
- if (ret < 0) {
- error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
- goto fail;
- }
- }
-
if (migrate_background_snapshot()) {
qemu_thread_create(&s->thread, "mig/snapshot",
bg_migration_thread, s, QEMU_THREAD_JOINABLE);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 05/11] physmem: preserve ram blocks for cpr
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (3 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 04/11] migration: stop vm earlier for cpr Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-06-30 19:40 ` [PATCH V2 06/11] migration: fix mismatched GPAs during cpr Steve Sistare
` (6 subsequent siblings)
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Save the memfd for anonymous ramblocks in CPR state, along with a name
that uniquely identifies it. The block's idstr is not yet set, so it
cannot be used for this purpose. Find the saved memfd in new QEMU when
creating a block. QEMU hard-codes the length of some internally-created
blocks, so to guard against that length changing, use lseek to get the
actual length of an incoming memfd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
system/physmem.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/system/physmem.c b/system/physmem.c
index efe95ff..e37352e 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -73,6 +73,7 @@
#include "qapi/qapi-types-migration.h"
#include "migration/options.h"
+#include "migration/cpr.h"
#include "migration/vmstate.h"
#include "qemu/range.h"
@@ -1641,6 +1642,19 @@ void qemu_ram_unset_idstr(RAMBlock *block)
}
}
+static char *cpr_name(RAMBlock *block)
+{
+ MemoryRegion *mr = block->mr;
+ const char *mr_name = memory_region_name(mr);
+ g_autofree char *id = mr->dev ? qdev_get_dev_path(mr->dev) : NULL;
+
+ if (id) {
+ return g_strdup_printf("%s/%s", id, mr_name);
+ } else {
+ return g_strdup(mr_name);
+ }
+}
+
size_t qemu_ram_pagesize(RAMBlock *rb)
{
return rb->page_size;
@@ -1836,13 +1850,17 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
} else if (new_block->flags & RAM_SHARED) {
size_t max_length = new_block->max_length;
MemoryRegion *mr = new_block->mr;
- const char *name = memory_region_name(mr);
+ g_autofree char *name = cpr_name(new_block);
new_block->mr->align = QEMU_VMALLOC_ALIGN;
+ new_block->fd = cpr_find_fd(name, 0);
if (new_block->fd == -1) {
new_block->fd = qemu_memfd_create(name, max_length + mr->align,
0, 0, 0, errp);
+ cpr_save_fd(name, 0, new_block->fd);
+ } else {
+ new_block->max_length = lseek(new_block->fd, 0, SEEK_END);
}
if (new_block->fd >= 0) {
@@ -1852,6 +1870,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
false, 0, errp);
}
if (!new_block->host) {
+ cpr_delete_fd(name, 0);
qemu_mutex_unlock_ramlist();
return;
}
@@ -2162,6 +2181,8 @@ static void reclaim_ramblock(RAMBlock *block)
void qemu_ram_free(RAMBlock *block)
{
+ g_autofree char *name = NULL;
+
if (!block) {
return;
}
@@ -2172,6 +2193,8 @@ void qemu_ram_free(RAMBlock *block)
}
qemu_mutex_lock_ramlist();
+ name = cpr_name(block);
+ cpr_delete_fd(name, 0);
QLIST_REMOVE_RCU(block, next);
ram_list.mru_block = NULL;
/* Write list before version */
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (4 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 05/11] physmem: preserve ram blocks " Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-19 16:28 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 07/11] oslib: qemu_clear_cloexec Steve Sistare
` (5 subsequent siblings)
11 siblings, 1 reply; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
For new cpr modes, ramblock_is_ignored will always be true, because the
memory is preserved in place rather than copied. However, for an ignored
block, parse_ramblock currently requires that the received address of the
block must match the address of the statically initialized region on the
target. This fails for a PCI rom block, because the memory region address
is set when the guest writes to a BAR on the source, which does not occur
on the target, causing a "Mismatched GPAs" error during cpr migration.
To fix, unconditionally set the target's address to the source's address
if the target region does not have an address yet.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
include/exec/memory.h | 12 ++++++++++++
migration/ram.c | 15 +++++++++------
system/memory.c | 10 ++++++++--
3 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c26ede3..227169e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -811,6 +811,7 @@ struct MemoryRegion {
bool ram_device;
bool enabled;
bool warning_printed; /* For reservations */
+ bool has_addr;
uint8_t vga_logging_count;
MemoryRegion *alias;
hwaddr alias_offset;
@@ -2408,6 +2409,17 @@ void memory_region_set_enabled(MemoryRegion *mr, bool enabled);
void memory_region_set_address(MemoryRegion *mr, hwaddr addr);
/*
+ * memory_region_set_address_only: set the address of a region.
+ *
+ * Same as memory_region_set_address, but without causing transaction side
+ * effects.
+ *
+ * @mr: the region to be updated
+ * @addr: new address, relative to container region
+ */
+void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr);
+
+/*
* memory_region_set_size: dynamically update the size of a region.
*
* Dynamically updates the size of a region.
diff --git a/migration/ram.c b/migration/ram.c
index edec1a2..eaf3151 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4059,12 +4059,15 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
}
if (migrate_ignore_shared()) {
hwaddr addr = qemu_get_be64(f);
- if (migrate_ram_is_ignored(block) &&
- block->mr->addr != addr) {
- error_report("Mismatched GPAs for block %s "
- "%" PRId64 "!= %" PRId64, block->idstr,
- (uint64_t)addr, (uint64_t)block->mr->addr);
- return -EINVAL;
+ if (migrate_ram_is_ignored(block)) {
+ if (!block->mr->has_addr) {
+ memory_region_set_address_only(block->mr, addr);
+ } else if (block->mr->addr != addr) {
+ error_report("Mismatched GPAs for block %s "
+ "%" PRId64 "!= %" PRId64, block->idstr,
+ (uint64_t)addr, (uint64_t)block->mr->addr);
+ return -EINVAL;
+ }
}
}
ret = rdma_block_notification_handle(f, block->idstr);
diff --git a/system/memory.c b/system/memory.c
index 28a837d..b7548bf 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2655,7 +2655,7 @@ static void memory_region_add_subregion_common(MemoryRegion *mr,
for (alias = subregion->alias; alias; alias = alias->alias) {
alias->mapped_via_alias++;
}
- subregion->addr = offset;
+ memory_region_set_address_only(subregion, offset);
memory_region_update_container_subregions(subregion);
}
@@ -2735,10 +2735,16 @@ static void memory_region_readd_subregion(MemoryRegion *mr)
}
}
+void memory_region_set_address_only(MemoryRegion *mr, hwaddr addr)
+{
+ mr->addr = addr;
+ mr->has_addr = true;
+}
+
void memory_region_set_address(MemoryRegion *mr, hwaddr addr)
{
if (addr != mr->addr) {
- mr->addr = addr;
+ memory_region_set_address_only(mr, addr);
memory_region_readd_subregion(mr);
}
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 07/11] oslib: qemu_clear_cloexec
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (5 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 06/11] migration: fix mismatched GPAs during cpr Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-06-30 19:40 ` [PATCH V2 08/11] vl: helper to request exec Steve Sistare
` (4 subsequent siblings)
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
include/qemu/osdep.h | 9 +++++++++
util/oslib-posix.c | 9 +++++++++
util/oslib-win32.c | 4 ++++
3 files changed, 22 insertions(+)
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 191916f..6ebf192 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -662,6 +662,15 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
void qemu_set_cloexec(int fd);
+/*
+ * Clear FD_CLOEXEC for a descriptor.
+ *
+ * The caller must guarantee that no other fork+exec's occur before the
+ * exec that is intended to inherit this descriptor, eg by suspending CPUs
+ * and blocking monitor commands.
+ */
+void qemu_clear_cloexec(int fd);
+
/* Return a dynamically allocated directory path that is appropriate for storing
* local state.
*
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e764416..614c3e5 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -272,6 +272,15 @@ int qemu_socketpair(int domain, int type, int protocol, int sv[2])
return ret;
}
+void qemu_clear_cloexec(int fd)
+{
+ int f;
+ f = fcntl(fd, F_GETFD);
+ assert(f != -1);
+ f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+ assert(f != -1);
+}
+
char *
qemu_get_local_state_dir(void)
{
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index b623830..c3e969a 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -222,6 +222,10 @@ void qemu_set_cloexec(int fd)
{
}
+void qemu_clear_cloexec(int fd)
+{
+}
+
int qemu_get_thread_id(void)
{
return GetCurrentThreadId();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 08/11] vl: helper to request exec
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (6 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 07/11] oslib: qemu_clear_cloexec Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-06-30 19:40 ` [PATCH V2 09/11] migration: cpr-exec-command parameter Steve Sistare
` (3 subsequent siblings)
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Add a qemu_system_exec_request() hook that causes the main loop to exit and
exec a command using the specified arguments. This will be used during CPR
to exec a new version of QEMU.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/sysemu/runstate.h | 3 +++
system/runstate.c | 29 +++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 0117d24..cb669cf 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -80,6 +80,8 @@ typedef enum WakeupReason {
QEMU_WAKEUP_REASON_OTHER,
} WakeupReason;
+typedef void (*qemu_exec_func)(char **exec_argv);
+
void qemu_system_reset_request(ShutdownCause reason);
void qemu_system_suspend_request(void);
void qemu_register_suspend_notifier(Notifier *notifier);
@@ -91,6 +93,7 @@ void qemu_register_wakeup_support(void);
void qemu_system_shutdown_request_with_code(ShutdownCause reason,
int exit_code);
void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(qemu_exec_func func, const strList *args);
void qemu_system_powerdown_request(void);
void qemu_register_powerdown_notifier(Notifier *notifier);
void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/system/runstate.c b/system/runstate.c
index ec32e27..afc56e4 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -40,6 +40,7 @@
#include "qapi/error.h"
#include "qapi/qapi-commands-run-state.h"
#include "qapi/qapi-events-run-state.h"
+#include "qapi/type-helpers.h"
#include "qemu/accel.h"
#include "qemu/error-report.h"
#include "qemu/job.h"
@@ -400,6 +401,8 @@ static NotifierList wakeup_notifiers =
static NotifierList shutdown_notifiers =
NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+qemu_exec_func exec_func;
+static char **exec_argv;
ShutdownCause qemu_shutdown_requested_get(void)
{
@@ -416,6 +419,11 @@ static int qemu_shutdown_requested(void)
return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
}
+static int qemu_exec_requested(void)
+{
+ return exec_argv != NULL;
+}
+
static void qemu_kill_report(void)
{
if (!qtest_driver() && shutdown_signal) {
@@ -693,6 +701,23 @@ void qemu_system_shutdown_request(ShutdownCause reason)
qemu_notify_event();
}
+static void qemu_system_exec(void)
+{
+ exec_func(exec_argv);
+
+ /* exec failed */
+ g_strfreev(exec_argv);
+ exec_argv = NULL;
+ exec_func = NULL;
+}
+
+void qemu_system_exec_request(qemu_exec_func func, const strList *args)
+{
+ exec_func = func;
+ exec_argv = strv_from_str_list(args);
+ qemu_notify_event();
+}
+
static void qemu_system_powerdown(void)
{
qapi_event_send_powerdown();
@@ -739,6 +764,10 @@ static bool main_loop_should_exit(int *status)
if (qemu_suspend_requested()) {
qemu_system_suspend();
}
+ if (qemu_exec_requested()) {
+ qemu_system_exec();
+ return false;
+ }
request = qemu_shutdown_requested();
if (request) {
qemu_kill_report();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 09/11] migration: cpr-exec-command parameter
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (7 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 08/11] vl: helper to request exec Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-06-30 19:40 ` [PATCH V2 10/11] migration: cpr-exec save and load Steve Sistare
` (2 subsequent siblings)
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Create the cpr-exec-command migration parameter, defined as a list of
strings. It will be used for cpr-exec migration mode in a subsequent
patch, and contains forward references to cpr-exec mode in the qapi
doc.
No functional change, except that cpr-exec-command is shown by the
'info migrate' command.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hmp-commands.hx | 2 +-
migration/migration-hmp-cmds.c | 25 +++++++++++++++++++++++++
migration/options.c | 14 ++++++++++++++
qapi/migration.json | 21 ++++++++++++++++++---
4 files changed, 58 insertions(+), 4 deletions(-)
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 06746f0..0e04eac 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1009,7 +1009,7 @@ ERST
{
.name = "migrate_set_parameter",
- .args_type = "parameter:s,value:s",
+ .args_type = "parameter:s,value:S",
.params = "parameter value",
.help = "Set the parameter for migration",
.cmd = hmp_migrate_set_parameter,
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7d608d2..16a4b00 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -229,6 +229,18 @@ void hmp_info_migrate_capabilities(Monitor *mon, const QDict *qdict)
qapi_free_MigrationCapabilityStatusList(caps);
}
+static void monitor_print_cpr_exec_command(Monitor *mon, strList *args)
+{
+ monitor_printf(mon, "%s:",
+ MigrationParameter_str(MIGRATION_PARAMETER_CPR_EXEC_COMMAND));
+
+ while (args) {
+ monitor_printf(mon, " %s", args->value);
+ args = args->next;
+ }
+ monitor_printf(mon, "\n");
+}
+
void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
{
MigrationParameters *params;
@@ -358,6 +370,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
MIGRATION_PARAMETER_DIRECT_IO),
params->direct_io ? "on" : "off");
}
+
+ assert(params->has_cpr_exec_command);
+ monitor_print_cpr_exec_command(mon, params->cpr_exec_command);
}
qapi_free_MigrationParameters(params);
@@ -635,6 +650,16 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
p->has_direct_io = true;
visit_type_bool(v, param, &p->direct_io, &err);
break;
+ case MIGRATION_PARAMETER_CPR_EXEC_COMMAND: {
+ g_autofree char **strv = g_strsplit(valuestr ?: "", " ", -1);
+ strList **tail = &p->cpr_exec_command;
+
+ for (int i = 0; strv[i]; i++) {
+ QAPI_LIST_APPEND(tail, strv[i]);
+ }
+ p->has_cpr_exec_command = true;
+ break;
+ }
default:
assert(0);
}
diff --git a/migration/options.c b/migration/options.c
index 305397a..b8d5f72 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -931,6 +931,9 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
params->zero_page_detection = s->parameters.zero_page_detection;
params->has_direct_io = true;
params->direct_io = s->parameters.direct_io;
+ params->has_cpr_exec_command = true;
+ params->cpr_exec_command = QAPI_CLONE(strList,
+ s->parameters.cpr_exec_command);
return params;
}
@@ -964,6 +967,7 @@ void migrate_params_init(MigrationParameters *params)
params->has_mode = true;
params->has_zero_page_detection = true;
params->has_direct_io = true;
+ params->has_cpr_exec_command = true;
}
/*
@@ -1252,6 +1256,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
if (params->has_direct_io) {
dest->direct_io = params->direct_io;
}
+
+ if (params->has_cpr_exec_command) {
+ dest->cpr_exec_command = params->cpr_exec_command;
+ }
}
static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1381,6 +1389,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
if (params->has_direct_io) {
s->parameters.direct_io = params->direct_io;
}
+
+ if (params->has_cpr_exec_command) {
+ qapi_free_strList(s->parameters.cpr_exec_command);
+ s->parameters.cpr_exec_command =
+ QAPI_CLONE(strList, params->cpr_exec_command);
+ }
}
void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/qapi/migration.json b/qapi/migration.json
index 0f24206..20092d2 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -829,6 +829,10 @@
# only has effect if the @mapped-ram capability is enabled.
# (Since 9.1)
#
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+# is @cpr-exec. The first list element is the program's filename,
+# the remainder its arguments. (Since 9.1)
+#
# Features:
#
# @unstable: Members @x-checkpoint-delay and
@@ -854,7 +858,8 @@
'vcpu-dirty-limit',
'mode',
'zero-page-detection',
- 'direct-io'] }
+ 'direct-io',
+ 'cpr-exec-command'] }
##
# @MigrateSetParameters:
@@ -1004,6 +1009,10 @@
# only has effect if the @mapped-ram capability is enabled.
# (Since 9.1)
#
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+# is @cpr-exec. The first list element is the program's filename,
+# the remainder its arguments. (Since 9.1)
+#
# Features:
#
# @unstable: Members @x-checkpoint-delay and
@@ -1044,7 +1053,8 @@
'*vcpu-dirty-limit': 'uint64',
'*mode': 'MigMode',
'*zero-page-detection': 'ZeroPageDetection',
- '*direct-io': 'bool' } }
+ '*direct-io': 'bool',
+ '*cpr-exec-command': [ 'str' ]} }
##
# @migrate-set-parameters:
@@ -1208,6 +1218,10 @@
# only has effect if the @mapped-ram capability is enabled.
# (Since 9.1)
#
+# @cpr-exec-command: Command to start the new QEMU process when @mode
+# is @cpr-exec. The first list element is the program's filename,
+# the remainder its arguments. (Since 9.1)
+#
# Features:
#
# @unstable: Members @x-checkpoint-delay and
@@ -1245,7 +1259,8 @@
'*vcpu-dirty-limit': 'uint64',
'*mode': 'MigMode',
'*zero-page-detection': 'ZeroPageDetection',
- '*direct-io': 'bool' } }
+ '*direct-io': 'bool',
+ '*cpr-exec-command': [ 'str' ]} }
##
# @query-migrate-parameters:
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 10/11] migration: cpr-exec save and load
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (8 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 09/11] migration: cpr-exec-command parameter Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-06-30 19:40 ` [PATCH V2 11/11] migration: cpr-exec mode Steve Sistare
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
To preserve CPR state across exec, create a QEMUFile based on a memfd, and
keep the memfd open across exec. Save the value of the memfd in an
environment variable so post-exec QEMU can find it.
These new functions are called in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 5 +++
migration/cpr-exec.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++
migration/meson.build | 1 +
3 files changed, 101 insertions(+)
create mode 100644 migration/cpr-exec.c
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 42b4019..76d6ccb 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -25,4 +25,9 @@ void cpr_resave_fd(const char *name, int id, int fd);
int cpr_state_save(Error **errp);
int cpr_state_load(Error **errp);
+QEMUFile *cpr_exec_output(Error **errp);
+QEMUFile *cpr_exec_input(Error **errp);
+void cpr_exec_persist_state(QEMUFile *f);
+bool cpr_exec_has_state(void);
+void cpr_exec_unpersist_state(void);
#endif
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
new file mode 100644
index 0000000..5c40457
--- /dev/null
+++ b/migration/cpr-exec.c
@@ -0,0 +1,95 @@
+/*
+ * Copyright (c) 2021-2024 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "io/channel-file.h"
+#include "io/channel-socket.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/misc.h"
+#include "migration/vmstate.h"
+#include "sysemu/runstate.h"
+
+#define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
+
+static QEMUFile *qemu_file_new_fd_input(int fd, const char *name)
+{
+ g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+ QIOChannel *ioc = QIO_CHANNEL(fioc);
+ qio_channel_set_name(ioc, name);
+ return qemu_file_new_input(ioc);
+}
+
+static QEMUFile *qemu_file_new_fd_output(int fd, const char *name)
+{
+ g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+ QIOChannel *ioc = QIO_CHANNEL(fioc);
+ qio_channel_set_name(ioc, name);
+ return qemu_file_new_output(ioc);
+}
+
+void cpr_exec_persist_state(QEMUFile *f)
+{
+ QIOChannelFile *fioc = QIO_CHANNEL_FILE(qemu_file_get_ioc(f));
+ int mfd = dup(fioc->fd);
+ char val[16];
+
+ /* Remember mfd in environment for post-exec load */
+ qemu_clear_cloexec(mfd);
+ snprintf(val, sizeof(val), "%d", mfd);
+ g_setenv(CPR_EXEC_STATE_NAME, val, 1);
+}
+
+static int cpr_exec_find_state(void)
+{
+ const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+ int mfd;
+
+ assert(val);
+ g_unsetenv(CPR_EXEC_STATE_NAME);
+ assert(!qemu_strtoi(val, NULL, 10, &mfd));
+ return mfd;
+}
+
+bool cpr_exec_has_state(void)
+{
+ return g_getenv(CPR_EXEC_STATE_NAME) != NULL;
+}
+
+void cpr_exec_unpersist_state(void)
+{
+ int mfd;
+ const char *val = g_getenv(CPR_EXEC_STATE_NAME);
+
+ g_unsetenv(CPR_EXEC_STATE_NAME);
+ assert(val);
+ assert(!qemu_strtoi(val, NULL, 10, &mfd));
+ close(mfd);
+}
+
+QEMUFile *cpr_exec_output(Error **errp)
+{
+ int mfd = memfd_create(CPR_EXEC_STATE_NAME, 0);
+
+ if (mfd < 0) {
+ error_setg_errno(errp, errno, "memfd_create failed");
+ return NULL;
+ }
+
+ return qemu_file_new_fd_output(mfd, CPR_EXEC_STATE_NAME);
+}
+
+QEMUFile *cpr_exec_input(Error **errp)
+{
+ int mfd = cpr_exec_find_state();
+
+ lseek(mfd, 0, SEEK_SET);
+ return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
+}
diff --git a/migration/meson.build b/migration/meson.build
index 87feb4c..dd1d315 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -14,6 +14,7 @@ system_ss.add(files(
'channel.c',
'channel-block.c',
'cpr.c',
+ 'cpr-exec.c',
'dirtyrate.c',
'exec.c',
'fd.c',
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* [PATCH V2 11/11] migration: cpr-exec mode
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (9 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 10/11] migration: cpr-exec save and load Steve Sistare
@ 2024-06-30 19:40 ` Steve Sistare
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
11 siblings, 0 replies; 77+ messages in thread
From: Steve Sistare @ 2024-06-30 19:40 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster, Steve Sistare
Add the cpr-exec migration mode. Usage:
qemu-system-$arch -machine anon-alloc=memfd ...
migrate_set_parameter mode cpr-exec
migrate_set_parameter cpr-exec-command \
<arg1> <arg2> ... -incoming <uri-1> \
migrate -d <uri-1>
The migrate command stops the VM, saves state to uri-1,
directly exec's a new version of QEMU on the same host,
replacing the original process while retaining its PID, and
loads state from uri-1. Guest RAM is preserved in place,
albeit with new virtual addresses.
The new QEMU process is started by exec'ing the command
specified by the @cpr-exec-command parameter. The first word of
the command is the binary, and the remaining words are its
arguments. The command may be a direct invocation of new QEMU,
or may be a non-QEMU command that exec's the new QEMU binary.
This mode creates a second migration channel that is not visible
to the user. At the start of migration, old QEMU saves CPR state
to the second channel, and at the end of migration, it tells the
main loop to call cpr_exec. New QEMU loads CPR state early, before
objects are created.
Because old QEMU terminates when new QEMU starts, one cannot
stream data between the two, so uri-1 must be a type,
such as a file, that accepts all data before old QEMU exits.
Otherwise, old QEMU may quietly block writing to the channel.
Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported. The VM must be started with
the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process. The memfds
are kept open across exec by clearing the close-on-exec flag, their
values are saved in CPR state, and they are mmap'd in new qemu.
Note that the anon-alloc option is not related to memory-backend-memfd.
Later patches add support for memory-backend-memfd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 2 ++
migration/cpr-exec.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
migration/cpr.c | 37 ++++++++++++++++++---
migration/migration.c | 14 ++++++--
migration/ram.c | 2 ++
qapi/migration.json | 24 +++++++++++++-
6 files changed, 157 insertions(+), 7 deletions(-)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 76d6ccb..c6c60f8 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -30,4 +30,6 @@ QEMUFile *cpr_exec_input(Error **errp);
void cpr_exec_persist_state(QEMUFile *f);
bool cpr_exec_has_state(void);
void cpr_exec_unpersist_state(void);
+void cpr_mig_init(void);
+void cpr_unpreserve_fds(void);
#endif
diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
index 5c40457..fd65435 100644
--- a/migration/cpr-exec.c
+++ b/migration/cpr-exec.c
@@ -11,8 +11,11 @@
#include "qapi/error.h"
#include "io/channel-file.h"
#include "io/channel-socket.h"
+#include "block/block-global-state.h"
+#include "qemu/main-loop.h"
#include "migration/cpr.h"
#include "migration/qemu-file.h"
+#include "migration/migration.h"
#include "migration/misc.h"
#include "migration/vmstate.h"
#include "sysemu/runstate.h"
@@ -93,3 +96,85 @@ QEMUFile *cpr_exec_input(Error **errp)
lseek(mfd, 0, SEEK_SET);
return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
}
+
+static int preserve_fd(int fd)
+{
+ qemu_clear_cloexec(fd);
+ return 0;
+}
+
+static int unpreserve_fd(int fd)
+{
+ qemu_set_cloexec(fd);
+ return 0;
+}
+
+static void cpr_preserve_fds(void)
+{
+ cpr_walk_fd(preserve_fd);
+}
+
+void cpr_unpreserve_fds(void)
+{
+ cpr_walk_fd(unpreserve_fd);
+}
+
+static void cpr_exec(char **argv)
+{
+ MigrationState *s = migrate_get_current();
+ Error *err = NULL;
+
+ /*
+ * Clear the close-on-exec flag for all preserved fd's. We cannot do so
+ * earlier because they should not persist across miscellaneous fork and
+ * exec calls that are performed during normal operation.
+ */
+ cpr_preserve_fds();
+
+ execvp(argv[0], argv);
+
+ cpr_unpreserve_fds();
+
+ error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
+ error_report_err(error_copy(err));
+ migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+ migrate_set_error(s, err);
+
+ migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
+
+ if (s->block_inactive) {
+ Error *local_err = NULL;
+ bdrv_activate_all(&local_err);
+ if (local_err) {
+ error_report_err(local_err);
+ return;
+ } else {
+ s->block_inactive = false;
+ }
+ }
+
+ if (runstate_is_live(s->vm_old_state)) {
+ vm_start();
+ }
+}
+
+static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+ Error **errp)
+{
+ MigrationState *s = migrate_get_current();
+
+ if (e->type == MIG_EVENT_PRECOPY_DONE) {
+ assert(s->state == MIGRATION_STATUS_COMPLETED);
+ qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
+ } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
+ cpr_exec_unpersist_state();
+ }
+ return 0;
+}
+
+void cpr_mig_init(void)
+{
+ static NotifierWithReturn notifier;
+ migration_add_notifier_mode(¬ifier, cpr_exec_notifier,
+ MIG_MODE_CPR_EXEC);
+}
diff --git a/migration/cpr.c b/migration/cpr.c
index 1c296c6..f756c15 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -9,6 +9,7 @@
#include "qapi/error.h"
#include "migration/cpr.h"
#include "migration/misc.h"
+#include "migration/options.h"
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
@@ -160,9 +161,16 @@ int cpr_state_save(Error **errp)
{
int ret;
QEMUFile *f;
+ MigMode mode = migrate_mode();
- /* set f based on mode in a later patch in this series */
- return 0;
+ if (mode == MIG_MODE_CPR_EXEC) {
+ f = cpr_exec_output(errp);
+ } else {
+ return 0;
+ }
+ if (!f) {
+ return -1;
+ }
qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
qemu_put_be32(f, QEMU_VM_FILE_VERSION);
@@ -170,8 +178,14 @@ int cpr_state_save(Error **errp)
ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
if (ret) {
error_setg(errp, "vmstate_save_state error %d", ret);
+ goto out;
}
+ if (mode == MIG_MODE_CPR_EXEC) {
+ cpr_exec_persist_state(f);
+ }
+
+out:
qemu_fclose(f);
return ret;
}
@@ -182,8 +196,18 @@ int cpr_state_load(Error **errp)
uint32_t v;
QEMUFile *f;
- /* set f based on mode in a later patch in this series */
- return 0;
+ /*
+ * Mode will be loaded in CPR state, so cannot use it to decide which
+ * form of state to load.
+ */
+ if (cpr_exec_has_state()) {
+ f = cpr_exec_input(errp);
+ } else {
+ return 0;
+ }
+ if (!f) {
+ return -1;
+ }
v = qemu_get_be32(f);
if (v != QEMU_VM_FILE_MAGIC) {
@@ -203,6 +227,11 @@ int cpr_state_load(Error **errp)
error_setg(errp, "vmstate_load_state error %d", ret);
}
+ if (migrate_mode() == MIG_MODE_CPR_EXEC) {
+ /* Set cloexec to prevent fd leaks from fork until the next cpr-exec */
+ cpr_unpreserve_fds();
+ }
+
qemu_fclose(f);
return ret;
}
diff --git a/migration/migration.c b/migration/migration.c
index 8a8e927..a4a020e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -73,9 +73,10 @@
#define INMIGRATE_DEFAULT_EXIT_ON_ERROR true
-static NotifierWithReturnList migration_state_notifiers[] = {
+static NotifierWithReturnList migration_state_notifiers[MIG_MODE__MAX] = {
NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_NORMAL),
NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_REBOOT),
+ NOTIFIER_ELEM_INIT(migration_state_notifiers, MIG_MODE_CPR_EXEC),
};
/* Messages sent on the return path from destination to source */
@@ -264,6 +265,7 @@ void migration_object_init(void)
ram_mig_init();
dirty_bitmap_mig_init();
+ cpr_mig_init();
}
typedef struct {
@@ -1693,7 +1695,9 @@ bool migration_thread_is_self(void)
bool migrate_mode_is_cpr(MigrationState *s)
{
- return s->parameters.mode == MIG_MODE_CPR_REBOOT;
+ MigMode mode = s->parameters.mode;
+ return mode == MIG_MODE_CPR_REBOOT ||
+ mode == MIG_MODE_CPR_EXEC;
}
int migrate_init(MigrationState *s, Error **errp)
@@ -2028,6 +2032,12 @@ static bool migrate_prepare(MigrationState *s, bool resume, Error **errp)
return false;
}
+ if (migrate_mode() == MIG_MODE_CPR_EXEC &&
+ !s->parameters.has_cpr_exec_command) {
+ error_setg(errp, "cpr-exec mode requires setting cpr-exec-command");
+ return false;
+ }
+
if (migration_is_blocked(errp)) {
return false;
}
diff --git a/migration/ram.c b/migration/ram.c
index eaf3151..45b8f00 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -216,7 +216,9 @@ static bool postcopy_preempt_active(void)
bool migrate_ram_is_ignored(RAMBlock *block)
{
+ MigMode mode = migrate_mode();
return !qemu_ram_is_migratable(block) ||
+ mode == MIG_MODE_CPR_EXEC ||
(migrate_ignore_shared() && qemu_ram_is_shared(block)
&& qemu_ram_is_named_file(block));
}
diff --git a/qapi/migration.json b/qapi/migration.json
index 20092d2..4e626df 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -604,9 +604,31 @@
# or COLO.
#
# (since 8.2)
+#
+# @cpr-exec: The migrate command stops the VM, saves state to the
+# migration channel, directly exec's a new version of QEMU on the
+# same host, replacing the original process while retaining its
+# PID, and loads state from the channel. Guest RAM is preserved
+# in place.
+#
+# Old QEMU starts new QEMU by exec'ing the command specified by
+# the @cpr-exec-command parameter. The command may be a direct
+# invocation of new QEMU, or may be a non-QEMU command that exec's
+# the new QEMU binary.
+#
+# Because old QEMU terminates when new QEMU starts, one cannot
+# stream data between the two, so the channel must be a type,
+# such as a file, that accepts all data before old QEMU exits.
+# Otherwise, old QEMU may quietly block writing to the channel.
+#
+# Memory-backend objects must have the share=on attribute, but
+# memory-backend-epc is not supported. The VM must be started
+# with the '-machine anon-alloc=memfd' option.
+#
+# (since 9.1)
##
{ 'enum': 'MigMode',
- 'data': [ 'normal', 'cpr-reboot' ] }
+ 'data': [ 'normal', 'cpr-reboot', 'cpr-exec' ] }
##
# @ZeroPageDetection:
--
1.8.3.1
^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
@ 2024-07-15 17:52 ` Fabiano Rosas
2024-07-16 9:19 ` Igor Mammedov
1 sibling, 0 replies; 77+ messages in thread
From: Fabiano Rosas @ 2024-07-15 17:52 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster, Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property. This affects
> memory-backend-ram objects, guest RAM created with the global -m option
> but without an associated memory-backend object and without the -mem-path
> option, and various memory regions such as ROMs that are allocated when
> devices are created. This option does not affect memory-backend-file,
> memory-backend-memfd, or memory-backend-epc objects.
>
> The memfd option is intended to support new migration modes, in which the
> memory region can be transferred in place to a new QEMU process, by sending
> the memfd file descriptor to the process. Memory contents are preserved,
> and if the mode also transfers device descriptors, then pages that are
> locked in memory for DMA remain locked. This behavior is a pre-requisite
> for supporting vfio, vdpa, and iommufd devices with the new modes.
>
> To access the same memory in the old and new QEMU processes, the memory
> must be mapped shared. Therefore, the implementation always sets
> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> user must explicitly specify the share option. In lieu of defining a new
> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> as the condition for calling memfd_create.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
The commit message is inconsistent with alloc-anon, anon-alloc.
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
2024-07-15 17:52 ` Fabiano Rosas
@ 2024-07-16 9:19 ` Igor Mammedov
2024-07-17 19:24 ` Peter Xu
2024-07-20 20:28 ` Steven Sistare
1 sibling, 2 replies; 77+ messages in thread
From: Igor Mammedov @ 2024-07-16 9:19 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Daniel P. Berrange, Markus Armbruster
On Sun, 30 Jun 2024 12:40:24 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:
> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property. This affects
> memory-backend-ram objects, guest RAM created with the global -m option
> but without an associated memory-backend object and without the -mem-path
> option
nowadays, all machines were converted to use memory backend for VM RAM.
so -m option implicitly creates memory-backend object,
which will be either MEMORY_BACKEND_FILE if -mem-path present
or MEMORY_BACKEND_RAM otherwise.
> To access the same memory in the old and new QEMU processes, the memory
> must be mapped shared. Therefore, the implementation always sets
> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> user must explicitly specify the share option. In lieu of defining a new
so statement at the top that memory-backend-ram is affected is not
really valid?
> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> as the condition for calling memfd_create.
In general I do dislike adding yet another option that will affect
guest RAM allocation (memory-backends should be sufficient).
However I do see that you need memfd for device memory (vram, roms, ...).
Can we just use memfd/shared unconditionally for those and
avoid introducing a new confusing option?
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/core/machine.c | 24 ++++++++++++++++++++++++
> include/hw/boards.h | 1 +
> qapi/machine.json | 14 ++++++++++++++
> qemu-options.hx | 13 +++++++++++++
> system/memory.c | 12 +++++++++---
> system/physmem.c | 38 +++++++++++++++++++++++++++++++++++++-
> system/trace-events | 3 +++
> 7 files changed, 101 insertions(+), 4 deletions(-)
>
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 655d75c..7ca2ad0 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -454,6 +454,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> ms->mem_merge = value;
> }
>
> +static int machine_get_anon_alloc(Object *obj, Error **errp)
> +{
> + MachineState *ms = MACHINE(obj);
> +
> + return ms->anon_alloc;
> +}
> +
> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> +{
> + MachineState *ms = MACHINE(obj);
> +
> + ms->anon_alloc = value;
> +}
> +
> static bool machine_get_usb(Object *obj, Error **errp)
> {
> MachineState *ms = MACHINE(obj);
> @@ -1066,6 +1080,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> object_class_property_set_description(oc, "mem-merge",
> "Enable/disable memory merge support");
>
> + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> + &AnonAllocOption_lookup,
> + machine_get_anon_alloc,
> + machine_set_anon_alloc);
> +
> object_class_property_add_bool(oc, "usb",
> machine_get_usb, machine_set_usb);
> object_class_property_set_description(oc, "usb",
> @@ -1416,6 +1435,11 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
> if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
> goto out;
> }
> + if (!object_property_set_bool(obj, "share",
> + ms->anon_alloc == ANON_ALLOC_OPTION_MEMFD,
> + errp)) {
> + goto out;
> + }
> object_property_add_child(object_get_objects_root(), mc->default_ram_id,
> obj);
> /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 73ad319..77f16ad 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -383,6 +383,7 @@ struct MachineState {
> bool enable_graphics;
> ConfidentialGuestSupport *cgs;
> HostMemoryBackend *memdev;
> + AnonAllocOption anon_alloc;
> /*
> * convenience alias to ram_memdev_id backend memory region
> * or to numa container memory region
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 2fd3e9c..9173953 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1881,3 +1881,17 @@
> { 'command': 'x-query-interrupt-controllers',
> 'returns': 'HumanReadableText',
> 'features': [ 'unstable' ]}
> +
> +##
> +# @AnonAllocOption:
> +#
> +# An enumeration of the options for allocating anonymous guest memory.
> +#
> +# @mmap: allocate using mmap MAP_ANON
> +#
> +# @memfd: allocate using memfd_create
> +#
> +# Since: 9.1
> +##
> +{ 'enum': 'AnonAllocOption',
> + 'data': [ 'mmap', 'memfd' ] }
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8ca7f34..595b693 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> " nvdimm=on|off controls NVDIMM support (default=off)\n"
> " memory-encryption=@var{} memory encryption object to use (default=none)\n"
> " hmat=on|off controls ACPI HMAT support (default=off)\n"
> + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
> " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
> " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
> QEMU_ARCH_ALL)
> @@ -101,6 +102,18 @@ SRST
> Enables or disables ACPI Heterogeneous Memory Attribute Table
> (HMAT) support. The default is off.
>
> + ``anon-alloc=mmap|memfd``
> + Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> + or memfd_create. This affects memory-backend-ram objects,
> + RAM created with the global -m option but without an
> + associated memory-backend object and without the -mem-path
> + option, and various memory regions such as ROMs that are
> + allocated when devices are created. This option does not
> + affect memory-backend-file, memory-backend-memfd, or
> + memory-backend-epc objects.
> +
> + Some migration modes require anon-alloc=memfd.
> +
> ``memory-backend='id'``
> An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
> Allows to use a memory backend as main RAM.
> diff --git a/system/memory.c b/system/memory.c
> index 2d69521..28a837d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -1552,8 +1552,10 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
> uint64_t size,
> Error **errp)
> {
> + uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> + RAM_SHARED : 0;
> return memory_region_init_ram_flags_nomigrate(mr, owner, name,
> - size, 0, errp);
> + size, flags, errp);
> }
>
> bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
> @@ -1713,8 +1715,10 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
> uint64_t size,
> Error **errp)
> {
> + uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> + RAM_SHARED : 0;
> if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
> - size, 0, errp)) {
> + size, flags, errp)) {
> return false;
> }
> mr->readonly = true;
> @@ -1731,6 +1735,8 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
> Error **errp)
> {
> Error *err = NULL;
> + uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> + RAM_SHARED : 0;
> assert(ops);
> memory_region_init(mr, owner, name, size);
> mr->ops = ops;
> @@ -1738,7 +1744,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
> mr->terminates = true;
> mr->rom_device = true;
> mr->destructor = memory_region_destructor_ram;
> - mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
> + mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
> if (err) {
> mr->size = int128_zero();
> object_unparent(OBJECT(mr));
> diff --git a/system/physmem.c b/system/physmem.c
> index 33d09f7..efe95ff 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -47,6 +47,7 @@
> #include "qemu/qemu-print.h"
> #include "qemu/log.h"
> #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
> #include "exec/memory.h"
> #include "exec/ioport.h"
> #include "sysemu/dma.h"
> @@ -54,6 +55,7 @@
> #include "sysemu/hw_accel.h"
> #include "sysemu/xen-mapcache.h"
> #include "trace/trace-root.h"
> +#include "trace.h"
>
> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> #include <linux/falloc.h>
> @@ -69,6 +71,8 @@
>
> #include "qemu/pmem.h"
>
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/options.h"
> #include "migration/vmstate.h"
>
> #include "qemu/range.h"
> @@ -1828,6 +1832,32 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> qemu_mutex_unlock_ramlist();
> return;
> }
> +
> + } else if (new_block->flags & RAM_SHARED) {
> + size_t max_length = new_block->max_length;
> + MemoryRegion *mr = new_block->mr;
> + const char *name = memory_region_name(mr);
> +
> + new_block->mr->align = QEMU_VMALLOC_ALIGN;
> +
> + if (new_block->fd == -1) {
> + new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> + 0, 0, 0, errp);
> + }
> +
> + if (new_block->fd >= 0) {
> + int mfd = new_block->fd;
> + qemu_set_cloexec(mfd);
> + new_block->host = file_ram_alloc(new_block, max_length, mfd,
> + false, 0, errp);
> + }
> + if (!new_block->host) {
> + qemu_mutex_unlock_ramlist();
> + return;
> + }
> + memory_try_enable_merging(new_block->host, new_block->max_length);
> + free_on_error = true;
> +
> } else {
> new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> &new_block->mr->align,
> @@ -1911,6 +1941,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> ram_block_notify_add(new_block->host, new_block->used_length,
> new_block->max_length);
> }
> + trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
> + new_block->fd, new_block->used_length,
> + new_block->max_length);
> return;
>
> out_free:
> @@ -2097,8 +2130,11 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
> void *host),
> MemoryRegion *mr, Error **errp)
> {
> + uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> + RAM_SHARED : 0;
> + flags |= RAM_RESIZEABLE;
> return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
> - RAM_RESIZEABLE, mr, errp);
> + flags, mr, errp);
> }
>
> static void reclaim_ramblock(RAMBlock *block)
> diff --git a/system/trace-events b/system/trace-events
> index 69c9044..f8ebf42 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
> dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
> dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
> dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
> +
> +#physmem.c
> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 02/11] migration: cpr-state
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
@ 2024-07-17 18:39 ` Fabiano Rosas
2024-07-19 15:03 ` Peter Xu
1 sibling, 0 replies; 77+ messages in thread
From: Fabiano Rosas @ 2024-07-17 18:39 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster, Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> CPR must save state that is needed after QEMU is restarted, when devices
> are realized. Thus the extra state cannot be saved in the migration stream,
> as objects must already exist before that stream can be loaded. Instead,
> define auxilliary state structures and vmstate descriptions, not associated
> with any registered object, and serialize the aux state to a cpr-specific
> stream in cpr_state_save. Deserialize in cpr_state_load after QEMU
> restarts, before devices are realized.
>
> Provide accessors for clients to register file descriptors for saving.
> The mechanism for passing the fd's to the new process will be specific
> to each migration mode, and added in subsequent patches.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 03/11] migration: save cpr mode
2024-06-30 19:40 ` [PATCH V2 03/11] migration: save cpr mode Steve Sistare
@ 2024-07-17 18:39 ` Fabiano Rosas
2024-07-18 15:47 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Fabiano Rosas @ 2024-07-17 18:39 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster, Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> Save the mode in CPR state, so the user does not need to explicitly specify
> it for the target. Modify migrate_mode() so it returns the incoming mode on
> the target.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/migration/cpr.h | 7 +++++++
> migration/cpr.c | 23 ++++++++++++++++++++++-
> migration/migration.c | 1 +
> migration/options.c | 9 +++++++--
> 4 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 8e7e705..42b4019 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -8,6 +8,13 @@
> #ifndef MIGRATION_CPR_H
> #define MIGRATION_CPR_H
>
> +#include "qapi/qapi-types-migration.h"
> +
> +#define MIG_MODE_NONE MIG_MODE__MAX
What happens when a QEMU that knows about a new mode migrates into a
QEMU that doesn't know that mode, i.e. sees it as MIG_MODE__MAX?
I'd just use -1.
> +
> +MigMode cpr_get_incoming_mode(void);
> +void cpr_set_incoming_mode(MigMode mode);
> +
> typedef int (*cpr_walk_fd_cb)(int fd);
> void cpr_save_fd(const char *name, int id, int fd);
> void cpr_delete_fd(const char *name, int id);
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 313e74e..1c296c6 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -21,10 +21,23 @@
> typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>
> typedef struct CprState {
> + MigMode mode;
> CprFdList fds;
> } CprState;
>
> -static CprState cpr_state;
> +static CprState cpr_state = {
> + .mode = MIG_MODE_NONE,
> +};
> +
> +MigMode cpr_get_incoming_mode(void)
> +{
> + return cpr_state.mode;
> +}
> +
> +void cpr_set_incoming_mode(MigMode mode)
> +{
> + cpr_state.mode = mode;
> +}
>
> /****************************************************************************/
>
> @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
> /*************************************************************************/
> #define CPR_STATE "CprState"
>
> +static int cpr_state_presave(void *opaque)
> +{
> + cpr_state.mode = migrate_mode();
> + return 0;
> +}
> +
> static const VMStateDescription vmstate_cpr_state = {
> .name = CPR_STATE,
> .version_id = 1,
> .minimum_version_id = 1,
> + .pre_save = cpr_state_presave,
> .fields = (VMStateField[]) {
> + VMSTATE_UINT32(mode, CprState),
> VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
> VMSTATE_END_OF_LIST()
> }
> diff --git a/migration/migration.c b/migration/migration.c
> index e394ad7..0f47765 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -411,6 +411,7 @@ void migration_incoming_state_destroy(void)
> mis->postcopy_qemufile_dst = NULL;
> }
>
> + cpr_set_incoming_mode(MIG_MODE_NONE);
> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> }
>
> diff --git a/migration/options.c b/migration/options.c
> index 645f550..305397a 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -22,6 +22,7 @@
> #include "qapi/qmp/qnull.h"
> #include "sysemu/runstate.h"
> #include "migration/colo.h"
> +#include "migration/cpr.h"
> #include "migration/misc.h"
> #include "migration.h"
> #include "migration-stats.h"
> @@ -758,8 +759,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
>
> MigMode migrate_mode(void)
> {
> - MigrationState *s = migrate_get_current();
> - MigMode mode = s->parameters.mode;
> + MigMode mode = cpr_get_incoming_mode();
> +
> + if (mode == MIG_MODE_NONE) {
> + MigrationState *s = migrate_get_current();
> + mode = s->parameters.mode;
> + }
>
> assert(mode >= 0 && mode < MIG_MODE__MAX);
> return mode;
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 04/11] migration: stop vm earlier for cpr
2024-06-30 19:40 ` [PATCH V2 04/11] migration: stop vm earlier for cpr Steve Sistare
@ 2024-07-17 18:59 ` Fabiano Rosas
2024-07-20 20:00 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Fabiano Rosas @ 2024-07-17 18:59 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster, Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> Stop the vm earlier for cpr, to guarantee consistent device state when
> CPR state is saved.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> migration/migration.c | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 0f47765..8a8e927 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
> MigrationState *s = migrate_get_current();
> g_autoptr(MigrationChannel) channel = NULL;
> MigrationAddress *addr = NULL;
> + bool stopped = false;
>
> /*
> * Having preliminary checks for uri and channel
> @@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
> }
> }
>
> + if (migrate_mode_is_cpr(s)) {
> + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> + if (ret < 0) {
> + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
> + goto out;
> + }
> + stopped = true;
> + }
> +
> if (cpr_state_save(&local_err)) {
> goto out;
> }
> @@ -2155,6 +2165,9 @@ out:
> }
> migrate_fd_error(s, local_err);
> error_propagate(errp, local_err);
> + if (stopped && runstate_is_live(s->vm_old_state)) {
> + vm_start();
> + }
What about non-live states? Shouldn't this be:
if (stopped) {
vm_resume();
}
> return;
> }
> }
> @@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
> Error *local_err = NULL;
> uint64_t rate_limit;
> bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
> - int ret;
>
> /*
> * If there's a previous error, free it and prepare for another one.
> @@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
> return;
> }
>
> - if (migrate_mode_is_cpr(s)) {
> - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
> - if (ret < 0) {
> - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
> - goto fail;
> - }
> - }
> -
> if (migrate_background_snapshot()) {
> qemu_thread_create(&s->thread, "mig/snapshot",
> bg_migration_thread, s, QEMU_THREAD_JOINABLE);
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-16 9:19 ` Igor Mammedov
@ 2024-07-17 19:24 ` Peter Xu
2024-07-18 15:43 ` Steven Sistare
2024-07-20 20:35 ` Steven Sistare
2024-07-20 20:28 ` Steven Sistare
1 sibling, 2 replies; 77+ messages in thread
From: Peter Xu @ 2024-07-17 19:24 UTC (permalink / raw)
To: Igor Mammedov
Cc: Steve Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Daniel P. Berrange, Markus Armbruster
On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
> On Sun, 30 Jun 2024 12:40:24 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
>
> > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > on the value of the anon-alloc machine property. This affects
> > memory-backend-ram objects, guest RAM created with the global -m option
> > but without an associated memory-backend object and without the -mem-path
> > option
> nowadays, all machines were converted to use memory backend for VM RAM.
> so -m option implicitly creates memory-backend object,
> which will be either MEMORY_BACKEND_FILE if -mem-path present
> or MEMORY_BACKEND_RAM otherwise.
>
>
> > To access the same memory in the old and new QEMU processes, the memory
> > must be mapped shared. Therefore, the implementation always sets
>
> > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > user must explicitly specify the share option. In lieu of defining a new
> so statement at the top that memory-backend-ram is affected is not
> really valid?
>
> > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > as the condition for calling memfd_create.
>
> In general I do dislike adding yet another option that will affect
> guest RAM allocation (memory-backends should be sufficient).
I shared the same concern when reviewing the previous version, and I keep
having so.
>
> However I do see that you need memfd for device memory (vram, roms, ...).
> Can we just use memfd/shared unconditionally for those and
> avoid introducing a new confusing option?
ROMs should be fine IIUC, as they shouldn't be large, and they can be
migrated normally (because they're not DMA target from VFIO assigned
devices). IOW, per my understanding what must be shared via memfd is
writable memories that can be DMAed from a VFIO device.
I raised such question on whether / why vram can be a DMA target, but I
didn't get a response. So I would like to redo this comment: I think we
should figure out what is missing when we switch all backends to use
-object, rather than adding this flag easily. When added, we should be
crystal clear on which RAM region will be applicable by this flag.
PS to Steve: and I think I left tons of other comments in previous version
outside this patch too, but I don't think they're fully discussed when this
series was sent. I can re-read the series again, but I don't think it'll
work out if we keep skipping discussions..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-17 19:24 ` Peter Xu
@ 2024-07-18 15:43 ` Steven Sistare
2024-07-18 16:22 ` Peter Xu
2024-07-20 20:35 ` Steven Sistare
1 sibling, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-07-18 15:43 UTC (permalink / raw)
To: Peter Xu, Igor Mammedov
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/17/2024 3:24 PM, Peter Xu wrote:
[...]
>
> PS to Steve: and I think I left tons of other comments in previous version
> outside this patch too, but I don't think they're fully discussed when this
> series was sent. I can re-read the series again, but I don't think it'll
> work out if we keep skipping discussions..
Hi Peter, let me address this part first, because I don't want you to think
that I ignored your questions and concerns. This V2 series tries to address
them. The change log was intended to be my response, rather than responding
to each open question individually, but let me try again here with more detail.
I apologize if I don't summarize your concerns correctly or completely.
issue: discomfort with exec. why is it needed?
response: exec is just a transport mechanism to send fd's to new qemu.
I refactored to separate core patches from exec-specific patches, submitted
cpr-transfer patches to illustrate a non-exec method, and provided reasons
why one vs the other would be desirable in the commit messages and cover
letter.
issue: why do we need to preserve the ramblock fields and make them available
prior to object creation?
response. we don't need to preserve all of them, and we only need fd prior
to object creation, so I deleted the precreate, factory, and named object
patches, and added CprState to preserve fd's. used_length arrives in the
normal migration stream. max_length is recovered from the mfd using lseek.
issue: the series is too large, with too much change.
response: in addition to the deletions mentioned above, I simplified the
functionality and tossed out style patches and nice-to-haves, so we can
focus on core functionality. V2 is much smaller.
issue: memfd_create option is oddly expressed and hard to understand.
response: I redefined the option, deleted all the stylistic ramblock patches
to lay its workings bare, and explicitly documented its affect on all types
of memory in the commit messages and qapi documentation.
issue: no need to preserve blocks like ROM for DMA (with memfd_create).
Blocks that must be preserved should be surfaced to the user as objects.
response: I disagree, and will continue that conversation in this email thread.
issue: how will vfio be handled?
response: I submitted the vfio patches (non-trivial, because first I had to
rework them without using precreate vmstate).
issue: how will fd's be preserved for chardevs?
response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
in each device's creation function, just like vfio. Those primitives are
defined in this V2 series.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 03/11] migration: save cpr mode
2024-07-17 18:39 ` Fabiano Rosas
@ 2024-07-18 15:47 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-07-18 15:47 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster
On 7/17/2024 2:39 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
>
>> Save the mode in CPR state, so the user does not need to explicitly specify
>> it for the target. Modify migrate_mode() so it returns the incoming mode on
>> the target.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/migration/cpr.h | 7 +++++++
>> migration/cpr.c | 23 ++++++++++++++++++++++-
>> migration/migration.c | 1 +
>> migration/options.c | 9 +++++++--
>> 4 files changed, 37 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index 8e7e705..42b4019 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -8,6 +8,13 @@
>> #ifndef MIGRATION_CPR_H
>> #define MIGRATION_CPR_H
>>
>> +#include "qapi/qapi-types-migration.h"
>> +
>> +#define MIG_MODE_NONE MIG_MODE__MAX
>
> What happens when a QEMU that knows about a new mode migrates into a
> QEMU that doesn't know that mode, i.e. sees it as MIG_MODE__MAX?
>
> I'd just use -1.
Good idea, thanks - steve
>> +
>> +MigMode cpr_get_incoming_mode(void);
>> +void cpr_set_incoming_mode(MigMode mode);
>> +
>> typedef int (*cpr_walk_fd_cb)(int fd);
>> void cpr_save_fd(const char *name, int id, int fd);
>> void cpr_delete_fd(const char *name, int id);
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 313e74e..1c296c6 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -21,10 +21,23 @@
>> typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>>
>> typedef struct CprState {
>> + MigMode mode;
>> CprFdList fds;
>> } CprState;
>>
>> -static CprState cpr_state;
>> +static CprState cpr_state = {
>> + .mode = MIG_MODE_NONE,
>> +};
>> +
>> +MigMode cpr_get_incoming_mode(void)
>> +{
>> + return cpr_state.mode;
>> +}
>> +
>> +void cpr_set_incoming_mode(MigMode mode)
>> +{
>> + cpr_state.mode = mode;
>> +}
>>
>> /****************************************************************************/
>>
>> @@ -124,11 +137,19 @@ void cpr_resave_fd(const char *name, int id, int fd)
>> /*************************************************************************/
>> #define CPR_STATE "CprState"
>>
>> +static int cpr_state_presave(void *opaque)
>> +{
>> + cpr_state.mode = migrate_mode();
>> + return 0;
>> +}
>> +
>> static const VMStateDescription vmstate_cpr_state = {
>> .name = CPR_STATE,
>> .version_id = 1,
>> .minimum_version_id = 1,
>> + .pre_save = cpr_state_presave,
>> .fields = (VMStateField[]) {
>> + VMSTATE_UINT32(mode, CprState),
>> VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>> VMSTATE_END_OF_LIST()
>> }
>> diff --git a/migration/migration.c b/migration/migration.c
>> index e394ad7..0f47765 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -411,6 +411,7 @@ void migration_incoming_state_destroy(void)
>> mis->postcopy_qemufile_dst = NULL;
>> }
>>
>> + cpr_set_incoming_mode(MIG_MODE_NONE);
>> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>> }
>>
>> diff --git a/migration/options.c b/migration/options.c
>> index 645f550..305397a 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -22,6 +22,7 @@
>> #include "qapi/qmp/qnull.h"
>> #include "sysemu/runstate.h"
>> #include "migration/colo.h"
>> +#include "migration/cpr.h"
>> #include "migration/misc.h"
>> #include "migration.h"
>> #include "migration-stats.h"
>> @@ -758,8 +759,12 @@ uint64_t migrate_max_postcopy_bandwidth(void)
>>
>> MigMode migrate_mode(void)
>> {
>> - MigrationState *s = migrate_get_current();
>> - MigMode mode = s->parameters.mode;
>> + MigMode mode = cpr_get_incoming_mode();
>> +
>> + if (mode == MIG_MODE_NONE) {
>> + MigrationState *s = migrate_get_current();
>> + mode = s->parameters.mode;
>> + }
>>
>> assert(mode >= 0 && mode < MIG_MODE__MAX);
>> return mode;
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
` (10 preceding siblings ...)
2024-06-30 19:40 ` [PATCH V2 11/11] migration: cpr-exec mode Steve Sistare
@ 2024-07-18 15:56 ` Peter Xu
2024-07-20 21:26 ` Steven Sistare
` (2 more replies)
11 siblings, 3 replies; 77+ messages in thread
From: Peter Xu @ 2024-07-18 15:56 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
Steve,
On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> What?
Thanks for trying out with the cpr-transfer series. I saw that that series
missed most of the cc list here, so I'm attaching the link here:
https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
I think most of my previous questions for exec() solution still are there,
I'll try to summarize them all in this reply as much as I can.
>
> This patch series adds the live migration cpr-exec mode, which allows
> the user to update QEMU with minimal guest pause time, by preserving
> guest RAM in place, albeit with new virtual addresses in new QEMU, and
> by preserving device file descriptors.
>
> The new user-visible interfaces are:
> * cpr-exec (MigMode migration parameter)
> * cpr-exec-command (migration parameter)
I really, really hope we can avoid this..
It's super cumbersome to pass in a qemu cmdline in a qemu migration
parameter.. if we can do that with generic live migration ways, I hope we
stick with the clean approach.
> * anon-alloc (command-line option for -machine)
Igor questioned this, and I second his opinion.. We can leave the
discussion there for this one.
>
> The user sets the mode parameter before invoking the migrate command.
> In this mode, the user issues the migrate command to old QEMU, which
> stops the VM and saves state to the migration channels. Old QEMU then
> exec's new QEMU, replacing the original process while retaining its PID.
> The user specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command. The command must pass all old QEMU arguments to new
> QEMU, plus the -incoming option. Execution resumes in new QEMU.
>
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported. The VM must be started
> with the '-machine anon-alloc=memfd' option, which allows anonymous
> memory to be transferred in place to the new process.
>
> Why?
>
> This mode has less impact on the guest than any other method of updating
> in place.
So I wonder whether there's comparison between exec() and transfer mode
that you recently proposed.
I'm asking because exec() (besides all the rest of things that I dislike on
it in this approach..) should be simply slower, logically, due to the
serialized operation to (1) tearing down the old mm, (2) reload the new
ELF, then (3) runs through the QEMU init process.
If with a generic migration solution, the dest QEMU can start running (2+3)
concurrently without even need to run (1).
In this whole process, I doubt (2) could be relatively fast, (3) I donno,
maybe it could be slow but I never measured; Paolo may have good idea as I
know he used to work on qboot.
For (1), I also doubt in your test cases it's fast, but it may not always
be fast. Consider the guest has a huge TBs of shared mem, even if the
memory will be completely shared between src/dst QEMUs, the pgtable won't!
It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
pgtable alone can even take time, and that will be accounted in step (1)
and further in exec() request.
All these fuss will be avoided if you use a generic live migration model
like cpr-transfer you proposed. That's also cleaner.
> The pause time is much lower, because devices need not be torn
> down and recreated, DMA does not need to be drained and quiesced, and minimal
> state is copied to new QEMU. Further, there are no constraints on the guest.
> By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
> and suspending plus resuming vfio devices adds multiple seconds to the
> guest pause time. Lastly, there is no loss of connectivity to the guest,
> because chardev descriptors remain open and connected.
Again, I raised the question on why this would matter, as after all mgmt
app will need to coop with reconnections due to the fact they'll need to
support a generic live migration, in which case reconnection is a must.
So far it doesn't sound like a performance critical path, for example, to
do the mgmt reconnects on the ports. So this might be an optimization that
most mgmt apps may not care much?
>
> These benefits all derive from the core design principle of this mode,
> which is preserving open descriptors. This approach is very general and
> can be used to support a wide variety of devices that do not have hardware
> support for live migration, including but not limited to: vfio, chardev,
> vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
> to allow a descriptor to be used in a process that did not originally open it.
Yes, I still think this is a great idea. It just can also be built on top
of something else than exec().
>
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources. By contrast, consider a design in
> which a new container is created on the same host as the target of the
> CPR operation. Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
Note that if we need to share RAM anyway, the resources consumption should
be minimal, as mem should IMHO be the major concern (except CPU, but CPU
isn't a concern in this scenario) in container world and here the shared
guest mem shouldn't be accounted to the dest container. So IMHO it's about
the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
that should be relatively small.
In that case I don't yet see it a huge improvement, if the dest container
is cheap to initiate.
> Avoiding over commitment requires extra work in the management layer.
So it would be nice to know what needs to be overcommitted here. I confess
I don't know much on containerized VMs, so maybe the page cache can be a
problem even if shared. But I hope we can spell that out. Logically IIUC
memcg shouldn't account those page cache if preallocated, because memcg
accounting should be done at folio allocations, at least, where the page
cache should miss first (so not this case..).
> This is one reason why a cloud provider may prefer cpr-exec. A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
>
> How?
>
> All memory that is mapped by the guest is preserved in place. Indeed,
> it must be, because it may be the target of DMA requests, which are not
> quiesced during cpr-exec. All such memory must be mmap'able in new QEMU.
> This is easy for named memory-backend objects, as long as they are mapped
> shared, because they are visible in the file system in both old and new QEMU.
> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
> so the memfd's can be sent to new QEMU. Pages that were locked in memory
> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
> the device that locked them remains open.
>
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
>
> For device descriptors, new QEMU reuses the descriptor when creating the
> device, rather than opening it again. The same holds for chardevs. For
> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
> is created.
>
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.
Oh, maybe this is the reason that cpr-transfer will need a separate uri..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-18 15:43 ` Steven Sistare
@ 2024-07-18 16:22 ` Peter Xu
0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2024-07-18 16:22 UTC (permalink / raw)
To: Steven Sistare
Cc: Igor Mammedov, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Daniel P. Berrange, Markus Armbruster
On Thu, Jul 18, 2024 at 11:43:54AM -0400, Steven Sistare wrote:
> On 7/17/2024 3:24 PM, Peter Xu wrote:
> [...]
> >
> > PS to Steve: and I think I left tons of other comments in previous version
> > outside this patch too, but I don't think they're fully discussed when this
> > series was sent. I can re-read the series again, but I don't think it'll
> > work out if we keep skipping discussions..
>
> Hi Peter, let me address this part first, because I don't want you to think
> that I ignored your questions and concerns. This V2 series tries to address
> them. The change log was intended to be my response, rather than responding
> to each open question individually, but let me try again here with more detail.
> I apologize if I don't summarize your concerns correctly or completely.
>
> issue: discomfort with exec. why is it needed?
> response: exec is just a transport mechanism to send fd's to new qemu.
> I refactored to separate core patches from exec-specific patches, submitted
> cpr-transfer patches to illustrate a non-exec method, and provided reasons
> why one vs the other would be desirable in the commit messages and cover
> letter.
>
> issue: why do we need to preserve the ramblock fields and make them available
> prior to object creation?
> response. we don't need to preserve all of them, and we only need fd prior
> to object creation, so I deleted the precreate, factory, and named object
> patches, and added CprState to preserve fd's. used_length arrives in the
> normal migration stream. max_length is recovered from the mfd using lseek.
>
> issue: the series is too large, with too much change.
> response: in addition to the deletions mentioned above, I simplified the
> functionality and tossed out style patches and nice-to-haves, so we can
> focus on core functionality. V2 is much smaller.
>
> issue: memfd_create option is oddly expressed and hard to understand.
> response: I redefined the option, deleted all the stylistic ramblock patches
> to lay its workings bare, and explicitly documented its affect on all types
> of memory in the commit messages and qapi documentation.
>
> issue: no need to preserve blocks like ROM for DMA (with memfd_create).
> Blocks that must be preserved should be surfaced to the user as objects.
> response: I disagree, and will continue that conversation in this email thread.
>
> issue: how will vfio be handled?
> response: I submitted the vfio patches (non-trivial, because first I had to
> rework them without using precreate vmstate).
>
> issue: how will fd's be preserved for chardevs?
> response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
> in each device's creation function, just like vfio. Those primitives are
> defined in this V2 series.
Thanks for the answers. I think I'll need to read more into the patches in
the next few days; it looks like I'll get more answers from there.
I just sent an email probably when you're drafting this one.. it may has
some questions that may not be covered here.
I think a major issue with exec() is the (1-3) steps that I mentioned there
that needs to run sequentially, and IIUC all these steps can be completely
avoided in cpr-transfer, and it may matter a lot in huge VMs. But maybe I
missed something.
Please have a look there.
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 02/11] migration: cpr-state
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
@ 2024-07-19 15:03 ` Peter Xu
2024-07-20 19:53 ` Steven Sistare
1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-07-19 15:03 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Sun, Jun 30, 2024 at 12:40:25PM -0700, Steve Sistare wrote:
> CPR must save state that is needed after QEMU is restarted, when devices
> are realized. Thus the extra state cannot be saved in the migration stream,
> as objects must already exist before that stream can be loaded. Instead,
> define auxilliary state structures and vmstate descriptions, not associated
> with any registered object, and serialize the aux state to a cpr-specific
> stream in cpr_state_save. Deserialize in cpr_state_load after QEMU
> restarts, before devices are realized.
>
> Provide accessors for clients to register file descriptors for saving.
> The mechanism for passing the fd's to the new process will be specific
> to each migration mode, and added in subsequent patches.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/migration/cpr.h | 21 ++++++
> migration/cpr.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++
> migration/meson.build | 1 +
> migration/migration.c | 6 ++
> migration/trace-events | 5 ++
> system/vl.c | 3 +
> 6 files changed, 224 insertions(+)
> create mode 100644 include/migration/cpr.h
> create mode 100644 migration/cpr.c
>
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> new file mode 100644
> index 0000000..8e7e705
> --- /dev/null
> +++ b/include/migration/cpr.h
> @@ -0,0 +1,21 @@
> +/*
> + * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef MIGRATION_CPR_H
> +#define MIGRATION_CPR_H
> +
> +typedef int (*cpr_walk_fd_cb)(int fd);
> +void cpr_save_fd(const char *name, int id, int fd);
> +void cpr_delete_fd(const char *name, int id);
> +int cpr_find_fd(const char *name, int id);
> +int cpr_walk_fd(cpr_walk_fd_cb cb);
> +void cpr_resave_fd(const char *name, int id, int fd);
> +
> +int cpr_state_save(Error **errp);
> +int cpr_state_load(Error **errp);
> +
> +#endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> new file mode 100644
> index 0000000..313e74e
> --- /dev/null
> +++ b/migration/cpr.c
> @@ -0,0 +1,188 @@
> +/*
> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/misc.h"
> +#include "migration/qemu-file.h"
> +#include "migration/savevm.h"
> +#include "migration/vmstate.h"
> +#include "sysemu/runstate.h"
> +#include "trace.h"
> +
> +/*************************************************************************/
> +/* cpr state container for all information to be saved. */
> +
> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
> +
> +typedef struct CprState {
> + CprFdList fds;
> +} CprState;
> +
> +static CprState cpr_state;
> +
> +/****************************************************************************/
> +
> +typedef struct CprFd {
> + char *name;
> + unsigned int namelen;
> + int id;
> + int fd;
[1]
> + QLIST_ENTRY(CprFd) next;
> +} CprFd;
> +
> +static const VMStateDescription vmstate_cpr_fd = {
> + .name = "cpr fd",
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .fields = (VMStateField[]) {
> + VMSTATE_UINT32(namelen, CprFd),
> + VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
> + VMSTATE_INT32(id, CprFd),
> + VMSTATE_INT32(fd, CprFd),
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +void cpr_save_fd(const char *name, int id, int fd)
> +{
> + CprFd *elem = g_new0(CprFd, 1);
> +
> + trace_cpr_save_fd(name, id, fd);
> + elem->name = g_strdup(name);
> + elem->namelen = strlen(name) + 1;
> + elem->id = id;
> + elem->fd = fd;
> + QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
> +}
> +
> +static CprFd *find_fd(CprFdList *head, const char *name, int id)
> +{
> + CprFd *elem;
> +
> + QLIST_FOREACH(elem, head, next) {
> + if (!strcmp(elem->name, name) && elem->id == id) {
> + return elem;
> + }
> + }
> + return NULL;
> +}
> +
> +void cpr_delete_fd(const char *name, int id)
> +{
> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
> +
> + if (elem) {
> + QLIST_REMOVE(elem, next);
> + g_free(elem->name);
> + g_free(elem);
> + }
> +
> + trace_cpr_delete_fd(name, id);
> +}
> +
> +int cpr_find_fd(const char *name, int id)
> +{
> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
> + int fd = elem ? elem->fd : -1;
> +
> + trace_cpr_find_fd(name, id, fd);
> + return fd;
> +}
> +
> +int cpr_walk_fd(cpr_walk_fd_cb cb)
> +{
> + CprFd *elem;
> +
> + QLIST_FOREACH(elem, &cpr_state.fds, next) {
> + if (elem->fd >= 0 && cb(elem->fd)) {
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> +void cpr_resave_fd(const char *name, int id, int fd)
> +{
> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
> + int old_fd = elem ? elem->fd : -1;
> +
> + if (old_fd < 0) {
> + cpr_save_fd(name, id, fd);
I don't think I know well on when old_fd<0 would happen yet, as this series
doesn't look like to use this function at all. From that POV, maybe nice
to add a comment above [1] for "fd" field.
Meanwhile, do we need to remove the old_fd<0 element here, or is it
intended to keep that and the new CprFD?
> + } else if (old_fd != fd) {
> + error_setg(&error_fatal,
> + "internal error: cpr fd '%s' id %d value %d "
> + "already saved with a different value %d",
> + name, id, fd, old_fd);
> + }
> +}
> +/*************************************************************************/
> +#define CPR_STATE "CprState"
> +
> +static const VMStateDescription vmstate_cpr_state = {
> + .name = CPR_STATE,
> + .version_id = 1,
> + .minimum_version_id = 1,
> + .fields = (VMStateField[]) {
> + VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +/*************************************************************************/
> +
> +int cpr_state_save(Error **errp)
> +{
> + int ret;
> + QEMUFile *f;
> +
> + /* set f based on mode in a later patch in this series */
> + return 0;
> +
> + qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
> + qemu_put_be32(f, QEMU_VM_FILE_VERSION);
Having magic/version makes sense to me, though I'd suggest we use CPR new
magic/versions, so that if we see an binary dump we know what it is, and we
don't mixup a CPR image against a migration stream image.
> +
> + ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
s/0/NULL/
> + if (ret) {
> + error_setg(errp, "vmstate_save_state error %d", ret);
> + }
Can consider using vmstate_save_state_with_err().
> +
> + qemu_fclose(f);
> + return ret;
> +}
> +
> +int cpr_state_load(Error **errp)
> +{
> + int ret;
> + uint32_t v;
> + QEMUFile *f;
> +
> + /* set f based on mode in a later patch in this series */
> + return 0;
> +
> + v = qemu_get_be32(f);
> + if (v != QEMU_VM_FILE_MAGIC) {
> + error_setg(errp, "Not a migration stream (bad magic %x)", v);
> + qemu_fclose(f);
> + return -EINVAL;
> + }
> + v = qemu_get_be32(f);
> + if (v != QEMU_VM_FILE_VERSION) {
> + error_setg(errp, "Unsupported migration stream version %d", v);
> + qemu_fclose(f);
> + return -ENOTSUP;
> + }
> +
> + ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
> + if (ret) {
> + error_setg(errp, "vmstate_load_state error %d", ret);
> + }
Simiarly, can use vmstate_save_state_with_err().
> +
> + qemu_fclose(f);
> + return ret;
> +}
> +
> diff --git a/migration/meson.build b/migration/meson.build
> index 5ce2acb4..87feb4c 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -13,6 +13,7 @@ system_ss.add(files(
> 'block-dirty-bitmap.c',
> 'channel.c',
> 'channel-block.c',
> + 'cpr.c',
> 'dirtyrate.c',
> 'exec.c',
> 'fd.c',
> diff --git a/migration/migration.c b/migration/migration.c
> index 3dea06d..e394ad7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -27,6 +27,7 @@
> #include "sysemu/cpu-throttle.h"
> #include "rdma.h"
> #include "ram.h"
> +#include "migration/cpr.h"
> #include "migration/global_state.h"
> #include "migration/misc.h"
> #include "migration.h"
> @@ -2118,6 +2119,10 @@ void qmp_migrate(const char *uri, bool has_channels,
> }
> }
>
> + if (cpr_state_save(&local_err)) {
> + goto out;
> + }
> +
> if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
> SocketAddress *saddr = &addr->u.socket;
> if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
> @@ -2142,6 +2147,7 @@ void qmp_migrate(const char *uri, bool has_channels,
> MIGRATION_STATUS_FAILED);
> }
>
> +out:
> if (local_err) {
> if (!resume_requested) {
> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> diff --git a/migration/trace-events b/migration/trace-events
> index 0b7c332..173f2c0 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -340,6 +340,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
> # colo-failover.c
> colo_failover_set_state(const char *new_state) "new state %s"
>
> +# cpr.c
> +cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
> +cpr_delete_fd(const char *name, int id) "%s, id %d"
> +cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
> +
> # block-dirty-bitmap.c
> send_bitmap_header_enter(void) ""
> send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
> diff --git a/system/vl.c b/system/vl.c
> index 03951be..6521ee3 100644
> --- a/system/vl.c
> +++ b/system/vl.c
> @@ -77,6 +77,7 @@
> #include "hw/block/block.h"
> #include "hw/i386/x86.h"
> #include "hw/i386/pc.h"
> +#include "migration/cpr.h"
> #include "migration/misc.h"
> #include "migration/snapshot.h"
> #include "sysemu/tpm.h"
> @@ -3713,6 +3714,8 @@ void qemu_init(int argc, char **argv)
>
> qemu_create_machine(machine_opts_dict);
>
> + cpr_state_load(&error_fatal);
Might be good to add a rich comment here explaining the decision on why
loading here; I think most of the tricks lie here. E.g., it needs to be
before XXX and it needs to be after YYY.
Thanks,
> +
> suspend_mux_open();
>
> qemu_disable_default_devices();
> --
> 1.8.3.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-06-30 19:40 ` [PATCH V2 06/11] migration: fix mismatched GPAs during cpr Steve Sistare
@ 2024-07-19 16:28 ` Peter Xu
2024-07-20 21:28 ` Steven Sistare
2024-08-07 21:04 ` Steven Sistare
0 siblings, 2 replies; 77+ messages in thread
From: Peter Xu @ 2024-07-19 16:28 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
> For new cpr modes, ramblock_is_ignored will always be true, because the
> memory is preserved in place rather than copied. However, for an ignored
> block, parse_ramblock currently requires that the received address of the
> block must match the address of the statically initialized region on the
> target. This fails for a PCI rom block, because the memory region address
> is set when the guest writes to a BAR on the source, which does not occur
> on the target, causing a "Mismatched GPAs" error during cpr migration.
Is this a common fix with/without cpr mode?
It looks to me mr->addr (for these ROMs) should only be set in PCI config
region updates as you mentioned. But then I didn't figure out when they're
updated on dest in live migration: the ramblock info was sent at the
beginning of migration, so it doesn't even have PCI config space migrated;
I thought the real mr->addr should be in there.
I also failed to understand yet on why the mr->addr check needs to be done
by ignore-shared only. Some explanation would be greatly helpful around
this area..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 02/11] migration: cpr-state
2024-07-19 15:03 ` Peter Xu
@ 2024-07-20 19:53 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 19:53 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/19/2024 11:03 AM, Peter Xu wrote:
> On Sun, Jun 30, 2024 at 12:40:25PM -0700, Steve Sistare wrote:
>> CPR must save state that is needed after QEMU is restarted, when devices
>> are realized. Thus the extra state cannot be saved in the migration stream,
>> as objects must already exist before that stream can be loaded. Instead,
>> define auxilliary state structures and vmstate descriptions, not associated
>> with any registered object, and serialize the aux state to a cpr-specific
>> stream in cpr_state_save. Deserialize in cpr_state_load after QEMU
>> restarts, before devices are realized.
>>
>> Provide accessors for clients to register file descriptors for saving.
>> The mechanism for passing the fd's to the new process will be specific
>> to each migration mode, and added in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/migration/cpr.h | 21 ++++++
>> migration/cpr.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++
>> migration/meson.build | 1 +
>> migration/migration.c | 6 ++
>> migration/trace-events | 5 ++
>> system/vl.c | 3 +
>> 6 files changed, 224 insertions(+)
>> create mode 100644 include/migration/cpr.h
>> create mode 100644 migration/cpr.c
>>
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> new file mode 100644
>> index 0000000..8e7e705
>> --- /dev/null
>> +++ b/include/migration/cpr.h
>> @@ -0,0 +1,21 @@
>> +/*
>> + * Copyright (c) 2021, 2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef MIGRATION_CPR_H
>> +#define MIGRATION_CPR_H
>> +
>> +typedef int (*cpr_walk_fd_cb)(int fd);
>> +void cpr_save_fd(const char *name, int id, int fd);
>> +void cpr_delete_fd(const char *name, int id);
>> +int cpr_find_fd(const char *name, int id);
>> +int cpr_walk_fd(cpr_walk_fd_cb cb);
>> +void cpr_resave_fd(const char *name, int id, int fd);
>> +
>> +int cpr_state_save(Error **errp);
>> +int cpr_state_load(Error **errp);
>> +
>> +#endif
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> new file mode 100644
>> index 0000000..313e74e
>> --- /dev/null
>> +++ b/migration/cpr.c
>> @@ -0,0 +1,188 @@
>> +/*
>> + * Copyright (c) 2021-2024 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/misc.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/savevm.h"
>> +#include "migration/vmstate.h"
>> +#include "sysemu/runstate.h"
>> +#include "trace.h"
>> +
>> +/*************************************************************************/
>> +/* cpr state container for all information to be saved. */
>> +
>> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>> +
>> +typedef struct CprState {
>> + CprFdList fds;
>> +} CprState;
>> +
>> +static CprState cpr_state;
>> +
>> +/****************************************************************************/
>> +
>> +typedef struct CprFd {
>> + char *name;
>> + unsigned int namelen;
>> + int id;
>> + int fd;
>
> [1]
>
>> + QLIST_ENTRY(CprFd) next;
>> +} CprFd;
>> +
>> +static const VMStateDescription vmstate_cpr_fd = {
>> + .name = "cpr fd",
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_UINT32(namelen, CprFd),
>> + VMSTATE_VBUFFER_ALLOC_UINT32(name, CprFd, 0, NULL, namelen),
>> + VMSTATE_INT32(id, CprFd),
>> + VMSTATE_INT32(fd, CprFd),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +void cpr_save_fd(const char *name, int id, int fd)
>> +{
>> + CprFd *elem = g_new0(CprFd, 1);
>> +
>> + trace_cpr_save_fd(name, id, fd);
>> + elem->name = g_strdup(name);
>> + elem->namelen = strlen(name) + 1;
>> + elem->id = id;
>> + elem->fd = fd;
>> + QLIST_INSERT_HEAD(&cpr_state.fds, elem, next);
>> +}
>> +
>> +static CprFd *find_fd(CprFdList *head, const char *name, int id)
>> +{
>> + CprFd *elem;
>> +
>> + QLIST_FOREACH(elem, head, next) {
>> + if (!strcmp(elem->name, name) && elem->id == id) {
>> + return elem;
>> + }
>> + }
>> + return NULL;
>> +}
>> +
>> +void cpr_delete_fd(const char *name, int id)
>> +{
>> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> +
>> + if (elem) {
>> + QLIST_REMOVE(elem, next);
>> + g_free(elem->name);
>> + g_free(elem);
>> + }
>> +
>> + trace_cpr_delete_fd(name, id);
>> +}
>> +
>> +int cpr_find_fd(const char *name, int id)
>> +{
>> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> + int fd = elem ? elem->fd : -1;
>> +
>> + trace_cpr_find_fd(name, id, fd);
>> + return fd;
>> +}
>> +
>> +int cpr_walk_fd(cpr_walk_fd_cb cb)
>> +{
>> + CprFd *elem;
>> +
>> + QLIST_FOREACH(elem, &cpr_state.fds, next) {
>> + if (elem->fd >= 0 && cb(elem->fd)) {
>> + return 1;
>> + }
>> + }
>> + return 0;
>> +}
>> +
>> +void cpr_resave_fd(const char *name, int id, int fd)
>> +{
>> + CprFd *elem = find_fd(&cpr_state.fds, name, id);
>> + int old_fd = elem ? elem->fd : -1;
>> +
>> + if (old_fd < 0) {
>> + cpr_save_fd(name, id, fd);
>
> I don't think I know well on when old_fd<0 would happen yet, as this series
> doesn't look like to use this function at all. From that POV, maybe nice
> to add a comment above [1] for "fd" field.
cpr_resave_fd can simplify client logic. It allows the same name,fd,id triplet to
be (re) saved without creating a duplicate entry. The vfio series uses it. Yes,
I should add some brief API docs in the header file.
> Meanwhile, do we need to remove the old_fd<0 element here, or is it
> intended to keep that and the new CprFD?
old_fd < 0 is not an entry, it means no entry was found.
>> + } else if (old_fd != fd) {
>> + error_setg(&error_fatal,
>> + "internal error: cpr fd '%s' id %d value %d "
>> + "already saved with a different value %d",
>> + name, id, fd, old_fd);
>> + }
>> +}
>> +/*************************************************************************/
>> +#define CPR_STATE "CprState"
>> +
>> +static const VMStateDescription vmstate_cpr_state = {
>> + .name = CPR_STATE,
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +/*************************************************************************/
>> +
>> +int cpr_state_save(Error **errp)
>> +{
>> + int ret;
>> + QEMUFile *f;
>> +
>> + /* set f based on mode in a later patch in this series */
>> + return 0;
>> +
>> + qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
>> + qemu_put_be32(f, QEMU_VM_FILE_VERSION);
>
> Having magic/version makes sense to me, though I'd suggest we use CPR new
> magic/versions, so that if we see an binary dump we know what it is, and we
> don't mixup a CPR image against a migration stream image.
Will do.
>> +
>> + ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
>
> s/0/NULL/
Will do.
>> + if (ret) {
>> + error_setg(errp, "vmstate_save_state error %d", ret);
>> + }
>
> Can consider using vmstate_save_state_with_err().
Cool, will do.
>> +
>> + qemu_fclose(f);
>> + return ret;
>> +}
>> +
>> +int cpr_state_load(Error **errp)
>> +{
>> + int ret;
>> + uint32_t v;
>> + QEMUFile *f;
>> +
>> + /* set f based on mode in a later patch in this series */
>> + return 0;
>> +
>> + v = qemu_get_be32(f);
>> + if (v != QEMU_VM_FILE_MAGIC) {
>> + error_setg(errp, "Not a migration stream (bad magic %x)", v);
>> + qemu_fclose(f);
>> + return -EINVAL;
>> + }
>> + v = qemu_get_be32(f);
>> + if (v != QEMU_VM_FILE_VERSION) {
>> + error_setg(errp, "Unsupported migration stream version %d", v);
>> + qemu_fclose(f);
>> + return -ENOTSUP;
>> + }
>> +
>> + ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
>> + if (ret) {
>> + error_setg(errp, "vmstate_load_state error %d", ret);
>> + }
>
> Simiarly, can use vmstate_save_state_with_err().
Hmm, vmstate_load_state_with_err does not exist.
>> +
>> + qemu_fclose(f);
>> + return ret;
>> +}
>> +
>> diff --git a/migration/meson.build b/migration/meson.build
>> index 5ce2acb4..87feb4c 100644
>> --- a/migration/meson.build
>> +++ b/migration/meson.build
>> @@ -13,6 +13,7 @@ system_ss.add(files(
>> 'block-dirty-bitmap.c',
>> 'channel.c',
>> 'channel-block.c',
>> + 'cpr.c',
>> 'dirtyrate.c',
>> 'exec.c',
>> 'fd.c',
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3dea06d..e394ad7 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -27,6 +27,7 @@
>> #include "sysemu/cpu-throttle.h"
>> #include "rdma.h"
>> #include "ram.h"
>> +#include "migration/cpr.h"
>> #include "migration/global_state.h"
>> #include "migration/misc.h"
>> #include "migration.h"
>> @@ -2118,6 +2119,10 @@ void qmp_migrate(const char *uri, bool has_channels,
>> }
>> }
>>
>> + if (cpr_state_save(&local_err)) {
>> + goto out;
>> + }
>> +
>> if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
>> SocketAddress *saddr = &addr->u.socket;
>> if (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
>> @@ -2142,6 +2147,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>> MIGRATION_STATUS_FAILED);
>> }
>>
>> +out:
>> if (local_err) {
>> if (!resume_requested) {
>> yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 0b7c332..173f2c0 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -340,6 +340,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
>> # colo-failover.c
>> colo_failover_set_state(const char *new_state) "new state %s"
>>
>> +# cpr.c
>> +cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
>> +cpr_delete_fd(const char *name, int id) "%s, id %d"
>> +cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
>> +
>> # block-dirty-bitmap.c
>> send_bitmap_header_enter(void) ""
>> send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
>> diff --git a/system/vl.c b/system/vl.c
>> index 03951be..6521ee3 100644
>> --- a/system/vl.c
>> +++ b/system/vl.c
>> @@ -77,6 +77,7 @@
>> #include "hw/block/block.h"
>> #include "hw/i386/x86.h"
>> #include "hw/i386/pc.h"
>> +#include "migration/cpr.h"
>> #include "migration/misc.h"
>> #include "migration/snapshot.h"
>> #include "sysemu/tpm.h"
>> @@ -3713,6 +3714,8 @@ void qemu_init(int argc, char **argv)
>>
>> qemu_create_machine(machine_opts_dict);
>>
>> + cpr_state_load(&error_fatal);
>
> Might be good to add a rich comment here explaining the decision on why
> loading here; I think most of the tricks lie here. E.g., it needs to be
> before XXX and it needs to be after YYY.
Will do.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 04/11] migration: stop vm earlier for cpr
2024-07-17 18:59 ` Fabiano Rosas
@ 2024-07-20 20:00 ` Steven Sistare
2024-07-22 13:42 ` Fabiano Rosas
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 20:00 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster
On 7/17/2024 2:59 PM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
>
>> Stop the vm earlier for cpr, to guarantee consistent device state when
>> CPR state is saved.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> migration/migration.c | 22 +++++++++++++---------
>> 1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 0f47765..8a8e927 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>> MigrationState *s = migrate_get_current();
>> g_autoptr(MigrationChannel) channel = NULL;
>> MigrationAddress *addr = NULL;
>> + bool stopped = false;
>>
>> /*
>> * Having preliminary checks for uri and channel
>> @@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
>> }
>> }
>>
>> + if (migrate_mode_is_cpr(s)) {
>> + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>> + if (ret < 0) {
>> + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>> + goto out;
>> + }
>> + stopped = true;
>> + }
>> +
>> if (cpr_state_save(&local_err)) {
>> goto out;
>> }
>> @@ -2155,6 +2165,9 @@ out:
>> }
>> migrate_fd_error(s, local_err);
>> error_propagate(errp, local_err);
>> + if (stopped && runstate_is_live(s->vm_old_state)) {
>> + vm_start();
>> + }
>
> What about non-live states? Shouldn't this be:
>
> if (stopped) {
> vm_resume();
> }
Not quite. vm_old_state may be a stopped state, so we don't want to resume.
However, I should probably restore the old stopped state here. I'll try some more
error recovery scenarios.
- Steve
>
>> return;
>> }
>> }
>> @@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>> Error *local_err = NULL;
>> uint64_t rate_limit;
>> bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
>> - int ret;
>>
>> /*
>> * If there's a previous error, free it and prepare for another one.
>> @@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>> return;
>> }
>>
>> - if (migrate_mode_is_cpr(s)) {
>> - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>> - if (ret < 0) {
>> - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>> - goto fail;
>> - }
>> - }
>> -
>> if (migrate_background_snapshot()) {
>> qemu_thread_create(&s->thread, "mig/snapshot",
>> bg_migration_thread, s, QEMU_THREAD_JOINABLE);
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-16 9:19 ` Igor Mammedov
2024-07-17 19:24 ` Peter Xu
@ 2024-07-20 20:28 ` Steven Sistare
2024-07-22 9:10 ` David Hildenbrand
2024-07-29 12:29 ` Igor Mammedov
1 sibling, 2 replies; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 20:28 UTC (permalink / raw)
To: Igor Mammedov
Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Daniel P. Berrange, Markus Armbruster
On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> On Sun, 30 Jun 2024 12:40:24 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
>
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property. This affects
>> memory-backend-ram objects, guest RAM created with the global -m option
>> but without an associated memory-backend object and without the -mem-path
>> option
> nowadays, all machines were converted to use memory backend for VM RAM.
> so -m option implicitly creates memory-backend object,
> which will be either MEMORY_BACKEND_FILE if -mem-path present
> or MEMORY_BACKEND_RAM otherwise.
Yes. I dropped an an important adjective, "implicit".
"guest RAM created with the global -m option but without an explicit associated
memory-backend object and without the -mem-path option"
>> To access the same memory in the old and new QEMU processes, the memory
>> must be mapped shared. Therefore, the implementation always sets
>
>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>> user must explicitly specify the share option. In lieu of defining a new
> so statement at the top that memory-backend-ram is affected is not
> really valid?
memory-backend-ram is affected by alloc-anon. But in addition, the user must
explicitly add the "share" option. I don't implicitly set share in this case,
because I would be overriding the user's specification of the memory object's property,
which would be private if omitted.
>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>> as the condition for calling memfd_create.
>
> In general I do dislike adding yet another option that will affect
> guest RAM allocation (memory-backends should be sufficient).
>
> However I do see that you need memfd for device memory (vram, roms, ...).
> Can we just use memfd/shared unconditionally for those and
> avoid introducing a new confusing option?
The Linux kernel has different tunables for backing memfd's with huge pages, so we
could hurt performance if we unconditionally change to memfd. The user should have
a choice for any segment that is large enough for huge pages to improve performance,
which potentially is any memory-backend-object. The non memory-backend objects are
small, and it would be OK to use memfd unconditionally for them.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-17 19:24 ` Peter Xu
2024-07-18 15:43 ` Steven Sistare
@ 2024-07-20 20:35 ` Steven Sistare
2024-08-04 16:20 ` Peter Xu
1 sibling, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 20:35 UTC (permalink / raw)
To: Peter Xu, Igor Mammedov, Alex Williamson, Cedric Le Goater
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/17/2024 3:24 PM, Peter Xu wrote:
> On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
>> On Sun, 30 Jun 2024 12:40:24 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property. This affects
>>> memory-backend-ram objects, guest RAM created with the global -m option
>>> but without an associated memory-backend object and without the -mem-path
>>> option
>> nowadays, all machines were converted to use memory backend for VM RAM.
>> so -m option implicitly creates memory-backend object,
>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>> or MEMORY_BACKEND_RAM otherwise.
>>
>>
>>> To access the same memory in the old and new QEMU processes, the memory
>>> must be mapped shared. Therefore, the implementation always sets
>>
>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>> user must explicitly specify the share option. In lieu of defining a new
>> so statement at the top that memory-backend-ram is affected is not
>> really valid?
>>
>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>> as the condition for calling memfd_create.
>>
>> In general I do dislike adding yet another option that will affect
>> guest RAM allocation (memory-backends should be sufficient).
>
> I shared the same concern when reviewing the previous version, and I keep
> having so.
>
>>
>> However I do see that you need memfd for device memory (vram, roms, ...).
>> Can we just use memfd/shared unconditionally for those and
>> avoid introducing a new confusing option?
>
> ROMs should be fine IIUC, as they shouldn't be large, and they can be
> migrated normally (because they're not DMA target from VFIO assigned
> devices). IOW, per my understanding what must be shared via memfd is
> writable memories that can be DMAed from a VFIO device.
>
> I raised such question on whether / why vram can be a DMA target, but I
> didn't get a response. So I would like to redo this comment: I think we
> should figure out what is missing when we switch all backends to use
> -object, rather than adding this flag easily. When added, we should be
> crystal clear on which RAM region will be applicable by this flag.
All RAM regions that are mapped by the guest are registered for vfio DMA by
a memory listener and could potentially be DMA'd, either read or written.
That is defined by the architecture. We are not allowed to make value
judgements and decide to not support the architecture for some segments
such as ROM.
Alex Williamson, any comment here?
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
@ 2024-07-20 21:26 ` Steven Sistare
2024-08-04 16:10 ` Peter Xu
2024-07-22 8:59 ` [PATCH V2 00/11] Live update: cpr-exec David Hildenbrand
2024-08-05 10:01 ` Daniel P. Berrangé
2 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 21:26 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/18/2024 11:56 AM, Peter Xu wrote:
> Steve,
>
> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
>> What?
>
> Thanks for trying out with the cpr-transfer series. I saw that that series
> missed most of the cc list here, so I'm attaching the link here:
>
> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>
> I think most of my previous questions for exec() solution still are there,
> I'll try to summarize them all in this reply as much as I can.
>
>>
>> This patch series adds the live migration cpr-exec mode, which allows
>> the user to update QEMU with minimal guest pause time, by preserving
>> guest RAM in place, albeit with new virtual addresses in new QEMU, and
>> by preserving device file descriptors.
>>
>> The new user-visible interfaces are:
>> * cpr-exec (MigMode migration parameter)
>> * cpr-exec-command (migration parameter)
>
> I really, really hope we can avoid this..
>
> It's super cumbersome to pass in a qemu cmdline in a qemu migration
> parameter.. if we can do that with generic live migration ways, I hope we
> stick with the clean approach.
This is no different than live migration, requiring a management agent to
launch target qemu with all the arguments use to start source QEMU. Now that
same agent will send the arguments via cpr-exec-command.
>> * anon-alloc (command-line option for -machine)
>
> Igor questioned this, and I second his opinion.. We can leave the
> discussion there for this one.
Continued on the other thread.
>> The user sets the mode parameter before invoking the migrate command.
>> In this mode, the user issues the migrate command to old QEMU, which
>> stops the VM and saves state to the migration channels. Old QEMU then
>> exec's new QEMU, replacing the original process while retaining its PID.
>> The user specifies the command to exec new QEMU in the migration parameter
>> cpr-exec-command. The command must pass all old QEMU arguments to new
>> QEMU, plus the -incoming option. Execution resumes in new QEMU.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported. The VM must be started
>> with the '-machine anon-alloc=memfd' option, which allows anonymous
>> memory to be transferred in place to the new process.
>>
>> Why?
>>
>> This mode has less impact on the guest than any other method of updating
>> in place.
>
> So I wonder whether there's comparison between exec() and transfer mode
> that you recently proposed.
Not yet, but I will measure it.
> I'm asking because exec() (besides all the rest of things that I dislike on
> it in this approach..) should be simply slower, logically, due to the
> serialized operation to (1) tearing down the old mm, (2) reload the new
> ELF, then (3) runs through the QEMU init process.
>
> If with a generic migration solution, the dest QEMU can start running (2+3)
> concurrently without even need to run (1).
>
> In this whole process, I doubt (2) could be relatively fast, (3) I donno,
> maybe it could be slow but I never measured; Paolo may have good idea as I
> know he used to work on qboot.
We'll see, but in any case these take < 100 msec, which is a wonderfully short
pause time unless your customer is doing high speed stock trading. If cpr-transfer
is faster still, that's gravy, but cpr-exec is still great.
> For (1), I also doubt in your test cases it's fast, but it may not always
> be fast. Consider the guest has a huge TBs of shared mem, even if the
> memory will be completely shared between src/dst QEMUs, the pgtable won't!
> It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
> pgtable alone can even take time, and that will be accounted in step (1)
> and further in exec() request.
Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is
backed by huge pages. In UEK, we make it faster still by unmapping in parallel
with multiple threads. I don't have the data handy but can share after running
some experiments. Regardless, this time is negligible for small and medium
size guests, which form the majority of instances in a cloud.
> All these fuss will be avoided if you use a generic live migration model
> like cpr-transfer you proposed. That's also cleaner.
>
>> The pause time is much lower, because devices need not be torn
>> down and recreated, DMA does not need to be drained and quiesced, and minimal
>> state is copied to new QEMU. Further, there are no constraints on the guest.
>> By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
>> and suspending plus resuming vfio devices adds multiple seconds to the
>> guest pause time. Lastly, there is no loss of connectivity to the guest,
>> because chardev descriptors remain open and connected.
>
> Again, I raised the question on why this would matter, as after all mgmt
> app will need to coop with reconnections due to the fact they'll need to
> support a generic live migration, in which case reconnection is a must.
>
> So far it doesn't sound like a performance critical path, for example, to
> do the mgmt reconnects on the ports. So this might be an optimization that
> most mgmt apps may not care much?
Perhaps. I view the chardev preservation as nice to have, but not essential.
It does not appear in this series, other than in docs. It's easy to implement
given the CPR foundation. I suggest we continue this discussion when I post
the chardev series, so we can focus on the core functionality.
>> These benefits all derive from the core design principle of this mode,
>> which is preserving open descriptors. This approach is very general and
>> can be used to support a wide variety of devices that do not have hardware
>> support for live migration, including but not limited to: vfio, chardev,
>> vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
>> to allow a descriptor to be used in a process that did not originally open it.
>
> Yes, I still think this is a great idea. It just can also be built on top
> of something else than exec().
>
>>
>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>> container and its assigned resources. By contrast, consider a design in
>> which a new container is created on the same host as the target of the
>> CPR operation. Resources must be reserved for the new container, while
>> the old container still reserves resources until the operation completes.
>
> Note that if we need to share RAM anyway, the resources consumption should
> be minimal, as mem should IMHO be the major concern (except CPU, but CPU
> isn't a concern in this scenario) in container world and here the shared
> guest mem shouldn't be accounted to the dest container. So IMHO it's about
> the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
> that should be relatively small.
>
> In that case I don't yet see it a huge improvement, if the dest container
> is cheap to initiate.
It's about reserving memory and CPUs, and transferring those reservations from
the old instance to the new, and fiddling with the OS mechanisms that enforce
reservations and limits. The devil is in the details, and with the exec model,
the management agent can ignore all of that.
You don't see it as a huge improvement because you don't need to write the
management code. I do!
Both modes are valid and useful - exec in container, or launch a new container.
I have volunteered to implement the cpr-transfer mode for the latter, a mode
I do not use. Please don't reward me by dropping the mode I care about :)
Both modes can co-exist. The presence of the cpr-exec specific code in qemu
will not hinder future live migration development.
>> Avoiding over commitment requires extra work in the management layer.
>
> So it would be nice to know what needs to be overcommitted here. I confess
> I don't know much on containerized VMs, so maybe the page cache can be a
> problem even if shared. But I hope we can spell that out. Logically IIUC
> memcg shouldn't account those page cache if preallocated, because memcg
> accounting should be done at folio allocations, at least, where the page
> cache should miss first (so not this case..).
>
>> This is one reason why a cloud provider may prefer cpr-exec. A second reason
>> is that the container may include agents with their own connections to the
>> outside world, and such connections remain intact if the container is reused.
>>
>> How?
chardev preservation. The qemu socket chardevs to these agents are preserved,
and the agent connections to the outside world do not change, so no one sees
any interruption of traffic.
>> All memory that is mapped by the guest is preserved in place. Indeed,
>> it must be, because it may be the target of DMA requests, which are not
>> quiesced during cpr-exec. All such memory must be mmap'able in new QEMU.
>> This is easy for named memory-backend objects, as long as they are mapped
>> shared, because they are visible in the file system in both old and new QEMU.
>> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
>> so the memfd's can be sent to new QEMU. Pages that were locked in memory
>> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
>> the device that locked them remains open.
>>
>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>> and by sending the unique name and value of each descriptor to new QEMU
>> via CPR state.
>>
>> For device descriptors, new QEMU reuses the descriptor when creating the
>> device, rather than opening it again. The same holds for chardevs. For
>> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
>> is created.
>>
>> CPR state cannot be sent over the normal migration channel, because devices
>> and backends are created prior to reading the channel, so this mode sends
>> CPR state over a second migration channel that is not visible to the user.
>> New QEMU reads the second channel prior to creating devices or backends.
>
> Oh, maybe this is the reason that cpr-transfer will need a separate uri..
Indeed.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-07-19 16:28 ` Peter Xu
@ 2024-07-20 21:28 ` Steven Sistare
2024-08-07 21:04 ` Steven Sistare
1 sibling, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-07-20 21:28 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/19/2024 12:28 PM, Peter Xu wrote:
> On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
>> For new cpr modes, ramblock_is_ignored will always be true, because the
>> memory is preserved in place rather than copied. However, for an ignored
>> block, parse_ramblock currently requires that the received address of the
>> block must match the address of the statically initialized region on the
>> target. This fails for a PCI rom block, because the memory region address
>> is set when the guest writes to a BAR on the source, which does not occur
>> on the target, causing a "Mismatched GPAs" error during cpr migration.
>
> Is this a common fix with/without cpr mode?
It does not occur during normal migration.
> It looks to me mr->addr (for these ROMs) should only be set in PCI config
> region updates as you mentioned. But then I didn't figure out when they're
> updated on dest in live migration: the ramblock info was sent at the
> beginning of migration, so it doesn't even have PCI config space migrated;
> I thought the real mr->addr should be in there.
>
> I also failed to understand yet on why the mr->addr check needs to be done
> by ignore-shared only. Some explanation would be greatly helpful around
> this area..
I will continue this thread later and explain more fully.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
2024-07-20 21:26 ` Steven Sistare
@ 2024-07-22 8:59 ` David Hildenbrand
2024-08-04 15:43 ` Peter Xu
2024-08-05 10:01 ` Daniel P. Berrangé
2 siblings, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2024-07-22 8:59 UTC (permalink / raw)
To: Peter Xu, Steve Sistare
Cc: qemu-devel, Fabiano Rosas, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster
On 18.07.24 17:56, Peter Xu wrote:
> Steve,
>
> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
>> What?
>
> Thanks for trying out with the cpr-transfer series. I saw that that series
> missed most of the cc list here, so I'm attaching the link here:
>
> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>
> I think most of my previous questions for exec() solution still are there,
> I'll try to summarize them all in this reply as much as I can.
>
>>
>> This patch series adds the live migration cpr-exec mode, which allows
>> the user to update QEMU with minimal guest pause time, by preserving
>> guest RAM in place, albeit with new virtual addresses in new QEMU, and
>> by preserving device file descriptors.
>>
>> The new user-visible interfaces are:
>> * cpr-exec (MigMode migration parameter)
>> * cpr-exec-command (migration parameter)
>
> I really, really hope we can avoid this..
>
> It's super cumbersome to pass in a qemu cmdline in a qemu migration
> parameter.. if we can do that with generic live migration ways, I hope we
> stick with the clean approach.
>
>> * anon-alloc (command-line option for -machine)
>
> Igor questioned this, and I second his opinion.. We can leave the
> discussion there for this one.
>
>>
>> The user sets the mode parameter before invoking the migrate command.
>> In this mode, the user issues the migrate command to old QEMU, which
>> stops the VM and saves state to the migration channels. Old QEMU then
>> exec's new QEMU, replacing the original process while retaining its PID.
>> The user specifies the command to exec new QEMU in the migration parameter
>> cpr-exec-command. The command must pass all old QEMU arguments to new
>> QEMU, plus the -incoming option. Execution resumes in new QEMU.
>>
>> Memory-backend objects must have the share=on attribute, but
>> memory-backend-epc is not supported. The VM must be started
>> with the '-machine anon-alloc=memfd' option, which allows anonymous
>> memory to be transferred in place to the new process.
>>
>> Why?
>>
>> This mode has less impact on the guest than any other method of updating
>> in place.
>
> So I wonder whether there's comparison between exec() and transfer mode
> that you recently proposed.
>
> I'm asking because exec() (besides all the rest of things that I dislike on
> it in this approach..) should be simply slower, logically, due to the
> serialized operation to (1) tearing down the old mm, (2) reload the new
> ELF, then (3) runs through the QEMU init process.
>
> If with a generic migration solution, the dest QEMU can start running (2+3)
> concurrently without even need to run (1).
I'll note (not sure if already discussed) that with the "async-teardown"
option we have a way to move the MM teardown to a separate process, such
that it will happen asynchronously.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-20 20:28 ` Steven Sistare
@ 2024-07-22 9:10 ` David Hildenbrand
2024-07-29 12:29 ` Igor Mammedov
1 sibling, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2024-07-22 9:10 UTC (permalink / raw)
To: Steven Sistare, Igor Mammedov
Cc: qemu-devel, Peter Xu, Fabiano Rosas, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 20.07.24 22:28, Steven Sistare wrote:
> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>> On Sun, 30 Jun 2024 12:40:24 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property. This affects
>>> memory-backend-ram objects, guest RAM created with the global -m option
>>> but without an associated memory-backend object and without the -mem-path
>>> option
>> nowadays, all machines were converted to use memory backend for VM RAM.
>> so -m option implicitly creates memory-backend object,
>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>> or MEMORY_BACKEND_RAM otherwise.
>
> Yes. I dropped an an important adjective, "implicit".
>
> "guest RAM created with the global -m option but without an explicit associated
> memory-backend object and without the -mem-path option"
>
>>> To access the same memory in the old and new QEMU processes, the memory
>>> must be mapped shared. Therefore, the implementation always sets
>>
>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>> user must explicitly specify the share option. In lieu of defining a new
>> so statement at the top that memory-backend-ram is affected is not
>> really valid?
>
> memory-backend-ram is affected by alloc-anon. But in addition, the user must
> explicitly add the "share" option. I don't implicitly set share in this case,
> because I would be overriding the user's specification of the memory object's property,
> which would be private if omitted.
Note that memory-backend-memfd uses "shared=on" as default, as using
"shared=off" is something that shouldn't have ever been allowed. It can
(and will) result in a double memory consumption.
One reason I also don't quite like this approach :/
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 04/11] migration: stop vm earlier for cpr
2024-07-20 20:00 ` Steven Sistare
@ 2024-07-22 13:42 ` Fabiano Rosas
2024-08-06 20:52 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Fabiano Rosas @ 2024-07-22 13:42 UTC (permalink / raw)
To: Steven Sistare, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster
Steven Sistare <steven.sistare@oracle.com> writes:
> On 7/17/2024 2:59 PM, Fabiano Rosas wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>>
>>> Stop the vm earlier for cpr, to guarantee consistent device state when
>>> CPR state is saved.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> migration/migration.c | 22 +++++++++++++---------
>>> 1 file changed, 13 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 0f47765..8a8e927 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>>> MigrationState *s = migrate_get_current();
>>> g_autoptr(MigrationChannel) channel = NULL;
>>> MigrationAddress *addr = NULL;
>>> + bool stopped = false;
>>>
>>> /*
>>> * Having preliminary checks for uri and channel
>>> @@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
>>> }
>>> }
>>>
>>> + if (migrate_mode_is_cpr(s)) {
>>> + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>> + if (ret < 0) {
>>> + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>>> + goto out;
>>> + }
>>> + stopped = true;
>>> + }
>>> +
>>> if (cpr_state_save(&local_err)) {
>>> goto out;
>>> }
>>> @@ -2155,6 +2165,9 @@ out:
>>> }
>>> migrate_fd_error(s, local_err);
>>> error_propagate(errp, local_err);
>>> + if (stopped && runstate_is_live(s->vm_old_state)) {
>>> + vm_start();
>>> + }
>>
>> What about non-live states? Shouldn't this be:
>>
>> if (stopped) {
>> vm_resume();
>> }
>
> Not quite. vm_old_state may be a stopped state, so we don't want to resume.
> However, I should probably restore the old stopped state here. I'll try some more
> error recovery scenarios.
AIUI vm_resume() does the right thing already:
void vm_resume(RunState state)
{
if (runstate_is_live(state)) {
vm_start();
} else {
runstate_set(state);
}
}
>
> - Steve
>
>>
>>> return;
>>> }
>>> }
>>> @@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>> Error *local_err = NULL;
>>> uint64_t rate_limit;
>>> bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
>>> - int ret;
>>>
>>> /*
>>> * If there's a previous error, free it and prepare for another one.
>>> @@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>> return;
>>> }
>>>
>>> - if (migrate_mode_is_cpr(s)) {
>>> - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>> - if (ret < 0) {
>>> - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>>> - goto fail;
>>> - }
>>> - }
>>> -
>>> if (migrate_background_snapshot()) {
>>> qemu_thread_create(&s->thread, "mig/snapshot",
>>> bg_migration_thread, s, QEMU_THREAD_JOINABLE);
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-20 20:28 ` Steven Sistare
2024-07-22 9:10 ` David Hildenbrand
@ 2024-07-29 12:29 ` Igor Mammedov
2024-08-08 18:32 ` Steven Sistare
1 sibling, 1 reply; 77+ messages in thread
From: Igor Mammedov @ 2024-07-29 12:29 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Peter Xu, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Daniel P. Berrange, Markus Armbruster
On Sat, 20 Jul 2024 16:28:25 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:
> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > On Sun, 30 Jun 2024 12:40:24 -0700
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >
> >> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> >> on the value of the anon-alloc machine property. This affects
> >> memory-backend-ram objects, guest RAM created with the global -m option
> >> but without an associated memory-backend object and without the -mem-path
> >> option
> > nowadays, all machines were converted to use memory backend for VM RAM.
> > so -m option implicitly creates memory-backend object,
> > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > or MEMORY_BACKEND_RAM otherwise.
>
> Yes. I dropped an an important adjective, "implicit".
>
> "guest RAM created with the global -m option but without an explicit associated
> memory-backend object and without the -mem-path option"
>
> >> To access the same memory in the old and new QEMU processes, the memory
> >> must be mapped shared. Therefore, the implementation always sets
> >
> >> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> >> user must explicitly specify the share option. In lieu of defining a new
> > so statement at the top that memory-backend-ram is affected is not
> > really valid?
>
> memory-backend-ram is affected by alloc-anon. But in addition, the user must
> explicitly add the "share" option. I don't implicitly set share in this case,
> because I would be overriding the user's specification of the memory object's property,
> which would be private if omitted.
instead of touching implicit RAM (-m), it would be better to error out
and ask user to provide properly configured memory-backend explicitly.
>
> >> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> >> as the condition for calling memfd_create.
> >
> > In general I do dislike adding yet another option that will affect
> > guest RAM allocation (memory-backends should be sufficient).
> >
> > However I do see that you need memfd for device memory (vram, roms, ...).
> > Can we just use memfd/shared unconditionally for those and
> > avoid introducing a new confusing option?
>
> The Linux kernel has different tunables for backing memfd's with huge pages, so we
> could hurt performance if we unconditionally change to memfd. The user should have
> a choice for any segment that is large enough for huge pages to improve performance,
> which potentially is any memory-backend-object. The non memory-backend objects are
> small, and it would be OK to use memfd unconditionally for them.
>
> - Steve
>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-07-22 8:59 ` [PATCH V2 00/11] Live update: cpr-exec David Hildenbrand
@ 2024-08-04 15:43 ` Peter Xu
2024-08-05 9:52 ` David Hildenbrand
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-04 15:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Mon, Jul 22, 2024 at 10:59:47AM +0200, David Hildenbrand wrote:
> > So I wonder whether there's comparison between exec() and transfer mode
> > that you recently proposed.
> >
> > I'm asking because exec() (besides all the rest of things that I dislike on
> > it in this approach..) should be simply slower, logically, due to the
> > serialized operation to (1) tearing down the old mm, (2) reload the new
> > ELF, then (3) runs through the QEMU init process.
> >
> > If with a generic migration solution, the dest QEMU can start running (2+3)
> > concurrently without even need to run (1).
>
> I'll note (not sure if already discussed) that with the "async-teardown"
> option we have a way to move the MM teardown to a separate process, such
> that it will happen asynchronously.
I just had a look, maybe it won't trivially work, as it relies on QEMU
process to quit first..
async_teardown_fn():
if (the_ppid == getppid()) {
pause();
}
While if we stick with exec(), then PID shouldn't change, so the teardown
process can hold the mm and pause until the VM is destroyed..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-07-20 21:26 ` Steven Sistare
@ 2024-08-04 16:10 ` Peter Xu
2024-08-07 19:47 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-04 16:10 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Sat, Jul 20, 2024 at 05:26:07PM -0400, Steven Sistare wrote:
> On 7/18/2024 11:56 AM, Peter Xu wrote:
> > Steve,
> >
> > On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> > > What?
> >
> > Thanks for trying out with the cpr-transfer series. I saw that that series
> > missed most of the cc list here, so I'm attaching the link here:
> >
> > https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
> >
> > I think most of my previous questions for exec() solution still are there,
> > I'll try to summarize them all in this reply as much as I can.
> >
> > >
> > > This patch series adds the live migration cpr-exec mode, which allows
> > > the user to update QEMU with minimal guest pause time, by preserving
> > > guest RAM in place, albeit with new virtual addresses in new QEMU, and
> > > by preserving device file descriptors.
> > >
> > > The new user-visible interfaces are:
> > > * cpr-exec (MigMode migration parameter)
> > > * cpr-exec-command (migration parameter)
> >
> > I really, really hope we can avoid this..
> >
> > It's super cumbersome to pass in a qemu cmdline in a qemu migration
> > parameter.. if we can do that with generic live migration ways, I hope we
> > stick with the clean approach.
>
> This is no different than live migration, requiring a management agent to
> launch target qemu with all the arguments use to start source QEMU. Now that
> same agent will send the arguments via cpr-exec-command.
It's still a bit different.
There we append "-incoming defer" only, which makes sense because we're
instructing a QEMU to take an incoming stream to load. Now we append the
complete qemu cmdline within the QEMU itself, that was booted with exactly
the same cmdline.. :-( I would at least start to ask why we need to pass
the same thing twice..
Not saying that this is no-go, but really looks unpretty to me from this
part.. especially if a cleaner solution seems possible.
>
> > > * anon-alloc (command-line option for -machine)
> >
> > Igor questioned this, and I second his opinion.. We can leave the
> > discussion there for this one.
>
> Continued on the other thread.
>
> > > The user sets the mode parameter before invoking the migrate command.
> > > In this mode, the user issues the migrate command to old QEMU, which
> > > stops the VM and saves state to the migration channels. Old QEMU then
> > > exec's new QEMU, replacing the original process while retaining its PID.
> > > The user specifies the command to exec new QEMU in the migration parameter
> > > cpr-exec-command. The command must pass all old QEMU arguments to new
> > > QEMU, plus the -incoming option. Execution resumes in new QEMU.
> > >
> > > Memory-backend objects must have the share=on attribute, but
> > > memory-backend-epc is not supported. The VM must be started
> > > with the '-machine anon-alloc=memfd' option, which allows anonymous
> > > memory to be transferred in place to the new process.
> > >
> > > Why?
> > >
> > > This mode has less impact on the guest than any other method of updating
> > > in place.
> >
> > So I wonder whether there's comparison between exec() and transfer mode
> > that you recently proposed.
>
> Not yet, but I will measure it.
Thanks.
>
> > I'm asking because exec() (besides all the rest of things that I dislike on
> > it in this approach..) should be simply slower, logically, due to the
> > serialized operation to (1) tearing down the old mm, (2) reload the new
> > ELF, then (3) runs through the QEMU init process.
> >
> > If with a generic migration solution, the dest QEMU can start running (2+3)
> > concurrently without even need to run (1).
> >
> > In this whole process, I doubt (2) could be relatively fast, (3) I donno,
> > maybe it could be slow but I never measured; Paolo may have good idea as I
> > know he used to work on qboot.
>
> We'll see, but in any case these take < 100 msec, which is a wonderfully short
I doubt whether it keeps <100ms when the VM is large. Note that I think we
should cover the case where the user does 4k mapping for a large guest.
So I agree that 4k mapping over e.g. 1T without hugetlb may not be the
ideal case, but the question is I suspect there're indeed serious users
using QEMU like that, and if we have most exactly a parallel solution that
does cover this case, it is definitely preferrable to consider the other
from this POV, simply because there's nothing to lose there..
> pause time unless your customer is doing high speed stock trading. If cpr-transfer
> is faster still, that's gravy, but cpr-exec is still great.
>
> > For (1), I also doubt in your test cases it's fast, but it may not always
> > be fast. Consider the guest has a huge TBs of shared mem, even if the
> > memory will be completely shared between src/dst QEMUs, the pgtable won't!
> > It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
> > pgtable alone can even take time, and that will be accounted in step (1)
> > and further in exec() request.
>
> Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is
> backed by huge pages. In UEK, we make it faster still by unmapping in parallel
> with multiple threads. I don't have the data handy but can share after running
> some experiments. Regardless, this time is negligible for small and medium
> size guests, which form the majority of instances in a cloud.
Possible. It's just that it sounds like a good idea to avoid having the
downtime taking any pgtable tearing down into account here for the old mm,
irrelevant of how much time it'll take. It's just that I suspect some use
case can take fair amount of time.
So I think this is "one point less" for exec() solution, while the issue
can be big or small on its own. What matters is IMHO where exec() is
superior so that we'd like to pay for this. I'll try to stop saying "let's
try to avoid using exec() as it sounds risky", but we still need to compare
with solid pros and cons.
>
> > All these fuss will be avoided if you use a generic live migration model
> > like cpr-transfer you proposed. That's also cleaner.
> >
> > > The pause time is much lower, because devices need not be torn
> > > down and recreated, DMA does not need to be drained and quiesced, and minimal
> > > state is copied to new QEMU. Further, there are no constraints on the guest.
> > > By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
> > > and suspending plus resuming vfio devices adds multiple seconds to the
> > > guest pause time. Lastly, there is no loss of connectivity to the guest,
> > > because chardev descriptors remain open and connected.
> >
> > Again, I raised the question on why this would matter, as after all mgmt
> > app will need to coop with reconnections due to the fact they'll need to
> > support a generic live migration, in which case reconnection is a must.
> >
> > So far it doesn't sound like a performance critical path, for example, to
> > do the mgmt reconnects on the ports. So this might be an optimization that
> > most mgmt apps may not care much?
>
> Perhaps. I view the chardev preservation as nice to have, but not essential.
> It does not appear in this series, other than in docs. It's easy to implement
> given the CPR foundation. I suggest we continue this discussion when I post
> the chardev series, so we can focus on the core functionality.
It's just that it can affect our decision on choosing the way to go.
For example, do we have someone from Libvirt or any mgmt layer can help
justify this point?
As I said, I thought most facilities for reconnection should be ready, but
I could miss important facts in mgmt layers..
>
> > > These benefits all derive from the core design principle of this mode,
> > > which is preserving open descriptors. This approach is very general and
> > > can be used to support a wide variety of devices that do not have hardware
> > > support for live migration, including but not limited to: vfio, chardev,
> > > vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
> > > to allow a descriptor to be used in a process that did not originally open it.
> >
> > Yes, I still think this is a great idea. It just can also be built on top
> > of something else than exec().
> >
> > >
> > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > container and its assigned resources. By contrast, consider a design in
> > > which a new container is created on the same host as the target of the
> > > CPR operation. Resources must be reserved for the new container, while
> > > the old container still reserves resources until the operation completes.
> >
> > Note that if we need to share RAM anyway, the resources consumption should
> > be minimal, as mem should IMHO be the major concern (except CPU, but CPU
> > isn't a concern in this scenario) in container world and here the shared
> > guest mem shouldn't be accounted to the dest container. So IMHO it's about
> > the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
> > that should be relatively small.
> >
> > In that case I don't yet see it a huge improvement, if the dest container
> > is cheap to initiate.
>
> It's about reserving memory and CPUs, and transferring those reservations from
> the old instance to the new, and fiddling with the OS mechanisms that enforce
> reservations and limits. The devil is in the details, and with the exec model,
> the management agent can ignore all of that.
>
> You don't see it as a huge improvement because you don't need to write the
> management code. I do!
Heh, possibly true.
Could I ask what management code you're working on? Why that management
code doesn't need to already work out these problems with reconnections
(like pre-CPR ways of live upgrade)?
>
> Both modes are valid and useful - exec in container, or launch a new container.
> I have volunteered to implement the cpr-transfer mode for the latter, a mode
> I do not use. Please don't reward me by dropping the mode I care about :)
> Both modes can co-exist. The presence of the cpr-exec specific code in qemu
> will not hinder future live migration development.
I'm trying to remove some of my "prejudices" on exec() :). Hopefully that
proved more or less that I simply wanted to be fair on making a design
decision. I don't think I have a strong opinion, but it looks to me not
ideal to merge two solutions if both modes share the use case.
Or if you think both modes should service different purpose, we might
consider both, but that needs to be justified - IOW, we shouldn't merge
anything that will never be used.
Thanks,
>
> > > Avoiding over commitment requires extra work in the management layer.
> >
> > So it would be nice to know what needs to be overcommitted here. I confess
> > I don't know much on containerized VMs, so maybe the page cache can be a
> > problem even if shared. But I hope we can spell that out. Logically IIUC
> > memcg shouldn't account those page cache if preallocated, because memcg
> > accounting should be done at folio allocations, at least, where the page
> > cache should miss first (so not this case..).
> >
> > > This is one reason why a cloud provider may prefer cpr-exec. A second reason
> > > is that the container may include agents with their own connections to the
> > > outside world, and such connections remain intact if the container is reused.
> > >
> > > How?
>
> chardev preservation. The qemu socket chardevs to these agents are preserved,
> and the agent connections to the outside world do not change, so no one sees
> any interruption of traffic.
>
> > > All memory that is mapped by the guest is preserved in place. Indeed,
> > > it must be, because it may be the target of DMA requests, which are not
> > > quiesced during cpr-exec. All such memory must be mmap'able in new QEMU.
> > > This is easy for named memory-backend objects, as long as they are mapped
> > > shared, because they are visible in the file system in both old and new QEMU.
> > > Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
> > > so the memfd's can be sent to new QEMU. Pages that were locked in memory
> > > for DMA in old QEMU remain locked in new QEMU, because the descriptor of
> > > the device that locked them remains open.
> > >
> > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > and by sending the unique name and value of each descriptor to new QEMU
> > > via CPR state.
> > >
> > > For device descriptors, new QEMU reuses the descriptor when creating the
> > > device, rather than opening it again. The same holds for chardevs. For
> > > memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
> > > is created.
> > >
> > > CPR state cannot be sent over the normal migration channel, because devices
> > > and backends are created prior to reading the channel, so this mode sends
> > > CPR state over a second migration channel that is not visible to the user.
> > > New QEMU reads the second channel prior to creating devices or backends.
> >
> > Oh, maybe this is the reason that cpr-transfer will need a separate uri..
>
> Indeed.
>
> - Steve
>
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-20 20:35 ` Steven Sistare
@ 2024-08-04 16:20 ` Peter Xu
0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2024-08-04 16:20 UTC (permalink / raw)
To: Steven Sistare
Cc: Igor Mammedov, Alex Williamson, Cedric Le Goater, qemu-devel,
Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Sat, Jul 20, 2024 at 04:35:58PM -0400, Steven Sistare wrote:
> On 7/17/2024 3:24 PM, Peter Xu wrote:
> > On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
> > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > >
> > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > on the value of the anon-alloc machine property. This affects
> > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > but without an associated memory-backend object and without the -mem-path
> > > > option
> > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > so -m option implicitly creates memory-backend object,
> > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > or MEMORY_BACKEND_RAM otherwise.
> > >
> > >
> > > > To access the same memory in the old and new QEMU processes, the memory
> > > > must be mapped shared. Therefore, the implementation always sets
> > >
> > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > user must explicitly specify the share option. In lieu of defining a new
> > > so statement at the top that memory-backend-ram is affected is not
> > > really valid?
> > >
> > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > as the condition for calling memfd_create.
> > >
> > > In general I do dislike adding yet another option that will affect
> > > guest RAM allocation (memory-backends should be sufficient).
> >
> > I shared the same concern when reviewing the previous version, and I keep
> > having so.
> >
> > >
> > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > Can we just use memfd/shared unconditionally for those and
> > > avoid introducing a new confusing option?
> >
> > ROMs should be fine IIUC, as they shouldn't be large, and they can be
> > migrated normally (because they're not DMA target from VFIO assigned
> > devices). IOW, per my understanding what must be shared via memfd is
> > writable memories that can be DMAed from a VFIO device.
> >
> > I raised such question on whether / why vram can be a DMA target, but I
> > didn't get a response. So I would like to redo this comment: I think we
> > should figure out what is missing when we switch all backends to use
> > -object, rather than adding this flag easily. When added, we should be
> > crystal clear on which RAM region will be applicable by this flag.
>
> All RAM regions that are mapped by the guest are registered for vfio DMA by
> a memory listener and could potentially be DMA'd, either read or written.
> That is defined by the architecture. We are not allowed to make value
> judgements and decide to not support the architecture for some segments
> such as ROM.
You're right. However the problem is we have pretty good grasp of the
major DMA target here (guest mem), so what I feel like is some missing work
in this area, that we're not sure what this new parameter is applying to.
It's not the case where we know "OK we have a million use case of RAM, and
we're 100% sure we want to make them all fd-based, and we introduce this
flag simply because adding this to each 1-million will take years and
thousands LOC changes".
The new parameter is cheap to paper over the question being raised here,
but it definitely adds not only ambiguity (when it conflicts with -object)
and that we'll need to maintain its compatibility for all RAMs that we have
totally no idea what can be implied underneath for whatever QEMU cmdline
that can be specified.
IMHO that, OTOH, justifies that there may need some further study to
justify this parameter alone. For example, if it's only the vRAM that is
missing, we may at least want to have a parameter nailing down to vRAM
behavior rather than affecting anything, so as to at least avoid collision
on -object parameters.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-04 15:43 ` Peter Xu
@ 2024-08-05 9:52 ` David Hildenbrand
2024-08-05 10:06 ` David Hildenbrand
0 siblings, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2024-08-05 9:52 UTC (permalink / raw)
To: Peter Xu
Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 04.08.24 17:43, Peter Xu wrote:
> On Mon, Jul 22, 2024 at 10:59:47AM +0200, David Hildenbrand wrote:
>>> So I wonder whether there's comparison between exec() and transfer mode
>>> that you recently proposed.
>>>
>>> I'm asking because exec() (besides all the rest of things that I dislike on
>>> it in this approach..) should be simply slower, logically, due to the
>>> serialized operation to (1) tearing down the old mm, (2) reload the new
>>> ELF, then (3) runs through the QEMU init process.
>>>
>>> If with a generic migration solution, the dest QEMU can start running (2+3)
>>> concurrently without even need to run (1).
>>
>> I'll note (not sure if already discussed) that with the "async-teardown"
>> option we have a way to move the MM teardown to a separate process, such
>> that it will happen asynchronously.
>
> I just had a look, maybe it won't trivially work, as it relies on QEMU
> process to quit first..
>
> async_teardown_fn():
> if (the_ppid == getppid()) {
> pause();
> }
>
> While if we stick with exec(), then PID shouldn't change, so the teardown
> process can hold the mm and pause until the VM is destroyed..
Right, the mechanism would have to be extended to realize that exec()
happened. Notifying the child before exec() would be undesired, so it
would have to happen after exec() from the changed parent.
Sounds doable, but certainly doesn't come for free!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
2024-07-20 21:26 ` Steven Sistare
2024-07-22 8:59 ` [PATCH V2 00/11] Live update: cpr-exec David Hildenbrand
@ 2024-08-05 10:01 ` Daniel P. Berrangé
2024-08-06 20:56 ` Steven Sistare
2 siblings, 1 reply; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-08-05 10:01 UTC (permalink / raw)
To: Peter Xu
Cc: Steve Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Thu, Jul 18, 2024 at 11:56:33AM -0400, Peter Xu wrote:
> Steve,
>
> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> > What?
>
> Thanks for trying out with the cpr-transfer series. I saw that that series
> missed most of the cc list here, so I'm attaching the link here:
>
> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>
> I think most of my previous questions for exec() solution still are there,
> I'll try to summarize them all in this reply as much as I can.
>
> >
> > This patch series adds the live migration cpr-exec mode, which allows
> > the user to update QEMU with minimal guest pause time, by preserving
> > guest RAM in place, albeit with new virtual addresses in new QEMU, and
> > by preserving device file descriptors.
> >
> > The new user-visible interfaces are:
> > * cpr-exec (MigMode migration parameter)
> > * cpr-exec-command (migration parameter)
>
> I really, really hope we can avoid this..
>
> It's super cumbersome to pass in a qemu cmdline in a qemu migration
> parameter.. if we can do that with generic live migration ways, I hope we
> stick with the clean approach.
A further issue I have is that it presumes the QEMU configuration is
fully captured by the command line. We have a long term design goal
in QEMU to get away from specifying configuration on the command
line, and move entirely to configuring QEMU via a series of QMP
commands.
This proposed command is introducing the concept of command line argv
as a formal part of the QEMU API and IMHO that is undesirable. Even
today we have backend configuration steps only done via QMP, and I'm
wondering how it would fit in with how mgmt apps currently doing live
migration.
The flipside, however, is that localhost migration via 2 separate QEMU
processes has issues where both QEMUs want to be opening the very same
file, and only 1 of them can ever have them open.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-05 9:52 ` David Hildenbrand
@ 2024-08-05 10:06 ` David Hildenbrand
0 siblings, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2024-08-05 10:06 UTC (permalink / raw)
To: Peter Xu
Cc: Steve Sistare, qemu-devel, Fabiano Rosas, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 05.08.24 11:52, David Hildenbrand wrote:
> On 04.08.24 17:43, Peter Xu wrote:
>> On Mon, Jul 22, 2024 at 10:59:47AM +0200, David Hildenbrand wrote:
>>>> So I wonder whether there's comparison between exec() and transfer mode
>>>> that you recently proposed.
>>>>
>>>> I'm asking because exec() (besides all the rest of things that I dislike on
>>>> it in this approach..) should be simply slower, logically, due to the
>>>> serialized operation to (1) tearing down the old mm, (2) reload the new
>>>> ELF, then (3) runs through the QEMU init process.
>>>>
>>>> If with a generic migration solution, the dest QEMU can start running (2+3)
>>>> concurrently without even need to run (1).
>>>
>>> I'll note (not sure if already discussed) that with the "async-teardown"
>>> option we have a way to move the MM teardown to a separate process, such
>>> that it will happen asynchronously.
>>
>> I just had a look, maybe it won't trivially work, as it relies on QEMU
>> process to quit first..
>>
>> async_teardown_fn():
>> if (the_ppid == getppid()) {
>> pause();
>> }
>>
>> While if we stick with exec(), then PID shouldn't change, so the teardown
>> process can hold the mm and pause until the VM is destroyed..
>
> Right, the mechanism would have to be extended to realize that exec()
> happened. Notifying the child before exec() would be undesired, so it
> would have to happen after exec() from the changed parent.
>
> Sounds doable, but certainly doesn't come for free!
I did not look deeply into this, but possibly using a pipe between both
processes created with O_CLOEXEC might do. Anyhow, something to look
into if really required :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 04/11] migration: stop vm earlier for cpr
2024-07-22 13:42 ` Fabiano Rosas
@ 2024-08-06 20:52 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-06 20:52 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel
Cc: Peter Xu, David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Daniel P. Berrange,
Markus Armbruster
On 7/22/2024 9:42 AM, Fabiano Rosas wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>
>> On 7/17/2024 2:59 PM, Fabiano Rosas wrote:
>>> Steve Sistare <steven.sistare@oracle.com> writes:
>>>
>>>> Stop the vm earlier for cpr, to guarantee consistent device state when
>>>> CPR state is saved.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> migration/migration.c | 22 +++++++++++++---------
>>>> 1 file changed, 13 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 0f47765..8a8e927 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -2077,6 +2077,7 @@ void qmp_migrate(const char *uri, bool has_channels,
>>>> MigrationState *s = migrate_get_current();
>>>> g_autoptr(MigrationChannel) channel = NULL;
>>>> MigrationAddress *addr = NULL;
>>>> + bool stopped = false;
>>>>
>>>> /*
>>>> * Having preliminary checks for uri and channel
>>>> @@ -2120,6 +2121,15 @@ void qmp_migrate(const char *uri, bool has_channels,
>>>> }
>>>> }
>>>>
>>>> + if (migrate_mode_is_cpr(s)) {
>>>> + int ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>>> + if (ret < 0) {
>>>> + error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>>>> + goto out;
>>>> + }
>>>> + stopped = true;
>>>> + }
>>>> +
>>>> if (cpr_state_save(&local_err)) {
>>>> goto out;
>>>> }
>>>> @@ -2155,6 +2165,9 @@ out:
>>>> }
>>>> migrate_fd_error(s, local_err);
>>>> error_propagate(errp, local_err);
>>>> + if (stopped && runstate_is_live(s->vm_old_state)) {
>>>> + vm_start();
>>>> + }
>>>
>>> What about non-live states? Shouldn't this be:
>>>
>>> if (stopped) {
>>> vm_resume();
>>> }
>>
>> Not quite. vm_old_state may be a stopped state, so we don't want to resume.
>> However, I should probably restore the old stopped state here. I'll try some more
>> error recovery scenarios.
>
> AIUI vm_resume() does the right thing already:
>
> void vm_resume(RunState state)
> {
> if (runstate_is_live(state)) {
> vm_start();
> } else {
> runstate_set(state);
> }
> }
Yes, thanks, I do need to set vm_old_state if not live. It should be:
out:
...
if (stopped) {
vm_resume(s->vm_old_state);
}
- Steve
>>>> return;
>>>> }
>>>> }
>>>> @@ -3738,7 +3751,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>>> Error *local_err = NULL;
>>>> uint64_t rate_limit;
>>>> bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
>>>> - int ret;
>>>>
>>>> /*
>>>> * If there's a previous error, free it and prepare for another one.
>>>> @@ -3810,14 +3822,6 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
>>>> return;
>>>> }
>>>>
>>>> - if (migrate_mode_is_cpr(s)) {
>>>> - ret = migration_stop_vm(s, RUN_STATE_FINISH_MIGRATE);
>>>> - if (ret < 0) {
>>>> - error_setg(&local_err, "migration_stop_vm failed, error %d", -ret);
>>>> - goto fail;
>>>> - }
>>>> - }
>>>> -
>>>> if (migrate_background_snapshot()) {
>>>> qemu_thread_create(&s->thread, "mig/snapshot",
>>>> bg_migration_thread, s, QEMU_THREAD_JOINABLE);
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-05 10:01 ` Daniel P. Berrangé
@ 2024-08-06 20:56 ` Steven Sistare
2024-08-13 19:46 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-06 20:56 UTC (permalink / raw)
To: Daniel P. Berrangé, Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/5/2024 6:01 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 18, 2024 at 11:56:33AM -0400, Peter Xu wrote:
>> Steve,
>>
>> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
>>> What?
>>
>> Thanks for trying out with the cpr-transfer series. I saw that that series
>> missed most of the cc list here, so I'm attaching the link here:
>>
>> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>>
>> I think most of my previous questions for exec() solution still are there,
>> I'll try to summarize them all in this reply as much as I can.
>>
>>>
>>> This patch series adds the live migration cpr-exec mode, which allows
>>> the user to update QEMU with minimal guest pause time, by preserving
>>> guest RAM in place, albeit with new virtual addresses in new QEMU, and
>>> by preserving device file descriptors.
>>>
>>> The new user-visible interfaces are:
>>> * cpr-exec (MigMode migration parameter)
>>> * cpr-exec-command (migration parameter)
>>
>> I really, really hope we can avoid this..
>>
>> It's super cumbersome to pass in a qemu cmdline in a qemu migration
>> parameter.. if we can do that with generic live migration ways, I hope we
>> stick with the clean approach.
>
> A further issue I have is that it presumes the QEMU configuration is
> fully captured by the command line. We have a long term design goal
> in QEMU to get away from specifying configuration on the command
> line, and move entirely to configuring QEMU via a series of QMP
> commands.
>
> This proposed command is introducing the concept of command line argv
> as a formal part of the QEMU API and IMHO that is undesirable.
Actually cpr-exec-command does not presume anything; it is an arbitrary
command with arbitrary arguments. If in the future QEMU takes no command-line
arguments, then mgmt will pass a simple launcher command as cpr-exec-command,
and the launcher will issue QMP commands. Or the launcher will send a message
to another mgmt agent to do so. It is very flexible. Regardless, the API
definition of cpr-exec-command will not change.
As another example, in our cloud environment, when the mgmt agent starts QEMU,
it saves the QEMU args in a file. My cpr-exec-command is just "/bin/qemu-exec"
with a few simple arguments. That command reads QEMU args from the file and
exec's new QEMU.
> Even
> today we have backend configuration steps only done via QMP, and I'm
> wondering how it would fit in with how mgmt apps currently doing live
> migration.
Sure, and that still works. For live migration, mgmt starts new QEMU with
its static arguments plus -S plus -incoming, then mgmt detects QEMU has reached
the prelaunch state, then it issues QMP commands. For live update, cpr-exec-command
has the static arguments plus -S, then mgmt detects QEMU has reached the prelaunch
state, then it issues QMP commands.
> The flipside, however, is that localhost migration via 2 separate QEMU
> processes has issues where both QEMUs want to be opening the very same
> file, and only 1 of them can ever have them open.
Indeed, and "files" includes unix domain sockets. Network ports also conflict.
cpr-exec avoids such problems, and is one of the advantages of the method that
I forgot to promote.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-04 16:10 ` Peter Xu
@ 2024-08-07 19:47 ` Steven Sistare
2024-08-13 20:12 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-07 19:47 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 8/4/2024 12:10 PM, Peter Xu wrote:
> On Sat, Jul 20, 2024 at 05:26:07PM -0400, Steven Sistare wrote:
>> On 7/18/2024 11:56 AM, Peter Xu wrote:
>>> Steve,
>>>
>>> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
>>>> What?
>>>
>>> Thanks for trying out with the cpr-transfer series. I saw that that series
>>> missed most of the cc list here, so I'm attaching the link here:
>>>
>>> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>>>
>>> I think most of my previous questions for exec() solution still are there,
>>> I'll try to summarize them all in this reply as much as I can.
>>>
>>>>
>>>> This patch series adds the live migration cpr-exec mode, which allows
>>>> the user to update QEMU with minimal guest pause time, by preserving
>>>> guest RAM in place, albeit with new virtual addresses in new QEMU, and
>>>> by preserving device file descriptors.
>>>>
>>>> The new user-visible interfaces are:
>>>> * cpr-exec (MigMode migration parameter)
>>>> * cpr-exec-command (migration parameter)
>>>
>>> I really, really hope we can avoid this..
>>>
>>> It's super cumbersome to pass in a qemu cmdline in a qemu migration
>>> parameter.. if we can do that with generic live migration ways, I hope we
>>> stick with the clean approach.
>>
>> This is no different than live migration, requiring a management agent to
>> launch target qemu with all the arguments use to start source QEMU. Now that
>> same agent will send the arguments via cpr-exec-command.
>
> It's still a bit different.
>
> There we append "-incoming defer" only, which makes sense because we're
> instructing a QEMU to take an incoming stream to load. Now we append the
> complete qemu cmdline within the QEMU itself, that was booted with exactly
> the same cmdline.. :-( I would at least start to ask why we need to pass
> the same thing twice..
Sometimes one must modify the command line arguments passed to new QEMU.
This interface allows for that possibility.
In an earlier patch series, I proposed a cpr-exec command that took no arguments,
and reused the old QEMU argv, which was remembered in main. A reviewer pointed out
how inflexible that was. See my response to Daniel yesterday for more on the value
of this flexibility.
This is not a burden for the mgmt agent. It already knows the arguments because
it can launch new qemu with the arguments for live migration. Passing the arguments
to cpr-exec-command is trivial.
> Not saying that this is no-go, but really looks unpretty to me from this
> part.. especially if a cleaner solution seems possible.
>
>>
>>>> * anon-alloc (command-line option for -machine)
>>>
>>> Igor questioned this, and I second his opinion.. We can leave the
>>> discussion there for this one.
>>
>> Continued on the other thread.
>>
>>>> The user sets the mode parameter before invoking the migrate command.
>>>> In this mode, the user issues the migrate command to old QEMU, which
>>>> stops the VM and saves state to the migration channels. Old QEMU then
>>>> exec's new QEMU, replacing the original process while retaining its PID.
>>>> The user specifies the command to exec new QEMU in the migration parameter
>>>> cpr-exec-command. The command must pass all old QEMU arguments to new
>>>> QEMU, plus the -incoming option. Execution resumes in new QEMU.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported. The VM must be started
>>>> with the '-machine anon-alloc=memfd' option, which allows anonymous
>>>> memory to be transferred in place to the new process.
>>>>
>>>> Why?
>>>>
>>>> This mode has less impact on the guest than any other method of updating
>>>> in place.
>>>
>>> So I wonder whether there's comparison between exec() and transfer mode
>>> that you recently proposed.
>>
>> Not yet, but I will measure it.
>
> Thanks.
>
>>
>>> I'm asking because exec() (besides all the rest of things that I dislike on
>>> it in this approach..) should be simply slower, logically, due to the
>>> serialized operation to (1) tearing down the old mm, (2) reload the new
>>> ELF, then (3) runs through the QEMU init process.
>>>
>>> If with a generic migration solution, the dest QEMU can start running (2+3)
>>> concurrently without even need to run (1).
>>>
>>> In this whole process, I doubt (2) could be relatively fast, (3) I donno,
>>> maybe it could be slow but I never measured; Paolo may have good idea as I
>>> know he used to work on qboot.
>>
>> We'll see, but in any case these take < 100 msec, which is a wonderfully short
>
> I doubt whether it keeps <100ms when the VM is large. Note that I think we
> should cover the case where the user does 4k mapping for a large guest.
>
> So I agree that 4k mapping over e.g. 1T without hugetlb may not be the
> ideal case, but the question is I suspect there're indeed serious users
> using QEMU like that, and if we have most exactly a parallel solution that
> does cover this case, it is definitely preferrable to consider the other
> from this POV, simply because there's nothing to lose there..
>
>> pause time unless your customer is doing high speed stock trading. If cpr-transfer
>> is faster still, that's gravy, but cpr-exec is still great.
>>
>>> For (1), I also doubt in your test cases it's fast, but it may not always
>>> be fast. Consider the guest has a huge TBs of shared mem, even if the
>>> memory will be completely shared between src/dst QEMUs, the pgtable won't!
>>> It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
>>> pgtable alone can even take time, and that will be accounted in step (1)
>>> and further in exec() request.
>>
>> Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is
>> backed by huge pages. In UEK, we make it faster still by unmapping in parallel
>> with multiple threads. I don't have the data handy but can share after running
>> some experiments. Regardless, this time is negligible for small and medium
>> size guests, which form the majority of instances in a cloud.
>
> Possible. It's just that it sounds like a good idea to avoid having the
> downtime taking any pgtable tearing down into account here for the old mm,
> irrelevant of how much time it'll take. It's just that I suspect some use
> case can take fair amount of time.
Here is the guest pause time, measured as the interval from the start
of the migrate command to the new QEMU guest reaching the running state.
The average over 10 runs is shown, in msecs.
Huge pages are enabled.
Guest memory is memfd.
The kernel is 6.9.0 (not UEK, so no parallel unmap)
The system is old and slow: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
cpr-exec cpr-transfer
256M 180 148
16G 190 150
128G 250 159
1T 300 ? 159 ? // extrapolated
At these scales, the difference between exec and transfer is not significant.
A provider would choose one vs the other based on ease of implementation in
their mgmt agent and container environment.
For small pages and large memory, cpr-exec can take multiple seconds, and
the UEK parallel unmap reduces that further. But, that is the exception,
not the rule. Providers strive to back huge memories with huge pages. It
makes no sense to use such a valuable resource in the crappiest way possible
(ie with small pages).
> So I think this is "one point less" for exec() solution, while the issue
> can be big or small on its own. What matters is IMHO where exec() is
> superior so that we'd like to pay for this. I'll try to stop saying "let's
> try to avoid using exec() as it sounds risky", but we still need to compare
> with solid pros and cons.
>
>>
>>> All these fuss will be avoided if you use a generic live migration model
>>> like cpr-transfer you proposed. That's also cleaner.
>>>
>>>> The pause time is much lower, because devices need not be torn
>>>> down and recreated, DMA does not need to be drained and quiesced, and minimal
>>>> state is copied to new QEMU. Further, there are no constraints on the guest.
>>>> By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
>>>> and suspending plus resuming vfio devices adds multiple seconds to the
>>>> guest pause time. Lastly, there is no loss of connectivity to the guest,
>>>> because chardev descriptors remain open and connected.
>>>
>>> Again, I raised the question on why this would matter, as after all mgmt
>>> app will need to coop with reconnections due to the fact they'll need to
>>> support a generic live migration, in which case reconnection is a must.
>>>
>>> So far it doesn't sound like a performance critical path, for example, to
>>> do the mgmt reconnects on the ports. So this might be an optimization that
>>> most mgmt apps may not care much?
>>
>> Perhaps. I view the chardev preservation as nice to have, but not essential.
>> It does not appear in this series, other than in docs. It's easy to implement
>> given the CPR foundation. I suggest we continue this discussion when I post
>> the chardev series, so we can focus on the core functionality.
>
> It's just that it can affect our decision on choosing the way to go.
>
> For example, do we have someone from Libvirt or any mgmt layer can help
> justify this point?
>
> As I said, I thought most facilities for reconnection should be ready, but
> I could miss important facts in mgmt layers..
I will more deeply study reconnects in the mgmt layer, run some experiments to
see if it is seamless for the end user, and get back to you, but it will take
some time.
>>>> These benefits all derive from the core design principle of this mode,
>>>> which is preserving open descriptors. This approach is very general and
>>>> can be used to support a wide variety of devices that do not have hardware
>>>> support for live migration, including but not limited to: vfio, chardev,
>>>> vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
>>>> to allow a descriptor to be used in a process that did not originally open it.
>>>
>>> Yes, I still think this is a great idea. It just can also be built on top
>>> of something else than exec().
>>>
>>>>
>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>> container and its assigned resources. By contrast, consider a design in
>>>> which a new container is created on the same host as the target of the
>>>> CPR operation. Resources must be reserved for the new container, while
>>>> the old container still reserves resources until the operation completes.
>>>
>>> Note that if we need to share RAM anyway, the resources consumption should
>>> be minimal, as mem should IMHO be the major concern (except CPU, but CPU
>>> isn't a concern in this scenario) in container world and here the shared
>>> guest mem shouldn't be accounted to the dest container. So IMHO it's about
>>> the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
>>> that should be relatively small.
>>>
>>> In that case I don't yet see it a huge improvement, if the dest container
>>> is cheap to initiate.
>>
>> It's about reserving memory and CPUs, and transferring those reservations from
>> the old instance to the new, and fiddling with the OS mechanisms that enforce
>> reservations and limits. The devil is in the details, and with the exec model,
>> the management agent can ignore all of that.
>>
>> You don't see it as a huge improvement because you don't need to write the
>> management code. I do!
>
> Heh, possibly true.
>
> Could I ask what management code you're working on? Why that management
> code doesn't need to already work out these problems with reconnections
> (like pre-CPR ways of live upgrade)?
OCI - Oracle Cloud Infrastructure.
Mgmt needs to manage reconnections for live migration, and perhaps I could
leverage that code for live update, but happily I did not need to. Regardless,
reconnection is the lesser issue. The bigger issue is resource management and
the container environment. But I cannot justify that statement in detail without
actually trying to implement cpr-transfer in OCI.
>> Both modes are valid and useful - exec in container, or launch a new container.
>> I have volunteered to implement the cpr-transfer mode for the latter, a mode
>> I do not use. Please don't reward me by dropping the mode I care about :)
>> Both modes can co-exist. The presence of the cpr-exec specific code in qemu
>> will not hinder future live migration development.
>
> I'm trying to remove some of my "prejudices" on exec() :). Hopefully that
> proved more or less that I simply wanted to be fair on making a design
> decision. I don't think I have a strong opinion, but it looks to me not
> ideal to merge two solutions if both modes share the use case.
>
> Or if you think both modes should service different purpose, we might
> consider both, but that needs to be justified - IOW, we shouldn't merge
> anything that will never be used.
The use case is the same for both modes, but they are simply different
transport methods for moving descriptors from old QEMU to new. The developer
of the mgmt agent should be allowed to choose.
- Steve
>>>> Avoiding over commitment requires extra work in the management layer.
>>>
>>> So it would be nice to know what needs to be overcommitted here. I confess
>>> I don't know much on containerized VMs, so maybe the page cache can be a
>>> problem even if shared. But I hope we can spell that out. Logically IIUC
>>> memcg shouldn't account those page cache if preallocated, because memcg
>>> accounting should be done at folio allocations, at least, where the page
>>> cache should miss first (so not this case..).
>>>
>>>> This is one reason why a cloud provider may prefer cpr-exec. A second reason
>>>> is that the container may include agents with their own connections to the
>>>> outside world, and such connections remain intact if the container is reused.
>>>>
>>>> How?
>>
>> chardev preservation. The qemu socket chardevs to these agents are preserved,
>> and the agent connections to the outside world do not change, so no one sees
>> any interruption of traffic.
>>
>>>> All memory that is mapped by the guest is preserved in place. Indeed,
>>>> it must be, because it may be the target of DMA requests, which are not
>>>> quiesced during cpr-exec. All such memory must be mmap'able in new QEMU.
>>>> This is easy for named memory-backend objects, as long as they are mapped
>>>> shared, because they are visible in the file system in both old and new QEMU.
>>>> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
>>>> so the memfd's can be sent to new QEMU. Pages that were locked in memory
>>>> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
>>>> the device that locked them remains open.
>>>>
>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>> via CPR state.
>>>>
>>>> For device descriptors, new QEMU reuses the descriptor when creating the
>>>> device, rather than opening it again. The same holds for chardevs. For
>>>> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
>>>> is created.
>>>>
>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>> and backends are created prior to reading the channel, so this mode sends
>>>> CPR state over a second migration channel that is not visible to the user.
>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>
>>> Oh, maybe this is the reason that cpr-transfer will need a separate uri..
>>
>> Indeed.
>>
>> - Steve
>>
>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-07-19 16:28 ` Peter Xu
2024-07-20 21:28 ` Steven Sistare
@ 2024-08-07 21:04 ` Steven Sistare
2024-08-13 20:43 ` Peter Xu
1 sibling, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-07 21:04 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 7/19/2024 12:28 PM, Peter Xu wrote:
> On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
>> For new cpr modes, ramblock_is_ignored will always be true, because the
>> memory is preserved in place rather than copied. However, for an ignored
>> block, parse_ramblock currently requires that the received address of the
>> block must match the address of the statically initialized region on the
>> target. This fails for a PCI rom block, because the memory region address
>> is set when the guest writes to a BAR on the source, which does not occur
>> on the target, causing a "Mismatched GPAs" error during cpr migration.
>
> Is this a common fix with/without cpr mode?
>
> It looks to me mr->addr (for these ROMs) should only be set in PCI config
> region updates as you mentioned. But then I didn't figure out when they're
> updated on dest in live migration: the ramblock info was sent at the
> beginning of migration, so it doesn't even have PCI config space migrated;
> I thought the real mr->addr should be in there.
>
> I also failed to understand yet on why the mr->addr check needs to be done
> by ignore-shared only. Some explanation would be greatly helpful around
> this area..
The error_report does not bite for normal migration because migrate_ram_is_ignored()
is false for the problematic blocks, so the block->mr->addr check is not
performed. However, mr->addr is never fixed up in this case, which is a
quiet potential bug, and this patch fixes that with the "has_addr" check.
For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
because we do not copy the contents over the migration stream, we preserve the
memory in place. So we fall into the block->mr->addr sanity check and fail
with the original code.
I will add this to the commit message.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-07-29 12:29 ` Igor Mammedov
@ 2024-08-08 18:32 ` Steven Sistare
2024-08-12 18:37 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-08 18:32 UTC (permalink / raw)
To: Igor Mammedov, Peter Xu, Daniel P. Berrange
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> On Sat, 20 Jul 2024 16:28:25 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>
>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>
>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>> on the value of the anon-alloc machine property. This affects
>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>> but without an associated memory-backend object and without the -mem-path
>>>> option
>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>> so -m option implicitly creates memory-backend object,
>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>> or MEMORY_BACKEND_RAM otherwise.
>>
>> Yes. I dropped an an important adjective, "implicit".
>>
>> "guest RAM created with the global -m option but without an explicit associated
>> memory-backend object and without the -mem-path option"
>>
>>>> To access the same memory in the old and new QEMU processes, the memory
>>>> must be mapped shared. Therefore, the implementation always sets
>>>
>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>> user must explicitly specify the share option. In lieu of defining a new
>>> so statement at the top that memory-backend-ram is affected is not
>>> really valid?
>>
>> memory-backend-ram is affected by alloc-anon. But in addition, the user must
>> explicitly add the "share" option. I don't implicitly set share in this case,
>> because I would be overriding the user's specification of the memory object's property,
>> which would be private if omitted.
>
> instead of touching implicit RAM (-m), it would be better to error out
> and ask user to provide properly configured memory-backend explicitly.
>
>>
>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>> as the condition for calling memfd_create.
>>>
>>> In general I do dislike adding yet another option that will affect
>>> guest RAM allocation (memory-backends should be sufficient).
>>>
>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>> Can we just use memfd/shared unconditionally for those and
>>> avoid introducing a new confusing option?
>>
>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>> could hurt performance if we unconditionally change to memfd. The user should have
>> a choice for any segment that is large enough for huge pages to improve performance,
>> which potentially is any memory-backend-object. The non memory-backend objects are
>> small, and it would be OK to use memfd unconditionally for them.
Thanks everyone for your feedback. The common theme is that you dislike that the
new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
to remove that interaction, and document in the QAPI which backends work for CPR.
Specifically, memory-backend-memfd or memory-backend-file object is required,
with share=on (which is the default for memory-backend-memfd). CPR will be blocked
otherwise. The legacy -m option without an explicit memory-backend-object will not
support CPR.
Non memory-backend-objects (ramblocks not described on the qemu command line) will always
be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
The logic in ram_block_add becomes:
if (!new_block->host) {
if (xen_enabled()) {
...
} else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
TYPE_MEMORY_BACKEND)) {
qemu_memfd_create()
} else {
qemu_anon_ram_alloc()
}
Is that acceptable to everyone? Igor, Peter, Daniel?
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-08 18:32 ` Steven Sistare
@ 2024-08-12 18:37 ` Steven Sistare
2024-08-13 15:35 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-12 18:37 UTC (permalink / raw)
To: Igor Mammedov, Peter Xu, Daniel P. Berrange
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/8/2024 2:32 PM, Steven Sistare wrote:
> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>> On Sat, 20 Jul 2024 16:28:25 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>> on the value of the anon-alloc machine property. This affects
>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>> but without an associated memory-backend object and without the -mem-path
>>>>> option
>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>> so -m option implicitly creates memory-backend object,
>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>> or MEMORY_BACKEND_RAM otherwise.
>>>
>>> Yes. I dropped an an important adjective, "implicit".
>>>
>>> "guest RAM created with the global -m option but without an explicit associated
>>> memory-backend object and without the -mem-path option"
>>>
>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>> must be mapped shared. Therefore, the implementation always sets
>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>> user must explicitly specify the share option. In lieu of defining a new
>>>> so statement at the top that memory-backend-ram is affected is not
>>>> really valid?
>>>
>>> memory-backend-ram is affected by alloc-anon. But in addition, the user must
>>> explicitly add the "share" option. I don't implicitly set share in this case,
>>> because I would be overriding the user's specification of the memory object's property,
>>> which would be private if omitted.
>>
>> instead of touching implicit RAM (-m), it would be better to error out
>> and ask user to provide properly configured memory-backend explicitly.
>>
>>>
>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>> as the condition for calling memfd_create.
>>>>
>>>> In general I do dislike adding yet another option that will affect
>>>> guest RAM allocation (memory-backends should be sufficient).
>>>>
>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>> Can we just use memfd/shared unconditionally for those and
>>>> avoid introducing a new confusing option?
>>>
>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>> could hurt performance if we unconditionally change to memfd. The user should have
>>> a choice for any segment that is large enough for huge pages to improve performance,
>>> which potentially is any memory-backend-object. The non memory-backend objects are
>>> small, and it would be OK to use memfd unconditionally for them.
>
> Thanks everyone for your feedback. The common theme is that you dislike that the
> new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
> to remove that interaction, and document in the QAPI which backends work for CPR.
> Specifically, memory-backend-memfd or memory-backend-file object is required,
> with share=on (which is the default for memory-backend-memfd). CPR will be blocked
> otherwise. The legacy -m option without an explicit memory-backend-object will not
> support CPR.
>
> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
> The logic in ram_block_add becomes:
>
> if (!new_block->host) {
> if (xen_enabled()) {
> ...
> } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> TYPE_MEMORY_BACKEND)) {
> qemu_memfd_create()
> } else {
> qemu_anon_ram_alloc()
> }
>
> Is that acceptable to everyone? Igor, Peter, Daniel?
In a simple test here are the NON-memory-backend-object ramblocks which
are allocated with memfd_create in my new proposal:
memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
Of those, only a subset are mapped for DMA, per the existing QEMU logic,
no changes from me:
dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
system.flash0 is excluded by the vfio listener because it is a rom_device.
The rom@etc blocks are excluded because their MemoryRegions are not added to
any container region, so the flatmem traversal of the AS used by the listener
does not see them.
The BARs should not be mapped IMO, and I propose excluding them in the
iommufd series:
https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
Note that the old-QEMU contents of all ramblocks must be preserved, just like
in live migration. Live migration copies the contents in the stream. Live update
preserves the contents in place by preserving the memfd. Thus memfd serves
two purposes: preserving old contents, and preserving DMA mapped pinned pages.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-12 18:37 ` Steven Sistare
@ 2024-08-13 15:35 ` Peter Xu
2024-08-13 17:00 ` Alex Williamson
2024-08-13 17:34 ` Steven Sistare
0 siblings, 2 replies; 77+ messages in thread
From: Peter Xu @ 2024-08-13 15:35 UTC (permalink / raw)
To: Steven Sistare
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> On 8/8/2024 2:32 PM, Steven Sistare wrote:
> > On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > >
> > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > on the value of the anon-alloc machine property. This affects
> > > > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > > > but without an associated memory-backend object and without the -mem-path
> > > > > > option
> > > > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > > > so -m option implicitly creates memory-backend object,
> > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > or MEMORY_BACKEND_RAM otherwise.
> > > >
> > > > Yes. I dropped an an important adjective, "implicit".
> > > >
> > > > "guest RAM created with the global -m option but without an explicit associated
> > > > memory-backend object and without the -mem-path option"
> > > >
> > > > > > To access the same memory in the old and new QEMU processes, the memory
> > > > > > must be mapped shared. Therefore, the implementation always sets
> > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > > > user must explicitly specify the share option. In lieu of defining a new
> > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > really valid?
> > > >
> > > > memory-backend-ram is affected by alloc-anon. But in addition, the user must
> > > > explicitly add the "share" option. I don't implicitly set share in this case,
> > > > because I would be overriding the user's specification of the memory object's property,
> > > > which would be private if omitted.
> > >
> > > instead of touching implicit RAM (-m), it would be better to error out
> > > and ask user to provide properly configured memory-backend explicitly.
> > >
> > > >
> > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > > > as the condition for calling memfd_create.
> > > > >
> > > > > In general I do dislike adding yet another option that will affect
> > > > > guest RAM allocation (memory-backends should be sufficient).
> > > > >
> > > > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > > > Can we just use memfd/shared unconditionally for those and
> > > > > avoid introducing a new confusing option?
> > > >
> > > > The Linux kernel has different tunables for backing memfd's with huge pages, so we
> > > > could hurt performance if we unconditionally change to memfd. The user should have
> > > > a choice for any segment that is large enough for huge pages to improve performance,
> > > > which potentially is any memory-backend-object. The non memory-backend objects are
> > > > small, and it would be OK to use memfd unconditionally for them.
> >
> > Thanks everyone for your feedback. The common theme is that you dislike that the
> > new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
> > to remove that interaction, and document in the QAPI which backends work for CPR.
> > Specifically, memory-backend-memfd or memory-backend-file object is required,
> > with share=on (which is the default for memory-backend-memfd). CPR will be blocked
> > otherwise. The legacy -m option without an explicit memory-backend-object will not
> > support CPR.
> >
> > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
> > The logic in ram_block_add becomes:
> >
> > if (!new_block->host) {
> > if (xen_enabled()) {
> > ...
> > } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > TYPE_MEMORY_BACKEND)) {
> > qemu_memfd_create()
> > } else {
> > qemu_anon_ram_alloc()
> > }
> >
> > Is that acceptable to everyone? Igor, Peter, Daniel?
Sorry for a late reply.
I think this may not work as David pointed out? Where AFAIU it will switch
many old anon use cases to use memfd, aka, shmem, and it might be
problematic when share=off: we have double memory consumption issue with
shmem with private mapping.
I assume that includes things like "-m", "memory-backend-ram", and maybe
more. IIUC memory consumption of the VM will double with them.
>
> In a simple test here are the NON-memory-backend-object ramblocks which
> are allocated with memfd_create in my new proposal:
>
> memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
> memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
> memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
> memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
> memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
> memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
> memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
> memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
>
> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> no changes from me:
>
> dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
> dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
> dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
I wonder whether there's any case that the "rom"s can be DMA target at
all.. I understand it's logically possible to be READ from as ROMs, but I
am curious what happens if we don't map them at all when they're ROMs, or
whether there's any device that can (in real life) DMA from device ROMs,
and for what use.
> dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
> dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
>
> system.flash0 is excluded by the vfio listener because it is a rom_device.
> The rom@etc blocks are excluded because their MemoryRegions are not added to
> any container region, so the flatmem traversal of the AS used by the listener
> does not see them.
>
> The BARs should not be mapped IMO, and I propose excluding them in the
> iommufd series:
> https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
Looks like this is clear now that they should be there.
>
> Note that the old-QEMU contents of all ramblocks must be preserved, just like
> in live migration. Live migration copies the contents in the stream. Live update
> preserves the contents in place by preserving the memfd. Thus memfd serves
> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
IMHO the 1st purpose is a fake one. IOW:
- Preserving content will be important on large RAM/ROM regions. When
it's small, it shouldn't matter a huge deal, IMHO, because this is
about "how fast we can migrate / live upgrade'. IOW, this is not a
functional requirement.
- DMA mapped pinned pages: instead this is a hard requirement that we
must make sure these pages are fd-based, because only a fd-based
mapping can persist the pages (via page cache).
IMHO we shouldn't mangle them, and we should start with sticking with the
2nd goal here. To be explicit, if we can find a good replacement for
-alloc-anon, IMHO we could still migrate the ramblocks only fall into the
1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
they should never be DMAed to/from.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 15:35 ` Peter Xu
@ 2024-08-13 17:00 ` Alex Williamson
2024-08-13 18:45 ` Peter Xu
2024-08-13 18:46 ` Steven Sistare
2024-08-13 17:34 ` Steven Sistare
1 sibling, 2 replies; 77+ messages in thread
From: Alex Williamson @ 2024-08-13 17:00 UTC (permalink / raw)
To: Peter Xu
Cc: Steven Sistare, Igor Mammedov, Daniel P. Berrange, qemu-devel,
Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On Tue, 13 Aug 2024 11:35:15 -0400
Peter Xu <peterx@redhat.com> wrote:
> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> > On 8/8/2024 2:32 PM, Steven Sistare wrote:
> > > On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> > > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > > >
> > > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > on the value of the anon-alloc machine property. This affects
> > > > > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > > > > but without an associated memory-backend object and without the -mem-path
> > > > > > > option
> > > > > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > > > > so -m option implicitly creates memory-backend object,
> > > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > > or MEMORY_BACKEND_RAM otherwise.
> > > > >
> > > > > Yes. I dropped an an important adjective, "implicit".
> > > > >
> > > > > "guest RAM created with the global -m option but without an explicit associated
> > > > > memory-backend object and without the -mem-path option"
> > > > >
> > > > > > > To access the same memory in the old and new QEMU processes, the memory
> > > > > > > must be mapped shared. Therefore, the implementation always sets
> > > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > > > > user must explicitly specify the share option. In lieu of defining a new
> > > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > > really valid?
> > > > >
> > > > > memory-backend-ram is affected by alloc-anon. But in addition, the user must
> > > > > explicitly add the "share" option. I don't implicitly set share in this case,
> > > > > because I would be overriding the user's specification of the memory object's property,
> > > > > which would be private if omitted.
> > > >
> > > > instead of touching implicit RAM (-m), it would be better to error out
> > > > and ask user to provide properly configured memory-backend explicitly.
> > > >
> > > > >
> > > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > > > > as the condition for calling memfd_create.
> > > > > >
> > > > > > In general I do dislike adding yet another option that will affect
> > > > > > guest RAM allocation (memory-backends should be sufficient).
> > > > > >
> > > > > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > > > > Can we just use memfd/shared unconditionally for those and
> > > > > > avoid introducing a new confusing option?
> > > > >
> > > > > The Linux kernel has different tunables for backing memfd's with huge pages, so we
> > > > > could hurt performance if we unconditionally change to memfd. The user should have
> > > > > a choice for any segment that is large enough for huge pages to improve performance,
> > > > > which potentially is any memory-backend-object. The non memory-backend objects are
> > > > > small, and it would be OK to use memfd unconditionally for them.
> > >
> > > Thanks everyone for your feedback. The common theme is that you dislike that the
> > > new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
> > > to remove that interaction, and document in the QAPI which backends work for CPR.
> > > Specifically, memory-backend-memfd or memory-backend-file object is required,
> > > with share=on (which is the default for memory-backend-memfd). CPR will be blocked
> > > otherwise. The legacy -m option without an explicit memory-backend-object will not
> > > support CPR.
> > >
> > > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > > be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
> > > The logic in ram_block_add becomes:
> > >
> > > if (!new_block->host) {
> > > if (xen_enabled()) {
> > > ...
> > > } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > TYPE_MEMORY_BACKEND)) {
> > > qemu_memfd_create()
> > > } else {
> > > qemu_anon_ram_alloc()
> > > }
> > >
> > > Is that acceptable to everyone? Igor, Peter, Daniel?
>
> Sorry for a late reply.
>
> I think this may not work as David pointed out? Where AFAIU it will switch
> many old anon use cases to use memfd, aka, shmem, and it might be
> problematic when share=off: we have double memory consumption issue with
> shmem with private mapping.
>
> I assume that includes things like "-m", "memory-backend-ram", and maybe
> more. IIUC memory consumption of the VM will double with them.
>
> >
> > In a simple test here are the NON-memory-backend-object ramblocks which
> > are allocated with memfd_create in my new proposal:
> >
> > memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
> > memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
> > memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
> > memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
> > memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
> > memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
> > memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
> > memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
> >
> > Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> > no changes from me:
> >
> > dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
> > dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
> > dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>
> I wonder whether there's any case that the "rom"s can be DMA target at
> all.. I understand it's logically possible to be READ from as ROMs, but I
> am curious what happens if we don't map them at all when they're ROMs, or
> whether there's any device that can (in real life) DMA from device ROMs,
> and for what use.
>
> > dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
> > dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
> >
> > system.flash0 is excluded by the vfio listener because it is a rom_device.
> > The rom@etc blocks are excluded because their MemoryRegions are not added to
> > any container region, so the flatmem traversal of the AS used by the listener
> > does not see them.
> >
> > The BARs should not be mapped IMO, and I propose excluding them in the
> > iommufd series:
> > https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>
> Looks like this is clear now that they should be there.
>
> >
> > Note that the old-QEMU contents of all ramblocks must be preserved, just like
> > in live migration. Live migration copies the contents in the stream. Live update
> > preserves the contents in place by preserving the memfd. Thus memfd serves
> > two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>
> IMHO the 1st purpose is a fake one. IOW:
>
> - Preserving content will be important on large RAM/ROM regions. When
> it's small, it shouldn't matter a huge deal, IMHO, because this is
> about "how fast we can migrate / live upgrade'. IOW, this is not a
> functional requirement.
Regardless of the size of a ROM region, how would it ever be faster to
migrate ROMs rather that reload them from stable media on the target?
Furthermore, what mechanism other than migrating the ROM do we have to
guarantee the contents of the ROM are identical?
I have a hard time accepting that ROMs are only migrated for
performance and there isn't some aspect of migrating them to ensure the
contents remain identical, and by that token CPR would also need to
preserve the contents to provide the same guarantee. Thanks,
Alex
> - DMA mapped pinned pages: instead this is a hard requirement that we
> must make sure these pages are fd-based, because only a fd-based
> mapping can persist the pages (via page cache).
>
> IMHO we shouldn't mangle them, and we should start with sticking with the
> 2nd goal here. To be explicit, if we can find a good replacement for
> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
> they should never be DMAed to/from.
>
> Thanks,
>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 15:35 ` Peter Xu
2024-08-13 17:00 ` Alex Williamson
@ 2024-08-13 17:34 ` Steven Sistare
2024-08-13 19:02 ` Peter Xu
1 sibling, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-13 17:34 UTC (permalink / raw)
To: Peter Xu
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On 8/13/2024 11:35 AM, Peter Xu wrote:
> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>> on the value of the anon-alloc machine property. This affects
>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>> option
>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>> so -m option implicitly creates memory-backend object,
>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>
>>>>> Yes. I dropped an an important adjective, "implicit".
>>>>>
>>>>> "guest RAM created with the global -m option but without an explicit associated
>>>>> memory-backend object and without the -mem-path option"
>>>>>
>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>> must be mapped shared. Therefore, the implementation always sets
>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>> user must explicitly specify the share option. In lieu of defining a new
>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>> really valid?
>>>>>
>>>>> memory-backend-ram is affected by alloc-anon. But in addition, the user must
>>>>> explicitly add the "share" option. I don't implicitly set share in this case,
>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>> which would be private if omitted.
>>>>
>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>
>>>>>
>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>> as the condition for calling memfd_create.
>>>>>>
>>>>>> In general I do dislike adding yet another option that will affect
>>>>>> guest RAM allocation (memory-backends should be sufficient).
>>>>>>
>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>> avoid introducing a new confusing option?
>>>>>
>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>> could hurt performance if we unconditionally change to memfd. The user should have
>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>> which potentially is any memory-backend-object. The non memory-backend objects are
>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>
>>> Thanks everyone for your feedback. The common theme is that you dislike that the
>>> new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>> with share=on (which is the default for memory-backend-memfd). CPR will be blocked
>>> otherwise. The legacy -m option without an explicit memory-backend-object will not
>>> support CPR.
>>>
>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>> be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
>>> The logic in ram_block_add becomes:
>>>
>>> if (!new_block->host) {
>>> if (xen_enabled()) {
>>> ...
>>> } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>> TYPE_MEMORY_BACKEND)) {
>>> qemu_memfd_create()
>>> } else {
>>> qemu_anon_ram_alloc()
>>> }
>>>
>>> Is that acceptable to everyone? Igor, Peter, Daniel?
>
> Sorry for a late reply.
>
> I think this may not work as David pointed out? Where AFAIU it will switch
> many old anon use cases to use memfd, aka, shmem, and it might be
> problematic when share=off: we have double memory consumption issue with
> shmem with private mapping.
>
> I assume that includes things like "-m", "memory-backend-ram", and maybe
> more. IIUC memory consumption of the VM will double with them.
The new proposal only affects anon allocations that are not described on
the command line, and their memfd will be shared. There is no
command line option which would set share=off for these blocks.
"-m" and "memory-backend-ram" are not affected.
They will not work with CPR.
I will respond to your other comments separately.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 17:00 ` Alex Williamson
@ 2024-08-13 18:45 ` Peter Xu
2024-08-13 18:56 ` Steven Sistare
2024-08-13 18:46 ` Steven Sistare
1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-13 18:45 UTC (permalink / raw)
To: Alex Williamson
Cc: Steven Sistare, Igor Mammedov, Daniel P. Berrange, qemu-devel,
Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On Tue, Aug 13, 2024 at 11:00:37AM -0600, Alex Williamson wrote:
> > > Note that the old-QEMU contents of all ramblocks must be preserved, just like
> > > in live migration. Live migration copies the contents in the stream. Live update
> > > preserves the contents in place by preserving the memfd. Thus memfd serves
> > > two purposes: preserving old contents, and preserving DMA mapped pinned pages.
> >
> > IMHO the 1st purpose is a fake one. IOW:
> >
> > - Preserving content will be important on large RAM/ROM regions. When
> > it's small, it shouldn't matter a huge deal, IMHO, because this is
> > about "how fast we can migrate / live upgrade'. IOW, this is not a
> > functional requirement.
>
> Regardless of the size of a ROM region, how would it ever be faster to
> migrate ROMs rather that reload them from stable media on the target?
> Furthermore, what mechanism other than migrating the ROM do we have to
> guarantee the contents of the ROM are identical?
IIRC we need to migrate ROMs in some form because they can be different on
src/dst, e.g., ROM files can upgrade after QEMU upgrades. Here either
putting them onto migration stream, or making that fd-based should work.
Frankly I don't solidly know the details on why they can't be different.
My current understanding was that if one device boots with one version of
ROM/firmware, then it's possible the device keep interacting with the ROM
region in some way (in the form of referring addresses in this specific
version of ROM?), so that it may stop working if the ROM content changed.
IOW, if my understanding is correct, new ROM files won't get used after
migration automatically, but it requires one system reset. When a system
reset triggered after VM migrated to destination host, it'll reload device
ROMs with the files specified (which will start to point to the upgraded
version of ROM files), and IIUC that's where the devices will boostrap with
the new ROM / BIOS files.
>
> I have a hard time accepting that ROMs are only migrated for
> performance
AFAICT, it's never about performance or making it faster when putting ROM
data on wire. Even in this context where Steve wanted to use fd backing
the ROMs, then putting that on wire is still slower than sharing fds.
Here my previous comment / point was that this should be a small region, so
it shouldn't matter a huge deal for ROMs to migrate either through the wire
or via the fd page cache. I wanted to remove one more dependency that we
may even need the new -alloc-anon parameter as it doesn't sound required
for ROM migrations.
> and there isn't some aspect of migrating them to ensure the
> contents remain identical, and by that token CPR would also need to
> preserve the contents to provide the same guarantee. Thanks,
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 17:00 ` Alex Williamson
2024-08-13 18:45 ` Peter Xu
@ 2024-08-13 18:46 ` Steven Sistare
2024-08-13 18:49 ` Steven Sistare
1 sibling, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-13 18:46 UTC (permalink / raw)
To: Alex Williamson, Peter Xu
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On 8/13/2024 1:00 PM, Alex Williamson wrote:
> On Tue, 13 Aug 2024 11:35:15 -0400
> Peter Xu <peterx@redhat.com> wrote:
>
>> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>
>>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>> on the value of the anon-alloc machine property. This affects
>>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>>> option
>>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>>> so -m option implicitly creates memory-backend object,
>>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>>
>>>>>> Yes. I dropped an an important adjective, "implicit".
>>>>>>
>>>>>> "guest RAM created with the global -m option but without an explicit associated
>>>>>> memory-backend object and without the -mem-path option"
>>>>>>
>>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>>> must be mapped shared. Therefore, the implementation always sets
>>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>>> user must explicitly specify the share option. In lieu of defining a new
>>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>>> really valid?
>>>>>>
>>>>>> memory-backend-ram is affected by alloc-anon. But in addition, the user must
>>>>>> explicitly add the "share" option. I don't implicitly set share in this case,
>>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>>> which would be private if omitted.
>>>>>
>>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>>
>>>>>>
>>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>>> as the condition for calling memfd_create.
>>>>>>>
>>>>>>> In general I do dislike adding yet another option that will affect
>>>>>>> guest RAM allocation (memory-backends should be sufficient).
>>>>>>>
>>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>>> avoid introducing a new confusing option?
>>>>>>
>>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>>> could hurt performance if we unconditionally change to memfd. The user should have
>>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>>> which potentially is any memory-backend-object. The non memory-backend objects are
>>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>>
>>>> Thanks everyone for your feedback. The common theme is that you dislike that the
>>>> new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
>>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>>> with share=on (which is the default for memory-backend-memfd). CPR will be blocked
>>>> otherwise. The legacy -m option without an explicit memory-backend-object will not
>>>> support CPR.
>>>>
>>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>>> be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
>>>> The logic in ram_block_add becomes:
>>>>
>>>> if (!new_block->host) {
>>>> if (xen_enabled()) {
>>>> ...
>>>> } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>> TYPE_MEMORY_BACKEND)) {
>>>> qemu_memfd_create()
>>>> } else {
>>>> qemu_anon_ram_alloc()
>>>> }
>>>>
>>>> Is that acceptable to everyone? Igor, Peter, Daniel?
>>
>> Sorry for a late reply.
>>
>> I think this may not work as David pointed out? Where AFAIU it will switch
>> many old anon use cases to use memfd, aka, shmem, and it might be
>> problematic when share=off: we have double memory consumption issue with
>> shmem with private mapping.
>>
>> I assume that includes things like "-m", "memory-backend-ram", and maybe
>> more. IIUC memory consumption of the VM will double with them.
>>
>>>
>>> In a simple test here are the NON-memory-backend-object ramblocks which
>>> are allocated with memfd_create in my new proposal:
>>>
>>> memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>>> memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>>> memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>>> memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>>> memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>>> memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>>> memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>>> memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
>>>
>>> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
>>> no changes from me:
>>>
>>> dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>>> dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>>> dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>>
>> I wonder whether there's any case that the "rom"s can be DMA target at
>> all.. I understand it's logically possible to be READ from as ROMs, but I
>> am curious what happens if we don't map them at all when they're ROMs, or
>> whether there's any device that can (in real life) DMA from device ROMs,
>> and for what use.
>>
>>> dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>>> dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
>>>
>>> system.flash0 is excluded by the vfio listener because it is a rom_device.
>>> The rom@etc blocks are excluded because their MemoryRegions are not added to
>>> any container region, so the flatmem traversal of the AS used by the listener
>>> does not see them.
>>>
>>> The BARs should not be mapped IMO, and I propose excluding them in the
>>> iommufd series:
>>> https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>>
>> Looks like this is clear now that they should be there.
>>
>>>
>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>> in live migration. Live migration copies the contents in the stream. Live update
>>> preserves the contents in place by preserving the memfd. Thus memfd serves
>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>
>> IMHO the 1st purpose is a fake one. IOW:
>>
>> - Preserving content will be important on large RAM/ROM regions. When
>> it's small, it shouldn't matter a huge deal, IMHO, because this is
>> about "how fast we can migrate / live upgrade'. IOW, this is not a
>> functional requirement.
>
> Regardless of the size of a ROM region, how would it ever be faster to
> migrate ROMs rather that reload them from stable media on the target?
> Furthermore, what mechanism other than migrating the ROM do we have to
> guarantee the contents of the ROM are identical?
>
> I have a hard time accepting that ROMs are only migrated for
> performance and there isn't some aspect of migrating them to ensure the
> contents remain identical, and by that token CPR would also need to
> preserve the contents to provide the same guarantee. Thanks,
I agree. Any ramblock may change if the contents are read from a file in
the QEMU distribution, or if the contents are composed by QEMU code. Live
migration guards against this by sending the old ramblock contents in the
migration stream.
- Steve
>> - DMA mapped pinned pages: instead this is a hard requirement that we
>> must make sure these pages are fd-based, because only a fd-based
>> mapping can persist the pages (via page cache).
>>
>> IMHO we shouldn't mangle them, and we should start with sticking with the
>> 2nd goal here. To be explicit, if we can find a good replacement for
>> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
>> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
>> they should never be DMAed to/from.
>>
>> Thanks,
>>
>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 18:46 ` Steven Sistare
@ 2024-08-13 18:49 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-13 18:49 UTC (permalink / raw)
To: Alex Williamson, Peter Xu
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On 8/13/2024 2:46 PM, Steven Sistare wrote:
> On 8/13/2024 1:00 PM, Alex Williamson wrote:
>> On Tue, 13 Aug 2024 11:35:15 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>>>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property. This affects
>>>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>>>> option
>>>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>>>> so -m option implicitly creates memory-backend object,
>>>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>>>
>>>>>>> Yes. I dropped an an important adjective, "implicit".
>>>>>>>
>>>>>>> "guest RAM created with the global -m option but without an explicit associated
>>>>>>> memory-backend object and without the -mem-path option"
>>>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>>>> must be mapped shared. Therefore, the implementation always sets
>>>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>>>> user must explicitly specify the share option. In lieu of defining a new
>>>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>>>> really valid?
>>>>>>>
>>>>>>> memory-backend-ram is affected by alloc-anon. But in addition, the user must
>>>>>>> explicitly add the "share" option. I don't implicitly set share in this case,
>>>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>>>> which would be private if omitted.
>>>>>>
>>>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>>>> as the condition for calling memfd_create.
>>>>>>>>
>>>>>>>> In general I do dislike adding yet another option that will affect
>>>>>>>> guest RAM allocation (memory-backends should be sufficient).
>>>>>>>>
>>>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>>>> avoid introducing a new confusing option?
>>>>>>>
>>>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>>>> could hurt performance if we unconditionally change to memfd. The user should have
>>>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>>>> which potentially is any memory-backend-object. The non memory-backend objects are
>>>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>>>
>>>>> Thanks everyone for your feedback. The common theme is that you dislike that the
>>>>> new option modifies the allocation of memory-backend-objects. OK, accepted. I propose
>>>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>>>> with share=on (which is the default for memory-backend-memfd). CPR will be blocked
>>>>> otherwise. The legacy -m option without an explicit memory-backend-object will not
>>>>> support CPR.
>>>>>
>>>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>>>> be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
>>>>> The logic in ram_block_add becomes:
>>>>>
>>>>> if (!new_block->host) {
>>>>> if (xen_enabled()) {
>>>>> ...
>>>>> } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>>> TYPE_MEMORY_BACKEND)) {
>>>>> qemu_memfd_create()
>>>>> } else {
>>>>> qemu_anon_ram_alloc()
>>>>> }
>>>>>
>>>>> Is that acceptable to everyone? Igor, Peter, Daniel?
>>>
>>> Sorry for a late reply.
>>>
>>> I think this may not work as David pointed out? Where AFAIU it will switch
>>> many old anon use cases to use memfd, aka, shmem, and it might be
>>> problematic when share=off: we have double memory consumption issue with
>>> shmem with private mapping.
>>>
>>> I assume that includes things like "-m", "memory-backend-ram", and maybe
>>> more. IIUC memory consumption of the VM will double with them.
>>>
>>>>
>>>> In a simple test here are the NON-memory-backend-object ramblocks which
>>>> are allocated with memfd_create in my new proposal:
>>>>
>>>> memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>>>> memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>>>> memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>>>> memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>>>> memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>>>> memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>>>> memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>>>> memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
>>>>
>>>> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
>>>> no changes from me:
>>>>
>>>> dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>>>> dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>>>> dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>>>
>>> I wonder whether there's any case that the "rom"s can be DMA target at
>>> all.. I understand it's logically possible to be READ from as ROMs, but I
>>> am curious what happens if we don't map them at all when they're ROMs, or
>>> whether there's any device that can (in real life) DMA from device ROMs,
>>> and for what use.
>>>
>>>> dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>>>> dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
>>>>
>>>> system.flash0 is excluded by the vfio listener because it is a rom_device.
>>>> The rom@etc blocks are excluded because their MemoryRegions are not added to
>>>> any container region, so the flatmem traversal of the AS used by the listener
>>>> does not see them.
>>>>
>>>> The BARs should not be mapped IMO, and I propose excluding them in the
>>>> iommufd series:
>>>> https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>>>
>>> Looks like this is clear now that they should be there.
>>>
>>>>
>>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>>> in live migration. Live migration copies the contents in the stream. Live update
>>>> preserves the contents in place by preserving the memfd. Thus memfd serves
>>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>>
>>> IMHO the 1st purpose is a fake one. IOW:
>>>
>>> - Preserving content will be important on large RAM/ROM regions. When
>>> it's small, it shouldn't matter a huge deal, IMHO, because this is
>>> about "how fast we can migrate / live upgrade'. IOW, this is not a
>>> functional requirement.
>>
>> Regardless of the size of a ROM region, how would it ever be faster to
>> migrate ROMs rather that reload them from stable media on the target?
>> Furthermore, what mechanism other than migrating the ROM do we have to
>> guarantee the contents of the ROM are identical?
>>
>> I have a hard time accepting that ROMs are only migrated for
>> performance and there isn't some aspect of migrating them to ensure the
>> contents remain identical, and by that token CPR would also need to
>> preserve the contents to provide the same guarantee. Thanks,
>
> I agree. Any ramblock may change if the contents are read from a file in
> the QEMU distribution, or if the contents are composed by QEMU code. Live
> migration guards against this by sending the old ramblock contents in the
> migration stream.
Our emails just crossed. I will repost this to your new reply so keep a single
conversation thread.
- Steve
>
>>> - DMA mapped pinned pages: instead this is a hard requirement that we
>>> must make sure these pages are fd-based, because only a fd-based
>>> mapping can persist the pages (via page cache).
>>>
>>> IMHO we shouldn't mangle them, and we should start with sticking with the
>>> 2nd goal here. To be explicit, if we can find a good replacement for
>>> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
>>> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
>>> they should never be DMAed to/from.
>>>
>>> Thanks,
>>>
>>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 18:45 ` Peter Xu
@ 2024-08-13 18:56 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-13 18:56 UTC (permalink / raw)
To: Peter Xu, Alex Williamson
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On 8/13/2024 2:45 PM, Peter Xu wrote:
> On Tue, Aug 13, 2024 at 11:00:37AM -0600, Alex Williamson wrote:
>>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>>> in live migration. Live migration copies the contents in the stream. Live update
>>>> preserves the contents in place by preserving the memfd. Thus memfd serves
>>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>>
>>> IMHO the 1st purpose is a fake one. IOW:
>>>
>>> - Preserving content will be important on large RAM/ROM regions. When
>>> it's small, it shouldn't matter a huge deal, IMHO, because this is
>>> about "how fast we can migrate / live upgrade'. IOW, this is not a
>>> functional requirement.
>>
>> Regardless of the size of a ROM region, how would it ever be faster to
>> migrate ROMs rather that reload them from stable media on the target?
>> Furthermore, what mechanism other than migrating the ROM do we have to
>> guarantee the contents of the ROM are identical?
>
> IIRC we need to migrate ROMs in some form because they can be different on
> src/dst, e.g., ROM files can upgrade after QEMU upgrades. Here either
> putting them onto migration stream, or making that fd-based should work.
Agreed.
> Frankly I don't solidly know the details on why they can't be different.
Any ramblock may change if the contents are read from a file in
the QEMU distribution, or if the contents are composed by QEMU code.
> My current understanding was that if one device boots with one version of
> ROM/firmware, then it's possible the device keep interacting with the ROM
> region in some way (in the form of referring addresses in this specific
> version of ROM?), so that it may stop working if the ROM content changed.
Yes, the guest may continue to read data or execute code from the block.
> IOW, if my understanding is correct, new ROM files won't get used after
> migration automatically, but it requires one system reset. When a system
> reset triggered after VM migrated to destination host, it'll reload device
> ROMs with the files specified (which will start to point to the upgraded
> version of ROM files), and IIUC that's where the devices will boostrap with
> the new ROM / BIOS files.
Agreed.
>> I have a hard time accepting that ROMs are only migrated for
>> performance
>
> AFAICT, it's never about performance or making it faster when putting ROM
> data on wire. Even in this context where Steve wanted to use fd backing
> the ROMs, then putting that on wire is still slower than sharing fds.
>
> Here my previous comment / point was that this should be a small region, so
> it shouldn't matter a huge deal for ROMs to migrate either through the wire
> or via the fd page cache. I wanted to remove one more dependency that we
> may even need the new -alloc-anon parameter as it doesn't sound required
> for ROM migrations.
Agreed that performance is not the issue. I use memfd for correctness.
Also note that my new proposal deletes the alloc-anon parameter. I allocate
non-command-line ramblocks unconditionally using memfd_create.
- Steve
>> and there isn't some aspect of migrating them to ensure the
>> contents remain identical, and by that token CPR would also need to
>> preserve the contents to provide the same guarantee. Thanks,
>
> Thanks,
>
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 01/11] machine: alloc-anon option
2024-08-13 17:34 ` Steven Sistare
@ 2024-08-13 19:02 ` Peter Xu
0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2024-08-13 19:02 UTC (permalink / raw)
To: Steven Sistare
Cc: Igor Mammedov, Daniel P. Berrange, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Tue, Aug 13, 2024 at 01:34:42PM -0400, Steven Sistare wrote:
> > > > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > > > be allocated using memfd_create (on Linux only). The alloc-anon option is deleted.
> > > > The logic in ram_block_add becomes:
> > > >
> > > > if (!new_block->host) {
> > > > if (xen_enabled()) {
> > > > ...
> > > > } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > > TYPE_MEMORY_BACKEND)) {
> > > > qemu_memfd_create()
> > > > } else {
> > > > qemu_anon_ram_alloc()
> > > > }
> > > >
> > > > Is that acceptable to everyone? Igor, Peter, Daniel?
> >
> > Sorry for a late reply.
> >
> > I think this may not work as David pointed out? Where AFAIU it will switch
> > many old anon use cases to use memfd, aka, shmem, and it might be
> > problematic when share=off: we have double memory consumption issue with
> > shmem with private mapping.
> >
> > I assume that includes things like "-m", "memory-backend-ram", and maybe
> > more. IIUC memory consumption of the VM will double with them.
>
> The new proposal only affects anon allocations that are not described on
> the command line, and their memfd will be shared. There is no
> command line option which would set share=off for these blocks.
>
> "-m" and "memory-backend-ram" are not affected.
> They will not work with CPR.
Hmm yeah memory-backend-ram should be TYPE_MEMORY_BACKEND for sure.. and I
just noticed "-m" looks like the same.
Though this change still gives me the feeling that we don't yet know who's
the target of this change at all, and what purpose it services.
I'll see how others see this. For me, at least in this case I think it'll
be nice the "else if" will not be used unless cpr is enabled in the first
place. But that's still a bit hacky to me.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-06 20:56 ` Steven Sistare
@ 2024-08-13 19:46 ` Peter Xu
2024-08-15 20:55 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-13 19:46 UTC (permalink / raw)
To: Steven Sistare
Cc: Daniel P. Berrangé, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > The flipside, however, is that localhost migration via 2 separate QEMU
> > processes has issues where both QEMUs want to be opening the very same
> > file, and only 1 of them can ever have them open.
I thought we used to have similar issue with block devices, but I assume
it's solved for years (and whoever owns it will take proper file lock,
IIRC, and QEMU migration should properly serialize the time window on who's
going to take the file lock).
Maybe this is about something else?
>
> Indeed, and "files" includes unix domain sockets. Network ports also conflict.
> cpr-exec avoids such problems, and is one of the advantages of the method that
> I forgot to promote.
I was thinking that's fine, as the host ports should be the backend of the
VM ports only anyway so they don't need to be identical on both sides?
IOW, my understanding is it's the guest IP/ports/... which should still be
stable across migrations, where the host ports can be different as long as
the host ports can forward guest port messages correctly?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-07 19:47 ` Steven Sistare
@ 2024-08-13 20:12 ` Peter Xu
2024-08-20 16:28 ` [PATCH V2 00/11] Live update: cpr-exec (reconnections) Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-13 20:12 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Wed, Aug 07, 2024 at 03:47:47PM -0400, Steven Sistare wrote:
> On 8/4/2024 12:10 PM, Peter Xu wrote:
> > On Sat, Jul 20, 2024 at 05:26:07PM -0400, Steven Sistare wrote:
> > > On 7/18/2024 11:56 AM, Peter Xu wrote:
> > > > Steve,
> > > >
> > > > On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> > > > > What?
> > > >
> > > > Thanks for trying out with the cpr-transfer series. I saw that that series
> > > > missed most of the cc list here, so I'm attaching the link here:
> > > >
> > > > https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
> > > >
> > > > I think most of my previous questions for exec() solution still are there,
> > > > I'll try to summarize them all in this reply as much as I can.
> > > >
> > > > >
> > > > > This patch series adds the live migration cpr-exec mode, which allows
> > > > > the user to update QEMU with minimal guest pause time, by preserving
> > > > > guest RAM in place, albeit with new virtual addresses in new QEMU, and
> > > > > by preserving device file descriptors.
> > > > >
> > > > > The new user-visible interfaces are:
> > > > > * cpr-exec (MigMode migration parameter)
> > > > > * cpr-exec-command (migration parameter)
> > > >
> > > > I really, really hope we can avoid this..
> > > >
> > > > It's super cumbersome to pass in a qemu cmdline in a qemu migration
> > > > parameter.. if we can do that with generic live migration ways, I hope we
> > > > stick with the clean approach.
> > >
> > > This is no different than live migration, requiring a management agent to
> > > launch target qemu with all the arguments use to start source QEMU. Now that
> > > same agent will send the arguments via cpr-exec-command.
> >
> > It's still a bit different.
> >
> > There we append "-incoming defer" only, which makes sense because we're
> > instructing a QEMU to take an incoming stream to load. Now we append the
> > complete qemu cmdline within the QEMU itself, that was booted with exactly
> > the same cmdline.. :-( I would at least start to ask why we need to pass
> > the same thing twice..
>
> Sometimes one must modify the command line arguments passed to new QEMU.
> This interface allows for that possibility.
>
> In an earlier patch series, I proposed a cpr-exec command that took no arguments,
> and reused the old QEMU argv, which was remembered in main. A reviewer pointed out
> how inflexible that was. See my response to Daniel yesterday for more on the value
> of this flexibility.
>
> This is not a burden for the mgmt agent. It already knows the arguments because
> it can launch new qemu with the arguments for live migration. Passing the arguments
> to cpr-exec-command is trivial.
Right, trivial as-is. To me it's not a major blocker yet so far, but it's
still about being hackish, and I have this unpleasant feeling that we're
digging holes for our future.
>
> > Not saying that this is no-go, but really looks unpretty to me from this
> > part.. especially if a cleaner solution seems possible.
> >
> > >
> > > > > * anon-alloc (command-line option for -machine)
> > > >
> > > > Igor questioned this, and I second his opinion.. We can leave the
> > > > discussion there for this one.
> > >
> > > Continued on the other thread.
> > >
> > > > > The user sets the mode parameter before invoking the migrate command.
> > > > > In this mode, the user issues the migrate command to old QEMU, which
> > > > > stops the VM and saves state to the migration channels. Old QEMU then
> > > > > exec's new QEMU, replacing the original process while retaining its PID.
> > > > > The user specifies the command to exec new QEMU in the migration parameter
> > > > > cpr-exec-command. The command must pass all old QEMU arguments to new
> > > > > QEMU, plus the -incoming option. Execution resumes in new QEMU.
> > > > >
> > > > > Memory-backend objects must have the share=on attribute, but
> > > > > memory-backend-epc is not supported. The VM must be started
> > > > > with the '-machine anon-alloc=memfd' option, which allows anonymous
> > > > > memory to be transferred in place to the new process.
> > > > >
> > > > > Why?
> > > > >
> > > > > This mode has less impact on the guest than any other method of updating
> > > > > in place.
> > > >
> > > > So I wonder whether there's comparison between exec() and transfer mode
> > > > that you recently proposed.
> > >
> > > Not yet, but I will measure it.
> >
> > Thanks.
> >
> > >
> > > > I'm asking because exec() (besides all the rest of things that I dislike on
> > > > it in this approach..) should be simply slower, logically, due to the
> > > > serialized operation to (1) tearing down the old mm, (2) reload the new
> > > > ELF, then (3) runs through the QEMU init process.
> > > >
> > > > If with a generic migration solution, the dest QEMU can start running (2+3)
> > > > concurrently without even need to run (1).
> > > >
> > > > In this whole process, I doubt (2) could be relatively fast, (3) I donno,
> > > > maybe it could be slow but I never measured; Paolo may have good idea as I
> > > > know he used to work on qboot.
> > >
> > > We'll see, but in any case these take < 100 msec, which is a wonderfully short
> >
> > I doubt whether it keeps <100ms when the VM is large. Note that I think we
> > should cover the case where the user does 4k mapping for a large guest.
> >
> > So I agree that 4k mapping over e.g. 1T without hugetlb may not be the
> > ideal case, but the question is I suspect there're indeed serious users
> > using QEMU like that, and if we have most exactly a parallel solution that
> > does cover this case, it is definitely preferrable to consider the other
> > from this POV, simply because there's nothing to lose there..
> >
> > > pause time unless your customer is doing high speed stock trading. If cpr-transfer
> > > is faster still, that's gravy, but cpr-exec is still great.
> > >
> > > > For (1), I also doubt in your test cases it's fast, but it may not always
> > > > be fast. Consider the guest has a huge TBs of shared mem, even if the
> > > > memory will be completely shared between src/dst QEMUs, the pgtable won't!
> > > > It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
> > > > pgtable alone can even take time, and that will be accounted in step (1)
> > > > and further in exec() request.
> > >
> > > Yes, there is an O(n) effect here, but it is a fast O(n) when the memory is
> > > backed by huge pages. In UEK, we make it faster still by unmapping in parallel
> > > with multiple threads. I don't have the data handy but can share after running
> > > some experiments. Regardless, this time is negligible for small and medium
> > > size guests, which form the majority of instances in a cloud.
> >
> > Possible. It's just that it sounds like a good idea to avoid having the
> > downtime taking any pgtable tearing down into account here for the old mm,
> > irrelevant of how much time it'll take. It's just that I suspect some use
> > case can take fair amount of time.
>
> Here is the guest pause time, measured as the interval from the start
> of the migrate command to the new QEMU guest reaching the running state.
> The average over 10 runs is shown, in msecs.
> Huge pages are enabled.
> Guest memory is memfd.
> The kernel is 6.9.0 (not UEK, so no parallel unmap)
> The system is old and slow: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>
> cpr-exec cpr-transfer
> 256M 180 148
> 16G 190 150
> 128G 250 159
> 1T 300 ? 159 ? // extrapolated
>
> At these scales, the difference between exec and transfer is not significant.
> A provider would choose one vs the other based on ease of implementation in
> their mgmt agent and container environment.
Thanks. Are these maximum huge page the memsize can take (either 2M or 1G),
or is this only about 2M?
>
> For small pages and large memory, cpr-exec can take multiple seconds, and
> the UEK parallel unmap reduces that further. But, that is the exception,
> not the rule. Providers strive to back huge memories with huge pages. It
> makes no sense to use such a valuable resource in the crappiest way possible
> (ie with small pages).
I can't say nobody uses small pages when memory is large. The thing is
hugetlb loses features comparing to small pages; or say, merely all the
memory features Linux provides. So I won't be surprised if someone tells
me there's TB level VMs that is using small page on purpose for any of
those features (swap, ksm, etc.). I had vague memory there was customers
using such setup in the past, without remembering the reasons.
I do wish a design that will perform well even there, so it works for all
cases we can think of so far. Not to mention cpr-transfer seems to
outperforms everywhere too. Unless the "management layer benefits" are so
strong.. it seems to me we have a clear choice. I understand it may affect
your plan, but let's be fair, or.. is it not the case?
>
> > So I think this is "one point less" for exec() solution, while the issue
> > can be big or small on its own. What matters is IMHO where exec() is
> > superior so that we'd like to pay for this. I'll try to stop saying "let's
> > try to avoid using exec() as it sounds risky", but we still need to compare
> > with solid pros and cons.
> >
> > >
> > > > All these fuss will be avoided if you use a generic live migration model
> > > > like cpr-transfer you proposed. That's also cleaner.
> > > >
> > > > > The pause time is much lower, because devices need not be torn
> > > > > down and recreated, DMA does not need to be drained and quiesced, and minimal
> > > > > state is copied to new QEMU. Further, there are no constraints on the guest.
> > > > > By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
> > > > > and suspending plus resuming vfio devices adds multiple seconds to the
> > > > > guest pause time. Lastly, there is no loss of connectivity to the guest,
> > > > > because chardev descriptors remain open and connected.
> > > >
> > > > Again, I raised the question on why this would matter, as after all mgmt
> > > > app will need to coop with reconnections due to the fact they'll need to
> > > > support a generic live migration, in which case reconnection is a must.
> > > >
> > > > So far it doesn't sound like a performance critical path, for example, to
> > > > do the mgmt reconnects on the ports. So this might be an optimization that
> > > > most mgmt apps may not care much?
> > >
> > > Perhaps. I view the chardev preservation as nice to have, but not essential.
> > > It does not appear in this series, other than in docs. It's easy to implement
> > > given the CPR foundation. I suggest we continue this discussion when I post
> > > the chardev series, so we can focus on the core functionality.
> >
> > It's just that it can affect our decision on choosing the way to go.
> >
> > For example, do we have someone from Libvirt or any mgmt layer can help
> > justify this point?
> >
> > As I said, I thought most facilities for reconnection should be ready, but
> > I could miss important facts in mgmt layers..
>
> I will more deeply study reconnects in the mgmt layer, run some experiments to
> see if it is seamless for the end user, and get back to you, but it will take
> some time.
> > > > > These benefits all derive from the core design principle of this mode,
> > > > > which is preserving open descriptors. This approach is very general and
> > > > > can be used to support a wide variety of devices that do not have hardware
> > > > > support for live migration, including but not limited to: vfio, chardev,
> > > > > vhost, vdpa, and iommufd. Some devices need new kernel software interfaces
> > > > > to allow a descriptor to be used in a process that did not originally open it.
> > > >
> > > > Yes, I still think this is a great idea. It just can also be built on top
> > > > of something else than exec().
> > > >
> > > > >
> > > > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > > > container and its assigned resources. By contrast, consider a design in
> > > > > which a new container is created on the same host as the target of the
> > > > > CPR operation. Resources must be reserved for the new container, while
> > > > > the old container still reserves resources until the operation completes.
> > > >
> > > > Note that if we need to share RAM anyway, the resources consumption should
> > > > be minimal, as mem should IMHO be the major concern (except CPU, but CPU
> > > > isn't a concern in this scenario) in container world and here the shared
> > > > guest mem shouldn't be accounted to the dest container. So IMHO it's about
> > > > the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
> > > > that should be relatively small.
> > > >
> > > > In that case I don't yet see it a huge improvement, if the dest container
> > > > is cheap to initiate.
> > >
> > > It's about reserving memory and CPUs, and transferring those reservations from
> > > the old instance to the new, and fiddling with the OS mechanisms that enforce
> > > reservations and limits. The devil is in the details, and with the exec model,
> > > the management agent can ignore all of that.
> > >
> > > You don't see it as a huge improvement because you don't need to write the
> > > management code. I do!
> >
> > Heh, possibly true.
> >
> > Could I ask what management code you're working on? Why that management
> > code doesn't need to already work out these problems with reconnections
> > (like pre-CPR ways of live upgrade)?
>
> OCI - Oracle Cloud Infrastructure.
> Mgmt needs to manage reconnections for live migration, and perhaps I could
> leverage that code for live update, but happily I did not need to. Regardless,
> reconnection is the lesser issue. The bigger issue is resource management and
> the container environment. But I cannot justify that statement in detail without
> actually trying to implement cpr-transfer in OCI.
I see. Is OCI open source somewhere?
If it's close-sourced, maybe it'll be helpful to see how the exec() design
could benefit other open source mgmt applications.
>
> > > Both modes are valid and useful - exec in container, or launch a new container.
> > > I have volunteered to implement the cpr-transfer mode for the latter, a mode
> > > I do not use. Please don't reward me by dropping the mode I care about :)
> > > Both modes can co-exist. The presence of the cpr-exec specific code in qemu
> > > will not hinder future live migration development.
> >
> > I'm trying to remove some of my "prejudices" on exec() :). Hopefully that
> > proved more or less that I simply wanted to be fair on making a design
> > decision. I don't think I have a strong opinion, but it looks to me not
> > ideal to merge two solutions if both modes share the use case.
> >
> > Or if you think both modes should service different purpose, we might
> > consider both, but that needs to be justified - IOW, we shouldn't merge
> > anything that will never be used.
>
> The use case is the same for both modes, but they are simply different
> transport methods for moving descriptors from old QEMU to new. The developer
> of the mgmt agent should be allowed to choose.
It's out of my capability to review the mgmt impact on this one. This is
all based on the idea that I think most mgmt apps supports reconnections
pretty well. If that's the case, I'd definitely go for the transfer mode.
I'm not sure whether there's anyone from mgmt layer would like to share
some opinion; Dan could be the most suitable in the loop already.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-08-07 21:04 ` Steven Sistare
@ 2024-08-13 20:43 ` Peter Xu
2024-08-15 20:54 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-13 20:43 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Wed, Aug 07, 2024 at 05:04:26PM -0400, Steven Sistare wrote:
> On 7/19/2024 12:28 PM, Peter Xu wrote:
> > On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
> > > For new cpr modes, ramblock_is_ignored will always be true, because the
> > > memory is preserved in place rather than copied. However, for an ignored
> > > block, parse_ramblock currently requires that the received address of the
> > > block must match the address of the statically initialized region on the
> > > target. This fails for a PCI rom block, because the memory region address
> > > is set when the guest writes to a BAR on the source, which does not occur
> > > on the target, causing a "Mismatched GPAs" error during cpr migration.
> >
> > Is this a common fix with/without cpr mode?
> >
> > It looks to me mr->addr (for these ROMs) should only be set in PCI config
> > region updates as you mentioned. But then I didn't figure out when they're
> > updated on dest in live migration: the ramblock info was sent at the
> > beginning of migration, so it doesn't even have PCI config space migrated;
> > I thought the real mr->addr should be in there.
> >
> > I also failed to understand yet on why the mr->addr check needs to be done
> > by ignore-shared only. Some explanation would be greatly helpful around
> > this area..
>
> The error_report does not bite for normal migration because migrate_ram_is_ignored()
> is false for the problematic blocks, so the block->mr->addr check is not
> performed. However, mr->addr is never fixed up in this case, which is a
> quiet potential bug, and this patch fixes that with the "has_addr" check.
>
> For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
> because we do not copy the contents over the migration stream, we preserve the
> memory in place. So we fall into the block->mr->addr sanity check and fail
> with the original code.
OK I get your point now. However this doesn't look right, instead I start
to question why we need to send mr->addr at all..
As I said previously, AFAIU mr->addr should only be updated when there's
some PCI config space updates so that it moves the MR around in the address
space based on how guest drivers / BIOS (?) set things up. Now after these
days not looking, and just started to look at this again, I think the only
sane place to do this update is during a post_load().
And if we start to check some of the memory_region_set_address() users,
that's exactly what happened..
- ich9_pm_iospace_update(), update addr for ICH9LPCPMRegs.io, where
ich9_pm_post_load() also invokes it.
- pm_io_space_update(), updates PIIX4PMState.io, where
vmstate_acpi_post_load() also invokes it.
I stopped here just looking at the initial two users, it looks all sane to
me that it only got updated there, because the update requires pci config
space being migrated first.
IOW, I don't think having mismatched mr->addr is wrong at this stage.
Instead, I don't see why we should send mr->addr at all in this case during
as early as SETUP, and I don't see anything justifies the mr->addr needs to
be verified in parse_ramblock() since ignore-shared introduced by Yury in
commit fbd162e629aaf8 in 2019.
We can't drop mr->addr now when it's on-wire, but I think we should drop
the error report and addr check, instead of this patch.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-08-13 20:43 ` Peter Xu
@ 2024-08-15 20:54 ` Steven Sistare
2024-08-16 14:43 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-15 20:54 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 8/13/2024 4:43 PM, Peter Xu wrote:
> On Wed, Aug 07, 2024 at 05:04:26PM -0400, Steven Sistare wrote:
>> On 7/19/2024 12:28 PM, Peter Xu wrote:
>>> On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
>>>> For new cpr modes, ramblock_is_ignored will always be true, because the
>>>> memory is preserved in place rather than copied. However, for an ignored
>>>> block, parse_ramblock currently requires that the received address of the
>>>> block must match the address of the statically initialized region on the
>>>> target. This fails for a PCI rom block, because the memory region address
>>>> is set when the guest writes to a BAR on the source, which does not occur
>>>> on the target, causing a "Mismatched GPAs" error during cpr migration.
>>>
>>> Is this a common fix with/without cpr mode?
>>>
>>> It looks to me mr->addr (for these ROMs) should only be set in PCI config
>>> region updates as you mentioned. But then I didn't figure out when they're
>>> updated on dest in live migration: the ramblock info was sent at the
>>> beginning of migration, so it doesn't even have PCI config space migrated;
>>> I thought the real mr->addr should be in there.
>>>
>>> I also failed to understand yet on why the mr->addr check needs to be done
>>> by ignore-shared only. Some explanation would be greatly helpful around
>>> this area..
>>
>> The error_report does not bite for normal migration because migrate_ram_is_ignored()
>> is false for the problematic blocks, so the block->mr->addr check is not
>> performed. However, mr->addr is never fixed up in this case, which is a
>> quiet potential bug, and this patch fixes that with the "has_addr" check.
>>
>> For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
>> because we do not copy the contents over the migration stream, we preserve the
>> memory in place. So we fall into the block->mr->addr sanity check and fail
>> with the original code.
>
> OK I get your point now. However this doesn't look right, instead I start
> to question why we need to send mr->addr at all..
>
> As I said previously, AFAIU mr->addr should only be updated when there's
> some PCI config space updates so that it moves the MR around in the address
> space based on how guest drivers / BIOS (?) set things up. Now after these
> days not looking, and just started to look at this again, I think the only
> sane place to do this update is during a post_load().
>
> And if we start to check some of the memory_region_set_address() users,
> that's exactly what happened..
>
> - ich9_pm_iospace_update(), update addr for ICH9LPCPMRegs.io, where
> ich9_pm_post_load() also invokes it.
>
> - pm_io_space_update(), updates PIIX4PMState.io, where
> vmstate_acpi_post_load() also invokes it.
>
> I stopped here just looking at the initial two users, it looks all sane to
> me that it only got updated there, because the update requires pci config
> space being migrated first.
>
> IOW, I don't think having mismatched mr->addr is wrong at this stage.
> Instead, I don't see why we should send mr->addr at all in this case during
> as early as SETUP, and I don't see anything justifies the mr->addr needs to
> be verified in parse_ramblock() since ignore-shared introduced by Yury in
> commit fbd162e629aaf8 in 2019.
>
> We can't drop mr->addr now when it's on-wire, but I think we should drop
> the error report and addr check, instead of this patch.
As it turns out, my test case triggers this bug because it sets x-ignore-shared,
but x-ignore-shared is not needed for cpr-exec, because migrate_ram_is_ignored
is true for all blocks when mode==cpr-exec. So, the best fix for the GPAs bug
for me is to stop setting x-ignore-shared. I will drop this patch.
I agree that post_load is the right place to restore mr->addr, and I don't
understand why commit fbd162e629aaf8 added the error report, but I am going
to leave it as is.
Thanks for reviewing this.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-13 19:46 ` Peter Xu
@ 2024-08-15 20:55 ` Steven Sistare
2024-08-16 15:06 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-15 20:55 UTC (permalink / raw)
To: Peter Xu
Cc: Daniel P. Berrangé, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On 8/13/2024 3:46 PM, Peter Xu wrote:
> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>> processes has issues where both QEMUs want to be opening the very same
>>> file, and only 1 of them can ever have them open.
>
> I thought we used to have similar issue with block devices, but I assume
> it's solved for years (and whoever owns it will take proper file lock,
> IIRC, and QEMU migration should properly serialize the time window on who's
> going to take the file lock).
>
> Maybe this is about something else?
I don't have an example where this fails.
I can cause "Failed to get "write" lock" errors if two qemu instances open
the same block device, but the error is suppressed if you add the -incoming
argument, due to this code:
blk_attach_dev()
if (runstate_check(RUN_STATE_INMIGRATE))
blk->disable_perm = true;
>> Indeed, and "files" includes unix domain sockets.
More on this -- the second qemu to bind a unix domain socket for listening
wins, and the first qemu loses it (because second qemu unlinks and recreates
the socket path before binding on the assumption that it is stale).
One must use a different name for the socket for second qemu, and clients
that wish to connect must be aware of the new port.
>> Network ports also conflict.
>> cpr-exec avoids such problems, and is one of the advantages of the method that
>> I forgot to promote.
>
> I was thinking that's fine, as the host ports should be the backend of the
> VM ports only anyway so they don't need to be identical on both sides?
>
> IOW, my understanding is it's the guest IP/ports/... which should still be
> stable across migrations, where the host ports can be different as long as
> the host ports can forward guest port messages correctly?
Yes, one must use a different host port number for the second qemu, and clients
that wish to connect must be aware of the new port.
That is my point -- cpr-transfer requires fiddling with such things.
cpr-exec does not.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-08-15 20:54 ` Steven Sistare
@ 2024-08-16 14:43 ` Peter Xu
2024-08-16 17:10 ` Steven Sistare
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-16 14:43 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Thu, Aug 15, 2024 at 04:54:58PM -0400, Steven Sistare wrote:
> On 8/13/2024 4:43 PM, Peter Xu wrote:
> > On Wed, Aug 07, 2024 at 05:04:26PM -0400, Steven Sistare wrote:
> > > On 7/19/2024 12:28 PM, Peter Xu wrote:
> > > > On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
> > > > > For new cpr modes, ramblock_is_ignored will always be true, because the
> > > > > memory is preserved in place rather than copied. However, for an ignored
> > > > > block, parse_ramblock currently requires that the received address of the
> > > > > block must match the address of the statically initialized region on the
> > > > > target. This fails for a PCI rom block, because the memory region address
> > > > > is set when the guest writes to a BAR on the source, which does not occur
> > > > > on the target, causing a "Mismatched GPAs" error during cpr migration.
> > > >
> > > > Is this a common fix with/without cpr mode?
> > > >
> > > > It looks to me mr->addr (for these ROMs) should only be set in PCI config
> > > > region updates as you mentioned. But then I didn't figure out when they're
> > > > updated on dest in live migration: the ramblock info was sent at the
> > > > beginning of migration, so it doesn't even have PCI config space migrated;
> > > > I thought the real mr->addr should be in there.
> > > >
> > > > I also failed to understand yet on why the mr->addr check needs to be done
> > > > by ignore-shared only. Some explanation would be greatly helpful around
> > > > this area..
> > >
> > > The error_report does not bite for normal migration because migrate_ram_is_ignored()
> > > is false for the problematic blocks, so the block->mr->addr check is not
> > > performed. However, mr->addr is never fixed up in this case, which is a
> > > quiet potential bug, and this patch fixes that with the "has_addr" check.
> > >
> > > For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
> > > because we do not copy the contents over the migration stream, we preserve the
> > > memory in place. So we fall into the block->mr->addr sanity check and fail
> > > with the original code.
> >
> > OK I get your point now. However this doesn't look right, instead I start
> > to question why we need to send mr->addr at all..
> >
> > As I said previously, AFAIU mr->addr should only be updated when there's
> > some PCI config space updates so that it moves the MR around in the address
> > space based on how guest drivers / BIOS (?) set things up. Now after these
> > days not looking, and just started to look at this again, I think the only
> > sane place to do this update is during a post_load().
> >
> > And if we start to check some of the memory_region_set_address() users,
> > that's exactly what happened..
> >
> > - ich9_pm_iospace_update(), update addr for ICH9LPCPMRegs.io, where
> > ich9_pm_post_load() also invokes it.
> >
> > - pm_io_space_update(), updates PIIX4PMState.io, where
> > vmstate_acpi_post_load() also invokes it.
> >
> > I stopped here just looking at the initial two users, it looks all sane to
> > me that it only got updated there, because the update requires pci config
> > space being migrated first.
> >
> > IOW, I don't think having mismatched mr->addr is wrong at this stage.
> > Instead, I don't see why we should send mr->addr at all in this case during
> > as early as SETUP, and I don't see anything justifies the mr->addr needs to
> > be verified in parse_ramblock() since ignore-shared introduced by Yury in
> > commit fbd162e629aaf8 in 2019.
> >
> > We can't drop mr->addr now when it's on-wire, but I think we should drop
> > the error report and addr check, instead of this patch.
>
> As it turns out, my test case triggers this bug because it sets x-ignore-shared,
> but x-ignore-shared is not needed for cpr-exec, because migrate_ram_is_ignored
> is true for all blocks when mode==cpr-exec. So, the best fix for the GPAs bug
> for me is to stop setting x-ignore-shared. I will drop this patch.
>
> I agree that post_load is the right place to restore mr->addr, and I don't
> understand why commit fbd162e629aaf8 added the error report, but I am going
> to leave it as is.
Ah, I didn't notice that cpr special cased migrate_ram_is_ignored()..
Shall we stick with the old check, but always require cpr to rely on
ignore-shared?
Then we replace this patch with removing the error_report, probably
together with not caring about whatever is received at all.. would that be
cleaner?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-15 20:55 ` Steven Sistare
@ 2024-08-16 15:06 ` Peter Xu
2024-08-16 15:16 ` Daniel P. Berrangé
0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-16 15:06 UTC (permalink / raw)
To: Steven Sistare
Cc: Daniel P. Berrangé, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> On 8/13/2024 3:46 PM, Peter Xu wrote:
> > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > processes has issues where both QEMUs want to be opening the very same
> > > > file, and only 1 of them can ever have them open.
> >
> > I thought we used to have similar issue with block devices, but I assume
> > it's solved for years (and whoever owns it will take proper file lock,
> > IIRC, and QEMU migration should properly serialize the time window on who's
> > going to take the file lock).
> >
> > Maybe this is about something else?
>
> I don't have an example where this fails.
>
> I can cause "Failed to get "write" lock" errors if two qemu instances open
> the same block device, but the error is suppressed if you add the -incoming
> argument, due to this code:
>
> blk_attach_dev()
> if (runstate_check(RUN_STATE_INMIGRATE))
> blk->disable_perm = true;
Yep, this one is pretty much expected.
>
> > > Indeed, and "files" includes unix domain sockets.
>
> More on this -- the second qemu to bind a unix domain socket for listening
> wins, and the first qemu loses it (because second qemu unlinks and recreates
> the socket path before binding on the assumption that it is stale).
>
> One must use a different name for the socket for second qemu, and clients
> that wish to connect must be aware of the new port.
>
> > > Network ports also conflict.
> > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > I forgot to promote.
> >
> > I was thinking that's fine, as the host ports should be the backend of the
> > VM ports only anyway so they don't need to be identical on both sides?
> >
> > IOW, my understanding is it's the guest IP/ports/... which should still be
> > stable across migrations, where the host ports can be different as long as
> > the host ports can forward guest port messages correctly?
>
> Yes, one must use a different host port number for the second qemu, and clients
> that wish to connect must be aware of the new port.
>
> That is my point -- cpr-transfer requires fiddling with such things.
> cpr-exec does not.
Right, and my understanding is all these facilities are already there, so
no new code should be needed on reconnect issues if to support cpr-transfer
in Libvirt or similar management layers that supports migrations.
I suppose that's also why I'm slightly confused on how cpr-exec can provide
benefit for mgmt layers yet so far with these open projects. It might
affect Oracle's mgmt layers, but again I'm curious why Oracle does not
support these, because if that should support normal live migration, I
thought it should be needed to support changed ports on host etc..
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 15:06 ` Peter Xu
@ 2024-08-16 15:16 ` Daniel P. Berrangé
2024-08-16 15:19 ` Steven Sistare
2024-08-16 15:34 ` Peter Xu
0 siblings, 2 replies; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-08-16 15:16 UTC (permalink / raw)
To: Peter Xu
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > file, and only 1 of them can ever have them open.
> > >
> > > I thought we used to have similar issue with block devices, but I assume
> > > it's solved for years (and whoever owns it will take proper file lock,
> > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > going to take the file lock).
> > >
> > > Maybe this is about something else?
> >
> > I don't have an example where this fails.
> >
> > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > the same block device, but the error is suppressed if you add the -incoming
> > argument, due to this code:
> >
> > blk_attach_dev()
> > if (runstate_check(RUN_STATE_INMIGRATE))
> > blk->disable_perm = true;
>
> Yep, this one is pretty much expected.
>
> >
> > > > Indeed, and "files" includes unix domain sockets.
> >
> > More on this -- the second qemu to bind a unix domain socket for listening
> > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > the socket path before binding on the assumption that it is stale).
> >
> > One must use a different name for the socket for second qemu, and clients
> > that wish to connect must be aware of the new port.
> >
> > > > Network ports also conflict.
> > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > I forgot to promote.
> > >
> > > I was thinking that's fine, as the host ports should be the backend of the
> > > VM ports only anyway so they don't need to be identical on both sides?
> > >
> > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > stable across migrations, where the host ports can be different as long as
> > > the host ports can forward guest port messages correctly?
> >
> > Yes, one must use a different host port number for the second qemu, and clients
> > that wish to connect must be aware of the new port.
> >
> > That is my point -- cpr-transfer requires fiddling with such things.
> > cpr-exec does not.
>
> Right, and my understanding is all these facilities are already there, so
> no new code should be needed on reconnect issues if to support cpr-transfer
> in Libvirt or similar management layers that supports migrations.
Note Libvirt explicitly blocks localhost migration today because
solving all these clashing resource problems is a huge can of worms
and it can't be made invisible to the user of libvirt in any practical
way.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 15:16 ` Daniel P. Berrangé
@ 2024-08-16 15:19 ` Steven Sistare
2024-08-16 15:34 ` Peter Xu
1 sibling, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-16 15:19 UTC (permalink / raw)
To: Daniel P. Berrangé, Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/16/2024 11:16 AM, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>> file, and only 1 of them can ever have them open.
>>>>
>>>> I thought we used to have similar issue with block devices, but I assume
>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>> going to take the file lock).
>>>>
>>>> Maybe this is about something else?
>>>
>>> I don't have an example where this fails.
>>>
>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>> the same block device, but the error is suppressed if you add the -incoming
>>> argument, due to this code:
>>>
>>> blk_attach_dev()
>>> if (runstate_check(RUN_STATE_INMIGRATE))
>>> blk->disable_perm = true;
>>
>> Yep, this one is pretty much expected.
>>
>>>
>>>>> Indeed, and "files" includes unix domain sockets.
>>>
>>> More on this -- the second qemu to bind a unix domain socket for listening
>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>> the socket path before binding on the assumption that it is stale).
>>>
>>> One must use a different name for the socket for second qemu, and clients
>>> that wish to connect must be aware of the new port.
>>>
>>>>> Network ports also conflict.
>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>> I forgot to promote.
>>>>
>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>
>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>> stable across migrations, where the host ports can be different as long as
>>>> the host ports can forward guest port messages correctly?
>>>
>>> Yes, one must use a different host port number for the second qemu, and clients
>>> that wish to connect must be aware of the new port.
>>>
>>> That is my point -- cpr-transfer requires fiddling with such things.
>>> cpr-exec does not.
>>
>> Right, and my understanding is all these facilities are already there, so
>> no new code should be needed on reconnect issues if to support cpr-transfer
>> in Libvirt or similar management layers that supports migrations.
>
> Note Libvirt explicitly blocks localhost migration today because
> solving all these clashing resource problems is a huge can of worms
> and it can't be made invisible to the user of libvirt in any practical
> way.
Thank you! This is what I suspected but could not prove due to my lack of
experience with libvirt.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 15:16 ` Daniel P. Berrangé
2024-08-16 15:19 ` Steven Sistare
@ 2024-08-16 15:34 ` Peter Xu
2024-08-16 16:00 ` Daniel P. Berrangé
1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-16 15:34 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > file, and only 1 of them can ever have them open.
> > > >
> > > > I thought we used to have similar issue with block devices, but I assume
> > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > going to take the file lock).
> > > >
> > > > Maybe this is about something else?
> > >
> > > I don't have an example where this fails.
> > >
> > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > the same block device, but the error is suppressed if you add the -incoming
> > > argument, due to this code:
> > >
> > > blk_attach_dev()
> > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > blk->disable_perm = true;
> >
> > Yep, this one is pretty much expected.
> >
> > >
> > > > > Indeed, and "files" includes unix domain sockets.
> > >
> > > More on this -- the second qemu to bind a unix domain socket for listening
> > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > the socket path before binding on the assumption that it is stale).
> > >
> > > One must use a different name for the socket for second qemu, and clients
> > > that wish to connect must be aware of the new port.
> > >
> > > > > Network ports also conflict.
> > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > I forgot to promote.
> > > >
> > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > VM ports only anyway so they don't need to be identical on both sides?
> > > >
> > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > stable across migrations, where the host ports can be different as long as
> > > > the host ports can forward guest port messages correctly?
> > >
> > > Yes, one must use a different host port number for the second qemu, and clients
> > > that wish to connect must be aware of the new port.
> > >
> > > That is my point -- cpr-transfer requires fiddling with such things.
> > > cpr-exec does not.
> >
> > Right, and my understanding is all these facilities are already there, so
> > no new code should be needed on reconnect issues if to support cpr-transfer
> > in Libvirt or similar management layers that supports migrations.
>
> Note Libvirt explicitly blocks localhost migration today because
> solving all these clashing resource problems is a huge can of worms
> and it can't be made invisible to the user of libvirt in any practical
> way.
Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
supported local migration somehow on top of libvirt.
Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
to consume it (as cpr-* is only for local host migrations so far)? Even if
all the rest issues we're discussing with cpr-exec, is that the only way to
go for Libvirt, then?
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 15:34 ` Peter Xu
@ 2024-08-16 16:00 ` Daniel P. Berrangé
2024-08-16 16:17 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-08-16 16:00 UTC (permalink / raw)
To: Peter Xu
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > file, and only 1 of them can ever have them open.
> > > > >
> > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > going to take the file lock).
> > > > >
> > > > > Maybe this is about something else?
> > > >
> > > > I don't have an example where this fails.
> > > >
> > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > the same block device, but the error is suppressed if you add the -incoming
> > > > argument, due to this code:
> > > >
> > > > blk_attach_dev()
> > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > blk->disable_perm = true;
> > >
> > > Yep, this one is pretty much expected.
> > >
> > > >
> > > > > > Indeed, and "files" includes unix domain sockets.
> > > >
> > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > the socket path before binding on the assumption that it is stale).
> > > >
> > > > One must use a different name for the socket for second qemu, and clients
> > > > that wish to connect must be aware of the new port.
> > > >
> > > > > > Network ports also conflict.
> > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > I forgot to promote.
> > > > >
> > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > >
> > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > stable across migrations, where the host ports can be different as long as
> > > > > the host ports can forward guest port messages correctly?
> > > >
> > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > that wish to connect must be aware of the new port.
> > > >
> > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > cpr-exec does not.
> > >
> > > Right, and my understanding is all these facilities are already there, so
> > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > in Libvirt or similar management layers that supports migrations.
> >
> > Note Libvirt explicitly blocks localhost migration today because
> > solving all these clashing resource problems is a huge can of worms
> > and it can't be made invisible to the user of libvirt in any practical
> > way.
>
> Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> supported local migration somehow on top of libvirt.
Since kubevirt runs inside a container, "localhost" migration
is effectively migrating between 2 completely separate OS installs
(containers), that happen to be on the same physical host. IOW, it
is a cross-host migration from Libvirt & QEMU's POV, and there are
no clashing resources to worry about.
> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> to consume it (as cpr-* is only for local host migrations so far)? Even if
> all the rest issues we're discussing with cpr-exec, is that the only way to
> go for Libvirt, then?
cpr-exec is certainly appealing from the POV of avoiding the clashing
resources problem in libvirt.
It has own issues though, because libvirt runs all QEMU processes with
seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
and thus don't want to allow it to exec anything !
I don't know which is the lesser evil from libvirt's POV.
Personally I see security controls as an overriding requirement for
everything.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 16:00 ` Daniel P. Berrangé
@ 2024-08-16 16:17 ` Peter Xu
2024-08-16 16:28 ` Daniel P. Berrangé
2024-08-16 17:09 ` Steven Sistare
0 siblings, 2 replies; 77+ messages in thread
From: Peter Xu @ 2024-08-16 16:17 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > file, and only 1 of them can ever have them open.
> > > > > >
> > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > going to take the file lock).
> > > > > >
> > > > > > Maybe this is about something else?
> > > > >
> > > > > I don't have an example where this fails.
> > > > >
> > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > argument, due to this code:
> > > > >
> > > > > blk_attach_dev()
> > > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > blk->disable_perm = true;
> > > >
> > > > Yep, this one is pretty much expected.
> > > >
> > > > >
> > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > >
> > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > the socket path before binding on the assumption that it is stale).
> > > > >
> > > > > One must use a different name for the socket for second qemu, and clients
> > > > > that wish to connect must be aware of the new port.
> > > > >
> > > > > > > Network ports also conflict.
> > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > I forgot to promote.
> > > > > >
> > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > >
> > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > the host ports can forward guest port messages correctly?
> > > > >
> > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > that wish to connect must be aware of the new port.
> > > > >
> > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > cpr-exec does not.
> > > >
> > > > Right, and my understanding is all these facilities are already there, so
> > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > in Libvirt or similar management layers that supports migrations.
> > >
> > > Note Libvirt explicitly blocks localhost migration today because
> > > solving all these clashing resource problems is a huge can of worms
> > > and it can't be made invisible to the user of libvirt in any practical
> > > way.
> >
> > Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> > supported local migration somehow on top of libvirt.
>
> Since kubevirt runs inside a container, "localhost" migration
> is effectively migrating between 2 completely separate OS installs
> (containers), that happen to be on the same physical host. IOW, it
> is a cross-host migration from Libvirt & QEMU's POV, and there are
> no clashing resources to worry about.
OK, makes sense.
Then do you think it's possible to support cpr-transfer in that scenario
from Libvirt POV?
>
> > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > to consume it (as cpr-* is only for local host migrations so far)? Even if
> > all the rest issues we're discussing with cpr-exec, is that the only way to
> > go for Libvirt, then?
>
> cpr-exec is certainly appealing from the POV of avoiding the clashing
> resources problem in libvirt.
>
> It has own issues though, because libvirt runs all QEMU processes with
> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> and thus don't want to allow it to exec anything !
>
> I don't know which is the lesser evil from libvirt's POV.
>
> Personally I see security controls as an overriding requirement for
> everything.
One thing I am aware of is cpr-exec is not the only one who might start to
use exec() in QEMU. TDX fundamentally will need to create another key VM to
deliver the keys and the plan seems to be using exec() too. However in
that case per my understanding the exec() is optional - the key VM can also
be created by Libvirt.
IOW, it looks like we can still stick with execve() being blocked yet so
far except cpr-exec().
Hmm, this makes the decision harder to make. We need to figure out a way
on knowing how to consume this feature for at least open source virt
stack.. So far it looks like it's only possible (if we take seccomp high
priority) we use cpr-transfer but only in a container.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 16:17 ` Peter Xu
@ 2024-08-16 16:28 ` Daniel P. Berrangé
2024-08-16 17:09 ` Steven Sistare
1 sibling, 0 replies; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-08-16 16:28 UTC (permalink / raw)
To: Peter Xu
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 12:17:30PM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > >
> > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > going to take the file lock).
> > > > > > >
> > > > > > > Maybe this is about something else?
> > > > > >
> > > > > > I don't have an example where this fails.
> > > > > >
> > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > argument, due to this code:
> > > > > >
> > > > > > blk_attach_dev()
> > > > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > blk->disable_perm = true;
> > > > >
> > > > > Yep, this one is pretty much expected.
> > > > >
> > > > > >
> > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > >
> > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > the socket path before binding on the assumption that it is stale).
> > > > > >
> > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > that wish to connect must be aware of the new port.
> > > > > >
> > > > > > > > Network ports also conflict.
> > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > I forgot to promote.
> > > > > > >
> > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > >
> > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > the host ports can forward guest port messages correctly?
> > > > > >
> > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > that wish to connect must be aware of the new port.
> > > > > >
> > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > cpr-exec does not.
> > > > >
> > > > > Right, and my understanding is all these facilities are already there, so
> > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > in Libvirt or similar management layers that supports migrations.
> > > >
> > > > Note Libvirt explicitly blocks localhost migration today because
> > > > solving all these clashing resource problems is a huge can of worms
> > > > and it can't be made invisible to the user of libvirt in any practical
> > > > way.
> > >
> > > Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> > > supported local migration somehow on top of libvirt.
> >
> > Since kubevirt runs inside a container, "localhost" migration
> > is effectively migrating between 2 completely separate OS installs
> > (containers), that happen to be on the same physical host. IOW, it
> > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > no clashing resources to worry about.
>
> OK, makes sense.
>
> Then do you think it's possible to support cpr-transfer in that scenario
> from Libvirt POV?
>
> >
> > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > to consume it (as cpr-* is only for local host migrations so far)? Even if
> > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > go for Libvirt, then?
> >
> > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > resources problem in libvirt.
> >
> > It has own issues though, because libvirt runs all QEMU processes with
> > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > and thus don't want to allow it to exec anything !
> >
> > I don't know which is the lesser evil from libvirt's POV.
> >
> > Personally I see security controls as an overriding requirement for
> > everything.
>
> One thing I am aware of is cpr-exec is not the only one who might start to
> use exec() in QEMU. TDX fundamentally will need to create another key VM to
> deliver the keys and the plan seems to be using exec() too. However in
> that case per my understanding the exec() is optional - the key VM can also
> be created by Libvirt.
Since nothing is merged, I'd consider whatever might have been presented
wrt TDX migration to all be open for potential re-design. With SNP there
is the SVSM paravisor which runs inside the guest to provide services,
and IIUC its intended to address migration service needs.
There's a push to support SVSM with TDX too, in order to enable vTPM
support. With that it might make sense to explore whether SVSM can
service migration for TDX too, instead of having a separate parallel
VM on the host. IMHO its highly desirable to have a common architecture
for CVM migration from QEMU's POV, and from a libvirt POV I'd like to
avoid having extra host VMs too.
> IOW, it looks like we can still stick with execve() being blocked yet so
> far except cpr-exec().
>
> Hmm, this makes the decision harder to make. We need to figure out a way
> on knowing how to consume this feature for at least open source virt
> stack.. So far it looks like it's only possible (if we take seccomp high
> priority) we use cpr-transfer but only in a container.
Or we have cpr-transfer, but libvirt has to more games to solve the
clashing resources problem, by making much more use of FD passing,
and/or by changnig path conventions, or a mix of both.
What might make this viable is that IIUC, CPR only permits a subset
of backends to be used, so libvirt doesn't have to solve clashing
resources for /everything/, just parts that are supported by CPR.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 16:17 ` Peter Xu
2024-08-16 16:28 ` Daniel P. Berrangé
@ 2024-08-16 17:09 ` Steven Sistare
2024-08-21 18:34 ` Peter Xu
2024-09-05 9:30 ` Daniel P. Berrangé
1 sibling, 2 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-16 17:09 UTC (permalink / raw)
To: Peter Xu, Daniel P. Berrangé
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/16/2024 12:17 PM, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
>> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
>>> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>>>>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>>>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>>>>> file, and only 1 of them can ever have them open.
>>>>>>>
>>>>>>> I thought we used to have similar issue with block devices, but I assume
>>>>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>>>>> going to take the file lock).
>>>>>>>
>>>>>>> Maybe this is about something else?
>>>>>>
>>>>>> I don't have an example where this fails.
>>>>>>
>>>>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>>>>> the same block device, but the error is suppressed if you add the -incoming
>>>>>> argument, due to this code:
>>>>>>
>>>>>> blk_attach_dev()
>>>>>> if (runstate_check(RUN_STATE_INMIGRATE))
>>>>>> blk->disable_perm = true;
>>>>>
>>>>> Yep, this one is pretty much expected.
>>>>>
>>>>>>
>>>>>>>> Indeed, and "files" includes unix domain sockets.
>>>>>>
>>>>>> More on this -- the second qemu to bind a unix domain socket for listening
>>>>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>>>>> the socket path before binding on the assumption that it is stale).
>>>>>>
>>>>>> One must use a different name for the socket for second qemu, and clients
>>>>>> that wish to connect must be aware of the new port.
>>>>>>
>>>>>>>> Network ports also conflict.
>>>>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>>>>> I forgot to promote.
>>>>>>>
>>>>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>>>>
>>>>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>>>>> stable across migrations, where the host ports can be different as long as
>>>>>>> the host ports can forward guest port messages correctly?
>>>>>>
>>>>>> Yes, one must use a different host port number for the second qemu, and clients
>>>>>> that wish to connect must be aware of the new port.
>>>>>>
>>>>>> That is my point -- cpr-transfer requires fiddling with such things.
>>>>>> cpr-exec does not.
>>>>>
>>>>> Right, and my understanding is all these facilities are already there, so
>>>>> no new code should be needed on reconnect issues if to support cpr-transfer
>>>>> in Libvirt or similar management layers that supports migrations.
>>>>
>>>> Note Libvirt explicitly blocks localhost migration today because
>>>> solving all these clashing resource problems is a huge can of worms
>>>> and it can't be made invisible to the user of libvirt in any practical
>>>> way.
>>>
>>> Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
>>> supported local migration somehow on top of libvirt.
>>
>> Since kubevirt runs inside a container, "localhost" migration
>> is effectively migrating between 2 completely separate OS installs
>> (containers), that happen to be on the same physical host. IOW, it
>> is a cross-host migration from Libvirt & QEMU's POV, and there are
>> no clashing resources to worry about.
>
> OK, makes sense.
>
> Then do you think it's possible to support cpr-transfer in that scenario
> from Libvirt POV?
>
>>
>>> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
>>> to consume it (as cpr-* is only for local host migrations so far)? Even if
>>> all the rest issues we're discussing with cpr-exec, is that the only way to
>>> go for Libvirt, then?
>>
>> cpr-exec is certainly appealing from the POV of avoiding the clashing
>> resources problem in libvirt.
>>
>> It has own issues though, because libvirt runs all QEMU processes with
>> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
>> and thus don't want to allow it to exec anything !
>>
>> I don't know which is the lesser evil from libvirt's POV.
>>
>> Personally I see security controls as an overriding requirement for
>> everything.
>
> One thing I am aware of is cpr-exec is not the only one who might start to
> use exec() in QEMU. TDX fundamentally will need to create another key VM to
> deliver the keys and the plan seems to be using exec() too. However in
> that case per my understanding the exec() is optional - the key VM can also
> be created by Libvirt.
>
> IOW, it looks like we can still stick with execve() being blocked yet so
> far except cpr-exec().
>
> Hmm, this makes the decision harder to make. We need to figure out a way
> on knowing how to consume this feature for at least open source virt
> stack.. So far it looks like it's only possible (if we take seccomp high
> priority) we use cpr-transfer but only in a container.
libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
and change namespace operations. I have a patch in my workspace to be submitted
later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
libvirt to block fork and namespace but allow exec.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-08-16 14:43 ` Peter Xu
@ 2024-08-16 17:10 ` Steven Sistare
2024-08-21 16:57 ` Peter Xu
0 siblings, 1 reply; 77+ messages in thread
From: Steven Sistare @ 2024-08-16 17:10 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On 8/16/2024 10:43 AM, Peter Xu wrote:
> On Thu, Aug 15, 2024 at 04:54:58PM -0400, Steven Sistare wrote:
>> On 8/13/2024 4:43 PM, Peter Xu wrote:
>>> On Wed, Aug 07, 2024 at 05:04:26PM -0400, Steven Sistare wrote:
>>>> On 7/19/2024 12:28 PM, Peter Xu wrote:
>>>>> On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
>>>>>> For new cpr modes, ramblock_is_ignored will always be true, because the
>>>>>> memory is preserved in place rather than copied. However, for an ignored
>>>>>> block, parse_ramblock currently requires that the received address of the
>>>>>> block must match the address of the statically initialized region on the
>>>>>> target. This fails for a PCI rom block, because the memory region address
>>>>>> is set when the guest writes to a BAR on the source, which does not occur
>>>>>> on the target, causing a "Mismatched GPAs" error during cpr migration.
>>>>>
>>>>> Is this a common fix with/without cpr mode?
>>>>>
>>>>> It looks to me mr->addr (for these ROMs) should only be set in PCI config
>>>>> region updates as you mentioned. But then I didn't figure out when they're
>>>>> updated on dest in live migration: the ramblock info was sent at the
>>>>> beginning of migration, so it doesn't even have PCI config space migrated;
>>>>> I thought the real mr->addr should be in there.
>>>>>
>>>>> I also failed to understand yet on why the mr->addr check needs to be done
>>>>> by ignore-shared only. Some explanation would be greatly helpful around
>>>>> this area..
>>>>
>>>> The error_report does not bite for normal migration because migrate_ram_is_ignored()
>>>> is false for the problematic blocks, so the block->mr->addr check is not
>>>> performed. However, mr->addr is never fixed up in this case, which is a
>>>> quiet potential bug, and this patch fixes that with the "has_addr" check.
>>>>
>>>> For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
>>>> because we do not copy the contents over the migration stream, we preserve the
>>>> memory in place. So we fall into the block->mr->addr sanity check and fail
>>>> with the original code.
>>>
>>> OK I get your point now. However this doesn't look right, instead I start
>>> to question why we need to send mr->addr at all..
>>>
>>> As I said previously, AFAIU mr->addr should only be updated when there's
>>> some PCI config space updates so that it moves the MR around in the address
>>> space based on how guest drivers / BIOS (?) set things up. Now after these
>>> days not looking, and just started to look at this again, I think the only
>>> sane place to do this update is during a post_load().
>>>
>>> And if we start to check some of the memory_region_set_address() users,
>>> that's exactly what happened..
>>>
>>> - ich9_pm_iospace_update(), update addr for ICH9LPCPMRegs.io, where
>>> ich9_pm_post_load() also invokes it.
>>>
>>> - pm_io_space_update(), updates PIIX4PMState.io, where
>>> vmstate_acpi_post_load() also invokes it.
>>>
>>> I stopped here just looking at the initial two users, it looks all sane to
>>> me that it only got updated there, because the update requires pci config
>>> space being migrated first.
>>>
>>> IOW, I don't think having mismatched mr->addr is wrong at this stage.
>>> Instead, I don't see why we should send mr->addr at all in this case during
>>> as early as SETUP, and I don't see anything justifies the mr->addr needs to
>>> be verified in parse_ramblock() since ignore-shared introduced by Yury in
>>> commit fbd162e629aaf8 in 2019.
>>>
>>> We can't drop mr->addr now when it's on-wire, but I think we should drop
>>> the error report and addr check, instead of this patch.
>>
>> As it turns out, my test case triggers this bug because it sets x-ignore-shared,
>> but x-ignore-shared is not needed for cpr-exec, because migrate_ram_is_ignored
>> is true for all blocks when mode==cpr-exec. So, the best fix for the GPAs bug
>> for me is to stop setting x-ignore-shared. I will drop this patch.
>>
>> I agree that post_load is the right place to restore mr->addr, and I don't
>> understand why commit fbd162e629aaf8 added the error report, but I am going
>> to leave it as is.
>
> Ah, I didn't notice that cpr special cased migrate_ram_is_ignored()..
>
> Shall we stick with the old check, but always require cpr to rely on
> ignore-shared?
>
> Then we replace this patch with removing the error_report, probably
> together with not caring about whatever is received at all.. would that be
> cleaner?
migrate_ram_is_ignored() is called in many places and must return true for
cpr-exec/cpr-transfer, independently of migrate_ignore_shared. That logic
must remain as is.
The cleanest change is no change, just dropping this patch. I was just confused
when I set x-ignore-shared for the test.
However, if an unsuspecting user sets x-ignore-shared, it will trigger this error,
so perhaps I should delete the error_report.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec (reconnections)
2024-08-13 20:12 ` Peter Xu
@ 2024-08-20 16:28 ` Steven Sistare
0 siblings, 0 replies; 77+ messages in thread
From: Steven Sistare @ 2024-08-20 16:28 UTC (permalink / raw)
To: Peter Xu, Daniel P. Berrange
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/13/2024 4:12 PM, Peter Xu wrote:
> On Wed, Aug 07, 2024 at 03:47:47PM -0400, Steven Sistare wrote:
>> On 8/4/2024 12:10 PM, Peter Xu wrote:
>>> On Sat, Jul 20, 2024 at 05:26:07PM -0400, Steven Sistare wrote:
>>>> On 7/18/2024 11:56 AM, Peter Xu wrote:
[...]
>>>>>> Lastly, there is no loss of connectivity to the guest,
>>>>>> because chardev descriptors remain open and connected.
>>>>>
>>>>> Again, I raised the question on why this would matter, as after all mgmt
>>>>> app will need to coop with reconnections due to the fact they'll need to
>>>>> support a generic live migration, in which case reconnection is a must.
>>>>>
>>>>> So far it doesn't sound like a performance critical path, for example, to
>>>>> do the mgmt reconnects on the ports. So this might be an optimization that
>>>>> most mgmt apps may not care much?
>>>>
>>>> Perhaps. I view the chardev preservation as nice to have, but not essential.
>>>> It does not appear in this series, other than in docs. It's easy to implement
>>>> given the CPR foundation. I suggest we continue this discussion when I post
>>>> the chardev series, so we can focus on the core functionality.
>>>
>>> It's just that it can affect our decision on choosing the way to go.
>>>
>>> For example, do we have someone from Libvirt or any mgmt layer can help
>>> justify this point?
>>>
>>> As I said, I thought most facilities for reconnection should be ready, but
>>> I could miss important facts in mgmt layers..
>>
>> I will more deeply study reconnects in the mgmt layer, run some experiments to
>> see if it is seamless for the end user, and get back to you, but it will take
>> some time.
See below.
[...]
>>> Could I ask what management code you're working on? Why that management
>>> code doesn't need to already work out these problems with reconnections
>>> (like pre-CPR ways of live upgrade)?
>>
>> OCI - Oracle Cloud Infrastructure.
>> Mgmt needs to manage reconnections for live migration, and perhaps I could
>> leverage that code for live update, but happily I did not need to. Regardless,
>> reconnection is the lesser issue. The bigger issue is resource management and
>> the container environment. But I cannot justify that statement in detail without
>> actually trying to implement cpr-transfer in OCI.
[...]
>> The use case is the same for both modes, but they are simply different
>> transport methods for moving descriptors from old QEMU to new. The developer
>> of the mgmt agent should be allowed to choose.
>
> It's out of my capability to review the mgmt impact on this one. This is
> all based on the idea that I think most mgmt apps supports reconnections
> pretty well. If that's the case, I'd definitely go for the transfer mode.
Closing the loop here on reconnections --
The managers I studied do not reconnect QEMU chardevs such as the guest console
after live migration. In all cases, the old console goes dark and the user must
manually reconnect to the console on the target.
OCI does not auto reconnect. libvirt does not, one must reconnect through libvirtd
on the target. kubevirt does not AFAICT; one must reconnect on the target using
virtctl console.
Thus chardev preservation does offer an improved user experience in this regard.
chardevs can be preserved using either cpr-exec or cpr-transfer. But, if QEMU
runs in a containerized environment that has agents that proxy connections between
QEMU chardevs and the outside world, then only cpr-exec (which preserves the existing
container) preserves connections end-to-end. OCI has such agents. I believe kubevirt
does also.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 06/11] migration: fix mismatched GPAs during cpr
2024-08-16 17:10 ` Steven Sistare
@ 2024-08-21 16:57 ` Peter Xu
0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2024-08-21 16:57 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Daniel P. Berrange, Markus Armbruster
On Fri, Aug 16, 2024 at 01:10:02PM -0400, Steven Sistare wrote:
> On 8/16/2024 10:43 AM, Peter Xu wrote:
> > On Thu, Aug 15, 2024 at 04:54:58PM -0400, Steven Sistare wrote:
> > > On 8/13/2024 4:43 PM, Peter Xu wrote:
> > > > On Wed, Aug 07, 2024 at 05:04:26PM -0400, Steven Sistare wrote:
> > > > > On 7/19/2024 12:28 PM, Peter Xu wrote:
> > > > > > On Sun, Jun 30, 2024 at 12:40:29PM -0700, Steve Sistare wrote:
> > > > > > > For new cpr modes, ramblock_is_ignored will always be true, because the
> > > > > > > memory is preserved in place rather than copied. However, for an ignored
> > > > > > > block, parse_ramblock currently requires that the received address of the
> > > > > > > block must match the address of the statically initialized region on the
> > > > > > > target. This fails for a PCI rom block, because the memory region address
> > > > > > > is set when the guest writes to a BAR on the source, which does not occur
> > > > > > > on the target, causing a "Mismatched GPAs" error during cpr migration.
> > > > > >
> > > > > > Is this a common fix with/without cpr mode?
> > > > > >
> > > > > > It looks to me mr->addr (for these ROMs) should only be set in PCI config
> > > > > > region updates as you mentioned. But then I didn't figure out when they're
> > > > > > updated on dest in live migration: the ramblock info was sent at the
> > > > > > beginning of migration, so it doesn't even have PCI config space migrated;
> > > > > > I thought the real mr->addr should be in there.
> > > > > >
> > > > > > I also failed to understand yet on why the mr->addr check needs to be done
> > > > > > by ignore-shared only. Some explanation would be greatly helpful around
> > > > > > this area..
> > > > >
> > > > > The error_report does not bite for normal migration because migrate_ram_is_ignored()
> > > > > is false for the problematic blocks, so the block->mr->addr check is not
> > > > > performed. However, mr->addr is never fixed up in this case, which is a
> > > > > quiet potential bug, and this patch fixes that with the "has_addr" check.
> > > > >
> > > > > For cpr-exec, migrate_ram_is_ignored() is true for all blocks,
> > > > > because we do not copy the contents over the migration stream, we preserve the
> > > > > memory in place. So we fall into the block->mr->addr sanity check and fail
> > > > > with the original code.
> > > >
> > > > OK I get your point now. However this doesn't look right, instead I start
> > > > to question why we need to send mr->addr at all..
> > > >
> > > > As I said previously, AFAIU mr->addr should only be updated when there's
> > > > some PCI config space updates so that it moves the MR around in the address
> > > > space based on how guest drivers / BIOS (?) set things up. Now after these
> > > > days not looking, and just started to look at this again, I think the only
> > > > sane place to do this update is during a post_load().
> > > >
> > > > And if we start to check some of the memory_region_set_address() users,
> > > > that's exactly what happened..
> > > >
> > > > - ich9_pm_iospace_update(), update addr for ICH9LPCPMRegs.io, where
> > > > ich9_pm_post_load() also invokes it.
> > > >
> > > > - pm_io_space_update(), updates PIIX4PMState.io, where
> > > > vmstate_acpi_post_load() also invokes it.
> > > >
> > > > I stopped here just looking at the initial two users, it looks all sane to
> > > > me that it only got updated there, because the update requires pci config
> > > > space being migrated first.
> > > >
> > > > IOW, I don't think having mismatched mr->addr is wrong at this stage.
> > > > Instead, I don't see why we should send mr->addr at all in this case during
> > > > as early as SETUP, and I don't see anything justifies the mr->addr needs to
> > > > be verified in parse_ramblock() since ignore-shared introduced by Yury in
> > > > commit fbd162e629aaf8 in 2019.
> > > >
> > > > We can't drop mr->addr now when it's on-wire, but I think we should drop
> > > > the error report and addr check, instead of this patch.
> > >
> > > As it turns out, my test case triggers this bug because it sets x-ignore-shared,
> > > but x-ignore-shared is not needed for cpr-exec, because migrate_ram_is_ignored
> > > is true for all blocks when mode==cpr-exec. So, the best fix for the GPAs bug
> > > for me is to stop setting x-ignore-shared. I will drop this patch.
> > >
> > > I agree that post_load is the right place to restore mr->addr, and I don't
> > > understand why commit fbd162e629aaf8 added the error report, but I am going
> > > to leave it as is.
> >
> > Ah, I didn't notice that cpr special cased migrate_ram_is_ignored()..
> >
> > Shall we stick with the old check, but always require cpr to rely on
> > ignore-shared?
> >
> > Then we replace this patch with removing the error_report, probably
> > together with not caring about whatever is received at all.. would that be
> > cleaner?
>
> migrate_ram_is_ignored() is called in many places and must return true for
> cpr-exec/cpr-transfer, independently of migrate_ignore_shared. That logic
> must remain as is.
Is this because cpr can fail some ramblock in qemu_ram_is_named_file()?
It's not obvious in this case, maybe some re-strcture would be nice. Would
something like this look nicer and easier to understand?
===8<===
diff --git a/migration/ram.c b/migration/ram.c
index 1e1e05e859..ace635b167 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -214,14 +214,29 @@ static bool postcopy_preempt_active(void)
return migrate_postcopy_preempt() && migration_in_postcopy();
}
-bool migrate_ram_is_ignored(RAMBlock *block)
+/* Whether the destination QEMU can share the access on this ramblock? */
+bool migrate_ram_is_shared(RAMBlock *block)
{
MigMode mode = migrate_mode();
+
+ /* Private ram is never share-able */
+ if (!qemu_ram_is_shared(block)) {
+ return false;
+ }
+
+ /* Named file ram is always assumed to be share-able */
+ if (qemu_ram_is_named_file(block)) {
+ return true;
+ }
+
+ /* It's a private fd, only cpr mode can share it (by sharing fd) */
+ return (mode == MIG_MODE_CPR_EXEC) || (mode == MIG_MODE_CPR_TRANSFER);
+}
+
+bool migrate_ram_is_ignored(RAMBlock *block)
+{
return !qemu_ram_is_migratable(block) ||
- mode == MIG_MODE_CPR_EXEC ||
- mode == MIG_MODE_CPR_TRANSFER ||
- (migrate_ignore_shared() && qemu_ram_is_shared(block)
- && qemu_ram_is_named_file(block));
+ (migrate_ignore_shared() && migrate_ram_is_shared(block));
}
===8<===
Please feel free to squash this to your patch in whatever way if it looks
reasonable to you.
>
> The cleanest change is no change, just dropping this patch. I was just confused
> when I set x-ignore-shared for the test.
>
> However, if an unsuspecting user sets x-ignore-shared, it will trigger this error,
> so perhaps I should delete the error_report.
Yes, feel free to send that as a separate patch if you want, since we
digged it this far it'll be nice we fix it even if it's not relevant now.
Thanks,
--
Peter Xu
^ permalink raw reply related [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 17:09 ` Steven Sistare
@ 2024-08-21 18:34 ` Peter Xu
2024-09-04 20:58 ` Steven Sistare
2024-09-05 9:30 ` Daniel P. Berrangé
1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-08-21 18:34 UTC (permalink / raw)
To: Steven Sistare, Daniel P. Berrangé
Cc: Daniel P. Berrangé, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> On 8/16/2024 12:17 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > >
> > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > going to take the file lock).
> > > > > > > >
> > > > > > > > Maybe this is about something else?
> > > > > > >
> > > > > > > I don't have an example where this fails.
> > > > > > >
> > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > argument, due to this code:
> > > > > > >
> > > > > > > blk_attach_dev()
> > > > > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > > blk->disable_perm = true;
> > > > > >
> > > > > > Yep, this one is pretty much expected.
> > > > > >
> > > > > > >
> > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > >
> > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > >
> > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > >
> > > > > > > > > Network ports also conflict.
> > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > I forgot to promote.
> > > > > > > >
> > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > >
> > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > >
> > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > >
> > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > cpr-exec does not.
> > > > > >
> > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > in Libvirt or similar management layers that supports migrations.
> > > > >
> > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > solving all these clashing resource problems is a huge can of worms
> > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > way.
> > > >
> > > > Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> > > > supported local migration somehow on top of libvirt.
> > >
> > > Since kubevirt runs inside a container, "localhost" migration
> > > is effectively migrating between 2 completely separate OS installs
> > > (containers), that happen to be on the same physical host. IOW, it
> > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > no clashing resources to worry about.
> >
> > OK, makes sense.
> >
> > Then do you think it's possible to support cpr-transfer in that scenario
> > from Libvirt POV?
> >
> > >
> > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > to consume it (as cpr-* is only for local host migrations so far)? Even if
> > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > go for Libvirt, then?
> > >
> > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > resources problem in libvirt.
> > >
> > > It has own issues though, because libvirt runs all QEMU processes with
> > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > and thus don't want to allow it to exec anything !
> > >
> > > I don't know which is the lesser evil from libvirt's POV.
> > >
> > > Personally I see security controls as an overriding requirement for
> > > everything.
> >
> > One thing I am aware of is cpr-exec is not the only one who might start to
> > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > deliver the keys and the plan seems to be using exec() too. However in
> > that case per my understanding the exec() is optional - the key VM can also
> > be created by Libvirt.
> >
> > IOW, it looks like we can still stick with execve() being blocked yet so
> > far except cpr-exec().
> >
> > Hmm, this makes the decision harder to make. We need to figure out a way
> > on knowing how to consume this feature for at least open source virt
> > stack.. So far it looks like it's only possible (if we take seccomp high
> > priority) we use cpr-transfer but only in a container.
>
> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> and change namespace operations. I have a patch in my workspace to be submitted
> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> libvirt to block fork and namespace but allow exec.
The question is whether that would be accepted, and it also gives me the
feeling that even if it's accepted, it might limit the use cases that cpr
can apply to.
What I read so far from Dan is that cpr-transfer seems to be also preferred
from Libvirt POV:
https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
Did I read it right?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-21 18:34 ` Peter Xu
@ 2024-09-04 20:58 ` Steven Sistare
2024-09-04 22:23 ` Peter Xu
2024-09-05 9:43 ` Daniel P. Berrangé
0 siblings, 2 replies; 77+ messages in thread
From: Steven Sistare @ 2024-09-04 20:58 UTC (permalink / raw)
To: Peter Xu, Daniel P. Berrangé
Cc: qemu-devel, Fabiano Rosas, David Hildenbrand, Marcel Apfelbaum,
Eduardo Habkost, Philippe Mathieu-Daude, Paolo Bonzini,
Markus Armbruster
On 8/21/2024 2:34 PM, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
>> On 8/16/2024 12:17 PM, Peter Xu wrote:
>>> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
>>>>> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
>>>>>> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>>>>>>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>>>>>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>>>>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>>>>>>> file, and only 1 of them can ever have them open.
>>>>>>>>>
>>>>>>>>> I thought we used to have similar issue with block devices, but I assume
>>>>>>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>>>>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>>>>>>> going to take the file lock).
>>>>>>>>>
>>>>>>>>> Maybe this is about something else?
>>>>>>>>
>>>>>>>> I don't have an example where this fails.
>>>>>>>>
>>>>>>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>>>>>>> the same block device, but the error is suppressed if you add the -incoming
>>>>>>>> argument, due to this code:
>>>>>>>>
>>>>>>>> blk_attach_dev()
>>>>>>>> if (runstate_check(RUN_STATE_INMIGRATE))
>>>>>>>> blk->disable_perm = true;
>>>>>>>
>>>>>>> Yep, this one is pretty much expected.
>>>>>>>
>>>>>>>>
>>>>>>>>>> Indeed, and "files" includes unix domain sockets.
>>>>>>>>
>>>>>>>> More on this -- the second qemu to bind a unix domain socket for listening
>>>>>>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>>>>>>> the socket path before binding on the assumption that it is stale).
>>>>>>>>
>>>>>>>> One must use a different name for the socket for second qemu, and clients
>>>>>>>> that wish to connect must be aware of the new port.
>>>>>>>>
>>>>>>>>>> Network ports also conflict.
>>>>>>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>>>>>>> I forgot to promote.
>>>>>>>>>
>>>>>>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>>>>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>>>>>>
>>>>>>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>>>>>>> stable across migrations, where the host ports can be different as long as
>>>>>>>>> the host ports can forward guest port messages correctly?
>>>>>>>>
>>>>>>>> Yes, one must use a different host port number for the second qemu, and clients
>>>>>>>> that wish to connect must be aware of the new port.
>>>>>>>>
>>>>>>>> That is my point -- cpr-transfer requires fiddling with such things.
>>>>>>>> cpr-exec does not.
>>>>>>>
>>>>>>> Right, and my understanding is all these facilities are already there, so
>>>>>>> no new code should be needed on reconnect issues if to support cpr-transfer
>>>>>>> in Libvirt or similar management layers that supports migrations.
>>>>>>
>>>>>> Note Libvirt explicitly blocks localhost migration today because
>>>>>> solving all these clashing resource problems is a huge can of worms
>>>>>> and it can't be made invisible to the user of libvirt in any practical
>>>>>> way.
>>>>>
>>>>> Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
>>>>> supported local migration somehow on top of libvirt.
>>>>
>>>> Since kubevirt runs inside a container, "localhost" migration
>>>> is effectively migrating between 2 completely separate OS installs
>>>> (containers), that happen to be on the same physical host. IOW, it
>>>> is a cross-host migration from Libvirt & QEMU's POV, and there are
>>>> no clashing resources to worry about.
>>>
>>> OK, makes sense.
>>>
>>> Then do you think it's possible to support cpr-transfer in that scenario
>>> from Libvirt POV?
>>>
>>>>
>>>>> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
>>>>> to consume it (as cpr-* is only for local host migrations so far)? Even if
>>>>> all the rest issues we're discussing with cpr-exec, is that the only way to
>>>>> go for Libvirt, then?
>>>>
>>>> cpr-exec is certainly appealing from the POV of avoiding the clashing
>>>> resources problem in libvirt.
>>>>
>>>> It has own issues though, because libvirt runs all QEMU processes with
>>>> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
>>>> and thus don't want to allow it to exec anything !
>>>>
>>>> I don't know which is the lesser evil from libvirt's POV.
>>>>
>>>> Personally I see security controls as an overriding requirement for
>>>> everything.
>>>
>>> One thing I am aware of is cpr-exec is not the only one who might start to
>>> use exec() in QEMU. TDX fundamentally will need to create another key VM to
>>> deliver the keys and the plan seems to be using exec() too. However in
>>> that case per my understanding the exec() is optional - the key VM can also
>>> be created by Libvirt.
>>>
>>> IOW, it looks like we can still stick with execve() being blocked yet so
>>> far except cpr-exec().
>>>
>>> Hmm, this makes the decision harder to make. We need to figure out a way
>>> on knowing how to consume this feature for at least open source virt
>>> stack.. So far it looks like it's only possible (if we take seccomp high
>>> priority) we use cpr-transfer but only in a container.
>>
>> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
>> and change namespace operations. I have a patch in my workspace to be submitted
>> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
>> libvirt to block fork and namespace but allow exec.
>
> The question is whether that would be accepted, and it also gives me the
> feeling that even if it's accepted, it might limit the use cases that cpr
> can apply to.
This is more acceptable for libvirt running in a container (such as under kubevirt)
with a limited set of binaries in /bin that could be exec'd. In that case allowing
exec is more reasonable.
> What I read so far from Dan is that cpr-transfer seems to be also preferred
> from Libvirt POV:
>
> https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
>
> Did I read it right?
I read that as: cpr-transfer is a viable option for libvirt. I don't hear him
excluding the possibility of cpr-exec.
I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
provide cpr-transfer. Which I will do.
So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
providing cpr-exec?
We are at an impasse on this series. To make forward progress, I am willing to
reorder the patches, and re-submit cpr-transfer as the first mode, so we can
review and pull that. I will submit cpr-exec as a follow on and we can resume
our arguments then.
- Steve
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-09-04 20:58 ` Steven Sistare
@ 2024-09-04 22:23 ` Peter Xu
2024-09-05 9:49 ` Daniel P. Berrangé
2024-09-05 9:43 ` Daniel P. Berrangé
1 sibling, 1 reply; 77+ messages in thread
From: Peter Xu @ 2024-09-04 22:23 UTC (permalink / raw)
To: Steven Sistare
Cc: Daniel P. Berrangé, qemu-devel, Fabiano Rosas,
David Hildenbrand, Marcel Apfelbaum, Eduardo Habkost,
Philippe Mathieu-Daude, Paolo Bonzini, Markus Armbruster
On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> On 8/21/2024 2:34 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > > On 8/16/2024 12:17 PM, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > > > >
> > > > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > > > going to take the file lock).
> > > > > > > > > >
> > > > > > > > > > Maybe this is about something else?
> > > > > > > > >
> > > > > > > > > I don't have an example where this fails.
> > > > > > > > >
> > > > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > > > argument, due to this code:
> > > > > > > > >
> > > > > > > > > blk_attach_dev()
> > > > > > > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > > > > blk->disable_perm = true;
> > > > > > > >
> > > > > > > > Yep, this one is pretty much expected.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > > > >
> > > > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > > > >
> > > > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > > > that wish to connect must be aware of the new port.
> > > > > > > > >
> > > > > > > > > > > Network ports also conflict.
> > > > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > > > I forgot to promote.
> > > > > > > > > >
> > > > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > > > >
> > > > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > > > >
> > > > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > > > that wish to connect must be aware of the new port.
> > > > > > > > >
> > > > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > > > cpr-exec does not.
> > > > > > > >
> > > > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > > > in Libvirt or similar management layers that supports migrations.
> > > > > > >
> > > > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > > > solving all these clashing resource problems is a huge can of worms
> > > > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > > > way.
> > > > > >
> > > > > > Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> > > > > > supported local migration somehow on top of libvirt.
> > > > >
> > > > > Since kubevirt runs inside a container, "localhost" migration
> > > > > is effectively migrating between 2 completely separate OS installs
> > > > > (containers), that happen to be on the same physical host. IOW, it
> > > > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > > > no clashing resources to worry about.
> > > >
> > > > OK, makes sense.
> > > >
> > > > Then do you think it's possible to support cpr-transfer in that scenario
> > > > from Libvirt POV?
> > > >
> > > > >
> > > > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > > > to consume it (as cpr-* is only for local host migrations so far)? Even if
> > > > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > > > go for Libvirt, then?
> > > > >
> > > > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > > > resources problem in libvirt.
> > > > >
> > > > > It has own issues though, because libvirt runs all QEMU processes with
> > > > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > > > and thus don't want to allow it to exec anything !
> > > > >
> > > > > I don't know which is the lesser evil from libvirt's POV.
> > > > >
> > > > > Personally I see security controls as an overriding requirement for
> > > > > everything.
> > > >
> > > > One thing I am aware of is cpr-exec is not the only one who might start to
> > > > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > > > deliver the keys and the plan seems to be using exec() too. However in
> > > > that case per my understanding the exec() is optional - the key VM can also
> > > > be created by Libvirt.
> > > >
> > > > IOW, it looks like we can still stick with execve() being blocked yet so
> > > > far except cpr-exec().
> > > >
> > > > Hmm, this makes the decision harder to make. We need to figure out a way
> > > > on knowing how to consume this feature for at least open source virt
> > > > stack.. So far it looks like it's only possible (if we take seccomp high
> > > > priority) we use cpr-transfer but only in a container.
> > >
> > > libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> > > and change namespace operations. I have a patch in my workspace to be submitted
> > > later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> > > libvirt to block fork and namespace but allow exec.
> >
> > The question is whether that would be accepted, and it also gives me the
> > feeling that even if it's accepted, it might limit the use cases that cpr
> > can apply to.
>
> This is more acceptable for libvirt running in a container (such as under kubevirt)
> with a limited set of binaries in /bin that could be exec'd. In that case allowing
> exec is more reasonable.
>
> > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > from Libvirt POV:
> >
> > https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> >
> > Did I read it right?
>
> I read that as: cpr-transfer is a viable option for libvirt. I don't hear him
> excluding the possibility of cpr-exec.
I preferred not having two solution because if they work the same problem
out, then it potentially means one of them might be leftover at some point,
unless they suite different needs. But I don't feel strongly, especially
if cpr-exec is light if cpr-transfer is there.
>
> I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> provide cpr-transfer. Which I will do.
>
> So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
> providing cpr-exec?
As an open source project, Libvirt using it means the feature can be
actively used and tested. When e.g. there's a new feature replacing CPR we
know when we can obsolete the old CPR, no matter -exec or -transfer.
Close sourced projects can also be great itself but naturally are less
important in open source communities IMHO due to not accessible to anyone
in the community. E.g., we never know when an close sourced project
abandoned a feature, then QEMU can carry over that feature forever without
knowing who's using it.
It's the same as when Linux doesn't maintain kabi on out-of-tree drivers to
me. It's just that here the open source virt stack is a huge project and
QEMU plays its role within.
>
> We are at an impasse on this series. To make forward progress, I am willing to
> reorder the patches, and re-submit cpr-transfer as the first mode, so we can
> review and pull that. I will submit cpr-exec as a follow on and we can resume
> our arguments then.
Yes this could be better to justify how small change cpr-exec would need on
top of cpr-transfer, but I'd still wait for some comments from Dan or
others in case they'll chime in, just to avoid sinking your time with
rebases.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-08-16 17:09 ` Steven Sistare
2024-08-21 18:34 ` Peter Xu
@ 2024-09-05 9:30 ` Daniel P. Berrangé
1 sibling, 0 replies; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-09-05 9:30 UTC (permalink / raw)
To: Steven Sistare
Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> On 8/16/2024 12:17 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > >
> > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > going to take the file lock).
> > > > > > > >
> > > > > > > > Maybe this is about something else?
> > > > > > >
> > > > > > > I don't have an example where this fails.
> > > > > > >
> > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > argument, due to this code:
> > > > > > >
> > > > > > > blk_attach_dev()
> > > > > > > if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > > blk->disable_perm = true;
> > > > > >
> > > > > > Yep, this one is pretty much expected.
> > > > > >
> > > > > > >
> > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > >
> > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > >
> > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > >
> > > > > > > > > Network ports also conflict.
> > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > I forgot to promote.
> > > > > > > >
> > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > >
> > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > >
> > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > >
> > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > cpr-exec does not.
> > > > > >
> > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > in Libvirt or similar management layers that supports migrations.
> > > > >
> > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > solving all these clashing resource problems is a huge can of worms
> > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > way.
> > > >
> > > > Ahhh, OK. I'm pretty surprised by this, as I thought at least kubevirt
> > > > supported local migration somehow on top of libvirt.
> > >
> > > Since kubevirt runs inside a container, "localhost" migration
> > > is effectively migrating between 2 completely separate OS installs
> > > (containers), that happen to be on the same physical host. IOW, it
> > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > no clashing resources to worry about.
> >
> > OK, makes sense.
> >
> > Then do you think it's possible to support cpr-transfer in that scenario
> > from Libvirt POV?
> >
> > >
> > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > to consume it (as cpr-* is only for local host migrations so far)? Even if
> > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > go for Libvirt, then?
> > >
> > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > resources problem in libvirt.
> > >
> > > It has own issues though, because libvirt runs all QEMU processes with
> > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > and thus don't want to allow it to exec anything !
> > >
> > > I don't know which is the lesser evil from libvirt's POV.
> > >
> > > Personally I see security controls as an overriding requirement for
> > > everything.
> >
> > One thing I am aware of is cpr-exec is not the only one who might start to
> > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > deliver the keys and the plan seems to be using exec() too. However in
> > that case per my understanding the exec() is optional - the key VM can also
> > be created by Libvirt.
> >
> > IOW, it looks like we can still stick with execve() being blocked yet so
> > far except cpr-exec().
> >
> > Hmm, this makes the decision harder to make. We need to figure out a way
> > on knowing how to consume this feature for at least open source virt
> > stack.. So far it looks like it's only possible (if we take seccomp high
> > priority) we use cpr-transfer but only in a container.
>
> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> and change namespace operations. I have a patch in my workspace to be submitted
> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> libvirt to block fork and namespace but allow exec.
IMHO this significantly undermines the protection offered. fork(), without
execve() is relatively benign from a security POV, mostly a slightly greater
resource consumption issue, compared to spawning threads, which is always
allowed. Blocking execve() is the key security benefit, as that is a way to
pick up new privileges (through setuid), or bring new binary code into
memory (via the new ELF images loaded), or pick up new MAC policy through
transition rules, etc.
IOW, if you're going to allow 'exec', there's little point in blocking fork
IMHO, and as such this doesn't sound very appealing as something to add to
libvirt.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-09-04 20:58 ` Steven Sistare
2024-09-04 22:23 ` Peter Xu
@ 2024-09-05 9:43 ` Daniel P. Berrangé
1 sibling, 0 replies; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-09-05 9:43 UTC (permalink / raw)
To: Steven Sistare
Cc: Peter Xu, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> On 8/21/2024 2:34 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > >
> > > libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> > > and change namespace operations. I have a patch in my workspace to be submitted
> > > later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> > > libvirt to block fork and namespace but allow exec.
> >
> > The question is whether that would be accepted, and it also gives me the
> > feeling that even if it's accepted, it might limit the use cases that cpr
> > can apply to.
>
> This is more acceptable for libvirt running in a container (such as under kubevirt)
> with a limited set of binaries in /bin that could be exec'd. In that case allowing
> exec is more reasonable.
Running inside a container does protect the host to a significant
degree. I'd say it is still important, however, to protect the
control plane (libvirt's daemons & kubevirt's agent) from the QEMU
process being managed, and in that case it still looks pretty
compelling to deny exec.
> > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > from Libvirt POV:
> >
> > https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> >
> > Did I read it right?
>
> I read that as: cpr-transfer is a viable option for libvirt. I don't hear him
> excluding the possibility of cpr-exec.
>
> I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> provide cpr-transfer. Which I will do.
Both approaches have significant challenges for integration, but my general
preference is towards a solution that doesn't require undermining our security
protections.
When starting a VM we have no knowledge of whether a user may want to use
CPR at a later date. We're not going to disable the seccomp sandbox by
default, so that means cpr-exec would not be viable in a default VM
deployment.
Admins could choose to modify /etc/libvirt/qemu.conf to turn off seccomp,
but I'm very much not in favour of introducing a feature that requires
them todo this. It would be a first in libvirt, as everything else we
support is possible to use with seccomp enabled. The seccomp opt-out is
essentially just there as an emergency escape hatch, not as something we
want used in production.
> We are at an impasse on this series. To make forward progress, I am willing to
> reorder the patches, and re-submit cpr-transfer as the first mode, so we can
> review and pull that. I will submit cpr-exec as a follow on and we can resume
> our arguments then.
Considering the end result, are there CPR usage scenarios that are possible
with cpr-exec, that can't be achieved with cpr-transfer ?
Supporting two ways to doing the same thing is increasing the maint burden
for QEMU maintainers, as well as downstream testing engineers who have to
validate this functionality. So unless there's compelling need to support
both cpr-transfer and cpr-exec, it'd be nice to standardize on just one of
them.
cpr-transfer does look like its probably more viable, even with its own
challenges wrt resources being opened twice.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
* Re: [PATCH V2 00/11] Live update: cpr-exec
2024-09-04 22:23 ` Peter Xu
@ 2024-09-05 9:49 ` Daniel P. Berrangé
0 siblings, 0 replies; 77+ messages in thread
From: Daniel P. Berrangé @ 2024-09-05 9:49 UTC (permalink / raw)
To: Peter Xu
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, David Hildenbrand,
Marcel Apfelbaum, Eduardo Habkost, Philippe Mathieu-Daude,
Paolo Bonzini, Markus Armbruster
On Wed, Sep 04, 2024 at 06:23:50PM -0400, Peter Xu wrote:
> On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> > On 8/21/2024 2:34 PM, Peter Xu wrote:
> > > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > > > On 8/16/2024 12:17 PM, Peter Xu wrote:
> > > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > > from Libvirt POV:
> > >
> > > https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> > >
> > > Did I read it right?
> >
> > I read that as: cpr-transfer is a viable option for libvirt. I don't hear him
> > excluding the possibility of cpr-exec.
>
> I preferred not having two solution because if they work the same problem
> out, then it potentially means one of them might be leftover at some point,
> unless they suite different needs. But I don't feel strongly, especially
> if cpr-exec is light if cpr-transfer is there.
>
> >
> > I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> > provide cpr-transfer. Which I will do.
> >
> > So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
> > providing cpr-exec?
>
> As an open source project, Libvirt using it means the feature can be
> actively used and tested. When e.g. there's a new feature replacing CPR we
> know when we can obsolete the old CPR, no matter -exec or -transfer.
>
> Close sourced projects can also be great itself but naturally are less
> important in open source communities IMHO due to not accessible to anyone
> in the community. E.g., we never know when an close sourced project
> abandoned a feature, then QEMU can carry over that feature forever without
> knowing who's using it.
In terms of closed source projects, effectively they don't exist from a
QEMU maintainer's POV. Our deprecation & removal policy is designed so
that we don't need to think about who is using stuff.
When QEMU deprecates something, any users (whether open source or closed
source) have 2 releases in which to notice this, and make a request that
we cancel the deprecation, or change their code.
Libvirt is special in the sense that we'll CC libvirt mailing list on
changes to the deprecated.rst file, and we'll often not propose
deprecations in the first place if we know libvirt is using it, since
we can ask libvirt quite easily & libvirt people pay attention to QEMU.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 77+ messages in thread
end of thread, other threads:[~2024-09-05 9:49 UTC | newest]
Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-30 19:40 [PATCH V2 00/11] Live update: cpr-exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 01/11] machine: alloc-anon option Steve Sistare
2024-07-15 17:52 ` Fabiano Rosas
2024-07-16 9:19 ` Igor Mammedov
2024-07-17 19:24 ` Peter Xu
2024-07-18 15:43 ` Steven Sistare
2024-07-18 16:22 ` Peter Xu
2024-07-20 20:35 ` Steven Sistare
2024-08-04 16:20 ` Peter Xu
2024-07-20 20:28 ` Steven Sistare
2024-07-22 9:10 ` David Hildenbrand
2024-07-29 12:29 ` Igor Mammedov
2024-08-08 18:32 ` Steven Sistare
2024-08-12 18:37 ` Steven Sistare
2024-08-13 15:35 ` Peter Xu
2024-08-13 17:00 ` Alex Williamson
2024-08-13 18:45 ` Peter Xu
2024-08-13 18:56 ` Steven Sistare
2024-08-13 18:46 ` Steven Sistare
2024-08-13 18:49 ` Steven Sistare
2024-08-13 17:34 ` Steven Sistare
2024-08-13 19:02 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 02/11] migration: cpr-state Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-07-19 15:03 ` Peter Xu
2024-07-20 19:53 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 03/11] migration: save cpr mode Steve Sistare
2024-07-17 18:39 ` Fabiano Rosas
2024-07-18 15:47 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 04/11] migration: stop vm earlier for cpr Steve Sistare
2024-07-17 18:59 ` Fabiano Rosas
2024-07-20 20:00 ` Steven Sistare
2024-07-22 13:42 ` Fabiano Rosas
2024-08-06 20:52 ` Steven Sistare
2024-06-30 19:40 ` [PATCH V2 05/11] physmem: preserve ram blocks " Steve Sistare
2024-06-30 19:40 ` [PATCH V2 06/11] migration: fix mismatched GPAs during cpr Steve Sistare
2024-07-19 16:28 ` Peter Xu
2024-07-20 21:28 ` Steven Sistare
2024-08-07 21:04 ` Steven Sistare
2024-08-13 20:43 ` Peter Xu
2024-08-15 20:54 ` Steven Sistare
2024-08-16 14:43 ` Peter Xu
2024-08-16 17:10 ` Steven Sistare
2024-08-21 16:57 ` Peter Xu
2024-06-30 19:40 ` [PATCH V2 07/11] oslib: qemu_clear_cloexec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 08/11] vl: helper to request exec Steve Sistare
2024-06-30 19:40 ` [PATCH V2 09/11] migration: cpr-exec-command parameter Steve Sistare
2024-06-30 19:40 ` [PATCH V2 10/11] migration: cpr-exec save and load Steve Sistare
2024-06-30 19:40 ` [PATCH V2 11/11] migration: cpr-exec mode Steve Sistare
2024-07-18 15:56 ` [PATCH V2 00/11] Live update: cpr-exec Peter Xu
2024-07-20 21:26 ` Steven Sistare
2024-08-04 16:10 ` Peter Xu
2024-08-07 19:47 ` Steven Sistare
2024-08-13 20:12 ` Peter Xu
2024-08-20 16:28 ` [PATCH V2 00/11] Live update: cpr-exec (reconnections) Steven Sistare
2024-07-22 8:59 ` [PATCH V2 00/11] Live update: cpr-exec David Hildenbrand
2024-08-04 15:43 ` Peter Xu
2024-08-05 9:52 ` David Hildenbrand
2024-08-05 10:06 ` David Hildenbrand
2024-08-05 10:01 ` Daniel P. Berrangé
2024-08-06 20:56 ` Steven Sistare
2024-08-13 19:46 ` Peter Xu
2024-08-15 20:55 ` Steven Sistare
2024-08-16 15:06 ` Peter Xu
2024-08-16 15:16 ` Daniel P. Berrangé
2024-08-16 15:19 ` Steven Sistare
2024-08-16 15:34 ` Peter Xu
2024-08-16 16:00 ` Daniel P. Berrangé
2024-08-16 16:17 ` Peter Xu
2024-08-16 16:28 ` Daniel P. Berrangé
2024-08-16 17:09 ` Steven Sistare
2024-08-21 18:34 ` Peter Xu
2024-09-04 20:58 ` Steven Sistare
2024-09-04 22:23 ` Peter Xu
2024-09-05 9:49 ` Daniel P. Berrangé
2024-09-05 9:43 ` Daniel P. Berrangé
2024-09-05 9:30 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).