* [PATCH V5 01/38] migration: cpr helpers
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 02/38] migration: lower handler priority Steve Sistare
` (37 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add the cpr_incoming_needed, cpr_open_fd, and cpr_resave_fd helpers,
for use when adding cpr support for vfio and iommufd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/migration/cpr.h | 5 +++++
migration/cpr.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7561fc7..07858e9 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -18,6 +18,9 @@
void cpr_save_fd(const char *name, int id, int fd);
void cpr_delete_fd(const char *name, int id);
int cpr_find_fd(const char *name, int id);
+void cpr_resave_fd(const char *name, int id, int fd);
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+ Error **errp);
MigMode cpr_get_incoming_mode(void);
void cpr_set_incoming_mode(MigMode mode);
@@ -28,6 +31,8 @@ int cpr_state_load(MigrationChannel *channel, Error **errp);
void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
+bool cpr_incoming_needed(void *opaque);
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 42c4656..a50a57e 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -95,6 +95,36 @@ int cpr_find_fd(const char *name, int id)
trace_cpr_find_fd(name, id, fd);
return fd;
}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+ int old_fd = elem ? elem->fd : -1;
+
+ if (old_fd < 0) {
+ cpr_save_fd(name, id, fd);
+ } else if (old_fd != fd) {
+ error_setg(&error_fatal,
+ "internal error: cpr fd '%s' id %d value %d "
+ "already saved with a different value %d",
+ name, id, fd, old_fd);
+ }
+}
+
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+ Error **errp)
+{
+ int fd = cpr_find_fd(name, id);
+
+ if (fd < 0) {
+ fd = qemu_open(path, flags, errp);
+ if (fd >= 0) {
+ cpr_save_fd(name, id, fd);
+ }
+ }
+ return fd;
+}
+
/*************************************************************************/
#define CPR_STATE "CprState"
@@ -228,3 +258,9 @@ void cpr_state_close(void)
cpr_state_file = NULL;
}
}
+
+bool cpr_incoming_needed(void *opaque)
+{
+ MigMode mode = migrate_mode();
+ return mode == MIG_MODE_CPR_TRANSFER;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 02/38] migration: lower handler priority
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
2025-06-10 15:39 ` [PATCH V5 01/38] migration: cpr helpers Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 03/38] vfio/container: register container for cpr Steve Sistare
` (36 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers. Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
include/migration/vmstate.h | 6 +++++-
migration/savevm.c | 4 ++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a1dfab4..1ff7bd9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -155,7 +155,11 @@ enum VMStateFlags {
};
typedef enum {
- MIG_PRI_DEFAULT = 0,
+ MIG_PRI_UNINITIALIZED = 0, /* An uninitialized priority field maps to */
+ /* MIG_PRI_DEFAULT in save_state_priority */
+
+ MIG_PRI_LOW, /* Must happen after default */
+ MIG_PRI_DEFAULT,
MIG_PRI_IOMMU, /* Must happen before PCI devices */
MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
diff --git a/migration/savevm.c b/migration/savevm.c
index 52105dd..bb04a45 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -266,7 +266,7 @@ typedef struct SaveState {
static SaveState savevm_state = {
.handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
- .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
+ .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
.global_section_id = 0,
};
@@ -737,7 +737,7 @@ static int calculate_compat_instance_id(const char *idstr)
static inline MigrationPriority save_state_priority(SaveStateEntry *se)
{
- if (se->vmsd) {
+ if (se->vmsd && se->vmsd->priority) {
return se->vmsd->priority;
}
return MIG_PRI_DEFAULT;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 03/38] vfio/container: register container for cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
2025-06-10 15:39 ` [PATCH V5 01/38] migration: cpr helpers Steve Sistare
2025-06-10 15:39 ` [PATCH V5 02/38] migration: lower handler priority Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 04/38] vfio/container: preserve descriptors Steve Sistare
` (35 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a legacy container for cpr-transfer, replacing the generic CPR
register call with a more specific legacy container register call. Add a
blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-container.h | 2 ++
include/hw/vfio/vfio-cpr.h | 15 +++++++++
hw/vfio/container.c | 7 ++---
hw/vfio/cpr-legacy.c | 68 ++++++++++++++++++++++++++++++++++++++++
hw/vfio/cpr.c | 5 ++-
hw/vfio/meson.build | 1 +
6 files changed, 91 insertions(+), 7 deletions(-)
create mode 100644 hw/vfio/cpr-legacy.c
diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
index afc498d..21e5807 100644
--- a/include/hw/vfio/vfio-container.h
+++ b/include/hw/vfio/vfio-container.h
@@ -10,6 +10,7 @@
#define HW_VFIO_CONTAINER_H
#include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
typedef struct VFIOContainer VFIOContainer;
typedef struct VFIODevice VFIODevice;
@@ -29,6 +30,7 @@ typedef struct VFIOContainer {
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
unsigned iommu_type;
QLIST_HEAD(, VFIOGroup) group_list;
+ VFIOContainerCPR cpr;
} VFIOContainer;
OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 750ea5b..d4e0bd5 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -9,8 +9,23 @@
#ifndef HW_VFIO_VFIO_CPR_H
#define HW_VFIO_VFIO_CPR_H
+#include "migration/misc.h"
+
+struct VFIOContainer;
struct VFIOContainerBase;
+typedef struct VFIOContainerCPR {
+ Error *blocker;
+} VFIOContainerCPR;
+
+
+bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
+ Error **errp);
+void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+ Error **errp);
+
bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
Error **errp);
void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 0f948d0..93cdf80 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -33,7 +33,6 @@
#include "qapi/error.h"
#include "pci.h"
#include "hw/vfio/vfio-container.h"
-#include "hw/vfio/vfio-cpr.h"
#include "vfio-helpers.h"
#include "vfio-listener.h"
@@ -643,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
new_container = true;
bcontainer = &container->bcontainer;
- if (!vfio_cpr_register_container(bcontainer, errp)) {
+ if (!vfio_legacy_cpr_register_container(container, errp)) {
goto fail;
}
@@ -679,7 +678,7 @@ fail:
vioc->release(bcontainer);
}
if (new_container) {
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
object_unref(container);
}
if (fd >= 0) {
@@ -720,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
VFIOAddressSpace *space = bcontainer->space;
trace_vfio_container_disconnect(container->fd);
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
close(container->fd);
object_unref(container);
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..dd7ac84
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-container.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+
+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
+{
+ if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+ error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+ return false;
+
+ } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+ error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+ return false;
+
+ } else {
+ return true;
+ }
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_incoming_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ Error **cpr_blocker = &container->cpr.blocker;
+
+ migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+ vfio_cpr_reboot_notifier,
+ MIG_MODE_CPR_REBOOT);
+
+ if (!vfio_cpr_supported(container, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+ return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+ migrate_del_blocker(&container->cpr.blocker);
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0210e76..0e59612 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -7,13 +7,12 @@
#include "qemu/osdep.h"
#include "hw/vfio/vfio-device.h"
-#include "migration/misc.h"
#include "hw/vfio/vfio-cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
-static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
- MigrationEvent *e, Error **errp)
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
{
if (e->type == MIG_EVENT_PRECOPY_SETUP &&
!runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bccb050..73d29f9 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
system_ss.add(when: 'CONFIG_VFIO', if_true: files(
'cpr.c',
+ 'cpr-legacy.c',
'device.c',
'migration.c',
'migration-multifd.c',
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 04/38] vfio/container: preserve descriptors
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (2 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 03/38] vfio/container: register container for cpr Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-23 9:07 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 05/38] vfio/container: discard old DMA vaddr Steve Sistare
` (34 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
the saved descriptors.
During reuse, device and iommu state is already configured, so operations
in vfio_realize that would modify the configuration, such as vfio ioctl's,
are skipped. The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/vfio/vfio-cpr.h | 6 +++++
hw/vfio/container.c | 67 +++++++++++++++++++++++++++++++++++-----------
hw/vfio/cpr-legacy.c | 42 +++++++++++++++++++++++++++++
3 files changed, 100 insertions(+), 15 deletions(-)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index d4e0bd5..5a2e5f6 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -13,6 +13,7 @@
struct VFIOContainer;
struct VFIOContainerBase;
+struct VFIOGroup;
typedef struct VFIOContainerCPR {
Error *blocker;
@@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
Error **errp);
void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
+int vfio_cpr_group_get_device_fd(int d, const char *name);
+
+bool vfio_cpr_container_match(struct VFIOContainer *container,
+ struct VFIOGroup *group, int fd);
+
#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 93cdf80..5caae4c 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,8 @@
#include "system/reset.h"
#include "trace.h"
#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
#include "pci.h"
#include "hw/vfio/vfio-container.h"
#include "vfio-helpers.h"
@@ -425,7 +427,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
return NULL;
}
- if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
+ /*
+ * During CPR, just set the container type and skip the ioctls, as the
+ * container and group are already configured in the kernel.
+ */
+ if (!cpr_is_incoming() &&
+ !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
return NULL;
}
@@ -592,6 +599,11 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
group->container = container;
QLIST_INSERT_HEAD(&container->group_list, group, container_next);
vfio_group_add_kvm_device(group);
+ /*
+ * Remember the container fd for each group, so we can attach to the same
+ * container after CPR.
+ */
+ cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
return true;
}
@@ -601,6 +613,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
group->container = NULL;
vfio_group_del_kvm_device(group);
vfio_ram_block_discard_disable(container, false);
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
}
static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
@@ -615,17 +628,34 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
bool group_was_added = false;
space = vfio_address_space_get(as);
+ fd = cpr_find_fd("vfio_container_for_group", group->groupid);
- QLIST_FOREACH(bcontainer, &space->containers, next) {
- container = container_of(bcontainer, VFIOContainer, bcontainer);
- if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- return vfio_container_group_add(container, group, errp);
+ if (!cpr_is_incoming()) {
+ QLIST_FOREACH(bcontainer, &space->containers, next) {
+ container = container_of(bcontainer, VFIOContainer, bcontainer);
+ if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+ return vfio_container_group_add(container, group, errp);
+ }
}
- }
- fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
- if (fd < 0) {
- goto fail;
+ fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
+ if (fd < 0) {
+ goto fail;
+ }
+ } else {
+ /*
+ * For incoming CPR, the group is already attached in the kernel.
+ * If a container with matching fd is found, then update the
+ * userland group list and return. If not, then after the loop,
+ * create the container struct and group list.
+ */
+ QLIST_FOREACH(bcontainer, &space->containers, next) {
+ container = container_of(bcontainer, VFIOContainer, bcontainer);
+
+ if (vfio_cpr_container_match(container, group, fd)) {
+ return vfio_container_group_add(container, group, errp);
+ }
+ }
}
ret = ioctl(fd, VFIO_GET_API_VERSION);
@@ -697,6 +727,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
QLIST_REMOVE(group, container_next);
group->container = NULL;
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
/*
* Explicitly release the listener first before unset container,
@@ -750,7 +781,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
group = g_malloc0(sizeof(*group));
snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
- group->fd = qemu_open(path, O_RDWR, errp);
+ group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, errp);
if (group->fd < 0) {
goto free_group_exit;
}
@@ -782,6 +813,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
return group;
close_fd_exit:
+ cpr_delete_fd("vfio_group", groupid);
close(group->fd);
free_group_exit:
@@ -803,6 +835,7 @@ static void vfio_group_put(VFIOGroup *group)
vfio_container_disconnect(group);
QLIST_REMOVE(group, next);
trace_vfio_group_put(group->fd);
+ cpr_delete_fd("vfio_group", group->groupid);
close(group->fd);
g_free(group);
}
@@ -813,7 +846,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
g_autofree struct vfio_device_info *info = NULL;
int fd;
- fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+ fd = vfio_cpr_group_get_device_fd(group->fd, name);
if (fd < 0) {
error_setg_errno(errp, errno, "error getting device from group %d",
group->groupid);
@@ -826,8 +859,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
info = vfio_get_device_info(fd);
if (!info) {
error_setg_errno(errp, errno, "error getting device info");
- close(fd);
- return false;
+ goto fail;
}
/*
@@ -841,8 +873,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
if (!QLIST_EMPTY(&group->device_list)) {
error_setg(errp, "Inconsistent setting of support for discarding "
"RAM (e.g., balloon) within group");
- close(fd);
- return false;
+ goto fail;
}
if (!group->ram_block_discard_allowed) {
@@ -860,6 +891,11 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
return true;
+
+fail:
+ close(fd);
+ cpr_delete_fd(name, 0);
+ return false;
}
static void vfio_device_put(VFIODevice *vbasedev)
@@ -870,6 +906,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, next);
vbasedev->group = NULL;
trace_vfio_device_put(vbasedev->fd);
+ cpr_delete_fd(vbasedev->name, 0);
close(vbasedev->fd);
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index dd7ac84..ac4a9ab 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -8,6 +8,7 @@
#include <linux/vfio.h>
#include "qemu/osdep.h"
#include "hw/vfio/vfio-container.h"
+#include "hw/vfio/vfio-device.h"
#include "migration/blocker.h"
#include "migration/cpr.h"
#include "migration/migration.h"
@@ -66,3 +67,44 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
migrate_del_blocker(&container->cpr.blocker);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
}
+
+int vfio_cpr_group_get_device_fd(int d, const char *name)
+{
+ const int id = 0;
+ int fd = cpr_find_fd(name, id);
+
+ if (fd < 0) {
+ fd = ioctl(d, VFIO_GROUP_GET_DEVICE_FD, name);
+ if (fd >= 0) {
+ cpr_save_fd(name, id, fd);
+ }
+ }
+ return fd;
+}
+
+static bool same_device(int fd1, int fd2)
+{
+ struct stat st1, st2;
+
+ return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
+}
+
+bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
+ int fd)
+{
+ if (container->fd == fd) {
+ return true;
+ }
+ if (!same_device(container->fd, fd)) {
+ return false;
+ }
+ /*
+ * Same device, different fd. This occurs when the container fd is
+ * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
+ * produces duplicates. De-dup it.
+ */
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
+ close(fd);
+ cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
+ return true;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 04/38] vfio/container: preserve descriptors
2025-06-10 15:39 ` [PATCH V5 04/38] vfio/container: preserve descriptors Steve Sistare
@ 2025-06-23 9:07 ` Duan, Zhenzhong
2025-07-01 14:25 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 9:07 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 04/38] vfio/container: preserve descriptors
>
>At vfio creation time, save the value of vfio container, group, and device
>descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
>the saved descriptors.
>
>During reuse, device and iommu state is already configured, so operations
>in vfio_realize that would modify the configuration, such as vfio ioctl's,
>are skipped. The result is that vfio_realize constructs qemu data
>structures that reflect the current state of the device.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
>Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>---
> include/hw/vfio/vfio-cpr.h | 6 +++++
> hw/vfio/container.c | 67 +++++++++++++++++++++++++++++++++++----------
>-
> hw/vfio/cpr-legacy.c | 42 +++++++++++++++++++++++++++++
> 3 files changed, 100 insertions(+), 15 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index d4e0bd5..5a2e5f6 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -13,6 +13,7 @@
>
> struct VFIOContainer;
> struct VFIOContainerBase;
>+struct VFIOGroup;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
>@@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase
>*bcontainer,
> Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>
>+int vfio_cpr_group_get_device_fd(int d, const char *name);
>+
>+bool vfio_cpr_container_match(struct VFIOContainer *container,
>+ struct VFIOGroup *group, int fd);
>+
> #endif /* HW_VFIO_VFIO_CPR_H */
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index 93cdf80..5caae4c 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -31,6 +31,8 @@
> #include "system/reset.h"
> #include "trace.h"
> #include "qapi/error.h"
>+#include "migration/cpr.h"
>+#include "migration/blocker.h"
> #include "pci.h"
> #include "hw/vfio/vfio-container.h"
> #include "vfio-helpers.h"
>@@ -425,7 +427,12 @@ static VFIOContainer *vfio_create_container(int fd,
>VFIOGroup *group,
> return NULL;
> }
>
>- if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>+ /*
>+ * During CPR, just set the container type and skip the ioctls, as the
>+ * container and group are already configured in the kernel.
>+ */
>+ if (!cpr_is_incoming() &&
>+ !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
> return NULL;
> }
>
>@@ -592,6 +599,11 @@ static bool vfio_container_group_add(VFIOContainer
>*container, VFIOGroup *group,
> group->container = container;
> QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> vfio_group_add_kvm_device(group);
>+ /*
>+ * Remember the container fd for each group, so we can attach to the same
>+ * container after CPR.
>+ */
>+ cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
I know this is already merged. Just out of curious, It looks cpr_save_fd is enough?
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 04/38] vfio/container: preserve descriptors
2025-06-23 9:07 ` Duan, Zhenzhong
@ 2025-07-01 14:25 ` Steven Sistare
2025-07-02 14:23 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:25 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/23/2025 5:07 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 04/38] vfio/container: preserve descriptors
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
>> the saved descriptors.
>>
>> During reuse, device and iommu state is already configured, so operations
>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>> are skipped. The result is that vfio_realize constructs qemu data
>> structures that reflect the current state of the device.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> include/hw/vfio/vfio-cpr.h | 6 +++++
>> hw/vfio/container.c | 67 +++++++++++++++++++++++++++++++++++----------
>> -
>> hw/vfio/cpr-legacy.c | 42 +++++++++++++++++++++++++++++
>> 3 files changed, 100 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index d4e0bd5..5a2e5f6 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -13,6 +13,7 @@
>>
>> struct VFIOContainer;
>> struct VFIOContainerBase;
>> +struct VFIOGroup;
>>
>> typedef struct VFIOContainerCPR {
>> Error *blocker;
>> @@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase
>> *bcontainer,
>> Error **errp);
>> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>
>> +int vfio_cpr_group_get_device_fd(int d, const char *name);
>> +
>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>> + struct VFIOGroup *group, int fd);
>> +
>> #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 93cdf80..5caae4c 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -31,6 +31,8 @@
>> #include "system/reset.h"
>> #include "trace.h"
>> #include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/blocker.h"
>> #include "pci.h"
>> #include "hw/vfio/vfio-container.h"
>> #include "vfio-helpers.h"
>> @@ -425,7 +427,12 @@ static VFIOContainer *vfio_create_container(int fd,
>> VFIOGroup *group,
>> return NULL;
>> }
>>
>> - if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>> + /*
>> + * During CPR, just set the container type and skip the ioctls, as the
>> + * container and group are already configured in the kernel.
>> + */
>> + if (!cpr_is_incoming() &&
>> + !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>> return NULL;
>> }
>>
>> @@ -592,6 +599,11 @@ static bool vfio_container_group_add(VFIOContainer
>> *container, VFIOGroup *group,
>> group->container = container;
>> QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> vfio_group_add_kvm_device(group);
>> + /*
>> + * Remember the container fd for each group, so we can attach to the same
>> + * container after CPR.
>> + */
>> + cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
>
> I know this is already merged. Just out of curious, It looks cpr_save_fd is enough?
vfio_container_group_add is called from multiple places. In some, we know that the fd
is being saved for the first time, in others we do not know. resave avoids creating
a duplicate entry.
- Steve
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 04/38] vfio/container: preserve descriptors
2025-07-01 14:25 ` Steven Sistare
@ 2025-07-02 14:23 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 14:23 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 04/38] vfio/container: preserve descriptors
>
>On 6/23/2025 5:07 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 04/38] vfio/container: preserve descriptors
>>>
>>> At vfio creation time, save the value of vfio container, group, and device
>>> descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
>>> the saved descriptors.
>>>
>>> During reuse, device and iommu state is already configured, so operations
>>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>>> are skipped. The result is that vfio_realize constructs qemu data
>>> structures that reflect the current state of the device.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>>> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> include/hw/vfio/vfio-cpr.h | 6 +++++
>>> hw/vfio/container.c | 67
>+++++++++++++++++++++++++++++++++++----------
>>> -
>>> hw/vfio/cpr-legacy.c | 42 +++++++++++++++++++++++++++++
>>> 3 files changed, 100 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index d4e0bd5..5a2e5f6 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -13,6 +13,7 @@
>>>
>>> struct VFIOContainer;
>>> struct VFIOContainerBase;
>>> +struct VFIOGroup;
>>>
>>> typedef struct VFIOContainerCPR {
>>> Error *blocker;
>>> @@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct
>VFIOContainerBase
>>> *bcontainer,
>>> Error **errp);
>>> void vfio_cpr_unregister_container(struct VFIOContainerBase
>*bcontainer);
>>>
>>> +int vfio_cpr_group_get_device_fd(int d, const char *name);
>>> +
>>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>>> + struct VFIOGroup *group, int fd);
>>> +
>>> #endif /* HW_VFIO_VFIO_CPR_H */
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index 93cdf80..5caae4c 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -31,6 +31,8 @@
>>> #include "system/reset.h"
>>> #include "trace.h"
>>> #include "qapi/error.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/blocker.h"
>>> #include "pci.h"
>>> #include "hw/vfio/vfio-container.h"
>>> #include "vfio-helpers.h"
>>> @@ -425,7 +427,12 @@ static VFIOContainer *vfio_create_container(int
>fd,
>>> VFIOGroup *group,
>>> return NULL;
>>> }
>>>
>>> - if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>> + /*
>>> + * During CPR, just set the container type and skip the ioctls, as the
>>> + * container and group are already configured in the kernel.
>>> + */
>>> + if (!cpr_is_incoming() &&
>>> + !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>> return NULL;
>>> }
>>>
>>> @@ -592,6 +599,11 @@ static bool
>vfio_container_group_add(VFIOContainer
>>> *container, VFIOGroup *group,
>>> group->container = container;
>>> QLIST_INSERT_HEAD(&container->group_list, group,
>container_next);
>>> vfio_group_add_kvm_device(group);
>>> + /*
>>> + * Remember the container fd for each group, so we can attach to
>the same
>>> + * container after CPR.
>>> + */
>>> + cpr_resave_fd("vfio_container_for_group", group->groupid,
>container->fd);
>>
>> I know this is already merged. Just out of curious, It looks cpr_save_fd is
>enough?
>
>vfio_container_group_add is called from multiple places. In some, we know
>that the fd
>is being saved for the first time, in others we do not know. resave avoids
>creating
>a duplicate entry.
IIUC, vfio_container_group_add is called only once for each group. But it's harmless to call cpr_resave_fd() here. I'm fine to leave it as is.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 05/38] vfio/container: discard old DMA vaddr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (3 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 04/38] vfio/container: preserve descriptors Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 06/38] vfio/container: restore " Steve Sistare
` (33 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
In the container pre_save handler, discard the virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will be
remapped at a different VA after in new QEMU. DMA to already-mapped
pages continues.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/cpr-legacy.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index ac4a9ab..ef106d0 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -15,6 +15,22 @@
#include "migration/vmstate.h"
#include "qapi/error.h"
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+ .iova = 0,
+ .size = 0,
+ };
+ if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+ error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+ return false;
+ }
+ return true;
+}
+
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -30,10 +46,23 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
}
}
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOContainer *container = opaque;
+ Error *local_err = NULL;
+
+ if (!vfio_dma_unmap_vaddr_all(container, &local_err)) {
+ error_report_err(local_err);
+ return -1;
+ }
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .pre_save = vfio_container_pre_save,
.needed = cpr_incoming_needed,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 06/38] vfio/container: restore DMA vaddr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (4 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 05/38] vfio/container: discard old DMA vaddr Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 07/38] vfio/container: mdev cpr blocker Steve Sistare
` (32 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
In new QEMU, do not register the memory listener at device creation time.
Register it later, in the container post_load handler, after all vmstate
that may affect regions and mapping boundaries has been loaded. The
post_load registration will cause the listener to invoke its callback on
each flat section, and the calls will match the mappings remembered by the
kernel.
The listener calls a special dma_map handler that passes the new VA of each
section to the kernel using VFIO_DMA_MAP_FLAG_VADDR. Restore the normal
handler at the end.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-cpr.h | 3 +++
hw/vfio/container.c | 15 ++++++++++--
hw/vfio/cpr-legacy.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 2 deletions(-)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 5a2e5f6..0462447 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -17,6 +17,9 @@ struct VFIOGroup;
typedef struct VFIOContainerCPR {
Error *blocker;
+ int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ void *vaddr, bool readonly, MemoryRegion *mr);
} VFIOContainerCPR;
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 5caae4c..936ce37 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -136,6 +136,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
int ret;
Error *local_err = NULL;
+ g_assert(!cpr_is_incoming());
+
if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
bcontainer->dirty_pages_supported) {
@@ -690,8 +692,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
}
group_was_added = true;
- if (!vfio_listener_register(bcontainer, errp)) {
- goto fail;
+ /*
+ * If CPR, register the listener later, after all state that may
+ * affect regions and mapping boundaries has been cpr load'ed. Later,
+ * the listener will invoke its callback on each flat section and call
+ * dma_map to supply the new vaddr, and the calls will match the mappings
+ * remembered by the kernel.
+ */
+ if (!cpr_is_incoming()) {
+ if (!vfio_listener_register(bcontainer, errp)) {
+ goto fail;
+ }
}
bcontainer->initialized = true;
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index ef106d0..2fd8348 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -9,11 +9,13 @@
#include "qemu/osdep.h"
#include "hw/vfio/vfio-container.h"
#include "hw/vfio/vfio-device.h"
+#include "hw/vfio/vfio-listener.h"
#include "migration/blocker.h"
#include "migration/cpr.h"
#include "migration/migration.h"
#include "migration/vmstate.h"
#include "qapi/error.h"
+#include "qemu/error-report.h"
static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
{
@@ -30,6 +32,32 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
return true;
}
+/*
+ * Set the new @vaddr for any mappings registered during cpr load.
+ * The incoming state is cleared thereafter.
+ */
+static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size, void *vaddr,
+ bool readonly, MemoryRegion *mr)
+{
+ const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
+ bcontainer);
+ struct vfio_iommu_type1_dma_map map = {
+ .argsz = sizeof(map),
+ .flags = VFIO_DMA_MAP_FLAG_VADDR,
+ .vaddr = (__u64)(uintptr_t)vaddr,
+ .iova = iova,
+ .size = size,
+ };
+
+ g_assert(cpr_is_incoming());
+
+ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+ return -errno;
+ }
+
+ return 0;
+}
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
@@ -58,11 +86,34 @@ static int vfio_container_pre_save(void *opaque)
return 0;
}
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ VFIOGroup *group;
+ Error *local_err = NULL;
+
+ if (!vfio_listener_register(bcontainer, &local_err)) {
+ error_report_err(local_err);
+ return -1;
+ }
+
+ QLIST_FOREACH(group, &container->group_list, container_next) {
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+
+ /* Restore original dma_map function */
+ vioc->dma_map = container->cpr.saved_dma_map;
+ }
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
.pre_save = vfio_container_pre_save,
+ .post_load = vfio_container_post_load,
.needed = cpr_incoming_needed,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
@@ -85,6 +136,12 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ /* During incoming CPR, divert calls to dma_map. */
+ if (cpr_is_incoming()) {
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ container->cpr.saved_dma_map = vioc->dma_map;
+ vioc->dma_map = vfio_legacy_cpr_dma_map;
+ }
return true;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 07/38] vfio/container: mdev cpr blocker
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (5 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 06/38] vfio/container: restore " Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
` (31 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
During CPR, after VFIO_DMA_UNMAP_FLAG_VADDR, the vaddr is temporarily
invalid, so mediated devices cannot be supported. Add a blocker for them.
This restriction will not apply to iommufd containers when CPR is added
for them in a future patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-cpr.h | 3 +++
include/hw/vfio/vfio-device.h | 2 ++
hw/vfio/container.c | 8 ++++++++
3 files changed, 13 insertions(+)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 0462447..b83dd42 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -22,6 +22,9 @@ typedef struct VFIOContainerCPR {
void *vaddr, bool readonly, MemoryRegion *mr);
} VFIOContainerCPR;
+typedef struct VFIODeviceCPR {
+ Error *mdev_blocker;
+} VFIODeviceCPR;
bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
Error **errp);
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 8bcb3c1..4e4d0b6 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -28,6 +28,7 @@
#endif
#include "system/system.h"
#include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
#include "system/host_iommu_device.h"
#include "system/iommufd.h"
@@ -84,6 +85,7 @@ typedef struct VFIODevice {
VFIOIOASHwpt *hwpt;
QLIST_ENTRY(VFIODevice) hwpt_next;
struct vfio_region_info **reginfo;
+ VFIODeviceCPR cpr;
} VFIODevice;
struct VFIODeviceOps {
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 936ce37..3e8d645 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -987,6 +987,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
goto device_put_exit;
}
+ if (vbasedev->mdev) {
+ error_setg(&vbasedev->cpr.mdev_blocker,
+ "CPR does not support vfio mdev %s", vbasedev->name);
+ migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, &error_fatal,
+ MIG_MODE_CPR_TRANSFER, -1);
+ }
+
return true;
device_put_exit:
@@ -1004,6 +1011,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
vfio_device_unprepare(vbasedev);
+ migrate_del_blocker(&vbasedev->cpr.mdev_blocker);
object_unref(vbasedev->hiod);
vfio_device_put(vbasedev);
vfio_group_put(group);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (6 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 07/38] vfio/container: mdev cpr blocker Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-08-13 12:54 ` Cédric Le Goater
2025-06-10 15:39 ` [PATCH V5 09/38] pci: export msix_is_pending Steve Sistare
` (30 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each. Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-container-base.h | 3 ++
include/hw/vfio/vfio-cpr.h | 10 ++++
hw/vfio/cpr-legacy.c | 91 +++++++++++++++++++++++++++++++++++
hw/vfio/listener.c | 19 +++++++-
4 files changed, 122 insertions(+), 1 deletion(-)
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 9d37f86..f023265 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -256,4 +256,7 @@ struct VFIOIOMMUClass {
VFIORamDiscardListener *vfio_find_ram_discard_listener(
VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section, bool cpr_remap);
+
#endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index b83dd42..56ede04 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -10,6 +10,7 @@
#define HW_VFIO_VFIO_CPR_H
#include "migration/misc.h"
+#include "system/memory.h"
struct VFIOContainer;
struct VFIOContainerBase;
@@ -17,6 +18,9 @@ struct VFIOGroup;
typedef struct VFIOContainerCPR {
Error *blocker;
+ bool vaddr_unmapped;
+ NotifierWithReturn transfer_notifier;
+ MemoryListener remap_listener;
int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
void *vaddr, bool readonly, MemoryRegion *mr);
@@ -42,4 +46,10 @@ int vfio_cpr_group_get_device_fd(int d, const char *name);
bool vfio_cpr_container_match(struct VFIOContainer *container,
struct VFIOGroup *group, int fd);
+void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section);
+
+bool vfio_cpr_ram_discard_register_listener(
+ struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 2fd8348..a84c324 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -29,6 +29,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
return false;
}
+ container->cpr.vaddr_unmapped = true;
return true;
}
@@ -59,6 +60,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
return 0;
}
+static void vfio_region_remap(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VFIOContainer *container = container_of(listener, VFIOContainer,
+ cpr.remap_listener);
+ vfio_container_region_add(&container->bcontainer, section, true);
+}
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -120,6 +129,40 @@ static const VMStateDescription vfio_container_vmstate = {
}
};
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
+{
+ VFIOContainer *container =
+ container_of(notifier, VFIOContainer, cpr.transfer_notifier);
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+ return 0;
+ }
+
+ if (container->cpr.vaddr_unmapped) {
+ /*
+ * Force a call to vfio_region_remap for each mapped section by
+ * temporarily registering a listener, and temporarily diverting
+ * dma_map to vfio_legacy_cpr_dma_map. The latter restores vaddr.
+ */
+
+ VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ vioc->dma_map = vfio_legacy_cpr_dma_map;
+
+ container->cpr.remap_listener = (MemoryListener) {
+ .name = "vfio cpr recover",
+ .region_add = vfio_region_remap
+ };
+ memory_listener_register(&container->cpr.remap_listener,
+ bcontainer->space->as);
+ memory_listener_unregister(&container->cpr.remap_listener);
+ container->cpr.vaddr_unmapped = false;
+ vioc->dma_map = container->cpr.saved_dma_map;
+ }
+ return 0;
+}
+
bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
{
VFIOContainerBase *bcontainer = &container->bcontainer;
@@ -142,6 +185,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
container->cpr.saved_dma_map = vioc->dma_map;
vioc->dma_map = vfio_legacy_cpr_dma_map;
}
+
+ migration_add_notifier_mode(&container->cpr.transfer_notifier,
+ vfio_cpr_fail_notifier,
+ MIG_MODE_CPR_TRANSFER);
return true;
}
@@ -152,6 +199,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
migrate_del_blocker(&container->cpr.blocker);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migration_remove_notifier(&container->cpr.transfer_notifier);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr. Call this
+ * to restore vaddr for a section with a giommu.
+ *
+ * The giommu already exists. Find it and replay it, which calls
+ * vfio_legacy_cpr_dma_map further down the stack.
+ */
+void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section)
+{
+ VFIOGuestIOMMU *giommu = NULL;
+ hwaddr as_offset = section->offset_within_address_space;
+ hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+ QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+ if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
+ giommu->iommu_offset == iommu_offset) {
+ break;
+ }
+ }
+ g_assert(giommu);
+ memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr. Call this
+ * to restore vaddr for a section with a RamDiscardManager.
+ *
+ * The ram discard listener already exists. Call its populate function
+ * directly, which calls vfio_legacy_cpr_dma_map.
+ */
+bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section)
+{
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
+
+ g_assert(vrdl);
+ return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
}
int vfio_cpr_group_get_device_fd(int d, const char *name)
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 203ed03..2e57986 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -481,6 +481,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
{
VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
listener);
+ vfio_container_region_add(bcontainer, section, false);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section,
+ bool cpr_remap)
+{
hwaddr iova, end;
Int128 llend, llsize;
void *vaddr;
@@ -516,6 +523,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
int iommu_idx;
trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
+
+ if (cpr_remap) {
+ vfio_cpr_giommu_remap(bcontainer, section);
+ }
+
/*
* FIXME: For VFIO iommu types which have KVM acceleration to
* avoid bouncing all map/unmaps through qemu this way, this
@@ -558,7 +570,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
* about changes.
*/
if (memory_region_has_ram_discard_manager(section->mr)) {
- vfio_ram_discard_register_listener(bcontainer, section);
+ if (!cpr_remap) {
+ vfio_ram_discard_register_listener(bcontainer, section);
+ } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
+ section)) {
+ goto fail;
+ }
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure
2025-06-10 15:39 ` [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-08-13 12:54 ` Cédric Le Goater
2025-08-13 14:18 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-08-13 12:54 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
Steve,
On 6/10/25 17:39, Steve Sistare wrote:
> If there are multiple containers and unmap-all fails for some container, we
> need to remap vaddr for the other containers for which unmap-all succeeded.
> Recover by walking all address ranges of all containers to restore the vaddr
> for each. Do so by invoking the vfio listener callback, and passing a new
> "remap" flag that tells it to restore a mapping without re-allocating new
> userland data structures.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> ---
> include/hw/vfio/vfio-container-base.h | 3 ++
> include/hw/vfio/vfio-cpr.h | 10 ++++
> hw/vfio/cpr-legacy.c | 91 +++++++++++++++++++++++++++++++++++
> hw/vfio/listener.c | 19 +++++++-
> 4 files changed, 122 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index 9d37f86..f023265 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -256,4 +256,7 @@ struct VFIOIOMMUClass {
> VFIORamDiscardListener *vfio_find_ram_discard_listener(
> VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section, bool cpr_remap);
> +
> #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index b83dd42..56ede04 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -10,6 +10,7 @@
> #define HW_VFIO_VFIO_CPR_H
>
> #include "migration/misc.h"
> +#include "system/memory.h"
>
> struct VFIOContainer;
> struct VFIOContainerBase;
> @@ -17,6 +18,9 @@ struct VFIOGroup;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
> + bool vaddr_unmapped;
> + NotifierWithReturn transfer_notifier;
> + MemoryListener remap_listener;
> int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
> hwaddr iova, ram_addr_t size,
> void *vaddr, bool readonly, MemoryRegion *mr);
> @@ -42,4 +46,10 @@ int vfio_cpr_group_get_device_fd(int d, const char *name);
> bool vfio_cpr_container_match(struct VFIOContainer *container,
> struct VFIOGroup *group, int fd);
>
> +void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section);
> +
> +bool vfio_cpr_ram_discard_register_listener(
> + struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
> +
> #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 2fd8348..a84c324 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -29,6 +29,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> return false;
> }
> + container->cpr.vaddr_unmapped = true;
> return true;
> }
>
> @@ -59,6 +60,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
> return 0;
> }
>
> +static void vfio_region_remap(MemoryListener *listener,
> + MemoryRegionSection *section)
> +{
> + VFIOContainer *container = container_of(listener, VFIOContainer,
> + cpr.remap_listener);
> + vfio_container_region_add(&container->bcontainer, section, true);
> +}
> +
> static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> {
> if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> @@ -120,6 +129,40 @@ static const VMStateDescription vfio_container_vmstate = {
> }
> };
>
> +static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
> + MigrationEvent *e, Error **errp)
> +{
> + VFIOContainer *container =
> + container_of(notifier, VFIOContainer, cpr.transfer_notifier);
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + if (e->type != MIG_EVENT_PRECOPY_FAILED) {
> + return 0;
> + }
> +
> + if (container->cpr.vaddr_unmapped) {
> + /*
> + * Force a call to vfio_region_remap for each mapped section by
> + * temporarily registering a listener, and temporarily diverting
> + * dma_map to vfio_legacy_cpr_dma_map. The latter restores vaddr.
> + */
> +
> + VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> + vioc->dma_map = vfio_legacy_cpr_dma_map;
> +
> + container->cpr.remap_listener = (MemoryListener) {
> + .name = "vfio cpr recover",
> + .region_add = vfio_region_remap
> + };
> + memory_listener_register(&container->cpr.remap_listener,
> + bcontainer->space->as);
> + memory_listener_unregister(&container->cpr.remap_listener);
> + container->cpr.vaddr_unmapped = false;
> + vioc->dma_map = container->cpr.saved_dma_map;
> + }
> + return 0;
> +}
> +
> bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> {
> VFIOContainerBase *bcontainer = &container->bcontainer;
> @@ -142,6 +185,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> container->cpr.saved_dma_map = vioc->dma_map;
> vioc->dma_map = vfio_legacy_cpr_dma_map;
> }
> +
> + migration_add_notifier_mode(&container->cpr.transfer_notifier,
> + vfio_cpr_fail_notifier,
> + MIG_MODE_CPR_TRANSFER);
> return true;
> }
>
> @@ -152,6 +199,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> migrate_del_blocker(&container->cpr.blocker);
> vmstate_unregister(NULL, &vfio_container_vmstate, container);
> + migration_remove_notifier(&container->cpr.transfer_notifier);
> +}
> +
> +/*
> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
> + * succeeding for others, so the latter have lost their vaddr. Call this
> + * to restore vaddr for a section with a giommu.
> + *
> + * The giommu already exists. Find it and replay it, which calls
> + * vfio_legacy_cpr_dma_map further down the stack.
> + */
> +void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section)
> +{
> + VFIOGuestIOMMU *giommu = NULL;
> + hwaddr as_offset = section->offset_within_address_space;
> + hwaddr iommu_offset = as_offset - section->offset_within_region;
> +
> + QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
> + if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
> + giommu->iommu_offset == iommu_offset) {
> + break;
> + }
> + }
> + g_assert(giommu);
> + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
> +}
> +
> +/*
> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
> + * succeeding for others, so the latter have lost their vaddr. Call this
> + * to restore vaddr for a section with a RamDiscardManager.
> + *
> + * The ram discard listener already exists. Call its populate function
> + * directly, which calls vfio_legacy_cpr_dma_map.
> + */
> +bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section)
> +{
> + VFIORamDiscardListener *vrdl =
> + vfio_find_ram_discard_listener(bcontainer, section);
> +
> + g_assert(vrdl);
> + return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
> }
>
> int vfio_cpr_group_get_device_fd(int d, const char *name)
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index 203ed03..2e57986 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -481,6 +481,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
> {
> VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
> listener);
> + vfio_container_region_add(bcontainer, section, false);
> +}
> +
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section,
> + bool cpr_remap)
> +{
> hwaddr iova, end;
> Int128 llend, llsize;
> void *vaddr;
> @@ -516,6 +523,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
> int iommu_idx;
>
> trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
> +
> + if (cpr_remap) {
> + vfio_cpr_giommu_remap(bcontainer, section);
> + }
> +
> /*
> * FIXME: For VFIO iommu types which have KVM acceleration to
> * avoid bouncing all map/unmaps through qemu this way, this
> @@ -558,7 +570,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
> * about changes.
> */
> if (memory_region_has_ram_discard_manager(section->mr)) {
> - vfio_ram_discard_register_listener(bcontainer, section);
> + if (!cpr_remap) {
> + vfio_ram_discard_register_listener(bcontainer, section);
> + } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
> + section)) {
> + goto fail;
vfio_cpr_ram_discard_register_listener() can fail without setting
an 'Error *' variable. I don't think this will generate a QEMU crash
(we are in the !bcontainer->initialized case) but it would be better
addressed if &err was set.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure
2025-08-13 12:54 ` Cédric Le Goater
@ 2025-08-13 14:18 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-08-13 14:18 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 8/13/2025 8:54 AM, Cédric Le Goater wrote:
> Steve,
>
> On 6/10/25 17:39, Steve Sistare wrote:
>> If there are multiple containers and unmap-all fails for some container, we
>> need to remap vaddr for the other containers for which unmap-all succeeded.
>> Recover by walking all address ranges of all containers to restore the vaddr
>> for each. Do so by invoking the vfio listener callback, and passing a new
>> "remap" flag that tells it to restore a mapping without re-allocating new
>> userland data structures.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>> ---
[...]
>> @@ -558,7 +570,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> * about changes.
>> */
>> if (memory_region_has_ram_discard_manager(section->mr)) {
>> - vfio_ram_discard_register_listener(bcontainer, section);
>> + if (!cpr_remap) {
>> + vfio_ram_discard_register_listener(bcontainer, section);
>> + } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
>> + section)) {
>> + goto fail;
>
> vfio_cpr_ram_discard_register_listener() can fail without setting
> an 'Error *' variable. I don't think this will generate a QEMU crash
> (we are in the !bcontainer->initialized case) but it would be better
> addressed if &err was set.
Thanks, I just posted a fix - steve
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 09/38] pci: export msix_is_pending
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (7 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 08/38] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 10/38] pci: skip reset during cpr Steve Sistare
` (29 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export msix_is_pending for use by cpr. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
include/hw/pci/msix.h | 1 +
hw/pci/msix.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
bool msix_is_masked(PCIDevice *dev, unsigned vector);
void msix_set_pending(PCIDevice *dev, unsigned vector);
void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
void msix_vector_use(PCIDevice *dev, unsigned vector);
void msix_vector_unuse(PCIDevice *dev, unsigned vector);
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 66f27b9..8c7f670 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -72,7 +72,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
return dev->msix_pba + vector / 8;
}
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
{
return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 10/38] pci: skip reset during cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (8 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 09/38] pci: export msix_is_pending Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 11/38] vfio-pci: " Steve Sistare
` (28 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/pci/pci_device.h | 3 +++
hw/pci/pci.c | 5 +++++
hw/vfio/pci.c | 7 +++++++
3 files changed, 15 insertions(+)
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index eee0338..0509430 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -182,6 +182,9 @@ struct PCIDevice {
uint32_t max_bounce_buffer_size;
char *sriov_pf;
+
+ /* CPR */
+ bool skip_reset_on_cpr;
};
static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 9b4bf48..0f6b9b3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,7 @@
#include "hw/pci/pci_host.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/cpr.h"
#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "net/net.h"
@@ -537,6 +538,10 @@ static void pci_reset_regions(PCIDevice *dev)
static void pci_do_device_reset(PCIDevice *dev)
{
+ if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
+ return;
+ }
+
pci_device_deassert_intx(dev);
assert(dev->irq_state == 0);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b1250d8..819170d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3408,6 +3408,13 @@ static void vfio_instance_init(Object *obj)
/* QEMU_PCI_CAP_EXPRESS initialization does not depend on QEMU command
* line, therefore, no need to wait to realize like other devices */
pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
+
+ /*
+ * A device that is resuming for cpr is already configured, so do not
+ * reset it during qemu_system_reset prior to cpr load, else interrupts
+ * may be lost.
+ */
+ pci_dev->skip_reset_on_cpr = true;
}
static void vfio_pci_base_dev_class_init(ObjectClass *klass, const void *data)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 11/38] vfio-pci: skip reset during cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (9 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 10/38] pci: skip reset during cpr Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 12/38] vfio/pci: vfio_pci_vector_init Steve Sistare
` (27 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR, and do not complain if the
kernel's PCI config space changes for non-emulated bits between the
vmstate save and load, which can happen due to ongoing interrupt activity.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-cpr.h | 2 ++
hw/vfio/cpr.c | 31 +++++++++++++++++++++++++++++++
hw/vfio/pci.c | 7 +++++++
3 files changed, 40 insertions(+)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 56ede04..8bf85b9 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -52,4 +52,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
bool vfio_cpr_ram_discard_register_listener(
struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+extern const VMStateDescription vfio_cpr_pci_vmstate;
+
#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0e59612..fdbb58e 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,6 +8,8 @@
#include "qemu/osdep.h"
#include "hw/vfio/vfio-device.h"
#include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/pci.h"
+#include "migration/cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
@@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
{
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
}
+
+/*
+ * The kernel may change non-emulated config bits. Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_cpr_pci_pre_load(void *opaque)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int size = MIN(pci_config_size(pdev), vdev->config_size);
+ int i;
+
+ for (i = 0; i < size; i++) {
+ pdev->cmask[i] &= vdev->emulated_config_bits[i];
+ }
+
+ return 0;
+}
+
+const VMStateDescription vfio_cpr_pci_vmstate = {
+ .name = "vfio-cpr-pci",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .pre_load = vfio_cpr_pci_pre_load,
+ .needed = cpr_incoming_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 819170d..67edf80 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -30,6 +30,7 @@
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
#include "migration/vmstate.h"
+#include "migration/cpr.h"
#include "qobject/qdict.h"
#include "qemu/error-report.h"
#include "qemu/main-loop.h"
@@ -3351,6 +3352,11 @@ static void vfio_pci_reset(DeviceState *dev)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
+ /* Do not reset the device during qemu_system_reset prior to cpr load */
+ if (cpr_is_incoming()) {
+ return;
+ }
+
trace_vfio_pci_reset(vdev->vbasedev.name);
vfio_pci_pre_reset(vdev);
@@ -3527,6 +3533,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
#ifdef CONFIG_IOMMUFD
object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
#endif
+ dc->vmsd = &vfio_cpr_pci_vmstate;
dc->desc = "VFIO-based PCI device assignment";
pdc->realize = vfio_pci_realize;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 12/38] vfio/pci: vfio_pci_vector_init
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (10 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 11/38] vfio-pci: " Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 13/38] vfio/pci: vfio_notifier_init Steve Sistare
` (26 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Extract a subroutine vfio_pci_vector_init. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 67edf80..03630bb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -531,6 +531,22 @@ static void set_irq_signalling(VFIODevice *vbasedev, VFIOMSIVector *vector,
}
}
+static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+ VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+ PCIDevice *pdev = &vdev->pdev;
+
+ vector->vdev = vdev;
+ vector->virq = -1;
+ if (event_notifier_init(&vector->interrupt, 0)) {
+ error_report("vfio: Error: event_notifier_init failed");
+ }
+ vector->use = true;
+ if (vdev->interrupt == VFIO_INT_MSIX) {
+ msix_vector_use(pdev, nr);
+ }
+}
+
static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
MSIMessage *msg, IOHandler *handler)
{
@@ -544,13 +560,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vector = &vdev->msi_vectors[nr];
if (!vector->use) {
- vector->vdev = vdev;
- vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
- vector->use = true;
- msix_vector_use(pdev, nr);
+ vfio_pci_vector_init(vdev, nr);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 13/38] vfio/pci: vfio_notifier_init
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (11 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 12/38] vfio/pci: vfio_pci_vector_init Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 14/38] vfio/pci: pass vector to virq functions Steve Sistare
` (25 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Move event_notifier_init calls to a helper vfio_notifier_init.
This version is trivial, but it will be expanded to support CPR
in subsequent patches. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 03630bb..4fac4d1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -57,6 +57,16 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+{
+ int ret = event_notifier_init(e, 0);
+
+ if (ret) {
+ error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ }
+ return !ret;
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -137,8 +147,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (event_notifier_init(&vdev->intx.unmask, 0)) {
- error_setg(errp, "event_notifier_init failed eoi");
+ if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
goto fail;
}
@@ -269,7 +278,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
int32_t fd;
- int ret;
if (!pin) {
@@ -292,9 +300,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- ret = event_notifier_init(&vdev->intx.interrupt, 0);
- if (ret) {
- error_setg_errno(errp, -ret, "event_notifier_init failed");
+ if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -474,11 +480,13 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
{
+ const char *name = "kvm_interrupt";
+
if (vector->virq < 0) {
return;
}
- if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+ if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
goto fail_notifier;
}
@@ -535,11 +543,12 @@ static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
{
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
PCIDevice *pdev = &vdev->pdev;
+ Error *err = NULL;
vector->vdev = vdev;
vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
+ if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ error_report_err(err);
}
vector->use = true;
if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -755,13 +764,14 @@ retry:
for (i = 0; i < vdev->nr_vectors; i++) {
VFIOMSIVector *vector = &vdev->msi_vectors[i];
+ Error *err = NULL;
vector->vdev = vdev;
vector->virq = -1;
vector->use = true;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
+ if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+ error_report_err(err);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2925,8 +2935,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->err_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for error detection");
+ if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+ error_report_err(err);
vdev->pci_aer = false;
return;
}
@@ -2992,8 +3002,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->req_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for device request");
+ if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+ error_report_err(err);
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 14/38] vfio/pci: pass vector to virq functions
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (12 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 13/38] vfio/pci: vfio_notifier_init Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 15/38] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
` (24 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass the vector number to vfio_connect_kvm_msi_virq and
vfio_remove_kvm_msi_virq, so it can be passed to their subroutines in
a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4fac4d1..e40402a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -478,7 +478,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector_n, &vdev->pdev);
}
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
{
const char *name = "kvm_interrupt";
@@ -504,7 +504,8 @@ fail_notifier:
vector->virq = -1;
}
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int nr)
{
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
vector->virq);
@@ -581,7 +582,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
*/
if (vector->virq >= 0) {
if (!msg) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, nr);
} else {
vfio_update_kvm_msi_virq(vector, *msg, pdev);
}
@@ -593,7 +594,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
vfio_add_kvm_msi_virq(vdev, vector, nr, true);
kvm_irqchip_commit_route_changes(&vfio_route_change);
- vfio_connect_kvm_msi_virq(vector);
+ vfio_connect_kvm_msi_virq(vector, nr);
}
}
}
@@ -687,7 +688,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
kvm_irqchip_commit_route_changes(&vfio_route_change);
for (i = 0; i < vdev->nr_vectors; i++) {
- vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+ vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
}
}
@@ -827,7 +828,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
VFIOMSIVector *vector = &vdev->msi_vectors[i];
if (vdev->msi_vectors[i].use) {
if (vector->virq >= 0) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, i);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 15/38] vfio/pci: vfio_notifier_init cpr parameters
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (13 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 14/38] vfio/pci: pass vector to virq functions Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 16/38] vfio/pci: vfio_notifier_cleanup Steve Sistare
` (23 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass vdev and nr to vfio_notifier_init, for use by CPR in a subsequent
patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e40402a..ebc1a4b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -57,7 +57,8 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
-static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr, Error **errp)
{
int ret = event_notifier_init(e, 0);
@@ -147,7 +148,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
+ if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
goto fail;
}
@@ -300,7 +301,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
+ if (!vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0,
+ errp)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -486,7 +488,8 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
return;
}
- if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
+ if (!vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr,
+ NULL)) {
goto fail_notifier;
}
@@ -544,12 +547,13 @@ static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
{
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
PCIDevice *pdev = &vdev->pdev;
- Error *err = NULL;
+ Error *local_err = NULL;
vector->vdev = vdev;
vector->virq = -1;
- if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
- error_report_err(err);
+ if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr,
+ &local_err)) {
+ error_report_err(local_err);
}
vector->use = true;
if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -765,14 +769,15 @@ retry:
for (i = 0; i < vdev->nr_vectors; i++) {
VFIOMSIVector *vector = &vdev->msi_vectors[i];
- Error *err = NULL;
+ Error *local_err = NULL;
vector->vdev = vdev;
vector->virq = -1;
vector->use = true;
- if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
- error_report_err(err);
+ if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i,
+ &local_err)) {
+ error_report_err(local_err);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2936,7 +2941,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+ if (!vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0,
+ &err)) {
error_report_err(err);
vdev->pci_aer = false;
return;
@@ -3003,7 +3009,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+ if (!vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0,
+ &err)) {
error_report_err(err);
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 16/38] vfio/pci: vfio_notifier_cleanup
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (14 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 15/38] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 17/38] vfio/pci: export MSI functions Steve Sistare
` (22 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Move event_notifier_cleanup calls to a helper vfio_notifier_cleanup.
This version is trivial, and does not yet use the vdev and nr parameters.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ebc1a4b..508701c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -68,6 +68,12 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
return !ret;
}
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ event_notifier_cleanup(e);
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -180,7 +186,7 @@ fail_vfio:
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
vdev->intx.route.irq);
fail_irqfd:
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
fail:
qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -212,7 +218,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
}
/* We only need to close the eventfd for VFIO to cleanup the kernel side */
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
/* QEMU starts listening for interrupt events. */
qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -311,7 +317,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
return false;
}
@@ -338,7 +344,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
vdev->interrupt = VFIO_INT_NONE;
@@ -501,7 +507,7 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
return;
fail_kvm:
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
fail_notifier:
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
@@ -514,7 +520,7 @@ static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector->virq);
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
}
static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -837,7 +843,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
- event_notifier_cleanup(&vector->interrupt);
+ vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
}
}
@@ -2955,7 +2961,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
vdev->pci_aer = false;
}
}
@@ -2974,7 +2980,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
}
static void vfio_req_notifier_handler(void *opaque)
@@ -3022,7 +3028,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
} else {
vdev->req_enabled = true;
}
@@ -3042,7 +3048,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
vdev->req_enabled = false;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 17/38] vfio/pci: export MSI functions
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (15 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 16/38] vfio/pci: vfio_notifier_cleanup Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 18/38] vfio-pci: preserve MSI Steve Sistare
` (21 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export various MSI functions, renamed with a vfio_pci prefix, for use by
CPR in subsequent patches. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/pci.h | 8 ++++++++
hw/vfio/pci.c | 29 +++++++++++++++++------------
2 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 5ce0fb9..6e4840d 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -210,6 +210,14 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
return class == PCI_CLASS_DISPLAY_VGA;
}
+/* MSI/MSI-X/INTx */
+void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr);
+void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int vector_n, bool msix);
+void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
void vfio_pci_write_config(PCIDevice *pdev,
uint32_t addr, uint32_t val, int len);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 508701c..4cda6dc 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -351,6 +351,11 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
trace_vfio_intx_disable(vdev->vbasedev.name);
}
+bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp)
+{
+ return vfio_intx_enable(vdev, errp);
+}
+
/*
* MSI/X
*/
@@ -475,8 +480,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
return ret;
}
-static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
- int vector_n, bool msix)
+void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int vector_n, bool msix)
{
if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
return;
@@ -549,7 +554,7 @@ static void set_irq_signalling(VFIODevice *vbasedev, VFIOMSIVector *vector,
}
}
-static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
+void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
{
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
PCIDevice *pdev = &vdev->pdev;
@@ -599,10 +604,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
} else {
if (msg) {
if (vdev->defer_kvm_irq_routing) {
- vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+ vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
} else {
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
- vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+ vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
kvm_irqchip_commit_route_changes(&vfio_route_change);
vfio_connect_kvm_msi_virq(vector, nr);
}
@@ -681,14 +686,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
}
}
-static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
{
assert(!vdev->defer_kvm_irq_routing);
vdev->defer_kvm_irq_routing = true;
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
}
-static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
{
int i;
@@ -718,14 +723,14 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
* routes once rather than per vector provides a substantial
* performance improvement.
*/
- vfio_prepare_kvm_msi_virq_batch(vdev);
+ vfio_pci_prepare_kvm_msi_virq_batch(vdev);
if (msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
vfio_msix_vector_release, NULL)) {
error_report("vfio: msix_set_vector_notifiers failed");
}
- vfio_commit_kvm_msi_virq_batch(vdev);
+ vfio_pci_commit_kvm_msi_virq_batch(vdev);
if (vdev->nr_vectors) {
ret = vfio_enable_vectors(vdev, true);
@@ -769,7 +774,7 @@ retry:
* Deferring to commit the KVM routes once rather than per vector
* provides a substantial performance improvement.
*/
- vfio_prepare_kvm_msi_virq_batch(vdev);
+ vfio_pci_prepare_kvm_msi_virq_batch(vdev);
vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
@@ -793,10 +798,10 @@ retry:
* Attempt to enable route through KVM irqchip,
* default to userspace handling if unavailable.
*/
- vfio_add_kvm_msi_virq(vdev, vector, i, false);
+ vfio_pci_add_kvm_msi_virq(vdev, vector, i, false);
}
- vfio_commit_kvm_msi_virq_batch(vdev);
+ vfio_pci_commit_kvm_msi_virq_batch(vdev);
/* Set interrupt type prior to possible interrupts */
vdev->interrupt = VFIO_INT_MSI;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 18/38] vfio-pci: preserve MSI
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (16 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 17/38] vfio/pci: export MSI functions Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-07-01 16:12 ` Steven Sistare
2025-07-02 15:35 ` Cédric Le Goater
2025-06-10 15:39 ` [PATCH V5 19/38] vfio-pci: preserve INTx Steve Sistare
` (20 subsequent siblings)
38 siblings, 2 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the MSI message area as part of vfio-pci vmstate, and preserve the
interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
then the vfio-pci post_load handler finds the eventfds in CPR state,
rebuilds vector data structures, and attaches the interrupts to the new
KVM instance.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.h | 2 +
include/hw/vfio/vfio-cpr.h | 8 ++++
hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
4 files changed, 158 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6e4840d..4d1203c 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
+void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
void vfio_pci_write_config(PCIDevice *pdev,
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 8bf85b9..25e74ee 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -15,6 +15,7 @@
struct VFIOContainer;
struct VFIOContainerBase;
struct VFIOGroup;
+struct VFIOPCIDevice;
typedef struct VFIOContainerCPR {
Error *blocker;
@@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
bool vfio_cpr_ram_discard_register_listener(
struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr, int fd);
+int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr);
+void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+ int nr);
+
extern const VMStateDescription vfio_cpr_pci_vmstate;
#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index fdbb58e..e467373 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -9,6 +9,8 @@
#include "hw/vfio/vfio-device.h"
#include "hw/vfio/vfio-cpr.h"
#include "hw/vfio/pci.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/msi.h"
#include "migration/cpr.h"
#include "qapi/error.h"
#include "system/runstate.h"
@@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
}
+#define STRDUP_VECTOR_FD_NAME(vdev, name) \
+ g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+ int fd)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ cpr_save_fd(fdname, nr, fd);
+}
+
+int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ return cpr_find_fd(fdname, nr);
+}
+
+void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+ cpr_delete_fd(fdname, nr);
+}
+
+static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
+ bool msix)
+{
+ int i, fd;
+ bool pending = false;
+ PCIDevice *pdev = &vdev->pdev;
+
+ vdev->nr_vectors = nr_vectors;
+ vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+ vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+ vfio_pci_prepare_kvm_msi_virq_batch(vdev);
+
+ for (i = 0; i < nr_vectors; i++) {
+ VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+ fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
+ if (fd >= 0) {
+ vfio_pci_vector_init(vdev, i);
+ vfio_pci_msi_set_handler(vdev, i);
+ }
+
+ if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
+ vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
+ } else {
+ vdev->msi_vectors[i].virq = -1;
+ }
+
+ if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+ set_bit(i, vdev->msix->pending);
+ pending = true;
+ }
+ }
+
+ vfio_pci_commit_kvm_msi_virq_batch(vdev);
+
+ if (msix) {
+ memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+ }
+}
+
/*
* The kernel may change non-emulated config bits. Exclude them from the
* changed-bits check in get_pci_config_device.
@@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
return 0;
}
+static int vfio_cpr_pci_post_load(void *opaque, int version_id)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int nr_vectors;
+
+ if (msix_enabled(pdev)) {
+ vfio_pci_msix_set_notifiers(vdev);
+ nr_vectors = vdev->msix->entries;
+ vfio_cpr_claim_vectors(vdev, nr_vectors, true);
+
+ } else if (msi_enabled(pdev)) {
+ nr_vectors = msi_nr_vectors_allocated(pdev);
+ vfio_cpr_claim_vectors(vdev, nr_vectors, false);
+
+ } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+ g_assert_not_reached(); /* completed in a subsequent patch */
+ }
+
+ return 0;
+}
+
+static bool pci_msix_present(void *opaque, int version_id)
+{
+ PCIDevice *pdev = opaque;
+
+ return msix_present(pdev);
+}
+
const VMStateDescription vfio_cpr_pci_vmstate = {
.name = "vfio-cpr-pci",
.version_id = 0,
.minimum_version_id = 0,
.pre_load = vfio_cpr_pci_pre_load,
+ .post_load = vfio_cpr_pci_post_load,
.needed = cpr_incoming_needed,
.fields = (VMStateField[]) {
+ VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+ VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4cda6dc..b3dbb84 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
#include "hw/pci/pci_bridge.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "hw/vfio/vfio-cpr.h"
#include "migration/vmstate.h"
#include "migration/cpr.h"
#include "qobject/qdict.h"
@@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+/* Create new or reuse existing eventfd */
static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr, Error **errp)
{
- int ret = event_notifier_init(e, 0);
+ int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
+ int ret = 0;
- if (ret) {
- error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ if (fd >= 0) {
+ event_notifier_init_fd(e, fd);
+ } else {
+ ret = event_notifier_init(e, 0);
+ if (ret) {
+ error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+ } else {
+ fd = event_notifier_get_fd(e);
+ if (fd >= 0) {
+ vfio_cpr_save_vector_fd(vdev, name, nr, fd);
+ }
+ }
}
return !ret;
}
@@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
+ vfio_cpr_delete_vector_fd(vdev, name, nr);
event_notifier_cleanup(e);
}
@@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
notify(&vdev->pdev, nr);
}
+void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
+{
+ VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+ int fd = event_notifier_get_fd(&vector->interrupt);
+
+ qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+}
+
/*
* Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
* fd to kernel.
@@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
int ret;
bool resizing = !!(vdev->nr_vectors < nr + 1);
+ /*
+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from
+ * vfio_cpr_claim_vectors during post load.
+ */
+ if (cpr_is_incoming()) {
+ return 0;
+ }
+
trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
vector = &vdev->msi_vectors[nr];
@@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
}
}
+void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
+{
+ msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
+ vfio_msix_vector_release, NULL);
+}
+
void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
{
assert(!vdev->defer_kvm_irq_routing);
@@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->err_notifier);
qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (cpr_is_incoming()) {
+ return;
+ }
+
if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->req_notifier);
qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (cpr_is_incoming()) {
+ vdev->req_enabled = true;
+ return;
+ }
+
if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 18/38] vfio-pci: preserve MSI
2025-06-10 15:39 ` [PATCH V5 18/38] vfio-pci: preserve MSI Steve Sistare
@ 2025-07-01 16:12 ` Steven Sistare
2025-07-02 7:17 ` Cédric Le Goater
2025-07-02 15:35 ` Cédric Le Goater
1 sibling, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 16:12 UTC (permalink / raw)
To: Cedric Le Goater
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
qemu-devel
Hi Cedric, what do we need to do to get this patch in, and the patch "preserve INTx"?
Just review, or are there conflicts to resolve?
- Steve
On 6/10/2025 11:39 AM, Steve Sistare wrote:
> Save the MSI message area as part of vfio-pci vmstate, and preserve the
> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
> then the vfio-pci post_load handler finds the eventfds in CPR state,
> rebuilds vector data structures, and attaches the interrupts to the new
> KVM instance.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.h | 2 +
> include/hw/vfio/vfio-cpr.h | 8 ++++
> hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
> 4 files changed, 158 insertions(+), 3 deletions(-)
>
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6e4840d..4d1203c 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
>
> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> void vfio_pci_write_config(PCIDevice *pdev,
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 8bf85b9..25e74ee 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -15,6 +15,7 @@
> struct VFIOContainer;
> struct VFIOContainerBase;
> struct VFIOGroup;
> +struct VFIOPCIDevice;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
> @@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
> bool vfio_cpr_ram_discard_register_listener(
> struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>
> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr, int fd);
> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr);
> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr);
> +
> extern const VMStateDescription vfio_cpr_pci_vmstate;
>
> #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index fdbb58e..e467373 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -9,6 +9,8 @@
> #include "hw/vfio/vfio-device.h"
> #include "hw/vfio/vfio-cpr.h"
> #include "hw/vfio/pci.h"
> +#include "hw/pci/msix.h"
> +#include "hw/pci/msi.h"
> #include "migration/cpr.h"
> #include "qapi/error.h"
> #include "system/runstate.h"
> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> }
>
> +#define STRDUP_VECTOR_FD_NAME(vdev, name) \
> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
> +
> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
> + int fd)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + cpr_save_fd(fdname, nr, fd);
> +}
> +
> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + return cpr_find_fd(fdname, nr);
> +}
> +
> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + cpr_delete_fd(fdname, nr);
> +}
> +
> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
> + bool msix)
> +{
> + int i, fd;
> + bool pending = false;
> + PCIDevice *pdev = &vdev->pdev;
> +
> + vdev->nr_vectors = nr_vectors;
> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> + vfio_pci_prepare_kvm_msi_virq_batch(vdev);
> +
> + for (i = 0; i < nr_vectors; i++) {
> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> + fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
> + if (fd >= 0) {
> + vfio_pci_vector_init(vdev, i);
> + vfio_pci_msi_set_handler(vdev, i);
> + }
> +
> + if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
> + vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
> + } else {
> + vdev->msi_vectors[i].virq = -1;
> + }
> +
> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> + set_bit(i, vdev->msix->pending);
> + pending = true;
> + }
> + }
> +
> + vfio_pci_commit_kvm_msi_virq_batch(vdev);
> +
> + if (msix) {
> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> + }
> +}
> +
> /*
> * The kernel may change non-emulated config bits. Exclude them from the
> * changed-bits check in get_pci_config_device.
> @@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
> return 0;
> }
>
> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
> +{
> + VFIOPCIDevice *vdev = opaque;
> + PCIDevice *pdev = &vdev->pdev;
> + int nr_vectors;
> +
> + if (msix_enabled(pdev)) {
> + vfio_pci_msix_set_notifiers(vdev);
> + nr_vectors = vdev->msix->entries;
> + vfio_cpr_claim_vectors(vdev, nr_vectors, true);
> +
> + } else if (msi_enabled(pdev)) {
> + nr_vectors = msi_nr_vectors_allocated(pdev);
> + vfio_cpr_claim_vectors(vdev, nr_vectors, false);
> +
> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> + g_assert_not_reached(); /* completed in a subsequent patch */
> + }
> +
> + return 0;
> +}
> +
> +static bool pci_msix_present(void *opaque, int version_id)
> +{
> + PCIDevice *pdev = opaque;
> +
> + return msix_present(pdev);
> +}
> +
> const VMStateDescription vfio_cpr_pci_vmstate = {
> .name = "vfio-cpr-pci",
> .version_id = 0,
> .minimum_version_id = 0,
> .pre_load = vfio_cpr_pci_pre_load,
> + .post_load = vfio_cpr_pci_post_load,
> .needed = cpr_incoming_needed,
> .fields = (VMStateField[]) {
> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
> VMSTATE_END_OF_LIST()
> }
> };
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 4cda6dc..b3dbb84 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,7 @@
> #include "hw/pci/pci_bridge.h"
> #include "hw/qdev-properties.h"
> #include "hw/qdev-properties-system.h"
> +#include "hw/vfio/vfio-cpr.h"
> #include "migration/vmstate.h"
> #include "migration/cpr.h"
> #include "qobject/qdict.h"
> @@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>
> +/* Create new or reuse existing eventfd */
> static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr, Error **errp)
> {
> - int ret = event_notifier_init(e, 0);
> + int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
> + int ret = 0;
>
> - if (ret) {
> - error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> + if (fd >= 0) {
> + event_notifier_init_fd(e, fd);
> + } else {
> + ret = event_notifier_init(e, 0);
> + if (ret) {
> + error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> + } else {
> + fd = event_notifier_get_fd(e);
> + if (fd >= 0) {
> + vfio_cpr_save_vector_fd(vdev, name, nr, fd);
> + }
> + }
> }
> return !ret;
> }
> @@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr)
> {
> + vfio_cpr_delete_vector_fd(vdev, name, nr);
> event_notifier_cleanup(e);
> }
>
> @@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
> notify(&vdev->pdev, nr);
> }
>
> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
> +{
> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> + int fd = event_notifier_get_fd(&vector->interrupt);
> +
> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +}
> +
> /*
> * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
> * fd to kernel.
> @@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> int ret;
> bool resizing = !!(vdev->nr_vectors < nr + 1);
>
> + /*
> + * Ignore the callback from msix_set_vector_notifiers during resume.
> + * The necessary subset of these actions is called from
> + * vfio_cpr_claim_vectors during post load.
> + */
> + if (cpr_is_incoming()) {
> + return 0;
> + }
> +
> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>
> vector = &vdev->msi_vectors[nr];
> @@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
> }
> }
>
> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
> +{
> + msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
> + vfio_msix_vector_release, NULL);
> +}
> +
> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> {
> assert(!vdev->defer_kvm_irq_routing);
> @@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->err_notifier);
> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (cpr_is_incoming()) {
> + return;
> + }
> +
> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->req_notifier);
> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (cpr_is_incoming()) {
> + vdev->req_enabled = true;
> + return;
> + }
> +
> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 18/38] vfio-pci: preserve MSI
2025-07-01 16:12 ` Steven Sistare
@ 2025-07-02 7:17 ` Cédric Le Goater
2025-07-02 12:03 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-07-02 7:17 UTC (permalink / raw)
To: Steven Sistare
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
qemu-devel
Hello Steve,
On 7/1/25 18:12, Steven Sistare wrote:
> Hi Cedric, what do we need to do to get this patch in, and the patch "preserve INTx"?
> Just review, or are there conflicts to resolve?
I haven't looked at it yet. I will before the end of the week.
I should send the last VFIO PR for QEMU 10.1 on Friday. On PTO
next week.
Thanks,
C.
>
> - Steve
>
> On 6/10/2025 11:39 AM, Steve Sistare wrote:
>> Save the MSI message area as part of vfio-pci vmstate, and preserve the
>> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
>> then the vfio-pci post_load handler finds the eventfds in CPR state,
>> rebuilds vector data structures, and attaches the interrupts to the new
>> KVM instance.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.h | 2 +
>> include/hw/vfio/vfio-cpr.h | 8 ++++
>> hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
>> 4 files changed, 158 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 6e4840d..4d1203c 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
>> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>> void vfio_pci_write_config(PCIDevice *pdev,
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 8bf85b9..25e74ee 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -15,6 +15,7 @@
>> struct VFIOContainer;
>> struct VFIOContainerBase;
>> struct VFIOGroup;
>> +struct VFIOPCIDevice;
>> typedef struct VFIOContainerCPR {
>> Error *blocker;
>> @@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>> bool vfio_cpr_ram_discard_register_listener(
>> struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr, int fd);
>> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr);
>> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr);
>> +
>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>> #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index fdbb58e..e467373 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -9,6 +9,8 @@
>> #include "hw/vfio/vfio-device.h"
>> #include "hw/vfio/vfio-cpr.h"
>> #include "hw/vfio/pci.h"
>> +#include "hw/pci/msix.h"
>> +#include "hw/pci/msi.h"
>> #include "migration/cpr.h"
>> #include "qapi/error.h"
>> #include "system/runstate.h"
>> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>> }
>> +#define STRDUP_VECTOR_FD_NAME(vdev, name) \
>> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>> +
>> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>> + int fd)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + cpr_save_fd(fdname, nr, fd);
>> +}
>> +
>> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + return cpr_find_fd(fdname, nr);
>> +}
>> +
>> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + cpr_delete_fd(fdname, nr);
>> +}
>> +
>> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
>> + bool msix)
>> +{
>> + int i, fd;
>> + bool pending = false;
>> + PCIDevice *pdev = &vdev->pdev;
>> +
>> + vdev->nr_vectors = nr_vectors;
>> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> + vfio_pci_prepare_kvm_msi_virq_batch(vdev);
>> +
>> + for (i = 0; i < nr_vectors; i++) {
>> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> + fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
>> + if (fd >= 0) {
>> + vfio_pci_vector_init(vdev, i);
>> + vfio_pci_msi_set_handler(vdev, i);
>> + }
>> +
>> + if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
>> + vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
>> + } else {
>> + vdev->msi_vectors[i].virq = -1;
>> + }
>> +
>> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> + set_bit(i, vdev->msix->pending);
>> + pending = true;
>> + }
>> + }
>> +
>> + vfio_pci_commit_kvm_msi_virq_batch(vdev);
>> +
>> + if (msix) {
>> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> + }
>> +}
>> +
>> /*
>> * The kernel may change non-emulated config bits. Exclude them from the
>> * changed-bits check in get_pci_config_device.
>> @@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
>> return 0;
>> }
>> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
>> +{
>> + VFIOPCIDevice *vdev = opaque;
>> + PCIDevice *pdev = &vdev->pdev;
>> + int nr_vectors;
>> +
>> + if (msix_enabled(pdev)) {
>> + vfio_pci_msix_set_notifiers(vdev);
>> + nr_vectors = vdev->msix->entries;
>> + vfio_cpr_claim_vectors(vdev, nr_vectors, true);
>> +
>> + } else if (msi_enabled(pdev)) {
>> + nr_vectors = msi_nr_vectors_allocated(pdev);
>> + vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>> +
>> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> + g_assert_not_reached(); /* completed in a subsequent patch */
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static bool pci_msix_present(void *opaque, int version_id)
>> +{
>> + PCIDevice *pdev = opaque;
>> +
>> + return msix_present(pdev);
>> +}
>> +
>> const VMStateDescription vfio_cpr_pci_vmstate = {
>> .name = "vfio-cpr-pci",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> .pre_load = vfio_cpr_pci_pre_load,
>> + .post_load = vfio_cpr_pci_post_load,
>> .needed = cpr_incoming_needed,
>> .fields = (VMStateField[]) {
>> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
>> VMSTATE_END_OF_LIST()
>> }
>> };
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 4cda6dc..b3dbb84 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,7 @@
>> #include "hw/pci/pci_bridge.h"
>> #include "hw/qdev-properties.h"
>> #include "hw/qdev-properties-system.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> #include "migration/vmstate.h"
>> #include "migration/cpr.h"
>> #include "qobject/qdict.h"
>> @@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>> +/* Create new or reuse existing eventfd */
>> static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr, Error **errp)
>> {
>> - int ret = event_notifier_init(e, 0);
>> + int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
>> + int ret = 0;
>> - if (ret) {
>> - error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>> + if (fd >= 0) {
>> + event_notifier_init_fd(e, fd);
>> + } else {
>> + ret = event_notifier_init(e, 0);
>> + if (ret) {
>> + error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>> + } else {
>> + fd = event_notifier_get_fd(e);
>> + if (fd >= 0) {
>> + vfio_cpr_save_vector_fd(vdev, name, nr, fd);
>> + }
>> + }
>> }
>> return !ret;
>> }
>> @@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr)
>> {
>> + vfio_cpr_delete_vector_fd(vdev, name, nr);
>> event_notifier_cleanup(e);
>> }
>> @@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
>> notify(&vdev->pdev, nr);
>> }
>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
>> +{
>> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>> + int fd = event_notifier_get_fd(&vector->interrupt);
>> +
>> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> +}
>> +
>> /*
>> * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
>> * fd to kernel.
>> @@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> int ret;
>> bool resizing = !!(vdev->nr_vectors < nr + 1);
>> + /*
>> + * Ignore the callback from msix_set_vector_notifiers during resume.
>> + * The necessary subset of these actions is called from
>> + * vfio_cpr_claim_vectors during post load.
>> + */
>> + if (cpr_is_incoming()) {
>> + return 0;
>> + }
>> +
>> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>> vector = &vdev->msi_vectors[nr];
>> @@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>> }
>> }
>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
>> +{
>> + msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
>> + vfio_msix_vector_release, NULL);
>> +}
>> +
>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>> {
>> assert(!vdev->defer_kvm_irq_routing);
>> @@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->err_notifier);
>> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (cpr_is_incoming()) {
>> + return;
>> + }
>> +
>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->req_notifier);
>> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (cpr_is_incoming()) {
>> + vdev->req_enabled = true;
>> + return;
>> + }
>> +
>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 18/38] vfio-pci: preserve MSI
2025-07-02 7:17 ` Cédric Le Goater
@ 2025-07-02 12:03 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 12:03 UTC (permalink / raw)
To: Cédric Le Goater, Zhenzhong Duan
Cc: Alex Williamson, Yi Liu, Eric Auger, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas, qemu-devel
On 7/2/2025 3:17 AM, Cédric Le Goater wrote:
> Hello Steve,
>
> On 7/1/25 18:12, Steven Sistare wrote:
>> Hi Cedric, what do we need to do to get this patch in, and the patch "preserve INTx"?
>> Just review, or are there conflicts to resolve?
>
> I haven't looked at it yet. I will before the end of the week.
>
> I should send the last VFIO PR for QEMU 10.1 on Friday. On PTO
> next week.
Hi Zhenzhong,
With Cedric out next week, we have very little time to finish the iommufd series.
I can post V6 today if you are satisfied with my most recent comments, and if
you review patch 29 "vfio/iommufd: register container for cpr".
- Steve
>> On 6/10/2025 11:39 AM, Steve Sistare wrote:
>>> Save the MSI message area as part of vfio-pci vmstate, and preserve the
>>> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
>>> then the vfio-pci post_load handler finds the eventfds in CPR state,
>>> rebuilds vector data structures, and attaches the interrupts to the new
>>> KVM instance.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/pci.h | 2 +
>>> include/hw/vfio/vfio-cpr.h | 8 ++++
>>> hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
>>> hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
>>> 4 files changed, 158 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index 6e4840d..4d1203c 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
>>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
>>> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>> void vfio_pci_write_config(PCIDevice *pdev,
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index 8bf85b9..25e74ee 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -15,6 +15,7 @@
>>> struct VFIOContainer;
>>> struct VFIOContainerBase;
>>> struct VFIOGroup;
>>> +struct VFIOPCIDevice;
>>> typedef struct VFIOContainerCPR {
>>> Error *blocker;
>>> @@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>>> bool vfio_cpr_ram_discard_register_listener(
>>> struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>>> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>>> + int nr, int fd);
>>> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>>> + int nr);
>>> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>>> + int nr);
>>> +
>>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>>> #endif /* HW_VFIO_VFIO_CPR_H */
>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>> index fdbb58e..e467373 100644
>>> --- a/hw/vfio/cpr.c
>>> +++ b/hw/vfio/cpr.c
>>> @@ -9,6 +9,8 @@
>>> #include "hw/vfio/vfio-device.h"
>>> #include "hw/vfio/vfio-cpr.h"
>>> #include "hw/vfio/pci.h"
>>> +#include "hw/pci/msix.h"
>>> +#include "hw/pci/msi.h"
>>> #include "migration/cpr.h"
>>> #include "qapi/error.h"
>>> #include "system/runstate.h"
>>> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>>> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>> }
>>> +#define STRDUP_VECTOR_FD_NAME(vdev, name) \
>>> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>>> +
>>> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>>> + int fd)
>>> +{
>>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>>> + cpr_save_fd(fdname, nr, fd);
>>> +}
>>> +
>>> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>>> +{
>>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>>> + return cpr_find_fd(fdname, nr);
>>> +}
>>> +
>>> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>>> +{
>>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>>> + cpr_delete_fd(fdname, nr);
>>> +}
>>> +
>>> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
>>> + bool msix)
>>> +{
>>> + int i, fd;
>>> + bool pending = false;
>>> + PCIDevice *pdev = &vdev->pdev;
>>> +
>>> + vdev->nr_vectors = nr_vectors;
>>> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>>> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>>> +
>>> + vfio_pci_prepare_kvm_msi_virq_batch(vdev);
>>> +
>>> + for (i = 0; i < nr_vectors; i++) {
>>> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
>>> +
>>> + fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
>>> + if (fd >= 0) {
>>> + vfio_pci_vector_init(vdev, i);
>>> + vfio_pci_msi_set_handler(vdev, i);
>>> + }
>>> +
>>> + if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
>>> + vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
>>> + } else {
>>> + vdev->msi_vectors[i].virq = -1;
>>> + }
>>> +
>>> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>>> + set_bit(i, vdev->msix->pending);
>>> + pending = true;
>>> + }
>>> + }
>>> +
>>> + vfio_pci_commit_kvm_msi_virq_batch(vdev);
>>> +
>>> + if (msix) {
>>> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>>> + }
>>> +}
>>> +
>>> /*
>>> * The kernel may change non-emulated config bits. Exclude them from the
>>> * changed-bits check in get_pci_config_device.
>>> @@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
>>> return 0;
>>> }
>>> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
>>> +{
>>> + VFIOPCIDevice *vdev = opaque;
>>> + PCIDevice *pdev = &vdev->pdev;
>>> + int nr_vectors;
>>> +
>>> + if (msix_enabled(pdev)) {
>>> + vfio_pci_msix_set_notifiers(vdev);
>>> + nr_vectors = vdev->msix->entries;
>>> + vfio_cpr_claim_vectors(vdev, nr_vectors, true);
>>> +
>>> + } else if (msi_enabled(pdev)) {
>>> + nr_vectors = msi_nr_vectors_allocated(pdev);
>>> + vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>>> +
>>> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>>> + g_assert_not_reached(); /* completed in a subsequent patch */
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static bool pci_msix_present(void *opaque, int version_id)
>>> +{
>>> + PCIDevice *pdev = opaque;
>>> +
>>> + return msix_present(pdev);
>>> +}
>>> +
>>> const VMStateDescription vfio_cpr_pci_vmstate = {
>>> .name = "vfio-cpr-pci",
>>> .version_id = 0,
>>> .minimum_version_id = 0,
>>> .pre_load = vfio_cpr_pci_pre_load,
>>> + .post_load = vfio_cpr_pci_post_load,
>>> .needed = cpr_incoming_needed,
>>> .fields = (VMStateField[]) {
>>> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>>> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
>>> VMSTATE_END_OF_LIST()
>>> }
>>> };
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 4cda6dc..b3dbb84 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -29,6 +29,7 @@
>>> #include "hw/pci/pci_bridge.h"
>>> #include "hw/qdev-properties.h"
>>> #include "hw/qdev-properties-system.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>> #include "migration/vmstate.h"
>>> #include "migration/cpr.h"
>>> #include "qobject/qdict.h"
>>> @@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>>> +/* Create new or reuse existing eventfd */
>>> static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>> const char *name, int nr, Error **errp)
>>> {
>>> - int ret = event_notifier_init(e, 0);
>>> + int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
>>> + int ret = 0;
>>> - if (ret) {
>>> - error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>>> + if (fd >= 0) {
>>> + event_notifier_init_fd(e, fd);
>>> + } else {
>>> + ret = event_notifier_init(e, 0);
>>> + if (ret) {
>>> + error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>>> + } else {
>>> + fd = event_notifier_get_fd(e);
>>> + if (fd >= 0) {
>>> + vfio_cpr_save_vector_fd(vdev, name, nr, fd);
>>> + }
>>> + }
>>> }
>>> return !ret;
>>> }
>>> @@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>>> const char *name, int nr)
>>> {
>>> + vfio_cpr_delete_vector_fd(vdev, name, nr);
>>> event_notifier_cleanup(e);
>>> }
>>> @@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
>>> notify(&vdev->pdev, nr);
>>> }
>>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
>>> +{
>>> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>> + int fd = event_notifier_get_fd(&vector->interrupt);
>>> +
>>> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>>> +}
>>> +
>>> /*
>>> * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
>>> * fd to kernel.
>>> @@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>> int ret;
>>> bool resizing = !!(vdev->nr_vectors < nr + 1);
>>> + /*
>>> + * Ignore the callback from msix_set_vector_notifiers during resume.
>>> + * The necessary subset of these actions is called from
>>> + * vfio_cpr_claim_vectors during post load.
>>> + */
>>> + if (cpr_is_incoming()) {
>>> + return 0;
>>> + }
>>> +
>>> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>>> vector = &vdev->msi_vectors[nr];
>>> @@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>> }
>>> }
>>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
>>> +{
>>> + msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
>>> + vfio_msix_vector_release, NULL);
>>> +}
>>> +
>>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>> {
>>> assert(!vdev->defer_kvm_irq_routing);
>>> @@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>>> fd = event_notifier_get_fd(&vdev->err_notifier);
>>> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>>> + if (cpr_is_incoming()) {
>>> + return;
>>> + }
>>> +
>>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>>> @@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>>> fd = event_notifier_get_fd(&vdev->req_notifier);
>>> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>>> + if (cpr_is_incoming()) {
>>> + vdev->req_enabled = true;
>>> + return;
>>> + }
>>> +
>>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>>
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 18/38] vfio-pci: preserve MSI
2025-06-10 15:39 ` [PATCH V5 18/38] vfio-pci: preserve MSI Steve Sistare
2025-07-01 16:12 ` Steven Sistare
@ 2025-07-02 15:35 ` Cédric Le Goater
2025-07-02 16:40 ` Steven Sistare
1 sibling, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-07-02 15:35 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Save the MSI message area as part of vfio-pci vmstate, and preserve the
> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
> then the vfio-pci post_load handler finds the eventfds in CPR state,
> rebuilds vector data structures, and attaches the interrupts to the new
> KVM instance.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.h | 2 +
> include/hw/vfio/vfio-cpr.h | 8 ++++
> hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
> 4 files changed, 158 insertions(+), 3 deletions(-)
>
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 6e4840d..4d1203c 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
>
> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> void vfio_pci_write_config(PCIDevice *pdev,
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 8bf85b9..25e74ee 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -15,6 +15,7 @@
> struct VFIOContainer;
> struct VFIOContainerBase;
> struct VFIOGroup;
> +struct VFIOPCIDevice;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
> @@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
> bool vfio_cpr_ram_discard_register_listener(
> struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>
> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr, int fd);
> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr);
> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> + int nr);
> +
> extern const VMStateDescription vfio_cpr_pci_vmstate;
>
> #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index fdbb58e..e467373 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -9,6 +9,8 @@
> #include "hw/vfio/vfio-device.h"
> #include "hw/vfio/vfio-cpr.h"
> #include "hw/vfio/pci.h"
> +#include "hw/pci/msix.h"
> +#include "hw/pci/msi.h"
> #include "migration/cpr.h"
> #include "qapi/error.h"
> #include "system/runstate.h"
> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> }
>
> +#define STRDUP_VECTOR_FD_NAME(vdev, name) \
> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
> +
> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
> + int fd)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + cpr_save_fd(fdname, nr, fd);
> +}
> +
> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + return cpr_find_fd(fdname, nr);
> +}
> +
> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> + cpr_delete_fd(fdname, nr);
> +}
> +
> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
> + bool msix)
> +{
> + int i, fd;
> + bool pending = false;
> + PCIDevice *pdev = &vdev->pdev;
> +
> + vdev->nr_vectors = nr_vectors;
> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> + vfio_pci_prepare_kvm_msi_virq_batch(vdev);
> +
> + for (i = 0; i < nr_vectors; i++) {
> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> + fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
> + if (fd >= 0) {
> + vfio_pci_vector_init(vdev, i);
> + vfio_pci_msi_set_handler(vdev, i);
> + }
> +
> + if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
> + vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
> + } else {
> + vdev->msi_vectors[i].virq = -1;
> + }
> +
> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> + set_bit(i, vdev->msix->pending);
> + pending = true;
> + }
> + }
> +
> + vfio_pci_commit_kvm_msi_virq_batch(vdev);
> +
> + if (msix) {
> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> + }
> +}
> +
> /*
> * The kernel may change non-emulated config bits. Exclude them from the
> * changed-bits check in get_pci_config_device.
> @@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
> return 0;
> }
>
> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
> +{
> + VFIOPCIDevice *vdev = opaque;
> + PCIDevice *pdev = &vdev->pdev;
> + int nr_vectors;
> +
> + if (msix_enabled(pdev)) {
> + vfio_pci_msix_set_notifiers(vdev);
> + nr_vectors = vdev->msix->entries;
> + vfio_cpr_claim_vectors(vdev, nr_vectors, true);
> +
> + } else if (msi_enabled(pdev)) {
> + nr_vectors = msi_nr_vectors_allocated(pdev);
> + vfio_cpr_claim_vectors(vdev, nr_vectors, false);
> +
> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> + g_assert_not_reached(); /* completed in a subsequent patch */
> + }
> +
> + return 0;
> +}
> +
> +static bool pci_msix_present(void *opaque, int version_id)
> +{
> + PCIDevice *pdev = opaque;
> +
> + return msix_present(pdev);
> +}
> +
> const VMStateDescription vfio_cpr_pci_vmstate = {
> .name = "vfio-cpr-pci",
> .version_id = 0,
> .minimum_version_id = 0,
> .pre_load = vfio_cpr_pci_pre_load,
> + .post_load = vfio_cpr_pci_post_load,
> .needed = cpr_incoming_needed,
> .fields = (VMStateField[]) {
> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
> VMSTATE_END_OF_LIST()
> }
> };
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 4cda6dc..b3dbb84 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,7 @@
> #include "hw/pci/pci_bridge.h"
> #include "hw/qdev-properties.h"
> #include "hw/qdev-properties-system.h"
> +#include "hw/vfio/vfio-cpr.h"
> #include "migration/vmstate.h"
> #include "migration/cpr.h"
> #include "qobject/qdict.h"
> @@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>
> +/* Create new or reuse existing eventfd */
> static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr, Error **errp)
> {
> - int ret = event_notifier_init(e, 0);
> + int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
Since this is a "complex" initialization, I would prefer it to
be done ...
> + int ret = 0;
>
> - if (ret) {
> - error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
... here :
fd = vfio_cpr_load_vector_fd(vdev, name, nr);
> + if (fd >= 0) {
> + event_notifier_init_fd(e, fd);
> + } else {
> + ret = event_notifier_init(e, 0);
> + if (ret) {
> + error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> + } else {
> + fd = event_notifier_get_fd(e);
> + if (fd >= 0) {
> + vfio_cpr_save_vector_fd(vdev, name, nr, fd);
> + }
> + }
Instead of preserving the ending return, could you please rework the
if statements to return asap. I think it would clarify the routine.
> }
> return !ret;
> }
> @@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr)
> {
> + vfio_cpr_delete_vector_fd(vdev, name, nr);
> event_notifier_cleanup(e);
> }
>
> @@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
> notify(&vdev->pdev, nr);
> }
>
> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
> +{
> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> + int fd = event_notifier_get_fd(&vector->interrupt);
> +
> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +}
> +
> /*
> * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
> * fd to kernel.
> @@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> int ret;
> bool resizing = !!(vdev->nr_vectors < nr + 1);
>
> + /*
> + * Ignore the callback from msix_set_vector_notifiers during resume.
> + * The necessary subset of these actions is called from
> + * vfio_cpr_claim_vectors during post load.
> + */
> + if (cpr_is_incoming()) {
> + return 0;
> + }
> +
This test could be moved in vfio_msix_vector_use().
The rest looks fine.
Thanks,
C.
> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>
> vector = &vdev->msi_vectors[nr];
> @@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
> }
> }
>
> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
> +{
> + msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
> + vfio_msix_vector_release, NULL);
> +}
> +
> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> {
> assert(!vdev->defer_kvm_irq_routing);
> @@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->err_notifier);
> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (cpr_is_incoming()) {
> + return;
> + }
> +
> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->req_notifier);
> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (cpr_is_incoming()) {
> + vdev->req_enabled = true;
> + return;
> + }
> +
> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 18/38] vfio-pci: preserve MSI
2025-07-02 15:35 ` Cédric Le Goater
@ 2025-07-02 16:40 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 16:40 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/2/2025 11:35 AM, Cédric Le Goater wrote:
> On 6/10/25 17:39, Steve Sistare wrote:
>> Save the MSI message area as part of vfio-pci vmstate, and preserve the
>> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
>> then the vfio-pci post_load handler finds the eventfds in CPR state,
>> rebuilds vector data structures, and attaches the interrupts to the new
>> KVM instance.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.h | 2 +
>> include/hw/vfio/vfio-cpr.h | 8 ++++
>> hw/vfio/cpr.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/pci.c | 54 ++++++++++++++++++++++++--
>> 4 files changed, 158 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 6e4840d..4d1203c 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
>> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>> void vfio_pci_write_config(PCIDevice *pdev,
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 8bf85b9..25e74ee 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -15,6 +15,7 @@
>> struct VFIOContainer;
>> struct VFIOContainerBase;
>> struct VFIOGroup;
>> +struct VFIOPCIDevice;
>> typedef struct VFIOContainerCPR {
>> Error *blocker;
>> @@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>> bool vfio_cpr_ram_discard_register_listener(
>> struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr, int fd);
>> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr);
>> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> + int nr);
>> +
>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>> #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index fdbb58e..e467373 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -9,6 +9,8 @@
>> #include "hw/vfio/vfio-device.h"
>> #include "hw/vfio/vfio-cpr.h"
>> #include "hw/vfio/pci.h"
>> +#include "hw/pci/msix.h"
>> +#include "hw/pci/msi.h"
>> #include "migration/cpr.h"
>> #include "qapi/error.h"
>> #include "system/runstate.h"
>> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>> migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>> }
>> +#define STRDUP_VECTOR_FD_NAME(vdev, name) \
>> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>> +
>> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>> + int fd)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + cpr_save_fd(fdname, nr, fd);
>> +}
>> +
>> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + return cpr_find_fd(fdname, nr);
>> +}
>> +
>> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> + cpr_delete_fd(fdname, nr);
>> +}
>> +
>> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
>> + bool msix)
>> +{
>> + int i, fd;
>> + bool pending = false;
>> + PCIDevice *pdev = &vdev->pdev;
>> +
>> + vdev->nr_vectors = nr_vectors;
>> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> + vfio_pci_prepare_kvm_msi_virq_batch(vdev);
>> +
>> + for (i = 0; i < nr_vectors; i++) {
>> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> + fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
>> + if (fd >= 0) {
>> + vfio_pci_vector_init(vdev, i);
>> + vfio_pci_msi_set_handler(vdev, i);
>> + }
>> +
>> + if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
>> + vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
>> + } else {
>> + vdev->msi_vectors[i].virq = -1;
>> + }
>> +
>> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> + set_bit(i, vdev->msix->pending);
>> + pending = true;
>> + }
>> + }
>> +
>> + vfio_pci_commit_kvm_msi_virq_batch(vdev);
>> +
>> + if (msix) {
>> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> + }
>> +}
>> +
>> /*
>> * The kernel may change non-emulated config bits. Exclude them from the
>> * changed-bits check in get_pci_config_device.
>> @@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
>> return 0;
>> }
>> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
>> +{
>> + VFIOPCIDevice *vdev = opaque;
>> + PCIDevice *pdev = &vdev->pdev;
>> + int nr_vectors;
>> +
>> + if (msix_enabled(pdev)) {
>> + vfio_pci_msix_set_notifiers(vdev);
>> + nr_vectors = vdev->msix->entries;
>> + vfio_cpr_claim_vectors(vdev, nr_vectors, true);
>> +
>> + } else if (msi_enabled(pdev)) {
>> + nr_vectors = msi_nr_vectors_allocated(pdev);
>> + vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>> +
>> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> + g_assert_not_reached(); /* completed in a subsequent patch */
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static bool pci_msix_present(void *opaque, int version_id)
>> +{
>> + PCIDevice *pdev = opaque;
>> +
>> + return msix_present(pdev);
>> +}
>> +
>> const VMStateDescription vfio_cpr_pci_vmstate = {
>> .name = "vfio-cpr-pci",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> .pre_load = vfio_cpr_pci_pre_load,
>> + .post_load = vfio_cpr_pci_post_load,
>> .needed = cpr_incoming_needed,
>> .fields = (VMStateField[]) {
>> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
>> VMSTATE_END_OF_LIST()
>> }
>> };
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 4cda6dc..b3dbb84 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,7 @@
>> #include "hw/pci/pci_bridge.h"
>> #include "hw/qdev-properties.h"
>> #include "hw/qdev-properties-system.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> #include "migration/vmstate.h"
>> #include "migration/cpr.h"
>> #include "qobject/qdict.h"
>> @@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>> +/* Create new or reuse existing eventfd */
>> static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr, Error **errp)
>> {
>> - int ret = event_notifier_init(e, 0);
>> + int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
>
> Since this is a "complex" initialization, I would prefer it to
> be done ...
>
>> + int ret = 0;
>> - if (ret) {
>> - error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>
> ... here :
>
> fd = vfio_cpr_load_vector_fd(vdev, name, nr);
OK
>> + if (fd >= 0) {
>> + event_notifier_init_fd(e, fd);
>> + } else {
>> + ret = event_notifier_init(e, 0);
>> + if (ret) {
>> + error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>> + } else {
>> + fd = event_notifier_get_fd(e);
>> + if (fd >= 0) {
>> + vfio_cpr_save_vector_fd(vdev, name, nr, fd);
>> + }
>> + }
>
> Instead of preserving the ending return, could you please rework the
> if statements to return asap. I think it would clarify the routine.
Sure. I can also simplify because event_notifier_get_fd is always valid
if event_notifier_init succeeds. The result:
static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr, Error **errp)
{
int fd, ret;
fd = vfio_cpr_load_vector_fd(vdev, name, nr);
if (fd >= 0) {
event_notifier_init_fd(e, fd);
return true;
}
ret = event_notifier_init(e, 0);
if (ret) {
error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
return false;
}
fd = event_notifier_get_fd(e);
vfio_cpr_save_vector_fd(vdev, name, nr, fd);
return true;
}
>> }
>> return !ret;
>> }
>> @@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr)
>> {
>> + vfio_cpr_delete_vector_fd(vdev, name, nr);
>> event_notifier_cleanup(e);
>> }
>> @@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
>> notify(&vdev->pdev, nr);
>> }
>> +void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
>> +{
>> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>> + int fd = event_notifier_get_fd(&vector->interrupt);
>> +
>> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> +}
>> +
>> /*
>> * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
>> * fd to kernel.
>> @@ -580,6 +602,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> int ret;
>> bool resizing = !!(vdev->nr_vectors < nr + 1);
>> + /*
>> + * Ignore the callback from msix_set_vector_notifiers during resume.
>> + * The necessary subset of these actions is called from
>> + * vfio_cpr_claim_vectors during post load.
>> + */
>> + if (cpr_is_incoming()) {
>> + return 0;
>> + }
>> +
>
> This test could be moved in vfio_msix_vector_use().
OK.
(there used to be multiple sites calling vfio_msix_vector_do_use).
- Steve
> The rest looks fine.
>
> Thanks,
>
> C.
>
>> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>> vector = &vdev->msi_vectors[nr];
>> @@ -686,6 +717,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>> }
>> }
>> +void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
>> +{
>> + msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
>> + vfio_msix_vector_release, NULL);
>> +}
>> +
>> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>> {
>> assert(!vdev->defer_kvm_irq_routing);
>> @@ -2962,6 +2999,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->err_notifier);
>> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (cpr_is_incoming()) {
>> + return;
>> + }
>> +
>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -3029,6 +3071,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->req_notifier);
>> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (cpr_is_incoming()) {
>> + vdev->req_enabled = true;
>> + return;
>> + }
>> +
>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 19/38] vfio-pci: preserve INTx
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (17 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 18/38] vfio-pci: preserve MSI Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-07-02 15:23 ` Cédric Le Goater
2025-06-10 15:39 ` [PATCH V5 20/38] migration: close kvm after cpr Steve Sistare
` (19 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
follows:
pin : Recover this from the vfio config in kernel space
interrupt : Preserve its eventfd descriptor across exec.
unmask : Ditto
route.irq : This could perhaps be recovered in vfio_pci_post_load by
calling pci_device_route_intx_to_irq(pin), whose implementation reads
config space for a bridge device such as ich9. However, there is no
guarantee that the bridge vmstate is read before vfio vmstate. Rather
than fiddling with MigrationPriority for vmstate handlers, explicitly
save route.irq in vfio vmstate.
pending : save in vfio vmstate.
mmap_timeout, mmap_timer : Re-initialize
bool kvm_accel : Re-initialize
In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load. Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++----
2 files changed, 54 insertions(+), 5 deletions(-)
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index e467373..f5555ca 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -139,7 +139,11 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
vfio_cpr_claim_vectors(vdev, nr_vectors, false);
} else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
- g_assert_not_reached(); /* completed in a subsequent patch */
+ Error *local_err = NULL;
+ if (!vfio_pci_intx_enable(vdev, &local_err)) {
+ error_report_err(local_err);
+ return -1;
+ }
}
return 0;
@@ -152,6 +156,26 @@ static bool pci_msix_present(void *opaque, int version_id)
return msix_present(pdev);
}
+static const VMStateDescription vfio_intx_vmstate = {
+ .name = "vfio-cpr-intx",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .fields = (VMStateField[]) {
+ VMSTATE_BOOL(pending, VFIOINTx),
+ VMSTATE_UINT32(route.mode, VFIOINTx),
+ VMSTATE_INT32(route.irq, VFIOINTx),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) { \
+ .name = (stringify(_field)), \
+ .size = sizeof(VFIOINTx), \
+ .vmsd = &vfio_intx_vmstate, \
+ .flags = VMS_STRUCT, \
+ .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
+}
+
const VMStateDescription vfio_cpr_pci_vmstate = {
.name = "vfio-cpr-pci",
.version_id = 0,
@@ -162,6 +186,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
.fields = (VMStateField[]) {
VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
+ VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b3dbb84..b52c488 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -161,12 +161,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
return true;
}
+ if (cpr_is_incoming()) {
+ goto skip_state;
+ }
+
/* Get to a known interrupt state */
qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
vfio_device_irq_mask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
vdev->intx.pending = false;
pci_irq_deassert(&vdev->pdev);
+skip_state:
/* Get an eventfd for resample/unmask */
if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
goto fail;
@@ -180,6 +185,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
goto fail_irqfd;
}
+ if (cpr_is_incoming()) {
+ goto skip_irq;
+ }
+
if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_UNMASK,
event_notifier_get_fd(&vdev->intx.unmask),
@@ -190,6 +199,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
/* Let'em rip */
vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+skip_irq:
vdev->intx.kvm_accel = true;
trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -305,7 +315,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
return true;
}
- vfio_disable_interrupts(vdev);
+ /*
+ * Do not alter interrupt state during vfio_realize and cpr load.
+ * The incoming state is cleared thereafter.
+ */
+ if (!cpr_is_incoming()) {
+ vfio_disable_interrupts(vdev);
+ }
vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -328,8 +344,10 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
- if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
- VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
+ if (!cpr_is_incoming() &&
+ !vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX,
+ 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,
+ errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
return false;
@@ -3234,7 +3252,13 @@ static bool vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
vfio_intx_routing_notifier);
vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
- if (!vfio_intx_enable(vdev, errp)) {
+
+ /*
+ * During CPR, do not call vfio_intx_enable at this time. Instead,
+ * call it from vfio_pci_post_load after the intx routing data has
+ * been loaded from vmstate.
+ */
+ if (!cpr_is_incoming() && !vfio_intx_enable(vdev, errp)) {
timer_free(vdev->intx.mmap_timer);
pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 19/38] vfio-pci: preserve INTx
2025-06-10 15:39 ` [PATCH V5 19/38] vfio-pci: preserve INTx Steve Sistare
@ 2025-07-02 15:23 ` Cédric Le Goater
2025-07-02 17:54 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-07-02 15:23 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
> follows:
> pin : Recover this from the vfio config in kernel space
> interrupt : Preserve its eventfd descriptor across exec.
> unmask : Ditto
> route.irq : This could perhaps be recovered in vfio_pci_post_load by
> calling pci_device_route_intx_to_irq(pin), whose implementation reads
> config space for a bridge device such as ich9. However, there is no
> guarantee that the bridge vmstate is read before vfio vmstate. Rather
> than fiddling with MigrationPriority for vmstate handlers, explicitly
> save route.irq in vfio vmstate.
> pending : save in vfio vmstate.
> mmap_timeout, mmap_timer : Re-initialize
> bool kvm_accel : Re-initialize
>
> In vfio_realize, defer calling vfio_intx_enable until the vmstate
> is available, in vfio_pci_post_load. Modify vfio_intx_enable and
> vfio_intx_kvm_enable to skip vfio initialization, but still perform
> kvm initialization.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
> hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++----
> 2 files changed, 54 insertions(+), 5 deletions(-)
>
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index e467373..f5555ca 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -139,7 +139,11 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
> vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>
> } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> - g_assert_not_reached(); /* completed in a subsequent patch */
> + Error *local_err = NULL;
> + if (!vfio_pci_intx_enable(vdev, &local_err)) {
> + error_report_err(local_err);
> + return -1;
> + }
> }
>
> return 0;
> @@ -152,6 +156,26 @@ static bool pci_msix_present(void *opaque, int version_id)
> return msix_present(pdev);
> }
>
> +static const VMStateDescription vfio_intx_vmstate = {
> + .name = "vfio-cpr-intx",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .fields = (VMStateField[]) {
> + VMSTATE_BOOL(pending, VFIOINTx),
> + VMSTATE_UINT32(route.mode, VFIOINTx),
> + VMSTATE_INT32(route.irq, VFIOINTx),
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +#define VMSTATE_VFIO_INTX(_field, _state) { \
> + .name = (stringify(_field)), \
> + .size = sizeof(VFIOINTx), \
> + .vmsd = &vfio_intx_vmstate, \
> + .flags = VMS_STRUCT, \
> + .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
> +}
> +
> const VMStateDescription vfio_cpr_pci_vmstate = {
> .name = "vfio-cpr-pci",
> .version_id = 0,
> @@ -162,6 +186,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
> .fields = (VMStateField[]) {
> VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
> + VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
> VMSTATE_END_OF_LIST()
> }
> };
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index b3dbb84..b52c488 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -161,12 +161,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> return true;
> }
>
> + if (cpr_is_incoming()) {
> + goto skip_state;
> + }
> +
> /* Get to a known interrupt state */
> qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
> vfio_device_irq_mask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> vdev->intx.pending = false;
> pci_irq_deassert(&vdev->pdev);
>
> +skip_state:
> /* Get an eventfd for resample/unmask */
> if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
> goto fail;
> @@ -180,6 +185,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> goto fail_irqfd;
> }
>
> + if (cpr_is_incoming()) {
> + goto skip_irq;
> + }
> +
> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_UNMASK,
> event_notifier_get_fd(&vdev->intx.unmask),
> @@ -190,6 +199,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> /* Let'em rip */
> vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>
> +skip_irq:
> vdev->intx.kvm_accel = true;
Looking closer at the code, I think it would clearer to introduce a
vfio_cpr_intx_enable_kvm() routine and duplicate some of the code
of vfio_intx_enable_kvm().
> trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
> @@ -305,7 +315,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> return true;
> }
>
> - vfio_disable_interrupts(vdev);
> + /*
> + * Do not alter interrupt state during vfio_realize and cpr load.
> + * The incoming state is cleared thereafter.
> + */
> + if (!cpr_is_incoming()) {
> + vfio_disable_interrupts(vdev);
> + }
>
> vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
> pci_config_set_interrupt_pin(vdev->pdev.config, pin);
> @@ -328,8 +344,10 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> fd = event_notifier_get_fd(&vdev->intx.interrupt);
> qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>
> - if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> - VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
> + if (!cpr_is_incoming() &&
> + !vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX,
> + 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,
> + errp)) {
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> return false;
> @@ -3234,7 +3252,13 @@ static bool vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
> vfio_intx_routing_notifier);
> vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
> kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
> - if (!vfio_intx_enable(vdev, errp)) {
> +
> + /*
> + * During CPR, do not call vfio_intx_enable at this time. Instead,
> + * call it from vfio_pci_post_load after the intx routing data has
> + * been loaded from vmstate.
> + */
> + if (!cpr_is_incoming() && !vfio_intx_enable(vdev, errp)) {
> timer_free(vdev->intx.mmap_timer);
> pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
The changes in vfio_intx_enable() seem ok.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 19/38] vfio-pci: preserve INTx
2025-07-02 15:23 ` Cédric Le Goater
@ 2025-07-02 17:54 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 17:54 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/2/2025 11:23 AM, Cédric Le Goater wrote:
> On 6/10/25 17:39, Steve Sistare wrote:
>> Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
>> follows:
>> pin : Recover this from the vfio config in kernel space
>> interrupt : Preserve its eventfd descriptor across exec.
>> unmask : Ditto
>> route.irq : This could perhaps be recovered in vfio_pci_post_load by
>> calling pci_device_route_intx_to_irq(pin), whose implementation reads
>> config space for a bridge device such as ich9. However, there is no
>> guarantee that the bridge vmstate is read before vfio vmstate. Rather
>> than fiddling with MigrationPriority for vmstate handlers, explicitly
>> save route.irq in vfio vmstate.
>> pending : save in vfio vmstate.
>> mmap_timeout, mmap_timer : Re-initialize
>> bool kvm_accel : Re-initialize
>>
>> In vfio_realize, defer calling vfio_intx_enable until the vmstate
>> is available, in vfio_pci_post_load. Modify vfio_intx_enable and
>> vfio_intx_kvm_enable to skip vfio initialization, but still perform
>> kvm initialization.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
>> hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++----
>> 2 files changed, 54 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index e467373..f5555ca 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -139,7 +139,11 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
>> vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>> } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> - g_assert_not_reached(); /* completed in a subsequent patch */
>> + Error *local_err = NULL;
>> + if (!vfio_pci_intx_enable(vdev, &local_err)) {
>> + error_report_err(local_err);
>> + return -1;
>> + }
>> }
>> return 0;
>> @@ -152,6 +156,26 @@ static bool pci_msix_present(void *opaque, int version_id)
>> return msix_present(pdev);
>> }
>> +static const VMStateDescription vfio_intx_vmstate = {
>> + .name = "vfio-cpr-intx",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_BOOL(pending, VFIOINTx),
>> + VMSTATE_UINT32(route.mode, VFIOINTx),
>> + VMSTATE_INT32(route.irq, VFIOINTx),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +#define VMSTATE_VFIO_INTX(_field, _state) { \
>> + .name = (stringify(_field)), \
>> + .size = sizeof(VFIOINTx), \
>> + .vmsd = &vfio_intx_vmstate, \
>> + .flags = VMS_STRUCT, \
>> + .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
>> +}
>> +
>> const VMStateDescription vfio_cpr_pci_vmstate = {
>> .name = "vfio-cpr-pci",
>> .version_id = 0,
>> @@ -162,6 +186,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>> .fields = (VMStateField[]) {
>> VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
>> + VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
>> VMSTATE_END_OF_LIST()
>> }
>> };
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index b3dbb84..b52c488 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -161,12 +161,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> return true;
>> }
>> + if (cpr_is_incoming()) {
>> + goto skip_state;
>> + }
>> +
>> /* Get to a known interrupt state */
>> qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
>> vfio_device_irq_mask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>> vdev->intx.pending = false;
>> pci_irq_deassert(&vdev->pdev);
>> +skip_state:
>> /* Get an eventfd for resample/unmask */
>> if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
>> goto fail;
>> @@ -180,6 +185,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> goto fail_irqfd;
>> }
>> + if (cpr_is_incoming()) {
>> + goto skip_irq;
>> + }
>> +
>> if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_UNMASK,
>> event_notifier_get_fd(&vdev->intx.unmask),
>> @@ -190,6 +199,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> /* Let'em rip */
>> vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>> +skip_irq:
>> vdev->intx.kvm_accel = true;
>
> Looking closer at the code, I think it would clearer to introduce a
> vfio_cpr_intx_enable_kvm() routine and duplicate some of the code
> of vfio_intx_enable_kvm().
OK:
static bool vfio_cpr_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
{
#ifdef CONFIG_KVM
if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
vdev->intx.route.mode != PCI_INTX_ENABLED ||
!kvm_resamplefds_enabled()) {
return true;
}
if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
return false;
}
if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
&vdev->intx.interrupt,
&vdev->intx.unmask,
vdev->intx.route.irq)) {
error_setg_errno(errp, errno, "failed to setup resample irqfd");
vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
return false;
}
vdev->intx.kvm_accel = true;
trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
return true;
#else
return true;
#endif
}
And the new code at the vfio_intx_enable call site becomes:
if (cpr_is_incoming()) {
if (!vfio_cpr_intx_enable_kvm(vdev, &err)) {
warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
}
goto skip_signaling;
}
vfio_device_irq_set_signaling ...
vfio_intx_enable_kvm ...
skip_signaling:
- Steve
>> trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
>> @@ -305,7 +315,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> return true;
>> }
>> - vfio_disable_interrupts(vdev);
>> + /*
>> + * Do not alter interrupt state during vfio_realize and cpr load.
>> + * The incoming state is cleared thereafter.
>> + */
>> + if (!cpr_is_incoming()) {
>> + vfio_disable_interrupts(vdev);
>> + }
>> vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>> pci_config_set_interrupt_pin(vdev->pdev.config, pin);
>> @@ -328,8 +344,10 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> fd = event_notifier_get_fd(&vdev->intx.interrupt);
>> qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>> - if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> - VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>> + if (!cpr_is_incoming() &&
>> + !vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX,
>> + 0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,
>> + errp)) {
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>> return false;
>> @@ -3234,7 +3252,13 @@ static bool vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
>> vfio_intx_routing_notifier);
>> vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>> kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>> - if (!vfio_intx_enable(vdev, errp)) {
>> +
>> + /*
>> + * During CPR, do not call vfio_intx_enable at this time. Instead,
>> + * call it from vfio_pci_post_load after the intx routing data has
>> + * been loaded from vmstate.
>> + */
>> + if (!cpr_is_incoming() && !vfio_intx_enable(vdev, errp)) {
>> timer_free(vdev->intx.mmap_timer);
>> pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
>> kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
>
> The changes in vfio_intx_enable() seem ok.
>
> Thanks,
>
> C.
>
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 20/38] migration: close kvm after cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (18 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 19/38] vfio-pci: preserve INTx Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-07-01 15:25 ` Steven Sistare
2025-07-01 17:49 ` Fabiano Rosas
2025-06-10 15:39 ` [PATCH V5 21/38] migration: cpr_get_fd_param helper Steve Sistare
` (18 subsequent siblings)
38 siblings, 2 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer breaks vfio network connectivity to and from the guest, and
the host system log shows:
irq bypass consumer (token 00000000a03c32e5) registration fails: -16
which is EBUSY. This occurs because KVM descriptors are still open in
the old QEMU process. Close them.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-device.h | 2 ++
include/migration/cpr.h | 2 ++
include/system/kvm.h | 1 +
accel/kvm/kvm-all.c | 32 ++++++++++++++++++++++++++++++++
accel/stubs/kvm-stub.c | 5 +++++
hw/vfio/helpers.c | 10 ++++++++++
hw/vfio/vfio-stubs.c | 13 +++++++++++++
migration/cpr-transfer.c | 18 ++++++++++++++++++
migration/cpr.c | 8 ++++++++
migration/migration.c | 1 +
hw/vfio/meson.build | 2 ++
11 files changed, 94 insertions(+)
create mode 100644 hw/vfio/vfio-stubs.c
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 4e4d0b6..6eb6f21 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
DeviceState *dev, bool ram_discard);
int vfio_device_get_aw_bits(VFIODevice *vdev);
+
+void vfio_kvm_device_close(void);
#endif /* HW_VFIO_VFIO_COMMON_H */
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 07858e9..d09b657 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -32,7 +32,9 @@ void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
bool cpr_incoming_needed(void *opaque);
+void cpr_kvm_close(void);
+void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/include/system/kvm.h b/include/system/kvm.h
index 7cc60d2..4896a3c 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -195,6 +195,7 @@ bool kvm_has_sync_mmu(void);
int kvm_has_vcpu_events(void);
int kvm_max_nested_state_length(void);
int kvm_has_gsi_routing(void);
+void kvm_close(void);
/**
* kvm_arm_supports_user_irq
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a317783..3d3a557 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -515,16 +515,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
goto err;
}
+ /* If I am the CPU that created coalesced_mmio_ring, then discard it */
+ if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
+ s->coalesced_mmio_ring = NULL;
+ }
+
ret = munmap(cpu->kvm_run, mmap_size);
if (ret < 0) {
goto err;
}
+ cpu->kvm_run = NULL;
if (cpu->kvm_dirty_gfns) {
ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
if (ret < 0) {
goto err;
}
+ cpu->kvm_dirty_gfns = NULL;
}
kvm_park_vcpu(cpu);
@@ -608,6 +615,31 @@ err:
return ret;
}
+void kvm_close(void)
+{
+ CPUState *cpu;
+
+ if (!kvm_state || kvm_state->fd == -1) {
+ return;
+ }
+
+ CPU_FOREACH(cpu) {
+ cpu_remove_sync(cpu);
+ close(cpu->kvm_fd);
+ cpu->kvm_fd = -1;
+ close(cpu->kvm_vcpu_stats_fd);
+ cpu->kvm_vcpu_stats_fd = -1;
+ }
+
+ if (kvm_state && kvm_state->fd != -1) {
+ close(kvm_state->vmfd);
+ kvm_state->vmfd = -1;
+ close(kvm_state->fd);
+ kvm_state->fd = -1;
+ }
+ kvm_state = NULL;
+}
+
/*
* dirty pages logging control
*/
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index ecfd763..97dacb3 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -134,3 +134,8 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
{
return -ENOSYS;
}
+
+void kvm_close(void)
+{
+ return;
+}
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index d0dbab1..af1db2f 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
int vfio_kvm_device_fd = -1;
#endif
+void vfio_kvm_device_close(void)
+{
+#ifdef CONFIG_KVM
+ if (vfio_kvm_device_fd != -1) {
+ close(vfio_kvm_device_fd);
+ vfio_kvm_device_fd = -1;
+ }
+#endif
+}
+
int vfio_kvm_device_add_fd(int fd, Error **errp)
{
#ifdef CONFIG_KVM
diff --git a/hw/vfio/vfio-stubs.c b/hw/vfio/vfio-stubs.c
new file mode 100644
index 0000000..a4c8b56
--- /dev/null
+++ b/hw/vfio/vfio-stubs.c
@@ -0,0 +1,13 @@
+/*
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-device.h"
+
+void vfio_kvm_device_close(void)
+{
+ return;
+}
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f1403..396558f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,24 @@
#include "migration/vmstate.h"
#include "trace.h"
+static int cpr_transfer_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e,
+ Error **errp)
+{
+ if (e->type == MIG_EVENT_PRECOPY_DONE) {
+ cpr_kvm_close();
+ }
+ return 0;
+}
+
+void cpr_transfer_init(void)
+{
+ static NotifierWithReturn notifier;
+
+ migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
+ MIG_MODE_CPR_TRANSFER);
+}
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
{
MigrationAddress *addr = channel->addr;
diff --git a/migration/cpr.c b/migration/cpr.c
index a50a57e..49fb0a5 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,12 +7,14 @@
#include "qemu/osdep.h"
#include "qapi/error.h"
+#include "hw/vfio/vfio-device.h"
#include "migration/cpr.h"
#include "migration/misc.h"
#include "migration/options.h"
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -264,3 +266,9 @@ bool cpr_incoming_needed(void *opaque)
MigMode mode = migrate_mode();
return mode == MIG_MODE_CPR_TRANSFER;
}
+
+void cpr_kvm_close(void)
+{
+ kvm_close();
+ vfio_kvm_device_close();
+}
diff --git a/migration/migration.c b/migration/migration.c
index 4098870..8f23cff 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -337,6 +337,7 @@ void migration_object_init(void)
ram_mig_init();
dirty_bitmap_mig_init();
+ cpr_transfer_init();
/* Initialize cpu throttle timers */
cpu_throttle_init();
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 73d29f9..98134a7 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -17,6 +17,8 @@ vfio_ss.add(when: 'CONFIG_VFIO_IGD', if_true: files('igd.c'))
specific_ss.add_all(when: 'CONFIG_VFIO', if_true: vfio_ss)
+system_ss.add(when: 'CONFIG_VFIO', if_false: files('vfio-stubs.c'))
+
system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
system_ss.add(when: 'CONFIG_VFIO', if_true: files(
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-06-10 15:39 ` [PATCH V5 20/38] migration: close kvm after cpr Steve Sistare
@ 2025-07-01 15:25 ` Steven Sistare
2025-07-02 16:02 ` Peter Xu
2025-07-01 17:49 ` Fabiano Rosas
1 sibling, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 15:25 UTC (permalink / raw)
To: qemu-devel, Peter Xu, Fabiano Rosas, Paolo Bonzini
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum
Hi Paolo, Peter, Fabiano,
This patch needs review. CPR for vfio is broken without it.
Soft feature freeze July 15.
- Steve
On 6/10/2025 11:39 AM, Steve Sistare wrote:
> cpr-transfer breaks vfio network connectivity to and from the guest, and
> the host system log shows:
> irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> which is EBUSY. This occurs because KVM descriptors are still open in
> the old QEMU process. Close them.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/hw/vfio/vfio-device.h | 2 ++
> include/migration/cpr.h | 2 ++
> include/system/kvm.h | 1 +
> accel/kvm/kvm-all.c | 32 ++++++++++++++++++++++++++++++++
> accel/stubs/kvm-stub.c | 5 +++++
> hw/vfio/helpers.c | 10 ++++++++++
> hw/vfio/vfio-stubs.c | 13 +++++++++++++
> migration/cpr-transfer.c | 18 ++++++++++++++++++
> migration/cpr.c | 8 ++++++++
> migration/migration.c | 1 +
> hw/vfio/meson.build | 2 ++
> 11 files changed, 94 insertions(+)
> create mode 100644 hw/vfio/vfio-stubs.c
>
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 4e4d0b6..6eb6f21 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
> void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
> DeviceState *dev, bool ram_discard);
> int vfio_device_get_aw_bits(VFIODevice *vdev);
> +
> +void vfio_kvm_device_close(void);
> #endif /* HW_VFIO_VFIO_COMMON_H */
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 07858e9..d09b657 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -32,7 +32,9 @@ void cpr_state_close(void);
> struct QIOChannel *cpr_state_ioc(void);
>
> bool cpr_incoming_needed(void *opaque);
> +void cpr_kvm_close(void);
>
> +void cpr_transfer_init(void);
> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>
> diff --git a/include/system/kvm.h b/include/system/kvm.h
> index 7cc60d2..4896a3c 100644
> --- a/include/system/kvm.h
> +++ b/include/system/kvm.h
> @@ -195,6 +195,7 @@ bool kvm_has_sync_mmu(void);
> int kvm_has_vcpu_events(void);
> int kvm_max_nested_state_length(void);
> int kvm_has_gsi_routing(void);
> +void kvm_close(void);
>
> /**
> * kvm_arm_supports_user_irq
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index a317783..3d3a557 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -515,16 +515,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
> goto err;
> }
>
> + /* If I am the CPU that created coalesced_mmio_ring, then discard it */
> + if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
> + s->coalesced_mmio_ring = NULL;
> + }
> +
> ret = munmap(cpu->kvm_run, mmap_size);
> if (ret < 0) {
> goto err;
> }
> + cpu->kvm_run = NULL;
>
> if (cpu->kvm_dirty_gfns) {
> ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
> if (ret < 0) {
> goto err;
> }
> + cpu->kvm_dirty_gfns = NULL;
> }
>
> kvm_park_vcpu(cpu);
> @@ -608,6 +615,31 @@ err:
> return ret;
> }
>
> +void kvm_close(void)
> +{
> + CPUState *cpu;
> +
> + if (!kvm_state || kvm_state->fd == -1) {
> + return;
> + }
> +
> + CPU_FOREACH(cpu) {
> + cpu_remove_sync(cpu);
> + close(cpu->kvm_fd);
> + cpu->kvm_fd = -1;
> + close(cpu->kvm_vcpu_stats_fd);
> + cpu->kvm_vcpu_stats_fd = -1;
> + }
> +
> + if (kvm_state && kvm_state->fd != -1) {
> + close(kvm_state->vmfd);
> + kvm_state->vmfd = -1;
> + close(kvm_state->fd);
> + kvm_state->fd = -1;
> + }
> + kvm_state = NULL;
> +}
> +
> /*
> * dirty pages logging control
> */
> diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
> index ecfd763..97dacb3 100644
> --- a/accel/stubs/kvm-stub.c
> +++ b/accel/stubs/kvm-stub.c
> @@ -134,3 +134,8 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
> {
> return -ENOSYS;
> }
> +
> +void kvm_close(void)
> +{
> + return;
> +}
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index d0dbab1..af1db2f 100644
> --- a/hw/vfio/helpers.c
> +++ b/hw/vfio/helpers.c
> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> int vfio_kvm_device_fd = -1;
> #endif
>
> +void vfio_kvm_device_close(void)
> +{
> +#ifdef CONFIG_KVM
> + if (vfio_kvm_device_fd != -1) {
> + close(vfio_kvm_device_fd);
> + vfio_kvm_device_fd = -1;
> + }
> +#endif
> +}
> +
> int vfio_kvm_device_add_fd(int fd, Error **errp)
> {
> #ifdef CONFIG_KVM
> diff --git a/hw/vfio/vfio-stubs.c b/hw/vfio/vfio-stubs.c
> new file mode 100644
> index 0000000..a4c8b56
> --- /dev/null
> +++ b/hw/vfio/vfio-stubs.c
> @@ -0,0 +1,13 @@
> +/*
> + * Copyright (c) 2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/vfio/vfio-device.h"
> +
> +void vfio_kvm_device_close(void)
> +{
> + return;
> +}
> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> index e1f1403..396558f 100644
> --- a/migration/cpr-transfer.c
> +++ b/migration/cpr-transfer.c
> @@ -17,6 +17,24 @@
> #include "migration/vmstate.h"
> #include "trace.h"
>
> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
> + MigrationEvent *e,
> + Error **errp)
> +{
> + if (e->type == MIG_EVENT_PRECOPY_DONE) {
> + cpr_kvm_close();
> + }
> + return 0;
> +}
> +
> +void cpr_transfer_init(void)
> +{
> + static NotifierWithReturn notifier;
> +
> + migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
> + MIG_MODE_CPR_TRANSFER);
> +}
> +
> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
> {
> MigrationAddress *addr = channel->addr;
> diff --git a/migration/cpr.c b/migration/cpr.c
> index a50a57e..49fb0a5 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,12 +7,14 @@
>
> #include "qemu/osdep.h"
> #include "qapi/error.h"
> +#include "hw/vfio/vfio-device.h"
> #include "migration/cpr.h"
> #include "migration/misc.h"
> #include "migration/options.h"
> #include "migration/qemu-file.h"
> #include "migration/savevm.h"
> #include "migration/vmstate.h"
> +#include "system/kvm.h"
> #include "system/runstate.h"
> #include "trace.h"
>
> @@ -264,3 +266,9 @@ bool cpr_incoming_needed(void *opaque)
> MigMode mode = migrate_mode();
> return mode == MIG_MODE_CPR_TRANSFER;
> }
> +
> +void cpr_kvm_close(void)
> +{
> + kvm_close();
> + vfio_kvm_device_close();
> +}
> diff --git a/migration/migration.c b/migration/migration.c
> index 4098870..8f23cff 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -337,6 +337,7 @@ void migration_object_init(void)
>
> ram_mig_init();
> dirty_bitmap_mig_init();
> + cpr_transfer_init();
>
> /* Initialize cpu throttle timers */
> cpu_throttle_init();
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 73d29f9..98134a7 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -17,6 +17,8 @@ vfio_ss.add(when: 'CONFIG_VFIO_IGD', if_true: files('igd.c'))
>
> specific_ss.add_all(when: 'CONFIG_VFIO', if_true: vfio_ss)
>
> +system_ss.add(when: 'CONFIG_VFIO', if_false: files('vfio-stubs.c'))
> +
> system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
> system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
> system_ss.add(when: 'CONFIG_VFIO', if_true: files(
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-01 15:25 ` Steven Sistare
@ 2025-07-02 16:02 ` Peter Xu
2025-07-02 19:41 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Peter Xu @ 2025-07-02 16:02 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, Paolo Bonzini, Alex Williamson,
Cedric Le Goater, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum
On Tue, Jul 01, 2025 at 11:25:23AM -0400, Steven Sistare wrote:
> Hi Paolo, Peter, Fabiano,
>
> This patch needs review. CPR for vfio is broken without it.
> Soft feature freeze July 15.
Sorry to not have tried looking at this more even if this is marked
"migration".. obviously I still almost see it as a KVM change..
Questions inline below:
>
> - Steve
>
> On 6/10/2025 11:39 AM, Steve Sistare wrote:
> > cpr-transfer breaks vfio network connectivity to and from the guest, and
> > the host system log shows:
> > irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> > which is EBUSY. This occurs because KVM descriptors are still open in
> > the old QEMU process. Close them.
> >
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> > include/hw/vfio/vfio-device.h | 2 ++
> > include/migration/cpr.h | 2 ++
> > include/system/kvm.h | 1 +
> > accel/kvm/kvm-all.c | 32 ++++++++++++++++++++++++++++++++
> > accel/stubs/kvm-stub.c | 5 +++++
> > hw/vfio/helpers.c | 10 ++++++++++
> > hw/vfio/vfio-stubs.c | 13 +++++++++++++
> > migration/cpr-transfer.c | 18 ++++++++++++++++++
> > migration/cpr.c | 8 ++++++++
> > migration/migration.c | 1 +
> > hw/vfio/meson.build | 2 ++
> > 11 files changed, 94 insertions(+)
> > create mode 100644 hw/vfio/vfio-stubs.c
> >
> > diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> > index 4e4d0b6..6eb6f21 100644
> > --- a/include/hw/vfio/vfio-device.h
> > +++ b/include/hw/vfio/vfio-device.h
> > @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
> > void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
> > DeviceState *dev, bool ram_discard);
> > int vfio_device_get_aw_bits(VFIODevice *vdev);
> > +
> > +void vfio_kvm_device_close(void);
> > #endif /* HW_VFIO_VFIO_COMMON_H */
> > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > index 07858e9..d09b657 100644
> > --- a/include/migration/cpr.h
> > +++ b/include/migration/cpr.h
> > @@ -32,7 +32,9 @@ void cpr_state_close(void);
> > struct QIOChannel *cpr_state_ioc(void);
> > bool cpr_incoming_needed(void *opaque);
> > +void cpr_kvm_close(void);
> > +void cpr_transfer_init(void);
> > QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> > QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > diff --git a/include/system/kvm.h b/include/system/kvm.h
> > index 7cc60d2..4896a3c 100644
> > --- a/include/system/kvm.h
> > +++ b/include/system/kvm.h
> > @@ -195,6 +195,7 @@ bool kvm_has_sync_mmu(void);
> > int kvm_has_vcpu_events(void);
> > int kvm_max_nested_state_length(void);
> > int kvm_has_gsi_routing(void);
> > +void kvm_close(void);
> > /**
> > * kvm_arm_supports_user_irq
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index a317783..3d3a557 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -515,16 +515,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
> > goto err;
> > }
> > + /* If I am the CPU that created coalesced_mmio_ring, then discard it */
> > + if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
> > + s->coalesced_mmio_ring = NULL;
> > + }
> > +
> > ret = munmap(cpu->kvm_run, mmap_size);
> > if (ret < 0) {
> > goto err;
> > }
> > + cpu->kvm_run = NULL;
> > if (cpu->kvm_dirty_gfns) {
> > ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
> > if (ret < 0) {
> > goto err;
> > }
> > + cpu->kvm_dirty_gfns = NULL;
> > }
> > kvm_park_vcpu(cpu);
> > @@ -608,6 +615,31 @@ err:
> > return ret;
> > }
> > +void kvm_close(void)
> > +{
> > + CPUState *cpu;
> > +
> > + if (!kvm_state || kvm_state->fd == -1) {
> > + return;
> > + }
> > +
> > + CPU_FOREACH(cpu) {
> > + cpu_remove_sync(cpu);
> > + close(cpu->kvm_fd);
> > + cpu->kvm_fd = -1;
> > + close(cpu->kvm_vcpu_stats_fd);
> > + cpu->kvm_vcpu_stats_fd = -1;
> > + }
> > +
> > + if (kvm_state && kvm_state->fd != -1) {
> > + close(kvm_state->vmfd);
> > + kvm_state->vmfd = -1;
> > + close(kvm_state->fd);
> > + kvm_state->fd = -1;
> > + }
> > + kvm_state = NULL;
> > +}
> > +
> > /*
> > * dirty pages logging control
> > */
> > diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
> > index ecfd763..97dacb3 100644
> > --- a/accel/stubs/kvm-stub.c
> > +++ b/accel/stubs/kvm-stub.c
> > @@ -134,3 +134,8 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
> > {
> > return -ENOSYS;
> > }
> > +
> > +void kvm_close(void)
> > +{
> > + return;
> > +}
> > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> > index d0dbab1..af1db2f 100644
> > --- a/hw/vfio/helpers.c
> > +++ b/hw/vfio/helpers.c
> > @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> > int vfio_kvm_device_fd = -1;
> > #endif
> > +void vfio_kvm_device_close(void)
> > +{
> > +#ifdef CONFIG_KVM
> > + if (vfio_kvm_device_fd != -1) {
> > + close(vfio_kvm_device_fd);
> > + vfio_kvm_device_fd = -1;
> > + }
> > +#endif
> > +}
> > +
> > int vfio_kvm_device_add_fd(int fd, Error **errp)
> > {
> > #ifdef CONFIG_KVM
> > diff --git a/hw/vfio/vfio-stubs.c b/hw/vfio/vfio-stubs.c
> > new file mode 100644
> > index 0000000..a4c8b56
> > --- /dev/null
> > +++ b/hw/vfio/vfio-stubs.c
> > @@ -0,0 +1,13 @@
> > +/*
> > + * Copyright (c) 2025 Oracle and/or its affiliates.
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "hw/vfio/vfio-device.h"
> > +
> > +void vfio_kvm_device_close(void)
> > +{
> > + return;
> > +}
Do we really need this stub, and the "include VFIO" headers in CPR as
below? I thought it would be doable the other way round, that VFIO or KVM
can register migration notifiers for CPR mode only. After all, the
registration (migration_add_notifier_mode) is in misc.h so it should be
available to QEMU all across.
Besides that, a high level question: what this patch does is trying to
close early the relevant kvm/vfio fds that are used to attach to irq
consumer / providers. At the meantime, AFAICT, CPR as a whole feature when
used against VFIO available, works only if VFIO can do whatever it wants
(DMA, irq injections) during the whole process of CPR live upgrade,
assuming that all the states are persisted in the fds. Then, if here we
need to (a) unregister on src QEMU and (b) re-attach on dest QEMU, what
happens if the irqs are generated exactly between (a) and (b)? Could they
get lost?
> > diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> > index e1f1403..396558f 100644
> > --- a/migration/cpr-transfer.c
> > +++ b/migration/cpr-transfer.c
> > @@ -17,6 +17,24 @@
> > #include "migration/vmstate.h"
> > #include "trace.h"
> > +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
> > + MigrationEvent *e,
> > + Error **errp)
> > +{
> > + if (e->type == MIG_EVENT_PRECOPY_DONE) {
> > + cpr_kvm_close();
> > + }
> > + return 0;
> > +}
> > +
> > +void cpr_transfer_init(void)
> > +{
> > + static NotifierWithReturn notifier;
> > +
> > + migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
> > + MIG_MODE_CPR_TRANSFER);
> > +}
> > +
> > QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
> > {
> > MigrationAddress *addr = channel->addr;
> > diff --git a/migration/cpr.c b/migration/cpr.c
> > index a50a57e..49fb0a5 100644
> > --- a/migration/cpr.c
> > +++ b/migration/cpr.c
> > @@ -7,12 +7,14 @@
> > #include "qemu/osdep.h"
> > #include "qapi/error.h"
> > +#include "hw/vfio/vfio-device.h"
[1]
> > #include "migration/cpr.h"
> > #include "migration/misc.h"
> > #include "migration/options.h"
> > #include "migration/qemu-file.h"
> > #include "migration/savevm.h"
> > #include "migration/vmstate.h"
> > +#include "system/kvm.h"
> > #include "system/runstate.h"
> > #include "trace.h"
> > @@ -264,3 +266,9 @@ bool cpr_incoming_needed(void *opaque)
> > MigMode mode = migrate_mode();
> > return mode == MIG_MODE_CPR_TRANSFER;
> > }
> > +
> > +void cpr_kvm_close(void)
> > +{
> > + kvm_close();
> > + vfio_kvm_device_close();
> > +}
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4098870..8f23cff 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -337,6 +337,7 @@ void migration_object_init(void)
> > ram_mig_init();
> > dirty_bitmap_mig_init();
> > + cpr_transfer_init();
> > /* Initialize cpu throttle timers */
> > cpu_throttle_init();
> > diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> > index 73d29f9..98134a7 100644
> > --- a/hw/vfio/meson.build
> > +++ b/hw/vfio/meson.build
> > @@ -17,6 +17,8 @@ vfio_ss.add(when: 'CONFIG_VFIO_IGD', if_true: files('igd.c'))
> > specific_ss.add_all(when: 'CONFIG_VFIO', if_true: vfio_ss)
> > +system_ss.add(when: 'CONFIG_VFIO', if_false: files('vfio-stubs.c'))
> > +
> > system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
> > system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
> > system_ss.add(when: 'CONFIG_VFIO', if_true: files(
--
Peter Xu
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-02 16:02 ` Peter Xu
@ 2025-07-02 19:41 ` Steven Sistare
2025-07-03 19:45 ` Peter Xu
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 19:41 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Fabiano Rosas, Paolo Bonzini, Alex Williamson,
Cedric Le Goater, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum
On 7/2/2025 12:02 PM, Peter Xu wrote:
> On Tue, Jul 01, 2025 at 11:25:23AM -0400, Steven Sistare wrote:
>> Hi Paolo, Peter, Fabiano,
>>
>> This patch needs review. CPR for vfio is broken without it.
>> Soft feature freeze July 15.
>
> Sorry to not have tried looking at this more even if this is marked
> "migration".. obviously I still almost see it as a KVM change..
>
> Questions inline below:
>
>>
>> - Steve
>>
>> On 6/10/2025 11:39 AM, Steve Sistare wrote:
>>> cpr-transfer breaks vfio network connectivity to and from the guest, and
>>> the host system log shows:
>>> irq bypass consumer (token 00000000a03c32e5) registration fails: -16
>>> which is EBUSY. This occurs because KVM descriptors are still open in
>>> the old QEMU process. Close them.
>>>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/hw/vfio/vfio-device.h | 2 ++
>>> include/migration/cpr.h | 2 ++
>>> include/system/kvm.h | 1 +
>>> accel/kvm/kvm-all.c | 32 ++++++++++++++++++++++++++++++++
>>> accel/stubs/kvm-stub.c | 5 +++++
>>> hw/vfio/helpers.c | 10 ++++++++++
>>> hw/vfio/vfio-stubs.c | 13 +++++++++++++
>>> migration/cpr-transfer.c | 18 ++++++++++++++++++
>>> migration/cpr.c | 8 ++++++++
>>> migration/migration.c | 1 +
>>> hw/vfio/meson.build | 2 ++
>>> 11 files changed, 94 insertions(+)
>>> create mode 100644 hw/vfio/vfio-stubs.c
>>>
>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>> index 4e4d0b6..6eb6f21 100644
>>> --- a/include/hw/vfio/vfio-device.h
>>> +++ b/include/hw/vfio/vfio-device.h
>>> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>>> void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>>> DeviceState *dev, bool ram_discard);
>>> int vfio_device_get_aw_bits(VFIODevice *vdev);
>>> +
>>> +void vfio_kvm_device_close(void);
>>> #endif /* HW_VFIO_VFIO_COMMON_H */
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index 07858e9..d09b657 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -32,7 +32,9 @@ void cpr_state_close(void);
>>> struct QIOChannel *cpr_state_ioc(void);
>>> bool cpr_incoming_needed(void *opaque);
>>> +void cpr_kvm_close(void);
>>> +void cpr_transfer_init(void);
>>> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>> diff --git a/include/system/kvm.h b/include/system/kvm.h
>>> index 7cc60d2..4896a3c 100644
>>> --- a/include/system/kvm.h
>>> +++ b/include/system/kvm.h
>>> @@ -195,6 +195,7 @@ bool kvm_has_sync_mmu(void);
>>> int kvm_has_vcpu_events(void);
>>> int kvm_max_nested_state_length(void);
>>> int kvm_has_gsi_routing(void);
>>> +void kvm_close(void);
>>> /**
>>> * kvm_arm_supports_user_irq
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index a317783..3d3a557 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -515,16 +515,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>>> goto err;
>>> }
>>> + /* If I am the CPU that created coalesced_mmio_ring, then discard it */
>>> + if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
>>> + s->coalesced_mmio_ring = NULL;
>>> + }
>>> +
>>> ret = munmap(cpu->kvm_run, mmap_size);
>>> if (ret < 0) {
>>> goto err;
>>> }
>>> + cpu->kvm_run = NULL;
>>> if (cpu->kvm_dirty_gfns) {
>>> ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>>> if (ret < 0) {
>>> goto err;
>>> }
>>> + cpu->kvm_dirty_gfns = NULL;
>>> }
>>> kvm_park_vcpu(cpu);
>>> @@ -608,6 +615,31 @@ err:
>>> return ret;
>>> }
>>> +void kvm_close(void)
>>> +{
>>> + CPUState *cpu;
>>> +
>>> + if (!kvm_state || kvm_state->fd == -1) {
>>> + return;
>>> + }
>>> +
>>> + CPU_FOREACH(cpu) {
>>> + cpu_remove_sync(cpu);
>>> + close(cpu->kvm_fd);
>>> + cpu->kvm_fd = -1;
>>> + close(cpu->kvm_vcpu_stats_fd);
>>> + cpu->kvm_vcpu_stats_fd = -1;
>>> + }
>>> +
>>> + if (kvm_state && kvm_state->fd != -1) {
>>> + close(kvm_state->vmfd);
>>> + kvm_state->vmfd = -1;
>>> + close(kvm_state->fd);
>>> + kvm_state->fd = -1;
>>> + }
>>> + kvm_state = NULL;
>>> +}
>>> +
>>> /*
>>> * dirty pages logging control
>>> */
>>> diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
>>> index ecfd763..97dacb3 100644
>>> --- a/accel/stubs/kvm-stub.c
>>> +++ b/accel/stubs/kvm-stub.c
>>> @@ -134,3 +134,8 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
>>> {
>>> return -ENOSYS;
>>> }
>>> +
>>> +void kvm_close(void)
>>> +{
>>> + return;
>>> +}
>>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>>> index d0dbab1..af1db2f 100644
>>> --- a/hw/vfio/helpers.c
>>> +++ b/hw/vfio/helpers.c
>>> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>> int vfio_kvm_device_fd = -1;
>>> #endif
>>> +void vfio_kvm_device_close(void)
>>> +{
>>> +#ifdef CONFIG_KVM
>>> + if (vfio_kvm_device_fd != -1) {
>>> + close(vfio_kvm_device_fd);
>>> + vfio_kvm_device_fd = -1;
>>> + }
>>> +#endif
>>> +}
>>> +
>>> int vfio_kvm_device_add_fd(int fd, Error **errp)
>>> {
>>> #ifdef CONFIG_KVM
>>> diff --git a/hw/vfio/vfio-stubs.c b/hw/vfio/vfio-stubs.c
>>> new file mode 100644
>>> index 0000000..a4c8b56
>>> --- /dev/null
>>> +++ b/hw/vfio/vfio-stubs.c
>>> @@ -0,0 +1,13 @@
>>> +/*
>>> + * Copyright (c) 2025 Oracle and/or its affiliates.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "hw/vfio/vfio-device.h"
>>> +
>>> +void vfio_kvm_device_close(void)
>>> +{
>>> + return;
>>> +}
>
> Do we really need this stub, and the "include VFIO" headers in CPR as
> below? I thought it would be doable the other way round, that VFIO or KVM
> can register migration notifiers for CPR mode only. After all, the
> registration (migration_add_notifier_mode) is in misc.h so it should be
> available to QEMU all across.
OK. I have reworked the code to register the notifier from vfio and
eliminate the stub.
> Besides that, a high level question: what this patch does is trying to
> close early the relevant kvm/vfio fds that are used to attach to irq
> consumer / providers. > At the meantime, AFAICT, CPR as a whole feature when
> used against VFIO available, works only if VFIO can do whatever it wants
> (DMA, irq injections) during the whole process of CPR live upgrade,
> assuming that all the states are persisted in the fds. Then, if here we
> need to (a) unregister on src QEMU and (b) re-attach on dest QEMU, what
> happens if the irqs are generated exactly between (a) and (b)? Could they
> get lost?
The irq producer is not closed, but it is detached from the kvm consumer.
It's eventfd is preserved in new QEMU, and interrupts that arrive during
transition are pended there.
- Steve
>>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>>> index e1f1403..396558f 100644
>>> --- a/migration/cpr-transfer.c
>>> +++ b/migration/cpr-transfer.c
>>> @@ -17,6 +17,24 @@
>>> #include "migration/vmstate.h"
>>> #include "trace.h"
>>> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
>>> + MigrationEvent *e,
>>> + Error **errp)
>>> +{
>>> + if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>> + cpr_kvm_close();
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +void cpr_transfer_init(void)
>>> +{
>>> + static NotifierWithReturn notifier;
>>> +
>>> + migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
>>> + MIG_MODE_CPR_TRANSFER);
>>> +}
>>> +
>>> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>>> {
>>> MigrationAddress *addr = channel->addr;
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index a50a57e..49fb0a5 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -7,12 +7,14 @@
>>> #include "qemu/osdep.h"
>>> #include "qapi/error.h"
>>> +#include "hw/vfio/vfio-device.h"
>
> [1]
>
>>> #include "migration/cpr.h"
>>> #include "migration/misc.h"
>>> #include "migration/options.h"
>>> #include "migration/qemu-file.h"
>>> #include "migration/savevm.h"
>>> #include "migration/vmstate.h"
>>> +#include "system/kvm.h"
>>> #include "system/runstate.h"
>>> #include "trace.h"
>>> @@ -264,3 +266,9 @@ bool cpr_incoming_needed(void *opaque)
>>> MigMode mode = migrate_mode();
>>> return mode == MIG_MODE_CPR_TRANSFER;
>>> }
>>> +
>>> +void cpr_kvm_close(void)
>>> +{
>>> + kvm_close();
>>> + vfio_kvm_device_close();
>>> +}
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 4098870..8f23cff 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -337,6 +337,7 @@ void migration_object_init(void)
>>> ram_mig_init();
>>> dirty_bitmap_mig_init();
>>> + cpr_transfer_init();
>>> /* Initialize cpu throttle timers */
>>> cpu_throttle_init();
>>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>>> index 73d29f9..98134a7 100644
>>> --- a/hw/vfio/meson.build
>>> +++ b/hw/vfio/meson.build
>>> @@ -17,6 +17,8 @@ vfio_ss.add(when: 'CONFIG_VFIO_IGD', if_true: files('igd.c'))
>>> specific_ss.add_all(when: 'CONFIG_VFIO', if_true: vfio_ss)
>>> +system_ss.add(when: 'CONFIG_VFIO', if_false: files('vfio-stubs.c'))
>>> +
>>> system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>> system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>> system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-02 19:41 ` Steven Sistare
@ 2025-07-03 19:45 ` Peter Xu
2025-07-03 21:21 ` Cédric Le Goater
0 siblings, 1 reply; 101+ messages in thread
From: Peter Xu @ 2025-07-03 19:45 UTC (permalink / raw)
To: Steven Sistare
Cc: qemu-devel, Fabiano Rosas, Paolo Bonzini, Alex Williamson,
Cedric Le Goater, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum
On Wed, Jul 02, 2025 at 03:41:08PM -0400, Steven Sistare wrote:
> The irq producer is not closed, but it is detached from the kvm consumer.
> It's eventfd is preserved in new QEMU, and interrupts that arrive during
> transition are pended there.
Ah I see, looks reasonable.
So can I understand the core issue here is about the irq consumer /
provider updates are atomic, meanwhile there's always the fallback paths
ready, so before / after the update the irq won't get lost?
E.g. in Post-Interrupt context of Intel's, the irte will be updated
atomically for these VFIO irqs, so that either it'll keep using the fast
path (provided by the irqbypass mechanism), or slow path (eventfd_signal),
so it's free of any kind of race that irq could trigger?
I saw that there's already a new version and Cedric queued it. If possible
add some explanation into commit message, either when repost, or when
merge, would be nice, on explaning irq won't get lost.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-03 19:45 ` Peter Xu
@ 2025-07-03 21:21 ` Cédric Le Goater
2025-07-03 21:58 ` Peter Xu
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-07-03 21:21 UTC (permalink / raw)
To: Peter Xu, Steven Sistare
Cc: qemu-devel, Fabiano Rosas, Paolo Bonzini, Alex Williamson, Yi Liu,
Eric Auger, Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum
On 7/3/25 21:45, Peter Xu wrote:
> On Wed, Jul 02, 2025 at 03:41:08PM -0400, Steven Sistare wrote:
>> The irq producer is not closed, but it is detached from the kvm consumer.
>> It's eventfd is preserved in new QEMU, and interrupts that arrive during
>> transition are pended there.
>
> Ah I see, looks reasonable.
>
> So can I understand the core issue here is about the irq consumer /
> provider updates are atomic, meanwhile there's always the fallback paths
> ready, so before / after the update the irq won't get lost?
>
> E.g. in Post-Interrupt context of Intel's, the irte will be updated
> atomically for these VFIO irqs, so that either it'll keep using the fast
> path (provided by the irqbypass mechanism), or slow path (eventfd_signal),
> so it's free of any kind of race that irq could trigger?
>
> I saw that there's already a new version and Cedric queued it. If possible
> add some explanation into commit message, either when repost, or when
> merge, would be nice, on explaning irq won't get lost.
yes.
Steve, just resend the patch. I will update the vfio queue.
Or we can address that with a follow up patch before QEMU 10.1
is released.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-03 21:21 ` Cédric Le Goater
@ 2025-07-03 21:58 ` Peter Xu
2025-07-07 13:13 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Peter Xu @ 2025-07-03 21:58 UTC (permalink / raw)
To: Cédric Le Goater, Steve Sistare
Cc: Steven Sistare, qemu-devel, Fabiano Rosas, Paolo Bonzini,
Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum
On Thu, Jul 03, 2025 at 11:21:38PM +0200, Cédric Le Goater wrote:
> On 7/3/25 21:45, Peter Xu wrote:
> > On Wed, Jul 02, 2025 at 03:41:08PM -0400, Steven Sistare wrote:
> > > The irq producer is not closed, but it is detached from the kvm consumer.
> > > It's eventfd is preserved in new QEMU, and interrupts that arrive during
> > > transition are pended there.
> >
> > Ah I see, looks reasonable.
> >
> > So can I understand the core issue here is about the irq consumer /
> > provider updates are atomic, meanwhile there's always the fallback paths
> > ready, so before / after the update the irq won't get lost?
> >
> > E.g. in Post-Interrupt context of Intel's, the irte will be updated
> > atomically for these VFIO irqs, so that either it'll keep using the fast
> > path (provided by the irqbypass mechanism), or slow path (eventfd_signal),
> > so it's free of any kind of race that irq could trigger?
> >
> > I saw that there's already a new version and Cedric queued it. If possible
> > add some explanation into commit message, either when repost, or when
> > merge, would be nice, on explaning irq won't get lost.
> yes.
>
> Steve, just resend the patch. I will update the vfio queue.
> Or we can address that with a follow up patch before QEMU 10.1
> is released.
I've just noticed maybe I was wrong that slow path was always present.
We've closed the kvm so likely the slow path is gone..
So I think I misunderstood, and Steve likely meant the irq will be
persisted in eventfd, which is still true if the irq eventfds are persisted
and passed over (I didn't check the patchset, but I'm assuming this is the
case).
Then I found, yes, indeed when irqfd is re-established on dest qemu, we
have such tricky code:
kvm_irqfd_assign():
/*
* Check if there was an event already pending on the eventfd
* before we registered, and trigger it as if we didn't miss it.
*/
events = vfs_poll(fd_file(f), &irqfd->pt);
if (events & EPOLLIN)
schedule_work(&irqfd->inject);
I've no idea whether it was intended to do this as the code was there since
2009, maybe this chunk of code is the core of why irq won't get lost for
CPR. But in all cases, it can be a pretty tricky spot to prove that cpr
works and looks important piece of info.
Personally I'm ok doing it on top of what's queued. Maybe such explanation
on how it works should be put directly into docs/../cpr.rst?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-07-03 21:58 ` Peter Xu
@ 2025-07-07 13:13 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-07 13:13 UTC (permalink / raw)
To: Peter Xu, Cédric Le Goater
Cc: qemu-devel, Fabiano Rosas, Paolo Bonzini, Alex Williamson, Yi Liu,
Eric Auger, Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum
On 7/3/2025 5:58 PM, Peter Xu wrote:
> On Thu, Jul 03, 2025 at 11:21:38PM +0200, Cédric Le Goater wrote:
>> On 7/3/25 21:45, Peter Xu wrote:
>>> On Wed, Jul 02, 2025 at 03:41:08PM -0400, Steven Sistare wrote:
>>>> The irq producer is not closed, but it is detached from the kvm consumer.
>>>> It's eventfd is preserved in new QEMU, and interrupts that arrive during
>>>> transition are pended there.
>>>
>>> Ah I see, looks reasonable.
>>>
>>> So can I understand the core issue here is about the irq consumer /
>>> provider updates are atomic, meanwhile there's always the fallback paths
>>> ready, so before / after the update the irq won't get lost?
>>>
>>> E.g. in Post-Interrupt context of Intel's, the irte will be updated
>>> atomically for these VFIO irqs, so that either it'll keep using the fast
>>> path (provided by the irqbypass mechanism), or slow path (eventfd_signal),
>>> so it's free of any kind of race that irq could trigger?
>>>
>>> I saw that there's already a new version and Cedric queued it. If possible
>>> add some explanation into commit message, either when repost, or when
>>> merge, would be nice, on explaning irq won't get lost.
>> yes.
>>
>> Steve, just resend the patch. I will update the vfio queue.
>> Or we can address that with a follow up patch before QEMU 10.1
>> is released.
>
> I've just noticed maybe I was wrong that slow path was always present.
> We've closed the kvm so likely the slow path is gone..
>
> So I think I misunderstood, and Steve likely meant the irq will be
> persisted in eventfd, which is still true if the irq eventfds are persisted
> and passed over (I didn't check the patchset, but I'm assuming this is the
> case).
>
> Then I found, yes, indeed when irqfd is re-established on dest qemu, we
> have such tricky code:
>
> kvm_irqfd_assign():
>
> /*
> * Check if there was an event already pending on the eventfd
> * before we registered, and trigger it as if we didn't miss it.
> */
> events = vfs_poll(fd_file(f), &irqfd->pt);
>
> if (events & EPOLLIN)
> schedule_work(&irqfd->inject);
>
> I've no idea whether it was intended to do this as the code was there since
> 2009, maybe this chunk of code is the core of why irq won't get lost for
> CPR. But in all cases, it can be a pretty tricky spot to prove that cpr
> works and looks important piece of info.
Yes, this is the mechanism I rely on to preserve an interrupt pended to
the vfio eventfd.
- Steve
> Personally I'm ok doing it on top of what's queued. Maybe such explanation
> on how it works should be put directly into docs/../cpr.rst?
>
> Thanks,
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 20/38] migration: close kvm after cpr
2025-06-10 15:39 ` [PATCH V5 20/38] migration: close kvm after cpr Steve Sistare
2025-07-01 15:25 ` Steven Sistare
@ 2025-07-01 17:49 ` Fabiano Rosas
1 sibling, 0 replies; 101+ messages in thread
From: Fabiano Rosas @ 2025-07-01 17:49 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> cpr-transfer breaks vfio network connectivity to and from the guest, and
> the host system log shows:
> irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> which is EBUSY. This occurs because KVM descriptors are still open in
> the old QEMU process. Close them.
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 21/38] migration: cpr_get_fd_param helper
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (19 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 20/38] migration: close kvm after cpr Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 22/38] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
` (17 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.
When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes. Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.
Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS. Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
include/migration/cpr.h | 2 ++
migration/cpr.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index d09b657..7fd8065 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -33,6 +33,8 @@ struct QIOChannel *cpr_state_ioc(void);
bool cpr_incoming_needed(void *opaque);
void cpr_kvm_close(void);
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ Error **errp);
void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 49fb0a5..4574608 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -14,6 +14,7 @@
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "monitor/monitor.h"
#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -272,3 +273,39 @@ void cpr_kvm_close(void)
kvm_close();
vfio_kvm_device_close();
}
+
+/*
+ * cpr_get_fd_param: find a descriptor and return its value.
+ *
+ * @name: CPR name for the descriptor
+ * @fdname: An integer-valued string, or a name passed to a getfd command
+ * @index: CPR index of the descriptor
+ * @errp: returned error message
+ *
+ * If CPR is not being performed, then use @fdname to find the fd.
+ * If CPR is being performed, then ignore @fdname, and look for @name
+ * and @index in CPR state.
+ *
+ * On success returns the fd value, else returns -1.
+ */
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ Error **errp)
+{
+ ERRP_GUARD();
+ int fd;
+
+ if (cpr_is_incoming()) {
+ fd = cpr_find_fd(name, index);
+ if (fd < 0) {
+ error_setg(errp, "cannot find saved value for fd %s", fdname);
+ }
+ } else {
+ fd = monitor_fd_param(monitor_cur(), fdname, errp);
+ if (fd >= 0) {
+ cpr_save_fd(name, index, fd);
+ } else {
+ error_prepend(errp, "Could not parse object fd %s:", fdname);
+ }
+ }
+ return fd;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 22/38] backends/iommufd: iommufd_backend_map_file_dma
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (20 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 21/38] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 23/38] backends/iommufd: change process ioctl Steve Sistare
` (16 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
This will be called as a substitute for iommufd_backend_map_dma, so
the error conditions for BARs are copied as-is from that function.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/system/iommufd.h | 3 +++
backends/iommufd.c | 34 ++++++++++++++++++++++++++++++++++
backends/trace-events | 1 +
3 files changed, 38 insertions(+)
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 283861b..2d24d93 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
Error **errp);
void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size, int fd,
+ unsigned long start, bool readonly);
int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
ram_addr_t size, void *vaddr, bool readonly);
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
diff --git a/backends/iommufd.c b/backends/iommufd.c
index c2c47ab..3a2ecc7 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -172,6 +172,40 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
return ret;
}
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size,
+ int mfd, unsigned long start, bool readonly)
+{
+ int ret, fd = be->fd;
+ struct iommu_ioas_map_file map = {
+ .size = sizeof(map),
+ .flags = IOMMU_IOAS_MAP_READABLE |
+ IOMMU_IOAS_MAP_FIXED_IOVA,
+ .ioas_id = ioas_id,
+ .fd = mfd,
+ .start = start,
+ .iova = iova,
+ .length = size,
+ };
+
+ if (!readonly) {
+ map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+ }
+
+ ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
+ trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
+ readonly, ret);
+ if (ret) {
+ ret = -errno;
+
+ /* TODO: Not support mapping hardware PCI BAR region for now. */
+ if (errno == EFAULT) {
+ warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
+ }
+ }
+ return ret;
+}
+
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
hwaddr iova, ram_addr_t size)
{
diff --git a/backends/trace-events b/backends/trace-events
index 7278214..e5f3e70 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 23/38] backends/iommufd: change process ioctl
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (21 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 22/38] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-11 12:38 ` Cédric Le Goater
2025-06-23 8:20 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 24/38] physmem: qemu_ram_get_fd_offset Steve Sistare
` (15 subsequent siblings)
38 siblings, 2 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define the change process ioctl
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/system/iommufd.h | 3 +++
backends/iommufd.c | 24 ++++++++++++++++++++++++
backends/trace-events | 1 +
3 files changed, 28 insertions(+)
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 2d24d93..db5f2c7 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -69,6 +69,9 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
uint32_t *entry_num, void *data,
Error **errp);
+bool iommufd_change_process_capable(IOMMUFDBackend *be);
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
+
#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
OBJECT_DECLARE_TYPE(HostIOMMUDeviceIOMMUFD, HostIOMMUDeviceIOMMUFDClass,
HOST_IOMMU_DEVICE_IOMMUFD)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 3a2ecc7..87f81a0 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -73,6 +73,30 @@ static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
}
+bool iommufd_change_process_capable(IOMMUFDBackend *be)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+
+ /*
+ * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
+ * This is a no-op if the process has not changed since DMA was mapped.
+ */
+ return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+}
+
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+ bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+
+ if (!ret) {
+ error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d failed",
+ be->fd);
+ }
+ trace_iommufd_change_process(be->fd, ret);
+ return ret;
+}
+
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
{
int fd;
diff --git a/backends/trace-events b/backends/trace-events
index e5f3e70..56132d3 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
dbus_vmstate_saving(const char *id) "id: %s"
# iommufd.c
+iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 23/38] backends/iommufd: change process ioctl
2025-06-10 15:39 ` [PATCH V5 23/38] backends/iommufd: change process ioctl Steve Sistare
@ 2025-06-11 12:38 ` Cédric Le Goater
2025-06-23 8:20 ` Duan, Zhenzhong
1 sibling, 0 replies; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-11 12:38 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Define the change process ioctl
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> include/system/iommufd.h | 3 +++
> backends/iommufd.c | 24 ++++++++++++++++++++++++
> backends/trace-events | 1 +
> 3 files changed, 28 insertions(+)
>
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index 2d24d93..db5f2c7 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -69,6 +69,9 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
> uint32_t *entry_num, void *data,
> Error **errp);
>
> +bool iommufd_change_process_capable(IOMMUFDBackend *be);
> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
> +
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
> OBJECT_DECLARE_TYPE(HostIOMMUDeviceIOMMUFD, HostIOMMUDeviceIOMMUFDClass,
> HOST_IOMMU_DEVICE_IOMMUFD)
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 3a2ecc7..87f81a0 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -73,6 +73,30 @@ static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
> object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
> }
>
> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
> +{
> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
> +
> + /*
> + * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
> + * This is a no-op if the process has not changed since DMA was mapped.
> + */
> + return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
> +}
> +
> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
> +{
> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
> + bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
> +
> + if (!ret) {
> + error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d failed",
> + be->fd);
> + }
> + trace_iommufd_change_process(be->fd, ret);
> + return ret;
> +}
> +
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> {
> int fd;
> diff --git a/backends/trace-events b/backends/trace-events
> index e5f3e70..56132d3 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
> dbus_vmstate_saving(const char *id) "id: %s"
>
> # iommufd.c
> +iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 23/38] backends/iommufd: change process ioctl
2025-06-10 15:39 ` [PATCH V5 23/38] backends/iommufd: change process ioctl Steve Sistare
2025-06-11 12:38 ` Cédric Le Goater
@ 2025-06-23 8:20 ` Duan, Zhenzhong
1 sibling, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 8:20 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 23/38] backends/iommufd: change process ioctl
>
>Define the change process ioctl
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 24/38] physmem: qemu_ram_get_fd_offset
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (22 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 23/38] backends/iommufd: change process ioctl Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 25/38] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
` (14 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define qemu_ram_get_fd_offset, so CPR can map a memory region using
IOMMU_IOAS_MAP_FILE in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/exec/cpu-common.h | 1 +
system/physmem.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index a684855..9b658a3 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -85,6 +85,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
const char *qemu_ram_get_idstr(RAMBlock *rb);
void *qemu_ram_get_host_addr(RAMBlock *rb);
ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
bool qemu_ram_is_shared(RAMBlock *rb);
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca3..18684a4 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1593,6 +1593,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
return rb->offset;
}
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
+{
+ return rb->fd_offset;
+}
+
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
{
return rb->used_length;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 25/38] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (23 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 24/38] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-10 15:39 ` [PATCH V5 26/38] vfio/iommufd: invariant device name Steve Sistare
` (13 subsequent siblings)
38 siblings, 0 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
Such a mapping can be preserved without modification during CPR,
because it depends on the file's address space, which does not change,
rather than on the process's address space, which does change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/vfio/vfio-container-base.h | 15 +++++++++++++++
hw/vfio/container-base.c | 9 +++++++++
hw/vfio/iommufd.c | 13 +++++++++++++
3 files changed, 37 insertions(+)
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index f023265..a49ef69 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -167,6 +167,21 @@ struct VFIOIOMMUClass {
hwaddr iova, ram_addr_t size,
void *vaddr, bool readonly, MemoryRegion *mr);
/**
+ * @dma_map_file
+ *
+ * Map a file range for the container.
+ *
+ * @bcontainer: #VFIOContainerBase to use for map
+ * @iova: start address to map
+ * @size: size of the range to map
+ * @fd: descriptor of the file to map
+ * @start: starting file offset of the range to map
+ * @readonly: map read only if true
+ */
+ int (*dma_map_file)(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly);
+ /**
* @dma_unmap
*
* Unmap an address range from the container.
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index d834bd4..5630497 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -78,7 +78,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
void *vaddr, bool readonly, MemoryRegion *mr)
{
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ RAMBlock *rb = mr->ram_block;
+ int mfd = rb ? qemu_ram_get_fd(rb) : -1;
+ if (mfd >= 0 && vioc->dma_map_file) {
+ unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
+ unsigned long offset = qemu_ram_get_fd_offset(rb);
+
+ return vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
+ readonly);
+ }
g_assert(vioc->dma_map);
return vioc->dma_map(bcontainer, iova, size, vaddr, readonly, mr);
}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d3efef7..962a1e2 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -45,6 +45,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
iova, size, vaddr, readonly);
}
+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly)
+{
+ const VFIOIOMMUFDContainer *container =
+ container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+ return iommufd_backend_map_file_dma(container->be,
+ container->ioas_id,
+ iova, size, fd, start, readonly);
+}
+
static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb, bool unmap_all)
@@ -807,6 +819,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
vioc->dma_map = iommufd_cdev_map;
+ vioc->dma_map_file = iommufd_cdev_map_file;
vioc->dma_unmap = iommufd_cdev_unmap;
vioc->attach_device = iommufd_cdev_attach;
vioc->detach_device = iommufd_cdev_detach;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* [PATCH V5 26/38] vfio/iommufd: invariant device name
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (24 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 25/38] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-23 8:25 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name Steve Sistare
` (12 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer will use the device name as a key to find the value
of the device descriptor in new QEMU. However, if the descriptor
number is specified by a command-line fd parameter, then
vfio_device_get_name creates a name that includes the fd number.
This causes a chicken-and-egg problem: new QEMU must know the fd
number to construct a name to find the fd number.
To fix, create an invariant name based on the id command-line parameter,
if id is defined. The user will need to provide such an id to use CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
hw/vfio/device.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 9fba2c7..71fa9f4 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -300,12 +300,17 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
error_setg(errp, "Use FD passing only with iommufd backend");
return false;
}
- /*
- * Give a name with fd so any function printing out vbasedev->name
- * will not break.
- */
if (!vbasedev->name) {
- vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+
+ if (vbasedev->dev->id) {
+ vbasedev->name = g_strdup(vbasedev->dev->id);
+ return true;
+ } else {
+ /*
+ * Assign a name so any function printing it will not break.
+ */
+ vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ }
}
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 26/38] vfio/iommufd: invariant device name
2025-06-10 15:39 ` [PATCH V5 26/38] vfio/iommufd: invariant device name Steve Sistare
@ 2025-06-23 8:25 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 8:25 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 26/38] vfio/iommufd: invariant device name
>
>cpr-transfer will use the device name as a key to find the value
>of the device descriptor in new QEMU. However, if the descriptor
>number is specified by a command-line fd parameter, then
>vfio_device_get_name creates a name that includes the fd number.
>This causes a chicken-and-egg problem: new QEMU must know the fd
>number to construct a name to find the fd number.
>
>To fix, create an invariant name based on the id command-line parameter,
>if id is defined. The user will need to provide such an id to use CPR.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (25 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 26/38] vfio/iommufd: invariant device name Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-11 12:38 ` Cédric Le Goater
` (2 more replies)
2025-06-10 15:39 ` [PATCH V5 28/38] vfio/iommufd: device name blocker Steve Sistare
` (11 subsequent siblings)
38 siblings, 3 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define vfio_device_free_name to free the name created by
vfio_device_get_name. A subsequent patch will do more there.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-device.h | 1 +
hw/vfio/ap.c | 2 +-
hw/vfio/ccw.c | 2 +-
hw/vfio/device.c | 5 +++++
hw/vfio/pci.c | 2 +-
hw/vfio/platform.c | 2 +-
6 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 6eb6f21..321b442 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
+void vfio_device_free_name(VFIODevice *vbasedev);
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
DeviceState *dev, bool ram_discard);
diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 785c0a0..013bd59 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
error:
error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
- g_free(vbasedev->name);
+ vfio_device_free_name(vbasedev);
}
static void vfio_ap_unrealize(DeviceState *dev)
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index cea9d6e..903b8b0 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -619,7 +619,7 @@ out_io_notifier_err:
out_region_err:
vfio_device_detach(vbasedev);
out_attach_dev_err:
- g_free(vbasedev->name);
+ vfio_device_free_name(vbasedev);
out_unrealize:
if (cdc->unrealize) {
cdc->unrealize(cdev);
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 71fa9f4..a3603f5 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -317,6 +317,11 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
return true;
}
+void vfio_device_free_name(VFIODevice *vbasedev)
+{
+ g_clear_pointer(&vbasedev->name, g_free);
+}
+
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
{
ERRP_GUARD();
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b52c488..b4136432 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2966,7 +2966,7 @@ static void vfio_pci_put_device(VFIOPCIDevice *vdev)
vfio_device_detach(&vdev->vbasedev);
- g_free(vdev->vbasedev.name);
+ vfio_device_free_name(&vdev->vbasedev);
g_free(vdev->msix);
}
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 9a21f2e..5c1795a 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -530,7 +530,7 @@ static bool vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
{
/* @fd takes precedence over @sysfsdev which takes precedence over @host */
if (vbasedev->fd < 0 && vbasedev->sysfsdev) {
- g_free(vbasedev->name);
+ vfio_device_free_name(vbasedev);
vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
} else if (vbasedev->fd < 0) {
if (!vbasedev->name || strchr(vbasedev->name, '/')) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
2025-06-10 15:39 ` [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name Steve Sistare
@ 2025-06-11 12:38 ` Cédric Le Goater
2025-06-23 8:27 ` Duan, Zhenzhong
2025-06-23 13:50 ` Eric Farman
2 siblings, 0 replies; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-11 12:38 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Define vfio_device_free_name to free the name created by
> vfio_device_get_name. A subsequent patch will do more there.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> include/hw/vfio/vfio-device.h | 1 +
> hw/vfio/ap.c | 2 +-
> hw/vfio/ccw.c | 2 +-
> hw/vfio/device.c | 5 +++++
> hw/vfio/pci.c | 2 +-
> hw/vfio/platform.c | 2 +-
> 6 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 6eb6f21..321b442 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
>
> /* Returns 0 on success, or a negative errno. */
> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
> +void vfio_device_free_name(VFIODevice *vbasedev);
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
> void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
> DeviceState *dev, bool ram_discard);
> diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
> index 785c0a0..013bd59 100644
> --- a/hw/vfio/ap.c
> +++ b/hw/vfio/ap.c
> @@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
>
> error:
> error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
> - g_free(vbasedev->name);
> + vfio_device_free_name(vbasedev);
> }
>
> static void vfio_ap_unrealize(DeviceState *dev)
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index cea9d6e..903b8b0 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -619,7 +619,7 @@ out_io_notifier_err:
> out_region_err:
> vfio_device_detach(vbasedev);
> out_attach_dev_err:
> - g_free(vbasedev->name);
> + vfio_device_free_name(vbasedev);
> out_unrealize:
> if (cdc->unrealize) {
> cdc->unrealize(cdev);
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 71fa9f4..a3603f5 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -317,6 +317,11 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
> return true;
> }
>
> +void vfio_device_free_name(VFIODevice *vbasedev)
> +{
> + g_clear_pointer(&vbasedev->name, g_free);
> +}
> +
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
> {
> ERRP_GUARD();
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index b52c488..b4136432 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2966,7 +2966,7 @@ static void vfio_pci_put_device(VFIOPCIDevice *vdev)
>
> vfio_device_detach(&vdev->vbasedev);
>
> - g_free(vdev->vbasedev.name);
> + vfio_device_free_name(&vdev->vbasedev);
> g_free(vdev->msix);
> }
>
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> index 9a21f2e..5c1795a 100644
> --- a/hw/vfio/platform.c
> +++ b/hw/vfio/platform.c
> @@ -530,7 +530,7 @@ static bool vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
> {
> /* @fd takes precedence over @sysfsdev which takes precedence over @host */
> if (vbasedev->fd < 0 && vbasedev->sysfsdev) {
> - g_free(vbasedev->name);
> + vfio_device_free_name(vbasedev);
> vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
> } else if (vbasedev->fd < 0) {
> if (!vbasedev->name || strchr(vbasedev->name, '/')) {
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
2025-06-10 15:39 ` [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name Steve Sistare
2025-06-11 12:38 ` Cédric Le Goater
@ 2025-06-23 8:27 ` Duan, Zhenzhong
2025-06-23 13:50 ` Eric Farman
2 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 8:27 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
>
>Define vfio_device_free_name to free the name created by
>vfio_device_get_name. A subsequent patch will do more there.
>No functional change.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
2025-06-10 15:39 ` [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name Steve Sistare
2025-06-11 12:38 ` Cédric Le Goater
2025-06-23 8:27 ` Duan, Zhenzhong
@ 2025-06-23 13:50 ` Eric Farman
2025-07-01 14:26 ` Steven Sistare
2 siblings, 1 reply; 101+ messages in thread
From: Eric Farman @ 2025-06-23 13:50 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas
On Tue, 2025-06-10 at 08:39 -0700, Steve Sistare wrote:
> Define vfio_device_free_name to free the name created by
> vfio_device_get_name. A subsequent patch will do more there.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/hw/vfio/vfio-device.h | 1 +
> hw/vfio/ap.c | 2 +-
> hw/vfio/ccw.c | 2 +-
> hw/vfio/device.c | 5 +++++
> hw/vfio/pci.c | 2 +-
> hw/vfio/platform.c | 2 +-
> 6 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 6eb6f21..321b442 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
>
> /* Returns 0 on success, or a negative errno. */
> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
> +void vfio_device_free_name(VFIODevice *vbasedev);
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
> void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
> DeviceState *dev, bool ram_discard);
> diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
> index 785c0a0..013bd59 100644
> --- a/hw/vfio/ap.c
> +++ b/hw/vfio/ap.c
> @@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
>
> error:
> error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
> - g_free(vbasedev->name);
> + vfio_device_free_name(vbasedev);
> }
>
> static void vfio_ap_unrealize(DeviceState *dev)
^^^
I suspect you want to convert the g_free call of the VFIODevice name here as well.
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index cea9d6e..903b8b0 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -619,7 +619,7 @@ out_io_notifier_err:
> out_region_err:
> vfio_device_detach(vbasedev);
> out_attach_dev_err:
> - g_free(vbasedev->name);
> + vfio_device_free_name(vbasedev);
> out_unrealize:
> if (cdc->unrealize) {
> cdc->unrealize(cdev);
Similarly, the matching g_free call in vfio_ccw_unrealize
Thanks,
Eric
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name
2025-06-23 13:50 ` Eric Farman
@ 2025-07-01 14:26 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Eric Farman, qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas
On 6/23/2025 9:50 AM, Eric Farman wrote:
> On Tue, 2025-06-10 at 08:39 -0700, Steve Sistare wrote:
>> Define vfio_device_free_name to free the name created by
>> vfio_device_get_name. A subsequent patch will do more there.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/hw/vfio/vfio-device.h | 1 +
>> hw/vfio/ap.c | 2 +-
>> hw/vfio/ccw.c | 2 +-
>> hw/vfio/device.c | 5 +++++
>> hw/vfio/pci.c | 2 +-
>> hw/vfio/platform.c | 2 +-
>> 6 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index 6eb6f21..321b442 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
>>
>> /* Returns 0 on success, or a negative errno. */
>> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
>> +void vfio_device_free_name(VFIODevice *vbasedev);
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>> void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>> DeviceState *dev, bool ram_discard);
>> diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
>> index 785c0a0..013bd59 100644
>> --- a/hw/vfio/ap.c
>> +++ b/hw/vfio/ap.c
>> @@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
>>
>> error:
>> error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>> - g_free(vbasedev->name);
>> + vfio_device_free_name(vbasedev);
>> }
>>
>> static void vfio_ap_unrealize(DeviceState *dev)
>
> ^^^
> I suspect you want to convert the g_free call of the VFIODevice name here as well.
>
>> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
>> index cea9d6e..903b8b0 100644
>> --- a/hw/vfio/ccw.c
>> +++ b/hw/vfio/ccw.c
>> @@ -619,7 +619,7 @@ out_io_notifier_err:
>> out_region_err:
>> vfio_device_detach(vbasedev);
>> out_attach_dev_err:
>> - g_free(vbasedev->name);
>> + vfio_device_free_name(vbasedev);
>> out_unrealize:
>> if (cdc->unrealize) {
>> cdc->unrealize(cdev);
>
> Similarly, the matching g_free call in vfio_ccw_unrealize
Yes, thank you. I will do that in the next version.
- Steve
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 28/38] vfio/iommufd: device name blocker
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (26 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 27/38] vfio/iommufd: add vfio_device_free_name Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-23 10:29 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 29/38] vfio/iommufd: register container for cpr Steve Sistare
` (10 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
If an invariant device name cannot be created, block CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-cpr.h | 1 +
hw/vfio/device.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 25e74ee..170a116 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -29,6 +29,7 @@ typedef struct VFIOContainerCPR {
typedef struct VFIODeviceCPR {
Error *mdev_blocker;
+ Error *id_blocker;
} VFIODeviceCPR;
bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index a3603f5..8c3835b 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -28,6 +28,8 @@
#include "qapi/error.h"
#include "qemu/error-report.h"
#include "qemu/units.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
#include "monitor/monitor.h"
#include "vfio-helpers.h"
@@ -308,8 +310,16 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
} else {
/*
* Assign a name so any function printing it will not break.
+ * The fd number changes across processes, so this cannot be
+ * used as an invariant name for CPR.
*/
vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ error_setg(&vbasedev->cpr.id_blocker,
+ "vfio device with fd=%d needs an id property",
+ vbasedev->fd);
+ return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker,
+ errp, MIG_MODE_CPR_TRANSFER,
+ -1) == 0;
}
}
}
@@ -320,6 +330,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
void vfio_device_free_name(VFIODevice *vbasedev)
{
g_clear_pointer(&vbasedev->name, g_free);
+ migrate_del_blocker(&vbasedev->cpr.id_blocker);
}
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 28/38] vfio/iommufd: device name blocker
2025-06-10 15:39 ` [PATCH V5 28/38] vfio/iommufd: device name blocker Steve Sistare
@ 2025-06-23 10:29 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 10:29 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 28/38] vfio/iommufd: device name blocker
>
>If an invariant device name cannot be created, block CPR.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 29/38] vfio/iommufd: register container for cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (27 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 28/38] vfio/iommufd: device name blocker Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-07-01 14:25 ` Steven Sistare
2025-07-02 14:17 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 30/38] migration: vfio cpr state hook Steve Sistare
` (9 subsequent siblings)
38 siblings, 2 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a vfio iommufd container and device for CPR, replacing the generic
CPR register call with a more specific iommufd register call. Add a
blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-cpr.h | 12 +++++++
include/system/iommufd.h | 1 +
backends/iommufd.c | 10 ++++++
hw/vfio/cpr-iommufd.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
hw/vfio/iommufd.c | 6 ++--
hw/vfio/meson.build | 1 +
6 files changed, 112 insertions(+), 2 deletions(-)
create mode 100644 hw/vfio/cpr-iommufd.c
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 170a116..b9b77ae 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -15,7 +15,10 @@
struct VFIOContainer;
struct VFIOContainerBase;
struct VFIOGroup;
+struct VFIODevice;
struct VFIOPCIDevice;
+struct VFIOIOMMUFDContainer;
+struct IOMMUFDBackend;
typedef struct VFIOContainerCPR {
Error *blocker;
@@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
Error **errp);
void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
+ Error **errp);
+void vfio_iommufd_cpr_unregister_container(
+ struct VFIOIOMMUFDContainer *container);
+bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
+void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+
int vfio_cpr_group_get_device_fd(int d, const char *name);
bool vfio_cpr_container_match(struct VFIOContainer *container,
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index db5f2c7..c9c72ff 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
/*< protected >*/
int fd; /* /dev/iommu file descriptor */
bool owned; /* is the /dev/iommu opened internally */
+ Error *cpr_blocker;/* set if be does not support CPR */
uint32_t users;
/*< public >*/
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 87f81a0..c554ce5 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
}
be->fd = fd;
}
+ if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
+ if (be->owned) {
+ close(be->fd);
+ be->fd = -1;
+ }
+ return false;
+ }
be->users++;
trace_iommufd_backend_connect(be->fd, be->owned, be->users);
@@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
be->fd = -1;
}
out:
+ if (!be->users) {
+ vfio_iommufd_cpr_unregister_iommufd(be);
+ }
trace_iommufd_backend_disconnect(be->fd, be->users);
}
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
new file mode 100644
index 0000000..60bd7e8
--- /dev/null
+++ b/hw/vfio/cpr-iommufd.c
@@ -0,0 +1,84 @@
+/*
+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "system/iommufd.h"
+#include "vfio-iommufd.h"
+
+static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
+{
+ if (!iommufd_change_process_capable(be)) {
+ if (errp) {
+ error_setg(errp, "vfio iommufd backend does not support "
+ "IOMMU_IOAS_CHANGE_PROCESS");
+ }
+ return false;
+ }
+ return true;
+}
+
+static const VMStateDescription iommufd_cpr_vmstate = {
+ .name = "iommufd",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_incoming_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
+{
+ Error **cpr_blocker = &be->cpr_blocker;
+
+ if (!vfio_cpr_supported(be, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
+
+ return true;
+}
+
+void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
+{
+ vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
+ migrate_del_blocker(&be->cpr_blocker);
+}
+
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+ Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+ vfio_cpr_reboot_notifier,
+ MIG_MODE_CPR_REBOOT);
+
+ return true;
+}
+
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+}
+
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
+{
+}
+
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
+{
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 962a1e2..ff291be 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
if (!QLIST_EMPTY(&bcontainer->device_list)) {
return;
}
- vfio_cpr_unregister_container(bcontainer);
+ vfio_iommufd_cpr_unregister_container(container);
vfio_listener_unregister(bcontainer);
iommufd_backend_free_id(container->be, container->ioas_id);
object_unref(container);
@@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
goto err_listener_register;
}
- if (!vfio_cpr_register_container(bcontainer, errp)) {
+ if (!vfio_iommufd_cpr_register_container(container, errp)) {
goto err_listener_register;
}
@@ -623,6 +623,7 @@ found_container:
}
vfio_device_prepare(vbasedev, bcontainer, &dev_info);
+ vfio_iommufd_cpr_register_device(vbasedev);
trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
vbasedev->num_regions, vbasedev->flags);
@@ -660,6 +661,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
iommufd_cdev_container_destroy(container);
vfio_address_space_put(space);
+ vfio_iommufd_cpr_unregister_device(vbasedev);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 98134a7..56373e3 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -31,6 +31,7 @@ system_ss.add(when: 'CONFIG_VFIO', if_true: files(
))
system_ss.add(when: ['CONFIG_VFIO', 'CONFIG_IOMMUFD'], if_true: files(
'iommufd.c',
+ 'cpr-iommufd.c',
))
system_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'display.c',
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 29/38] vfio/iommufd: register container for cpr
2025-06-10 15:39 ` [PATCH V5 29/38] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-07-01 14:25 ` Steven Sistare
2025-07-02 14:17 ` Duan, Zhenzhong
1 sibling, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:25 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas
Hi Zhenzhong, this is the only iommufd patch you have not reviewed yet - steve
On 6/10/2025 11:39 AM, Steve Sistare wrote:
> Register a vfio iommufd container and device for CPR, replacing the generic
> CPR register call with a more specific iommufd register call. Add a
> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>
> This is mostly boiler plate. The fields to to saved and restored are added
> in subsequent patches.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/hw/vfio/vfio-cpr.h | 12 +++++++
> include/system/iommufd.h | 1 +
> backends/iommufd.c | 10 ++++++
> hw/vfio/cpr-iommufd.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/iommufd.c | 6 ++--
> hw/vfio/meson.build | 1 +
> 6 files changed, 112 insertions(+), 2 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
>
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 170a116..b9b77ae 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -15,7 +15,10 @@
> struct VFIOContainer;
> struct VFIOContainerBase;
> struct VFIOGroup;
> +struct VFIODevice;
> struct VFIOPCIDevice;
> +struct VFIOIOMMUFDContainer;
> +struct IOMMUFDBackend;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
> @@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
> Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>
> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
> + Error **errp);
> +void vfio_iommufd_cpr_unregister_container(
> + struct VFIOIOMMUFDContainer *container);
> +bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
> +void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
> +
> int vfio_cpr_group_get_device_fd(int d, const char *name);
>
> bool vfio_cpr_container_match(struct VFIOContainer *container,
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index db5f2c7..c9c72ff 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
> /*< protected >*/
> int fd; /* /dev/iommu file descriptor */
> bool owned; /* is the /dev/iommu opened internally */
> + Error *cpr_blocker;/* set if be does not support CPR */
> uint32_t users;
>
> /*< public >*/
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 87f81a0..c554ce5 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> }
> be->fd = fd;
> }
> + if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
> + if (be->owned) {
> + close(be->fd);
> + be->fd = -1;
> + }
> + return false;
> + }
> be->users++;
>
> trace_iommufd_backend_connect(be->fd, be->owned, be->users);
> @@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
> be->fd = -1;
> }
> out:
> + if (!be->users) {
> + vfio_iommufd_cpr_unregister_iommufd(be);
> + }
> trace_iommufd_backend_disconnect(be->fd, be->users);
> }
>
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> new file mode 100644
> index 0000000..60bd7e8
> --- /dev/null
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -0,0 +1,84 @@
> +/*
> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "hw/vfio/vfio-cpr.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "system/iommufd.h"
> +#include "vfio-iommufd.h"
> +
> +static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
> +{
> + if (!iommufd_change_process_capable(be)) {
> + if (errp) {
> + error_setg(errp, "vfio iommufd backend does not support "
> + "IOMMU_IOAS_CHANGE_PROCESS");
> + }
> + return false;
> + }
> + return true;
> +}
> +
> +static const VMStateDescription iommufd_cpr_vmstate = {
> + .name = "iommufd",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .needed = cpr_incoming_needed,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
> +{
> + Error **cpr_blocker = &be->cpr_blocker;
> +
> + if (!vfio_cpr_supported(be, cpr_blocker)) {
> + return migrate_add_blocker_modes(cpr_blocker, errp,
> + MIG_MODE_CPR_TRANSFER, -1) == 0;
> + }
> +
> + vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
> +
> + return true;
> +}
> +
> +void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
> +{
> + vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
> + migrate_del_blocker(&be->cpr_blocker);
> +}
> +
> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
> + Error **errp)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
> + vfio_cpr_reboot_notifier,
> + MIG_MODE_CPR_REBOOT);
> +
> + return true;
> +}
> +
> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> +}
> +
> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
> +{
> +}
> +
> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
> +{
> +}
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 962a1e2..ff291be 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
> if (!QLIST_EMPTY(&bcontainer->device_list)) {
> return;
> }
> - vfio_cpr_unregister_container(bcontainer);
> + vfio_iommufd_cpr_unregister_container(container);
> vfio_listener_unregister(bcontainer);
> iommufd_backend_free_id(container->be, container->ioas_id);
> object_unref(container);
> @@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
> goto err_listener_register;
> }
>
> - if (!vfio_cpr_register_container(bcontainer, errp)) {
> + if (!vfio_iommufd_cpr_register_container(container, errp)) {
> goto err_listener_register;
> }
>
> @@ -623,6 +623,7 @@ found_container:
> }
>
> vfio_device_prepare(vbasedev, bcontainer, &dev_info);
> + vfio_iommufd_cpr_register_device(vbasedev);
>
> trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
> vbasedev->num_regions, vbasedev->flags);
> @@ -660,6 +661,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
> iommufd_cdev_container_destroy(container);
> vfio_address_space_put(space);
>
> + vfio_iommufd_cpr_unregister_device(vbasedev);
> iommufd_cdev_unbind_and_disconnect(vbasedev);
> close(vbasedev->fd);
> }
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 98134a7..56373e3 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -31,6 +31,7 @@ system_ss.add(when: 'CONFIG_VFIO', if_true: files(
> ))
> system_ss.add(when: ['CONFIG_VFIO', 'CONFIG_IOMMUFD'], if_true: files(
> 'iommufd.c',
> + 'cpr-iommufd.c',
> ))
> system_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> 'display.c',
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 29/38] vfio/iommufd: register container for cpr
2025-06-10 15:39 ` [PATCH V5 29/38] vfio/iommufd: register container for cpr Steve Sistare
2025-07-01 14:25 ` Steven Sistare
@ 2025-07-02 14:17 ` Duan, Zhenzhong
2025-07-02 14:52 ` Steven Sistare
1 sibling, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 14:17 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 29/38] vfio/iommufd: register container for cpr
>
>Register a vfio iommufd container and device for CPR, replacing the generic
>CPR register call with a more specific iommufd register call. Add a
>blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>
>This is mostly boiler plate. The fields to to saved and restored are added
>in subsequent patches.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> include/hw/vfio/vfio-cpr.h | 12 +++++++
> include/system/iommufd.h | 1 +
> backends/iommufd.c | 10 ++++++
> hw/vfio/cpr-iommufd.c | 84
>++++++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/iommufd.c | 6 ++--
> hw/vfio/meson.build | 1 +
> 6 files changed, 112 insertions(+), 2 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
>
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index 170a116..b9b77ae 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -15,7 +15,10 @@
> struct VFIOContainer;
> struct VFIOContainerBase;
> struct VFIOGroup;
>+struct VFIODevice;
> struct VFIOPCIDevice;
>+struct VFIOIOMMUFDContainer;
>+struct IOMMUFDBackend;
>
> typedef struct VFIOContainerCPR {
> Error *blocker;
>@@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct
>VFIOContainerBase *bcontainer,
> Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>
>+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer
>*container,
>+ Error **errp);
>+void vfio_iommufd_cpr_unregister_container(
>+ struct VFIOIOMMUFDContainer *container);
>+bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be,
>Error **errp);
>+void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
>+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>+
> int vfio_cpr_group_get_device_fd(int d, const char *name);
>
> bool vfio_cpr_container_match(struct VFIOContainer *container,
>diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>index db5f2c7..c9c72ff 100644
>--- a/include/system/iommufd.h
>+++ b/include/system/iommufd.h
>@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
> /*< protected >*/
> int fd; /* /dev/iommu file descriptor */
> bool owned; /* is the /dev/iommu opened internally */
>+ Error *cpr_blocker;/* set if be does not support CPR */
> uint32_t users;
>
> /*< public >*/
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 87f81a0..c554ce5 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -108,6 +108,13 @@ bool
>iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> }
> be->fd = fd;
> }
>+ if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
>+ if (be->owned) {
>+ close(be->fd);
>+ be->fd = -1;
>+ }
>+ return false;
>+ }
> be->users++;
>
> trace_iommufd_backend_connect(be->fd, be->owned, be->users);
>@@ -125,6 +132,9 @@ void
>iommufd_backend_disconnect(IOMMUFDBackend *be)
> be->fd = -1;
> }
> out:
>+ if (!be->users) {
>+ vfio_iommufd_cpr_unregister_iommufd(be);
>+ }
> trace_iommufd_backend_disconnect(be->fd, be->users);
> }
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>new file mode 100644
>index 0000000..60bd7e8
>--- /dev/null
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -0,0 +1,84 @@
>+/*
>+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>+ *
>+ * SPDX-License-Identifier: GPL-2.0-or-later
>+ */
>+
>+#include "qemu/osdep.h"
>+#include "qapi/error.h"
>+#include "hw/vfio/vfio-cpr.h"
>+#include "migration/blocker.h"
>+#include "migration/cpr.h"
>+#include "migration/migration.h"
>+#include "migration/vmstate.h"
>+#include "system/iommufd.h"
>+#include "vfio-iommufd.h"
>+
>+static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>+{
>+ if (!iommufd_change_process_capable(be)) {
>+ if (errp) {
>+ error_setg(errp, "vfio iommufd backend does not support "
>+ "IOMMU_IOAS_CHANGE_PROCESS");
>+ }
>+ return false;
>+ }
>+ return true;
>+}
>+
>+static const VMStateDescription iommufd_cpr_vmstate = {
>+ .name = "iommufd",
>+ .version_id = 0,
>+ .minimum_version_id = 0,
>+ .needed = cpr_incoming_needed,
>+ .fields = (VMStateField[]) {
>+ VMSTATE_END_OF_LIST()
>+ }
>+};
>+
>+bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error
>**errp)
>+{
>+ Error **cpr_blocker = &be->cpr_blocker;
>+
>+ if (!vfio_cpr_supported(be, cpr_blocker)) {
>+ return migrate_add_blocker_modes(cpr_blocker, errp,
>+
>MIG_MODE_CPR_TRANSFER, -1) == 0;
>+ }
I suspect that blocker is never installed. Because vfio_iommufd_cpr_register_iommufd() is called for first VFIO device and before memory listener is installed. iommufd_change_process_capable() will always return true.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 29/38] vfio/iommufd: register container for cpr
2025-07-02 14:17 ` Duan, Zhenzhong
@ 2025-07-02 14:52 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 14:52 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/2/2025 10:17 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 29/38] vfio/iommufd: register container for cpr
>>
>> Register a vfio iommufd container and device for CPR, replacing the generic
>> CPR register call with a more specific iommufd register call. Add a
>> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>>
>> This is mostly boiler plate. The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/hw/vfio/vfio-cpr.h | 12 +++++++
>> include/system/iommufd.h | 1 +
>> backends/iommufd.c | 10 ++++++
>> hw/vfio/cpr-iommufd.c | 84
>> ++++++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/iommufd.c | 6 ++--
>> hw/vfio/meson.build | 1 +
>> 6 files changed, 112 insertions(+), 2 deletions(-)
>> create mode 100644 hw/vfio/cpr-iommufd.c
>>
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 170a116..b9b77ae 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -15,7 +15,10 @@
>> struct VFIOContainer;
>> struct VFIOContainerBase;
>> struct VFIOGroup;
>> +struct VFIODevice;
>> struct VFIOPCIDevice;
>> +struct VFIOIOMMUFDContainer;
>> +struct IOMMUFDBackend;
>>
>> typedef struct VFIOContainerCPR {
>> Error *blocker;
>> @@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct
>> VFIOContainerBase *bcontainer,
>> Error **errp);
>> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>
>> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer
>> *container,
>> + Error **errp);
>> +void vfio_iommufd_cpr_unregister_container(
>> + struct VFIOIOMMUFDContainer *container);
>> +bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be,
>> Error **errp);
>> +void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
>> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>> +
>> int vfio_cpr_group_get_device_fd(int d, const char *name);
>>
>> bool vfio_cpr_container_match(struct VFIOContainer *container,
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index db5f2c7..c9c72ff 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>> /*< protected >*/
>> int fd; /* /dev/iommu file descriptor */
>> bool owned; /* is the /dev/iommu opened internally */
>> + Error *cpr_blocker;/* set if be does not support CPR */
>> uint32_t users;
>>
>> /*< public >*/
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 87f81a0..c554ce5 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -108,6 +108,13 @@ bool
>> iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>> }
>> be->fd = fd;
>> }
>> + if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
>> + if (be->owned) {
>> + close(be->fd);
>> + be->fd = -1;
>> + }
>> + return false;
>> + }
>> be->users++;
>>
>> trace_iommufd_backend_connect(be->fd, be->owned, be->users);
>> @@ -125,6 +132,9 @@ void
>> iommufd_backend_disconnect(IOMMUFDBackend *be)
>> be->fd = -1;
>> }
>> out:
>> + if (!be->users) {
>> + vfio_iommufd_cpr_unregister_iommufd(be);
>> + }
>> trace_iommufd_backend_disconnect(be->fd, be->users);
>> }
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> new file mode 100644
>> index 0000000..60bd7e8
>> --- /dev/null
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -0,0 +1,84 @@
>> +/*
>> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "system/iommufd.h"
>> +#include "vfio-iommufd.h"
>> +
>> +static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>> +{
>> + if (!iommufd_change_process_capable(be)) {
>> + if (errp) {
>> + error_setg(errp, "vfio iommufd backend does not support "
>> + "IOMMU_IOAS_CHANGE_PROCESS");
>> + }
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static const VMStateDescription iommufd_cpr_vmstate = {
>> + .name = "iommufd",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .needed = cpr_incoming_needed,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error
>> **errp)
>> +{
>> + Error **cpr_blocker = &be->cpr_blocker;
>> +
>> + if (!vfio_cpr_supported(be, cpr_blocker)) {
>> + return migrate_add_blocker_modes(cpr_blocker, errp,
>> +
>> MIG_MODE_CPR_TRANSFER, -1) == 0;
>> + }
>
> I suspect that blocker is never installed. Because vfio_iommufd_cpr_register_iommufd() is called for first VFIO device and before memory listener is installed. iommufd_change_process_capable() will always return true.
The blocker is installed if the kernel does not support the change process ioctl,
which was added last year. iommufd_change_process_capable() will return false.
- Steve
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 30/38] migration: vfio cpr state hook
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (28 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 29/38] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-24 11:24 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 31/38] vfio/iommufd: cpr state Steve Sistare
` (8 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define a list of vfio devices in CPR state, in a subsection so that
older QEMU can be live updated to this version. However, new QEMU
will not be live updateable to old QEMU. This is acceptable because
CPR is not yet commonly used, and updates to older versions are unusual.
The contents of each device object will be defined by the vfio subsystem
in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-cpr.h | 1 +
include/migration/cpr.h | 12 ++++++++++++
hw/vfio/cpr-iommufd.c | 2 ++
hw/vfio/iommufd-stubs.c | 18 ++++++++++++++++++
migration/cpr.c | 14 +++++---------
hw/vfio/meson.build | 1 +
6 files changed, 39 insertions(+), 9 deletions(-)
create mode 100644 hw/vfio/iommufd-stubs.c
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index b9b77ae..619af07 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
int nr);
extern const VMStateDescription vfio_cpr_pci_vmstate;
+extern const VMStateDescription vmstate_cpr_vfio_devices;
#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7fd8065..8fd8bfe 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -9,11 +9,23 @@
#define MIGRATION_CPR_H
#include "qapi/qapi-types-migration.h"
+#include "qemu/queue.h"
#define MIG_MODE_NONE -1
#define QEMU_CPR_FILE_MAGIC 0x51435052
#define QEMU_CPR_FILE_VERSION 0x00000001
+#define CPR_STATE "CprState"
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice) CprVFIODeviceList;
+
+typedef struct CprState {
+ CprFdList fds;
+ CprVFIODeviceList vfio_devices;
+} CprState;
+
+extern CprState cpr_state;
void cpr_save_fd(const char *name, int id, int fd);
void cpr_delete_fd(const char *name, int id);
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 60bd7e8..3e78265 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -14,6 +14,8 @@
#include "system/iommufd.h"
#include "vfio-iommufd.h"
+const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
+
static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
{
if (!iommufd_change_process_capable(be)) {
diff --git a/hw/vfio/iommufd-stubs.c b/hw/vfio/iommufd-stubs.c
new file mode 100644
index 0000000..0be5276
--- /dev/null
+++ b/hw/vfio/iommufd-stubs.c
@@ -0,0 +1,18 @@
+/*
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+#include "migration/vmstate.h"
+
+const VMStateDescription vmstate_cpr_vfio_devices = {
+ .name = CPR_STATE "/vfio devices",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (const VMStateField[]){
+ VMSTATE_END_OF_LIST()
+ }
+};
diff --git a/migration/cpr.c b/migration/cpr.c
index 4574608..47898ab 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -22,13 +22,7 @@
/*************************************************************************/
/* cpr state container for all information to be saved. */
-typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
-
-typedef struct CprState {
- CprFdList fds;
-} CprState;
-
-static CprState cpr_state;
+CprState cpr_state;
/****************************************************************************/
@@ -129,8 +123,6 @@ int cpr_open_fd(const char *path, int flags, const char *name, int id,
}
/*************************************************************************/
-#define CPR_STATE "CprState"
-
static const VMStateDescription vmstate_cpr_state = {
.name = CPR_STATE,
.version_id = 1,
@@ -138,6 +130,10 @@ static const VMStateDescription vmstate_cpr_state = {
.fields = (VMStateField[]) {
VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
VMSTATE_END_OF_LIST()
+ },
+ .subsections = (const VMStateDescription * const []) {
+ &vmstate_cpr_vfio_devices,
+ NULL
}
};
/*************************************************************************/
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 56373e3..b9420cf 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -33,6 +33,7 @@ system_ss.add(when: ['CONFIG_VFIO', 'CONFIG_IOMMUFD'], if_true: files(
'iommufd.c',
'cpr-iommufd.c',
))
+system_ss.add(when: 'CONFIG_IOMMUFD', if_false: files('iommufd-stubs.c'))
system_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'display.c',
))
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 30/38] migration: vfio cpr state hook
2025-06-10 15:39 ` [PATCH V5 30/38] migration: vfio cpr state hook Steve Sistare
@ 2025-06-24 11:24 ` Duan, Zhenzhong
2025-07-01 14:26 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-24 11:24 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 30/38] migration: vfio cpr state hook
>
>Define a list of vfio devices in CPR state, in a subsection so that
>older QEMU can be live updated to this version. However, new QEMU
>will not be live updateable to old QEMU. This is acceptable because
>CPR is not yet commonly used, and updates to older versions are unusual.
I'm not familiar with migration, may I ask how subsection help blocking migration
from new to old QEMU?
>
>The contents of each device object will be defined by the vfio subsystem
>in a subsequent patch.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> include/hw/vfio/vfio-cpr.h | 1 +
> include/migration/cpr.h | 12 ++++++++++++
> hw/vfio/cpr-iommufd.c | 2 ++
> hw/vfio/iommufd-stubs.c | 18 ++++++++++++++++++
> migration/cpr.c | 14 +++++---------
> hw/vfio/meson.build | 1 +
> 6 files changed, 39 insertions(+), 9 deletions(-)
> create mode 100644 hw/vfio/iommufd-stubs.c
>
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index b9b77ae..619af07 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev,
>const char *name,
> int nr);
>
> extern const VMStateDescription vfio_cpr_pci_vmstate;
>+extern const VMStateDescription vmstate_cpr_vfio_devices;
>
> #endif /* HW_VFIO_VFIO_CPR_H */
>diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>index 7fd8065..8fd8bfe 100644
>--- a/include/migration/cpr.h
>+++ b/include/migration/cpr.h
>@@ -9,11 +9,23 @@
> #define MIGRATION_CPR_H
>
> #include "qapi/qapi-types-migration.h"
>+#include "qemu/queue.h"
>
> #define MIG_MODE_NONE -1
>
> #define QEMU_CPR_FILE_MAGIC 0x51435052
> #define QEMU_CPR_FILE_VERSION 0x00000001
>+#define CPR_STATE "CprState"
>+
>+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>+typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice) CprVFIODeviceList;
>+
>+typedef struct CprState {
>+ CprFdList fds;
>+ CprVFIODeviceList vfio_devices;
>+} CprState;
>+
>+extern CprState cpr_state;
>
> void cpr_save_fd(const char *name, int id, int fd);
> void cpr_delete_fd(const char *name, int id);
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 60bd7e8..3e78265 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -14,6 +14,8 @@
> #include "system/iommufd.h"
> #include "vfio-iommufd.h"
>
>+const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
>+
> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
> {
> if (!iommufd_change_process_capable(be)) {
>diff --git a/hw/vfio/iommufd-stubs.c b/hw/vfio/iommufd-stubs.c
>new file mode 100644
>index 0000000..0be5276
>--- /dev/null
>+++ b/hw/vfio/iommufd-stubs.c
>@@ -0,0 +1,18 @@
>+/*
>+ * Copyright (c) 2025 Oracle and/or its affiliates.
>+ *
>+ * SPDX-License-Identifier: GPL-2.0-or-later
>+ */
>+
>+#include "qemu/osdep.h"
>+#include "migration/cpr.h"
>+#include "migration/vmstate.h"
>+
>+const VMStateDescription vmstate_cpr_vfio_devices = {
>+ .name = CPR_STATE "/vfio devices",
>+ .version_id = 1,
>+ .minimum_version_id = 1,
Is there difference if version_id=minimum_version_id=0?
Thanks
Zhenzhong
>+ .fields = (const VMStateField[]){
>+ VMSTATE_END_OF_LIST()
>+ }
>+};
>diff --git a/migration/cpr.c b/migration/cpr.c
>index 4574608..47898ab 100644
>--- a/migration/cpr.c
>+++ b/migration/cpr.c
>@@ -22,13 +22,7 @@
>
>/*****************************************************************
>********/
> /* cpr state container for all information to be saved. */
>
>-typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>-
>-typedef struct CprState {
>- CprFdList fds;
>-} CprState;
>-
>-static CprState cpr_state;
>+CprState cpr_state;
>
>
>/*****************************************************************
>***********/
>
>@@ -129,8 +123,6 @@ int cpr_open_fd(const char *path, int flags, const char
>*name, int id,
> }
>
>
>/*****************************************************************
>********/
>-#define CPR_STATE "CprState"
>-
> static const VMStateDescription vmstate_cpr_state = {
> .name = CPR_STATE,
> .version_id = 1,
>@@ -138,6 +130,10 @@ static const VMStateDescription vmstate_cpr_state = {
> .fields = (VMStateField[]) {
> VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
> VMSTATE_END_OF_LIST()
>+ },
>+ .subsections = (const VMStateDescription * const []) {
>+ &vmstate_cpr_vfio_devices,
>+ NULL
> }
> };
>
>/*****************************************************************
>********/
>diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>index 56373e3..b9420cf 100644
>--- a/hw/vfio/meson.build
>+++ b/hw/vfio/meson.build
>@@ -33,6 +33,7 @@ system_ss.add(when: ['CONFIG_VFIO',
>'CONFIG_IOMMUFD'], if_true: files(
> 'iommufd.c',
> 'cpr-iommufd.c',
> ))
>+system_ss.add(when: 'CONFIG_IOMMUFD', if_false: files('iommufd-stubs.c'))
> system_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> 'display.c',
> ))
>--
>1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 30/38] migration: vfio cpr state hook
2025-06-24 11:24 ` Duan, Zhenzhong
@ 2025-07-01 14:26 ` Steven Sistare
2025-07-02 13:39 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/24/2025 7:24 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 30/38] migration: vfio cpr state hook
>>
>> Define a list of vfio devices in CPR state, in a subsection so that
>> older QEMU can be live updated to this version. However, new QEMU
>> will not be live updateable to old QEMU. This is acceptable because
>> CPR is not yet commonly used, and updates to older versions are unusual.
>
> I'm not familiar with migration, may I ask how subsection help blocking migration
> from new to old QEMU?
Migrating new to old will fail with an error message saying the subsection is
not recognized.
>> The contents of each device object will be defined by the vfio subsystem
>> in a subsequent patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/hw/vfio/vfio-cpr.h | 1 +
>> include/migration/cpr.h | 12 ++++++++++++
>> hw/vfio/cpr-iommufd.c | 2 ++
>> hw/vfio/iommufd-stubs.c | 18 ++++++++++++++++++
>> migration/cpr.c | 14 +++++---------
>> hw/vfio/meson.build | 1 +
>> 6 files changed, 39 insertions(+), 9 deletions(-)
>> create mode 100644 hw/vfio/iommufd-stubs.c
>>
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index b9b77ae..619af07 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev,
>> const char *name,
>> int nr);
>>
>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>> +extern const VMStateDescription vmstate_cpr_vfio_devices;
>>
>> #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index 7fd8065..8fd8bfe 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -9,11 +9,23 @@
>> #define MIGRATION_CPR_H
>>
>> #include "qapi/qapi-types-migration.h"
>> +#include "qemu/queue.h"
>>
>> #define MIG_MODE_NONE -1
>>
>> #define QEMU_CPR_FILE_MAGIC 0x51435052
>> #define QEMU_CPR_FILE_VERSION 0x00000001
>> +#define CPR_STATE "CprState"
>> +
>> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>> +typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice) CprVFIODeviceList;
>> +
>> +typedef struct CprState {
>> + CprFdList fds;
>> + CprVFIODeviceList vfio_devices;
>> +} CprState;
>> +
>> +extern CprState cpr_state;
>>
>> void cpr_save_fd(const char *name, int id, int fd);
>> void cpr_delete_fd(const char *name, int id);
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 60bd7e8..3e78265 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -14,6 +14,8 @@
>> #include "system/iommufd.h"
>> #include "vfio-iommufd.h"
>>
>> +const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
>> +
>> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>> {
>> if (!iommufd_change_process_capable(be)) {
>> diff --git a/hw/vfio/iommufd-stubs.c b/hw/vfio/iommufd-stubs.c
>> new file mode 100644
>> index 0000000..0be5276
>> --- /dev/null
>> +++ b/hw/vfio/iommufd-stubs.c
>> @@ -0,0 +1,18 @@
>> +/*
>> + * Copyright (c) 2025 Oracle and/or its affiliates.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "migration/cpr.h"
>> +#include "migration/vmstate.h"
>> +
>> +const VMStateDescription vmstate_cpr_vfio_devices = {
>> + .name = CPR_STATE "/vfio devices",
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>
> Is there difference if version_id=minimum_version_id=0?
No. Some developers add a new VMStateDescription starting at 0,
and some starting at 1.
- Steve
> Thanks
> Zhenzhong
>
>> + .fields = (const VMStateField[]){
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 4574608..47898ab 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -22,13 +22,7 @@
>>
>> /*****************************************************************
>> ********/
>> /* cpr state container for all information to be saved. */
>>
>> -typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>> -
>> -typedef struct CprState {
>> - CprFdList fds;
>> -} CprState;
>> -
>> -static CprState cpr_state;
>> +CprState cpr_state;
>>
>>
>> /*****************************************************************
>> ***********/
>>
>> @@ -129,8 +123,6 @@ int cpr_open_fd(const char *path, int flags, const char
>> *name, int id,
>> }
>>
>>
>> /*****************************************************************
>> ********/
>> -#define CPR_STATE "CprState"
>> -
>> static const VMStateDescription vmstate_cpr_state = {
>> .name = CPR_STATE,
>> .version_id = 1,
>> @@ -138,6 +130,10 @@ static const VMStateDescription vmstate_cpr_state = {
>> .fields = (VMStateField[]) {
>> VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>> VMSTATE_END_OF_LIST()
>> + },
>> + .subsections = (const VMStateDescription * const []) {
>> + &vmstate_cpr_vfio_devices,
>> + NULL
>> }
>> };
>>
>> /*****************************************************************
>> ********/
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 56373e3..b9420cf 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -33,6 +33,7 @@ system_ss.add(when: ['CONFIG_VFIO',
>> 'CONFIG_IOMMUFD'], if_true: files(
>> 'iommufd.c',
>> 'cpr-iommufd.c',
>> ))
>> +system_ss.add(when: 'CONFIG_IOMMUFD', if_false: files('iommufd-stubs.c'))
>> system_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> 'display.c',
>> ))
>> --
>> 1.8.3.1
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 30/38] migration: vfio cpr state hook
2025-07-01 14:26 ` Steven Sistare
@ 2025-07-02 13:39 ` Duan, Zhenzhong
2025-07-02 15:07 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 13:39 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 30/38] migration: vfio cpr state hook
>
>On 6/24/2025 7:24 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 30/38] migration: vfio cpr state hook
>>>
>>> Define a list of vfio devices in CPR state, in a subsection so that
>>> older QEMU can be live updated to this version. However, new QEMU
>>> will not be live updateable to old QEMU. This is acceptable because
>>> CPR is not yet commonly used, and updates to older versions are unusual.
>>
>> I'm not familiar with migration, may I ask how subsection help blocking
>migration
>> from new to old QEMU?
>
>Migrating new to old will fail with an error message saying the subsection is
>not recognized.
You mean old qemu supporting legacy container live update, migrate to new qemu supporting iommufd live update?
>
>>> The contents of each device object will be defined by the vfio subsystem
>>> in a subsequent patch.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/hw/vfio/vfio-cpr.h | 1 +
>>> include/migration/cpr.h | 12 ++++++++++++
>>> hw/vfio/cpr-iommufd.c | 2 ++
>>> hw/vfio/iommufd-stubs.c | 18 ++++++++++++++++++
>>> migration/cpr.c | 14 +++++---------
>>> hw/vfio/meson.build | 1 +
>>> 6 files changed, 39 insertions(+), 9 deletions(-)
>>> create mode 100644 hw/vfio/iommufd-stubs.c
>>>
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index b9b77ae..619af07 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct
>VFIOPCIDevice *vdev,
>>> const char *name,
>>> int nr);
>>>
>>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>>> +extern const VMStateDescription vmstate_cpr_vfio_devices;
>>>
>>> #endif /* HW_VFIO_VFIO_CPR_H */
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index 7fd8065..8fd8bfe 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -9,11 +9,23 @@
>>> #define MIGRATION_CPR_H
>>>
>>> #include "qapi/qapi-types-migration.h"
>>> +#include "qemu/queue.h"
>>>
>>> #define MIG_MODE_NONE -1
>>>
>>> #define QEMU_CPR_FILE_MAGIC 0x51435052
>>> #define QEMU_CPR_FILE_VERSION 0x00000001
>>> +#define CPR_STATE "CprState"
>>> +
>>> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>>> +typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice)
>CprVFIODeviceList;
>>> +
>>> +typedef struct CprState {
>>> + CprFdList fds;
>>> + CprVFIODeviceList vfio_devices;
>>> +} CprState;
>>> +
>>> +extern CprState cpr_state;
>>>
>>> void cpr_save_fd(const char *name, int id, int fd);
>>> void cpr_delete_fd(const char *name, int id);
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 60bd7e8..3e78265 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -14,6 +14,8 @@
>>> #include "system/iommufd.h"
>>> #include "vfio-iommufd.h"
>>>
>>> +const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later
>patch */
>>> +
>>> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>>> {
>>> if (!iommufd_change_process_capable(be)) {
>>> diff --git a/hw/vfio/iommufd-stubs.c b/hw/vfio/iommufd-stubs.c
>>> new file mode 100644
>>> index 0000000..0be5276
>>> --- /dev/null
>>> +++ b/hw/vfio/iommufd-stubs.c
>>> @@ -0,0 +1,18 @@
>>> +/*
>>> + * Copyright (c) 2025 Oracle and/or its affiliates.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/vmstate.h"
>>> +
>>> +const VMStateDescription vmstate_cpr_vfio_devices = {
>>> + .name = CPR_STATE "/vfio devices",
>>> + .version_id = 1,
>>> + .minimum_version_id = 1,
>>
>> Is there difference if version_id=minimum_version_id=0?
>
>No. Some developers add a new VMStateDescription starting at 0,
>and some starting at 1.
OK.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 30/38] migration: vfio cpr state hook
2025-07-02 13:39 ` Duan, Zhenzhong
@ 2025-07-02 15:07 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 15:07 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/2/2025 9:39 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V5 30/38] migration: vfio cpr state hook
>>
>> On 6/24/2025 7:24 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V5 30/38] migration: vfio cpr state hook
>>>>
>>>> Define a list of vfio devices in CPR state, in a subsection so that
>>>> older QEMU can be live updated to this version. However, new QEMU
>>>> will not be live updateable to old QEMU. This is acceptable because
>>>> CPR is not yet commonly used, and updates to older versions are unusual.
>>>
>>> I'm not familiar with migration, may I ask how subsection help blocking
>> migration
>>> from new to old QEMU?
>>
>> Migrating new to old will fail with an error message saying the subsection is
>> not recognized.
>
> You mean old qemu supporting legacy container live update, migrate to new qemu supporting iommufd live update?
No, more basic.
Even with no vfio devices in the VM, qemu 10.1 will send an empty
vmstate_cpr_vfio_devices subsection, which qemu 10.0 does not recognize.
Thus live update from qemu 10.1 to qemu 10.0 will fail.
Live update from qemu 10.0 to qemu 10.1 will work.
- Steve
>>>> The contents of each device object will be defined by the vfio subsystem
>>>> in a subsequent patch.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> include/hw/vfio/vfio-cpr.h | 1 +
>>>> include/migration/cpr.h | 12 ++++++++++++
>>>> hw/vfio/cpr-iommufd.c | 2 ++
>>>> hw/vfio/iommufd-stubs.c | 18 ++++++++++++++++++
>>>> migration/cpr.c | 14 +++++---------
>>>> hw/vfio/meson.build | 1 +
>>>> 6 files changed, 39 insertions(+), 9 deletions(-)
>>>> create mode 100644 hw/vfio/iommufd-stubs.c
>>>>
>>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>>> index b9b77ae..619af07 100644
>>>> --- a/include/hw/vfio/vfio-cpr.h
>>>> +++ b/include/hw/vfio/vfio-cpr.h
>>>> @@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct
>> VFIOPCIDevice *vdev,
>>>> const char *name,
>>>> int nr);
>>>>
>>>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>>>> +extern const VMStateDescription vmstate_cpr_vfio_devices;
>>>>
>>>> #endif /* HW_VFIO_VFIO_CPR_H */
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index 7fd8065..8fd8bfe 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -9,11 +9,23 @@
>>>> #define MIGRATION_CPR_H
>>>>
>>>> #include "qapi/qapi-types-migration.h"
>>>> +#include "qemu/queue.h"
>>>>
>>>> #define MIG_MODE_NONE -1
>>>>
>>>> #define QEMU_CPR_FILE_MAGIC 0x51435052
>>>> #define QEMU_CPR_FILE_VERSION 0x00000001
>>>> +#define CPR_STATE "CprState"
>>>> +
>>>> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
>>>> +typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice)
>> CprVFIODeviceList;
>>>> +
>>>> +typedef struct CprState {
>>>> + CprFdList fds;
>>>> + CprVFIODeviceList vfio_devices;
>>>> +} CprState;
>>>> +
>>>> +extern CprState cpr_state;
>>>>
>>>> void cpr_save_fd(const char *name, int id, int fd);
>>>> void cpr_delete_fd(const char *name, int id);
>>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>>> index 60bd7e8..3e78265 100644
>>>> --- a/hw/vfio/cpr-iommufd.c
>>>> +++ b/hw/vfio/cpr-iommufd.c
>>>> @@ -14,6 +14,8 @@
>>>> #include "system/iommufd.h"
>>>> #include "vfio-iommufd.h"
>>>>
>>>> +const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later
>> patch */
>>>> +
>>>> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>>>> {
>>>> if (!iommufd_change_process_capable(be)) {
>>>> diff --git a/hw/vfio/iommufd-stubs.c b/hw/vfio/iommufd-stubs.c
>>>> new file mode 100644
>>>> index 0000000..0be5276
>>>> --- /dev/null
>>>> +++ b/hw/vfio/iommufd-stubs.c
>>>> @@ -0,0 +1,18 @@
>>>> +/*
>>>> + * Copyright (c) 2025 Oracle and/or its affiliates.
>>>> + *
>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "migration/cpr.h"
>>>> +#include "migration/vmstate.h"
>>>> +
>>>> +const VMStateDescription vmstate_cpr_vfio_devices = {
>>>> + .name = CPR_STATE "/vfio devices",
>>>> + .version_id = 1,
>>>> + .minimum_version_id = 1,
>>>
>>> Is there difference if version_id=minimum_version_id=0?
>>
>> No. Some developers add a new VMStateDescription starting at 0,
>> and some starting at 1.
>
> OK.
>
> Thanks
> Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 31/38] vfio/iommufd: cpr state
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (29 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 30/38] migration: vfio cpr state hook Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-23 10:45 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 32/38] vfio/iommufd: preserve descriptors Steve Sistare
` (7 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
VFIO iommufd devices will need access to ioas_id, devid, and hwpt_id in
new QEMU at realize time, so add them to CPR state. Define CprVFIODevice
as the object which holds the state and is serialized to the vmstate file.
Define accessors to copy state between VFIODevice and CprVFIODevice.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/hw/vfio/vfio-cpr.h | 3 ++
hw/vfio/cpr-iommufd.c | 96 +++++++++++++++++++++++++++++++++++++++++++++-
hw/vfio/iommufd.c | 2 +
3 files changed, 100 insertions(+), 1 deletion(-)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 619af07..f88e4ba 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -33,6 +33,8 @@ typedef struct VFIOContainerCPR {
typedef struct VFIODeviceCPR {
Error *mdev_blocker;
Error *id_blocker;
+ uint32_t hwpt_id;
+ uint32_t ioas_id;
} VFIODeviceCPR;
bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
@@ -54,6 +56,7 @@ bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+void vfio_cpr_load_device(struct VFIODevice *vbasedev);
int vfio_cpr_group_get_device_fd(int d, const char *name);
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 3e78265..2eca8a6 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -7,6 +7,7 @@
#include "qemu/osdep.h"
#include "qapi/error.h"
#include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/vfio-device.h"
#include "migration/blocker.h"
#include "migration/cpr.h"
#include "migration/migration.h"
@@ -14,7 +15,88 @@
#include "system/iommufd.h"
#include "vfio-iommufd.h"
-const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
+typedef struct CprVFIODevice {
+ char *name;
+ unsigned int namelen;
+ uint32_t ioas_id;
+ int devid;
+ uint32_t hwpt_id;
+ QLIST_ENTRY(CprVFIODevice) next;
+} CprVFIODevice;
+
+static const VMStateDescription vmstate_cpr_vfio_device = {
+ .name = "cpr vfio device",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT32(namelen, CprVFIODevice),
+ VMSTATE_VBUFFER_ALLOC_UINT32(name, CprVFIODevice, 0, NULL, namelen),
+ VMSTATE_INT32(devid, CprVFIODevice),
+ VMSTATE_UINT32(ioas_id, CprVFIODevice),
+ VMSTATE_UINT32(hwpt_id, CprVFIODevice),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+const VMStateDescription vmstate_cpr_vfio_devices = {
+ .name = CPR_STATE "/vfio devices",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .fields = (const VMStateField[]){
+ VMSTATE_QLIST_V(vfio_devices, CprState, 1, vmstate_cpr_vfio_device,
+ CprVFIODevice, next),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+static void vfio_cpr_save_device(VFIODevice *vbasedev)
+{
+ CprVFIODevice *elem = g_new0(CprVFIODevice, 1);
+
+ elem->name = g_strdup(vbasedev->name);
+ elem->namelen = strlen(vbasedev->name) + 1;
+ elem->ioas_id = vbasedev->cpr.ioas_id;
+ elem->devid = vbasedev->devid;
+ elem->hwpt_id = vbasedev->cpr.hwpt_id;
+ QLIST_INSERT_HEAD(&cpr_state.vfio_devices, elem, next);
+}
+
+static CprVFIODevice *find_device(const char *name)
+{
+ CprVFIODeviceList *head = &cpr_state.vfio_devices;
+ CprVFIODevice *elem;
+
+ QLIST_FOREACH(elem, head, next) {
+ if (!strcmp(elem->name, name)) {
+ return elem;
+ }
+ }
+ return NULL;
+}
+
+static void vfio_cpr_delete_device(const char *name)
+{
+ CprVFIODevice *elem = find_device(name);
+
+ if (elem) {
+ QLIST_REMOVE(elem, next);
+ g_free(elem->name);
+ g_free(elem);
+ }
+}
+
+static bool vfio_cpr_find_device(VFIODevice *vbasedev)
+{
+ CprVFIODevice *elem = find_device(vbasedev->name);
+
+ if (elem) {
+ vbasedev->cpr.ioas_id = elem->ioas_id;
+ vbasedev->devid = elem->devid;
+ vbasedev->cpr.hwpt_id = elem->hwpt_id;
+ return true;
+ }
+ return false;
+}
static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
{
@@ -79,8 +161,20 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
{
+ if (!cpr_is_incoming()) {
+ vfio_cpr_save_device(vbasedev);
+ }
}
void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
{
+ vfio_cpr_delete_device(vbasedev->name);
+}
+
+void vfio_cpr_load_device(VFIODevice *vbasedev)
+{
+ if (cpr_is_incoming()) {
+ bool ret = vfio_cpr_find_device(vbasedev);
+ g_assert(ret);
+ }
}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ff291be..f0d57ea 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -515,6 +515,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
const VFIOIOMMUClass *iommufd_vioc =
VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
+ vfio_cpr_load_device(vbasedev);
+
if (vbasedev->fd < 0) {
devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
if (devfd < 0) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 31/38] vfio/iommufd: cpr state
2025-06-10 15:39 ` [PATCH V5 31/38] vfio/iommufd: cpr state Steve Sistare
@ 2025-06-23 10:45 ` Duan, Zhenzhong
2025-07-01 14:26 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 10:45 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 31/38] vfio/iommufd: cpr state
>
>VFIO iommufd devices will need access to ioas_id, devid, and hwpt_id in
>new QEMU at realize time, so add them to CPR state. Define CprVFIODevice
>as the object which holds the state and is serialized to the vmstate file.
>Define accessors to copy state between VFIODevice and CprVFIODevice.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> include/hw/vfio/vfio-cpr.h | 3 ++
> hw/vfio/cpr-iommufd.c | 96
>+++++++++++++++++++++++++++++++++++++++++++++-
> hw/vfio/iommufd.c | 2 +
> 3 files changed, 100 insertions(+), 1 deletion(-)
>
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index 619af07..f88e4ba 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -33,6 +33,8 @@ typedef struct VFIOContainerCPR {
> typedef struct VFIODeviceCPR {
> Error *mdev_blocker;
> Error *id_blocker;
>+ uint32_t hwpt_id;
>+ uint32_t ioas_id;
> } VFIODeviceCPR;
>
> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>@@ -54,6 +56,7 @@ bool vfio_iommufd_cpr_register_iommufd(struct
>IOMMUFDBackend *be, Error **errp);
> void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
> void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
> void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>+void vfio_cpr_load_device(struct VFIODevice *vbasedev);
>
> int vfio_cpr_group_get_device_fd(int d, const char *name);
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 3e78265..2eca8a6 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -7,6 +7,7 @@
> #include "qemu/osdep.h"
> #include "qapi/error.h"
> #include "hw/vfio/vfio-cpr.h"
>+#include "hw/vfio/vfio-device.h"
> #include "migration/blocker.h"
> #include "migration/cpr.h"
> #include "migration/migration.h"
>@@ -14,7 +15,88 @@
> #include "system/iommufd.h"
> #include "vfio-iommufd.h"
>
>-const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
>+typedef struct CprVFIODevice {
>+ char *name;
>+ unsigned int namelen;
>+ uint32_t ioas_id;
>+ int devid;
>+ uint32_t hwpt_id;
>+ QLIST_ENTRY(CprVFIODevice) next;
>+} CprVFIODevice;
>+
>+static const VMStateDescription vmstate_cpr_vfio_device = {
>+ .name = "cpr vfio device",
>+ .version_id = 1,
>+ .minimum_version_id = 1,
>+ .fields = (VMStateField[]) {
>+ VMSTATE_UINT32(namelen, CprVFIODevice),
>+ VMSTATE_VBUFFER_ALLOC_UINT32(name, CprVFIODevice, 0, NULL,
>namelen),
>+ VMSTATE_INT32(devid, CprVFIODevice),
>+ VMSTATE_UINT32(ioas_id, CprVFIODevice),
>+ VMSTATE_UINT32(hwpt_id, CprVFIODevice),
>+ VMSTATE_END_OF_LIST()
>+ }
>+};
>+
>+const VMStateDescription vmstate_cpr_vfio_devices = {
>+ .name = CPR_STATE "/vfio devices",
>+ .version_id = 1,
>+ .minimum_version_id = 1,
>+ .fields = (const VMStateField[]){
>+ VMSTATE_QLIST_V(vfio_devices, CprState, 1, vmstate_cpr_vfio_device,
>+ CprVFIODevice, next),
>+ VMSTATE_END_OF_LIST()
>+ }
>+};
>+
>+static void vfio_cpr_save_device(VFIODevice *vbasedev)
>+{
>+ CprVFIODevice *elem = g_new0(CprVFIODevice, 1);
>+
>+ elem->name = g_strdup(vbasedev->name);
>+ elem->namelen = strlen(vbasedev->name) + 1;
>+ elem->ioas_id = vbasedev->cpr.ioas_id;
>+ elem->devid = vbasedev->devid;
>+ elem->hwpt_id = vbasedev->cpr.hwpt_id;
>+ QLIST_INSERT_HEAD(&cpr_state.vfio_devices, elem, next);
>+}
>+
>+static CprVFIODevice *find_device(const char *name)
>+{
>+ CprVFIODeviceList *head = &cpr_state.vfio_devices;
>+ CprVFIODevice *elem;
>+
>+ QLIST_FOREACH(elem, head, next) {
>+ if (!strcmp(elem->name, name)) {
>+ return elem;
>+ }
>+ }
>+ return NULL;
>+}
>+
>+static void vfio_cpr_delete_device(const char *name)
>+{
>+ CprVFIODevice *elem = find_device(name);
>+
>+ if (elem) {
>+ QLIST_REMOVE(elem, next);
>+ g_free(elem->name);
>+ g_free(elem);
>+ }
>+}
>+
>+static bool vfio_cpr_find_device(VFIODevice *vbasedev)
Better to rename as vfio_cpr_load_device
>+{
>+ CprVFIODevice *elem = find_device(vbasedev->name);
>+
>+ if (elem) {
>+ vbasedev->cpr.ioas_id = elem->ioas_id;
>+ vbasedev->devid = elem->devid;
>+ vbasedev->cpr.hwpt_id = elem->hwpt_id;
>+ return true;
>+ }
>+ return false;
>+}
>
> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
> {
>@@ -79,8 +161,20 @@ void
>vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>
> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
> {
>+ if (!cpr_is_incoming()) {
>+ vfio_cpr_save_device(vbasedev);
>+ }
> }
>
> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
> {
>+ vfio_cpr_delete_device(vbasedev->name);
>+}
>+
>+void vfio_cpr_load_device(VFIODevice *vbasedev)
>+{
>+ if (cpr_is_incoming()) {
>+ bool ret = vfio_cpr_find_device(vbasedev);
>+ g_assert(ret);
>+ }
> }
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index ff291be..f0d57ea 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -515,6 +515,8 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
> const VFIOIOMMUClass *iommufd_vioc =
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
>
>+ vfio_cpr_load_device(vbasedev);
This can be open coded.
Thanks
Zhenzhong
>+
> if (vbasedev->fd < 0) {
> devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
> if (devfd < 0) {
>--
>1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 31/38] vfio/iommufd: cpr state
2025-06-23 10:45 ` Duan, Zhenzhong
@ 2025-07-01 14:26 ` Steven Sistare
2025-07-02 13:44 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/23/2025 6:45 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 31/38] vfio/iommufd: cpr state
>>
>> VFIO iommufd devices will need access to ioas_id, devid, and hwpt_id in
>> new QEMU at realize time, so add them to CPR state. Define CprVFIODevice
>> as the object which holds the state and is serialized to the vmstate file.
>> Define accessors to copy state between VFIODevice and CprVFIODevice.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/hw/vfio/vfio-cpr.h | 3 ++
>> hw/vfio/cpr-iommufd.c | 96
>> +++++++++++++++++++++++++++++++++++++++++++++-
>> hw/vfio/iommufd.c | 2 +
>> 3 files changed, 100 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 619af07..f88e4ba 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -33,6 +33,8 @@ typedef struct VFIOContainerCPR {
>> typedef struct VFIODeviceCPR {
>> Error *mdev_blocker;
>> Error *id_blocker;
>> + uint32_t hwpt_id;
>> + uint32_t ioas_id;
>> } VFIODeviceCPR;
>>
>> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>> @@ -54,6 +56,7 @@ bool vfio_iommufd_cpr_register_iommufd(struct
>> IOMMUFDBackend *be, Error **errp);
>> void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
>> void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>> void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>> +void vfio_cpr_load_device(struct VFIODevice *vbasedev);
>>
>> int vfio_cpr_group_get_device_fd(int d, const char *name);
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 3e78265..2eca8a6 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -7,6 +7,7 @@
>> #include "qemu/osdep.h"
>> #include "qapi/error.h"
>> #include "hw/vfio/vfio-cpr.h"
>> +#include "hw/vfio/vfio-device.h"
>> #include "migration/blocker.h"
>> #include "migration/cpr.h"
>> #include "migration/migration.h"
>> @@ -14,7 +15,88 @@
>> #include "system/iommufd.h"
>> #include "vfio-iommufd.h"
>>
>> -const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later patch */
>> +typedef struct CprVFIODevice {
>> + char *name;
>> + unsigned int namelen;
>> + uint32_t ioas_id;
>> + int devid;
>> + uint32_t hwpt_id;
>> + QLIST_ENTRY(CprVFIODevice) next;
>> +} CprVFIODevice;
>> +
>> +static const VMStateDescription vmstate_cpr_vfio_device = {
>> + .name = "cpr vfio device",
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_UINT32(namelen, CprVFIODevice),
>> + VMSTATE_VBUFFER_ALLOC_UINT32(name, CprVFIODevice, 0, NULL,
>> namelen),
>> + VMSTATE_INT32(devid, CprVFIODevice),
>> + VMSTATE_UINT32(ioas_id, CprVFIODevice),
>> + VMSTATE_UINT32(hwpt_id, CprVFIODevice),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +const VMStateDescription vmstate_cpr_vfio_devices = {
>> + .name = CPR_STATE "/vfio devices",
>> + .version_id = 1,
>> + .minimum_version_id = 1,
>> + .fields = (const VMStateField[]){
>> + VMSTATE_QLIST_V(vfio_devices, CprState, 1, vmstate_cpr_vfio_device,
>> + CprVFIODevice, next),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +static void vfio_cpr_save_device(VFIODevice *vbasedev)
>> +{
>> + CprVFIODevice *elem = g_new0(CprVFIODevice, 1);
>> +
>> + elem->name = g_strdup(vbasedev->name);
>> + elem->namelen = strlen(vbasedev->name) + 1;
>> + elem->ioas_id = vbasedev->cpr.ioas_id;
>> + elem->devid = vbasedev->devid;
>> + elem->hwpt_id = vbasedev->cpr.hwpt_id;
>> + QLIST_INSERT_HEAD(&cpr_state.vfio_devices, elem, next);
>> +}
>> +
>> +static CprVFIODevice *find_device(const char *name)
>> +{
>> + CprVFIODeviceList *head = &cpr_state.vfio_devices;
>> + CprVFIODevice *elem;
>> +
>> + QLIST_FOREACH(elem, head, next) {
>> + if (!strcmp(elem->name, name)) {
>> + return elem;
>> + }
>> + }
>> + return NULL;
>> +}
>> +
>> +static void vfio_cpr_delete_device(const char *name)
>> +{
>> + CprVFIODevice *elem = find_device(name);
>> +
>> + if (elem) {
>> + QLIST_REMOVE(elem, next);
>> + g_free(elem->name);
>> + g_free(elem);
>> + }
>> +}
>> +
>> +static bool vfio_cpr_find_device(VFIODevice *vbasedev)
>
> Better to rename as vfio_cpr_load_device
This is already called by a function named vfio_cpr_load_device.
The usage is the same as cpr_find_fd, so "find" is a consistent name.
>> +{
>> + CprVFIODevice *elem = find_device(vbasedev->name);
>> +
>> + if (elem) {
>> + vbasedev->cpr.ioas_id = elem->ioas_id;
>> + vbasedev->devid = elem->devid;
>> + vbasedev->cpr.hwpt_id = elem->hwpt_id;
>> + return true;
>> + }
>> + return false;
>> +}
>>
>> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>> {
>> @@ -79,8 +161,20 @@ void
>> vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>>
>> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>> {
>> + if (!cpr_is_incoming()) {
>> + vfio_cpr_save_device(vbasedev);
>> + }
>> }
>>
>> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>> {
>> + vfio_cpr_delete_device(vbasedev->name);
>> +}
>> +
>> +void vfio_cpr_load_device(VFIODevice *vbasedev)
>> +{
>> + if (cpr_is_incoming()) {
>> + bool ret = vfio_cpr_find_device(vbasedev);
>> + g_assert(ret);
>> + }
>> }
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index ff291be..f0d57ea 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -515,6 +515,8 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>> const VFIOIOMMUClass *iommufd_vioc =
>>
>> VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
>>
>> + vfio_cpr_load_device(vbasedev);
>
> This can be open coded.
vfio_cpr_load_device grows in patch "preserve descriptors", so I would
rather keep it closed.
- Steve
>> +
>> if (vbasedev->fd < 0) {
>> devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>> if (devfd < 0) {
>> --
>> 1.8.3.1
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 31/38] vfio/iommufd: cpr state
2025-07-01 14:26 ` Steven Sistare
@ 2025-07-02 13:44 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 13:44 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 31/38] vfio/iommufd: cpr state
>
>On 6/23/2025 6:45 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 31/38] vfio/iommufd: cpr state
>>>
>>> VFIO iommufd devices will need access to ioas_id, devid, and hwpt_id in
>>> new QEMU at realize time, so add them to CPR state. Define
>CprVFIODevice
>>> as the object which holds the state and is serialized to the vmstate file.
>>> Define accessors to copy state between VFIODevice and CprVFIODevice.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/hw/vfio/vfio-cpr.h | 3 ++
>>> hw/vfio/cpr-iommufd.c | 96
>>> +++++++++++++++++++++++++++++++++++++++++++++-
>>> hw/vfio/iommufd.c | 2 +
>>> 3 files changed, 100 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index 619af07..f88e4ba 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -33,6 +33,8 @@ typedef struct VFIOContainerCPR {
>>> typedef struct VFIODeviceCPR {
>>> Error *mdev_blocker;
>>> Error *id_blocker;
>>> + uint32_t hwpt_id;
>>> + uint32_t ioas_id;
>>> } VFIODeviceCPR;
>>>
>>> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>> @@ -54,6 +56,7 @@ bool vfio_iommufd_cpr_register_iommufd(struct
>>> IOMMUFDBackend *be, Error **errp);
>>> void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend
>*be);
>>> void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>>> void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>>> +void vfio_cpr_load_device(struct VFIODevice *vbasedev);
>>>
>>> int vfio_cpr_group_get_device_fd(int d, const char *name);
>>>
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 3e78265..2eca8a6 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -7,6 +7,7 @@
>>> #include "qemu/osdep.h"
>>> #include "qapi/error.h"
>>> #include "hw/vfio/vfio-cpr.h"
>>> +#include "hw/vfio/vfio-device.h"
>>> #include "migration/blocker.h"
>>> #include "migration/cpr.h"
>>> #include "migration/migration.h"
>>> @@ -14,7 +15,88 @@
>>> #include "system/iommufd.h"
>>> #include "vfio-iommufd.h"
>>>
>>> -const VMStateDescription vmstate_cpr_vfio_devices; /* TBD in a later
>patch */
>>> +typedef struct CprVFIODevice {
>>> + char *name;
>>> + unsigned int namelen;
>>> + uint32_t ioas_id;
>>> + int devid;
>>> + uint32_t hwpt_id;
>>> + QLIST_ENTRY(CprVFIODevice) next;
>>> +} CprVFIODevice;
>>> +
>>> +static const VMStateDescription vmstate_cpr_vfio_device = {
>>> + .name = "cpr vfio device",
>>> + .version_id = 1,
>>> + .minimum_version_id = 1,
>>> + .fields = (VMStateField[]) {
>>> + VMSTATE_UINT32(namelen, CprVFIODevice),
>>> + VMSTATE_VBUFFER_ALLOC_UINT32(name, CprVFIODevice, 0,
>NULL,
>>> namelen),
>>> + VMSTATE_INT32(devid, CprVFIODevice),
>>> + VMSTATE_UINT32(ioas_id, CprVFIODevice),
>>> + VMSTATE_UINT32(hwpt_id, CprVFIODevice),
>>> + VMSTATE_END_OF_LIST()
>>> + }
>>> +};
>>> +
>>> +const VMStateDescription vmstate_cpr_vfio_devices = {
>>> + .name = CPR_STATE "/vfio devices",
>>> + .version_id = 1,
>>> + .minimum_version_id = 1,
>>> + .fields = (const VMStateField[]){
>>> + VMSTATE_QLIST_V(vfio_devices, CprState, 1,
>vmstate_cpr_vfio_device,
>>> + CprVFIODevice, next),
>>> + VMSTATE_END_OF_LIST()
>>> + }
>>> +};
>>> +
>>> +static void vfio_cpr_save_device(VFIODevice *vbasedev)
>>> +{
>>> + CprVFIODevice *elem = g_new0(CprVFIODevice, 1);
>>> +
>>> + elem->name = g_strdup(vbasedev->name);
>>> + elem->namelen = strlen(vbasedev->name) + 1;
>>> + elem->ioas_id = vbasedev->cpr.ioas_id;
>>> + elem->devid = vbasedev->devid;
>>> + elem->hwpt_id = vbasedev->cpr.hwpt_id;
>>> + QLIST_INSERT_HEAD(&cpr_state.vfio_devices, elem, next);
>>> +}
>>> +
>>> +static CprVFIODevice *find_device(const char *name)
>>> +{
>>> + CprVFIODeviceList *head = &cpr_state.vfio_devices;
>>> + CprVFIODevice *elem;
>>> +
>>> + QLIST_FOREACH(elem, head, next) {
>>> + if (!strcmp(elem->name, name)) {
>>> + return elem;
>>> + }
>>> + }
>>> + return NULL;
>>> +}
>>> +
>>> +static void vfio_cpr_delete_device(const char *name)
>>> +{
>>> + CprVFIODevice *elem = find_device(name);
>>> +
>>> + if (elem) {
>>> + QLIST_REMOVE(elem, next);
>>> + g_free(elem->name);
>>> + g_free(elem);
>>> + }
>>> +}
>>> +
>>> +static bool vfio_cpr_find_device(VFIODevice *vbasedev)
>>
>> Better to rename as vfio_cpr_load_device
>
>This is already called by a function named vfio_cpr_load_device.
>The usage is the same as cpr_find_fd, so "find" is a consistent name.
I thought this is find and load. This is trival, I'm ok with current change.
>
>>> +{
>>> + CprVFIODevice *elem = find_device(vbasedev->name);
>>> +
>>> + if (elem) {
>>> + vbasedev->cpr.ioas_id = elem->ioas_id;
>>> + vbasedev->devid = elem->devid;
>>> + vbasedev->cpr.hwpt_id = elem->hwpt_id;
>>> + return true;
>>> + }
>>> + return false;
>>> +}
>>>
>>> static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>>> {
>>> @@ -79,8 +161,20 @@ void
>>> vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer
>*container)
>>>
>>> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>>> {
>>> + if (!cpr_is_incoming()) {
>>> + vfio_cpr_save_device(vbasedev);
>>> + }
>>> }
>>>
>>> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>>> {
>>> + vfio_cpr_delete_device(vbasedev->name);
>>> +}
>>> +
>>> +void vfio_cpr_load_device(VFIODevice *vbasedev)
>>> +{
>>> + if (cpr_is_incoming()) {
>>> + bool ret = vfio_cpr_find_device(vbasedev);
>>> + g_assert(ret);
>>> + }
>>> }
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index ff291be..f0d57ea 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -515,6 +515,8 @@ static bool iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>> const VFIOIOMMUClass *iommufd_vioc =
>>>
>>>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD
>));
>>>
>>> + vfio_cpr_load_device(vbasedev);
>>
>> This can be open coded.
>
>vfio_cpr_load_device grows in patch "preserve descriptors", so I would
>rather keep it closed.
>
>- Steve
>
>>> +
>>> if (vbasedev->fd < 0) {
>>> devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>>> if (devfd < 0) {
>>> --
>>> 1.8.3.1
>>
OK.
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 32/38] vfio/iommufd: preserve descriptors
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (30 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 31/38] vfio/iommufd: cpr state Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 33/38] vfio/iommufd: reconstruct device Steve Sistare
` (6 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the iommu and vfio device fd in CPR state when it is created.
After CPR, the fd number is found in CPR state and reused.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 25 ++++++++++++++++++++++++-
hw/vfio/cpr-iommufd.c | 10 ++++++++++
hw/vfio/device.c | 9 +--------
3 files changed, 35 insertions(+), 9 deletions(-)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index c554ce5..e02f06e 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -16,12 +16,18 @@
#include "qemu/module.h"
#include "qom/object_interfaces.h"
#include "qemu/error-report.h"
+#include "migration/cpr.h"
#include "monitor/monitor.h"
#include "trace.h"
#include "hw/vfio/vfio-device.h"
#include <sys/ioctl.h>
#include <linux/iommufd.h>
+static const char *iommufd_fd_name(IOMMUFDBackend *be)
+{
+ return object_get_canonical_path_component(OBJECT(be));
+}
+
static void iommufd_backend_init(Object *obj)
{
IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
@@ -64,11 +70,27 @@ static bool iommufd_backend_can_be_deleted(UserCreatable *uc)
return !be->users;
}
+static void iommufd_backend_complete(UserCreatable *uc, Error **errp)
+{
+ IOMMUFDBackend *be = IOMMUFD_BACKEND(uc);
+ const char *name = iommufd_fd_name(be);
+
+ if (!be->owned) {
+ /* fd came from the command line. Fetch updated value from cpr state. */
+ if (cpr_is_incoming()) {
+ be->fd = cpr_find_fd(name, 0);
+ } else {
+ cpr_save_fd(name, 0, be->fd);
+ }
+ }
+}
+
static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
{
UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
ucc->can_be_deleted = iommufd_backend_can_be_deleted;
+ ucc->complete = iommufd_backend_complete;
object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
}
@@ -102,7 +124,7 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
int fd;
if (be->owned && !be->users) {
- fd = qemu_open("/dev/iommu", O_RDWR, errp);
+ fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0, errp);
if (fd < 0) {
return false;
}
@@ -134,6 +156,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
out:
if (!be->users) {
vfio_iommufd_cpr_unregister_iommufd(be);
+ cpr_delete_fd(iommufd_fd_name(be), 0);
}
trace_iommufd_backend_disconnect(be->fd, be->users);
}
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 2eca8a6..152a661 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -162,17 +162,27 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
{
if (!cpr_is_incoming()) {
+ /*
+ * Beware fd may have already been saved by vfio_device_set_fd,
+ * so call resave to avoid a duplicate entry.
+ */
+ cpr_resave_fd(vbasedev->name, 0, vbasedev->fd);
vfio_cpr_save_device(vbasedev);
}
}
void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
{
+ cpr_delete_fd(vbasedev->name, 0);
vfio_cpr_delete_device(vbasedev->name);
}
void vfio_cpr_load_device(VFIODevice *vbasedev)
{
+ if (vbasedev->fd < 0) {
+ vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
+ }
+
if (cpr_is_incoming()) {
bool ret = vfio_cpr_find_device(vbasedev);
g_assert(ret);
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 8c3835b..6bcc65c 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -335,14 +335,7 @@ void vfio_device_free_name(VFIODevice *vbasedev)
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
{
- ERRP_GUARD();
- int fd = monitor_fd_param(monitor_cur(), str, errp);
-
- if (fd < 0) {
- error_prepend(errp, "Could not parse remote object fd %s:", str);
- return;
- }
- vbasedev->fd = fd;
+ vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0, errp);
}
static VFIODeviceIOOps vfio_device_io_ops_ioctl;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
2025-06-10 15:39 ` [PATCH V5 32/38] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
2025-07-01 14:26 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
>
>Save the iommu and vfio device fd in CPR state when it is created.
>After CPR, the fd number is found in CPR state and reused.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> backends/iommufd.c | 25 ++++++++++++++++++++++++-
> hw/vfio/cpr-iommufd.c | 10 ++++++++++
> hw/vfio/device.c | 9 +--------
> 3 files changed, 35 insertions(+), 9 deletions(-)
>
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index c554ce5..e02f06e 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -16,12 +16,18 @@
> #include "qemu/module.h"
> #include "qom/object_interfaces.h"
> #include "qemu/error-report.h"
>+#include "migration/cpr.h"
> #include "monitor/monitor.h"
> #include "trace.h"
> #include "hw/vfio/vfio-device.h"
> #include <sys/ioctl.h>
> #include <linux/iommufd.h>
>
>+static const char *iommufd_fd_name(IOMMUFDBackend *be)
>+{
>+ return object_get_canonical_path_component(OBJECT(be));
>+}
>+
> static void iommufd_backend_init(Object *obj)
> {
> IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>@@ -64,11 +70,27 @@ static bool
>iommufd_backend_can_be_deleted(UserCreatable *uc)
> return !be->users;
> }
>
>+static void iommufd_backend_complete(UserCreatable *uc, Error **errp)
>+{
>+ IOMMUFDBackend *be = IOMMUFD_BACKEND(uc);
>+ const char *name = iommufd_fd_name(be);
>+
>+ if (!be->owned) {
>+ /* fd came from the command line. Fetch updated value from cpr state. */
>+ if (cpr_is_incoming()) {
>+ be->fd = cpr_find_fd(name, 0);
>+ } else {
>+ cpr_save_fd(name, 0, be->fd);
>+ }
Maybe this can be handled in iommufd_backend_set_fd() instead of introducing
complete callback? Can we call cpr_get_fd_param()?
>+ }
>+}
>+
> static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
> {
> UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
>
> ucc->can_be_deleted = iommufd_backend_can_be_deleted;
>+ ucc->complete = iommufd_backend_complete;
>
> object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
> }
>@@ -102,7 +124,7 @@ bool iommufd_backend_connect(IOMMUFDBackend *be,
>Error **errp)
> int fd;
>
> if (be->owned && !be->users) {
>- fd = qemu_open("/dev/iommu", O_RDWR, errp);
>+ fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0, errp);
> if (fd < 0) {
> return false;
> }
>@@ -134,6 +156,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>*be)
> out:
> if (!be->users) {
> vfio_iommufd_cpr_unregister_iommufd(be);
>+ cpr_delete_fd(iommufd_fd_name(be), 0);
I think we shouldn't call this if not owned.
> }
> trace_iommufd_backend_disconnect(be->fd, be->users);
> }
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 2eca8a6..152a661 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -162,17 +162,27 @@ void
>vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
> {
> if (!cpr_is_incoming()) {
>+ /*
>+ * Beware fd may have already been saved by vfio_device_set_fd,
>+ * so call resave to avoid a duplicate entry.
>+ */
>+ cpr_resave_fd(vbasedev->name, 0, vbasedev->fd);
> vfio_cpr_save_device(vbasedev);
> }
> }
>
> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
> {
>+ cpr_delete_fd(vbasedev->name, 0);
> vfio_cpr_delete_device(vbasedev->name);
> }
>
> void vfio_cpr_load_device(VFIODevice *vbasedev)
> {
>+ if (vbasedev->fd < 0) {
>+ vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
Maybe call this after checking cpr_is_incoming()?
>+ }
>+
> if (cpr_is_incoming()) {
> bool ret = vfio_cpr_find_device(vbasedev);
> g_assert(ret);
>diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>index 8c3835b..6bcc65c 100644
>--- a/hw/vfio/device.c
>+++ b/hw/vfio/device.c
>@@ -335,14 +335,7 @@ void vfio_device_free_name(VFIODevice *vbasedev)
>
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
> {
>- ERRP_GUARD();
>- int fd = monitor_fd_param(monitor_cur(), str, errp);
>-
>- if (fd < 0) {
>- error_prepend(errp, "Could not parse remote object fd %s:", str);
>- return;
>- }
>- vbasedev->fd = fd;
>+ vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0, errp);
> }
>
> static VFIODeviceIOOps vfio_device_io_ops_ioctl;
>--
>1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
2025-06-25 11:40 ` Duan, Zhenzhong
@ 2025-07-01 14:26 ` Steven Sistare
2025-07-02 14:08 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
>>
>> Save the iommu and vfio device fd in CPR state when it is created.
>> After CPR, the fd number is found in CPR state and reused.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c | 25 ++++++++++++++++++++++++-
>> hw/vfio/cpr-iommufd.c | 10 ++++++++++
>> hw/vfio/device.c | 9 +--------
>> 3 files changed, 35 insertions(+), 9 deletions(-)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index c554ce5..e02f06e 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -16,12 +16,18 @@
>> #include "qemu/module.h"
>> #include "qom/object_interfaces.h"
>> #include "qemu/error-report.h"
>> +#include "migration/cpr.h"
>> #include "monitor/monitor.h"
>> #include "trace.h"
>> #include "hw/vfio/vfio-device.h"
>> #include <sys/ioctl.h>
>> #include <linux/iommufd.h>
>>
>> +static const char *iommufd_fd_name(IOMMUFDBackend *be)
>> +{
>> + return object_get_canonical_path_component(OBJECT(be));
>> +}
>> +
>> static void iommufd_backend_init(Object *obj)
>> {
>> IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>> @@ -64,11 +70,27 @@ static bool
>> iommufd_backend_can_be_deleted(UserCreatable *uc)
>> return !be->users;
>> }
>>
>> +static void iommufd_backend_complete(UserCreatable *uc, Error **errp)
>> +{
>> + IOMMUFDBackend *be = IOMMUFD_BACKEND(uc);
>> + const char *name = iommufd_fd_name(be);
>> +
>> + if (!be->owned) {
>> + /* fd came from the command line. Fetch updated value from cpr state. */
>> + if (cpr_is_incoming()) {
>> + be->fd = cpr_find_fd(name, 0);
>> + } else {
>> + cpr_save_fd(name, 0, be->fd);
>> + }
>
> Maybe this can be handled in iommufd_backend_set_fd() instead of introducing
> complete callback?
Afraid not. iommufd_fd_name -> object_get_canonical_path_component needs the
parent, which is not set yet in iommufd_backend_set_fd. The complete callback
solved that problem nicely.
> Can we call cpr_get_fd_param()?
No. That one expects to get the name from monitor_fd_param.
>> + }
>> +}
>> +
>> static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
>> {
>> UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
>>
>> ucc->can_be_deleted = iommufd_backend_can_be_deleted;
>> + ucc->complete = iommufd_backend_complete;
>>
>> object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>> }
>> @@ -102,7 +124,7 @@ bool iommufd_backend_connect(IOMMUFDBackend *be,
>> Error **errp)
>> int fd;
>>
>> if (be->owned && !be->users) {
>> - fd = qemu_open("/dev/iommu", O_RDWR, errp);
>> + fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0, errp);
>> if (fd < 0) {
>> return false;
>> }
>> @@ -134,6 +156,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>> *be)
>> out:
>> if (!be->users) {
>> vfio_iommufd_cpr_unregister_iommufd(be);
>> + cpr_delete_fd(iommufd_fd_name(be), 0);
>
> I think we shouldn't call this if not owned.
I agree, thanks, and a mismerge during rebase put the out label in the wrong place.
It should be:
void iommufd_backend_disconnect(IOMMUFDBackend *be)
{
if (!be->users) {
goto out;
}
be->users--;
if (!be->users) {
vfio_iommufd_cpr_unregister_iommufd(be);
if (be->owned) {
cpr_delete_fd(iommufd_fd_name(be), 0);
close(be->fd);
be->fd = -1;
}
}
out:
trace_iommufd_backend_disconnect(be->fd, be->users);
}
>> }
>> trace_iommufd_backend_disconnect(be->fd, be->users);
>> }
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 2eca8a6..152a661 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -162,17 +162,27 @@ void
>> vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>> {
>> if (!cpr_is_incoming()) {
>> + /*
>> + * Beware fd may have already been saved by vfio_device_set_fd,
>> + * so call resave to avoid a duplicate entry.
>> + */
>> + cpr_resave_fd(vbasedev->name, 0, vbasedev->fd);
>> vfio_cpr_save_device(vbasedev);
>> }
>> }
>>
>> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>> {
>> + cpr_delete_fd(vbasedev->name, 0);
>> vfio_cpr_delete_device(vbasedev->name);
>> }
>>
>> void vfio_cpr_load_device(VFIODevice *vbasedev)
>> {
>> + if (vbasedev->fd < 0) {
>> + vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
>
> Maybe call this after checking cpr_is_incoming()?
That is not necessary, because cpr_find_fd returns -1 if !cpr_is_incoming(),
but I'll change it so the intent becomes clearer:
void vfio_cpr_load_device(VFIODevice *vbasedev)
{
if (cpr_is_incoming()) {
bool ret = vfio_cpr_find_device(vbasedev);
g_assert(ret);
if (vbasedev->fd < 0) {
vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
}
}
}
- Steve
>> + }
>> +
>> if (cpr_is_incoming()) {
>> bool ret = vfio_cpr_find_device(vbasedev);
>> g_assert(ret);
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 8c3835b..6bcc65c 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -335,14 +335,7 @@ void vfio_device_free_name(VFIODevice *vbasedev)
>>
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>> {
>> - ERRP_GUARD();
>> - int fd = monitor_fd_param(monitor_cur(), str, errp);
>> -
>> - if (fd < 0) {
>> - error_prepend(errp, "Could not parse remote object fd %s:", str);
>> - return;
>> - }
>> - vbasedev->fd = fd;
>> + vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0, errp);
>> }
>>
>> static VFIODeviceIOOps vfio_device_io_ops_ioctl;
>> --
>> 1.8.3.1
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
2025-07-01 14:26 ` Steven Sistare
@ 2025-07-02 14:08 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 14:08 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
>
>On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 32/38] vfio/iommufd: preserve descriptors
>>>
>>> Save the iommu and vfio device fd in CPR state when it is created.
>>> After CPR, the fd number is found in CPR state and reused.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> backends/iommufd.c | 25 ++++++++++++++++++++++++-
>>> hw/vfio/cpr-iommufd.c | 10 ++++++++++
>>> hw/vfio/device.c | 9 +--------
>>> 3 files changed, 35 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index c554ce5..e02f06e 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -16,12 +16,18 @@
>>> #include "qemu/module.h"
>>> #include "qom/object_interfaces.h"
>>> #include "qemu/error-report.h"
>>> +#include "migration/cpr.h"
>>> #include "monitor/monitor.h"
>>> #include "trace.h"
>>> #include "hw/vfio/vfio-device.h"
>>> #include <sys/ioctl.h>
>>> #include <linux/iommufd.h>
>>>
>>> +static const char *iommufd_fd_name(IOMMUFDBackend *be)
>>> +{
>>> + return object_get_canonical_path_component(OBJECT(be));
>>> +}
>>> +
>>> static void iommufd_backend_init(Object *obj)
>>> {
>>> IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>>> @@ -64,11 +70,27 @@ static bool
>>> iommufd_backend_can_be_deleted(UserCreatable *uc)
>>> return !be->users;
>>> }
>>>
>>> +static void iommufd_backend_complete(UserCreatable *uc, Error **errp)
>>> +{
>>> + IOMMUFDBackend *be = IOMMUFD_BACKEND(uc);
>>> + const char *name = iommufd_fd_name(be);
>>> +
>>> + if (!be->owned) {
>>> + /* fd came from the command line. Fetch updated value from
>cpr state. */
>>> + if (cpr_is_incoming()) {
>>> + be->fd = cpr_find_fd(name, 0);
>>> + } else {
>>> + cpr_save_fd(name, 0, be->fd);
>>> + }
>>
>> Maybe this can be handled in iommufd_backend_set_fd() instead of
>introducing
>> complete callback?
>
>Afraid not. iommufd_fd_name -> object_get_canonical_path_component
>needs the
>parent, which is not set yet in iommufd_backend_set_fd. The complete
>callback
>solved that problem nicely.
>
>> Can we call cpr_get_fd_param()?
>
>No. That one expects to get the name from monitor_fd_param.
OK
>
>>> + }
>>> +}
>>> +
>>> static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
>>> {
>>> UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
>>>
>>> ucc->can_be_deleted = iommufd_backend_can_be_deleted;
>>> + ucc->complete = iommufd_backend_complete;
>>>
>>> object_class_property_add_str(oc, "fd", NULL,
>iommufd_backend_set_fd);
>>> }
>>> @@ -102,7 +124,7 @@ bool
>iommufd_backend_connect(IOMMUFDBackend *be,
>>> Error **errp)
>>> int fd;
>>>
>>> if (be->owned && !be->users) {
>>> - fd = qemu_open("/dev/iommu", O_RDWR, errp);
>>> + fd = cpr_open_fd("/dev/iommu", O_RDWR,
>iommufd_fd_name(be), 0, errp);
>>> if (fd < 0) {
>>> return false;
>>> }
>>> @@ -134,6 +156,7 @@ void
>iommufd_backend_disconnect(IOMMUFDBackend
>>> *be)
>>> out:
>>> if (!be->users) {
>>> vfio_iommufd_cpr_unregister_iommufd(be);
>>> + cpr_delete_fd(iommufd_fd_name(be), 0);
>>
>> I think we shouldn't call this if not owned.
>
>I agree, thanks, and a mismerge during rebase put the out label in the wrong
>place.
>It should be:
>
>void iommufd_backend_disconnect(IOMMUFDBackend *be)
>{
> if (!be->users) {
> goto out;
> }
> be->users--;
> if (!be->users) {
> vfio_iommufd_cpr_unregister_iommufd(be);
> if (be->owned) {
> cpr_delete_fd(iommufd_fd_name(be), 0);
> close(be->fd);
> be->fd = -1;
> }
> }
>out:
> trace_iommufd_backend_disconnect(be->fd, be->users);
>}
Looks good.
>
>>> }
>>> trace_iommufd_backend_disconnect(be->fd, be->users);
>>> }
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 2eca8a6..152a661 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -162,17 +162,27 @@ void
>>> vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer
>*container)
>>> void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>>> {
>>> if (!cpr_is_incoming()) {
>>> + /*
>>> + * Beware fd may have already been saved by
>vfio_device_set_fd,
>>> + * so call resave to avoid a duplicate entry.
>>> + */
>>> + cpr_resave_fd(vbasedev->name, 0, vbasedev->fd);
>>> vfio_cpr_save_device(vbasedev);
>>> }
>>> }
>>>
>>> void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>>> {
>>> + cpr_delete_fd(vbasedev->name, 0);
>>> vfio_cpr_delete_device(vbasedev->name);
>>> }
>>>
>>> void vfio_cpr_load_device(VFIODevice *vbasedev)
>>> {
>>> + if (vbasedev->fd < 0) {
>>> + vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
>>
>> Maybe call this after checking cpr_is_incoming()?
>
>That is not necessary, because cpr_find_fd returns -1 if !cpr_is_incoming(),
>but I'll change it so the intent becomes clearer:
>
>void vfio_cpr_load_device(VFIODevice *vbasedev)
>{
> if (cpr_is_incoming()) {
> bool ret = vfio_cpr_find_device(vbasedev);
> g_assert(ret);
>
> if (vbasedev->fd < 0) {
> vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
> }
> }
>}
Looks good, with above changes,
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 33/38] vfio/iommufd: reconstruct device
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (31 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 32/38] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt Steve Sistare
` (5 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Reconstruct userland device state after CPR. During vfio_realize, skip all
ioctls that configure the device, as it was already configured in old QEMU.
Skip bind, and use the devid from CPR state.
Skip allocation of, and attachment to, ioas_id. Recover ioas_id from CPR
state, and use it to find a matching container, if any, before creating a
new one.
This reconstruction is not complete. hwpt_id is handled in a subsequent
patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f0d57ea..a650517 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -25,6 +25,7 @@
#include "system/reset.h"
#include "qemu/cutils.h"
#include "qemu/chardev_open.h"
+#include "migration/cpr.h"
#include "pci.h"
#include "vfio-iommufd.h"
#include "vfio-helpers.h"
@@ -121,6 +122,10 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
goto err_kvm_device_add;
}
+ if (cpr_is_incoming()) {
+ goto skip_bind;
+ }
+
/* Bind device to iommufd */
bind.iommufd = iommufd->fd;
if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
@@ -132,6 +137,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
vbasedev->devid = bind.out_devid;
trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
vbasedev->fd, vbasedev->devid);
+
+skip_bind:
return true;
err_bind:
iommufd_cdev_kvm_device_del(vbasedev);
@@ -421,7 +428,9 @@ static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
return iommufd_cdev_autodomains_get(vbasedev, container, errp);
}
- return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
+ /* If CPR, we are already attached to ioas_id. */
+ return cpr_is_incoming() ||
+ !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
}
static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
@@ -510,6 +519,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
VFIOAddressSpace *space;
struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
int ret, devfd;
+ bool res;
uint32_t ioas_id;
Error *err = NULL;
const VFIOIOMMUClass *iommufd_vioc =
@@ -540,7 +550,16 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
vbasedev->iommufd != container->be) {
continue;
}
- if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
+
+ if (!cpr_is_incoming()) {
+ res = iommufd_cdev_attach_container(vbasedev, container, &err);
+ } else if (vbasedev->cpr.ioas_id == container->ioas_id) {
+ res = true;
+ } else {
+ continue;
+ }
+
+ if (!res) {
const char *msg = error_get_pretty(err);
trace_iommufd_cdev_fail_attach_existing_container(msg);
@@ -557,6 +576,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
}
}
+ if (cpr_is_incoming()) {
+ ioas_id = vbasedev->cpr.ioas_id;
+ goto skip_ioas_alloc;
+ }
+
/* Need to allocate a new dedicated container */
if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
goto err_alloc_ioas;
@@ -564,10 +588,12 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
+skip_ioas_alloc:
container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
container->be = vbasedev->iommufd;
container->ioas_id = ioas_id;
QLIST_INIT(&container->hwpt_list);
+ vbasedev->cpr.ioas_id = ioas_id;
bcontainer = &container->bcontainer;
vfio_address_space_insert(space, bcontainer);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 33/38] vfio/iommufd: reconstruct device
2025-06-10 15:39 ` [PATCH V5 33/38] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
2025-07-01 14:26 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 33/38] vfio/iommufd: reconstruct device
>
>Reconstruct userland device state after CPR. During vfio_realize, skip all
>ioctls that configure the device, as it was already configured in old QEMU.
>
>Skip bind, and use the devid from CPR state.
>
>Skip allocation of, and attachment to, ioas_id. Recover ioas_id from CPR
>state, and use it to find a matching container, if any, before creating a
>new one.
>
>This reconstruction is not complete. hwpt_id is handled in a subsequent
>patch.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/iommufd.c | 30 ++++++++++++++++++++++++++++--
> 1 file changed, 28 insertions(+), 2 deletions(-)
>
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index f0d57ea..a650517 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -25,6 +25,7 @@
> #include "system/reset.h"
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
>+#include "migration/cpr.h"
> #include "pci.h"
> #include "vfio-iommufd.h"
> #include "vfio-helpers.h"
>@@ -121,6 +122,10 @@ static bool
>iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
> goto err_kvm_device_add;
> }
>
>+ if (cpr_is_incoming()) {
>+ goto skip_bind;
>+ }
>+
> /* Bind device to iommufd */
> bind.iommufd = iommufd->fd;
> if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>@@ -132,6 +137,8 @@ static bool
>iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
> vbasedev->devid = bind.out_devid;
> trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
> vbasedev->fd, vbasedev->devid);
>+
>+skip_bind:
I'm not sure if we should take above trace for CPR..
> return true;
> err_bind:
> iommufd_cdev_kvm_device_del(vbasedev);
>@@ -421,7 +428,9 @@ static bool iommufd_cdev_attach_container(VFIODevice
>*vbasedev,
> return iommufd_cdev_autodomains_get(vbasedev, container, errp);
> }
>
>- return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>+ /* If CPR, we are already attached to ioas_id. */
>+ return cpr_is_incoming() ||
>+ !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
> }
>
> static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>@@ -510,6 +519,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
> VFIOAddressSpace *space;
> struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
> int ret, devfd;
>+ bool res;
> uint32_t ioas_id;
> Error *err = NULL;
> const VFIOIOMMUClass *iommufd_vioc =
>@@ -540,7 +550,16 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
> vbasedev->iommufd != container->be) {
> continue;
> }
>- if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>+
>+ if (!cpr_is_incoming()) {
>+ res = iommufd_cdev_attach_container(vbasedev, container, &err);
>+ } else if (vbasedev->cpr.ioas_id == container->ioas_id) {
>+ res = true;
>+ } else {
>+ continue;
>+ }
>+
>+ if (!res) {
> const char *msg = error_get_pretty(err);
>
> trace_iommufd_cdev_fail_attach_existing_container(msg);
>@@ -557,6 +576,11 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
> }
> }
>
>+ if (cpr_is_incoming()) {
>+ ioas_id = vbasedev->cpr.ioas_id;
>+ goto skip_ioas_alloc;
>+ }
>+
> /* Need to allocate a new dedicated container */
> if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
> goto err_alloc_ioas;
>@@ -564,10 +588,12 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>
> trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>
>+skip_ioas_alloc:
Same here, others look good.
> container =
>VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
> container->be = vbasedev->iommufd;
> container->ioas_id = ioas_id;
> QLIST_INIT(&container->hwpt_list);
>+ vbasedev->cpr.ioas_id = ioas_id;
>
> bcontainer = &container->bcontainer;
> vfio_address_space_insert(space, bcontainer);
>--
>1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 33/38] vfio/iommufd: reconstruct device
2025-06-25 11:40 ` Duan, Zhenzhong
@ 2025-07-01 14:26 ` Steven Sistare
2025-07-02 14:14 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 33/38] vfio/iommufd: reconstruct device
>>
>> Reconstruct userland device state after CPR. During vfio_realize, skip all
>> ioctls that configure the device, as it was already configured in old QEMU.
>>
>> Skip bind, and use the devid from CPR state.
>>
>> Skip allocation of, and attachment to, ioas_id. Recover ioas_id from CPR
>> state, and use it to find a matching container, if any, before creating a
>> new one.
>>
>> This reconstruction is not complete. hwpt_id is handled in a subsequent
>> patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/iommufd.c | 30 ++++++++++++++++++++++++++++--
>> 1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index f0d57ea..a650517 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -25,6 +25,7 @@
>> #include "system/reset.h"
>> #include "qemu/cutils.h"
>> #include "qemu/chardev_open.h"
>> +#include "migration/cpr.h"
>> #include "pci.h"
>> #include "vfio-iommufd.h"
>> #include "vfio-helpers.h"
>> @@ -121,6 +122,10 @@ static bool
>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>> goto err_kvm_device_add;
>> }
>>
>> + if (cpr_is_incoming()) {
>> + goto skip_bind;
>> + }
>> +
>> /* Bind device to iommufd */
>> bind.iommufd = iommufd->fd;
>> if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>> @@ -132,6 +137,8 @@ static bool
>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>> vbasedev->devid = bind.out_devid;
>> trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
>> vbasedev->fd, vbasedev->devid);
>> +
>> +skip_bind:
>
> I'm not sure if we should take above trace for CPR..
My thinking is: on cpr, we do not connect or bind, so we should not log it.
iommufd_backend_connect() is called but it just reuses a cpr fd, and we
can observe the latter with cpr traces.
>> return true;
>> err_bind:
>> iommufd_cdev_kvm_device_del(vbasedev);
>> @@ -421,7 +428,9 @@ static bool iommufd_cdev_attach_container(VFIODevice
>> *vbasedev,
>> return iommufd_cdev_autodomains_get(vbasedev, container, errp);
>> }
>>
>> - return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>> + /* If CPR, we are already attached to ioas_id. */
>> + return cpr_is_incoming() ||
>> + !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>> }
>>
>> static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>> @@ -510,6 +519,7 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>> VFIOAddressSpace *space;
>> struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>> int ret, devfd;
>> + bool res;
>> uint32_t ioas_id;
>> Error *err = NULL;
>> const VFIOIOMMUClass *iommufd_vioc =
>> @@ -540,7 +550,16 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>> vbasedev->iommufd != container->be) {
>> continue;
>> }
>> - if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>> +
>> + if (!cpr_is_incoming()) {
>> + res = iommufd_cdev_attach_container(vbasedev, container, &err);
>> + } else if (vbasedev->cpr.ioas_id == container->ioas_id) {
>> + res = true;
>> + } else {
>> + continue;
>> + }
>> +
>> + if (!res) {
>> const char *msg = error_get_pretty(err);
>>
>> trace_iommufd_cdev_fail_attach_existing_container(msg);
>> @@ -557,6 +576,11 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>> }
>> }
>>
>> + if (cpr_is_incoming()) {
>> + ioas_id = vbasedev->cpr.ioas_id;
>> + goto skip_ioas_alloc;
>> + }
>> +
>> /* Need to allocate a new dedicated container */
>> if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
>> goto err_alloc_ioas;
>> @@ -564,10 +588,12 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>
>> trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>>
>> +skip_ioas_alloc:
>
> Same here, others look good.
During cpr, we do not allocate a new ioas, we use the one from cpr state.
I think it would be confusing to print a trace that suggests we allocated
a new ioas.
Perhaps I should add a trace in vfio_cpr_find_device:
trace_vfio_cpr_find_device(elem->ioas_id, elem->dev_id, elem->hwpt_id)
- Steve
>> container =
>> VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>> container->be = vbasedev->iommufd;
>> container->ioas_id = ioas_id;
>> QLIST_INIT(&container->hwpt_list);
>> + vbasedev->cpr.ioas_id = ioas_id;
>>
>> bcontainer = &container->bcontainer;
>> vfio_address_space_insert(space, bcontainer);
>> --
>> 1.8.3.1
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 33/38] vfio/iommufd: reconstruct device
2025-07-01 14:26 ` Steven Sistare
@ 2025-07-02 14:14 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 14:14 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 33/38] vfio/iommufd: reconstruct device
>
>On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 33/38] vfio/iommufd: reconstruct device
>>>
>>> Reconstruct userland device state after CPR. During vfio_realize, skip all
>>> ioctls that configure the device, as it was already configured in old QEMU.
>>>
>>> Skip bind, and use the devid from CPR state.
>>>
>>> Skip allocation of, and attachment to, ioas_id. Recover ioas_id from CPR
>>> state, and use it to find a matching container, if any, before creating a
>>> new one.
>>>
>>> This reconstruction is not complete. hwpt_id is handled in a subsequent
>>> patch.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/iommufd.c | 30 ++++++++++++++++++++++++++++--
>>> 1 file changed, 28 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index f0d57ea..a650517 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -25,6 +25,7 @@
>>> #include "system/reset.h"
>>> #include "qemu/cutils.h"
>>> #include "qemu/chardev_open.h"
>>> +#include "migration/cpr.h"
>>> #include "pci.h"
>>> #include "vfio-iommufd.h"
>>> #include "vfio-helpers.h"
>>> @@ -121,6 +122,10 @@ static bool
>>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>> goto err_kvm_device_add;
>>> }
>>>
>>> + if (cpr_is_incoming()) {
>>> + goto skip_bind;
>>> + }
>>> +
>>> /* Bind device to iommufd */
>>> bind.iommufd = iommufd->fd;
>>> if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>>> @@ -132,6 +137,8 @@ static bool
>>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>> vbasedev->devid = bind.out_devid;
>>> trace_iommufd_cdev_connect_and_bind(bind.iommufd,
>vbasedev->name,
>>> vbasedev->fd,
>vbasedev->devid);
>>> +
>>> +skip_bind:
>>
>> I'm not sure if we should take above trace for CPR..
>
>My thinking is: on cpr, we do not connect or bind, so we should not log it.
>iommufd_backend_connect() is called but it just reuses a cpr fd, and we
>can observe the latter with cpr traces.
OK.
>
>>> return true;
>>> err_bind:
>>> iommufd_cdev_kvm_device_del(vbasedev);
>>> @@ -421,7 +428,9 @@ static bool
>iommufd_cdev_attach_container(VFIODevice
>>> *vbasedev,
>>> return iommufd_cdev_autodomains_get(vbasedev, container,
>errp);
>>> }
>>>
>>> - return !iommufd_cdev_attach_ioas_hwpt(vbasedev,
>container->ioas_id, errp);
>>> + /* If CPR, we are already attached to ioas_id. */
>>> + return cpr_is_incoming() ||
>>> + !iommufd_cdev_attach_ioas_hwpt(vbasedev,
>container->ioas_id, errp);
>>> }
>>>
>>> static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>> @@ -510,6 +519,7 @@ static bool iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>> VFIOAddressSpace *space;
>>> struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>> int ret, devfd;
>>> + bool res;
>>> uint32_t ioas_id;
>>> Error *err = NULL;
>>> const VFIOIOMMUClass *iommufd_vioc =
>>> @@ -540,7 +550,16 @@ static bool iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>> vbasedev->iommufd != container->be) {
>>> continue;
>>> }
>>> - if (!iommufd_cdev_attach_container(vbasedev, container, &err))
>{
>>> +
>>> + if (!cpr_is_incoming()) {
>>> + res = iommufd_cdev_attach_container(vbasedev,
>container, &err);
>>> + } else if (vbasedev->cpr.ioas_id == container->ioas_id) {
>>> + res = true;
>>> + } else {
>>> + continue;
>>> + }
>>> +
>>> + if (!res) {
>>> const char *msg = error_get_pretty(err);
>>>
>>>
>trace_iommufd_cdev_fail_attach_existing_container(msg);
>>> @@ -557,6 +576,11 @@ static bool iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>> }
>>> }
>>>
>>> + if (cpr_is_incoming()) {
>>> + ioas_id = vbasedev->cpr.ioas_id;
>>> + goto skip_ioas_alloc;
>>> + }
>>> +
>>> /* Need to allocate a new dedicated container */
>>> if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id,
>errp)) {
>>> goto err_alloc_ioas;
>>> @@ -564,10 +588,12 @@ static bool iommufd_cdev_attach(const char
>*name,
>>> VFIODevice *vbasedev,
>>>
>>> trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>>>
>>> +skip_ioas_alloc:
>>
>> Same here, others look good.
>
>During cpr, we do not allocate a new ioas, we use the one from cpr state.
>I think it would be confusing to print a trace that suggests we allocated
>a new ioas.
>
>Perhaps I should add a trace in vfio_cpr_find_device:
>
> trace_vfio_cpr_find_device(elem->ioas_id, elem->dev_id,
>elem->hwpt_id)
Yes, that will be better.
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (32 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 33/38] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 35/38] vfio/iommufd: change process Steve Sistare
` (4 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Skip allocation of, and attachment to, hwpt_id. Recover it from CPR state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index a650517..48c590b 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -332,7 +332,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
/* Try to find a domain */
QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
- ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+ if (!cpr_is_incoming()) {
+ ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+ } else if (vbasedev->cpr.hwpt_id == hwpt->hwpt_id) {
+ ret = 0;
+ } else {
+ continue;
+ }
+
if (ret) {
/* -EINVAL means the domain is incompatible with the device. */
if (ret == -EINVAL) {
@@ -349,6 +356,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
} else {
vbasedev->hwpt = hwpt;
+ vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
return true;
@@ -371,6 +379,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
}
+ if (cpr_is_incoming()) {
+ hwpt_id = vbasedev->cpr.hwpt_id;
+ goto skip_alloc;
+ }
+
if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
container->ioas_id, flags,
IOMMU_HWPT_DATA_NONE, 0, NULL,
@@ -378,19 +391,20 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
}
+ ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
+ if (ret) {
+ iommufd_backend_free_id(container->be, hwpt_id);
+ return false;
+ }
+
+skip_alloc:
hwpt = g_malloc0(sizeof(*hwpt));
hwpt->hwpt_id = hwpt_id;
hwpt->hwpt_flags = flags;
QLIST_INIT(&hwpt->device_list);
- ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
- if (ret) {
- iommufd_backend_free_id(container->be, hwpt->hwpt_id);
- g_free(hwpt);
- return false;
- }
-
vbasedev->hwpt = hwpt;
+ vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt
2025-06-10 15:39 ` [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt
>
>Skip allocation of, and attachment to, hwpt_id. Recover it from CPR state.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 35/38] vfio/iommufd: change process
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (33 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 34/38] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 36/38] iommufd: preserve DMA mappings Steve Sistare
` (3 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Finish CPR by change the owning process of the iommufd device in
post load.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 152a661..a9e3f68 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -110,10 +110,40 @@ static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
return true;
}
+static int iommufd_cpr_pre_save(void *opaque)
+{
+ IOMMUFDBackend *be = opaque;
+ Error *local_err = NULL;
+
+ /*
+ * The process has not changed yet, but proactively call the ioctl,
+ * and it will fail if any DMA mappings are not supported.
+ */
+ if (!iommufd_change_process(be, &local_err)) {
+ error_report_err(local_err);
+ return -1;
+ }
+ return 0;
+}
+
+static int iommufd_cpr_post_load(void *opaque, int version_id)
+{
+ IOMMUFDBackend *be = opaque;
+ Error *local_err = NULL;
+
+ if (!iommufd_change_process(be, &local_err)) {
+ error_report_err(local_err);
+ return -1;
+ }
+ return 0;
+}
+
static const VMStateDescription iommufd_cpr_vmstate = {
.name = "iommufd",
.version_id = 0,
.minimum_version_id = 0,
+ .pre_save = iommufd_cpr_pre_save,
+ .post_load = iommufd_cpr_post_load,
.needed = cpr_incoming_needed,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 35/38] vfio/iommufd: change process
2025-06-10 15:39 ` [PATCH V5 35/38] vfio/iommufd: change process Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
2025-07-01 14:26 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 35/38] vfio/iommufd: change process
>
>Finish CPR by change the owning process of the iommufd device in
>post load.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 152a661..a9e3f68 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -110,10 +110,40 @@ static bool vfio_cpr_supported(IOMMUFDBackend *be,
>Error **errp)
> return true;
> }
>
>+static int iommufd_cpr_pre_save(void *opaque)
>+{
>+ IOMMUFDBackend *be = opaque;
>+ Error *local_err = NULL;
>+
>+ /*
>+ * The process has not changed yet, but proactively call the ioctl,
>+ * and it will fail if any DMA mappings are not supported.
>+ */
>+ if (!iommufd_change_process(be, &local_err)) {
I'm confused when to call iommufd_change_process_capable and when to call iommufd_change_process, could you clarify?
>+ error_report_err(local_err);
>+ return -1;
>+ }
>+ return 0;
>+}
>+
>+static int iommufd_cpr_post_load(void *opaque, int version_id)
>+{
>+ IOMMUFDBackend *be = opaque;
>+ Error *local_err = NULL;
>+
>+ if (!iommufd_change_process(be, &local_err)) {
>+ error_report_err(local_err);
>+ return -1;
>+ }
>+ return 0;
>+}
>+
> static const VMStateDescription iommufd_cpr_vmstate = {
> .name = "iommufd",
> .version_id = 0,
> .minimum_version_id = 0,
>+ .pre_save = iommufd_cpr_pre_save,
>+ .post_load = iommufd_cpr_post_load,
Do we need LOW priority?
> .needed = cpr_incoming_needed,
> .fields = (VMStateField[]) {
> VMSTATE_END_OF_LIST()
>--
>1.8.3.1
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 35/38] vfio/iommufd: change process
2025-06-25 11:40 ` Duan, Zhenzhong
@ 2025-07-01 14:26 ` Steven Sistare
2025-07-02 13:46 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-07-01 14:26 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V5 35/38] vfio/iommufd: change process
>>
>> Finish CPR by change the owning process of the iommufd device in
>> post load.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
>> 1 file changed, 30 insertions(+)
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 152a661..a9e3f68 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -110,10 +110,40 @@ static bool vfio_cpr_supported(IOMMUFDBackend *be,
>> Error **errp)
>> return true;
>> }
>>
>> +static int iommufd_cpr_pre_save(void *opaque)
>> +{
>> + IOMMUFDBackend *be = opaque;
>> + Error *local_err = NULL;
>> +
>> + /*
>> + * The process has not changed yet, but proactively call the ioctl,
>> + * and it will fail if any DMA mappings are not supported.
>> + */
>> + if (!iommufd_change_process(be, &local_err)) {
>
> I'm confused when to call iommufd_change_process_capable and when to call iommufd_change_process, could you clarify?
Strictly speaking, we do not need iommufd_change_process_capable, because we always
try iommufd_change_process and recover on failure. But, iommufd_change_process_capable
allows us to install a migration blocker, and fail with a blocker error, which is considered
more user friendly for migration.
>> + error_report_err(local_err);
>> + return -1;
>> + }
>> + return 0;
>> +}
>> +
>> +static int iommufd_cpr_post_load(void *opaque, int version_id)
>> +{
>> + IOMMUFDBackend *be = opaque;
>> + Error *local_err = NULL;
>> +
>> + if (!iommufd_change_process(be, &local_err)) {
>> + error_report_err(local_err);
>> + return -1;
>> + }
>> + return 0;
>> +}
>> +
>> static const VMStateDescription iommufd_cpr_vmstate = {
>> .name = "iommufd",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> + .pre_save = iommufd_cpr_pre_save,
>> + .post_load = iommufd_cpr_post_load,
>
> Do we need LOW priority?
No. iommufd_cpr_post_load only calls iommufd_change_process, which acts upon
mappings that are already known to the kernel, independently of vmstate.
- Steve
>> .needed = cpr_incoming_needed,
>> .fields = (VMStateField[]) {
>> VMSTATE_END_OF_LIST()
>> --
>> 1.8.3.1
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 35/38] vfio/iommufd: change process
2025-07-01 14:26 ` Steven Sistare
@ 2025-07-02 13:46 ` Duan, Zhenzhong
2025-07-02 20:57 ` Steven Sistare
0 siblings, 1 reply; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-07-02 13:46 UTC (permalink / raw)
To: Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V5 35/38] vfio/iommufd: change process
>
>On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V5 35/38] vfio/iommufd: change process
>>>
>>> Finish CPR by change the owning process of the iommufd device in
>>> post load.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
>>> 1 file changed, 30 insertions(+)
>>>
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 152a661..a9e3f68 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -110,10 +110,40 @@ static bool
>vfio_cpr_supported(IOMMUFDBackend *be,
>>> Error **errp)
>>> return true;
>>> }
>>>
>>> +static int iommufd_cpr_pre_save(void *opaque)
>>> +{
>>> + IOMMUFDBackend *be = opaque;
>>> + Error *local_err = NULL;
>>> +
>>> + /*
>>> + * The process has not changed yet, but proactively call the ioctl,
>>> + * and it will fail if any DMA mappings are not supported.
>>> + */
>>> + if (!iommufd_change_process(be, &local_err)) {
>>
>> I'm confused when to call iommufd_change_process_capable and when to
>call iommufd_change_process, could you clarify?
>
>Strictly speaking, we do not need iommufd_change_process_capable,
>because we always
>try iommufd_change_process and recover on failure. But,
>iommufd_change_process_capable
>allows us to install a migration blocker, and fail with a blocker error, which is
>considered
>more user friendly for migration.
Though they are same effect, iommufd_change_process_capable still looks better than iommufd_change_process in pre_save(), because we want to check instead of really changing anything here. Other than that, this patch looks good to me.
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Unrelated to this patch, after further thinking, I suspect that blocker is never installed. Because vfio_iommufd_cpr_register_iommufd() is called for first VFIO device and before memory listener is installed. iommufd_change_process_capable() will always return true.
>
>>> + error_report_err(local_err);
>>> + return -1;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> +static int iommufd_cpr_post_load(void *opaque, int version_id)
>>> +{
>>> + IOMMUFDBackend *be = opaque;
>>> + Error *local_err = NULL;
>>> +
>>> + if (!iommufd_change_process(be, &local_err)) {
>>> + error_report_err(local_err);
>>> + return -1;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>> static const VMStateDescription iommufd_cpr_vmstate = {
>>> .name = "iommufd",
>>> .version_id = 0,
>>> .minimum_version_id = 0,
>>> + .pre_save = iommufd_cpr_pre_save,
>>> + .post_load = iommufd_cpr_post_load,
>>
>> Do we need LOW priority?
>
>No. iommufd_cpr_post_load only calls iommufd_change_process, which
>acts upon
>mappings that are already known to the kernel, independently of vmstate.
OK.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 35/38] vfio/iommufd: change process
2025-07-02 13:46 ` Duan, Zhenzhong
@ 2025-07-02 20:57 ` Steven Sistare
0 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 20:57 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 7/2/2025 9:46 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V5 35/38] vfio/iommufd: change process
>>
>> On 6/25/2025 7:40 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V5 35/38] vfio/iommufd: change process
>>>>
>>>> Finish CPR by change the owning process of the iommufd device in
>>>> post load.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
>>>> 1 file changed, 30 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>>> index 152a661..a9e3f68 100644
>>>> --- a/hw/vfio/cpr-iommufd.c
>>>> +++ b/hw/vfio/cpr-iommufd.c
>>>> @@ -110,10 +110,40 @@ static bool
>> vfio_cpr_supported(IOMMUFDBackend *be,
>>>> Error **errp)
>>>> return true;
>>>> }
>>>>
>>>> +static int iommufd_cpr_pre_save(void *opaque)
>>>> +{
>>>> + IOMMUFDBackend *be = opaque;
>>>> + Error *local_err = NULL;
>>>> +
>>>> + /*
>>>> + * The process has not changed yet, but proactively call the ioctl,
>>>> + * and it will fail if any DMA mappings are not supported.
>>>> + */
>>>> + if (!iommufd_change_process(be, &local_err)) {
>>>
>>> I'm confused when to call iommufd_change_process_capable and when to
>> call iommufd_change_process, could you clarify?
>>
>> Strictly speaking, we do not need iommufd_change_process_capable,
>> because we always
>> try iommufd_change_process and recover on failure. But,
>> iommufd_change_process_capable
>> allows us to install a migration blocker, and fail with a blocker error, which is
>> considered
>> more user friendly for migration.
>
> Though they are same effect, iommufd_change_process_capable still looks better than iommufd_change_process in pre_save(), because we want to check instead of really changing anything here.
Yes, I will make that change.
- Steve
> Other than that, this patch looks good to me.
>
> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
> Unrelated to this patch, after further thinking, I suspect that blocker is never installed. Because vfio_iommufd_cpr_register_iommufd() is called for first VFIO device and before memory listener is installed. iommufd_change_process_capable() will always return true.
>
>>
>>>> + error_report_err(local_err);
>>>> + return -1;
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static int iommufd_cpr_post_load(void *opaque, int version_id)
>>>> +{
>>>> + IOMMUFDBackend *be = opaque;
>>>> + Error *local_err = NULL;
>>>> +
>>>> + if (!iommufd_change_process(be, &local_err)) {
>>>> + error_report_err(local_err);
>>>> + return -1;
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> static const VMStateDescription iommufd_cpr_vmstate = {
>>>> .name = "iommufd",
>>>> .version_id = 0,
>>>> .minimum_version_id = 0,
>>>> + .pre_save = iommufd_cpr_pre_save,
>>>> + .post_load = iommufd_cpr_post_load,
>>>
>>> Do we need LOW priority?
>>
>> No. iommufd_cpr_post_load only calls iommufd_change_process, which
>> acts upon
>> mappings that are already known to the kernel, independently of vmstate.
>
> OK.
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 36/38] iommufd: preserve DMA mappings
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (34 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 35/38] vfio/iommufd: change process Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 37/38] vfio/container: delete old cpr register Steve Sistare
` (2 subsequent siblings)
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
During cpr-transfer load in new QEMU, the vfio_memory_listener causes
spurious calls to map and unmap DMA regions, as devices are created and
the address space is built. This memory was already already mapped by the
device in old QEMU, so suppress the map and unmap callbacks during incoming
CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index e02f06e..6a5566c 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -245,6 +245,10 @@ int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (cpr_is_incoming()) {
+ return 0;
+ }
+
if (!readonly) {
map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
}
@@ -274,6 +278,10 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (cpr_is_incoming()) {
+ return 0;
+ }
+
ret = ioctl(fd, IOMMU_IOAS_UNMAP, &unmap);
/*
* IOMMUFD takes mapping as some kind of object, unmapping
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 36/38] iommufd: preserve DMA mappings
2025-06-10 15:39 ` [PATCH V5 36/38] iommufd: preserve DMA mappings Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 36/38] iommufd: preserve DMA mappings
>
>During cpr-transfer load in new QEMU, the vfio_memory_listener causes
>spurious calls to map and unmap DMA regions, as devices are created and
>the address space is built. This memory was already already mapped by the
>device in old QEMU, so suppress the map and unmap callbacks during incoming
>CPR.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 37/38] vfio/container: delete old cpr register
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (35 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 36/38] iommufd: preserve DMA mappings Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-06-25 11:40 ` Duan, Zhenzhong
2025-06-10 15:39 ` [PATCH V5 38/38] vfio: doc changes for cpr Steve Sistare
2025-06-10 17:18 ` [PATCH V5 00/38] Live update: vfio and iommufd Cédric Le Goater
38 siblings, 1 reply; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
vfio_cpr_[un]register_container is no longer used since they were
subsumed by container type-specific registration. Delete them.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-cpr.h | 4 ----
hw/vfio/cpr.c | 13 -------------
2 files changed, 17 deletions(-)
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index f88e4ba..5b6c960 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -44,10 +44,6 @@ void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
Error **errp);
-bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
- Error **errp);
-void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
-
bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
Error **errp);
void vfio_iommufd_cpr_unregister_container(
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index f5555ca..c97e467 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -29,19 +29,6 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
return 0;
}
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
-{
- migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
- vfio_cpr_reboot_notifier,
- MIG_MODE_CPR_REBOOT);
- return true;
-}
-
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
-{
- migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
-}
-
#define STRDUP_VECTOR_FD_NAME(vdev, name) \
g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* RE: [PATCH V5 37/38] vfio/container: delete old cpr register
2025-06-10 15:39 ` [PATCH V5 37/38] vfio/container: delete old cpr register Steve Sistare
@ 2025-06-25 11:40 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-25 11:40 UTC (permalink / raw)
To: Steve Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V5 37/38] vfio/container: delete old cpr register
>
>vfio_cpr_[un]register_container is no longer used since they were
>subsumed by container type-specific registration. Delete them.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* [PATCH V5 38/38] vfio: doc changes for cpr
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (36 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 37/38] vfio/container: delete old cpr register Steve Sistare
@ 2025-06-10 15:39 ` Steve Sistare
2025-07-02 14:03 ` Steven Sistare
` (2 more replies)
2025-06-10 17:18 ` [PATCH V5 00/38] Live update: vfio and iommufd Cédric Le Goater
38 siblings, 3 replies; 101+ messages in thread
From: Steve Sistare @ 2025-06-10 15:39 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Update documentation to say that cpr-transfer supports vfio and iommufd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
docs/devel/migration/CPR.rst | 5 ++---
qapi/migration.json | 6 ++++--
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
index 7897873..0a0fd4f 100644
--- a/docs/devel/migration/CPR.rst
+++ b/docs/devel/migration/CPR.rst
@@ -152,8 +152,7 @@ cpr-transfer mode
This mode allows the user to transfer a guest to a new QEMU instance
on the same host with minimal guest pause time, by preserving guest
RAM in place, albeit with new virtual addresses in new QEMU. Devices
-and their pinned memory pages will also be preserved in a future QEMU
-release.
+and their pinned memory pages are also preserved for VFIO and IOMMUFD.
The user starts new QEMU on the same host as old QEMU, with command-
line arguments to create the same machine, plus the ``-incoming``
@@ -322,6 +321,6 @@ Futures
cpr-transfer mode is based on a capability to transfer open file
descriptors from old to new QEMU. In the future, descriptors for
-vfio, iommufd, vhost, and char devices could be transferred,
+vhost, and char devices could be transferred,
preserving those devices and their kernel state without interruption,
even if they do not explicitly support live migration.
diff --git a/qapi/migration.json b/qapi/migration.json
index 4963f6c..e8a7d3b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -620,8 +620,10 @@
#
# @cpr-transfer: This mode allows the user to transfer a guest to a
# new QEMU instance on the same host with minimal guest pause
-# time by preserving guest RAM in place. Devices and their pinned
-# pages will also be preserved in a future QEMU release.
+# time by preserving guest RAM in place.
+#
+# Devices and their pinned pages are also preserved for VFIO and
+# IOMMUFD. (since 10.1)
#
# The user starts new QEMU on the same host as old QEMU, with
# command-line arguments to create the same machine, plus the
--
1.8.3.1
^ permalink raw reply related [flat|nested] 101+ messages in thread
* Re: [PATCH V5 38/38] vfio: doc changes for cpr
2025-06-10 15:39 ` [PATCH V5 38/38] vfio: doc changes for cpr Steve Sistare
@ 2025-07-02 14:03 ` Steven Sistare
2025-07-02 14:49 ` Cédric Le Goater
2025-07-02 17:52 ` Fabiano Rosas
2 siblings, 0 replies; 101+ messages in thread
From: Steven Sistare @ 2025-07-02 14:03 UTC (permalink / raw)
To: qemu-devel, Peter Xu, Fabiano Rosas
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum
Peter or Fabiano, this patch needs review. It's trivial.
- Steve
On 6/10/2025 11:39 AM, Steve Sistare wrote:
> Update documentation to say that cpr-transfer supports vfio and iommufd.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> docs/devel/migration/CPR.rst | 5 ++---
> qapi/migration.json | 6 ++++--
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
> index 7897873..0a0fd4f 100644
> --- a/docs/devel/migration/CPR.rst
> +++ b/docs/devel/migration/CPR.rst
> @@ -152,8 +152,7 @@ cpr-transfer mode
> This mode allows the user to transfer a guest to a new QEMU instance
> on the same host with minimal guest pause time, by preserving guest
> RAM in place, albeit with new virtual addresses in new QEMU. Devices
> -and their pinned memory pages will also be preserved in a future QEMU
> -release.
> +and their pinned memory pages are also preserved for VFIO and IOMMUFD.
>
> The user starts new QEMU on the same host as old QEMU, with command-
> line arguments to create the same machine, plus the ``-incoming``
> @@ -322,6 +321,6 @@ Futures
>
> cpr-transfer mode is based on a capability to transfer open file
> descriptors from old to new QEMU. In the future, descriptors for
> -vfio, iommufd, vhost, and char devices could be transferred,
> +vhost, and char devices could be transferred,
> preserving those devices and their kernel state without interruption,
> even if they do not explicitly support live migration.
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4963f6c..e8a7d3b 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -620,8 +620,10 @@
> #
> # @cpr-transfer: This mode allows the user to transfer a guest to a
> # new QEMU instance on the same host with minimal guest pause
> -# time by preserving guest RAM in place. Devices and their pinned
> -# pages will also be preserved in a future QEMU release.
> +# time by preserving guest RAM in place.
> +#
> +# Devices and their pinned pages are also preserved for VFIO and
> +# IOMMUFD. (since 10.1)
> #
> # The user starts new QEMU on the same host as old QEMU, with
> # command-line arguments to create the same machine, plus the
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 38/38] vfio: doc changes for cpr
2025-06-10 15:39 ` [PATCH V5 38/38] vfio: doc changes for cpr Steve Sistare
2025-07-02 14:03 ` Steven Sistare
@ 2025-07-02 14:49 ` Cédric Le Goater
2025-07-02 17:52 ` Fabiano Rosas
2 siblings, 0 replies; 101+ messages in thread
From: Cédric Le Goater @ 2025-07-02 14:49 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Update documentation to say that cpr-transfer supports vfio and iommufd.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> docs/devel/migration/CPR.rst | 5 ++---
> qapi/migration.json | 6 ++++--
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
> index 7897873..0a0fd4f 100644
> --- a/docs/devel/migration/CPR.rst
> +++ b/docs/devel/migration/CPR.rst
> @@ -152,8 +152,7 @@ cpr-transfer mode
> This mode allows the user to transfer a guest to a new QEMU instance
> on the same host with minimal guest pause time, by preserving guest
> RAM in place, albeit with new virtual addresses in new QEMU. Devices
> -and their pinned memory pages will also be preserved in a future QEMU
> -release.
> +and their pinned memory pages are also preserved for VFIO and IOMMUFD.
>
> The user starts new QEMU on the same host as old QEMU, with command-
> line arguments to create the same machine, plus the ``-incoming``
> @@ -322,6 +321,6 @@ Futures
>
> cpr-transfer mode is based on a capability to transfer open file
> descriptors from old to new QEMU. In the future, descriptors for
> -vfio, iommufd, vhost, and char devices could be transferred,
> +vhost, and char devices could be transferred,
> preserving those devices and their kernel state without interruption,
> even if they do not explicitly support live migration.
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4963f6c..e8a7d3b 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -620,8 +620,10 @@
> #
> # @cpr-transfer: This mode allows the user to transfer a guest to a
> # new QEMU instance on the same host with minimal guest pause
> -# time by preserving guest RAM in place. Devices and their pinned
> -# pages will also be preserved in a future QEMU release.
> +# time by preserving guest RAM in place.
> +#
> +# Devices and their pinned pages are also preserved for VFIO and
> +# IOMMUFD. (since 10.1)
> #
> # The user starts new QEMU on the same host as old QEMU, with
> # command-line arguments to create the same machine, plus the
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 38/38] vfio: doc changes for cpr
2025-06-10 15:39 ` [PATCH V5 38/38] vfio: doc changes for cpr Steve Sistare
2025-07-02 14:03 ` Steven Sistare
2025-07-02 14:49 ` Cédric Le Goater
@ 2025-07-02 17:52 ` Fabiano Rosas
2 siblings, 0 replies; 101+ messages in thread
From: Fabiano Rosas @ 2025-07-02 17:52 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> Update documentation to say that cpr-transfer supports vfio and iommufd.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> docs/devel/migration/CPR.rst | 5 ++---
> qapi/migration.json | 6 ++++--
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
> index 7897873..0a0fd4f 100644
> --- a/docs/devel/migration/CPR.rst
> +++ b/docs/devel/migration/CPR.rst
> @@ -152,8 +152,7 @@ cpr-transfer mode
> This mode allows the user to transfer a guest to a new QEMU instance
> on the same host with minimal guest pause time, by preserving guest
> RAM in place, albeit with new virtual addresses in new QEMU. Devices
> -and their pinned memory pages will also be preserved in a future QEMU
> -release.
> +and their pinned memory pages are also preserved for VFIO and IOMMUFD.
>
> The user starts new QEMU on the same host as old QEMU, with command-
> line arguments to create the same machine, plus the ``-incoming``
> @@ -322,6 +321,6 @@ Futures
>
> cpr-transfer mode is based on a capability to transfer open file
> descriptors from old to new QEMU. In the future, descriptors for
> -vfio, iommufd, vhost, and char devices could be transferred,
> +vhost, and char devices could be transferred,
> preserving those devices and their kernel state without interruption,
> even if they do not explicitly support live migration.
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 4963f6c..e8a7d3b 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -620,8 +620,10 @@
> #
> # @cpr-transfer: This mode allows the user to transfer a guest to a
> # new QEMU instance on the same host with minimal guest pause
> -# time by preserving guest RAM in place. Devices and their pinned
> -# pages will also be preserved in a future QEMU release.
> +# time by preserving guest RAM in place.
> +#
> +# Devices and their pinned pages are also preserved for VFIO and
> +# IOMMUFD. (since 10.1)
> #
> # The user starts new QEMU on the same host as old QEMU, with
> # command-line arguments to create the same machine, plus the
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-10 15:39 [PATCH V5 00/38] Live update: vfio and iommufd Steve Sistare
` (37 preceding siblings ...)
2025-06-10 15:39 ` [PATCH V5 38/38] vfio: doc changes for cpr Steve Sistare
@ 2025-06-10 17:18 ` Cédric Le Goater
2025-06-10 17:39 ` Cédric Le Goater
38 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-10 17:18 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 17:39, Steve Sistare wrote:
> Support vfio and iommufd devices with the cpr-transfer live migration mode.
> Devices that do not support live migration can still support cpr-transfer,
> allowing live update to a new version of QEMU on the same host, with no loss
> of guest connectivity.
>
> No user-visible interfaces are added.
>
> For legacy containers:
>
> Pass vfio device descriptors to new QEMU. In new QEMU, during vfio_realize,
> skip the ioctls that configure the device, because it is already configured.
>
> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
> QEMU and update the locked memory accounting. The physical pages remain
> pinned, because the descriptor of the device that locked them remains open,
> so DMA to those pages continues without interruption. Mediated devices are
> not supported, however, because they require the VA to always be valid, and
> there is a brief window where no VA is registered.
>
> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
> and notifier eventfd's to new QEMU. New QEMU loads the MSI data, then the
> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
> data structures, and attaches the interrupts to the new KVM instance. This
> logic also applies to iommufd containers.
>
> For iommufd containers:
>
> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
> backed by a file (including a memfd), so DMA mappings do not depend on VA,
> which can differ after live update. This allows mediated devices to be
> supported.
>
> Pass the iommufd and vfio device descriptors from old to new QEMU. In new
> QEMU, during vfio_realize, skip the ioctls that configure the device, because
> it is already configured.
>
> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
> locked memory accounting.
>
> Patches 3 to 8 are specific to legacy containers.
> Patches 21 to 36 are specific to iommufd containers.
> The remainder apply to both.
>
> Changes from previous versions:
> * V1 of this series contains minor changes from the "Live update: vfio" and
> "Live update: iommufd" series, mainly bug fixes and refactored patches.
>
> Changes in V2:
> * refactored various vfio code snippets into new cpr helpers
> * refactored vfio struct members into cpr-specific structures
> * refactored various small changes into their own patches
> * split complex patches. Notably:
> - split "refactor for cpr" into 5 patches
> - split "reconstruct device" into 4 patches
> * refactored vfio_connect_container using helpers and made its
> error recovery more robust.
> * moved vfio pci msi/vector/intx cpr functions to cpr.c
> * renamed "reused" to cpr_reused and cpr.reused
> * squashed vfio_cpr_[un]register_container to their call sites
> * simplified iommu_type setting after cpr
> * added cpr_open_fd and cpr_is_incoming helpers
> * removed changes from vfio_legacy_dma_map, and instead temporarily
> override dma_map and dma_unmap ops.
> * deleted error_report and returned Error to callers where possible.
> * simplified the memory_get_xlat_addr interface
> * fixed flags passed to iommufd_backend_alloc_hwpt
> * defined MIG_PRI_UNINITIALIZED
> * added maintainers
>
> Changes in V3:
> * removed cleanup patches that were already pulled
> * rebased to latest master
>
> Changes in V4:
> * added SPDX-License-Identifier
> * patch "vfio/container: preserve descriptors"
> - rewrote search loop in vfio_container_connect
> - do not return pfd from vfio_cpr_container_match
> - add helper for VFIO_GROUP_GET_DEVICE_FD
> * deleted patch "export vfio_legacy_dma_map"
> * patch "vfio/container: restore DMA vaddr"
> - deleted redundant error_report from vfio_legacy_cpr_dma_map
> - save old dma_map function
> * patch "vfio-pci: skip reset during cpr"
> - use cpr_is_incoming instead of cpr_reused
> * renamed err -> local_err in all new code
> * patch "export MSI functions"
> - renamed with vfio_pci prefix, and defined wrappers for low level
> routines instead of exporting them.
> * patch "close kvm after cpr"
> - fixed build error for !CONFIG_KVM
> * added the cpr_resave_fd helper
> * dropped patch "pass ramblock to vfio_container_dma_map", relying on
> "pass MemoryRegion" from the vfio-user series instead.
> * deleted "reused" variables, replaced with cpr_is_incoming()
> * renamed cpr_needed_for_reuse -> cpr_incoming_needed
> * rewrote patch "pci: skip reset during cpr"
> * rebased to latest master
>
> for iommufd:
> * deleted redundant error_report from iommufd_backend_map_file_dma
> * added interface doc for dma_map_file
> * check return value of cpr_open_fd
> * deleted "export iommufd_cdev_get_info_iova_range"
> * deleted "reconstruct device"
> * deleted "reconstruct hw_caps"
> * deleted "define hwpt constructors"
> * separated cpr registration for iommufd be and vfio container
> * correctly attach to multiple containers per iommufd using ioas_id
> * simplified "reconstruct hwpt" by matching against hwpt_id.
> * added patch "add vfio_device_free_name"
>
> Changes in V5:
> * dropped: vfio/pci: vfio_pci_put_device on failure
> * added: "vfio: doc changes for cpr"
> * deleted unnecessary include of vfio-cpr.h
> * fixed compilation for !CONFIG_VFIO and !CONFIG_IOMMUFD
> * misc minor changes
> * Added RB's, rebased to master
>
> Steve Sistare (38):
> migration: cpr helpers
> migration: lower handler priority
> vfio/container: register container for cpr
> vfio/container: preserve descriptors
> vfio/container: discard old DMA vaddr
> vfio/container: restore DMA vaddr
> vfio/container: mdev cpr blocker
> vfio/container: recover from unmap-all-vaddr failure
> pci: export msix_is_pending
> pci: skip reset during cpr
> vfio-pci: skip reset during cpr
> vfio/pci: vfio_pci_vector_init
> vfio/pci: vfio_notifier_init
> vfio/pci: pass vector to virq functions
> vfio/pci: vfio_notifier_init cpr parameters
> vfio/pci: vfio_notifier_cleanup
> vfio/pci: export MSI functions
> vfio-pci: preserve MSI
> vfio-pci: preserve INTx
> migration: close kvm after cpr
> migration: cpr_get_fd_param helper
> backends/iommufd: iommufd_backend_map_file_dma
> backends/iommufd: change process ioctl
> physmem: qemu_ram_get_fd_offset
> vfio/iommufd: use IOMMU_IOAS_MAP_FILE
> vfio/iommufd: invariant device name
> vfio/iommufd: add vfio_device_free_name
> vfio/iommufd: device name blocker
> vfio/iommufd: register container for cpr
> migration: vfio cpr state hook
> vfio/iommufd: cpr state
> vfio/iommufd: preserve descriptors
> vfio/iommufd: reconstruct device
> vfio/iommufd: reconstruct hwpt
> vfio/iommufd: change process
> iommufd: preserve DMA mappings
> vfio/container: delete old cpr register
> vfio: doc changes for cpr
>
> docs/devel/migration/CPR.rst | 5 +-
> qapi/migration.json | 6 +-
> hw/vfio/pci.h | 10 ++
> include/exec/cpu-common.h | 1 +
> include/hw/pci/msix.h | 1 +
> include/hw/pci/pci_device.h | 3 +
> include/hw/vfio/vfio-container-base.h | 18 +++
> include/hw/vfio/vfio-container.h | 2 +
> include/hw/vfio/vfio-cpr.h | 66 +++++++-
> include/hw/vfio/vfio-device.h | 5 +
> include/migration/cpr.h | 21 +++
> include/migration/vmstate.h | 6 +-
> include/system/iommufd.h | 7 +
> include/system/kvm.h | 1 +
> accel/kvm/kvm-all.c | 32 ++++
> accel/stubs/kvm-stub.c | 5 +
> backends/iommufd.c | 101 +++++++++++-
> hw/pci/msix.c | 2 +-
> hw/pci/pci.c | 5 +
> hw/vfio/ap.c | 2 +-
> hw/vfio/ccw.c | 2 +-
> hw/vfio/container-base.c | 9 ++
> hw/vfio/container.c | 97 +++++++++---
> hw/vfio/cpr-iommufd.c | 220 ++++++++++++++++++++++++++
> hw/vfio/cpr-legacy.c | 287 ++++++++++++++++++++++++++++++++++
> hw/vfio/cpr.c | 159 +++++++++++++++++--
> hw/vfio/device.c | 40 +++--
> hw/vfio/helpers.c | 10 ++
> hw/vfio/iommufd-stubs.c | 18 +++
> hw/vfio/iommufd.c | 81 ++++++++--
> hw/vfio/listener.c | 19 ++-
> hw/vfio/pci.c | 231 ++++++++++++++++++++-------
> hw/vfio/platform.c | 2 +-
> hw/vfio/vfio-stubs.c | 13 ++
> migration/cpr-transfer.c | 18 +++
> migration/cpr.c | 95 +++++++++--
> migration/migration.c | 1 +
> migration/savevm.c | 4 +-
> system/physmem.c | 5 +
> backends/trace-events | 2 +
> hw/vfio/meson.build | 5 +
> 41 files changed, 1482 insertions(+), 135 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
> create mode 100644 hw/vfio/cpr-legacy.c
> create mode 100644 hw/vfio/iommufd-stubs.c
> create mode 100644 hw/vfio/vfio-stubs.c
>
> base-commit: bc98ffdc7577e55ab8373c579c28fe24d600c40f
Steve,
For the next vfio PR, I plan to take patches 1-17 when patch 10 is
updated. The rest is for later in this cycle.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-10 17:18 ` [PATCH V5 00/38] Live update: vfio and iommufd Cédric Le Goater
@ 2025-06-10 17:39 ` Cédric Le Goater
2025-06-11 14:25 ` Cédric Le Goater
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-10 17:39 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
> Steve,
>
> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
> updated. The rest is for later in this cycle
Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-10 17:39 ` Cédric Le Goater
@ 2025-06-11 14:25 ` Cédric Le Goater
2025-06-11 14:39 ` Steven Sistare
2025-06-11 14:49 ` Peter Xu
0 siblings, 2 replies; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-11 14:25 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/10/25 19:39, Cédric Le Goater wrote:
>> Steve,
>>
>> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
>> updated. The rest is for later in this cycle
>
> Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
I am planing to send a PR with this first part to get more visibility.
There is a slight risk of merging useless changes since CPR is not
fully reviewed. My optimistic nature tells me it should reach QEMU 10.1
and we have time to adjust.
Please feel free to intervene if you prefer the series to be fully
approved/reviewed before merging.
Peter, Fabiano,
The first 2 patches are migration patches. Do you agree if I take them
through the VFIO queue ?
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-11 14:25 ` Cédric Le Goater
@ 2025-06-11 14:39 ` Steven Sistare
2025-06-12 7:23 ` Cédric Le Goater
2025-06-11 14:49 ` Peter Xu
1 sibling, 1 reply; 101+ messages in thread
From: Steven Sistare @ 2025-06-11 14:39 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/11/2025 10:25 AM, Cédric Le Goater wrote:
> On 6/10/25 19:39, Cédric Le Goater wrote:
>>> Steve,
>>>
>>> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
>>> updated. The rest is for later in this cycle
>>
>> Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
>
> I am planing to send a PR with this first part to get more visibility.
> There is a slight risk of merging useless changes since CPR is not
> fully reviewed. My optimistic nature tells me it should reach QEMU 10.1
> and we have time to adjust.
>
> Please feel free to intervene if you prefer the series to be fully
> approved/reviewed before merging.
A partial merge is fine with me.
- Steve
>
> Peter, Fabiano,
>
> The first 2 patches are migration patches. Do you agree if I take them
> through the VFIO queue ?
>
> Thanks,
>
> C.
>
>
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-11 14:39 ` Steven Sistare
@ 2025-06-12 7:23 ` Cédric Le Goater
2025-06-19 12:03 ` Cédric Le Goater
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-12 7:23 UTC (permalink / raw)
To: Steven Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 6/11/25 16:39, Steven Sistare wrote:
> On 6/11/2025 10:25 AM, Cédric Le Goater wrote:
>> On 6/10/25 19:39, Cédric Le Goater wrote:
>>>> Steve,
>>>>
>>>> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
>>>> updated. The rest is for later in this cycle
>>>
>>> Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
>>
>> I am planing to send a PR with this first part to get more visibility.
>> There is a slight risk of merging useless changes since CPR is not
>> fully reviewed. My optimistic nature tells me it should reach QEMU 10.1
>> and we have time to adjust.
>>
>> Please feel free to intervene if you prefer the series to be fully
>> approved/reviewed before merging.
>
> A partial merge is fine with me.
Now merged. I am waiting for some feedback on the IOMMUFD part.
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-12 7:23 ` Cédric Le Goater
@ 2025-06-19 12:03 ` Cédric Le Goater
2025-06-20 5:46 ` Duan, Zhenzhong
0 siblings, 1 reply; 101+ messages in thread
From: Cédric Le Goater @ 2025-06-19 12:03 UTC (permalink / raw)
To: Steven Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
Zhenzhong, Eric,
On 6/12/25 09:23, Cédric Le Goater wrote:
> On 6/11/25 16:39, Steven Sistare wrote:
>> On 6/11/2025 10:25 AM, Cédric Le Goater wrote:
>>> On 6/10/25 19:39, Cédric Le Goater wrote:
>>>>> Steve,
>>>>>
>>>>> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
>>>>> updated. The rest is for later in this cycle
>>>>
>>>> Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
>>>
>>> I am planing to send a PR with this first part to get more visibility.
>>> There is a slight risk of merging useless changes since CPR is not
>>> fully reviewed. My optimistic nature tells me it should reach QEMU 10.1
>>> and we have time to adjust.
>>>
>>> Please feel free to intervene if you prefer the series to be fully
>>> approved/reviewed before merging.
>>
>> A partial merge is fine with me.
>
>
> Now merged. I am waiting for some feedback on the IOMMUFD part.
Would you have some time in June to take a look ?
Thanks,
C.
^ permalink raw reply [flat|nested] 101+ messages in thread
* RE: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-19 12:03 ` Cédric Le Goater
@ 2025-06-20 5:46 ` Duan, Zhenzhong
0 siblings, 0 replies; 101+ messages in thread
From: Duan, Zhenzhong @ 2025-06-20 5:46 UTC (permalink / raw)
To: Cédric Le Goater, Steven Sistare, qemu-devel@nongnu.org
Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
Marcel Apfelbaum, Peter Xu, Fabiano Rosas
Hi Cédric,
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH V5 00/38] Live update: vfio and iommufd
>
>Zhenzhong, Eric,
>
>On 6/12/25 09:23, Cédric Le Goater wrote:
>> On 6/11/25 16:39, Steven Sistare wrote:
>>> On 6/11/2025 10:25 AM, Cédric Le Goater wrote:
>>>> On 6/10/25 19:39, Cédric Le Goater wrote:
>>>>>> Steve,
>>>>>>
>>>>>> For the next vfio PR, I plan to take patches 1-17 when patch 10 is
>>>>>> updated. The rest is for later in this cycle
>>>>>
>>>>> Applied 1-17 to vfio-next. Waiting for an Ack from Michael.
>>>>
>>>> I am planing to send a PR with this first part to get more visibility.
>>>> There is a slight risk of merging useless changes since CPR is not
>>>> fully reviewed. My optimistic nature tells me it should reach QEMU 10.1
>>>> and we have time to adjust.
>>>>
>>>> Please feel free to intervene if you prefer the series to be fully
>>>> approved/reviewed before merging.
>>>
>>> A partial merge is fine with me.
>>
>>
>> Now merged. I am waiting for some feedback on the IOMMUFD part.
>
>Would you have some time in June to take a look ?
Sorry, short of bandwidth this week, I'll review it next week.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 101+ messages in thread
* Re: [PATCH V5 00/38] Live update: vfio and iommufd
2025-06-11 14:25 ` Cédric Le Goater
2025-06-11 14:39 ` Steven Sistare
@ 2025-06-11 14:49 ` Peter Xu
1 sibling, 0 replies; 101+ messages in thread
From: Peter Xu @ 2025-06-11 14:49 UTC (permalink / raw)
To: Cédric Le Goater
Cc: Steve Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Wed, Jun 11, 2025 at 04:25:13PM +0200, Cédric Le Goater wrote:
> Peter, Fabiano,
>
> The first 2 patches are migration patches. Do you agree if I take them
> through the VFIO queue ?
Yep, please go ahead.
--
Peter Xu
^ permalink raw reply [flat|nested] 101+ messages in thread