* [PATCH V1 00/26] Live update: vfio and iommufd
@ 2025-01-29 14:42 Steve Sistare
2025-01-29 14:42 ` [PATCH V1 01/26] migration: cpr helpers Steve Sistare
` (25 more replies)
0 siblings, 26 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:42 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Support vfio and iommufd devices with the cpr-transfer live migration mode.
Devices that do not support live migration can still support cpr-transfer,
allowing live update to a new version of QEMU on the same host, with no loss
of guest connectivity.
No user-visible interfaces are added.
For legacy containers:
Pass vfio device descriptors to new QEMU. In new QEMU, during vfio_realize,
skip the ioctls that configure the device, because it is already configured.
Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
QEMU and update the locked memory accounting. The physical pages remain
pinned, because the descriptor of the device that locked them remains open,
so DMA to those pages continues without interruption. Mediated devices are
not supported, however, because they require the VA to always be valid, and
there is a brief window where no VA is registered.
Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
and notifier eventfd's to new QEMU. New QEMU loads the MSI data, then the
vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
data structures, and attaches the interrupts to the new KVM instance. This
logic also applies to iommufd containers.
For iommufd containers:
Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
backed by a file (including a memfd), so DMA mappings do not depend on VA,
which can differ after live update. This allows mediated devices to be
supported.
Pass the iommufd and vfio device descriptors from old to new QEMU. In new
QEMU, during vfio_realize, skip the ioctls that configure the device, because
it is already configured.
In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
locked memory accounting.
Patches 3 to 7 are specific to legacy containers.
Patches 15 to 26 are specific to iommufd containers.
Patches 1, 2 and 8 to 14 apply to both.
Changes from previous versions:
* This series contains minor changes from the "Live update: vfio" and
"Live update: iommufd" series, mainly bug fixes and refactored patches.
Steve Sistare (26):
migration: cpr helpers
migration: lower handler priority
vfio: vfio_find_ram_discard_listener
vfio/container: register container for cpr
vfio/container: preserve descriptors
vfio/container: preserve DMA mappings
vfio/container: recover from unmap-all-vaddr failure
pci: skip reset during cpr
pci: export msix_is_pending
vfio-pci: refactor for cpr
vfio-pci: skip reset during cpr
vfio-pci: preserve MSI
vfio-pci: preserve INTx
migration: close kvm after cpr
migration: cpr_get_fd_param helper
vfio: return mr from vfio_get_xlat_addr
vfio: pass ramblock to vfio_container_dma_map
vfio/iommufd: define iommufd_cdev_make_hwpt
vfio/iommufd: use IOMMU_IOAS_MAP_FILE
vfio/iommufd: export iommufd_cdev_get_info_iova_range
iommufd: change process ioctl
vfio/iommufd: invariant device name
vfio/iommufd: register container for cpr
vfio/iommufd: preserve descriptors
vfio/iommufd: reconstruct device
iommufd: preserve DMA mappings
accel/kvm/kvm-all.c | 20 +++
backends/iommufd.c | 88 +++++++++-
backends/trace-events | 2 +
hw/pci/msix.c | 2 +-
hw/pci/pci.c | 13 ++
hw/vfio/common.c | 108 +++++++++---
hw/vfio/container-base.c | 12 +-
hw/vfio/container.c | 155 ++++++++++++++---
hw/vfio/cpr-iommufd.c | 161 ++++++++++++++++++
hw/vfio/cpr-legacy.c | 161 ++++++++++++++++++
hw/vfio/helpers.c | 28 ++--
hw/vfio/iommufd.c | 156 ++++++++++++-----
hw/vfio/meson.build | 4 +-
hw/vfio/pci.c | 307 +++++++++++++++++++++++++++++-----
hw/vfio/trace-events | 1 +
hw/virtio/vhost-vdpa.c | 2 +-
include/exec/cpu-common.h | 1 +
include/exec/memory.h | 5 +-
include/hw/pci/msix.h | 1 +
include/hw/vfio/vfio-common.h | 25 +++
include/hw/vfio/vfio-container-base.h | 6 +-
include/migration/cpr.h | 7 +
include/migration/vmstate.h | 3 +-
include/system/iommufd.h | 6 +
include/system/kvm.h | 1 +
migration/cpr-transfer.c | 18 ++
migration/cpr.c | 70 ++++++++
migration/migration.c | 1 +
migration/savevm.c | 4 +-
system/memory.c | 8 +-
system/physmem.c | 5 +
31 files changed, 1230 insertions(+), 151 deletions(-)
create mode 100644 hw/vfio/cpr-iommufd.c
create mode 100644 hw/vfio/cpr-legacy.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH V1 01/26] migration: cpr helpers
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
@ 2025-01-29 14:42 ` Steve Sistare
2025-01-29 14:42 ` [PATCH V1 02/26] migration: lower handler priority Steve Sistare
` (24 subsequent siblings)
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:42 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add the cpr_needed_for_reuse and cpr_resave_fd helpers, for use when
adding cpr support for vfio and iommufd.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 3 +++
migration/cpr.c | 21 +++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 3a6deb7..faeb0cc 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -18,6 +18,7 @@
void cpr_save_fd(const char *name, int id, int fd);
void cpr_delete_fd(const char *name, int id);
int cpr_find_fd(const char *name, int id);
+void cpr_resave_fd(const char *name, int id, int fd);
MigMode cpr_get_incoming_mode(void);
void cpr_set_incoming_mode(MigMode mode);
@@ -27,6 +28,8 @@ int cpr_state_load(MigrationChannel *channel, Error **errp);
void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
+bool cpr_needed_for_reuse(void *opaque);
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 584b0b9..e3f27e9 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -95,6 +95,21 @@ int cpr_find_fd(const char *name, int id)
trace_cpr_find_fd(name, id, fd);
return fd;
}
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+ CprFd *elem = find_fd(&cpr_state.fds, name, id);
+ int old_fd = elem ? elem->fd : -1;
+
+ if (old_fd < 0) {
+ cpr_save_fd(name, id, fd);
+ } else if (old_fd != fd) {
+ error_setg(&error_fatal,
+ "internal error: cpr fd '%s' id %d value %d "
+ "already saved with a different value %d",
+ name, id, fd, old_fd);
+ }
+}
/*************************************************************************/
#define CPR_STATE "CprState"
@@ -222,3 +237,9 @@ void cpr_state_close(void)
cpr_state_file = NULL;
}
}
+
+bool cpr_needed_for_reuse(void *opaque)
+{
+ MigMode mode = migrate_mode();
+ return mode == MIG_MODE_CPR_TRANSFER;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 02/26] migration: lower handler priority
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
2025-01-29 14:42 ` [PATCH V1 01/26] migration: cpr helpers Steve Sistare
@ 2025-01-29 14:42 ` Steve Sistare
2025-02-03 16:21 ` Fabiano Rosas
2025-02-03 16:58 ` Peter Xu
2025-01-29 14:42 ` [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener Steve Sistare
` (23 subsequent siblings)
25 siblings, 2 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:42 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers. Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/vmstate.h | 3 ++-
migration/savevm.c | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a1dfab4..3055a46 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -155,7 +155,8 @@ enum VMStateFlags {
};
typedef enum {
- MIG_PRI_DEFAULT = 0,
+ MIG_PRI_LOW = 1, /* Must happen after default */
+ MIG_PRI_DEFAULT,
MIG_PRI_IOMMU, /* Must happen before PCI devices */
MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
diff --git a/migration/savevm.c b/migration/savevm.c
index 264bc06..5dd2dc4 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -232,7 +232,7 @@ typedef struct SaveState {
static SaveState savevm_state = {
.handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
- .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
+ .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
.global_section_id = 0,
};
@@ -704,7 +704,7 @@ static int calculate_compat_instance_id(const char *idstr)
static inline MigrationPriority save_state_priority(SaveStateEntry *se)
{
- if (se->vmsd) {
+ if (se->vmsd && se->vmsd->priority) {
return se->vmsd->priority;
}
return MIG_PRI_DEFAULT;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
2025-01-29 14:42 ` [PATCH V1 01/26] migration: cpr helpers Steve Sistare
2025-01-29 14:42 ` [PATCH V1 02/26] migration: lower handler priority Steve Sistare
@ 2025-01-29 14:42 ` Steve Sistare
2025-02-03 16:57 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 04/26] vfio/container: register container for cpr Steve Sistare
` (22 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:42 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define vfio_find_ram_discard_listener as a subroutine so additional calls to
it may be added in a subsequent patch.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f7499a9..7370332 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -555,6 +555,26 @@ static bool vfio_get_section_iova_range(VFIOContainerBase *bcontainer,
return true;
}
+static VFIORamDiscardListener *vfio_find_ram_discard_listener(
+ VFIOContainerBase *bcontainer, MemoryRegionSection *section)
+{
+ VFIORamDiscardListener *vrdl = NULL;
+
+ QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
+ if (vrdl->mr == section->mr &&
+ vrdl->offset_within_address_space ==
+ section->offset_within_address_space) {
+ break;
+ }
+ }
+
+ if (!vrdl) {
+ hw_error("vfio: Trying to sync missing RAM discard listener");
+ /* does not return */
+ }
+ return vrdl;
+}
+
static void vfio_listener_region_add(MemoryListener *listener,
MemoryRegionSection *section)
{
@@ -1266,19 +1286,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
MemoryRegionSection *section)
{
RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
- VFIORamDiscardListener *vrdl = NULL;
-
- QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
- if (vrdl->mr == section->mr &&
- vrdl->offset_within_address_space ==
- section->offset_within_address_space) {
- break;
- }
- }
-
- if (!vrdl) {
- hw_error("vfio: Trying to sync missing RAM discard listener");
- }
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
/*
* We only want/can synchronize the bitmap for actually mapped parts -
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 04/26] vfio/container: register container for cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (2 preceding siblings ...)
2025-01-29 14:42 ` [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-03 17:01 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 05/26] vfio/container: preserve descriptors Steve Sistare
` (21 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a legacy container for cpr-transfer. Add a blocker if the kernel
does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 6 ++--
hw/vfio/cpr-legacy.c | 68 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/meson.build | 3 +-
include/hw/vfio/vfio-common.h | 3 ++
4 files changed, 76 insertions(+), 4 deletions(-)
create mode 100644 hw/vfio/cpr-legacy.c
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 4ebb526..a90ce6c 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -618,7 +618,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
}
bcontainer = &container->bcontainer;
- if (!vfio_cpr_register_container(bcontainer, errp)) {
+ if (!vfio_legacy_cpr_register_container(container, errp)) {
goto free_container_exit;
}
@@ -666,7 +666,7 @@ enable_discards_exit:
vfio_ram_block_discard_disable(container, false);
unregister_container_exit:
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
free_container_exit:
object_unref(container);
@@ -710,7 +710,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
VFIOAddressSpace *space = bcontainer->space;
trace_vfio_disconnect_container(container->fd);
- vfio_cpr_unregister_container(bcontainer);
+ vfio_legacy_cpr_unregister_container(container);
close(container->fd);
object_unref(container);
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..d3bbc05
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include <sys/ioctl.h>
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-common.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+
+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
+{
+ if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+ error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+ return false;
+
+ } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+ error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+ return false;
+
+ } else {
+ return true;
+ }
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ Error **cpr_blocker = &container->cpr_blocker;
+
+ if (!vfio_cpr_register_container(bcontainer, errp)) {
+ return false;
+ }
+
+ if (!vfio_cpr_supported(container, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+ return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ vfio_cpr_unregister_container(bcontainer);
+ migrate_del_blocker(&container->cpr_blocker);
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bba776f..5487815 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,13 +5,14 @@ vfio_ss.add(files(
'container-base.c',
'container.c',
'migration.c',
- 'cpr.c',
))
vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
'iommufd.c',
))
vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+ 'cpr.c',
+ 'cpr-legacy.c',
'display.c',
'pci-quirks.c',
'pci.c',
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 0c60be5..53e554f 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -84,6 +84,7 @@ typedef struct VFIOContainer {
VFIOContainerBase bcontainer;
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
unsigned iommu_type;
+ Error *cpr_blocker;
QLIST_HEAD(, VFIOGroup) group_list;
} VFIOContainer;
@@ -258,6 +259,8 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 05/26] vfio/container: preserve descriptors
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (3 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 04/26] vfio/container: register container for cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-03 17:48 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 06/26] vfio/container: preserve DMA mappings Steve Sistare
` (20 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
the saved descriptors, and remembers the reused status for subsequent
patches. The reused status is cleared when vmstate load finishes.
During reuse, device and iommu state is already configured, so operations
in vfio_realize that would modify the configuration, such as vfio ioctl's,
are skipped. The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 105 ++++++++++++++++++++++++++++++++++--------
hw/vfio/cpr-legacy.c | 17 +++++++
include/hw/vfio/vfio-common.h | 2 +
3 files changed, 105 insertions(+), 19 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index a90ce6c..81d0ccc 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,7 @@
#include "system/reset.h"
#include "trace.h"
#include "qapi/error.h"
+#include "migration/cpr.h"
#include "pci.h"
VFIOGroupList vfio_group_list =
@@ -415,12 +416,28 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
}
static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
- Error **errp)
+ bool reused, Error **errp)
{
int iommu_type;
const char *vioc_name;
VFIOContainer *container;
+ /*
+ * If container is reused, just set its type and skip the ioctls, as the
+ * container and group are already configured in the kernel.
+ * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
+ */
+ if (reused) {
+ if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
+ iommu_type = VFIO_TYPE1v2_IOMMU;
+ goto skip_iommu;
+ } else {
+ error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
+ "is not supported");
+ return NULL;
+ }
+ }
+
iommu_type = vfio_get_iommu_type(fd, errp);
if (iommu_type < 0) {
return NULL;
@@ -430,10 +447,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
return NULL;
}
+skip_iommu:
vioc_name = vfio_get_iommu_class_name(iommu_type);
container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
container->fd = fd;
+ container->reused = reused;
container->iommu_type = iommu_type;
return container;
}
@@ -543,10 +562,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
VFIOContainer *container;
VFIOContainerBase *bcontainer;
int ret, fd;
+ bool reused;
VFIOAddressSpace *space;
VFIOIOMMUClass *vioc;
space = vfio_get_address_space(as);
+ fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+ reused = (fd > 0);
/*
* VFIO is currently incompatible with discarding of RAM insofar as the
@@ -579,28 +601,52 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
* details once we know which type of IOMMU we are using.
*/
+ /*
+ * If the container is reused, then the group is already attached in the
+ * kernel. If a container with matching fd is found, then update the
+ * userland group list and return. If not, then after the loop, create
+ * the container struct and group list.
+ */
+
QLIST_FOREACH(bcontainer, &space->containers, next) {
container = container_of(bcontainer, VFIOContainer, bcontainer);
- if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
- ret = vfio_ram_block_discard_disable(container, true);
- if (ret) {
- error_setg_errno(errp, -ret,
- "Cannot set discarding of RAM broken");
- if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
- &container->fd)) {
- error_report("vfio: error disconnecting group %d from"
- " container", group->groupid);
- }
- return false;
+
+ if (reused) {
+ if (container->fd != fd) {
+ continue;
}
- group->container = container;
- QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+ continue;
+ }
+
+ /* Container is a match for the group */
+ ret = vfio_ram_block_discard_disable(container, true);
+ if (ret) {
+ error_setg_errno(errp, -ret,
+ "Cannot set discarding of RAM broken");
+ if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+ &container->fd)) {
+ error_report("vfio: error disconnecting group %d from"
+ " container", group->groupid);
+
+ }
+ goto delete_fd_exit;
+ }
+ group->container = container;
+ QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+ if (!reused) {
vfio_kvm_device_add_group(group);
- return true;
+ cpr_save_fd("vfio_container_for_group", group->groupid,
+ container->fd);
}
+ return true;
+ }
+
+ /* No matching container found, create one */
+ if (!reused) {
+ fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
}
- fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
if (fd < 0) {
goto put_space_exit;
}
@@ -612,11 +658,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
goto close_fd_exit;
}
- container = vfio_create_container(fd, group, errp);
+ container = vfio_create_container(fd, group, reused, errp);
if (!container) {
goto close_fd_exit;
}
bcontainer = &container->bcontainer;
+ container->reused = reused;
if (!vfio_legacy_cpr_register_container(container, errp)) {
goto free_container_exit;
@@ -652,6 +699,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
}
bcontainer->initialized = true;
+ cpr_resave_fd("vfio_container_for_group", group->groupid, fd);
return true;
listener_release_exit:
@@ -677,6 +725,8 @@ close_fd_exit:
put_space_exit:
vfio_put_address_space(space);
+delete_fd_exit:
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
return false;
}
@@ -688,6 +738,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
QLIST_REMOVE(group, container_next);
group->container = NULL;
+ cpr_delete_fd("vfio_container_for_group", group->groupid);
/*
* Explicitly release the listener first before unset container,
@@ -741,7 +792,12 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
group = g_malloc0(sizeof(*group));
snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
- group->fd = qemu_open(path, O_RDWR, errp);
+
+ group->fd = cpr_find_fd("vfio_group", groupid);
+ if (group->fd < 0) {
+ group->fd = qemu_open(path, O_RDWR, errp);
+ }
+
if (group->fd < 0) {
goto free_group_exit;
}
@@ -769,6 +825,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
}
QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+ cpr_resave_fd("vfio_group", groupid, group->fd);
return group;
@@ -794,6 +851,7 @@ static void vfio_put_group(VFIOGroup *group)
vfio_disconnect_container(group);
QLIST_REMOVE(group, next);
trace_vfio_put_group(group->fd);
+ cpr_delete_fd("vfio_group", group->groupid);
close(group->fd);
g_free(group);
}
@@ -803,8 +861,14 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
{
g_autofree struct vfio_device_info *info = NULL;
int fd;
+ bool reused;
+
+ fd = cpr_find_fd(name, 0);
+ reused = (fd >= 0);
+ if (!reused) {
+ fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+ }
- fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
if (fd < 0) {
error_setg_errno(errp, errno, "error getting device from group %d",
group->groupid);
@@ -849,6 +913,8 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
vbasedev->num_irqs = info->num_irqs;
vbasedev->num_regions = info->num_regions;
vbasedev->flags = info->flags;
+ vbasedev->reused = reused;
+ cpr_resave_fd(name, 0, fd);
trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
@@ -865,6 +931,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, next);
vbasedev->group = NULL;
trace_vfio_put_base_device(vbasedev->fd);
+ cpr_delete_fd(vbasedev->name, 0);
close(vbasedev->fd);
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index d3bbc05..ce6f14e 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -29,10 +29,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
}
}
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOContainer *container = opaque;
+ VFIOGroup *group;
+ VFIODevice *vbasedev;
+
+ container->reused = false;
+
+ QLIST_FOREACH(group, &container->group_list, container_next) {
+ QLIST_FOREACH(vbasedev, &group->device_list, next) {
+ vbasedev->reused = false;
+ }
+ }
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 53e554f..a435a90 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,7 @@ typedef struct VFIOContainer {
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
unsigned iommu_type;
Error *cpr_blocker;
+ bool reused;
QLIST_HEAD(, VFIOGroup) group_list;
} VFIOContainer;
@@ -135,6 +136,7 @@ typedef struct VFIODevice {
bool ram_block_discard_allowed;
OnOffAuto enable_migration;
bool migration_events;
+ bool reused;
VFIODeviceOps *ops;
unsigned int num_irqs;
unsigned int num_regions;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 06/26] vfio/container: preserve DMA mappings
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (4 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 05/26] vfio/container: preserve descriptors Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-03 18:25 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
` (19 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Preserve DMA mappings during cpr-transfer.
In the container pre_save handler, suspend the use of virtual addresses
in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will
be remapped at a different VA after exec. DMA to already-mapped pages
continues.
Because the vaddr is temporarily invalid, mediated devices cannot be
supported, so add a blocker for them. This restriction will not apply
to iommufd containers when CPR is added for them in a future patch.
In new QEMU, do not register the memory listener at device creation time.
Register it later, in the container post_load handler, after all vmstate
that may affect regions and mapping boundaries has been loaded. The
post_load registration will cause the listener to invoke its callback on
each flat section, and the calls will match the mappings remembered by the
kernel. Modify vfio_dma_map (which is called by the listener) to pass the
new VA to the kernel using VFIO_DMA_MAP_FLAG_VADDR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/container.c | 44 +++++++++++++++++++++++++++++++++++++++----
hw/vfio/cpr-legacy.c | 32 +++++++++++++++++++++++++++++++
include/hw/vfio/vfio-common.h | 3 +++
3 files changed, 75 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 81d0ccc..2b5125e 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -32,6 +32,7 @@
#include "trace.h"
#include "qapi/error.h"
#include "migration/cpr.h"
+#include "migration/blocker.h"
#include "pci.h"
VFIOGroupList vfio_group_list =
@@ -132,6 +133,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
int ret;
Error *local_err = NULL;
+ assert(!container->reused);
+
if (iotlb && vfio_devices_all_dirty_tracking_started(bcontainer)) {
if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
bcontainer->dirty_pages_supported) {
@@ -183,12 +186,24 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
bcontainer);
struct vfio_iommu_type1_dma_map map = {
.argsz = sizeof(map),
- .flags = VFIO_DMA_MAP_FLAG_READ,
.vaddr = (__u64)(uintptr_t)vaddr,
.iova = iova,
.size = size,
};
+ /*
+ * Set the new vaddr for any mappings registered during cpr load.
+ * Reused is cleared thereafter.
+ */
+ if (container->reused) {
+ map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+ goto fail;
+ }
+ return 0;
+ }
+
+ map.flags = VFIO_DMA_MAP_FLAG_READ;
if (!readonly) {
map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
}
@@ -205,7 +220,11 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
return 0;
}
- error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+fail:
+ error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
+ (container->reused ? "VADDR" : ""), iova, size, vaddr,
+ strerror(errno));
+
return -errno;
}
@@ -689,8 +708,17 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
group->container = container;
QLIST_INSERT_HEAD(&container->group_list, group, container_next);
- bcontainer->listener = vfio_memory_listener;
- memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+ /*
+ * If reused, register the listener later, after all state that may
+ * affect regions and mapping boundaries has been cpr load'ed. Later,
+ * the listener will invoke its callback on each flat section and call
+ * vfio_dma_map to supply the new vaddr, and the calls will match the
+ * mappings remembered by the kernel.
+ */
+ if (!reused) {
+ bcontainer->listener = vfio_memory_listener;
+ memory_listener_register(&bcontainer->listener, bcontainer->space->as);
+ }
if (bcontainer->error) {
error_propagate_prepend(errp, bcontainer->error,
@@ -1002,6 +1030,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
return false;
}
+ if (vbasedev->mdev) {
+ error_setg(&vbasedev->cpr_mdev_blocker,
+ "CPR does not support vfio mdev %s", vbasedev->name);
+ migrate_add_blocker_modes(&vbasedev->cpr_mdev_blocker, &error_fatal,
+ MIG_MODE_CPR_TRANSFER, -1);
+ }
+
bcontainer = &group->container->bcontainer;
vbasedev->bcontainer = bcontainer;
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
@@ -1018,6 +1053,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
QLIST_REMOVE(vbasedev, container_next);
vbasedev->bcontainer = NULL;
trace_vfio_detach_device(vbasedev->name, group->groupid);
+ migrate_del_blocker(&vbasedev->cpr_mdev_blocker);
vfio_put_base_device(vbasedev);
vfio_put_group(group);
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index ce6f14e..f3a31d1 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -14,6 +14,21 @@
#include "migration/vmstate.h"
#include "qapi/error.h"
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+ .iova = 0,
+ .size = 0,
+ };
+ if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+ error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+ return false;
+ }
+ return true;
+}
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -29,12 +44,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
}
}
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOContainer *container = opaque;
+ Error *err = NULL;
+
+ if (!vfio_dma_unmap_vaddr_all(container, &err)) {
+ error_report_err(err);
+ return -1;
+ }
+ return 0;
+}
+
static int vfio_container_post_load(void *opaque, int version_id)
{
VFIOContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
VFIOGroup *group;
VFIODevice *vbasedev;
+ bcontainer->listener = vfio_memory_listener;
+ memory_listener_register(&bcontainer->listener, bcontainer->space->as);
container->reused = false;
QLIST_FOREACH(group, &container->group_list, container_next) {
@@ -49,6 +79,8 @@ static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-container",
.version_id = 0,
.minimum_version_id = 0,
+ .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
+ .pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a435a90..1e974e0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -143,6 +143,7 @@ typedef struct VFIODevice {
unsigned int flags;
VFIOMigration *migration;
Error *migration_blocker;
+ Error *cpr_mdev_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
OnOffAuto device_dirty_page_tracking;
bool dirty_pages_supported;
@@ -310,6 +311,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
+void vfio_listener_register(VFIOContainerBase *bcontainer);
+
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (5 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 06/26] vfio/container: preserve DMA mappings Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 14:10 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 08/26] pci: skip reset during cpr Steve Sistare
` (18 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each. Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 47 ++++++++++++++++++++++++++++++++++++++++++-
hw/vfio/cpr-legacy.c | 44 ++++++++++++++++++++++++++++++++++++++++
include/hw/vfio/vfio-common.h | 6 +++++-
3 files changed, 95 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7370332..c8ee71a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -580,6 +580,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
{
VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
listener);
+ vfio_container_region_add(bcontainer, section, false);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section,
+ bool remap)
+{
hwaddr iova, end;
Int128 llend, llsize;
void *vaddr;
@@ -614,6 +621,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
int iommu_idx;
trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
+
+ /*
+ * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
+ * want to remap the vaddr. vfio_container_region_add was already
+ * called in the past, so the giommu already exists. Find it and
+ * replay it, which calls vfio_dma_map further down the stack.
+ */
+
+ if (remap) {
+ hwaddr as_offset = section->offset_within_address_space;
+ hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+ QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+ if (giommu->iommu_mr == iommu_mr &&
+ giommu->iommu_offset == iommu_offset) {
+ memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+ return;
+ }
+ }
+ error_report("Container cannot find iommu region %s offset %lx",
+ memory_region_name(section->mr), iommu_offset);
+ goto fail;
+ }
+
/*
* FIXME: For VFIO iommu types which have KVM acceleration to
* avoid bouncing all map/unmaps through qemu this way, this
@@ -656,7 +687,21 @@ static void vfio_listener_region_add(MemoryListener *listener,
* about changes.
*/
if (memory_region_has_ram_discard_manager(section->mr)) {
- vfio_register_ram_discard_listener(bcontainer, section);
+ /*
+ * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
+ * want to remap the vaddr. vfio_container_region_add was already
+ * called in the past, so the ram discard listener already exists.
+ * Call its populate function directly, which calls vfio_dma_map.
+ */
+ if (remap) {
+ VFIORamDiscardListener *vrdl =
+ vfio_find_ram_discard_listener(bcontainer, section);
+ if (vrdl->listener.notify_populate(&vrdl->listener, section)) {
+ error_report("listener.notify_populate failed");
+ }
+ } else {
+ vfio_register_ram_discard_listener(bcontainer, section);
+ }
return;
}
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index f3a31d1..3139de1 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -26,9 +26,18 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
return false;
}
+ container->vaddr_unmapped = true;
return true;
}
+static void vfio_region_remap(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VFIOContainer *container = container_of(listener, VFIOContainer,
+ remap_listener);
+ vfio_container_region_add(&container->bcontainer, section, true);
+}
+
static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
{
if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -88,6 +97,37 @@ static const VMStateDescription vfio_container_vmstate = {
}
};
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e, Error **errp)
+{
+ VFIOContainer *container =
+ container_of(notifier, VFIOContainer, cpr_transfer_notifier);
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+ return 0;
+ }
+
+ if (container->vaddr_unmapped) {
+ /*
+ * Force a call to vfio_region_remap for each mapped section by
+ * temporarily registering a listener, which calls vfio_dma_map
+ * further down the stack. Set reused so vfio_dma_map restores vaddr.
+ */
+ container->reused = true;
+ container->remap_listener = (MemoryListener) {
+ .name = "vfio recover",
+ .region_add = vfio_region_remap
+ };
+ memory_listener_register(&container->remap_listener,
+ bcontainer->space->as);
+ memory_listener_unregister(&container->remap_listener);
+ container->reused = false;
+ container->vaddr_unmapped = false;
+ }
+ return 0;
+}
+
bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
{
VFIOContainerBase *bcontainer = &container->bcontainer;
@@ -104,6 +144,9 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ migration_add_notifier_mode(&container->cpr_transfer_notifier,
+ vfio_cpr_fail_notifier,
+ MIG_MODE_CPR_TRANSFER);
return true;
}
@@ -114,4 +157,5 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
vfio_cpr_unregister_container(bcontainer);
migrate_del_blocker(&container->cpr_blocker);
vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migration_remove_notifier(&container->cpr_transfer_notifier);
}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1e974e0..8a4a658 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -86,6 +86,9 @@ typedef struct VFIOContainer {
unsigned iommu_type;
Error *cpr_blocker;
bool reused;
+ bool vaddr_unmapped;
+ NotifierWithReturn cpr_transfer_notifier;
+ MemoryListener remap_listener;
QLIST_HEAD(, VFIOGroup) group_list;
} VFIOContainer;
@@ -311,7 +314,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
uint64_t size, ram_addr_t ram_addr, Error **errp);
-void vfio_listener_register(VFIOContainerBase *bcontainer);
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section, bool remap);
/* Returns 0 on success, or a negative errno. */
bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 08/26] pci: skip reset during cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (6 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 14:14 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 09/26] pci: export msix_is_pending Steve Sistare
` (17 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/pci/pci.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2afa423..16b4f71 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,7 @@
#include "hw/pci/pci_host.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/misc.h"
#include "migration/qemu-file-types.h"
#include "migration/vmstate.h"
#include "net/net.h"
@@ -459,6 +460,18 @@ static void pci_reset_regions(PCIDevice *dev)
static void pci_do_device_reset(PCIDevice *dev)
{
+ /*
+ * A PCI device that is resuming for cpr is already configured, so do
+ * not reset it here when we are called from qemu_system_reset prior to
+ * cpr load, else interrupts may be lost for vfio-pci devices. It is
+ * safe to skip this reset for all PCI devices, because cpr load will set
+ * all fields that would have been set here.
+ */
+ MigMode mode = migrate_mode();
+ if (mode == MIG_MODE_CPR_TRANSFER) {
+ return;
+ }
+
pci_device_deassert_intx(dev);
assert(dev->irq_state == 0);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 09/26] pci: export msix_is_pending
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (7 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 08/26] pci: skip reset during cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 10/26] vfio-pci: refactor for cpr Steve Sistare
` (16 subsequent siblings)
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export msix_is_pending for use by cpr. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
hw/pci/msix.c | 2 +-
include/hw/pci/msix.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 57ec708..c7b40cd 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -71,7 +71,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
return dev->msix_pba + vector / 8;
}
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
{
return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
}
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
bool msix_is_masked(PCIDevice *dev, unsigned vector);
void msix_set_pending(PCIDevice *dev, unsigned vector);
void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
void msix_vector_use(PCIDevice *dev, unsigned vector);
void msix_vector_unuse(PCIDevice *dev, unsigned vector);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 10/26] vfio-pci: refactor for cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (8 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 09/26] pci: export msix_is_pending Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 14:39 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 11/26] vfio-pci: skip reset during cpr Steve Sistare
` (15 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Refactor vector use into a helper vfio_vector_init.
Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
and pass additional arguments to vfio_remove_kvm_msi_virq.
All for use by CPR in a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 106 +++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 68 insertions(+), 38 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ab17a98..24ebd69 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -54,6 +54,32 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+/* Create new or reuse existing eventfd */
+static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ int fd = -1; /* placeholder until a subsequent patch */
+ int ret = 0;
+
+ if (fd >= 0) {
+ event_notifier_init_fd(e, fd);
+ } else {
+ ret = event_notifier_init(e, 0);
+ if (ret) {
+ Error *err = NULL;
+ error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
+ error_report_err(err);
+ }
+ }
+ return ret;
+}
+
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+ const char *name, int nr)
+{
+ event_notifier_cleanup(e);
+}
+
/*
* Disabling BAR mmaping can be slow, but toggling it around INTx can
* also be a huge overhead. We try to get the best of both worlds by
@@ -134,8 +160,8 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
pci_irq_deassert(&vdev->pdev);
/* Get an eventfd for resample/unmask */
- if (event_notifier_init(&vdev->intx.unmask, 0)) {
- error_setg(errp, "event_notifier_init failed eoi");
+ if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+ error_setg(errp, "vfio_notifier_init intx-unmask failed");
goto fail;
}
@@ -167,7 +193,7 @@ fail_vfio:
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
vdev->intx.route.irq);
fail_irqfd:
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
fail:
qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -199,7 +225,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
}
/* We only need to close the eventfd for VFIO to cleanup the kernel side */
- event_notifier_cleanup(&vdev->intx.unmask);
+ vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
/* QEMU starts listening for interrupt events. */
qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -266,7 +292,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
Error *err = NULL;
int32_t fd;
- int ret;
if (!pin) {
@@ -289,9 +314,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
}
#endif
- ret = event_notifier_init(&vdev->intx.interrupt, 0);
- if (ret) {
- error_setg_errno(errp, -ret, "event_notifier_init failed");
+ if (vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0)) {
return false;
}
fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -300,7 +323,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
return false;
}
@@ -327,7 +350,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->intx.interrupt);
+ vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
vdev->interrupt = VFIO_INT_NONE;
@@ -471,13 +494,15 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
vector_n, &vdev->pdev);
}
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
{
+ const char *name = "kvm_interrupt";
+
if (vector->virq < 0) {
return;
}
- if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+ if (vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr)) {
goto fail_notifier;
}
@@ -489,19 +514,20 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
return;
fail_kvm:
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
fail_notifier:
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
}
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+ int nr)
{
kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
vector->virq);
kvm_irqchip_release_virq(kvm_state, vector->virq);
vector->virq = -1;
- event_notifier_cleanup(&vector->kvm_interrupt);
+ vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
}
static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -511,6 +537,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
kvm_irqchip_commit_routes(kvm_state);
}
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+ VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+ PCIDevice *pdev = &vdev->pdev;
+
+ vector->vdev = vdev;
+ vector->virq = -1;
+ vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr);
+ vector->use = true;
+ if (vdev->interrupt == VFIO_INT_MSIX) {
+ msix_vector_use(pdev, nr);
+ }
+}
+
static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
MSIMessage *msg, IOHandler *handler)
{
@@ -524,13 +564,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vector = &vdev->msi_vectors[nr];
if (!vector->use) {
- vector->vdev = vdev;
- vector->virq = -1;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
- vector->use = true;
- msix_vector_use(pdev, nr);
+ vfio_vector_init(vdev, nr);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -542,7 +576,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
*/
if (vector->virq >= 0) {
if (!msg) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, nr);
} else {
vfio_update_kvm_msi_virq(vector, *msg, pdev);
}
@@ -554,7 +588,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
vfio_add_kvm_msi_virq(vdev, vector, nr, true);
kvm_irqchip_commit_route_changes(&vfio_route_change);
- vfio_connect_kvm_msi_virq(vector);
+ vfio_connect_kvm_msi_virq(vector, nr);
}
}
}
@@ -661,7 +695,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
kvm_irqchip_commit_route_changes(&vfio_route_change);
for (i = 0; i < vdev->nr_vectors; i++) {
- vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+ vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
}
}
@@ -741,9 +775,7 @@ retry:
vector->virq = -1;
vector->use = true;
- if (event_notifier_init(&vector->interrupt, 0)) {
- error_report("vfio: Error: event_notifier_init failed");
- }
+ vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i);
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
vfio_msi_interrupt, NULL, vector);
@@ -797,11 +829,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
VFIOMSIVector *vector = &vdev->msi_vectors[i];
if (vdev->msi_vectors[i].use) {
if (vector->virq >= 0) {
- vfio_remove_kvm_msi_virq(vector);
+ vfio_remove_kvm_msi_virq(vdev, vector, i);
}
qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
NULL, NULL, NULL);
- event_notifier_cleanup(&vector->interrupt);
+ vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
}
}
@@ -2854,8 +2886,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->err_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for error detection");
+ if (vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0)) {
vdev->pci_aer = false;
return;
}
@@ -2867,7 +2898,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
vdev->pci_aer = false;
}
}
@@ -2886,7 +2917,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->err_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
}
static void vfio_req_notifier_handler(void *opaque)
@@ -2920,8 +2951,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (event_notifier_init(&vdev->req_notifier, 0)) {
- error_report("vfio: Unable to init event notifier for device request");
+ if (vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0)) {
return;
}
@@ -2932,7 +2962,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
qemu_set_fd_handler(fd, NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
} else {
vdev->req_enabled = true;
}
@@ -2952,7 +2982,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
}
qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
NULL, NULL, vdev);
- event_notifier_cleanup(&vdev->req_notifier);
+ vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
vdev->req_enabled = false;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 11/26] vfio-pci: skip reset during cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (9 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 10/26] vfio-pci: refactor for cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 14:56 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 12/26] vfio-pci: preserve MSI Steve Sistare
` (14 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Do not reset a vfio-pci device during CPR, and do not complain if the
kernel's PCI config space changes for non-emulated bits between the
vmstate save and load, which can happen due to ongoing interrupt activity.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 24ebd69..fa77c36 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,8 @@
#include "hw/pci/pci_bridge.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-properties-system.h"
+#include "migration/misc.h"
+#include "migration/cpr.h"
#include "migration/vmstate.h"
#include "qapi/qmp/qdict.h"
#include "qemu/error-report.h"
@@ -3324,6 +3326,11 @@ static void vfio_pci_reset(DeviceState *dev)
{
VFIOPCIDevice *vdev = VFIO_PCI(dev);
+ /* Do not reset the device during qemu_system_reset prior to cpr load */
+ if (vdev->vbasedev.reused) {
+ return;
+ }
+
trace_vfio_pci_reset(vdev->vbasedev.name);
vfio_pci_pre_reset(vdev);
@@ -3447,6 +3454,35 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
}
#endif
+/*
+ * The kernel may change non-emulated config bits. Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_pci_pre_load(void *opaque)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int size = MIN(pci_config_size(pdev), vdev->config_size);
+ int i;
+
+ for (i = 0; i < size; i++) {
+ pdev->cmask[i] &= vdev->emulated_config_bits[i];
+ }
+
+ return 0;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+ .name = "vfio-pci",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .pre_load = vfio_pci_pre_load,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
{
DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3457,6 +3493,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
#ifdef CONFIG_IOMMUFD
object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
#endif
+ dc->vmsd = &vfio_pci_vmstate;
dc->desc = "VFIO-based PCI device assignment";
set_bit(DEVICE_CATEGORY_MISC, dc->categories);
pdc->realize = vfio_realize;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 12/26] vfio-pci: preserve MSI
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (10 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 11/26] vfio-pci: skip reset during cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 16:48 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 13/26] vfio-pci: preserve INTx Steve Sistare
` (13 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the MSI message area as part of vfio-pci vmstate, and preserve the
interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
then the vfio-pci post_load handler finds the eventfds in CPR state,
rebuilds vector data structures, and attaches the interrupts to the new
KVM instance.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 116 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index fa77c36..df6e298 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -56,11 +56,37 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
+#define EVENT_FD_NAME(vdev, name) \
+ g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+ EventNotifier *ev)
+{
+ int fd = event_notifier_get_fd(ev);
+
+ if (fd >= 0) {
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ cpr_resave_fd(fdname, nr, fd);
+ }
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ return cpr_find_fd(fdname, nr);
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+ g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+ cpr_delete_fd(fdname, nr);
+}
+
/* Create new or reuse existing eventfd */
static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
- int fd = -1; /* placeholder until a subsequent patch */
+ int fd = load_event_fd(vdev, name, nr);
int ret = 0;
if (fd >= 0) {
@@ -71,6 +97,8 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
Error *err = NULL;
error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
error_report_err(err);
+ } else {
+ save_event_fd(vdev, name, nr, e);
}
}
return ret;
@@ -79,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
const char *name, int nr)
{
+ delete_event_fd(vdev, name, nr);
event_notifier_cleanup(e);
}
@@ -561,6 +590,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
int ret;
bool resizing = !!(vdev->nr_vectors < nr + 1);
+ /*
+ * Ignore the callback from msix_set_vector_notifiers during resume.
+ * The necessary subset of these actions is called from vfio_claim_vectors
+ * during post load.
+ */
+ if (vdev->vbasedev.reused) {
+ return 0;
+ }
+
trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
vector = &vdev->msi_vectors[nr];
@@ -2896,6 +2934,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->err_notifier);
qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.reused) {
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2960,6 +3003,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
fd = event_notifier_get_fd(&vdev->req_notifier);
qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
+ /* Do not alter irq_signaling during vfio_realize for cpr */
+ if (vdev->vbasedev.reused) {
+ vdev->req_enabled = true;
+ return;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3454,6 +3503,46 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
}
#endif
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+ int i, fd;
+ bool pending = false;
+ PCIDevice *pdev = &vdev->pdev;
+
+ vdev->nr_vectors = nr_vectors;
+ vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+ vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+ vfio_prepare_kvm_msi_virq_batch(vdev);
+
+ for (i = 0; i < nr_vectors; i++) {
+ VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+ fd = load_event_fd(vdev, "interrupt", i);
+ if (fd >= 0) {
+ vfio_vector_init(vdev, i);
+ qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+ }
+
+ if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+ vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+ } else {
+ vdev->msi_vectors[i].virq = -1;
+ }
+
+ if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+ set_bit(i, vdev->msix->pending);
+ pending = true;
+ }
+ }
+
+ vfio_commit_kvm_msi_virq_batch(vdev);
+
+ if (msix) {
+ memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+ }
+}
+
/*
* The kernel may change non-emulated config bits. Exclude them from the
* changed-bits check in get_pci_config_device.
@@ -3472,13 +3561,39 @@ static int vfio_pci_pre_load(void *opaque)
return 0;
}
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+ VFIOPCIDevice *vdev = opaque;
+ PCIDevice *pdev = &vdev->pdev;
+ int nr_vectors;
+
+ if (msix_enabled(pdev)) {
+ msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+ vfio_msix_vector_release, NULL);
+ nr_vectors = vdev->msix->entries;
+ vfio_claim_vectors(vdev, nr_vectors, true);
+
+ } else if (msi_enabled(pdev)) {
+ nr_vectors = msi_nr_vectors_allocated(pdev);
+ vfio_claim_vectors(vdev, nr_vectors, false);
+
+ } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+ g_assert_not_reached(); /* completed in a subsequent patch */
+ }
+
+ return 0;
+}
+
static const VMStateDescription vfio_pci_vmstate = {
.name = "vfio-pci",
.version_id = 0,
.minimum_version_id = 0,
.pre_load = vfio_pci_pre_load,
+ .post_load = vfio_pci_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+ VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
VMSTATE_END_OF_LIST()
}
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 13/26] vfio-pci: preserve INTx
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (11 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 12/26] vfio-pci: preserve MSI Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:13 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 14/26] migration: close kvm after cpr Steve Sistare
` (12 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
follows:
pin : Recover this from the vfio config in kernel space
interrupt : Preserve its eventfd descriptor across exec.
unmask : Ditto
route.irq : This could perhaps be recovered in vfio_pci_post_load by
calling pci_device_route_intx_to_irq(pin), whose implementation reads
config space for a bridge device such as ich9. However, there is no
guarantee that the bridge vmstate is read before vfio vmstate. Rather
than fiddling with MigrationPriority for vmstate handlers, explicitly
save route.irq in vfio vmstate.
pending : save in vfio vmstate.
mmap_timeout, mmap_timer : Re-initialize
bool kvm_accel : Re-initialize
In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load. Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/pci.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index df6e298..c50dbef 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -184,12 +184,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
return true;
}
+ if (vdev->vbasedev.reused) {
+ goto skip_state;
+ }
+
/* Get to a known interrupt state */
qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
vdev->intx.pending = false;
pci_irq_deassert(&vdev->pdev);
+skip_state:
/* Get an eventfd for resample/unmask */
if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
error_setg(errp, "vfio_notifier_init intx-unmask failed");
@@ -204,6 +209,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
goto fail_irqfd;
}
+ if (vdev->vbasedev.reused) {
+ goto skip_irq;
+ }
+
if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_UNMASK,
event_notifier_get_fd(&vdev->intx.unmask),
@@ -214,6 +223,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
/* Let'em rip */
vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
+skip_irq:
vdev->intx.kvm_accel = true;
trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -329,7 +339,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
return true;
}
- vfio_disable_interrupts(vdev);
+ /*
+ * Do not alter interrupt state during vfio_realize and cpr load. The
+ * reused flag is cleared thereafter.
+ */
+ if (!vdev->vbasedev.reused) {
+ vfio_disable_interrupts(vdev);
+ }
vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -351,7 +367,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
fd = event_notifier_get_fd(&vdev->intx.interrupt);
qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
- if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
+ if (!vdev->vbasedev.reused &&
+ !vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
qemu_set_fd_handler(fd, NULL, NULL, vdev);
vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
@@ -3256,7 +3273,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vfio_intx_routing_notifier);
vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
- if (!vfio_intx_enable(vdev, errp)) {
+ /* Wait until cpr load reads intx routing data to enable */
+ if (!vdev->vbasedev.reused && !vfio_intx_enable(vdev, errp)) {
goto out_deregister;
}
}
@@ -3578,12 +3596,36 @@ static int vfio_pci_post_load(void *opaque, int version_id)
vfio_claim_vectors(vdev, nr_vectors, false);
} else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
- g_assert_not_reached(); /* completed in a subsequent patch */
+ Error *err = NULL;
+ if (!vfio_intx_enable(vdev, &err)) {
+ error_report_err(err);
+ return -1;
+ }
}
return 0;
}
+static const VMStateDescription vfio_intx_vmstate = {
+ .name = "vfio-intx",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .fields = (VMStateField[]) {
+ VMSTATE_BOOL(pending, VFIOINTx),
+ VMSTATE_UINT32(route.mode, VFIOINTx),
+ VMSTATE_INT32(route.irq, VFIOINTx),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) { \
+ .name = (stringify(_field)), \
+ .size = sizeof(VFIOINTx), \
+ .vmsd = &vfio_intx_vmstate, \
+ .flags = VMS_STRUCT, \
+ .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
+}
+
static const VMStateDescription vfio_pci_vmstate = {
.name = "vfio-pci",
.version_id = 0,
@@ -3594,6 +3636,7 @@ static const VMStateDescription vfio_pci_vmstate = {
.fields = (VMStateField[]) {
VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+ VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
VMSTATE_END_OF_LIST()
}
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 14/26] migration: close kvm after cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (12 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 13/26] vfio-pci: preserve INTx Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 15/26] migration: cpr_get_fd_param helper Steve Sistare
` (11 subsequent siblings)
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer breaks vfio network connectivity to and from the guest, and
the host system log shows:
irq bypass consumer (token 00000000a03c32e5) registration fails: -16
which is EBUSY. This occurs because KVM descriptors are still open in
the old QEMU process. Close them.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
accel/kvm/kvm-all.c | 20 ++++++++++++++++++++
hw/vfio/common.c | 8 ++++++++
include/hw/vfio/vfio-common.h | 1 +
include/migration/cpr.h | 2 ++
include/system/kvm.h | 1 +
migration/cpr-transfer.c | 18 ++++++++++++++++++
migration/cpr.c | 8 ++++++++
migration/migration.c | 1 +
8 files changed, 59 insertions(+)
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c65b790..f4e341f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -595,6 +595,26 @@ err:
return ret;
}
+void kvm_close(void)
+{
+ CPUState *cpu;
+
+ CPU_FOREACH(cpu) {
+ cpu_remove_sync(cpu);
+ close(cpu->kvm_fd);
+ cpu->kvm_fd = -1;
+ close(cpu->kvm_vcpu_stats_fd);
+ cpu->kvm_vcpu_stats_fd = -1;
+ }
+
+ if (kvm_state && kvm_state->fd != -1) {
+ close(kvm_state->vmfd);
+ kvm_state->vmfd = -1;
+ close(kvm_state->fd);
+ kvm_state->fd = -1;
+ }
+}
+
/*
* dirty pages logging control
*/
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c8ee71a..db0498e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1511,6 +1511,14 @@ int vfio_kvm_device_del_fd(int fd, Error **errp)
return 0;
}
+void vfio_kvm_device_close(void)
+{
+ if (vfio_kvm_device_fd != -1) {
+ close(vfio_kvm_device_fd);
+ vfio_kvm_device_fd = -1;
+ }
+}
+
VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
{
VFIOAddressSpace *space;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8a4a658..5a89aca 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -262,6 +262,7 @@ void vfio_detach_device(VFIODevice *vbasedev);
int vfio_kvm_device_add_fd(int fd, Error **errp);
int vfio_kvm_device_del_fd(int fd, Error **errp);
+void vfio_kvm_device_close(void);
bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index faeb0cc..e8f4aba 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -29,7 +29,9 @@ void cpr_state_close(void);
struct QIOChannel *cpr_state_ioc(void);
bool cpr_needed_for_reuse(void *opaque);
+void cpr_kvm_close(void);
+void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
diff --git a/include/system/kvm.h b/include/system/kvm.h
index ab17c09..ad5c55e 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
int kvm_has_vcpu_events(void);
int kvm_max_nested_state_length(void);
int kvm_has_gsi_routing(void);
+void kvm_close(void);
/**
* kvm_arm_supports_user_irq
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f1403..396558f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,24 @@
#include "migration/vmstate.h"
#include "trace.h"
+static int cpr_transfer_notifier(NotifierWithReturn *notifier,
+ MigrationEvent *e,
+ Error **errp)
+{
+ if (e->type == MIG_EVENT_PRECOPY_DONE) {
+ cpr_kvm_close();
+ }
+ return 0;
+}
+
+void cpr_transfer_init(void)
+{
+ static NotifierWithReturn notifier;
+
+ migration_add_notifier_mode(¬ifier, cpr_transfer_notifier,
+ MIG_MODE_CPR_TRANSFER);
+}
+
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
{
MigrationAddress *addr = channel->addr;
diff --git a/migration/cpr.c b/migration/cpr.c
index e3f27e9..86eb484 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,12 +7,14 @@
#include "qemu/osdep.h"
#include "qapi/error.h"
+#include "hw/vfio/vfio-common.h"
#include "migration/cpr.h"
#include "migration/misc.h"
#include "migration/options.h"
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -243,3 +245,9 @@ bool cpr_needed_for_reuse(void *opaque)
MigMode mode = migrate_mode();
return mode == MIG_MODE_CPR_TRANSFER;
}
+
+void cpr_kvm_close(void)
+{
+ kvm_close();
+ vfio_kvm_device_close();
+}
diff --git a/migration/migration.c b/migration/migration.c
index 88b0991..7a69782 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -288,6 +288,7 @@ void migration_object_init(void)
ram_mig_init();
dirty_bitmap_mig_init();
+ cpr_transfer_init();
/* Initialize cpu throttle timers */
cpu_throttle_init();
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 15/26] migration: cpr_get_fd_param helper
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (13 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 14/26] migration: close kvm after cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr Steve Sistare
` (10 subsequent siblings)
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.
When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes. Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.
Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS. Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
include/migration/cpr.h | 2 ++
migration/cpr.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index e8f4aba..28414f7 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -30,6 +30,8 @@ struct QIOChannel *cpr_state_ioc(void);
bool cpr_needed_for_reuse(void *opaque);
void cpr_kvm_close(void);
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ bool *reused, Error **errp);
void cpr_transfer_init(void);
QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 86eb484..a1084e9 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -14,6 +14,7 @@
#include "migration/qemu-file.h"
#include "migration/savevm.h"
#include "migration/vmstate.h"
+#include "monitor/monitor.h"
#include "system/kvm.h"
#include "system/runstate.h"
#include "trace.h"
@@ -251,3 +252,43 @@ void cpr_kvm_close(void)
kvm_close();
vfio_kvm_device_close();
}
+
+/*
+ * cpr_get_fd_param: find a descriptor and return its value.
+ *
+ * @name: CPR name for the descriptor
+ * @fdname: An integer-valued string, or a name passed to a getfd command
+ * @index: CPR index of the descriptor
+ * @reused: returns true if the fd is found in CPR state, else false.
+ * @errp: returned error message
+ *
+ * If CPR is not being performed, then use @fdname to find the fd.
+ * If CPR is being performed, then ignore @fdname, and look for @name
+ * and @index in CPR state.
+ *
+ * On success returns the fd value, else returns -1.
+ */
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+ bool *reused, Error **errp)
+{
+ ERRP_GUARD();
+ MigMode mode = cpr_get_incoming_mode();
+ int fd;
+
+ if (mode == MIG_MODE_CPR_TRANSFER) {
+ fd = cpr_find_fd(name, index);
+ if (fd < 0) {
+ error_setg(errp, "cannot find saved value for fd %s", fdname);
+ }
+ *reused = true;
+ } else {
+ fd = monitor_fd_param(monitor_cur(), fdname, errp);
+ if (fd >= 0) {
+ cpr_save_fd(name, index, fd);
+ } else {
+ error_prepend(errp, "Could not parse object fd %s:", fdname);
+ }
+ *reused = false;
+ }
+ return fd;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (14 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 15/26] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 15:47 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 17/26] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
` (9 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Return the memory region that the translated address is found in, for
use in a subsequent patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 9 ++++++---
hw/virtio/vhost-vdpa.c | 2 +-
include/exec/memory.h | 5 ++++-
system/memory.c | 8 +++++++-
4 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index db0498e..4bbc29f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -248,12 +248,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
/* Called with rcu_read_lock held. */
static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
ram_addr_t *ram_addr, bool *read_only,
+ MemoryRegion **mr_p,
Error **errp)
{
bool ret, mr_has_discard_manager;
ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
- &mr_has_discard_manager, errp);
+ &mr_has_discard_manager, mr_p, errp);
if (ret && mr_has_discard_manager) {
/*
* Malicious VMs might trigger discarding of IOMMU-mapped memory. The
@@ -300,7 +301,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
+ if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
+ &local_err)) {
error_report_err(local_err);
goto out;
}
@@ -1279,7 +1281,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
}
rcu_read_lock();
- if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
+ if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL,
+ &local_err)) {
error_report_err(local_err);
goto out_unlock;
}
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3cdaa12..a1866bb 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -228,7 +228,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
+ if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL, NULL,
&local_err)) {
error_report_err(local_err);
return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index ea5d33a..a2f1229 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -747,13 +747,16 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
* @read_only: indicates if writes are allowed
* @mr_has_discard_manager: indicates memory is controlled by a
* RamDiscardManager
+ * @mr_p: return the MemoryRegion containing the @iotlb translated addr
* @errp: pointer to Error*, to store an error if it happens.
*
* Return: true on success, else false setting @errp with error.
*/
bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
ram_addr_t *ram_addr, bool *read_only,
- bool *mr_has_discard_manager, Error **errp);
+ bool *mr_has_discard_manager,
+ MemoryRegion **mr_p,
+ Error **errp);
typedef struct CoalescedMemoryRange CoalescedMemoryRange;
typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
diff --git a/system/memory.c b/system/memory.c
index 4c82979..4ec2b8f 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2185,7 +2185,9 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
/* Called with rcu_read_lock held. */
bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
ram_addr_t *ram_addr, bool *read_only,
- bool *mr_has_discard_manager, Error **errp)
+ bool *mr_has_discard_manager,
+ MemoryRegion **mr_p,
+ Error **errp)
{
MemoryRegion *mr;
hwaddr xlat;
@@ -2250,6 +2252,10 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
*read_only = !writable || mr->readonly;
}
+ if (mr_p) {
+ *mr_p = mr;
+ }
+
return true;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 17/26] vfio: pass ramblock to vfio_container_dma_map
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (15 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt Steve Sistare
` (8 subsequent siblings)
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Pass ramblock to vfio_container_dma_map for use in a subsequent patch.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/common.c | 11 +++++++----
hw/vfio/container-base.c | 3 ++-
include/hw/vfio/vfio-container-base.h | 3 ++-
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4bbc29f..aceb0cf 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -282,6 +282,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
VFIOContainerBase *bcontainer = giommu->bcontainer;
hwaddr iova = iotlb->iova + giommu->iommu_offset;
+ MemoryRegion *mr;
void *vaddr;
int ret;
Error *local_err = NULL;
@@ -301,7 +302,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
+ if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &mr,
&local_err)) {
error_report_err(local_err);
goto out;
@@ -315,7 +316,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
*/
ret = vfio_container_dma_map(bcontainer, iova,
iotlb->addr_mask + 1, vaddr,
- read_only);
+ read_only, mr->ram_block);
if (ret) {
error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
"0x%"HWADDR_PRIx", %p) = %d (%s)",
@@ -380,7 +381,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
vaddr = memory_region_get_ram_ptr(section->mr) + start;
ret = vfio_container_dma_map(bcontainer, iova, next - start,
- vaddr, section->readonly);
+ vaddr, section->readonly,
+ section->mr->ram_block);
if (ret) {
/* Rollback */
vfio_ram_discard_notify_discard(rdl, section);
@@ -729,7 +731,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
}
ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
- vaddr, section->readonly);
+ vaddr, section->readonly,
+ section->mr->ram_block);
if (ret) {
error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
"0x%"HWADDR_PRIx", %p) = %d (%s)",
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 749a3fd..302cd4c 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -17,7 +17,8 @@
int vfio_container_dma_map(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
- void *vaddr, bool readonly)
+ void *vaddr, bool readonly,
+ RAMBlock *rb)
{
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 4cff994..d82e256 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -73,7 +73,8 @@ typedef struct VFIORamDiscardListener {
int vfio_container_dma_map(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
- void *vaddr, bool readonly);
+ void *vaddr, bool readonly,
+ RAMBlock *rb);
int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (16 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 17/26] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-04 16:22 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
` (7 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Refactor and define iommufd_cdev_make_hwpt, to be called by CPR in a
a later patch. No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 69 +++++++++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 28 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 3490a8f..42ba63f 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -275,6 +275,41 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
return true;
}
+static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
+{
+ vbasedev->hwpt = hwpt;
+ vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+ QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+}
+
+static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container,
+ uint32_t hwpt_id)
+{
+ VFIOIOASHwpt *hwpt = g_malloc0(sizeof(*hwpt));
+ uint32_t flags = 0;
+
+ /*
+ * This is quite early and VFIO Migration state isn't yet fully
+ * initialized, thus rely only on IOMMU hardware capabilities as to
+ * whether IOMMU dirty tracking is going to be requested. Later
+ * vfio_migration_realize() may decide to use VF dirty tracking
+ * instead.
+ */
+ if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
+ flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
+ }
+
+ hwpt->hwpt_id = hwpt_id;
+ hwpt->hwpt_flags = flags;
+ QLIST_INIT(&hwpt->device_list);
+
+ QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+ container->bcontainer.dirty_pages_supported |=
+ vbasedev->iommu_dirty_tracking;
+ return hwpt;
+}
+
static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
VFIOIOMMUFDContainer *container,
Error **errp)
@@ -304,24 +339,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
} else {
- vbasedev->hwpt = hwpt;
- QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
- vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
return true;
}
}
- /*
- * This is quite early and VFIO Migration state isn't yet fully
- * initialized, thus rely only on IOMMU hardware capabilities as to
- * whether IOMMU dirty tracking is going to be requested. Later
- * vfio_migration_realize() may decide to use VF dirty tracking
- * instead.
- */
- if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
- flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
- }
-
if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
container->ioas_id, flags,
IOMMU_HWPT_DATA_NONE, 0, NULL,
@@ -329,24 +351,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
return false;
}
- hwpt = g_malloc0(sizeof(*hwpt));
- hwpt->hwpt_id = hwpt_id;
- hwpt->hwpt_flags = flags;
- QLIST_INIT(&hwpt->device_list);
-
- ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+ ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
if (ret) {
- iommufd_backend_free_id(container->be, hwpt->hwpt_id);
- g_free(hwpt);
+ iommufd_backend_free_id(container->be, hwpt_id);
return false;
}
- vbasedev->hwpt = hwpt;
- vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
- QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
- QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
- container->bcontainer.dirty_pages_supported |=
- vbasedev->iommu_dirty_tracking;
+ hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+
if (container->bcontainer.dirty_pages_supported &&
!vbasedev->iommu_dirty_tracking) {
warn_report("IOMMU instance for device %s doesn't support dirty tracking",
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (17 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:23 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
` (6 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
Such a mapping can be preserved without modification during CPR,
because it depends on the file's address space, which does not change,
rather than on the process's address space, which does change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 36 +++++++++++++++++++++++++++++++++++
backends/trace-events | 1 +
hw/vfio/container-base.c | 9 +++++++++
hw/vfio/iommufd.c | 13 +++++++++++++
include/exec/cpu-common.h | 1 +
include/hw/vfio/vfio-container-base.h | 3 +++
include/system/iommufd.h | 3 +++
system/physmem.c | 5 +++++
8 files changed, 71 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 7b4fc8e..6d29221 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -174,6 +174,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
return ret;
}
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size,
+ int mfd, unsigned long start, bool readonly)
+{
+ int ret, fd = be->fd;
+ struct iommu_ioas_map_file map = {
+ .size = sizeof(map),
+ .flags = IOMMU_IOAS_MAP_READABLE |
+ IOMMU_IOAS_MAP_FIXED_IOVA,
+ .ioas_id = ioas_id,
+ .fd = mfd,
+ .start = start,
+ .iova = iova,
+ .length = size,
+ };
+
+ if (!readonly) {
+ map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+ }
+
+ ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
+ trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
+ readonly, ret);
+ if (ret) {
+ ret = -errno;
+
+ /* TODO: Not support mapping hardware PCI BAR region for now. */
+ if (errno == EFAULT) {
+ warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
+ } else {
+ error_report("IOMMU_IOAS_MAP_FILE failed: %m");
+ }
+ }
+ return ret;
+}
+
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
hwaddr iova, ram_addr_t size)
{
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3..f478e18 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 302cd4c..fbaf04a 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -21,7 +21,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
RAMBlock *rb)
{
VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+ int mfd = rb ? qemu_ram_get_fd(rb) : -1;
+ if (mfd >= 0 && vioc->dma_map_file) {
+ unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
+ unsigned long offset = qemu_ram_get_fd_offset(rb);
+
+ vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
+ readonly);
+ return 0;
+ }
g_assert(vioc->dma_map);
return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 42ba63f..a3e7edb 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -38,6 +38,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
iova, size, vaddr, readonly);
}
+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly)
+{
+ const VFIOIOMMUFDContainer *container =
+ container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+ return iommufd_backend_map_file_dma(container->be,
+ container->ioas_id,
+ iova, size, fd, start, readonly);
+}
+
static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb)
@@ -806,6 +818,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
vioc->hiod_typename = TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO;
vioc->dma_map = iommufd_cdev_map;
+ vioc->dma_map_file = iommufd_cdev_map_file;
vioc->dma_unmap = iommufd_cdev_unmap;
vioc->attach_device = iommufd_cdev_attach;
vioc->detach_device = iommufd_cdev_detach;
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index b1d76d6..0cab252 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -95,6 +95,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
const char *qemu_ram_get_idstr(RAMBlock *rb);
void *qemu_ram_get_host_addr(RAMBlock *rb);
ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
bool qemu_ram_is_shared(RAMBlock *rb);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index d82e256..4daa5f8 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -115,6 +115,9 @@ struct VFIOIOMMUClass {
int (*dma_map)(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
void *vaddr, bool readonly);
+ int (*dma_map_file)(const VFIOContainerBase *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ int fd, unsigned long start, bool readonly);
int (*dma_unmap)(const VFIOContainerBase *bcontainer,
hwaddr iova, ram_addr_t size,
IOMMUTLBEntry *iotlb);
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75b..ac700b8 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
Error **errp);
void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+ hwaddr iova, ram_addr_t size, int fd,
+ unsigned long start, bool readonly);
int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
ram_addr_t size, void *vaddr, bool readonly);
int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
diff --git a/system/physmem.c b/system/physmem.c
index 0bcfc6c..c41a80b 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1569,6 +1569,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
return rb->offset;
}
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
+{
+ return rb->fd_offset;
+}
+
ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
{
return rb->used_length;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (18 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:33 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 21/26] iommufd: change process ioctl Steve Sistare
` (5 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Export iommufd_cdev_get_info_iova_range for use by CPR.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/iommufd.c | 4 ++--
include/hw/vfio/vfio-common.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index a3e7edb..2f888e5 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -442,8 +442,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
return ram_block_uncoordinated_discard_disable(state);
}
-static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
- uint32_t ioas_id, Error **errp)
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+ uint32_t ioas_id, Error **errp)
{
VFIOContainerBase *bcontainer = &container->bcontainer;
g_autofree struct iommu_ioas_iova_ranges *info = NULL;
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5a89aca..ca10abc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -268,6 +268,8 @@ bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+ uint32_t ioas_id, Error **errp);
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 21/26] iommufd: change process ioctl
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (19 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:34 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 22/26] vfio/iommufd: invariant device name Steve Sistare
` (4 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Define the change process ioctl
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 20 ++++++++++++++++++++
backends/trace-events | 1 +
include/system/iommufd.h | 2 ++
3 files changed, 23 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 6d29221..be5f6a3 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc, void *data)
object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
}
+bool iommufd_change_process_capable(IOMMUFDBackend *be)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+
+ return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+}
+
+int iommufd_change_process(IOMMUFDBackend *be)
+{
+ struct iommu_ioas_change_process args = {.size = sizeof(args)};
+ int ret = ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+
+ if (ret) {
+ ret = -errno;
+ error_report("IOMMU_IOAS_CHANGE_PROCESS fd %d failed: %m", be->fd);
+ }
+ trace_iommufd_change_process(be->fd, ret);
+ return ret;
+}
+
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
{
int fd;
diff --git a/backends/trace-events b/backends/trace-events
index f478e18..9b33dc3 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
dbus_vmstate_saving(const char *id) "id: %s"
# iommufd.c
+iommufd_change_process(int fd, int ret) "fd=%d (%d)"
iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index ac700b8..4e9c037 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
uint64_t iova, ram_addr_t size,
uint64_t page_size, uint64_t *data,
Error **errp);
+bool iommufd_change_process_capable(IOMMUFDBackend *be);
+int iommufd_change_process(IOMMUFDBackend *be);
#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 22/26] vfio/iommufd: invariant device name
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (20 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 21/26] iommufd: change process ioctl Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:42 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 23/26] vfio/iommufd: register container for cpr Steve Sistare
` (3 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
cpr-transfer will use the device name as a key to find the value
of the device descriptor in new QEMU. However, if the descriptor
number is specified by a command-line fd parameter, then
vfio_device_get_name creates a name that includes the fd number.
This causes a chicken-and-egg problem: new QEMU must know the fd
number to construct a name to find the fd number.
To fix, create an invariant name based on the id command-line
parameter. If id is not defined, add a CPR blocker.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/helpers.c | 18 +++++++++++++++---
hw/vfio/iommufd.c | 2 ++
include/hw/vfio/vfio-common.h | 1 +
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index 913796f..bd94b86 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -25,6 +25,8 @@
#include "hw/vfio/vfio-common.h"
#include "hw/hw.h"
#include "trace.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
#include "qapi/error.h"
#include "qemu/error-report.h"
#include "qemu/units.h"
@@ -636,6 +638,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
{
ERRP_GUARD();
struct stat st;
+ bool ret = true;
if (vbasedev->fd < 0) {
if (stat(vbasedev->sysfsdev, &st) < 0) {
@@ -653,15 +656,24 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
return false;
}
/*
- * Give a name with fd so any function printing out vbasedev->name
+ * Give a name so any function printing out vbasedev->name
* will not break.
*/
if (!vbasedev->name) {
- vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ if (vbasedev->dev->id) {
+ vbasedev->name = g_strdup(vbasedev->dev->id);
+ } else {
+ vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+ error_setg(&vbasedev->cpr_id_blocker,
+ "vfio device with fd=%d needs an id property",
+ vbasedev->fd);
+ ret = migrate_add_blocker_modes(&vbasedev->cpr_id_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
}
}
- return true;
+ return ret;
}
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 2f888e5..8308715 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -24,6 +24,7 @@
#include "system/reset.h"
#include "qemu/cutils.h"
#include "qemu/chardev_open.h"
+#include "migration/blocker.h"
#include "pci.h"
#include "exec/ram_addr.h"
@@ -657,6 +658,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
iommufd_cdev_container_destroy(container);
vfio_put_address_space(space);
+ migrate_del_blocker(&vbasedev->cpr_id_blocker);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index ca10abc..37e7c26 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -147,6 +147,7 @@ typedef struct VFIODevice {
VFIOMigration *migration;
Error *migration_blocker;
Error *cpr_mdev_blocker;
+ Error *cpr_id_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
OnOffAuto device_dirty_page_tracking;
bool dirty_pages_supported;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 23/26] vfio/iommufd: register container for cpr
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (21 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 22/26] vfio/iommufd: invariant device name Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-02-05 17:45 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 24/26] vfio/iommufd: preserve descriptors Steve Sistare
` (2 subsequent siblings)
25 siblings, 1 reply; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Register a vfio iommufd container for CPR. Add a blocker if the kernel does
not support IOMMU_IOAS_CHANGE_PROCESS.
This is mostly boiler plate. The fields to to saved and restored are added
in subsequent patches.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 96 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/iommufd.c | 12 +++---
hw/vfio/meson.build | 1 +
include/hw/vfio/vfio-common.h | 6 +++
4 files changed, 110 insertions(+), 5 deletions(-)
create mode 100644 hw/vfio/cpr-iommufd.c
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
new file mode 100644
index 0000000..4eb358a
--- /dev/null
+++ b/hw/vfio/cpr-iommufd.c
@@ -0,0 +1,96 @@
+/*
+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/vfio/vfio-common.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "system/iommufd.h"
+
+static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
+{
+ if (!iommufd_change_process_capable(container->be)) {
+ error_setg(errp,
+ "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
+ return false;
+ }
+ return true;
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+ .name = "vfio-iommufd-container",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+static const VMStateDescription iommufd_cpr_vmstate = {
+ .name = "iommufd",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+ Error **errp)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ Error **cpr_blocker = &container->cpr_blocker;
+
+ if (!vfio_cpr_register_container(bcontainer, errp)) {
+ return false;
+ }
+
+ if (!vfio_cpr_supported(container, cpr_blocker)) {
+ return migrate_add_blocker_modes(cpr_blocker, errp,
+ MIG_MODE_CPR_TRANSFER, -1) == 0;
+ }
+
+ vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+ vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
+
+ return true;
+}
+
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
+{
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+
+ vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
+ vmstate_unregister(NULL, &vfio_container_vmstate, container);
+ migrate_del_blocker(&container->cpr_blocker);
+ vfio_cpr_unregister_container(bcontainer);
+}
+
+static const VMStateDescription vfio_device_vmstate = {
+ .name = "vfio-iommufd-device",
+ .version_id = 0,
+ .minimum_version_id = 0,
+ .needed = cpr_needed_for_reuse,
+ .fields = (VMStateField[]) {
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
+{
+ vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
+}
+
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
+{
+ vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8308715..ae78e00 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -592,6 +592,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer->initialized = true;
+ if (!vfio_iommufd_cpr_register_container(container, errp)) {
+ goto err_listener_register;
+ }
+
found_container:
ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
if (ret) {
@@ -599,10 +603,6 @@ found_container:
goto err_listener_register;
}
- if (!vfio_cpr_register_container(bcontainer, errp)) {
- goto err_listener_register;
- }
-
/*
* TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
* for discarding incompatibility check as well?
@@ -619,6 +619,7 @@ found_container:
vbasedev->bcontainer = bcontainer;
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
+ vfio_iommufd_cpr_register_device(vbasedev);
trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
vbasedev->num_regions, vbasedev->flags);
@@ -653,12 +654,13 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
iommufd_cdev_ram_block_discard_disable(false);
}
- vfio_cpr_unregister_container(bcontainer);
+ vfio_iommufd_cpr_unregister_container(container);
iommufd_cdev_detach_container(vbasedev, container);
iommufd_cdev_container_destroy(container);
vfio_put_address_space(space);
migrate_del_blocker(&vbasedev->cpr_id_blocker);
+ vfio_iommufd_cpr_unregister_device(vbasedev);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 5487815..998adb5 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -13,6 +13,7 @@ vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'cpr.c',
'cpr-legacy.c',
+ 'cpr-iommufd.c',
'display.c',
'pci-quirks.c',
'pci.c',
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 37e7c26..add44d4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -113,6 +113,7 @@ typedef struct VFIOIOASHwpt {
typedef struct VFIOIOMMUFDContainer {
VFIOContainerBase bcontainer;
IOMMUFDBackend *be;
+ Error *cpr_blocker;
uint32_t ioas_id;
QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
} VFIOIOMMUFDContainer;
@@ -271,6 +272,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
uint32_t ioas_id, Error **errp);
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+ Error **errp);
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container);
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev);
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev);
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 24/26] vfio/iommufd: preserve descriptors
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (22 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 23/26] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 25/26] vfio/iommufd: reconstruct device Steve Sistare
2025-01-29 14:43 ` [PATCH V1 26/26] iommufd: preserve DMA mappings Steve Sistare
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Save the iommu and vfio device fd in CPR state when it is created.
After CPR, the fd number is found in CPR state and reused. Remember
the reused status for subsequent patches. The reused status is cleared
when vmstate load finishes.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 24 ++++++++++++++++++++----
hw/vfio/cpr-iommufd.c | 15 +++++++++++++++
hw/vfio/helpers.c | 10 ++--------
hw/vfio/iommufd.c | 10 +++++++++-
include/system/iommufd.h | 1 +
5 files changed, 47 insertions(+), 13 deletions(-)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index be5f6a3..e9452e4 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -16,12 +16,18 @@
#include "qemu/module.h"
#include "qom/object_interfaces.h"
#include "qemu/error-report.h"
+#include "migration/cpr.h"
#include "monitor/monitor.h"
#include "trace.h"
#include "hw/vfio/vfio-common.h"
#include <sys/ioctl.h>
#include <linux/iommufd.h>
+static const char *iommufd_fd_name(IOMMUFDBackend *be)
+{
+ return object_get_canonical_path_component(OBJECT(be));
+}
+
static void iommufd_backend_init(Object *obj)
{
IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
@@ -46,10 +52,10 @@ static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp)
ERRP_GUARD();
IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
int fd = -1;
+ const char *name = iommufd_fd_name(be);
- fd = monitor_fd_param(monitor_cur(), str, errp);
+ fd = cpr_get_fd_param(name, str, 0, &be->reused, errp);
if (fd == -1) {
- error_prepend(errp, "Could not parse remote object fd %s:", str);
return;
}
be->fd = fd;
@@ -98,10 +104,16 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
int fd;
if (be->owned && !be->users) {
- fd = qemu_open("/dev/iommu", O_RDWR, errp);
+ const char *name = iommufd_fd_name(be);
+ fd = cpr_find_fd(name, 0);
+ be->reused = (fd >= 0);
+ if (!be->reused) {
+ fd = qemu_open("/dev/iommu", O_RDWR, errp);
+ }
if (fd < 0) {
return false;
}
+ cpr_resave_fd(name, 0, fd);
be->fd = fd;
}
be->users++;
@@ -112,6 +124,9 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
void iommufd_backend_disconnect(IOMMUFDBackend *be)
{
+ int fd = be->fd;
+ const char *name = iommufd_fd_name(be);
+
if (!be->users) {
goto out;
}
@@ -121,7 +136,8 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
be->fd = -1;
}
out:
- trace_iommufd_backend_disconnect(be->fd, be->users);
+ cpr_delete_fd(name, 0);
+ trace_iommufd_backend_disconnect(fd, be->users);
}
bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 4eb358a..053ff8c 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -24,10 +24,25 @@ static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
return true;
}
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+ VFIOIOMMUFDContainer *container = opaque;
+ VFIOContainerBase *bcontainer = &container->bcontainer;
+ VFIODevice *vbasedev;
+
+ QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
+ vbasedev->reused = false;
+ }
+ container->be->reused = false;
+
+ return 0;
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index bd94b86..7f69707 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -678,14 +678,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
{
- ERRP_GUARD();
- int fd = monitor_fd_param(monitor_cur(), str, errp);
-
- if (fd < 0) {
- error_prepend(errp, "Could not parse remote object fd %s:", str);
- return;
- }
- vbasedev->fd = fd;
+ vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
+ &vbasedev->reused, errp);
}
void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ae78e00..abd17b6 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -25,6 +25,7 @@
#include "qemu/cutils.h"
#include "qemu/chardev_open.h"
#include "migration/blocker.h"
+#include "migration/cpr.h"
#include "pci.h"
#include "exec/ram_addr.h"
@@ -499,13 +500,18 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
if (vbasedev->fd < 0) {
- devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+ devfd = cpr_find_fd(vbasedev->name, 0);
+ vbasedev->reused = (devfd >= 0);
+ if (!vbasedev->reused) {
+ devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+ }
if (devfd < 0) {
return false;
}
vbasedev->fd = devfd;
} else {
devfd = vbasedev->fd;
+ /* reused was set in iommufd_backend_set_fd */
}
if (!iommufd_cdev_connect_and_bind(vbasedev, errp)) {
@@ -620,6 +626,7 @@ found_container:
QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
vfio_iommufd_cpr_register_device(vbasedev);
+ cpr_resave_fd(vbasedev->name, 0, devfd);
trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
vbasedev->num_regions, vbasedev->flags);
@@ -661,6 +668,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
migrate_del_blocker(&vbasedev->cpr_id_blocker);
vfio_iommufd_cpr_unregister_device(vbasedev);
+ cpr_delete_fd(vbasedev->name, 0);
iommufd_cdev_unbind_and_disconnect(vbasedev);
close(vbasedev->fd);
}
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 4e9c037..6618248 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
/*< protected >*/
int fd; /* /dev/iommu file descriptor */
bool owned; /* is the /dev/iommu opened internally */
+ bool reused; /* fd is reused after CPR */
uint32_t users;
/*< public >*/
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 25/26] vfio/iommufd: reconstruct device
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (23 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 24/26] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
2025-01-29 14:43 ` [PATCH V1 26/26] iommufd: preserve DMA mappings Steve Sistare
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
Reconstruct userland device state after CPR. During vfio_realize, skip
all ioctls that configure the device, as it was already configured in old
QEMU.
Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
Because we skip ioctl's, it is not needed at realize time. However, we do
need the range info, so defer the call to iommufd_cdev_get_info_iova_range
to a post_load handler, at which time the ioas_id is known.
Save the devid in vmstate. It is used in one place during realize, to
fetch hw_caps device info, at vfio_device_hiod_realize ->
hiod_iommufd_vfio_realize -> iommufd_backend_get_device_info. The hw_caps
is not needed until post load time (see the next paragraph), so defer
the call to vfio_device_hiod_realize to post load, at which time the
devid is known.
Save the hwpt_id in vmstate. In realize, skip its allocation from
iommufd_cdev_attach -> iommufd_cdev_attach_container ->
iommufd_cdev_autodomains_get. Rebuild userland structures to hold
hwpt_id by calling iommufd_cdev_rebuild_hwpt at post load time.
This depends on hw_caps as described above.
Lastly, change the owning process of the iommufd device in post load.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
hw/vfio/cpr-iommufd.c | 49 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/iommufd.c | 46 +++++++++++++++++++++++++++++++++++++---
hw/vfio/trace-events | 1 +
include/hw/vfio/vfio-common.h | 3 +++
4 files changed, 96 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 053ff8c..711e5cf 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -14,6 +14,9 @@
#include "migration/vmstate.h"
#include "system/iommufd.h"
+#define IOMMUFD_CONTAINER(base) \
+ container_of(base, VFIOIOMMUFDContainer, bcontainer)
+
static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
{
if (!iommufd_change_process_capable(container->be)) {
@@ -29,6 +32,13 @@ static int vfio_container_post_load(void *opaque, int version_id)
VFIOIOMMUFDContainer *container = opaque;
VFIOContainerBase *bcontainer = &container->bcontainer;
VFIODevice *vbasedev;
+ Error *err = NULL;
+ uint32_t ioas_id = container->ioas_id;
+
+ if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
+ error_report_err(err);
+ return -1;
+ }
QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
vbasedev->reused = false;
@@ -38,21 +48,41 @@ static int vfio_container_post_load(void *opaque, int version_id)
return 0;
}
+static int vfio_container_pre_save(void *opaque)
+{
+ VFIOIOMMUFDContainer *container = opaque;
+
+ /*
+ * The process has not changed yet, but proactively call the ioctl,
+ * and it will fail if any DMA mappings are not supported.
+ */
+ return iommufd_change_process(container->be);
+}
+
static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
VMSTATE_END_OF_LIST()
}
};
+static int iommufd_cpr_post_load(void *opaque, int version_id)
+{
+ IOMMUFDBackend *be = opaque;
+ return iommufd_change_process(be);
+}
+
static const VMStateDescription iommufd_cpr_vmstate = {
.name = "iommufd",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = iommufd_cpr_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
VMSTATE_END_OF_LIST()
@@ -90,12 +120,31 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
vfio_cpr_unregister_container(bcontainer);
}
+static int vfio_device_post_load(void *opaque, int version_id)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOIOMMUFDContainer *container = IOMMUFD_CONTAINER(vbasedev->bcontainer);
+ Error *err = NULL;
+
+ if (!vfio_device_hiod_realize(vbasedev, &err)) {
+ error_report_err(err);
+ return false;
+ }
+ if (!vbasedev->mdev) {
+ iommufd_cdev_rebuild_hwpt(vbasedev, container);
+ }
+ return true;
+}
+
static const VMStateDescription vfio_device_vmstate = {
.name = "vfio-iommufd-device",
.version_id = 0,
.minimum_version_id = 0,
+ .post_load = vfio_device_post_load,
.needed = cpr_needed_for_reuse,
.fields = (VMStateField[]) {
+ VMSTATE_INT32(devid, VFIODevice),
+ VMSTATE_UINT32(hwpt_id, VFIODevice),
VMSTATE_END_OF_LIST()
}
};
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index abd17b6..a007b6c 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -99,6 +99,11 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
goto err_kvm_device_add;
}
+ if (vbasedev->reused) {
+ /* Already bound, and devid was set in iommufd_cdev_attach */
+ goto skip_bind;
+ }
+
/* Bind device to iommufd */
bind.iommufd = iommufd->fd;
if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
@@ -110,6 +115,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
vbasedev->devid = bind.out_devid;
trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
vbasedev->fd, vbasedev->devid);
+
+skip_bind:
return true;
err_bind:
iommufd_cdev_kvm_device_del(vbasedev);
@@ -292,6 +299,7 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
{
vbasedev->hwpt = hwpt;
+ vbasedev->hwpt_id = hwpt->hwpt_id;
vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
}
@@ -324,6 +332,24 @@ static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
return hwpt;
}
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container)
+{
+ VFIOIOASHwpt *hwpt;
+ int hwpt_id = vbasedev->hwpt_id;
+
+ trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
+
+ QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+ if (hwpt->hwpt_id == hwpt_id) {
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+ return;
+ }
+ }
+ hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
+ iommufd_cdev_set_hwpt(vbasedev, hwpt);
+}
+
static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
VFIOIOMMUFDContainer *container,
Error **errp)
@@ -527,7 +553,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
* FD to be connected and having a devid to be able to successfully call
* iommufd_backend_get_device_info().
*/
- if (!vfio_device_hiod_realize(vbasedev, errp)) {
+ if (!vbasedev->reused &&
+ !vfio_device_hiod_realize(vbasedev, errp)) {
goto err_alloc_ioas;
}
@@ -538,7 +565,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
vbasedev->iommufd != container->be) {
continue;
}
- if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
+ if (!vbasedev->reused &&
+ !iommufd_cdev_attach_container(vbasedev, container, &err)) {
const char *msg = error_get_pretty(err);
trace_iommufd_cdev_fail_attach_existing_container(msg);
@@ -555,6 +583,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
}
}
+ if (vbasedev->reused) {
+ ioas_id = -1; /* ioas_id will be received from vmstate */
+ goto skip_ioas_alloc;
+ }
+
/* Need to allocate a new dedicated container */
if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
goto err_alloc_ioas;
@@ -562,6 +595,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
+skip_ioas_alloc:
container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
container->be = vbasedev->iommufd;
container->ioas_id = ioas_id;
@@ -570,7 +604,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer = &container->bcontainer;
vfio_address_space_insert(space, bcontainer);
- if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
+ if (!vbasedev->reused &&
+ !iommufd_cdev_attach_container(vbasedev, container, errp)) {
goto err_attach_container;
}
@@ -579,6 +614,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
goto err_discard_disable;
}
+ if (vbasedev->reused) {
+ goto skip_info;
+ }
+
if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
error_append_hint(&err,
"Fallback to default 64bit IOVA range and 4K page size\n");
@@ -587,6 +626,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
bcontainer->pgsizes = qemu_real_host_page_size();
}
+skip_info:
bcontainer->listener = vfio_memory_listener;
memory_listener_register(&bcontainer->listener, bcontainer->space->as);
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index cab1cf1..25ff04c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -176,6 +176,7 @@ iommufd_cdev_connect_and_bind(int iommufd, const char *name, int devfd, int devi
iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id) " [iommufd=%d] Successfully attached device %s (%d) to id=%d"
iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d] Successfully detached %s"
+iommufd_cdev_rebuild_hwpt(int iommufd, int hwpt_id) " [iommufd=%d] hwpt %d"
iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index add44d4..a359ea9 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -157,6 +157,7 @@ typedef struct VFIODevice {
HostIOMMUDevice *hiod;
int devid;
IOMMUFDBackend *iommufd;
+ uint32_t hwpt_id;
VFIOIOASHwpt *hwpt;
QLIST_ENTRY(VFIODevice) hwpt_next;
} VFIODevice;
@@ -270,6 +271,8 @@ bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+ VFIOIOMMUFDContainer *container);
bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
uint32_t ioas_id, Error **errp);
bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH V1 26/26] iommufd: preserve DMA mappings
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
` (24 preceding siblings ...)
2025-01-29 14:43 ` [PATCH V1 25/26] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-01-29 14:43 ` Steve Sistare
25 siblings, 0 replies; 64+ messages in thread
From: Steve Sistare @ 2025-01-29 14:43 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Fabiano Rosas, Steve Sistare
During cpr-transfer load in new QEMU, the vfio_memory_listener causes
spurious calls to map and unmap DMA regions, as devices are created and
the address space is built. This memory was already already mapped by the
device in old QEMU, so suppress the map and unmap callbacks during CPR --
eg, if the reused flag is set. Clear the reused flag in the post_load
handler.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
backends/iommufd.c | 8 ++++++++
hw/vfio/cpr-iommufd.c | 1 +
2 files changed, 9 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index e9452e4..cc61432 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -226,6 +226,10 @@ int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (be->reused) {
+ return 0;
+ }
+
if (!readonly) {
map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
}
@@ -257,6 +261,10 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
.length = size,
};
+ if (be->reused) {
+ return 0;
+ }
+
ret = ioctl(fd, IOMMU_IOAS_UNMAP, &unmap);
/*
* IOMMUFD takes mapping as some kind of object, unmapping
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 711e5cf..5b93b24 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -63,6 +63,7 @@ static const VMStateDescription vfio_container_vmstate = {
.name = "vfio-iommufd-container",
.version_id = 0,
.minimum_version_id = 0,
+ .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
.pre_save = vfio_container_pre_save,
.post_load = vfio_container_post_load,
.needed = cpr_needed_for_reuse,
--
1.8.3.1
^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH V1 02/26] migration: lower handler priority
2025-01-29 14:42 ` [PATCH V1 02/26] migration: lower handler priority Steve Sistare
@ 2025-02-03 16:21 ` Fabiano Rosas
2025-02-03 16:58 ` Peter Xu
1 sibling, 0 replies; 64+ messages in thread
From: Fabiano Rosas @ 2025-02-03 16:21 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
Steve Sistare
Steve Sistare <steven.sistare@oracle.com> writes:
> Define a vmstate priority that is lower than the default, so its handlers
> run after all default priority handlers. Since 0 is no longer the default
> priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
>
> CPR for vfio will use this to install handlers for containers that run
> after handlers for the devices that they contain.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener
2025-01-29 14:42 ` [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-02-03 16:57 ` Cédric Le Goater
0 siblings, 0 replies; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-03 16:57 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:42, Steve Sistare wrote:
> Define vfio_find_ram_discard_listener as a subroutine so additional calls to
> it may be added in a subsequent patch.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> hw/vfio/common.c | 35 ++++++++++++++++++++++-------------
> 1 file changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f7499a9..7370332 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -555,6 +555,26 @@ static bool vfio_get_section_iova_range(VFIOContainerBase *bcontainer,
> return true;
> }
>
> +static VFIORamDiscardListener *vfio_find_ram_discard_listener(
> + VFIOContainerBase *bcontainer, MemoryRegionSection *section)
> +{
> + VFIORamDiscardListener *vrdl = NULL;
> +
> + QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
> + if (vrdl->mr == section->mr &&
> + vrdl->offset_within_address_space ==
> + section->offset_within_address_space) {
> + break;
> + }
> + }
> +
> + if (!vrdl) {
> + hw_error("vfio: Trying to sync missing RAM discard listener");
> + /* does not return */
> + }
> + return vrdl;
> +}
> +
> static void vfio_listener_region_add(MemoryListener *listener,
> MemoryRegionSection *section)
> {
> @@ -1266,19 +1286,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
> MemoryRegionSection *section)
> {
> RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
> - VFIORamDiscardListener *vrdl = NULL;
> -
> - QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
> - if (vrdl->mr == section->mr &&
> - vrdl->offset_within_address_space ==
> - section->offset_within_address_space) {
> - break;
> - }
> - }
> -
> - if (!vrdl) {
> - hw_error("vfio: Trying to sync missing RAM discard listener");
> - }
> + VFIORamDiscardListener *vrdl =
> + vfio_find_ram_discard_listener(bcontainer, section);
>
> /*
> * We only want/can synchronize the bitmap for actually mapped parts -
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 02/26] migration: lower handler priority
2025-01-29 14:42 ` [PATCH V1 02/26] migration: lower handler priority Steve Sistare
2025-02-03 16:21 ` Fabiano Rosas
@ 2025-02-03 16:58 ` Peter Xu
2025-02-06 13:39 ` Steven Sistare
1 sibling, 1 reply; 64+ messages in thread
From: Peter Xu @ 2025-02-03 16:58 UTC (permalink / raw)
To: Steve Sistare
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On Wed, Jan 29, 2025 at 06:42:58AM -0800, Steve Sistare wrote:
> Define a vmstate priority that is lower than the default, so its handlers
> run after all default priority handlers. Since 0 is no longer the default
> priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
>
> CPR for vfio will use this to install handlers for containers that run
> after handlers for the devices that they contain.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> include/migration/vmstate.h | 3 ++-
> migration/savevm.c | 4 ++--
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index a1dfab4..3055a46 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -155,7 +155,8 @@ enum VMStateFlags {
> };
>
> typedef enum {
> - MIG_PRI_DEFAULT = 0,
Shall we still keep a defintion for 0? Or at least add a comment link to
save_state_priority() - it might be helpful for whoever jumps to this enum
defintion when reading.. and get confused how a default value is non-zero.
Or define it as something like:
MIG_PRI_UNINITIALIZED = 0, /* Most devices don't set a priority, it will
* be routed to MIG_PRI_DEFAULT */
> + MIG_PRI_LOW = 1, /* Must happen after default */
> + MIG_PRI_DEFAULT,
> MIG_PRI_IOMMU, /* Must happen before PCI devices */
> MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
> MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 264bc06..5dd2dc4 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -232,7 +232,7 @@ typedef struct SaveState {
>
> static SaveState savevm_state = {
> .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
> - .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
> + .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
> .global_section_id = 0,
> };
>
> @@ -704,7 +704,7 @@ static int calculate_compat_instance_id(const char *idstr)
>
> static inline MigrationPriority save_state_priority(SaveStateEntry *se)
> {
> - if (se->vmsd) {
> + if (se->vmsd && se->vmsd->priority) {
> return se->vmsd->priority;
> }
> return MIG_PRI_DEFAULT;
> --
> 1.8.3.1
>
--
Peter Xu
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 04/26] vfio/container: register container for cpr
2025-01-29 14:43 ` [PATCH V1 04/26] vfio/container: register container for cpr Steve Sistare
@ 2025-02-03 17:01 ` Cédric Le Goater
2025-02-03 22:26 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-03 17:01 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Register a legacy container for cpr-transfer. Add a blocker if the kernel
> does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
>
> This is mostly boiler plate. The fields to to saved and restored are added
> in subsequent patches.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/container.c | 6 ++--
> hw/vfio/cpr-legacy.c | 68 +++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/meson.build | 3 +-
> include/hw/vfio/vfio-common.h | 3 ++
> 4 files changed, 76 insertions(+), 4 deletions(-)
> create mode 100644 hw/vfio/cpr-legacy.c
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 4ebb526..a90ce6c 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -618,7 +618,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> }
> bcontainer = &container->bcontainer;
>
> - if (!vfio_cpr_register_container(bcontainer, errp)) {
> + if (!vfio_legacy_cpr_register_container(container, errp)) {
> goto free_container_exit;
> }
>
> @@ -666,7 +666,7 @@ enable_discards_exit:
> vfio_ram_block_discard_disable(container, false);
>
> unregister_container_exit:
> - vfio_cpr_unregister_container(bcontainer);
> + vfio_legacy_cpr_unregister_container(container);
>
> free_container_exit:
> object_unref(container);
> @@ -710,7 +710,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
> VFIOAddressSpace *space = bcontainer->space;
>
> trace_vfio_disconnect_container(container->fd);
> - vfio_cpr_unregister_container(bcontainer);
> + vfio_legacy_cpr_unregister_container(container);
> close(container->fd);
> object_unref(container);
>
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> new file mode 100644
> index 0000000..d3bbc05
> --- /dev/null
> +++ b/hw/vfio/cpr-legacy.c
> @@ -0,0 +1,68 @@
> +/*
> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include <sys/ioctl.h>
> +#include "qemu/osdep.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +
> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> +{
> + if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> + error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
> + return false;
> +
> + } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> + error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
> + return false;
> +
> + } else {
> + return true;
> + }
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> + .name = "vfio-container",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .needed = cpr_needed_for_reuse,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> + Error **cpr_blocker = &container->cpr_blocker;
> +
> + if (!vfio_cpr_register_container(bcontainer, errp)) {
> + return false;
> + }
> +
> + if (!vfio_cpr_supported(container, cpr_blocker)) {
> + return migrate_add_blocker_modes(cpr_blocker, errp,
> + MIG_MODE_CPR_TRANSFER, -1) == 0;
> + }
> +
> + vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> +
> + return true;
> +}
> +
> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + vfio_cpr_unregister_container(bcontainer);
> + migrate_del_blocker(&container->cpr_blocker);
> + vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index bba776f..5487815 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,13 +5,14 @@ vfio_ss.add(files(
> 'container-base.c',
> 'container.c',
> 'migration.c',
> - 'cpr.c',
> ))
> vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
> vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
> 'iommufd.c',
> ))
> vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> + 'cpr.c',
> + 'cpr-legacy.c',
> 'display.c',
> 'pci-quirks.c',
> 'pci.c',
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 0c60be5..53e554f 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -84,6 +84,7 @@ typedef struct VFIOContainer {
> VFIOContainerBase bcontainer;
> int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> unsigned iommu_type;
> + Error *cpr_blocker;
> QLIST_HEAD(, VFIOGroup) group_list;
> } VFIOContainer;
>
> @@ -258,6 +259,8 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
>
> bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
> void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
I think we should now rename the above routines to reflect what they do :
add/remove a notifier.
> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
Thanks,
C.
>
> extern const MemoryRegionOps vfio_region_ops;
> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 05/26] vfio/container: preserve descriptors
2025-01-29 14:43 ` [PATCH V1 05/26] vfio/container: preserve descriptors Steve Sistare
@ 2025-02-03 17:48 ` Cédric Le Goater
2025-02-03 22:26 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-03 17:48 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
> the saved descriptors, and remembers the reused status for subsequent
> patches. The reused status is cleared when vmstate load finishes.
>
> During reuse, device and iommu state is already configured, so operations
> in vfio_realize that would modify the configuration, such as vfio ioctl's,
> are skipped. The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/container.c | 105 ++++++++++++++++++++++++++++++++++--------
> hw/vfio/cpr-legacy.c | 17 +++++++
> include/hw/vfio/vfio-common.h | 2 +
> 3 files changed, 105 insertions(+), 19 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index a90ce6c..81d0ccc 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -31,6 +31,7 @@
> #include "system/reset.h"
> #include "trace.h"
> #include "qapi/error.h"
> +#include "migration/cpr.h"
> #include "pci.h"
>
> VFIOGroupList vfio_group_list =
> @@ -415,12 +416,28 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
> }
>
> static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
> - Error **errp)
> + bool reused, Error **errp)
Please rename 'reused' to 'cpr_reused'. We should know what this parameter
is for and I don't see any other use than CPR.
> {
> int iommu_type;
> const char *vioc_name;
> VFIOContainer *container;
>
> + /*
> + * If container is reused, just set its type and skip the ioctls, as the
> + * container and group are already configured in the kernel.
> + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> + */
> + if (reused) {
> + if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
> + iommu_type = VFIO_TYPE1v2_IOMMU;
> + goto skip_iommu;
> + } else {
> + error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
> + "is not supported");
> + return NULL;
> + }
> + }
> +
Can we use 'iommu_type' below instead and avoid VFIO_CHECK_EXTENSION
ioctl ? and then set the iommu unless CPR reused is set.
> iommu_type = vfio_get_iommu_type(fd, errp);
> if (iommu_type < 0) {
> return NULL;
> @@ -430,10 +447,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
> return NULL;
> }
>
> +skip_iommu:
I think we can avoid this 'skip_iommu' label with some minor refactoring.
> vioc_name = vfio_get_iommu_class_name(iommu_type);
>
> container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
> container->fd = fd;
> + container->reused = reused;
> container->iommu_type = iommu_type;
> return container;
> }
> @@ -543,10 +562,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> VFIOContainer *container;
> VFIOContainerBase *bcontainer;
> int ret, fd;
> + bool reused;
cpr_reused.
> VFIOAddressSpace *space;
> VFIOIOMMUClass *vioc;
>
> space = vfio_get_address_space(as);
> + fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> + reused = (fd > 0);
hmm, so we are deducing from the existence of a CprFd state element
that we are doing a live update of the VM. This seems to me to be a
somewhat quick heuristic.
Isn't there a global helper ? Isn't the VM aware that it's being
restarted after a live update ? I am not familiar with the CPR
sequence.
> /*
> * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -579,28 +601,52 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> * details once we know which type of IOMMU we are using.
> */
>
> + /*
> + * If the container is reused, then the group is already attached in the
> + * kernel. If a container with matching fd is found, then update the
> + * userland group list and return. If not, then after the loop, create
> + * the container struct and group list.
> + */
> +
> QLIST_FOREACH(bcontainer, &space->containers, next) {
> container = container_of(bcontainer, VFIOContainer, bcontainer);
> - if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> - ret = vfio_ram_block_discard_disable(container, true);
> - if (ret) {
> - error_setg_errno(errp, -ret,
> - "Cannot set discarding of RAM broken");
> - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> - &container->fd)) {
> - error_report("vfio: error disconnecting group %d from"
> - " container", group->groupid);
> - }
> - return false;
> +
> + if (reused) {
> + if (container->fd != fd) {
> + continue;
> }
> - group->container = container;
> - QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> + } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> + continue;
> + }
> +
> + /* Container is a match for the group */
> + ret = vfio_ram_block_discard_disable(container, true);
> + if (ret) {
> + error_setg_errno(errp, -ret,
> + "Cannot set discarding of RAM broken");
> + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> + &container->fd)) {
> + error_report("vfio: error disconnecting group %d from"
> + " container", group->groupid);
> +
> + }
> + goto delete_fd_exit;
> + }
> + group->container = container;
> + QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> + if (!reused) {
> vfio_kvm_device_add_group(group);
> - return true;
> + cpr_save_fd("vfio_container_for_group", group->groupid,
> + container->fd);
> }
> + return true;
> + }
The above changes are difficult to understand and I really don't like
these 'if (reused)' code sequences scattered all over the place. It
would make reading and long term maintenance easier if we could
introduce helpers to hide the "CPR reuse" aspect of the machine
initialization phase.
> + /* No matching container found, create one */
> + if (!reused) {
> + fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
> }
> - fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);> if (fd < 0) {
> goto put_space_exit;
> }
> @@ -612,11 +658,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,> goto close_fd_exit;
> }
>
> - container = vfio_create_container(fd, group, errp);
> + container = vfio_create_container(fd, group, reused, errp);
> if (!container) {
> goto close_fd_exit;
> }
> bcontainer = &container->bcontainer;
> + container->reused = reused;
that's done already in vfio_create_container()
>
> if (!vfio_legacy_cpr_register_container(container, errp)) {
> goto free_container_exit;
> @@ -652,6 +699,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> }
>
> bcontainer->initialized = true;
> + cpr_resave_fd("vfio_container_for_group", group->groupid, fd);
can't we have an helper routine to open/reuse/resave the fd ? Same
comment for vfio_get_device() and vfio_get_group()
>
> return true;
> listener_release_exit:
> @@ -677,6 +725,8 @@ close_fd_exit:
> put_space_exit:
> vfio_put_address_space(space);
>
> +delete_fd_exit:
> + cpr_delete_fd("vfio_container_for_group", group->groupid);
Another exit label. That's the 7th in vfio_connect_container() ...
This is becoming too complex, we need to refactor first.
Thanks,
C.
> return false;
> }
>
> @@ -688,6 +738,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>
> QLIST_REMOVE(group, container_next);
> group->container = NULL;
> + cpr_delete_fd("vfio_container_for_group", group->groupid);
>
> /*
> * Explicitly release the listener first before unset container,
> @@ -741,7 +792,12 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
> group = g_malloc0(sizeof(*group));
>
> snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> - group->fd = qemu_open(path, O_RDWR, errp);
> +
> + group->fd = cpr_find_fd("vfio_group", groupid);
> + if (group->fd < 0) {
> + group->fd = qemu_open(path, O_RDWR, errp);
> + }
> +
> if (group->fd < 0) {
> goto free_group_exit;
> }
> @@ -769,6 +825,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
> }
>
> QLIST_INSERT_HEAD(&vfio_group_list, group, next);
> + cpr_resave_fd("vfio_group", groupid, group->fd);
>
> return group;
>
> @@ -794,6 +851,7 @@ static void vfio_put_group(VFIOGroup *group)
> vfio_disconnect_container(group);
> QLIST_REMOVE(group, next);
> trace_vfio_put_group(group->fd);
> + cpr_delete_fd("vfio_group", group->groupid);
> close(group->fd);
> g_free(group);
> }
> @@ -803,8 +861,14 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
> {
> g_autofree struct vfio_device_info *info = NULL;
> int fd;
> + bool reused;
> +
> + fd = cpr_find_fd(name, 0);
> + reused = (fd >= 0);
> + if (!reused) {
> + fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> + }
>
> - fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> if (fd < 0) {
> error_setg_errno(errp, errno, "error getting device from group %d",
> group->groupid);
> @@ -849,6 +913,8 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
> vbasedev->num_irqs = info->num_irqs;
> vbasedev->num_regions = info->num_regions;
> vbasedev->flags = info->flags;
> + vbasedev->reused = reused;
> + cpr_resave_fd(name, 0, fd);
>
> trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
>
> @@ -865,6 +931,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
> QLIST_REMOVE(vbasedev, next);
> vbasedev->group = NULL;
> trace_vfio_put_base_device(vbasedev->fd);
> + cpr_delete_fd(vbasedev->name, 0);
> close(vbasedev->fd);
> }
>
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index d3bbc05..ce6f14e 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -29,10 +29,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> }
> }
>
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> + VFIOContainer *container = opaque;
> + VFIOGroup *group;
> + VFIODevice *vbasedev;
> +
> + container->reused = false;
> +
> + QLIST_FOREACH(group, &container->group_list, container_next) {
> + QLIST_FOREACH(vbasedev, &group->device_list, next) {
> + vbasedev->reused = false;
> + }
> + }
> + return 0;
> +}
> +
> static const VMStateDescription vfio_container_vmstate = {
> .name = "vfio-container",
> .version_id = 0,
> .minimum_version_id = 0,
> + .post_load = vfio_container_post_load,
> .needed = cpr_needed_for_reuse,
> .fields = (VMStateField[]) {
> VMSTATE_END_OF_LIST()
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 53e554f..a435a90 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
> int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> unsigned iommu_type;
> Error *cpr_blocker;
> + bool reused;
> QLIST_HEAD(, VFIOGroup) group_list;
> } VFIOContainer;
>
> @@ -135,6 +136,7 @@ typedef struct VFIODevice {
> bool ram_block_discard_allowed;
> OnOffAuto enable_migration;
> bool migration_events;
> + bool reused;
> VFIODeviceOps *ops;
> unsigned int num_irqs;
> unsigned int num_regions;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 06/26] vfio/container: preserve DMA mappings
2025-01-29 14:43 ` [PATCH V1 06/26] vfio/container: preserve DMA mappings Steve Sistare
@ 2025-02-03 18:25 ` Cédric Le Goater
2025-02-03 22:27 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-03 18:25 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Preserve DMA mappings during cpr-transfer.
>
> In the container pre_save handler, suspend the use of virtual addresses
> in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will
> be remapped at a different VA after exec. DMA to already-mapped pages
> continues.
>
> Because the vaddr is temporarily invalid, mediated devices cannot be
> supported, so add a blocker for them. This restriction will not apply
> to iommufd containers when CPR is added for them in a future patch.
>
> In new QEMU, do not register the memory listener at device creation time.
> Register it later, in the container post_load handler, after all vmstate
> that may affect regions and mapping boundaries has been loaded. The
> post_load registration will cause the listener to invoke its callback on
> each flat section, and the calls will match the mappings remembered by the
> kernel. Modify vfio_dma_map (which is called by the listener) to pass the
> new VA to the kernel using VFIO_DMA_MAP_FLAG_VADDR.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/container.c | 44 +++++++++++++++++++++++++++++++++++++++----
> hw/vfio/cpr-legacy.c | 32 +++++++++++++++++++++++++++++++
> include/hw/vfio/vfio-common.h | 3 +++
> 3 files changed, 75 insertions(+), 4 deletions(-)
>
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 81d0ccc..2b5125e 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -32,6 +32,7 @@
> #include "trace.h"
> #include "qapi/error.h"
> #include "migration/cpr.h"
> +#include "migration/blocker.h"
> #include "pci.h"
>
> VFIOGroupList vfio_group_list =
> @@ -132,6 +133,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
> int ret;
> Error *local_err = NULL;
>
> + assert(!container->reused);
> +
> if (iotlb && vfio_devices_all_dirty_tracking_started(bcontainer)) {
> if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
> bcontainer->dirty_pages_supported) {
> @@ -183,12 +186,24 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> bcontainer);
> struct vfio_iommu_type1_dma_map map = {
> .argsz = sizeof(map),
> - .flags = VFIO_DMA_MAP_FLAG_READ,
> .vaddr = (__u64)(uintptr_t)vaddr,
> .iova = iova,
> .size = size,
> };
>
> + /*
> + * Set the new vaddr for any mappings registered during cpr load.
> + * Reused is cleared thereafter.
> + */
> + if (container->reused) {
> + map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> + if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> + goto fail;
> + }
> + return 0;
> + }
This is a bit ugly.
When reaching routine vfio_attach_device(), could we detect that CPR is
in progress and replace the 'VFIOIOMMUClass *' temporarily with a set of
CPR specific handlers ?
> +
> + map.flags = VFIO_DMA_MAP_FLAG_READ;
> if (!readonly) {
> map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
> }
> @@ -205,7 +220,11 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> return 0;
> }
>
> - error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> + error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> + (container->reused ? "VADDR" : ""), iova, size, vaddr,
> + strerror(errno));
> +
FYI, I am currently trying to remove this error report.
> return -errno;
> }
>
> @@ -689,8 +708,17 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> group->container = container;
> QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>
> - bcontainer->listener = vfio_memory_listener;
> - memory_listener_register(&bcontainer->listener, bcontainer->space->as);
> + /*
> + * If reused, register the listener later, after all state that may
> + * affect regions and mapping boundaries has been cpr load'ed. Later,
> + * the listener will invoke its callback on each flat section and call
> + * vfio_dma_map to supply the new vaddr, and the calls will match the
> + * mappings remembered by the kernel.
> + */
> + if (!reused) {
> + bcontainer->listener = vfio_memory_listener;
> + memory_listener_register(&bcontainer->listener, bcontainer->space->as);
> + }
oh ! This is an important change. Please move in its own patch.
> if (bcontainer->error) {
> error_propagate_prepend(errp, bcontainer->error,
> @@ -1002,6 +1030,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
> return false;
> }
>
> + if (vbasedev->mdev) {
> + error_setg(&vbasedev->cpr_mdev_blocker,
> + "CPR does not support vfio mdev %s", vbasedev->name);
> + migrate_add_blocker_modes(&vbasedev->cpr_mdev_blocker, &error_fatal,
> + MIG_MODE_CPR_TRANSFER, -1);
> + }
same here, the cpr blocker for mdev devices should be in its own patch.
> bcontainer = &group->container->bcontainer;
> vbasedev->bcontainer = bcontainer;
> QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
> @@ -1018,6 +1053,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
> QLIST_REMOVE(vbasedev, container_next);
> vbasedev->bcontainer = NULL;
> trace_vfio_detach_device(vbasedev->name, group->groupid);
> + migrate_del_blocker(&vbasedev->cpr_mdev_blocker);
> vfio_put_base_device(vbasedev);
> vfio_put_group(group);
> }
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index ce6f14e..f3a31d1 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -14,6 +14,21 @@
> #include "migration/vmstate.h"
> #include "qapi/error.h"
>
> +static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> + struct vfio_iommu_type1_dma_unmap unmap = {
> + .argsz = sizeof(unmap),
> + .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> + .iova = 0,
> + .size = 0,
> + };
> + if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> + error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> + return false;
> + }
> + return true;
> +}
> +
> static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> {
> if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> @@ -29,12 +44,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> }
> }
>
> +static int vfio_container_pre_save(void *opaque)
> +{
> + VFIOContainer *container = opaque;
> + Error *err = NULL;
> +
> + if (!vfio_dma_unmap_vaddr_all(container, &err)) {
> + error_report_err(err);
We should modify vmstate_save_state_v() to call .pre_save() handlers
with an Error ** parameter.
Thanks,
C.
> + return -1;
> + }
> + return 0;
> +}
> +
> static int vfio_container_post_load(void *opaque, int version_id)
> {
> VFIOContainer *container = opaque;
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> VFIOGroup *group;
> VFIODevice *vbasedev;
>
> + bcontainer->listener = vfio_memory_listener;
> + memory_listener_register(&bcontainer->listener, bcontainer->space->as);
> container->reused = false;
>
> QLIST_FOREACH(group, &container->group_list, container_next) {
> @@ -49,6 +79,8 @@ static const VMStateDescription vfio_container_vmstate = {
> .name = "vfio-container",
> .version_id = 0,
> .minimum_version_id = 0,
> + .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
> + .pre_save = vfio_container_pre_save,
> .post_load = vfio_container_post_load,
> .needed = cpr_needed_for_reuse,
> .fields = (VMStateField[]) {
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a435a90..1e974e0 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -143,6 +143,7 @@ typedef struct VFIODevice {
> unsigned int flags;
> VFIOMigration *migration;
> Error *migration_blocker;
> + Error *cpr_mdev_blocker;
> OnOffAuto pre_copy_dirty_page_tracking;
> OnOffAuto device_dirty_page_tracking;
> bool dirty_pages_supported;
> @@ -310,6 +311,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
> int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
> uint64_t size, ram_addr_t ram_addr, Error **errp);
>
> +void vfio_listener_register(VFIOContainerBase *bcontainer);
> +
> /* Returns 0 on success, or a negative errno. */
> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 05/26] vfio/container: preserve descriptors
2025-02-03 17:48 ` Cédric Le Goater
@ 2025-02-03 22:26 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-03 22:26 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/3/2025 12:48 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in CPR state. On qemu restart, vfio_realize() finds and uses
>> the saved descriptors, and remembers the reused status for subsequent
>> patches. The reused status is cleared when vmstate load finishes.
>>
>> During reuse, device and iommu state is already configured, so operations
>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>> are skipped. The result is that vfio_realize constructs qemu data
>> structures that reflect the current state of the device.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/container.c | 105 ++++++++++++++++++++++++++++++++++--------
>> hw/vfio/cpr-legacy.c | 17 +++++++
>> include/hw/vfio/vfio-common.h | 2 +
>> 3 files changed, 105 insertions(+), 19 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index a90ce6c..81d0ccc 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -31,6 +31,7 @@
>> #include "system/reset.h"
>> #include "trace.h"
>> #include "qapi/error.h"
>> +#include "migration/cpr.h"
>> #include "pci.h"
>> VFIOGroupList vfio_group_list =
>> @@ -415,12 +416,28 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>> }
>> static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>> - Error **errp)
>> + bool reused, Error **errp)
>
> Please rename 'reused' to 'cpr_reused'. We should know what this parameter
> is for and I don't see any other use than CPR.
Hi Cedric, glad to virtually meet you, and thanks for reviewing this.
There is no other notion of "reused" in qemu -- CPR is the first to introduce
it. Thus "reused" is unambiguous, it always refers to CPR. IMO shorter names
without underscores make the code more readable, as long as they are unambiguous.
Also, the "reused" identifier already appears in the initial series for
cpr-transfer, and to switch now to a different identifier leaves us with two
names for the same functionality. Right now I can cscope "reused" and find
everything.
For those reasons, I prefer reused, but if you feel strongly, I will rename it.
>> {
>> int iommu_type;
>> const char *vioc_name;
>> VFIOContainer *container;
>> + /*
>> + * If container is reused, just set its type and skip the ioctls, as the
>> + * container and group are already configured in the kernel.
>> + * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>> + */
>> + if (reused) {
>> + if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
>> + iommu_type = VFIO_TYPE1v2_IOMMU;
>> + goto skip_iommu;
>> + } else {
>> + error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
>> + "is not supported");
>> + return NULL;
>> + }
>> + }
>> +
>
> Can we use 'iommu_type' below instead and avoid VFIO_CHECK_EXTENSION
> ioctl ? and then set the iommu unless CPR reused is set.
Sure, I'll mke that change.
>> iommu_type = vfio_get_iommu_type(fd, errp);
>> if (iommu_type < 0) {
>> return NULL;
>> @@ -430,10 +447,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>> return NULL;
>> }
>> +skip_iommu:
>
> I think we can avoid this 'skip_iommu' label with some minor refactoring.
>
>> vioc_name = vfio_get_iommu_class_name(iommu_type);
>> container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>> container->fd = fd;
>> + container->reused = reused;
>> container->iommu_type = iommu_type;
>> return container;
>> }
>> @@ -543,10 +562,13 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>> VFIOContainer *container;
>> VFIOContainerBase *bcontainer;
>> int ret, fd;
>> + bool reused;
>
> cpr_reused.
>
>> VFIOAddressSpace *space;
>> VFIOIOMMUClass *vioc;
>> space = vfio_get_address_space(as);
>> + fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> + reused = (fd > 0);
>
>
> hmm, so we are deducing from the existence of a CprFd state element
> that we are doing a live update of the VM. This seems to me to be a
> somewhat quick heuristic.
>
> Isn't there a global helper ? Isn't the VM aware that it's being
> restarted after a live update ? I am not familiar with the CPR
> sequence.
There is a global mode that can be checked, but we would still need to
fetch the fd. Checking the fd alone yields tighter code. It also seems
perfectly logical to me when reading the code. Can't find the cpr fd?
Then we are not doing cpr. BTW, it is not heuristic. The cpr fd exists
at creation time iff we are doing cpr.
>> /*
>> * VFIO is currently incompatible with discarding of RAM insofar as the
>> @@ -579,28 +601,52 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>> * details once we know which type of IOMMU we are using.
>> */
>> + /*
>> + * If the container is reused, then the group is already attached in the
>> + * kernel. If a container with matching fd is found, then update the
>> + * userland group list and return. If not, then after the loop, create
>> + * the container struct and group list.
>> + */
>> +
>> QLIST_FOREACH(bcontainer, &space->containers, next) {
>> container = container_of(bcontainer, VFIOContainer, bcontainer);
>> - if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> - ret = vfio_ram_block_discard_disable(container, true);
>> - if (ret) {
>> - error_setg_errno(errp, -ret,
>> - "Cannot set discarding of RAM broken");
>> - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
>> - &container->fd)) {
>> - error_report("vfio: error disconnecting group %d from"
>> - " container", group->groupid);
>> - }
>> - return false;
>> +
>> + if (reused) {
>> + if (container->fd != fd) {
>> + continue;
>> }
>> - group->container = container;
>> - QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> + } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> + continue;
>> + }
>> +
>> + /* Container is a match for the group */
>> + ret = vfio_ram_block_discard_disable(container, true);
>> + if (ret) {
>> + error_setg_errno(errp, -ret,
>> + "Cannot set discarding of RAM broken");
>> + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
>> + &container->fd)) {
>> + error_report("vfio: error disconnecting group %d from"
>> + " container", group->groupid);
>> +
>> + }
>> + goto delete_fd_exit;
>> + }
>> + group->container = container;
>> + QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> + if (!reused) {
>> vfio_kvm_device_add_group(group);
>> - return true;
>> + cpr_save_fd("vfio_container_for_group", group->groupid,
>> + container->fd);
>> }
>> + return true;
>> + }
>
> The above changes are difficult to understand
Agreed, the above diffs are indeed hard to grok. Please apply the changes
and review the resulting code and let me know if it still needs helpers.
I could move all of the code after "Container is a match for the group" to
a helper, or just the code after "group->container = container", but IMO
neither choice helps one understand the slightly tricky logic in the loop.
> and I really don't like
> these 'if (reused)' code sequences scattered all over the place. It
> would make reading and long term maintenance easier if we could
> introduce helpers to hide the "CPR reuse" aspect of the machine
> initialization phase.
I'll look into refactoring and helpers, but I'm not convinced the resulting
code will be more readable, because there are many separate steps that must
be performed in order, and the lines to be skipped for cpr are interleaved
throughout.
Again, I hope you get a chance to read the patched code, and not just the
diffs. Reading a patched function from top to bottom, it is easy to see
what is skipped for cpr.
>> + /* No matching container found, create one */
>> + if (!reused) {
>> + fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>> }
>> - fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);> if (fd < 0) {
>> goto put_space_exit;
>> }
>> @@ -612,11 +658,12 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,> goto close_fd_exit;
>> }
>> - container = vfio_create_container(fd, group, errp);
>> + container = vfio_create_container(fd, group, reused, errp);
>> if (!container) {
>> goto close_fd_exit;
>> }
>> bcontainer = &container->bcontainer;
>> + container->reused = reused;
>
> that's done already in vfio_create_container()
Thanks, I will delete the redundant assignment.
>> if (!vfio_legacy_cpr_register_container(container, errp)) {
>> goto free_container_exit;
>> @@ -652,6 +699,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>> }
>> bcontainer->initialized = true;
>> + cpr_resave_fd("vfio_container_for_group", group->groupid, fd);
>
> can't we have an helper routine to open/reuse/resave the fd ? Same
> comment for vfio_get_device() and vfio_get_group()
Yes, for some cases where the descriptor is opened using qemu_open, I could
define a helper. It would work well for vfio_get_group, which was:
group->fd = cpr_find_fd("vfio_group", groupid);
if (group->fd < 0) {
group->fd = qemu_open(path, O_RDWR, errp);
}
...
cpr_resave_fd("vfio_group", groupid, group->fd);
and now becomes:
group->fd = cpr_open_or_find_fd(path, O_RDWR, "vfio_group", groupid, errp);
but now we need an additional call to delete the fd on failure, so the helper
provides only a modest improvement in lines of code:
free_group_exit:
cpr_delete_fd("vfio_group", group->groupid);
Also, the helper cannot be used for vfio_get_device, because it creates the
descriptor via VFIO_GROUP_GET_DEVICE_FD.
And it cannot be used for vfio_connect_container, because the reused fd must be
known early, during the search of containers, before qemu_open("/dev/vfio/vfio")
is called.
>> return true;
>> listener_release_exit:
>> @@ -677,6 +725,8 @@ close_fd_exit:
>> put_space_exit:
>> vfio_put_address_space(space);
>> +delete_fd_exit:
>> + cpr_delete_fd("vfio_container_for_group", group->groupid);
>
> Another exit label. That's the 7th in vfio_connect_container() ...
> This is becoming too complex, we need to refactor first.
I don't see any obvious subroutine candidates that would reduce the
goto count.
But, if we set and clear variables appropriately, we can check them while
unwinding, and rely on some cleanup functions being safe to call even when
not needed, and delete all intermediate labels:
fail:
if (group_was_added) { // new local variable
QLIST_REMOVE(group, container_next);
vfio_kvm_device_del_group(group);
}
memory_listener_unregister(&bcontainer->listener); // safe
if (vioc && vioc->release) {
vioc->release(bcontainer);
}
if (discard_disabled) { // new local variable
vfio_ram_block_discard_disable(container, false);
}
vfio_legacy_cpr_unregister_container(container); // safe
if (container) {
object_unref(container);
}
if (fd >= 0) {
close(fd);
}
if (space) {
vfio_put_address_space(space);
}
cpr_delete_fd("vfio_container_for_group", group->groupid); // safe
return false;
Sound good?
- Steve
>> return false;
>> }
>> @@ -688,6 +738,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>> QLIST_REMOVE(group, container_next);
>> group->container = NULL;
>> + cpr_delete_fd("vfio_container_for_group", group->groupid);
>> /*
>> * Explicitly release the listener first before unset container,
>> @@ -741,7 +792,12 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>> group = g_malloc0(sizeof(*group));
>> snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> - group->fd = qemu_open(path, O_RDWR, errp);
>> +
>> + group->fd = cpr_find_fd("vfio_group", groupid);
>> + if (group->fd < 0) {
>> + group->fd = qemu_open(path, O_RDWR, errp);
>> + }
>> +
>> if (group->fd < 0) {
>> goto free_group_exit;
>> }
>> @@ -769,6 +825,7 @@ static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>> }
>> QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>> + cpr_resave_fd("vfio_group", groupid, group->fd);
>> return group;
>> @@ -794,6 +851,7 @@ static void vfio_put_group(VFIOGroup *group)
>> vfio_disconnect_container(group);
>> QLIST_REMOVE(group, next);
>> trace_vfio_put_group(group->fd);
>> + cpr_delete_fd("vfio_group", group->groupid);
>> close(group->fd);
>> g_free(group);
>> }
>> @@ -803,8 +861,14 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
>> {
>> g_autofree struct vfio_device_info *info = NULL;
>> int fd;
>> + bool reused;
>> +
>> + fd = cpr_find_fd(name, 0);
>> + reused = (fd >= 0);
>> + if (!reused) {
>> + fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> + }
>> - fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> if (fd < 0) {
>> error_setg_errno(errp, errno, "error getting device from group %d",
>> group->groupid);
>> @@ -849,6 +913,8 @@ static bool vfio_get_device(VFIOGroup *group, const char *name,
>> vbasedev->num_irqs = info->num_irqs;
>> vbasedev->num_regions = info->num_regions;
>> vbasedev->flags = info->flags;
>> + vbasedev->reused = reused;
>> + cpr_resave_fd(name, 0, fd);
>> trace_vfio_get_device(name, info->flags, info->num_regions, info->num_irqs);
>> @@ -865,6 +931,7 @@ static void vfio_put_base_device(VFIODevice *vbasedev)
>> QLIST_REMOVE(vbasedev, next);
>> vbasedev->group = NULL;
>> trace_vfio_put_base_device(vbasedev->fd);
>> + cpr_delete_fd(vbasedev->name, 0);
>> close(vbasedev->fd);
>> }
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index d3bbc05..ce6f14e 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -29,10 +29,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> }
>> }
>> +static int vfio_container_post_load(void *opaque, int version_id)
>> +{
>> + VFIOContainer *container = opaque;
>> + VFIOGroup *group;
>> + VFIODevice *vbasedev;
>> +
>> + container->reused = false;
>> +
>> + QLIST_FOREACH(group, &container->group_list, container_next) {
>> + QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> + vbasedev->reused = false;
>> + }
>> + }
>> + return 0;
>> +}
>> +
>> static const VMStateDescription vfio_container_vmstate = {
>> .name = "vfio-container",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> + .post_load = vfio_container_post_load,
>> .needed = cpr_needed_for_reuse,
>> .fields = (VMStateField[]) {
>> VMSTATE_END_OF_LIST()
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 53e554f..a435a90 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>> int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> unsigned iommu_type;
>> Error *cpr_blocker;
>> + bool reused;
>> QLIST_HEAD(, VFIOGroup) group_list;
>> } VFIOContainer;
>> @@ -135,6 +136,7 @@ typedef struct VFIODevice {
>> bool ram_block_discard_allowed;
>> OnOffAuto enable_migration;
>> bool migration_events;
>> + bool reused;
>> VFIODeviceOps *ops;
>> unsigned int num_irqs;
>> unsigned int num_regions;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 04/26] vfio/container: register container for cpr
2025-02-03 17:01 ` Cédric Le Goater
@ 2025-02-03 22:26 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-03 22:26 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/3/2025 12:01 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Register a legacy container for cpr-transfer. Add a blocker if the kernel
>> does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
>>
>> This is mostly boiler plate. The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/container.c | 6 ++--
>> hw/vfio/cpr-legacy.c | 68 +++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/meson.build | 3 +-
>> include/hw/vfio/vfio-common.h | 3 ++
>> 4 files changed, 76 insertions(+), 4 deletions(-)
>> create mode 100644 hw/vfio/cpr-legacy.c
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 4ebb526..a90ce6c 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -618,7 +618,7 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>> }
>> bcontainer = &container->bcontainer;
>> - if (!vfio_cpr_register_container(bcontainer, errp)) {
>> + if (!vfio_legacy_cpr_register_container(container, errp)) {
>> goto free_container_exit;
>> }
>> @@ -666,7 +666,7 @@ enable_discards_exit:
>> vfio_ram_block_discard_disable(container, false);
>> unregister_container_exit:
>> - vfio_cpr_unregister_container(bcontainer);
>> + vfio_legacy_cpr_unregister_container(container);
>> free_container_exit:
>> object_unref(container);
>> @@ -710,7 +710,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>> VFIOAddressSpace *space = bcontainer->space;
>> trace_vfio_disconnect_container(container->fd);
>> - vfio_cpr_unregister_container(bcontainer);
>> + vfio_legacy_cpr_unregister_container(container);
>> close(container->fd);
>> object_unref(container);
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> new file mode 100644
>> index 0000000..d3bbc05
>> --- /dev/null
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -0,0 +1,68 @@
>> +/*
>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include <sys/ioctl.h>
>> +#include "qemu/osdep.h"
>> +#include "hw/vfio/vfio-common.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "qapi/error.h"
>> +
>> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> +{
>> + if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>> + error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
>> + return false;
>> +
>> + } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> + error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
>> + return false;
>> +
>> + } else {
>> + return true;
>> + }
>> +}
>> +
>> +static const VMStateDescription vfio_container_vmstate = {
>> + .name = "vfio-container",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .needed = cpr_needed_for_reuse,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>> +{
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> + Error **cpr_blocker = &container->cpr_blocker;
>> +
>> + if (!vfio_cpr_register_container(bcontainer, errp)) {
>> + return false;
>> + }
>> +
>> + if (!vfio_cpr_supported(container, cpr_blocker)) {
>> + return migrate_add_blocker_modes(cpr_blocker, errp,
>> + MIG_MODE_CPR_TRANSFER, -1) == 0;
>> + }
>> +
>> + vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +
>> + return true;
>> +}
>> +
>> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>> +{
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> + vfio_cpr_unregister_container(bcontainer);
>> + migrate_del_blocker(&container->cpr_blocker);
>> + vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index bba776f..5487815 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,13 +5,14 @@ vfio_ss.add(files(
>> 'container-base.c',
>> 'container.c',
>> 'migration.c',
>> - 'cpr.c',
>> ))
>> vfio_ss.add(when: 'CONFIG_PSERIES', if_true: files('spapr.c'))
>> vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
>> 'iommufd.c',
>> ))
>> vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> + 'cpr.c',
>> + 'cpr-legacy.c',
>> 'display.c',
>> 'pci-quirks.c',
>> 'pci.c',
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 0c60be5..53e554f 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -84,6 +84,7 @@ typedef struct VFIOContainer {
>> VFIOContainerBase bcontainer;
>> int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> unsigned iommu_type;
>> + Error *cpr_blocker;
>> QLIST_HEAD(, VFIOGroup) group_list;
>> } VFIOContainer;
>> @@ -258,6 +259,8 @@ int vfio_kvm_device_del_fd(int fd, Error **errp);
>> bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
>> void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
>
> I think we should now rename the above routines to reflect what they do :
> add/remove a notifier.
Agreed, they do little. Before the container types split, I thought this function
would be extended to support cpr-transfer, but now the container-specific functions
do that.
I'll just squash vfio_cpr_register_container and vfio_cpr_unregister_container into
their call sites.
- Steve
>> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
>> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
>
> Thanks,
>
> C.
>
>> extern const MemoryRegionOps vfio_region_ops;
>> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 06/26] vfio/container: preserve DMA mappings
2025-02-03 18:25 ` Cédric Le Goater
@ 2025-02-03 22:27 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-03 22:27 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/3/2025 1:25 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Preserve DMA mappings during cpr-transfer.
>>
>> In the container pre_save handler, suspend the use of virtual addresses
>> in DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will
>> be remapped at a different VA after exec. DMA to already-mapped pages
>> continues.
>>
>> Because the vaddr is temporarily invalid, mediated devices cannot be
>> supported, so add a blocker for them. This restriction will not apply
>> to iommufd containers when CPR is added for them in a future patch.
>>
>> In new QEMU, do not register the memory listener at device creation time.
>> Register it later, in the container post_load handler, after all vmstate
>> that may affect regions and mapping boundaries has been loaded. The
>> post_load registration will cause the listener to invoke its callback on
>> each flat section, and the calls will match the mappings remembered by the
>> kernel. Modify vfio_dma_map (which is called by the listener) to pass the
>> new VA to the kernel using VFIO_DMA_MAP_FLAG_VADDR.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/container.c | 44 +++++++++++++++++++++++++++++++++++++++----
>> hw/vfio/cpr-legacy.c | 32 +++++++++++++++++++++++++++++++
>> include/hw/vfio/vfio-common.h | 3 +++
>> 3 files changed, 75 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 81d0ccc..2b5125e 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -32,6 +32,7 @@
>> #include "trace.h"
>> #include "qapi/error.h"
>> #include "migration/cpr.h"
>> +#include "migration/blocker.h"
>> #include "pci.h"
>> VFIOGroupList vfio_group_list =
>> @@ -132,6 +133,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
>> int ret;
>> Error *local_err = NULL;
>> + assert(!container->reused);
>> +
>> if (iotlb && vfio_devices_all_dirty_tracking_started(bcontainer)) {
>> if (!vfio_devices_all_device_dirty_tracking(bcontainer) &&
>> bcontainer->dirty_pages_supported) {
>> @@ -183,12 +186,24 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> bcontainer);
>> struct vfio_iommu_type1_dma_map map = {
>> .argsz = sizeof(map),
>> - .flags = VFIO_DMA_MAP_FLAG_READ,
>> .vaddr = (__u64)(uintptr_t)vaddr,
>> .iova = iova,
>> .size = size,
>> };
>> + /*
>> + * Set the new vaddr for any mappings registered during cpr load.
>> + * Reused is cleared thereafter.
>> + */
>> + if (container->reused) {
>> + map.flags = VFIO_DMA_MAP_FLAG_VADDR;
>> + if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> + goto fail;
>> + }
>> + return 0;
>> + }
>
> This is a bit ugly.
>
> When reaching routine vfio_attach_device(), could we detect that CPR is
> in progress and replace the 'VFIOIOMMUClass *' temporarily with a set of
> CPR specific handlers ?
Good idea, I'll try it. I wrote this code years ago before the dma
map and unmap functions were defined in an ops vector.
>> +
>> + map.flags = VFIO_DMA_MAP_FLAG_READ;
>> if (!readonly) {
>> map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>> }
>> @@ -205,7 +220,11 @@ static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> return 0;
>> }
>> - error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
>> +fail:
>> + error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
>> + (container->reused ? "VADDR" : ""), iova, size, vaddr,
>> + strerror(errno));
>> +
>
>
> FYI, I am currently trying to remove this error report.
>
>
>> return -errno;
>> }
>> @@ -689,8 +708,17 @@ static bool vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>> group->container = container;
>> QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> - bcontainer->listener = vfio_memory_listener;
>> - memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>> + /*
>> + * If reused, register the listener later, after all state that may
>> + * affect regions and mapping boundaries has been cpr load'ed. Later,
>> + * the listener will invoke its callback on each flat section and call
>> + * vfio_dma_map to supply the new vaddr, and the calls will match the
>> + * mappings remembered by the kernel.
>> + */
>> + if (!reused) {
>> + bcontainer->listener = vfio_memory_listener;
>> + memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>> + }
>
> oh ! This is an important change. Please move in its own patch.
OK.
>> if (bcontainer->error) {
>> error_propagate_prepend(errp, bcontainer->error,
>> @@ -1002,6 +1030,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>> return false;
>> }
>> + if (vbasedev->mdev) {
>> + error_setg(&vbasedev->cpr_mdev_blocker,
>> + "CPR does not support vfio mdev %s", vbasedev->name);
>> + migrate_add_blocker_modes(&vbasedev->cpr_mdev_blocker, &error_fatal,
>> + MIG_MODE_CPR_TRANSFER, -1);
>> + }
>
> same here, the cpr blocker for mdev devices should be in its own patch.
OK. It was a separate patch in my workspace then I squashed it :)
>> bcontainer = &group->container->bcontainer;
>> vbasedev->bcontainer = bcontainer;
>> QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
>> @@ -1018,6 +1053,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
>> QLIST_REMOVE(vbasedev, container_next);
>> vbasedev->bcontainer = NULL;
>> trace_vfio_detach_device(vbasedev->name, group->groupid);
>> + migrate_del_blocker(&vbasedev->cpr_mdev_blocker);
>> vfio_put_base_device(vbasedev);
>> vfio_put_group(group);
>> }
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index ce6f14e..f3a31d1 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -14,6 +14,21 @@
>> #include "migration/vmstate.h"
>> #include "qapi/error.h"
>> +static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> + struct vfio_iommu_type1_dma_unmap unmap = {
>> + .argsz = sizeof(unmap),
>> + .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> + .iova = 0,
>> + .size = 0,
>> + };
>> + if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> + error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> {
>> if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>> @@ -29,12 +44,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> }
>> }
>> +static int vfio_container_pre_save(void *opaque)
>> +{
>> + VFIOContainer *container = opaque;
>> + Error *err = NULL;
>> +
>> + if (!vfio_dma_unmap_vaddr_all(container, &err)) {
>> + error_report_err(err);
>
> We should modify vmstate_save_state_v() to call .pre_save() handlers
> with an Error ** parameter.
Hmm, that changes the signature of every pre_save handler. That does not
belong in this series, IMO. It would be a separate RFE for migration.
- Steve
>> + return -1;
>> + }
>> + return 0;
>> +}
>> +
>> static int vfio_container_post_load(void *opaque, int version_id)
>> {
>> VFIOContainer *container = opaque;
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> VFIOGroup *group;
>> VFIODevice *vbasedev;
>> + bcontainer->listener = vfio_memory_listener;
>> + memory_listener_register(&bcontainer->listener, bcontainer->space->as);
>> container->reused = false;
>> QLIST_FOREACH(group, &container->group_list, container_next) {
>> @@ -49,6 +79,8 @@ static const VMStateDescription vfio_container_vmstate = {
>> .name = "vfio-container",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> + .priority = MIG_PRI_LOW, /* Must happen after devices and groups */
>> + .pre_save = vfio_container_pre_save,
>> .post_load = vfio_container_post_load,
>> .needed = cpr_needed_for_reuse,
>> .fields = (VMStateField[]) {
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index a435a90..1e974e0 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -143,6 +143,7 @@ typedef struct VFIODevice {
>> unsigned int flags;
>> VFIOMigration *migration;
>> Error *migration_blocker;
>> + Error *cpr_mdev_blocker;
>> OnOffAuto pre_copy_dirty_page_tracking;
>> OnOffAuto device_dirty_page_tracking;
>> bool dirty_pages_supported;
>> @@ -310,6 +311,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
>> int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
>> uint64_t size, ram_addr_t ram_addr, Error **errp);
>> +void vfio_listener_register(VFIOContainerBase *bcontainer);
>> +
>> /* Returns 0 on success, or a negative errno. */
>> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure
2025-01-29 14:43 ` [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-02-04 14:10 ` Cédric Le Goater
2025-02-04 16:13 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 14:10 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> If there are multiple containers and unmap-all fails for some container, we
> need to remap vaddr for the other containers for which unmap-all succeeded.
> Recover by walking all address ranges of all containers to restore the vaddr
> for each. Do so by invoking the vfio listener callback, and passing a new
> "remap" flag that tells it to restore a mapping without re-allocating new
> userland data structures.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/common.c | 47 ++++++++++++++++++++++++++++++++++++++++++-
> hw/vfio/cpr-legacy.c | 44 ++++++++++++++++++++++++++++++++++++++++
> include/hw/vfio/vfio-common.h | 6 +++++-
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7370332..c8ee71a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -580,6 +580,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
> {
> VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
> listener);
> + vfio_container_region_add(bcontainer, section, false);
> +}
> +
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section,
> + bool remap)
> +{
vfio_container_region_add() is already complex enough. Please consider
doing an initial refactoring before adding a new code path. It would be
welcome !
> hwaddr iova, end;
> Int128 llend, llsize;
> void *vaddr;
> @@ -614,6 +621,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
> int iommu_idx;
>
> trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
> +
> + /*
> + * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
> + * want to remap the vaddr. vfio_container_region_add was already
> + * called in the past, so the giommu already exists. Find it and
> + * replay it, which calls vfio_dma_map further down the stack.
> + */
> +
> + if (remap) {
> + hwaddr as_offset = section->offset_within_address_space;
> + hwaddr iommu_offset = as_offset - section->offset_within_region;
> +
> + QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
> + if (giommu->iommu_mr == iommu_mr &&
> + giommu->iommu_offset == iommu_offset) {
> + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
> + return;
> + }
> + }
> + error_report("Container cannot find iommu region %s offset %lx",
> + memory_region_name(section->mr), iommu_offset);
error_report() are not welcomed. We need to find a way to propagate
this error.
> + goto fail;
> + }
Please introduce a vfio_cpr helper for the section above and move it
under the hw/vfio/cpr* files.
> /*
> * FIXME: For VFIO iommu types which have KVM acceleration to
> * avoid bouncing all map/unmaps through qemu this way, this
> @@ -656,7 +687,21 @@ static void vfio_listener_region_add(MemoryListener *listener,
> * about changes.
> */
> if (memory_region_has_ram_discard_manager(section->mr)) {
> - vfio_register_ram_discard_listener(bcontainer, section);
> + /*
> + * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
> + * want to remap the vaddr. vfio_container_region_add was already
> + * called in the past, so the ram discard listener already exists.
> + * Call its populate function directly, which calls vfio_dma_map.
> + */
> + if (remap) {
> + VFIORamDiscardListener *vrdl =
> + vfio_find_ram_discard_listener(bcontainer, section);
> + if (vrdl->listener.notify_populate(&vrdl->listener, section)) {
> + error_report("listener.notify_populate failed");
> + }
> + } else {
> + vfio_register_ram_discard_listener(bcontainer, section);
> + }
idem.
> return;
> }
>
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index f3a31d1..3139de1 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -26,9 +26,18 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> return false;
> }
> + container->vaddr_unmapped = true;
> return true;> }
>
> +static void vfio_region_remap(MemoryListener *listener,
> + MemoryRegionSection *section)
> +{
> + VFIOContainer *container = container_of(listener, VFIOContainer,
> + remap_listener);
> + vfio_container_region_add(&container->bcontainer, section, true);
> +}
> +
> static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> {
> if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> @@ -88,6 +97,37 @@ static const VMStateDescription vfio_container_vmstate = {
> }
> };
>
> +static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
> + MigrationEvent *e, Error **errp)
> +{
> + VFIOContainer *container =
> + container_of(notifier, VFIOContainer, cpr_transfer_notifier);
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + if (e->type != MIG_EVENT_PRECOPY_FAILED) {
> + return 0;
> + }
> +
> + if (container->vaddr_unmapped) {
> + /*
> + * Force a call to vfio_region_remap for each mapped section by
> + * temporarily registering a listener, which calls vfio_dma_map
> + * further down the stack. Set reused so vfio_dma_map restores vaddr.
> + */
> + container->reused = true;
> + container->remap_listener = (MemoryListener) {
> + .name = "vfio recover",
> + .region_add = vfio_region_remap
> + };
> + memory_listener_register(&container->remap_listener,
> + bcontainer->space->as);
> + memory_listener_unregister(&container->remap_listener);
> + container->reused = false;
> + container->vaddr_unmapped = false;
> + }> + return 0;
> +}
> +
> bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> {
> VFIOContainerBase *bcontainer = &container->bcontainer;
> @@ -104,6 +144,9 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>
> vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>
> + migration_add_notifier_mode(&container->cpr_transfer_notifier,
> + vfio_cpr_fail_notifier,
> + MIG_MODE_CPR_TRANSFER);
> return true;> }
>
> @@ -114,4 +157,5 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
> vfio_cpr_unregister_container(bcontainer);
> migrate_del_blocker(&container->cpr_blocker);
> vmstate_unregister(NULL, &vfio_container_vmstate, container);
> + migration_remove_notifier(&container->cpr_transfer_notifier);
> }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1e974e0..8a4a658 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -86,6 +86,9 @@ typedef struct VFIOContainer {
> unsigned iommu_type;
> Error *cpr_blocker;
> bool reused;
> + bool vaddr_unmapped;
> + NotifierWithReturn cpr_transfer_notifier;
> + MemoryListener remap_listener;
There are 5 attributes related to CPR, please add a CPR struct to hold
them all.
Thanks,
C.
> QLIST_HEAD(, VFIOGroup) group_list;
> } VFIOContainer;
>
> @@ -311,7 +314,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
> int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
> uint64_t size, ram_addr_t ram_addr, Error **errp);
>
> -void vfio_listener_register(VFIOContainerBase *bcontainer);
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section, bool remap);
>
> /* Returns 0 on success, or a negative errno. */
> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 08/26] pci: skip reset during cpr
2025-01-29 14:43 ` [PATCH V1 08/26] pci: skip reset during cpr Steve Sistare
@ 2025-02-04 14:14 ` Cédric Le Goater
2025-02-04 16:13 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 14:14 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/pci/pci.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 2afa423..16b4f71 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -32,6 +32,7 @@
> #include "hw/pci/pci_host.h"
> #include "hw/qdev-properties.h"
> #include "hw/qdev-properties-system.h"
> +#include "migration/misc.h"
> #include "migration/qemu-file-types.h"
> #include "migration/vmstate.h"
> #include "net/net.h"
> @@ -459,6 +460,18 @@ static void pci_reset_regions(PCIDevice *dev)
>
> static void pci_do_device_reset(PCIDevice *dev)
> {
> + /*
> + * A PCI device that is resuming for cpr is already configured, so do
> + * not reset it here when we are called from qemu_system_reset prior to
> + * cpr load, else interrupts may be lost for vfio-pci devices. It is
> + * safe to skip this reset for all PCI devices, because cpr load will set
> + * all fields that would have been set here.
> + */
> + MigMode mode = migrate_mode();
> + if (mode == MIG_MODE_CPR_TRANSFER) {
> + return;
> + }
Please use cpr_needed_for_reuse(). Or an appropriate helper.
I would the test under pci_device_reset() and avoid calling
pci_do_device_reset().
Thanks,
C.
> pci_device_deassert_intx(dev);
> assert(dev->irq_state == 0);
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 10/26] vfio-pci: refactor for cpr
2025-01-29 14:43 ` [PATCH V1 10/26] vfio-pci: refactor for cpr Steve Sistare
@ 2025-02-04 14:39 ` Cédric Le Goater
2025-02-04 16:14 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 14:39 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Refactor vector use into a helper vfio_vector_init.
> Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
> and pass additional arguments to vfio_remove_kvm_msi_virq.
>
> All for use by CPR in a subsequent patch. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.c | 106 +++++++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 68 insertions(+), 38 deletions(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index ab17a98..24ebd69 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -54,6 +54,32 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>
> +/* Create new or reuse existing eventfd */
> +static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> + const char *name, int nr)
> +{
> + int fd = -1; /* placeholder until a subsequent patch */
> + int ret = 0;
> +
> + if (fd >= 0) {
> + event_notifier_init_fd(e, fd);
Could you please first introduce the vfio_notifier_init() routine,
which can me merged quickly, and then, in a subsequent patch, modify
vfio_notifier_init() for CPR support.
> + } else {
> + ret = event_notifier_init(e, 0);
> + if (ret) {
> + Error *err = NULL;
> + error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
I don't think "name" is useful if the caller calls error_prepend() to
extend the error message.
> + error_report_err(err);
Nope. We should propagate the error with 'Error **' parameter and return
bool.
> + }
> + }
> + return ret;
> +}
> +
> +static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
> + const char *name, int nr)
That's a lot of unused parameters which should be introduces when required.
> +{
> + event_notifier_cleanup(e);
> +}
> +
> /*
> * Disabling BAR mmaping can be slow, but toggling it around INTx can
> * also be a huge overhead. We try to get the best of both worlds by
> @@ -134,8 +160,8 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> pci_irq_deassert(&vdev->pdev);
>
> /* Get an eventfd for resample/unmask */
> - if (event_notifier_init(&vdev->intx.unmask, 0)) {
> - error_setg(errp, "event_notifier_init failed eoi");
> + if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
> + error_setg(errp, "vfio_notifier_init intx-unmask failed");
> goto fail;
> }
>
> @@ -167,7 +193,7 @@ fail_vfio:
> kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
> vdev->intx.route.irq);
> fail_irqfd:
> - event_notifier_cleanup(&vdev->intx.unmask);
> + vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
> fail:
> qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
> vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> @@ -199,7 +225,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
> }
>
> /* We only need to close the eventfd for VFIO to cleanup the kernel side */
> - event_notifier_cleanup(&vdev->intx.unmask);
> + vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
>
> /* QEMU starts listening for interrupt events. */
> qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
> @@ -266,7 +292,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
> Error *err = NULL;
> int32_t fd;
> - int ret;
>
>
> if (!pin) {
> @@ -289,9 +314,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> }
> #endif
>
> - ret = event_notifier_init(&vdev->intx.interrupt, 0);
> - if (ret) {
> - error_setg_errno(errp, -ret, "event_notifier_init failed");
> + if (vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0)) {
> return false;
> }
> fd = event_notifier_get_fd(&vdev->intx.interrupt);
> @@ -300,7 +323,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->intx.interrupt);
> + vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> return false;
> }
>
> @@ -327,7 +350,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>
> fd = event_notifier_get_fd(&vdev->intx.interrupt);
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->intx.interrupt);
> + vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>
> vdev->interrupt = VFIO_INT_NONE;
>
> @@ -471,13 +494,15 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> vector_n, &vdev->pdev);
> }
>
> -static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
> +static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
This change belongs to another patch.
> {
> + const char *name = "kvm_interrupt";
> +
> if (vector->virq < 0) {
> return;
> }
>
> - if (event_notifier_init(&vector->kvm_interrupt, 0)) {
> + if (vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr)) {
> goto fail_notifier;
> }
>
> @@ -489,19 +514,20 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
> return;
>
> fail_kvm:
> - event_notifier_cleanup(&vector->kvm_interrupt);
> + vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
> fail_notifier:
> kvm_irqchip_release_virq(kvm_state, vector->virq);
> vector->virq = -1;
> }
>
> -static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
> +static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> + int nr)
> {
> kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
> vector->virq);
> kvm_irqchip_release_virq(kvm_state, vector->virq);
> vector->virq = -1;
> - event_notifier_cleanup(&vector->kvm_interrupt);
> + vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
> }
>
> static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
> @@ -511,6 +537,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
> kvm_irqchip_commit_routes(kvm_state);
> }
>
> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
> +{
> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> + PCIDevice *pdev = &vdev->pdev;
> +
> + vector->vdev = vdev;
> + vector->virq = -1;
> + vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr);
> + vector->use = true;
> + if (vdev->interrupt == VFIO_INT_MSIX) {
> + msix_vector_use(pdev, nr);
> + }
> +}
This change belongs to another patch.
Thanks,
C.
> static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> MSIMessage *msg, IOHandler *handler)
> {
> @@ -524,13 +564,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> vector = &vdev->msi_vectors[nr];
>
> if (!vector->use) {
> - vector->vdev = vdev;
> - vector->virq = -1;
> - if (event_notifier_init(&vector->interrupt, 0)) {
> - error_report("vfio: Error: event_notifier_init failed");
> - }
> - vector->use = true;
> - msix_vector_use(pdev, nr);
> + vfio_vector_init(vdev, nr);
> }
>
> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> @@ -542,7 +576,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> */
> if (vector->virq >= 0) {
> if (!msg) {
> - vfio_remove_kvm_msi_virq(vector);
> + vfio_remove_kvm_msi_virq(vdev, vector, nr);
> } else {
> vfio_update_kvm_msi_virq(vector, *msg, pdev);
> }
> @@ -554,7 +588,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
> vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> kvm_irqchip_commit_route_changes(&vfio_route_change);
> - vfio_connect_kvm_msi_virq(vector);
> + vfio_connect_kvm_msi_virq(vector, nr);
> }
> }
> }
> @@ -661,7 +695,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> kvm_irqchip_commit_route_changes(&vfio_route_change);
>
> for (i = 0; i < vdev->nr_vectors; i++) {
> - vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
> + vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
> }
> }
>
> @@ -741,9 +775,7 @@ retry:
> vector->virq = -1;
> vector->use = true;
>
> - if (event_notifier_init(&vector->interrupt, 0)) {
> - error_report("vfio: Error: event_notifier_init failed");
> - }
> + vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i);
>
> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> vfio_msi_interrupt, NULL, vector);
> @@ -797,11 +829,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
> VFIOMSIVector *vector = &vdev->msi_vectors[i];
> if (vdev->msi_vectors[i].use) {
> if (vector->virq >= 0) {
> - vfio_remove_kvm_msi_virq(vector);
> + vfio_remove_kvm_msi_virq(vdev, vector, i);
> }
> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> NULL, NULL, NULL);
> - event_notifier_cleanup(&vector->interrupt);
> + vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
> }
> }
>
> @@ -2854,8 +2886,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
> return;
> }
>
> - if (event_notifier_init(&vdev->err_notifier, 0)) {
> - error_report("vfio: Unable to init event notifier for error detection");
> + if (vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0)) {
> vdev->pci_aer = false;
> return;
> }
> @@ -2867,7 +2898,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->err_notifier);
> + vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
> vdev->pci_aer = false;
> }
> }
> @@ -2886,7 +2917,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
> }
> qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
> NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->err_notifier);
> + vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
> }
>
> static void vfio_req_notifier_handler(void *opaque)
> @@ -2920,8 +2951,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> return;
> }
>
> - if (event_notifier_init(&vdev->req_notifier, 0)) {
> - error_report("vfio: Unable to init event notifier for device request");
> + if (vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0)) {
> return;
> }
>
> @@ -2932,7 +2962,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->req_notifier);
> + vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
> } else {
> vdev->req_enabled = true;
> }
> @@ -2952,7 +2982,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> }
> qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
> NULL, NULL, vdev);
> - event_notifier_cleanup(&vdev->req_notifier);
> + vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
>
> vdev->req_enabled = false;
> }
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 11/26] vfio-pci: skip reset during cpr
2025-01-29 14:43 ` [PATCH V1 11/26] vfio-pci: skip reset during cpr Steve Sistare
@ 2025-02-04 14:56 ` Cédric Le Goater
2025-02-04 16:15 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 14:56 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR, and do not complain if the
> kernel's PCI config space changes for non-emulated bits between the
> vmstate save and load, which can happen due to ongoing interrupt activity.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.c | 37 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 37 insertions(+)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 24ebd69..fa77c36 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,8 @@
> #include "hw/pci/pci_bridge.h"
> #include "hw/qdev-properties.h"
> #include "hw/qdev-properties-system.h"
> +#include "migration/misc.h"
> +#include "migration/cpr.h"
> #include "migration/vmstate.h"
> #include "qapi/qmp/qdict.h"
> #include "qemu/error-report.h"
> @@ -3324,6 +3326,11 @@ static void vfio_pci_reset(DeviceState *dev)
> {
> VFIOPCIDevice *vdev = VFIO_PCI(dev);
>
> + /* Do not reset the device during qemu_system_reset prior to cpr load */
> + if (vdev->vbasedev.reused) {
> + return;
> + }
> +
sometime we use :
MigMode mode = migrate_mode();
if (mode == MIG_MODE_CPR_TRANSFER) {
return;
}
Why is this different ? This is confusing.
Thanks,
C.
> trace_vfio_pci_reset(vdev->vbasedev.name);
>
> vfio_pci_pre_reset(vdev);
> @@ -3447,6 +3454,35 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
> }
> #endif
>
> +/*
> + * The kernel may change non-emulated config bits. Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> + VFIOPCIDevice *vdev = opaque;
> + PCIDevice *pdev = &vdev->pdev;
> + int size = MIN(pci_config_size(pdev), vdev->config_size);
> + int i;
> +
> + for (i = 0; i < size; i++) {
> + pdev->cmask[i] &= vdev->emulated_config_bits[i];
> + }
> +
> + return 0;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> + .name = "vfio-pci",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .pre_load = vfio_pci_pre_load,
> + .needed = cpr_needed_for_reuse,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> {
> DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3457,6 +3493,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> #ifdef CONFIG_IOMMUFD
> object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
> #endif
> + dc->vmsd = &vfio_pci_vmstate;
> dc->desc = "VFIO-based PCI device assignment";
> set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> pdc->realize = vfio_realize;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr
2025-01-29 14:43 ` [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-02-04 15:47 ` Cédric Le Goater
2025-02-04 17:42 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 15:47 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
John Levon
+ John (for vfio-user)
On 1/29/25 15:43, Steve Sistare wrote:
> Return the memory region that the translated address is found in, for
> use in a subsequent patch. No functional change.
Keeping a reference on this memory region could be risky. What for ?
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/common.c | 9 ++++++---
> hw/virtio/vhost-vdpa.c | 2 +-
> include/exec/memory.h | 5 ++++-
> system/memory.c | 8 +++++++-
> 4 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index db0498e..4bbc29f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -248,12 +248,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> /* Called with rcu_read_lock held. */
> static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> ram_addr_t *ram_addr, bool *read_only,
> + MemoryRegion **mr_p,
> Error **errp)
> {
> bool ret, mr_has_discard_manager;
>
> ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
> - &mr_has_discard_manager, errp);
> + &mr_has_discard_manager, mr_p, errp);
> if (ret && mr_has_discard_manager) {
> /*
> * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
> @@ -300,7 +301,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
> + &local_err)) {
> error_report_err(local_err);
> goto out;
> }
> @@ -1279,7 +1281,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> }
>
> rcu_read_lock();
> - if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
> + if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL,
> + &local_err)) {
> error_report_err(local_err);
> goto out_unlock;
> }
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 3cdaa12..a1866bb 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -228,7 +228,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> bool read_only;
>
> - if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
> + if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL, NULL,
> &local_err)) {
> error_report_err(local_err);
> return;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index ea5d33a..a2f1229 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -747,13 +747,16 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> * @read_only: indicates if writes are allowed
> * @mr_has_discard_manager: indicates memory is controlled by a
> * RamDiscardManager
> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr
> * @errp: pointer to Error*, to store an error if it happens.
> *
> * Return: true on success, else false setting @errp with error.
> */
> bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp);
> + bool *mr_has_discard_manager,
> + MemoryRegion **mr_p,
There is a risk that the life cycle of the returned MemoryRegion
doesn't match VFIO expectations.
Also, it seems that memory_get_xlat_addr() has reached a point
where the callers need refactoring. 'mr_p' would be the 5th out
parameter and 3 of these already depend on the MemoryRegion
returned by flatview_translate().
Thanks,
C.
> + Error **errp);
>
> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
> diff --git a/system/memory.c b/system/memory.c
> index 4c82979..4ec2b8f 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2185,7 +2185,9 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> /* Called with rcu_read_lock held. */
> bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> ram_addr_t *ram_addr, bool *read_only,
> - bool *mr_has_discard_manager, Error **errp)
> + bool *mr_has_discard_manager,
> + MemoryRegion **mr_p,
> + Error **errp)
> {
> MemoryRegion *mr;
> hwaddr xlat;
> @@ -2250,6 +2252,10 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> *read_only = !writable || mr->readonly;
> }
>
> + if (mr_p) {
> + *mr_p = mr;
> + }
> +
> return true;
> }
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure
2025-02-04 14:10 ` Cédric Le Goater
@ 2025-02-04 16:13 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 16:13 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/4/2025 9:10 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> If there are multiple containers and unmap-all fails for some container, we
>> need to remap vaddr for the other containers for which unmap-all succeeded.
>> Recover by walking all address ranges of all containers to restore the vaddr
>> for each. Do so by invoking the vfio listener callback, and passing a new
>> "remap" flag that tells it to restore a mapping without re-allocating new
>> userland data structures.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/common.c | 47 ++++++++++++++++++++++++++++++++++++++++++-
>> hw/vfio/cpr-legacy.c | 44 ++++++++++++++++++++++++++++++++++++++++
>> include/hw/vfio/vfio-common.h | 6 +++++-
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 7370332..c8ee71a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -580,6 +580,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> {
>> VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
>> listener);
>> + vfio_container_region_add(bcontainer, section, false);
>> +}
>> +
>> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
>> + MemoryRegionSection *section,
>> + bool remap)
>> +{
>
> vfio_container_region_add() is already complex enough. Please consider
> doing an initial refactoring before adding a new code path. It would be
> welcome !
I'll take a look after factoring out the cpr code into helpers as you
request below.
>> hwaddr iova, end;
>> Int128 llend, llsize;
>> void *vaddr;
>> @@ -614,6 +621,30 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> int iommu_idx;
>> trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
>> +
>> + /*
>> + * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
>> + * want to remap the vaddr. vfio_container_region_add was already
>> + * called in the past, so the giommu already exists. Find it and
>> + * replay it, which calls vfio_dma_map further down the stack.
>> + */
>> +
>> + if (remap) {
>> + hwaddr as_offset = section->offset_within_address_space;
>> + hwaddr iommu_offset = as_offset - section->offset_within_region;
>> +
>> + QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
>> + if (giommu->iommu_mr == iommu_mr &&
>> + giommu->iommu_offset == iommu_offset) {
>> + memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
>> + return;
>> + }
>> + }
>> + error_report("Container cannot find iommu region %s offset %lx",
>> + memory_region_name(section->mr), iommu_offset);
>
> error_report() are not welcomed. We need to find a way to propagate
> this error.
I follow the existing practice in this function, which already reports other
errors. This is called in the context of a memory listener region_add method,
so returning an error affects the signature of all listeners and should be a
seperate RFE.
>> + goto fail;
>> + }
>
> Please introduce a vfio_cpr helper for the section above and move it
> under the hw/vfio/cpr* files.
OK.
>> /*
>> * FIXME: For VFIO iommu types which have KVM acceleration to
>> * avoid bouncing all map/unmaps through qemu this way, this
>> @@ -656,7 +687,21 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> * about changes.
>> */
>> if (memory_region_has_ram_discard_manager(section->mr)) {
>> - vfio_register_ram_discard_listener(bcontainer, section);
>> + /*
>> + * If remap, then VFIO_DMA_UNMAP_FLAG_VADDR has been called, and we
>> + * want to remap the vaddr. vfio_container_region_add was already
>> + * called in the past, so the ram discard listener already exists.
>> + * Call its populate function directly, which calls vfio_dma_map.
>> + */
>> + if (remap) {
>> + VFIORamDiscardListener *vrdl =
>> + vfio_find_ram_discard_listener(bcontainer, section);
>> + if (vrdl->listener.notify_populate(&vrdl->listener, section)) {
>> + error_report("listener.notify_populate failed");
>> + }
>> + } else {
>> + vfio_register_ram_discard_listener(bcontainer, section);
>> + }
>
> idem.
OK.
>> return;
>> }
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index f3a31d1..3139de1 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -26,9 +26,18 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> return false;
>> }
>> + container->vaddr_unmapped = true;
>> return true;> }
>> +static void vfio_region_remap(MemoryListener *listener,
>> + MemoryRegionSection *section)
>> +{
>> + VFIOContainer *container = container_of(listener, VFIOContainer,
>> + remap_listener);
>> + vfio_container_region_add(&container->bcontainer, section, true);
>> +}
>> +
>> static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> {
>> if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>> @@ -88,6 +97,37 @@ static const VMStateDescription vfio_container_vmstate = {
>> }
>> };
>> +static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
>> + MigrationEvent *e, Error **errp)
>> +{
>> + VFIOContainer *container =
>> + container_of(notifier, VFIOContainer, cpr_transfer_notifier);
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> + if (e->type != MIG_EVENT_PRECOPY_FAILED) {
>> + return 0;
>> + }
>> +
>> + if (container->vaddr_unmapped) {
>> + /*
>> + * Force a call to vfio_region_remap for each mapped section by
>> + * temporarily registering a listener, which calls vfio_dma_map
>> + * further down the stack. Set reused so vfio_dma_map restores vaddr.
>> + */
>> + container->reused = true;
>> + container->remap_listener = (MemoryListener) {
>> + .name = "vfio recover",
>> + .region_add = vfio_region_remap
>> + };
>> + memory_listener_register(&container->remap_listener,
>> + bcontainer->space->as);
>> + memory_listener_unregister(&container->remap_listener);
>> + container->reused = false;
>> + container->vaddr_unmapped = false;
>> + }> + return 0;
>> +}
>> +
>> bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>> {
>> VFIOContainerBase *bcontainer = &container->bcontainer;
>> @@ -104,6 +144,9 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>> vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> + migration_add_notifier_mode(&container->cpr_transfer_notifier,
>> + vfio_cpr_fail_notifier,
>> + MIG_MODE_CPR_TRANSFER);
>> return true;> }
>> @@ -114,4 +157,5 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>> vfio_cpr_unregister_container(bcontainer);
>> migrate_del_blocker(&container->cpr_blocker);
>> vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> + migration_remove_notifier(&container->cpr_transfer_notifier);
>> }
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 1e974e0..8a4a658 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -86,6 +86,9 @@ typedef struct VFIOContainer {
>> unsigned iommu_type;
>> Error *cpr_blocker;
>> bool reused;
>> + bool vaddr_unmapped;
>> + NotifierWithReturn cpr_transfer_notifier;
>> + MemoryListener remap_listener;
>
> There are 5 attributes related to CPR, please add a CPR struct to hold
> them all.
OK.
- Steve
>> QLIST_HEAD(, VFIOGroup) group_list;
>> } VFIOContainer;
>> @@ -311,7 +314,8 @@ int vfio_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
>> int vfio_get_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
>> uint64_t size, ram_addr_t ram_addr, Error **errp);
>> -void vfio_listener_register(VFIOContainerBase *bcontainer);
>> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
>> + MemoryRegionSection *section, bool remap);
>> /* Returns 0 on success, or a negative errno. */
>> bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 08/26] pci: skip reset during cpr
2025-02-04 14:14 ` Cédric Le Goater
@ 2025-02-04 16:13 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 16:13 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/4/2025 9:14 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Do not reset a vfio-pci device during CPR.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/pci/pci.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 2afa423..16b4f71 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -32,6 +32,7 @@
>> #include "hw/pci/pci_host.h"
>> #include "hw/qdev-properties.h"
>> #include "hw/qdev-properties-system.h"
>> +#include "migration/misc.h"
>> #include "migration/qemu-file-types.h"
>> #include "migration/vmstate.h"
>> #include "net/net.h"
>> @@ -459,6 +460,18 @@ static void pci_reset_regions(PCIDevice *dev)
>> static void pci_do_device_reset(PCIDevice *dev)
>> {
>> + /*
>> + * A PCI device that is resuming for cpr is already configured, so do
>> + * not reset it here when we are called from qemu_system_reset prior to
>> + * cpr load, else interrupts may be lost for vfio-pci devices. It is
>> + * safe to skip this reset for all PCI devices, because cpr load will set
>> + * all fields that would have been set here.
>> + */
>> + MigMode mode = migrate_mode();
>> + if (mode == MIG_MODE_CPR_TRANSFER) {
>> + return;
>> + }
>
> Please use cpr_needed_for_reuse(). Or an appropriate helper.
OK.
> I would the test under pci_device_reset() and avoid calling
> pci_do_device_reset().
pci_do_device_reset is also called from pcibus_reset_hold.
- Steve
>> pci_device_deassert_intx(dev);
>> assert(dev->irq_state == 0);
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 10/26] vfio-pci: refactor for cpr
2025-02-04 14:39 ` Cédric Le Goater
@ 2025-02-04 16:14 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 16:14 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/4/2025 9:39 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Refactor vector use into a helper vfio_vector_init.
>> Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
>> and pass additional arguments to vfio_remove_kvm_msi_virq.
>>
>> All for use by CPR in a subsequent patch. No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.c | 106 +++++++++++++++++++++++++++++++++++++---------------------
>> 1 file changed, 68 insertions(+), 38 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index ab17a98..24ebd69 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -54,6 +54,32 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>> +/* Create new or reuse existing eventfd */
>> +static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> + const char *name, int nr)
>> +{
>> + int fd = -1; /* placeholder until a subsequent patch */
>> + int ret = 0;
>> +
>> + if (fd >= 0) {
>> + event_notifier_init_fd(e, fd);
>
> Could you please first introduce the vfio_notifier_init() routine,
> which can me merged quickly, and then, in a subsequent patch, modify
> vfio_notifier_init() for CPR support.
OK.
>> + } else {
>> + ret = event_notifier_init(e, 0);
>> + if (ret) {
>> + Error *err = NULL;
>> + error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
>
> I don't think "name" is useful if the caller calls error_prepend() to
> extend the error message.
I don't follow. The new code is strictly more informative than the old.
Some of the call sites before this patch printed generic messages such as
"event_notifier_init failed". The new code identifies the notifier that
failed.
>> + error_report_err(err);
>
> Nope. We should propagate the error with 'Error **' parameter and return
> bool.
OK. Some call sites will simply report that error, though. And some ignore
errors. I intend to be bug-for-bug compatible with the old code, and not
introduce new behaviors.
>> + }
>> + }
>> + return ret;
>> +}
>> +
>> +static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>> + const char *name, int nr)
>
> That's a lot of unused parameters which should be introduces when required.
OK.
>> +{
>> + event_notifier_cleanup(e);
>> +}
>> +
>> /*
>> * Disabling BAR mmaping can be slow, but toggling it around INTx can
>> * also be a huge overhead. We try to get the best of both worlds by
>> @@ -134,8 +160,8 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> pci_irq_deassert(&vdev->pdev);
>> /* Get an eventfd for resample/unmask */
>> - if (event_notifier_init(&vdev->intx.unmask, 0)) {
>> - error_setg(errp, "event_notifier_init failed eoi");
>> + if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
>> + error_setg(errp, "vfio_notifier_init intx-unmask failed");
>> goto fail;
>> }
>> @@ -167,7 +193,7 @@ fail_vfio:
>> kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
>> vdev->intx.route.irq);
>> fail_irqfd:
>> - event_notifier_cleanup(&vdev->intx.unmask);
>> + vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
>> fail:
>> qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
>> vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>> @@ -199,7 +225,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>> }
>> /* We only need to close the eventfd for VFIO to cleanup the kernel side */
>> - event_notifier_cleanup(&vdev->intx.unmask);
>> + vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
>> /* QEMU starts listening for interrupt events. */
>> qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
>> @@ -266,7 +292,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>> Error *err = NULL;
>> int32_t fd;
>> - int ret;
>> if (!pin) {
>> @@ -289,9 +314,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> }
>> #endif
>> - ret = event_notifier_init(&vdev->intx.interrupt, 0);
>> - if (ret) {
>> - error_setg_errno(errp, -ret, "event_notifier_init failed");
>> + if (vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0)) {
>> return false;
>> }
>> fd = event_notifier_get_fd(&vdev->intx.interrupt);
>> @@ -300,7 +323,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->intx.interrupt);
>> + vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>> return false;
>> }
>> @@ -327,7 +350,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->intx.interrupt);
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->intx.interrupt);
>> + vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>> vdev->interrupt = VFIO_INT_NONE;
>> @@ -471,13 +494,15 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> vector_n, &vdev->pdev);
>> }
>> -static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
>> +static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
>
> This change belongs to another patch.
OK. It will be tiny, though -- add nr in the declaration and at 2 call sites.
It also runs counter to your advice above, which is do not add a parameter
until it is needed.
>> {
>> + const char *name = "kvm_interrupt";
>> +
>> if (vector->virq < 0) {
>> return;
>> }
>> - if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>> + if (vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr)) {
>> goto fail_notifier;
>> }
>> @@ -489,19 +514,20 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
>> return;
>> fail_kvm:
>> - event_notifier_cleanup(&vector->kvm_interrupt);
>> + vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
>> fail_notifier:
>> kvm_irqchip_release_virq(kvm_state, vector->virq);
>> vector->virq = -1;
>> }
>> -static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
>> +static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> + int nr)
>> {
>> kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
>> vector->virq);
>> kvm_irqchip_release_virq(kvm_state, vector->virq);
>> vector->virq = -1;
>> - event_notifier_cleanup(&vector->kvm_interrupt);
>> + vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
>> }
>> static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>> @@ -511,6 +537,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>> kvm_irqchip_commit_routes(kvm_state);
>> }
>> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>> +{
>> + VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>> + PCIDevice *pdev = &vdev->pdev;
>> +
>> + vector->vdev = vdev;
>> + vector->virq = -1;
>> + vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr);
>> + vector->use = true;
>> + if (vdev->interrupt == VFIO_INT_MSIX) {
>> + msix_vector_use(pdev, nr);
>> + }
>> +}
>
> This change belongs to another patch.
OK.
- Steve
>> static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> MSIMessage *msg, IOHandler *handler)
>> {
>> @@ -524,13 +564,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> vector = &vdev->msi_vectors[nr];
>> if (!vector->use) {
>> - vector->vdev = vdev;
>> - vector->virq = -1;
>> - if (event_notifier_init(&vector->interrupt, 0)) {
>> - error_report("vfio: Error: event_notifier_init failed");
>> - }
>> - vector->use = true;
>> - msix_vector_use(pdev, nr);
>> + vfio_vector_init(vdev, nr);
>> }
>> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>> @@ -542,7 +576,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> */
>> if (vector->virq >= 0) {
>> if (!msg) {
>> - vfio_remove_kvm_msi_virq(vector);
>> + vfio_remove_kvm_msi_virq(vdev, vector, nr);
>> } else {
>> vfio_update_kvm_msi_virq(vector, *msg, pdev);
>> }
>> @@ -554,7 +588,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>> vfio_add_kvm_msi_virq(vdev, vector, nr, true);
>> kvm_irqchip_commit_route_changes(&vfio_route_change);
>> - vfio_connect_kvm_msi_virq(vector);
>> + vfio_connect_kvm_msi_virq(vector, nr);
>> }
>> }
>> }
>> @@ -661,7 +695,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>> kvm_irqchip_commit_route_changes(&vfio_route_change);
>> for (i = 0; i < vdev->nr_vectors; i++) {
>> - vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
>> + vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
>> }
>> }
>> @@ -741,9 +775,7 @@ retry:
>> vector->virq = -1;
>> vector->use = true;
>> - if (event_notifier_init(&vector->interrupt, 0)) {
>> - error_report("vfio: Error: event_notifier_init failed");
>> - }
>> + vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i);
>> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>> vfio_msi_interrupt, NULL, vector);
>> @@ -797,11 +829,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
>> VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> if (vdev->msi_vectors[i].use) {
>> if (vector->virq >= 0) {
>> - vfio_remove_kvm_msi_virq(vector);
>> + vfio_remove_kvm_msi_virq(vdev, vector, i);
>> }
>> qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>> NULL, NULL, NULL);
>> - event_notifier_cleanup(&vector->interrupt);
>> + vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
>> }
>> }
>> @@ -2854,8 +2886,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>> return;
>> }
>> - if (event_notifier_init(&vdev->err_notifier, 0)) {
>> - error_report("vfio: Unable to init event notifier for error detection");
>> + if (vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0)) {
>> vdev->pci_aer = false;
>> return;
>> }
>> @@ -2867,7 +2898,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->err_notifier);
>> + vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
>> vdev->pci_aer = false;
>> }
>> }
>> @@ -2886,7 +2917,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>> }
>> qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
>> NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->err_notifier);
>> + vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
>> }
>> static void vfio_req_notifier_handler(void *opaque)
>> @@ -2920,8 +2951,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>> return;
>> }
>> - if (event_notifier_init(&vdev->req_notifier, 0)) {
>> - error_report("vfio: Unable to init event notifier for device request");
>> + if (vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0)) {
>> return;
>> }
>> @@ -2932,7 +2962,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->req_notifier);
>> + vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
>> } else {
>> vdev->req_enabled = true;
>> }
>> @@ -2952,7 +2982,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>> }
>> qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
>> NULL, NULL, vdev);
>> - event_notifier_cleanup(&vdev->req_notifier);
>> + vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
>> vdev->req_enabled = false;
>> }
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 11/26] vfio-pci: skip reset during cpr
2025-02-04 14:56 ` Cédric Le Goater
@ 2025-02-04 16:15 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 16:15 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/4/2025 9:56 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Do not reset a vfio-pci device during CPR, and do not complain if the
>> kernel's PCI config space changes for non-emulated bits between the
>> vmstate save and load, which can happen due to ongoing interrupt activity.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.c | 37 +++++++++++++++++++++++++++++++++++++
>> 1 file changed, 37 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 24ebd69..fa77c36 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,8 @@
>> #include "hw/pci/pci_bridge.h"
>> #include "hw/qdev-properties.h"
>> #include "hw/qdev-properties-system.h"
>> +#include "migration/misc.h"
>> +#include "migration/cpr.h"
>> #include "migration/vmstate.h"
>> #include "qapi/qmp/qdict.h"
>> #include "qemu/error-report.h"
>> @@ -3324,6 +3326,11 @@ static void vfio_pci_reset(DeviceState *dev)
>> {
>> VFIOPCIDevice *vdev = VFIO_PCI(dev);
>> + /* Do not reset the device during qemu_system_reset prior to cpr load */
>> + if (vdev->vbasedev.reused) {
>> + return;
>> + }
>> +
>
> sometime we use :
>
> MigMode mode = migrate_mode();
> if (mode == MIG_MODE_CPR_TRANSFER) {
> return;
> }
>
> Why is this different ? This is confusing.
I try to use local state -- object's knowledge about themselves -- whenever possible.
Sometimes that is less desirable. For example, in pci_do_device_reset I test mode, rather
than add a reused field to the generic PCIDevice, because the pci code would not use the
reused field anywhere else.
- Steve
>> trace_vfio_pci_reset(vdev->vbasedev.name);
>> vfio_pci_pre_reset(vdev);
>> @@ -3447,6 +3454,35 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
>> }
>> #endif
>> +/*
>> + * The kernel may change non-emulated config bits. Exclude them from the
>> + * changed-bits check in get_pci_config_device.
>> + */
>> +static int vfio_pci_pre_load(void *opaque)
>> +{
>> + VFIOPCIDevice *vdev = opaque;
>> + PCIDevice *pdev = &vdev->pdev;
>> + int size = MIN(pci_config_size(pdev), vdev->config_size);
>> + int i;
>> +
>> + for (i = 0; i < size; i++) {
>> + pdev->cmask[i] &= vdev->emulated_config_bits[i];
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> + .name = "vfio-pci",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .pre_load = vfio_pci_pre_load,
>> + .needed = cpr_needed_for_reuse,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>> {
>> DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3457,6 +3493,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>> #ifdef CONFIG_IOMMUFD
>> object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
>> #endif
>> + dc->vmsd = &vfio_pci_vmstate;
>> dc->desc = "VFIO-based PCI device assignment";
>> set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>> pdc->realize = vfio_realize;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt
2025-01-29 14:43 ` [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt Steve Sistare
@ 2025-02-04 16:22 ` Cédric Le Goater
2025-02-04 17:42 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-04 16:22 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Refactor and define iommufd_cdev_make_hwpt, to be called by CPR in a
> a later patch. No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/iommufd.c | 69 +++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 41 insertions(+), 28 deletions(-)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 3490a8f..42ba63f 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -275,6 +275,41 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
> return true;
> }
>
> +static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
> +{
> + vbasedev->hwpt = hwpt;
> + vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +}
> +
> +static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
> + VFIOIOMMUFDContainer *container,
> + uint32_t hwpt_id)
> +{
> + VFIOIOASHwpt *hwpt = g_malloc0(sizeof(*hwpt));
> + uint32_t flags = 0;
> +
> + /*
> + * This is quite early and VFIO Migration state isn't yet fully
> + * initialized, thus rely only on IOMMU hardware capabilities as to
> + * whether IOMMU dirty tracking is going to be requested. Later
> + * vfio_migration_realize() may decide to use VF dirty tracking
> + * instead.
> + */
> + if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
> + flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> + }
> +
> + hwpt->hwpt_id = hwpt_id;
> + hwpt->hwpt_flags = flags;
> + QLIST_INIT(&hwpt->device_list);
> +
> + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> + container->bcontainer.dirty_pages_supported |=
> + vbasedev->iommu_dirty_tracking;
> + return hwpt;
> +}
> +
> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> VFIOIOMMUFDContainer *container,
> Error **errp)
> @@ -304,24 +339,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>
> return false;
> } else {
> - vbasedev->hwpt = hwpt;
> - QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> - vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
> + iommufd_cdev_set_hwpt(vbasedev, hwpt);
> return true;
> }
> }
>
> - /*
> - * This is quite early and VFIO Migration state isn't yet fully
> - * initialized, thus rely only on IOMMU hardware capabilities as to
> - * whether IOMMU dirty tracking is going to be requested. Later
> - * vfio_migration_realize() may decide to use VF dirty tracking
> - * instead.
> - */
> - if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
> - flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> - }
>
AFAICT, iommufd_backend_alloc_hwpt() below needs the flag value.
Thanks,
C.
> if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> container->ioas_id, flags,
> IOMMU_HWPT_DATA_NONE, 0, NULL,
> @@ -329,24 +351,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> return false;
> }
>
> - hwpt = g_malloc0(sizeof(*hwpt));
> - hwpt->hwpt_id = hwpt_id;
> - hwpt->hwpt_flags = flags;
> - QLIST_INIT(&hwpt->device_list);
> -
> - ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
> if (ret) {
> - iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> - g_free(hwpt);
> + iommufd_backend_free_id(container->be, hwpt_id);
> return false;
> }
>
> - vbasedev->hwpt = hwpt;
> - vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
> - QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> - QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> - container->bcontainer.dirty_pages_supported |=
> - vbasedev->iommu_dirty_tracking;
> + hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
> + iommufd_cdev_set_hwpt(vbasedev, hwpt);
> +
> if (container->bcontainer.dirty_pages_supported &&
> !vbasedev->iommu_dirty_tracking) {
> warn_report("IOMMU instance for device %s doesn't support dirty tracking",
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr
2025-02-04 15:47 ` Cédric Le Goater
@ 2025-02-04 17:42 ` Steven Sistare
2025-02-16 23:19 ` John Levon
0 siblings, 1 reply; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 17:42 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
John Levon
On 2/4/2025 10:47 AM, Cédric Le Goater wrote:
> + John (for vfio-user)
>
> On 1/29/25 15:43, Steve Sistare wrote:
>> Return the memory region that the translated address is found in, for
>> use in a subsequent patch. No functional change.
>
> Keeping a reference on this memory region could be risky. What for ?
The returned mr is briefly used here in later patches:
vfio_iommu_map_notify()
vfio_get_xlat_addr(&mr)
vfio_container_dma_map(mr->ram_block) ******
if ram_block is right
vioc->dma_map_file()
else
vioc->dma_map()
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/common.c | 9 ++++++---
>> hw/virtio/vhost-vdpa.c | 2 +-
>> include/exec/memory.h | 5 ++++-
>> system/memory.c | 8 +++++++-
>> 4 files changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index db0498e..4bbc29f 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -248,12 +248,13 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> /* Called with rcu_read_lock held. */
>> static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> ram_addr_t *ram_addr, bool *read_only,
>> + MemoryRegion **mr_p,
>> Error **errp)
>> {
>> bool ret, mr_has_discard_manager;
>> ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
>> - &mr_has_discard_manager, errp);
>> + &mr_has_discard_manager, mr_p, errp);
>> if (ret && mr_has_discard_manager) {
>> /*
>> * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
>> @@ -300,7 +301,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> bool read_only;
>> - if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
>> + if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
>> + &local_err)) {
>> error_report_err(local_err);
>> goto out;
>> }
>> @@ -1279,7 +1281,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> }
>> rcu_read_lock();
>> - if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
>> + if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL,
>> + &local_err)) {
>> error_report_err(local_err);
>> goto out_unlock;
>> }
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 3cdaa12..a1866bb 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -228,7 +228,7 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>> if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>> bool read_only;
>> - if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
>> + if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL, NULL,
>> &local_err)) {
>> error_report_err(local_err);
>> return;
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index ea5d33a..a2f1229 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -747,13 +747,16 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>> * @read_only: indicates if writes are allowed
>> * @mr_has_discard_manager: indicates memory is controlled by a
>> * RamDiscardManager
>> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr
>> * @errp: pointer to Error*, to store an error if it happens.
>> *
>> * Return: true on success, else false setting @errp with error.
>> */
>> bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> ram_addr_t *ram_addr, bool *read_only,
>> - bool *mr_has_discard_manager, Error **errp);
>> + bool *mr_has_discard_manager,
>> + MemoryRegion **mr_p,
>
> There is a risk that the life cycle of the returned MemoryRegion
> doesn't match VFIO expectations.
>
> Also, it seems that memory_get_xlat_addr() has reached a point
> where the callers need refactoring. 'mr_p' would be the 5th out
> parameter and 3 of these already depend on the MemoryRegion
> returned by flatview_translate().
If we return mr plus xlat, then the caller could trivially derive
vaddr, ram_addr, and read_only.
- Steve
>> + Error **errp);
>> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>> diff --git a/system/memory.c b/system/memory.c
>> index 4c82979..4ec2b8f 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2185,7 +2185,9 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>> /* Called with rcu_read_lock held. */
>> bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> ram_addr_t *ram_addr, bool *read_only,
>> - bool *mr_has_discard_manager, Error **errp)
>> + bool *mr_has_discard_manager,
>> + MemoryRegion **mr_p,
>> + Error **errp)
>> {
>> MemoryRegion *mr;
>> hwaddr xlat;
>> @@ -2250,6 +2252,10 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> *read_only = !writable || mr->readonly;
>> }
>> + if (mr_p) {
>> + *mr_p = mr;
>> + }
>> +
>> return true;
>> }
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt
2025-02-04 16:22 ` Cédric Le Goater
@ 2025-02-04 17:42 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-04 17:42 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/4/2025 11:22 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Refactor and define iommufd_cdev_make_hwpt, to be called by CPR in a
>> a later patch. No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/iommufd.c | 69 +++++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 41 insertions(+), 28 deletions(-)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 3490a8f..42ba63f 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -275,6 +275,41 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>> return true;
>> }
>> +static void iommufd_cdev_set_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
>> +{
>> + vbasedev->hwpt = hwpt;
>> + vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> + QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +}
>> +
>> +static VFIOIOASHwpt *iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
>> + VFIOIOMMUFDContainer *container,
>> + uint32_t hwpt_id)
>> +{
>> + VFIOIOASHwpt *hwpt = g_malloc0(sizeof(*hwpt));
>> + uint32_t flags = 0;
>> +
>> + /*
>> + * This is quite early and VFIO Migration state isn't yet fully
>> + * initialized, thus rely only on IOMMU hardware capabilities as to
>> + * whether IOMMU dirty tracking is going to be requested. Later
>> + * vfio_migration_realize() may decide to use VF dirty tracking
>> + * instead.
>> + */
>> + if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> + flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> + }
>> +
>> + hwpt->hwpt_id = hwpt_id;
>> + hwpt->hwpt_flags = flags;
>> + QLIST_INIT(&hwpt->device_list);
>> +
>> + QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> + container->bcontainer.dirty_pages_supported |=
>> + vbasedev->iommu_dirty_tracking;
>> + return hwpt;
>> +}
>> +
>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> VFIOIOMMUFDContainer *container,
>> Error **errp)
>> @@ -304,24 +339,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> return false;
>> } else {
>> - vbasedev->hwpt = hwpt;
>> - QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> - vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> + iommufd_cdev_set_hwpt(vbasedev, hwpt);
>> return true;
>> }
>> }
>> - /*
>> - * This is quite early and VFIO Migration state isn't yet fully
>> - * initialized, thus rely only on IOMMU hardware capabilities as to
>> - * whether IOMMU dirty tracking is going to be requested. Later
>> - * vfio_migration_realize() may decide to use VF dirty tracking
>> - * instead.
>> - */
>> - if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> - flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> - }
>>
>
> AFAICT, iommufd_backend_alloc_hwpt() below needs the flag value.
Good catch, that's a bug, will fix.
- Steve
>> if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>> container->ioas_id, flags,
>> IOMMU_HWPT_DATA_NONE, 0, NULL,
>> @@ -329,24 +351,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> return false;
>> }
>> - hwpt = g_malloc0(sizeof(*hwpt));
>> - hwpt->hwpt_id = hwpt_id;
>> - hwpt->hwpt_flags = flags;
>> - QLIST_INIT(&hwpt->device_list);
>> -
>> - ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>> + ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
>> if (ret) {
>> - iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>> - g_free(hwpt);
>> + iommufd_backend_free_id(container->be, hwpt_id);
>> return false;
>> }
>> - vbasedev->hwpt = hwpt;
>> - vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> - QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> - QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> - container->bcontainer.dirty_pages_supported |=
>> - vbasedev->iommu_dirty_tracking;
>> + hwpt = iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id);
>> + iommufd_cdev_set_hwpt(vbasedev, hwpt);
>> +
>> if (container->bcontainer.dirty_pages_supported &&
>> !vbasedev->iommu_dirty_tracking) {
>> warn_report("IOMMU instance for device %s doesn't support dirty tracking",
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 12/26] vfio-pci: preserve MSI
2025-01-29 14:43 ` [PATCH V1 12/26] vfio-pci: preserve MSI Steve Sistare
@ 2025-02-05 16:48 ` Cédric Le Goater
2025-02-06 14:41 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 16:48 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Save the MSI message area as part of vfio-pci vmstate, and preserve the
> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
> then the vfio-pci post_load handler finds the eventfds in CPR state,
> rebuilds vector data structures, and attaches the interrupts to the new
> KVM instance.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 116 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index fa77c36..df6e298 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -56,11 +56,37 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>
> +#define EVENT_FD_NAME(vdev, name) \
> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
hmm, this helper could lead to memory leaks if not used as done below.
Being explict would be safer.
> +static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
> + EventNotifier *ev)
> +{
> + int fd = event_notifier_get_fd(ev);
> +
> + if (fd >= 0) {
> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> + cpr_resave_fd(fdname, nr, fd);
> + }
> +}
> +
> +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> + return cpr_find_fd(fdname, nr);
> +}
> +
> +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> + cpr_delete_fd(fdname, nr);
> +}
> +
please move these helpers to a cpr file. They are not strictly VFIO
related too. So they could me moved outside of hw/vfio.
> /* Create new or reuse existing eventfd */
> static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr)
> {
> - int fd = -1; /* placeholder until a subsequent patch */
> + int fd = load_event_fd(vdev, name, nr);
> int ret = 0;
>
> if (fd >= 0) {
> @@ -71,6 +97,8 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> Error *err = NULL;
> error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
> error_report_err(err);
> + } else {
> + save_event_fd(vdev, name, nr, e);
I'd rather move the CPR related fd handling which is done in
vfio_notifier_init() in a cpr routine which vfio_notifier_init()
would call. This comment applies to all the series. Anything
related to CPR should be handled explicitely :
if (cpr_in_progress) {
cpr_do_cpr_related_stuff()
}
It will ease reading and long term maintenance.
> }
> }
> return ret;
> @@ -79,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
> const char *name, int nr)
> {
> + delete_event_fd(vdev, name, nr);
> event_notifier_cleanup(e);
> }
>
> @@ -561,6 +590,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> int ret;
> bool resizing = !!(vdev->nr_vectors < nr + 1);
>
> + /*
> + * Ignore the callback from msix_set_vector_notifiers during resume.
> + * The necessary subset of these actions is called from vfio_claim_vectors
> + * during post load.
> + */
> + if (vdev->vbasedev.reused) {
> + return 0;
> + }
again, I would prefer some explicit CPR test. Same below.
> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>
> vector = &vdev->msi_vectors[nr];
> @@ -2896,6 +2934,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->err_notifier);
> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (vdev->vbasedev.reused) {
> + return;
> + }
> +
> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -2960,6 +3003,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> fd = event_notifier_get_fd(&vdev->req_notifier);
> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>
> + /* Do not alter irq_signaling during vfio_realize for cpr */
> + if (vdev->vbasedev.reused) {
> + vdev->req_enabled = true;
> + return;
> + }
> +
> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3454,6 +3503,46 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
> }
> #endif
>
> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
> +{
> + int i, fd;
> + bool pending = false;
> + PCIDevice *pdev = &vdev->pdev;
> +
> + vdev->nr_vectors = nr_vectors;
> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> + vfio_prepare_kvm_msi_virq_batch(vdev);
> +
> + for (i = 0; i < nr_vectors; i++) {
> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> + fd = load_event_fd(vdev, "interrupt", i);
> + if (fd >= 0) {
> + vfio_vector_init(vdev, i);
> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> + }
> +
> + if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> + vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> + } else {
> + vdev->msi_vectors[i].virq = -1;
> + }
> +
> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> + set_bit(i, vdev->msix->pending);
> + pending = true;
> + }
> + }
> +
> + vfio_commit_kvm_msi_virq_batch(vdev);
> +
> + if (msix) {
> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> + }
> +}
move to a cpr file please. We can have a vfio-pci lib/common file
for external users. It will take more work to get the interface right
but it will benefit other proposals. I think vfio-user as more or less
the same needs.
> +
> /*
> * The kernel may change non-emulated config bits. Exclude them from the
> * changed-bits check in get_pci_config_device.
> @@ -3472,13 +3561,39 @@ static int vfio_pci_pre_load(void *opaque)
> return 0;
> }
>
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> + VFIOPCIDevice *vdev = opaque;
> + PCIDevice *pdev = &vdev->pdev;
> + int nr_vectors;
> +
> + if (msix_enabled(pdev)) {
> + msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
> + vfio_msix_vector_release, NULL);
> + nr_vectors = vdev->msix->entries;
> + vfio_claim_vectors(vdev, nr_vectors, true);
> +
> + } else if (msi_enabled(pdev)) {
> + nr_vectors = msi_nr_vectors_allocated(pdev);
> + vfio_claim_vectors(vdev, nr_vectors, false);
> +
> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> + g_assert_not_reached(); /* completed in a subsequent patch */
> + }
> +
> + return 0;
> +}
> +
> static const VMStateDescription vfio_pci_vmstate = {
> .name = "vfio-pci",
> .version_id = 0,
> .minimum_version_id = 0,
> .pre_load = vfio_pci_pre_load,
> + .post_load = vfio_pci_post_load,
> .needed = cpr_needed_for_reuse,
> .fields = (VMStateField[]) {
> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> VMSTATE_END_OF_LIST()
> }
> };
I think you can move vfio_pci_vmstate out of hw/vfio/pci.c too. Only
cpr needs it.
Thanks,
C.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 13/26] vfio-pci: preserve INTx
2025-01-29 14:43 ` [PATCH V1 13/26] vfio-pci: preserve INTx Steve Sistare
@ 2025-02-05 17:13 ` Cédric Le Goater
2025-02-06 14:43 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:13 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
> follows:
> pin : Recover this from the vfio config in kernel space
> interrupt : Preserve its eventfd descriptor across exec.
> unmask : Ditto
> route.irq : This could perhaps be recovered in vfio_pci_post_load by
> calling pci_device_route_intx_to_irq(pin), whose implementation reads
> config space for a bridge device such as ich9. However, there is no
> guarantee that the bridge vmstate is read before vfio vmstate. Rather
> than fiddling with MigrationPriority for vmstate handlers, explicitly
> save route.irq in vfio vmstate.
> pending : save in vfio vmstate.
> mmap_timeout, mmap_timer : Re-initialize
> bool kvm_accel : Re-initialize
>
> In vfio_realize, defer calling vfio_intx_enable until the vmstate
> is available, in vfio_pci_post_load. Modify vfio_intx_enable and
> vfio_intx_kvm_enable to skip vfio initialization, but still perform
> kvm initialization.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/pci.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 47 insertions(+), 4 deletions(-)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index df6e298..c50dbef 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -184,12 +184,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> return true;
> }
>
> + if (vdev->vbasedev.reused) {
1 x vdev->vbasedev.reused
> + goto skip_state;
> + }
> +
> /* Get to a known interrupt state */
> qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
> vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> vdev->intx.pending = false;
> pci_irq_deassert(&vdev->pdev);
>
> +skip_state:
hmm, this skip_state label and ...
> /* Get an eventfd for resample/unmask */
> if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
> error_setg(errp, "vfio_notifier_init intx-unmask failed");
> @@ -204,6 +209,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> goto fail_irqfd;
> }
>
> + if (vdev->vbasedev.reused) {
2 x vdev->vbasedev.reused
> + goto skip_irq;
> + }
> +
> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_UNMASK,
> event_notifier_get_fd(&vdev->intx.unmask),
> @@ -214,6 +223,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> /* Let'em rip */
> vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>
> +skip_irq:
... this skip_irq label are one "very quick" way to get things done :)
> vdev->intx.kvm_accel = true;
>
> trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
> @@ -329,7 +339,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> return true;
> }
>
> - vfio_disable_interrupts(vdev);
> + /*
> + * Do not alter interrupt state during vfio_realize and cpr load. The
> + * reused flag is cleared thereafter.
> + */
> + if (!vdev->vbasedev.reused) {
3 x vdev->vbasedev.reused
> + vfio_disable_interrupts(vdev);
> + }
>
> vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
> pci_config_set_interrupt_pin(vdev->pdev.config, pin);
> @@ -351,7 +367,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> fd = event_notifier_get_fd(&vdev->intx.interrupt);
> qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>
> - if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> + if (!vdev->vbasedev.reused &&
4 x vdev->vbasedev.reused
> + !vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
> VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
> qemu_set_fd_handler(fd, NULL, NULL, vdev);
> vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> @@ -3256,7 +3273,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> vfio_intx_routing_notifier);
> vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
> kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
> - if (!vfio_intx_enable(vdev, errp)) {
> + /* Wait until cpr load reads intx routing data to enable */
> + if (!vdev->vbasedev.reused && !vfio_intx_enable(vdev, errp)) {
5 x vdev->vbasedev.reused
This patch already adds a test on vdev->vbasedev.reused at the top of
vfio_intx_enable(). This one seems redudant.
Please duplicate the whole vfio_intx_enable() routine and move it
under a cpr file.
> goto out_deregister;
> }
> }
> @@ -3578,12 +3596,36 @@ static int vfio_pci_post_load(void *opaque, int version_id)
> vfio_claim_vectors(vdev, nr_vectors, false);>
> } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> - g_assert_not_reached(); /* completed in a subsequent patch */
> + Error *err = NULL;
> + if (!vfio_intx_enable(vdev, &err)) {
> + error_report_err(err);
> + return -1;> + }
> }
>
> return 0;
> }
>
> +static const VMStateDescription vfio_intx_vmstate = {
> + .name = "vfio-intx",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .fields = (VMStateField[]) {
> + VMSTATE_BOOL(pending, VFIOINTx),
> + VMSTATE_UINT32(route.mode, VFIOINTx),
> + VMSTATE_INT32(route.irq, VFIOINTx),
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +#define VMSTATE_VFIO_INTX(_field, _state) { \
> + .name = (stringify(_field)), \
> + .size = sizeof(VFIOINTx), \
> + .vmsd = &vfio_intx_vmstate, \
> + .flags = VMS_STRUCT, \
> + .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
> +}
> +
move these to cpr file please.
Thanks,
C.
> static const VMStateDescription vfio_pci_vmstate = {
> .name = "vfio-pci",
> .version_id = 0,
> @@ -3594,6 +3636,7 @@ static const VMStateDescription vfio_pci_vmstate = {
> .fields = (VMStateField[]) {
> VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> + VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
> VMSTATE_END_OF_LIST()
> }
> };
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
2025-01-29 14:43 ` [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-02-05 17:23 ` Cédric Le Goater
2025-02-05 22:01 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:23 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
> Such a mapping can be preserved without modification during CPR,
> because it depends on the file's address space, which does not change,
> rather than on the process's address space, which does change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> backends/iommufd.c | 36 +++++++++++++++++++++++++++++++++++
> backends/trace-events | 1 +
> hw/vfio/container-base.c | 9 +++++++++
> hw/vfio/iommufd.c | 13 +++++++++++++
> include/exec/cpu-common.h | 1 +
> include/hw/vfio/vfio-container-base.h | 3 +++
> include/system/iommufd.h | 3 +++
> system/physmem.c | 5 +++++
> 8 files changed, 71 insertions(+)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 7b4fc8e..6d29221 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -174,6 +174,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
> return ret;
> }
>
> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> + hwaddr iova, ram_addr_t size,
> + int mfd, unsigned long start, bool readonly)
Please introduce a patch for this new routine.
> +{
> + int ret, fd = be->fd;
> + struct iommu_ioas_map_file map = {
> + .size = sizeof(map),
> + .flags = IOMMU_IOAS_MAP_READABLE |
> + IOMMU_IOAS_MAP_FIXED_IOVA,
> + .ioas_id = ioas_id,
> + .fd = mfd,
> + .start = start,
> + .iova = iova,
> + .length = size,
> + };
> +
> + if (!readonly) {
> + map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
> + }
> +
> + ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
> + trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
> + readonly, ret);
> + if (ret) {
> + ret = -errno;
> +
> + /* TODO: Not support mapping hardware PCI BAR region for now. */
> + if (errno == EFAULT) {
> + warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
I am not sure this warning can occur when the PCI BARs are mmaped
in an VM with incompatible address spaces. My attempts produced EINVAL.
Let's keep it for now until it is clarified.
> + } else {
> + error_report("IOMMU_IOAS_MAP_FILE failed: %m");
please remove this error report. It's redundant with the callers which
will report the same.
> + }
> + }
> + return ret;
> +}
> +
> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> hwaddr iova, ram_addr_t size)
> {
> diff --git a/backends/trace-events b/backends/trace-events
> index 40811a3..f478e18 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
> iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
> +iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
> index 302cd4c..fbaf04a 100644
> --- a/hw/vfio/container-base.c
> +++ b/hw/vfio/container-base.c
> @@ -21,7 +21,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
> RAMBlock *rb)
> {
> VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> + int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>
> + if (mfd >= 0 && vioc->dma_map_file) {
> + unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
> + unsigned long offset = qemu_ram_get_fd_offset(rb);
> +
> + vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
> + readonly);
> + return 0;
This is CPR related. Please add a dma_map_file helper and move the
code abolve to a cpr file.
> + }
> g_assert(vioc->dma_map);
> return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
> }
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 42ba63f..a3e7edb 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -38,6 +38,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> iova, size, vaddr, readonly);
> }
>
> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
> + hwaddr iova, ram_addr_t size,
> + int fd, unsigned long start, bool readonly)
> +{
> + const VFIOIOMMUFDContainer *container =
> + container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
> +
> + return iommufd_backend_map_file_dma(container->be,
> + container->ioas_id,
> + iova, size, fd, start, readonly);
> +}
> +
> static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
> hwaddr iova, ram_addr_t size,
> IOMMUTLBEntry *iotlb)
> @@ -806,6 +818,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
> vioc->hiod_typename = TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO;
>
> vioc->dma_map = iommufd_cdev_map;
> + vioc->dma_map_file = iommufd_cdev_map_file;
> vioc->dma_unmap = iommufd_cdev_unmap;
> vioc->attach_device = iommufd_cdev_attach;
> vioc->detach_device = iommufd_cdev_detach;
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index b1d76d6..0cab252 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -95,6 +95,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
> const char *qemu_ram_get_idstr(RAMBlock *rb);
> void *qemu_ram_get_host_addr(RAMBlock *rb);
> ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
> ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
> bool qemu_ram_is_shared(RAMBlock *rb);
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index d82e256..4daa5f8 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -115,6 +115,9 @@ struct VFIOIOMMUClass {
> int (*dma_map)(const VFIOContainerBase *bcontainer,
> hwaddr iova, ram_addr_t size,
> void *vaddr, bool readonly);
> + int (*dma_map_file)(const VFIOContainerBase *bcontainer,
> + hwaddr iova, ram_addr_t size,
> + int fd, unsigned long start, bool readonly);
> int (*dma_unmap)(const VFIOContainerBase *bcontainer,
> hwaddr iova, ram_addr_t size,
> IOMMUTLBEntry *iotlb);
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index cbab75b..ac700b8 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
> bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
> Error **errp);
> void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> + hwaddr iova, ram_addr_t size, int fd,
> + unsigned long start, bool readonly);
> int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
> ram_addr_t size, void *vaddr, bool readonly);
> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> diff --git a/system/physmem.c b/system/physmem.c
> index 0bcfc6c..c41a80b 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1569,6 +1569,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
> return rb->offset;
> }
>
> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
> +{
> + return rb->fd_offset;
> +}
Should go in its own patch.
> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
> {
> return rb->used_length;
Thanks,
C.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range
2025-01-29 14:43 ` [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
@ 2025-02-05 17:33 ` Cédric Le Goater
2025-02-05 22:01 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:33 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Export iommufd_cdev_get_info_iova_range for use by CPR.
why does CPR need access to the IOVA ranges ?
Thanks,
C.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/iommufd.c | 4 ++--
> include/hw/vfio/vfio-common.h | 2 ++
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index a3e7edb..2f888e5 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -442,8 +442,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
> return ram_block_uncoordinated_discard_disable(state);
> }
>
> -static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> - uint32_t ioas_id, Error **errp)
> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> + uint32_t ioas_id, Error **errp)
> {
> VFIOContainerBase *bcontainer = &container->bcontainer;
> g_autofree struct iommu_ioas_iova_ranges *info = NULL;
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5a89aca..ca10abc 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -268,6 +268,8 @@ bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
> void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
> bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
> void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> + uint32_t ioas_id, Error **errp);
>
> extern const MemoryRegionOps vfio_region_ops;
> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 21/26] iommufd: change process ioctl
2025-01-29 14:43 ` [PATCH V1 21/26] iommufd: change process ioctl Steve Sistare
@ 2025-02-05 17:34 ` Cédric Le Goater
2025-02-05 22:02 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:34 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Define the change process ioctl
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> backends/iommufd.c | 20 ++++++++++++++++++++
> backends/trace-events | 1 +
> include/system/iommufd.h | 2 ++
> 3 files changed, 23 insertions(+)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 6d29221..be5f6a3 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc, void *data)
> object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
> }
>
> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
> +{
> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
> +
> + return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
> +}
> +
> +int iommufd_change_process(IOMMUFDBackend *be)
> +{
> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
> + int ret = ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
> +
> + if (ret) {
> + ret = -errno;
> + error_report("IOMMU_IOAS_CHANGE_PROCESS fd %d failed: %m", be->fd);
please add an 'Error **errp' parameter.
Thanks,
C.
> + }
> + trace_iommufd_change_process(be->fd, ret);
> + return ret;
> +}
> +
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> {
> int fd;
> diff --git a/backends/trace-events b/backends/trace-events
> index f478e18..9b33dc3 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
> dbus_vmstate_saving(const char *id) "id: %s"
>
> # iommufd.c
> +iommufd_change_process(int fd, int ret) "fd=%d (%d)"
> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index ac700b8..4e9c037 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
> uint64_t iova, ram_addr_t size,
> uint64_t page_size, uint64_t *data,
> Error **errp);
> +bool iommufd_change_process_capable(IOMMUFDBackend *be);
> +int iommufd_change_process(IOMMUFDBackend *be);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 22/26] vfio/iommufd: invariant device name
2025-01-29 14:43 ` [PATCH V1 22/26] vfio/iommufd: invariant device name Steve Sistare
@ 2025-02-05 17:42 ` Cédric Le Goater
2025-02-05 22:02 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:42 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> cpr-transfer will use the device name as a key to find the value
> of the device descriptor in new QEMU. However, if the descriptor
> number is specified by a command-line fd parameter, then
> vfio_device_get_name creates a name that includes the fd number.
> This causes a chicken-and-egg problem: new QEMU must know the fd
> number to construct a name to find the fd number.
>
> To fix, create an invariant name based on the id command-line
> parameter. If id is not defined, add a CPR blocker.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/helpers.c | 18 +++++++++++++++---
> hw/vfio/iommufd.c | 2 ++
> include/hw/vfio/vfio-common.h | 1 +
> 3 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index 913796f..bd94b86 100644
> --- a/hw/vfio/helpers.c
> +++ b/hw/vfio/helpers.c
> @@ -25,6 +25,8 @@
> #include "hw/vfio/vfio-common.h"
> #include "hw/hw.h"
> #include "trace.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> #include "qapi/error.h"
> #include "qemu/error-report.h"
> #include "qemu/units.h"
> @@ -636,6 +638,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
> {
> ERRP_GUARD();
> struct stat st;
> + bool ret = true;
>
> if (vbasedev->fd < 0) {
> if (stat(vbasedev->sysfsdev, &st) < 0) {
> @@ -653,15 +656,24 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
> return false;
> }
> /*
> - * Give a name with fd so any function printing out vbasedev->name
> + * Give a name so any function printing out vbasedev->name
> * will not break.
> */
> if (!vbasedev->name) {
> - vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> + if (vbasedev->dev->id) {
> + vbasedev->name = g_strdup(vbasedev->dev->id);
> + } else {
> + vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> + error_setg(&vbasedev->cpr_id_blocker,
> + "vfio device with fd=%d needs an id property",
> + vbasedev->fd);
> + ret = migrate_add_blocker_modes(&vbasedev->cpr_id_blocker, errp,
> + MIG_MODE_CPR_TRANSFER, -1) == 0;
cpr helper please.
> + }
> }
> }
>
> - return true;
> + return ret;
> }
>
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 2f888e5..8308715 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -24,6 +24,7 @@
> #include "system/reset.h"
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
> +#include "migration/blocker.h"
> #include "pci.h"
> #include "exec/ram_addr.h"
>
> @@ -657,6 +658,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
> iommufd_cdev_container_destroy(container);
> vfio_put_address_space(space);
>
> + migrate_del_blocker(&vbasedev->cpr_id_blocker);
> iommufd_cdev_unbind_and_disconnect(vbasedev);
> close(vbasedev->fd);
> }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index ca10abc..37e7c26 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -147,6 +147,7 @@ typedef struct VFIODevice {
> VFIOMigration *migration;
> Error *migration_blocker;
> Error *cpr_mdev_blocker;
> + Error *cpr_id_blocker;
a struct VFIODeviceCPR would be welcome.
Thanks,
C.
> OnOffAuto pre_copy_dirty_page_tracking;
> OnOffAuto device_dirty_page_tracking;
> bool dirty_pages_supported;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 23/26] vfio/iommufd: register container for cpr
2025-01-29 14:43 ` [PATCH V1 23/26] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-02-05 17:45 ` Cédric Le Goater
2025-02-05 22:03 ` Steven Sistare
0 siblings, 1 reply; 64+ messages in thread
From: Cédric Le Goater @ 2025-02-05 17:45 UTC (permalink / raw)
To: Steve Sistare, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 1/29/25 15:43, Steve Sistare wrote:
> Register a vfio iommufd container for CPR. Add a blocker if the kernel does
> not support IOMMU_IOAS_CHANGE_PROCESS.
>
> This is mostly boiler plate. The fields to to saved and restored are added
> in subsequent patches.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
> hw/vfio/cpr-iommufd.c | 96 +++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/iommufd.c | 12 +++---
> hw/vfio/meson.build | 1 +
> include/hw/vfio/vfio-common.h | 6 +++
> 4 files changed, 110 insertions(+), 5 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
>
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> new file mode 100644
> index 0000000..4eb358a
> --- /dev/null
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -0,0 +1,96 @@
> +/*
> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "system/iommufd.h"
> +
> +static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
> +{
> + if (!iommufd_change_process_capable(container->be)) {
> + error_setg(errp,
> + "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
> + return false;
> + }
> + return true;
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> + .name = "vfio-iommufd-container",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .needed = cpr_needed_for_reuse,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +static const VMStateDescription iommufd_cpr_vmstate = {
> + .name = "iommufd",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .needed = cpr_needed_for_reuse,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
> + Error **errp)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> + Error **cpr_blocker = &container->cpr_blocker;
> +
> + if (!vfio_cpr_register_container(bcontainer, errp)) {
> + return false;
> + }
> +
> + if (!vfio_cpr_supported(container, cpr_blocker)) {
> + return migrate_add_blocker_modes(cpr_blocker, errp,
> + MIG_MODE_CPR_TRANSFER, -1) == 0;
> + }
> +
> + vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> + vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
> +
> + return true;
> +}
> +
> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
> +{
> + VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> + vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
> + vmstate_unregister(NULL, &vfio_container_vmstate, container);
> + migrate_del_blocker(&container->cpr_blocker);
> + vfio_cpr_unregister_container(bcontainer);
> +}
> +
> +static const VMStateDescription vfio_device_vmstate = {
> + .name = "vfio-iommufd-device",
> + .version_id = 0,
> + .minimum_version_id = 0,
> + .needed = cpr_needed_for_reuse,
> + .fields = (VMStateField[]) {
> + VMSTATE_END_OF_LIST()
> + }
> +};
> +
> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
> +{
> + vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
> +}
> +
> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
> +{
> + vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
> +}
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8308715..ae78e00 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -592,6 +592,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>
> bcontainer->initialized = true;
>
> + if (!vfio_iommufd_cpr_register_container(container, errp)) {
> + goto err_listener_register;
> + }
> +
why this change ?
Thanks,
C.
> found_container:
> ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
> if (ret) {
> @@ -599,10 +603,6 @@ found_container:
> goto err_listener_register;
> }
>
> - if (!vfio_cpr_register_container(bcontainer, errp)) {
> - goto err_listener_register;
> - }
> -
> /*
> * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
> * for discarding incompatibility check as well?
> @@ -619,6 +619,7 @@ found_container:
> vbasedev->bcontainer = bcontainer;
> QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
> QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
> + vfio_iommufd_cpr_register_device(vbasedev);
>
> trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
> vbasedev->num_regions, vbasedev->flags);
> @@ -653,12 +654,13 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
> iommufd_cdev_ram_block_discard_disable(false);
> }
>
> - vfio_cpr_unregister_container(bcontainer);
> + vfio_iommufd_cpr_unregister_container(container);
> iommufd_cdev_detach_container(vbasedev, container);
> iommufd_cdev_container_destroy(container);
> vfio_put_address_space(space);
>
> migrate_del_blocker(&vbasedev->cpr_id_blocker);
> + vfio_iommufd_cpr_unregister_device(vbasedev);
> iommufd_cdev_unbind_and_disconnect(vbasedev);
> close(vbasedev->fd);
> }
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 5487815..998adb5 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -13,6 +13,7 @@ vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
> vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> 'cpr.c',
> 'cpr-legacy.c',
> + 'cpr-iommufd.c',
> 'display.c',
> 'pci-quirks.c',
> 'pci.c',
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 37e7c26..add44d4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -113,6 +113,7 @@ typedef struct VFIOIOASHwpt {
> typedef struct VFIOIOMMUFDContainer {
> VFIOContainerBase bcontainer;
> IOMMUFDBackend *be;
> + Error *cpr_blocker;
> uint32_t ioas_id;
> QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
> @@ -271,6 +272,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
> void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
> bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> uint32_t ioas_id, Error **errp);
> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
> + Error **errp);
> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container);
> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev);
> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev);
>
> extern const MemoryRegionOps vfio_region_ops;
> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
2025-02-05 17:23 ` Cédric Le Goater
@ 2025-02-05 22:01 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-05 22:01 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:23 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>> Such a mapping can be preserved without modification during CPR,
>> because it depends on the file's address space, which does not change,
>> rather than on the process's address space, which does change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c | 36 +++++++++++++++++++++++++++++++++++
>> backends/trace-events | 1 +
>> hw/vfio/container-base.c | 9 +++++++++
>> hw/vfio/iommufd.c | 13 +++++++++++++
>> include/exec/cpu-common.h | 1 +
>> include/hw/vfio/vfio-container-base.h | 3 +++
>> include/system/iommufd.h | 3 +++
>> system/physmem.c | 5 +++++
>> 8 files changed, 71 insertions(+)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 7b4fc8e..6d29221 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -174,6 +174,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
>> return ret;
>> }
>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> + hwaddr iova, ram_addr_t size,
>> + int mfd, unsigned long start, bool readonly)
>
> Please introduce a patch for this new routine.
OK.
>> +{
>> + int ret, fd = be->fd;
>> + struct iommu_ioas_map_file map = {
>> + .size = sizeof(map),
>> + .flags = IOMMU_IOAS_MAP_READABLE |
>> + IOMMU_IOAS_MAP_FIXED_IOVA,
>> + .ioas_id = ioas_id,
>> + .fd = mfd,
>> + .start = start,
>> + .iova = iova,
>> + .length = size,
>> + };
>> +
>> + if (!readonly) {
>> + map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
>> + }
>> +
>> + ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
>> + trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
>> + readonly, ret);
>> + if (ret) {
>> + ret = -errno;
>> +
>> + /* TODO: Not support mapping hardware PCI BAR region for now. */
>> + if (errno == EFAULT) {
>> + warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
>
> I am not sure this warning can occur when the PCI BARs are mmaped
> in an VM with incompatible address spaces. My attempts produced EINVAL.
> Let's keep it for now until it is clarified.
>
>
>> + } else {
>> + error_report("IOMMU_IOAS_MAP_FILE failed: %m");
>
> please remove this error report. It's redundant with the callers which
> will report the same.
These warnings and errors are copied as-is from iommufd_backend_map_dma.
I aim to be bug-for-bug compatible until the issues with mapping BARs
are resolved.
>> + }
>> + }
>> + return ret;
>> +}
>> +
>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> hwaddr iova, ram_addr_t size)
>> {
>> diff --git a/backends/trace-events b/backends/trace-events
>> index 40811a3..f478e18 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>> iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
>> +iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
>> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> index 302cd4c..fbaf04a 100644
>> --- a/hw/vfio/container-base.c
>> +++ b/hw/vfio/container-base.c
>> @@ -21,7 +21,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
>> RAMBlock *rb)
>> {
>> VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> + int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>> + if (mfd >= 0 && vioc->dma_map_file) {
>> + unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>> + unsigned long offset = qemu_ram_get_fd_offset(rb);
>> +
>> + vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>> + readonly);
>> + return 0;
>
> This is CPR related. Please add a dma_map_file helper and move the
> code abolve to a cpr file.
This is not specific to CPR. It has value that is independent of CPR,
by representing mappings in the kernel using file mappings with folios
rather than struct pages. It would be proposed even if CPR did not exist.
>> + }
>> g_assert(vioc->dma_map);
>> return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>> }
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 42ba63f..a3e7edb 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -38,6 +38,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> iova, size, vaddr, readonly);
>> }
>> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>> + hwaddr iova, ram_addr_t size,
>> + int fd, unsigned long start, bool readonly)
>> +{
>> + const VFIOIOMMUFDContainer *container =
>> + container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>> +
>> + return iommufd_backend_map_file_dma(container->be,
>> + container->ioas_id,
>> + iova, size, fd, start, readonly);
>> +}
>> +
>> static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>> hwaddr iova, ram_addr_t size,
>> IOMMUTLBEntry *iotlb)
>> @@ -806,6 +818,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
>> vioc->hiod_typename = TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO;
>> vioc->dma_map = iommufd_cdev_map;
>> + vioc->dma_map_file = iommufd_cdev_map_file;
>> vioc->dma_unmap = iommufd_cdev_unmap;
>> vioc->attach_device = iommufd_cdev_attach;
>> vioc->detach_device = iommufd_cdev_detach;
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index b1d76d6..0cab252 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -95,6 +95,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
>> const char *qemu_ram_get_idstr(RAMBlock *rb);
>> void *qemu_ram_get_host_addr(RAMBlock *rb);
>> ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
>> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
>> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
>> ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
>> bool qemu_ram_is_shared(RAMBlock *rb);
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
>> index d82e256..4daa5f8 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -115,6 +115,9 @@ struct VFIOIOMMUClass {
>> int (*dma_map)(const VFIOContainerBase *bcontainer,
>> hwaddr iova, ram_addr_t size,
>> void *vaddr, bool readonly);
>> + int (*dma_map_file)(const VFIOContainerBase *bcontainer,
>> + hwaddr iova, ram_addr_t size,
>> + int fd, unsigned long start, bool readonly);
>> int (*dma_unmap)(const VFIOContainerBase *bcontainer,
>> hwaddr iova, ram_addr_t size,
>> IOMMUTLBEntry *iotlb);
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index cbab75b..ac700b8 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
>> bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
>> Error **errp);
>> void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> + hwaddr iova, ram_addr_t size, int fd,
>> + unsigned long start, bool readonly);
>> int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
>> ram_addr_t size, void *vaddr, bool readonly);
>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 0bcfc6c..c41a80b 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1569,6 +1569,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
>> return rb->offset;
>> }
>> +ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
>> +{
>> + return rb->fd_offset;
>> +}
>
> Should go in its own patch.
OK.
- Steve
>> ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
>> {
>> return rb->used_length;
>
>
> Thanks,
>
> C.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range
2025-02-05 17:33 ` Cédric Le Goater
@ 2025-02-05 22:01 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-05 22:01 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:33 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Export iommufd_cdev_get_info_iova_range for use by CPR.
>
> why does CPR need access to the IOVA ranges ?
This is explained in the commit message of a subsequent patch,
"vfio/iommufd: reconstruct device"
- Steve
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/iommufd.c | 4 ++--
>> include/hw/vfio/vfio-common.h | 2 ++
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index a3e7edb..2f888e5 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -442,8 +442,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
>> return ram_block_uncoordinated_discard_disable(state);
>> }
>> -static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>> - uint32_t ioas_id, Error **errp)
>> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>> + uint32_t ioas_id, Error **errp)
>> {
>> VFIOContainerBase *bcontainer = &container->bcontainer;
>> g_autofree struct iommu_ioas_iova_ranges *info = NULL;
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 5a89aca..ca10abc 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -268,6 +268,8 @@ bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
>> void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
>> bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
>> void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
>> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>> + uint32_t ioas_id, Error **errp);
>> extern const MemoryRegionOps vfio_region_ops;
>> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 21/26] iommufd: change process ioctl
2025-02-05 17:34 ` Cédric Le Goater
@ 2025-02-05 22:02 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-05 22:02 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:34 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Define the change process ioctl
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c | 20 ++++++++++++++++++++
>> backends/trace-events | 1 +
>> include/system/iommufd.h | 2 ++
>> 3 files changed, 23 insertions(+)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 6d29221..be5f6a3 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc, void *data)
>> object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>> }
>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>> +{
>> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
>> +
>> + return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>> +}
>> +
>> +int iommufd_change_process(IOMMUFDBackend *be)
>> +{
>> + struct iommu_ioas_change_process args = {.size = sizeof(args)};
>> + int ret = ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>> +
>> + if (ret) {
>> + ret = -errno;
>> + error_report("IOMMU_IOAS_CHANGE_PROCESS fd %d failed: %m", be->fd);
>
> please add an 'Error **errp' parameter.
OK - steve
>> + }
>> + trace_iommufd_change_process(be->fd, ret);
>> + return ret;
>> +}
>> +
>> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>> {
>> int fd;
>> diff --git a/backends/trace-events b/backends/trace-events
>> index f478e18..9b33dc3 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
>> dbus_vmstate_saving(const char *id) "id: %s"
>> # iommufd.c
>> +iommufd_change_process(int fd, int ret) "fd=%d (%d)"
>> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index ac700b8..4e9c037 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>> uint64_t iova, ram_addr_t size,
>> uint64_t page_size, uint64_t *data,
>> Error **errp);
>> +bool iommufd_change_process_capable(IOMMUFDBackend *be);
>> +int iommufd_change_process(IOMMUFDBackend *be);
>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>> #endif
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 22/26] vfio/iommufd: invariant device name
2025-02-05 17:42 ` Cédric Le Goater
@ 2025-02-05 22:02 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-05 22:02 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:42 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> cpr-transfer will use the device name as a key to find the value
>> of the device descriptor in new QEMU. However, if the descriptor
>> number is specified by a command-line fd parameter, then
>> vfio_device_get_name creates a name that includes the fd number.
>> This causes a chicken-and-egg problem: new QEMU must know the fd
>> number to construct a name to find the fd number.
>>
>> To fix, create an invariant name based on the id command-line
>> parameter. If id is not defined, add a CPR blocker.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/helpers.c | 18 +++++++++++++++---
>> hw/vfio/iommufd.c | 2 ++
>> include/hw/vfio/vfio-common.h | 1 +
>> 3 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>> index 913796f..bd94b86 100644
>> --- a/hw/vfio/helpers.c
>> +++ b/hw/vfio/helpers.c
>> @@ -25,6 +25,8 @@
>> #include "hw/vfio/vfio-common.h"
>> #include "hw/hw.h"
>> #include "trace.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> #include "qapi/error.h"
>> #include "qemu/error-report.h"
>> #include "qemu/units.h"
>> @@ -636,6 +638,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>> {
>> ERRP_GUARD();
>> struct stat st;
>> + bool ret = true;
>> if (vbasedev->fd < 0) {
>> if (stat(vbasedev->sysfsdev, &st) < 0) {
>> @@ -653,15 +656,24 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>> return false;
>> }
>> /*
>> - * Give a name with fd so any function printing out vbasedev->name
>> + * Give a name so any function printing out vbasedev->name
>> * will not break.
>> */
>> if (!vbasedev->name) {
>> - vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>> + if (vbasedev->dev->id) {
>> + vbasedev->name = g_strdup(vbasedev->dev->id);
>> + } else {
>> + vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>> + error_setg(&vbasedev->cpr_id_blocker,
>> + "vfio device with fd=%d needs an id property",
>> + vbasedev->fd);
>> + ret = migrate_add_blocker_modes(&vbasedev->cpr_id_blocker, errp,
>> + MIG_MODE_CPR_TRANSFER, -1) == 0;
>
> cpr helper please.
OK.
>> + }
>> }
>> }
>> - return true;
>> + return ret;
>> }
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 2f888e5..8308715 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -24,6 +24,7 @@
>> #include "system/reset.h"
>> #include "qemu/cutils.h"
>> #include "qemu/chardev_open.h"
>> +#include "migration/blocker.h"
>> #include "pci.h"
>> #include "exec/ram_addr.h"
>> @@ -657,6 +658,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>> iommufd_cdev_container_destroy(container);
>> vfio_put_address_space(space);
>> + migrate_del_blocker(&vbasedev->cpr_id_blocker);
>> iommufd_cdev_unbind_and_disconnect(vbasedev);
>> close(vbasedev->fd);
>> }
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index ca10abc..37e7c26 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -147,6 +147,7 @@ typedef struct VFIODevice {
>> VFIOMigration *migration;
>> Error *migration_blocker;
>> Error *cpr_mdev_blocker;
>> + Error *cpr_id_blocker;
>
> a struct VFIODeviceCPR would be welcome.
OK.
- Steve
>> OnOffAuto pre_copy_dirty_page_tracking;
>> OnOffAuto device_dirty_page_tracking;
>> bool dirty_pages_supported;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 23/26] vfio/iommufd: register container for cpr
2025-02-05 17:45 ` Cédric Le Goater
@ 2025-02-05 22:03 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-05 22:03 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:45 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Register a vfio iommufd container for CPR. Add a blocker if the kernel does
>> not support IOMMU_IOAS_CHANGE_PROCESS.
>>
>> This is mostly boiler plate. The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr-iommufd.c | 96 +++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/iommufd.c | 12 +++---
>> hw/vfio/meson.build | 1 +
>> include/hw/vfio/vfio-common.h | 6 +++
>> 4 files changed, 110 insertions(+), 5 deletions(-)
>> create mode 100644 hw/vfio/cpr-iommufd.c
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> new file mode 100644
>> index 0000000..4eb358a
>> --- /dev/null
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -0,0 +1,96 @@
>> +/*
>> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "hw/vfio/vfio-common.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "system/iommufd.h"
>> +
>> +static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
>> +{
>> + if (!iommufd_change_process_capable(container->be)) {
>> + error_setg(errp,
>> + "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static const VMStateDescription vfio_container_vmstate = {
>> + .name = "vfio-iommufd-container",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .needed = cpr_needed_for_reuse,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +static const VMStateDescription iommufd_cpr_vmstate = {
>> + .name = "iommufd",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .needed = cpr_needed_for_reuse,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
>> + Error **errp)
>> +{
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> + Error **cpr_blocker = &container->cpr_blocker;
>> +
>> + if (!vfio_cpr_register_container(bcontainer, errp)) {
>> + return false;
>> + }
>> +
>> + if (!vfio_cpr_supported(container, cpr_blocker)) {
>> + return migrate_add_blocker_modes(cpr_blocker, errp,
>> + MIG_MODE_CPR_TRANSFER, -1) == 0;
>> + }
>> +
>> + vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> + vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
>> +
>> + return true;
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>> +{
>> + VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> + vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
>> + vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> + migrate_del_blocker(&container->cpr_blocker);
>> + vfio_cpr_unregister_container(bcontainer);
>> +}
>> +
>> +static const VMStateDescription vfio_device_vmstate = {
>> + .name = "vfio-iommufd-device",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .needed = cpr_needed_for_reuse,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>> +{
>> + vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>> +{
>> + vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
>> +}
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 8308715..ae78e00 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -592,6 +592,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>> bcontainer->initialized = true;
>> + if (!vfio_iommufd_cpr_register_container(container, errp)) {
>> + goto err_listener_register;
>> + }
>> +
>
> why this change ?
vfio_iommufd_cpr_register_container() registers empty vmstate handlers in this
patch, but additional fields are added to the vmstate in subsequent patches.
- Steve
>> found_container:
>> ret = ioctl(devfd, VFIO_DEVICE_GET_INFO, &dev_info);
>> if (ret) {
>> @@ -599,10 +603,6 @@ found_container:
>> goto err_listener_register;
>> }
>> - if (!vfio_cpr_register_container(bcontainer, errp)) {
>> - goto err_listener_register;
>> - }
>> -
>> /*
>> * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>> * for discarding incompatibility check as well?
>> @@ -619,6 +619,7 @@ found_container:
>> vbasedev->bcontainer = bcontainer;
>> QLIST_INSERT_HEAD(&bcontainer->device_list, vbasedev, container_next);
>> QLIST_INSERT_HEAD(&vfio_device_list, vbasedev, global_next);
>> + vfio_iommufd_cpr_register_device(vbasedev);
>> trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>> vbasedev->num_regions, vbasedev->flags);
>> @@ -653,12 +654,13 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>> iommufd_cdev_ram_block_discard_disable(false);
>> }
>> - vfio_cpr_unregister_container(bcontainer);
>> + vfio_iommufd_cpr_unregister_container(container);
>> iommufd_cdev_detach_container(vbasedev, container);
>> iommufd_cdev_container_destroy(container);
>> vfio_put_address_space(space);
>> migrate_del_blocker(&vbasedev->cpr_id_blocker);
>> + vfio_iommufd_cpr_unregister_device(vbasedev);
>> iommufd_cdev_unbind_and_disconnect(vbasedev);
>> close(vbasedev->fd);
>> }
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 5487815..998adb5 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -13,6 +13,7 @@ vfio_ss.add(when: 'CONFIG_IOMMUFD', if_true: files(
>> vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> 'cpr.c',
>> 'cpr-legacy.c',
>> + 'cpr-iommufd.c',
>> 'display.c',
>> 'pci-quirks.c',
>> 'pci.c',
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 37e7c26..add44d4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -113,6 +113,7 @@ typedef struct VFIOIOASHwpt {
>> typedef struct VFIOIOMMUFDContainer {
>> VFIOContainerBase bcontainer;
>> IOMMUFDBackend *be;
>> + Error *cpr_blocker;
>> uint32_t ioas_id;
>> QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>> } VFIOIOMMUFDContainer;
>> @@ -271,6 +272,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp);
>> void vfio_legacy_cpr_unregister_container(VFIOContainer *container);
>> bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>> uint32_t ioas_id, Error **errp);
>> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
>> + Error **errp);
>> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container);
>> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev);
>> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev);
>> extern const MemoryRegionOps vfio_region_ops;
>> typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 02/26] migration: lower handler priority
2025-02-03 16:58 ` Peter Xu
@ 2025-02-06 13:39 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-06 13:39 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Fabiano Rosas
On 2/3/2025 11:58 AM, Peter Xu wrote:
> On Wed, Jan 29, 2025 at 06:42:58AM -0800, Steve Sistare wrote:
>> Define a vmstate priority that is lower than the default, so its handlers
>> run after all default priority handlers. Since 0 is no longer the default
>> priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.
>>
>> CPR for vfio will use this to install handlers for containers that run
>> after handlers for the devices that they contain.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/migration/vmstate.h | 3 ++-
>> migration/savevm.c | 4 ++--
>> 2 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index a1dfab4..3055a46 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -155,7 +155,8 @@ enum VMStateFlags {
>> };
>>
>> typedef enum {
>> - MIG_PRI_DEFAULT = 0,
>
> Shall we still keep a defintion for 0? Or at least add a comment link to
> save_state_priority() - it might be helpful for whoever jumps to this enum
> defintion when reading.. and get confused how a default value is non-zero.
>
> Or define it as something like:
>
> MIG_PRI_UNINITIALIZED = 0, /* Most devices don't set a priority, it will
> * be routed to MIG_PRI_DEFAULT */
Sure, I'll add MIG_PRI_UNINITIALIZED.
- Steve
>> + MIG_PRI_LOW = 1, /* Must happen after default */
>> + MIG_PRI_DEFAULT,
>> MIG_PRI_IOMMU, /* Must happen before PCI devices */
>> MIG_PRI_PCI_BUS, /* Must happen before IOMMU */
>> MIG_PRI_VIRTIO_MEM, /* Must happen before IOMMU */
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 264bc06..5dd2dc4 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -232,7 +232,7 @@ typedef struct SaveState {
>>
>> static SaveState savevm_state = {
>> .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
>> - .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
>> + .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
>> .global_section_id = 0,
>> };
>>
>> @@ -704,7 +704,7 @@ static int calculate_compat_instance_id(const char *idstr)
>>
>> static inline MigrationPriority save_state_priority(SaveStateEntry *se)
>> {
>> - if (se->vmsd) {
>> + if (se->vmsd && se->vmsd->priority) {
>> return se->vmsd->priority;
>> }
>> return MIG_PRI_DEFAULT;
>> --
>> 1.8.3.1
>>
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 12/26] vfio-pci: preserve MSI
2025-02-05 16:48 ` Cédric Le Goater
@ 2025-02-06 14:41 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-06 14:41 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 11:48 AM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Save the MSI message area as part of vfio-pci vmstate, and preserve the
>> interrupt and notifier eventfd's. migrate_incoming loads the MSI data,
>> then the vfio-pci post_load handler finds the eventfds in CPR state,
>> rebuilds vector data structures, and attaches the interrupts to the new
>> KVM instance.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 116 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index fa77c36..df6e298 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -56,11 +56,37 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>> static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>> +#define EVENT_FD_NAME(vdev, name) \
>> + g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>
> hmm, this helper could lead to memory leaks if not used as done below.
> Being explict would be safer.
How about renaming it ALLOC_EVENT_FD_NAME?
If not, I will uses g_strdup_printf at the call sites but define a symbol
for the format string.
>> +static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>> + EventNotifier *ev)
>> +{
>> + int fd = event_notifier_get_fd(ev);
>> +
>> + if (fd >= 0) {
>> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> + cpr_resave_fd(fdname, nr, fd);
>> + }
>> +}
>> +
>> +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> + return cpr_find_fd(fdname, nr);
>> +}
>> +
>> +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> + g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> + cpr_delete_fd(fdname, nr);
>> +}
>> +
>
> please move these helpers to a cpr file. They are not strictly VFIO
> related too. So they could me moved outside of hw/vfio.
OK. Moving to migration/cpr.c.
>> /* Create new or reuse existing eventfd */
>> static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr)
>> {
>> - int fd = -1; /* placeholder until a subsequent patch */
>> + int fd = load_event_fd(vdev, name, nr);
>
>
>> int ret = 0;
>> if (fd >= 0) {
>> @@ -71,6 +97,8 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> Error *err = NULL;
>> error_setg_errno(&err, -ret, "vfio_notifier_init %s failed", name);
>> error_report_err(err);
>> + } else {
>> + save_event_fd(vdev, name, nr, e);
>
> I'd rather move the CPR related fd handling which is done in
> vfio_notifier_init() in a cpr routine which vfio_notifier_init()
> would call. This comment applies to all the series. Anything
> related to CPR should be handled explicitely :
>
> if (cpr_in_progress) {
> cpr_do_cpr_related_stuff()
> }
>
> It will ease reading and long term maintenance.
That design pattern does not apply to this call site. The event fd must
be saved unconditionally, in case a cpr operation is performed later.
>> }
>> }
>> return ret;
>> @@ -79,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>> static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>> const char *name, int nr)
>> {
>> + delete_event_fd(vdev, name, nr);
>> event_notifier_cleanup(e);
>> }
>> @@ -561,6 +590,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>> int ret;
>> bool resizing = !!(vdev->nr_vectors < nr + 1);
>> + /*
>> + * Ignore the callback from msix_set_vector_notifiers during resume.
>> + * The necessary subset of these actions is called from vfio_claim_vectors
>> + * during post load.
>> + */
>> + if (vdev->vbasedev.reused) {
>> + return 0;
>> + }
>
> again, I would prefer some explicit CPR test. Same below.
Reused is an explicit cpr test, true iff an incoming cpr operation is in
progress. I prefer the short name, but if it would help you to see cpr
leap off the page, I'll rename it cpr_reused.
>> trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>> vector = &vdev->msi_vectors[nr];
>> @@ -2896,6 +2934,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->err_notifier);
>> qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (vdev->vbasedev.reused) {
>> + return;
>> + }
>> +
>> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -2960,6 +3003,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>> fd = event_notifier_get_fd(&vdev->req_notifier);
>> qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>> + /* Do not alter irq_signaling during vfio_realize for cpr */
>> + if (vdev->vbasedev.reused) {
>> + vdev->req_enabled = true;
>> + return;
>> + }
>> +
>> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>> error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -3454,6 +3503,46 @@ static void vfio_pci_set_fd(Object *obj, const char *str, Error **errp)
>> }
>> #endif
>> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
>> +{
>> + int i, fd;
>> + bool pending = false;
>> + PCIDevice *pdev = &vdev->pdev;
>> +
>> + vdev->nr_vectors = nr_vectors;
>> + vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> + vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> + vfio_prepare_kvm_msi_virq_batch(vdev);
>> +
>> + for (i = 0; i < nr_vectors; i++) {
>> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> + fd = load_event_fd(vdev, "interrupt", i);
>> + if (fd >= 0) {
>> + vfio_vector_init(vdev, i);
>> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> + }
>> +
>> + if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
>> + vfio_add_kvm_msi_virq(vdev, vector, i, msix);
>> + } else {
>> + vdev->msi_vectors[i].virq = -1;
>> + }
>> +
>> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> + set_bit(i, vdev->msix->pending);
>> + pending = true;
>> + }
>> + }
>> +
>> + vfio_commit_kvm_msi_virq_batch(vdev);
>> +
>> + if (msix) {
>> + memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> + }
>> +}
>
> move to a cpr file please. We can have a vfio-pci lib/common file
> for external users. It will take more work to get the interface right
> but it will benefit other proposals. I think vfio-user as more or less
> the same needs.
OK.
>> +
>> /*
>> * The kernel may change non-emulated config bits. Exclude them from the
>> * changed-bits check in get_pci_config_device.
>> @@ -3472,13 +3561,39 @@ static int vfio_pci_pre_load(void *opaque)
>> return 0;
>> }
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> + VFIOPCIDevice *vdev = opaque;
>> + PCIDevice *pdev = &vdev->pdev;
>> + int nr_vectors;
>> +
>> + if (msix_enabled(pdev)) {
>> + msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
>> + vfio_msix_vector_release, NULL);
>> + nr_vectors = vdev->msix->entries;
>> + vfio_claim_vectors(vdev, nr_vectors, true);
>> +
>> + } else if (msi_enabled(pdev)) {
>> + nr_vectors = msi_nr_vectors_allocated(pdev);
>> + vfio_claim_vectors(vdev, nr_vectors, false);
>> +
>> + } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> + g_assert_not_reached(); /* completed in a subsequent patch */
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static const VMStateDescription vfio_pci_vmstate = {
>> .name = "vfio-pci",
>> .version_id = 0,
>> .minimum_version_id = 0,
>> .pre_load = vfio_pci_pre_load,
>> + .post_load = vfio_pci_post_load,
>> .needed = cpr_needed_for_reuse,
>> .fields = (VMStateField[]) {
>> + VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> + VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>> VMSTATE_END_OF_LIST()
>> }
>> };
>
>
> I think you can move vfio_pci_vmstate out of hw/vfio/pci.c too. Only
> cpr needs it.
OK.
- Steve
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 13/26] vfio-pci: preserve INTx
2025-02-05 17:13 ` Cédric Le Goater
@ 2025-02-06 14:43 ` Steven Sistare
0 siblings, 0 replies; 64+ messages in thread
From: Steven Sistare @ 2025-02-06 14:43 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel
Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas
On 2/5/2025 12:13 PM, Cédric Le Goater wrote:
> On 1/29/25 15:43, Steve Sistare wrote:
>> Preserve vfio INTx state across cpr-transfer. Preserve VFIOINTx fields as
>> follows:
>> pin : Recover this from the vfio config in kernel space
>> interrupt : Preserve its eventfd descriptor across exec.
>> unmask : Ditto
>> route.irq : This could perhaps be recovered in vfio_pci_post_load by
>> calling pci_device_route_intx_to_irq(pin), whose implementation reads
>> config space for a bridge device such as ich9. However, there is no
>> guarantee that the bridge vmstate is read before vfio vmstate. Rather
>> than fiddling with MigrationPriority for vmstate handlers, explicitly
>> save route.irq in vfio vmstate.
>> pending : save in vfio vmstate.
>> mmap_timeout, mmap_timer : Re-initialize
>> bool kvm_accel : Re-initialize
>>
>> In vfio_realize, defer calling vfio_intx_enable until the vmstate
>> is available, in vfio_pci_post_load. Modify vfio_intx_enable and
>> vfio_intx_kvm_enable to skip vfio initialization, but still perform
>> kvm initialization.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 47 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index df6e298..c50dbef 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -184,12 +184,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> return true;
>> }
>> + if (vdev->vbasedev.reused) {
>
> 1 x vdev->vbasedev.reused
>
>> + goto skip_state;
>> + }
>> +
>> /* Get to a known interrupt state */
>> qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
>> vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>> vdev->intx.pending = false;
>> pci_irq_deassert(&vdev->pdev);
>> +skip_state:
>
>
> hmm, this skip_state label and ...
>
>> /* Get an eventfd for resample/unmask */
>> if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
>> error_setg(errp, "vfio_notifier_init intx-unmask failed");
>> @@ -204,6 +209,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> goto fail_irqfd;
>> }
>> + if (vdev->vbasedev.reused) {
>
> 2 x vdev->vbasedev.reused
>
>> + goto skip_irq;
>> + }
>> +
>> if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_UNMASK,
>> event_notifier_get_fd(&vdev->intx.unmask),
>> @@ -214,6 +223,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> /* Let'em rip */
>> vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>> +skip_irq:
>
> ... this skip_irq label are one "very quick" way to get things done :)
I chose to use goto's and skip labels for your benefit as a reviewer, to reduce
diffs, so you can see that the non-cpr code is not changed. Not as a quick way to
get this done. But if you prefer, I can use conditional blocks instead of goto's,
and let indentation create additional diffs:
if (reused)
goto skip;
non-cpr code;
skip:
vs
if (!reused) {
non-cpr code;
}
>> vdev->intx.kvm_accel = true;
>> trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
>> @@ -329,7 +339,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> return true;
>> }
>> - vfio_disable_interrupts(vdev);
>> + /*
>> + * Do not alter interrupt state during vfio_realize and cpr load. The
>> + * reused flag is cleared thereafter.
>> + */
>> + if (!vdev->vbasedev.reused) {
>
> 3 x vdev->vbasedev.reused
>
>> + vfio_disable_interrupts(vdev);
>> + }
>> vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>> pci_config_set_interrupt_pin(vdev->pdev.config, pin);
>> @@ -351,7 +367,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> fd = event_notifier_get_fd(&vdev->intx.interrupt);
>> qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>> - if (!vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> + if (!vdev->vbasedev.reused &&
>
> 4 x vdev->vbasedev.reused
>
>> + !vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>> VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>> qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>
>> @@ -3256,7 +3273,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>> vfio_intx_routing_notifier);
>> vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>> kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>> - if (!vfio_intx_enable(vdev, errp)) {
>> + /* Wait until cpr load reads intx routing data to enable */
>> + if (!vdev->vbasedev.reused && !vfio_intx_enable(vdev, errp)) {
>
> 5 x vdev->vbasedev.reused
>
> This patch already adds a test on vdev->vbasedev.reused at the top of
> vfio_intx_enable(). This one seems redudant.
This test is necessary. I will expand the comment to be more explicit:
/*
* During CPR, do not call vfio_intx_enable at this time. Instead,
* call it from vfio_pci_post_load after the intx routing data has
* been loaded from vmstate.
*/
if (!vdev->vbasedev.reused && !vfio_intx_enable(vdev, errp)) {
> Please duplicate the whole vfio_intx_enable() routine and move it
> under a cpr file.
Do you just mean vfio_intx_enable? Or also vfio_intx_enable_kvm? The
occurrences of vdev->vbasedev.reused that you flag occur in both.
I coded with reused conditionals and "skip" labels for a good reason. By
keeping the common logic inline with the cpr conditionals, I minimize the
chance that changes in the common logic will break cpr. Conversely,
outlining cpr specific versions of these functions and duplicating common
code creates the very real possibility that changes in vfio core code will
not be made in the cpr copies, and break cpr.
>> goto out_deregister;
>> }
>> }
>> @@ -3578,12 +3596,36 @@ static int vfio_pci_post_load(void *opaque, int version_id)
>> vfio_claim_vectors(vdev, nr_vectors, false);> } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> - g_assert_not_reached(); /* completed in a subsequent patch */
>> + Error *err = NULL;
>> + if (!vfio_intx_enable(vdev, &err)) {
>> + error_report_err(err);
>> + return -1;> + }
>> }
>> return 0;
>> }
>> +static const VMStateDescription vfio_intx_vmstate = {
>> + .name = "vfio-intx",
>> + .version_id = 0,
>> + .minimum_version_id = 0,
>> + .fields = (VMStateField[]) {
>> + VMSTATE_BOOL(pending, VFIOINTx),
>> + VMSTATE_UINT32(route.mode, VFIOINTx),
>> + VMSTATE_INT32(route.irq, VFIOINTx),
>> + VMSTATE_END_OF_LIST()
>> + }
>> +};
>> +
>> +#define VMSTATE_VFIO_INTX(_field, _state) { \
>> + .name = (stringify(_field)), \
>> + .size = sizeof(VFIOINTx), \
>> + .vmsd = &vfio_intx_vmstate, \
>> + .flags = VMS_STRUCT, \
>> + .offset = vmstate_offset_value(_state, _field, VFIOINTx), \
>> +}
>> +
>
> move these to cpr file please.
OK.
- Steve
>> static const VMStateDescription vfio_pci_vmstate = {
>> .name = "vfio-pci",
>> .version_id = 0,
>> @@ -3594,6 +3636,7 @@ static const VMStateDescription vfio_pci_vmstate = {
>> .fields = (VMStateField[]) {
>> VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>> + VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
>> VMSTATE_END_OF_LIST()
>> }
>> };
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr
2025-02-04 17:42 ` Steven Sistare
@ 2025-02-16 23:19 ` John Levon
0 siblings, 0 replies; 64+ messages in thread
From: John Levon @ 2025-02-16 23:19 UTC (permalink / raw)
To: Steven Sistare
Cc: Cédric Le Goater, qemu-devel, Alex Williamson, Yi Liu,
Eric Auger, Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
Peter Xu, Fabiano Rosas
On Tue, Feb 04, 2025 at 12:42:20PM -0500, Steven Sistare wrote:
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On 2/4/2025 10:47 AM, Cédric Le Goater wrote:
> > + John (for vfio-user)
> >
> > On 1/29/25 15:43, Steve Sistare wrote:
> > > Return the memory region that the translated address is found in, for
> > > use in a subsequent patch. No functional change.
> >
> > Keeping a reference on this memory region could be risky. What for ?
>
> The returned mr is briefly used here in later patches:
>
> vfio_iommu_map_notify()
> vfio_get_xlat_addr(&mr)
> vfio_container_dma_map(mr->ram_block) ******
> if ram_block is right
> vioc->dma_map_file()
> else
> vioc->dma_map()
The need for ->ram_block in dma map/unmap is exactly the case for vfio-user too.
Cédric:
> > There is a risk that the life cycle of the returned MemoryRegion
> > doesn't match VFIO expectations.
Can you perhaps explain in a bit more detail your concerns? Are you talking
about current code, or possible future uses?
Is there an alternative approach you could suggest?
regards
john
^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2025-02-16 23:37 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-29 14:42 [PATCH V1 00/26] Live update: vfio and iommufd Steve Sistare
2025-01-29 14:42 ` [PATCH V1 01/26] migration: cpr helpers Steve Sistare
2025-01-29 14:42 ` [PATCH V1 02/26] migration: lower handler priority Steve Sistare
2025-02-03 16:21 ` Fabiano Rosas
2025-02-03 16:58 ` Peter Xu
2025-02-06 13:39 ` Steven Sistare
2025-01-29 14:42 ` [PATCH V1 03/26] vfio: vfio_find_ram_discard_listener Steve Sistare
2025-02-03 16:57 ` Cédric Le Goater
2025-01-29 14:43 ` [PATCH V1 04/26] vfio/container: register container for cpr Steve Sistare
2025-02-03 17:01 ` Cédric Le Goater
2025-02-03 22:26 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 05/26] vfio/container: preserve descriptors Steve Sistare
2025-02-03 17:48 ` Cédric Le Goater
2025-02-03 22:26 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 06/26] vfio/container: preserve DMA mappings Steve Sistare
2025-02-03 18:25 ` Cédric Le Goater
2025-02-03 22:27 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 07/26] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
2025-02-04 14:10 ` Cédric Le Goater
2025-02-04 16:13 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 08/26] pci: skip reset during cpr Steve Sistare
2025-02-04 14:14 ` Cédric Le Goater
2025-02-04 16:13 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 09/26] pci: export msix_is_pending Steve Sistare
2025-01-29 14:43 ` [PATCH V1 10/26] vfio-pci: refactor for cpr Steve Sistare
2025-02-04 14:39 ` Cédric Le Goater
2025-02-04 16:14 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 11/26] vfio-pci: skip reset during cpr Steve Sistare
2025-02-04 14:56 ` Cédric Le Goater
2025-02-04 16:15 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 12/26] vfio-pci: preserve MSI Steve Sistare
2025-02-05 16:48 ` Cédric Le Goater
2025-02-06 14:41 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 13/26] vfio-pci: preserve INTx Steve Sistare
2025-02-05 17:13 ` Cédric Le Goater
2025-02-06 14:43 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 14/26] migration: close kvm after cpr Steve Sistare
2025-01-29 14:43 ` [PATCH V1 15/26] migration: cpr_get_fd_param helper Steve Sistare
2025-01-29 14:43 ` [PATCH V1 16/26] vfio: return mr from vfio_get_xlat_addr Steve Sistare
2025-02-04 15:47 ` Cédric Le Goater
2025-02-04 17:42 ` Steven Sistare
2025-02-16 23:19 ` John Levon
2025-01-29 14:43 ` [PATCH V1 17/26] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
2025-01-29 14:43 ` [PATCH V1 18/26] vfio/iommufd: define iommufd_cdev_make_hwpt Steve Sistare
2025-02-04 16:22 ` Cédric Le Goater
2025-02-04 17:42 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 19/26] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
2025-02-05 17:23 ` Cédric Le Goater
2025-02-05 22:01 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 20/26] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
2025-02-05 17:33 ` Cédric Le Goater
2025-02-05 22:01 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 21/26] iommufd: change process ioctl Steve Sistare
2025-02-05 17:34 ` Cédric Le Goater
2025-02-05 22:02 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 22/26] vfio/iommufd: invariant device name Steve Sistare
2025-02-05 17:42 ` Cédric Le Goater
2025-02-05 22:02 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 23/26] vfio/iommufd: register container for cpr Steve Sistare
2025-02-05 17:45 ` Cédric Le Goater
2025-02-05 22:03 ` Steven Sistare
2025-01-29 14:43 ` [PATCH V1 24/26] vfio/iommufd: preserve descriptors Steve Sistare
2025-01-29 14:43 ` [PATCH V1 25/26] vfio/iommufd: reconstruct device Steve Sistare
2025-01-29 14:43 ` [PATCH V1 26/26] iommufd: preserve DMA mappings Steve Sistare
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).