qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 00/42] Live update: vfio and iommufd
@ 2025-05-12 15:32 Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR Steve Sistare
                   ` (42 more replies)
  0 siblings, 43 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Support vfio and iommufd devices with the cpr-transfer live migration mode.
Devices that do not support live migration can still support cpr-transfer,
allowing live update to a new version of QEMU on the same host, with no loss
of guest connectivity.

No user-visible interfaces are added.

For legacy containers:

Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
skip the ioctls that configure the device, because it is already configured.

Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
QEMU and update the locked memory accounting.  The physical pages remain
pinned, because the descriptor of the device that locked them remains open,
so DMA to those pages continues without interruption.  Mediated devices are
not supported, however, because they require the VA to always be valid, and
there is a brief window where no VA is registered.

Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
data structures, and attaches the interrupts to the new KVM instance.  This
logic also applies to iommufd containers.

For iommufd containers:

Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
backed by a file (including a memfd), so DMA mappings do not depend on VA,
which can differ after live update.  This allows mediated devices to be
supported.

Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
QEMU, during vfio_realize, skip the ioctls that configure the device, because
it is already configured.

In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
locked memory accounting.

Patches 4 to 12 are specific to legacy containers.
Patches 25 to 41 are specific to iommufd containers.
The remainder apply to both.

Changes from previous versions:
  * V1 of this series contains minor changes from the "Live update: vfio" and
    "Live update: iommufd" series, mainly bug fixes and refactored patches.

Changes in V2:
  * refactored various vfio code snippets into new cpr helpers
  * refactored vfio struct members into cpr-specific structures
  * refactored various small changes into their own patches
  * split complex patches.  Notably:
    - split "refactor for cpr" into 5 patches
    - split "reconstruct device" into 4 patches
  * refactored vfio_connect_container using helpers and made its
    error recovery more robust.
  * moved vfio pci msi/vector/intx cpr functions to cpr.c
  * renamed "reused" to cpr_reused and cpr.reused
  * squashed vfio_cpr_[un]register_container to their call sites
  * simplified iommu_type setting after cpr
  * added cpr_open_fd and cpr_is_incoming helpers
  * removed changes from vfio_legacy_dma_map, and instead temporarily
    override dma_map and dma_unmap ops.
  * deleted error_report and returned Error to callers where possible.
  * simplified the memory_get_xlat_addr interface
  * fixed flags passed to iommufd_backend_alloc_hwpt
  * defined MIG_PRI_UNINITIALIZED
  * added maintainers

Changes in V3:
  * removed cleanup patches that were already pulled
  * rebased to latest master


Steve Sistare (42):
  MAINTAINERS: Add reviewer for CPR
  migration: cpr helpers
  migration: lower handler priority
  vfio: vfio_find_ram_discard_listener
  vfio: move vfio-cpr.h
  vfio/container: register container for cpr
  vfio/container: preserve descriptors
  vfio/container: export vfio_legacy_dma_map
  vfio/container: discard old DMA vaddr
  vfio/container: restore DMA vaddr
  vfio/container: mdev cpr blocker
  vfio/container: recover from unmap-all-vaddr failure
  pci: export msix_is_pending
  pci: skip reset during cpr
  vfio-pci: skip reset during cpr
  vfio/pci: vfio_vector_init
  vfio/pci: vfio_notifier_init
  vfio/pci: pass vector to virq functions
  vfio/pci: vfio_notifier_init cpr parameters
  vfio/pci: vfio_notifier_cleanup
  vfio/pci: export MSI functions
  vfio-pci: preserve MSI
  vfio-pci: preserve INTx
  migration: close kvm after cpr
  migration: cpr_get_fd_param helper
  vfio: return mr from vfio_get_xlat_addr
  vfio: pass ramblock to vfio_container_dma_map
  backends/iommufd: iommufd_backend_map_file_dma
  backends/iommufd: change process ioctl
  physmem: qemu_ram_get_fd_offset
  vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  vfio/iommufd: export iommufd_cdev_get_info_iova_range
  vfio/iommufd: define hwpt constructors
  vfio/iommufd: invariant device name
  vfio/iommufd: register container for cpr
  vfio/iommufd: preserve descriptors
  vfio/iommufd: reconstruct device
  vfio/iommufd: reconstruct hw_caps
  vfio/iommufd: reconstruct hwpt
  vfio/iommufd: change process
  iommufd: preserve DMA mappings
  vfio/container: delete old cpr register

 MAINTAINERS                           |  10 ++
 accel/kvm/kvm-all.c                   |  28 ++++
 backends/iommufd.c                    |  83 ++++++++--
 backends/trace-events                 |   2 +
 hw/pci/msix.c                         |   2 +-
 hw/pci/pci.c                          |  13 ++
 hw/vfio/container-base.c              |  12 +-
 hw/vfio/container.c                   | 100 +++++++++---
 hw/vfio/cpr-iommufd.c                 | 174 +++++++++++++++++++++
 hw/vfio/cpr-legacy.c                  | 284 ++++++++++++++++++++++++++++++++++
 hw/vfio/cpr.c                         | 176 +++++++++++++++++++--
 hw/vfio/device.c                      |  20 +--
 hw/vfio/helpers.c                     |  10 ++
 hw/vfio/iommufd.c                     | 201 ++++++++++++++++--------
 hw/vfio/listener.c                    |  91 +++++++----
 hw/vfio/meson.build                   |   2 +
 hw/vfio/pci.c                         | 195 +++++++++++++++++------
 hw/vfio/pci.h                         |  12 ++
 hw/vfio/trace-events                  |   1 +
 hw/vfio/vfio-cpr.h                    |  15 --
 hw/vfio/vfio-iommufd.h                |  10 ++
 hw/virtio/vhost-vdpa.c                |   8 +-
 include/exec/cpu-common.h             |   1 +
 include/hw/pci/msix.h                 |   1 +
 include/hw/vfio/vfio-container-base.h |  15 +-
 include/hw/vfio/vfio-container.h      |   2 +
 include/hw/vfio/vfio-cpr.h            |  71 +++++++++
 include/hw/vfio/vfio-device.h         |   4 +
 include/migration/cpr.h               |   8 +
 include/migration/vmstate.h           |   6 +-
 include/qemu/vfio-helpers.h           |   1 -
 include/system/iommufd.h              |   6 +
 include/system/kvm.h                  |   1 +
 include/system/memory.h               |  16 +-
 migration/cpr-transfer.c              |  18 +++
 migration/cpr.c                       |  72 +++++++++
 migration/migration.c                 |   1 +
 migration/savevm.c                    |   4 +-
 system/memory.c                       |  25 +--
 system/physmem.c                      |   5 +
 40 files changed, 1460 insertions(+), 246 deletions(-)
 create mode 100644 hw/vfio/cpr-iommufd.c
 create mode 100644 hw/vfio/cpr-legacy.c
 delete mode 100644 hw/vfio/vfio-cpr.h
 create mode 100644 include/hw/vfio/vfio-cpr.h

base-commit: 7be29f2f1a3f5b037d27eedbd5df9f441e8c8c16
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15  7:36   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 02/42] migration: cpr helpers Steve Sistare
                   ` (41 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

CPR is integrated with live migration, and has the same maintainers.
But, add a CPR section to add a reviewer.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6dacd6d..d54a532 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3019,6 +3019,15 @@ F: include/qemu/co-shared-resource.h
 T: git https://gitlab.com/jsnow/qemu.git jobs
 T: git https://gitlab.com/vsementsov/qemu.git block
 
+CheckPoint and Restart (CPR)
+R: Steve Sistare <steven.sistare@oracle.com>
+S: Supported
+F: hw/vfio/cpr*
+F: include/migration/cpr.h
+F: migration/cpr*
+F: tests/qtest/migration/cpr*
+F: docs/devel/migration/CPR.rst
+
 Compute Express Link
 M: Jonathan Cameron <jonathan.cameron@huawei.com>
 R: Fan Ni <fan.ni@samsung.com>
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 02/42] migration: cpr helpers
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15  7:43   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 03/42] migration: lower handler priority Steve Sistare
                   ` (40 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Add the cpr_needed_for_reuse and cpr_open_fd, for use when adding cpr
support for vfio and iommufd.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  4 ++++
 migration/cpr.c         | 24 ++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7561fc7..fc6aa33 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -18,6 +18,8 @@
 void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
 int cpr_find_fd(const char *name, int id);
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+                bool *reused, Error **errp);
 
 MigMode cpr_get_incoming_mode(void);
 void cpr_set_incoming_mode(MigMode mode);
@@ -28,6 +30,8 @@ int cpr_state_load(MigrationChannel *channel, Error **errp);
 void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
+bool cpr_needed_for_reuse(void *opaque);
+
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
diff --git a/migration/cpr.c b/migration/cpr.c
index 42c4656..0b01e25 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -95,6 +95,24 @@ int cpr_find_fd(const char *name, int id)
     trace_cpr_find_fd(name, id, fd);
     return fd;
 }
+
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+                bool *reused, Error **errp)
+{
+    int fd = cpr_find_fd(name, id);
+
+    if (reused) {
+        *reused = (fd >= 0);
+    }
+    if (fd < 0) {
+        fd = qemu_open(path, flags, errp);
+        if (fd >= 0) {
+            cpr_save_fd(name, id, fd);
+        }
+    }
+    return fd;
+}
+
 /*************************************************************************/
 #define CPR_STATE "CprState"
 
@@ -228,3 +246,9 @@ void cpr_state_close(void)
         cpr_state_file = NULL;
     }
 }
+
+bool cpr_needed_for_reuse(void *opaque)
+{
+    MigMode mode = migrate_mode();
+    return mode == MIG_MODE_CPR_TRANSFER;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 03/42] migration: lower handler priority
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 02/42] migration: cpr helpers Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 04/42] vfio: vfio_find_ram_discard_listener Steve Sistare
                   ` (39 subsequent siblings)
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers.  Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.

CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/migration/vmstate.h | 6 +++++-
 migration/savevm.c          | 4 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a1dfab4..1ff7bd9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -155,7 +155,11 @@ enum VMStateFlags {
 };
 
 typedef enum {
-    MIG_PRI_DEFAULT = 0,
+    MIG_PRI_UNINITIALIZED = 0,  /* An uninitialized priority field maps to */
+                                /* MIG_PRI_DEFAULT in save_state_priority */
+
+    MIG_PRI_LOW,                /* Must happen after default */
+    MIG_PRI_DEFAULT,
     MIG_PRI_IOMMU,              /* Must happen before PCI devices */
     MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
     MIG_PRI_VIRTIO_MEM,         /* Must happen before IOMMU */
diff --git a/migration/savevm.c b/migration/savevm.c
index 006514c..7e87815 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -266,7 +266,7 @@ typedef struct SaveState {
 
 static SaveState savevm_state = {
     .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
-    .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
+    .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
     .global_section_id = 0,
 };
 
@@ -737,7 +737,7 @@ static int calculate_compat_instance_id(const char *idstr)
 
 static inline MigrationPriority save_state_priority(SaveStateEntry *se)
 {
-    if (se->vmsd) {
+    if (se->vmsd && se->vmsd->priority) {
         return se->vmsd->priority;
     }
     return MIG_PRI_DEFAULT;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 04/42] vfio: vfio_find_ram_discard_listener
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (2 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 03/42] migration: lower handler priority Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 05/42] vfio: move vfio-cpr.h Steve Sistare
                   ` (38 subsequent siblings)
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define vfio_find_ram_discard_listener as a subroutine so additional calls to
it may be added in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/listener.c                    | 35 ++++++++++++++++++++++-------------
 include/hw/vfio/vfio-container-base.h |  3 +++
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index bfacb3d..5642d04 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -449,6 +449,26 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
     }
 }
 
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainerBase *bcontainer, MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to sync missing RAM discard listener");
+        /* does not return */
+    }
+    return vrdl;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -1075,19 +1095,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
                                             MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to sync missing RAM discard listener");
-    }
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(bcontainer, section);
 
     /*
      * We only want/can synchronize the bitmap for actually mapped parts -
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 3d392b0..1dc760f 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -183,4 +183,7 @@ struct VFIOIOMMUClass {
     void (*release)(VFIOContainerBase *bcontainer);
 };
 
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
 #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 05/42] vfio: move vfio-cpr.h
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (3 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 04/42] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15  7:46   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 06/42] vfio/container: register container for cpr Steve Sistare
                   ` (37 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move vfio-cpr.h to include/hw/vfio, because it will need to be included by
other files there.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS                |  1 +
 hw/vfio/container.c        |  2 +-
 hw/vfio/cpr.c              |  2 +-
 hw/vfio/iommufd.c          |  2 +-
 hw/vfio/vfio-cpr.h         | 15 ---------------
 include/hw/vfio/vfio-cpr.h | 18 ++++++++++++++++++
 6 files changed, 22 insertions(+), 18 deletions(-)
 delete mode 100644 hw/vfio/vfio-cpr.h
 create mode 100644 include/hw/vfio/vfio-cpr.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d54a532..9bee3cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3023,6 +3023,7 @@ CheckPoint and Restart (CPR)
 R: Steve Sistare <steven.sistare@oracle.com>
 S: Supported
 F: hw/vfio/cpr*
+F: include/hw/vfio/vfio-cpr.h
 F: include/migration/cpr.h
 F: migration/cpr*
 F: tests/qtest/migration/cpr*
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index a9f0dba..eb56f00 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -33,8 +33,8 @@
 #include "qapi/error.h"
 #include "pci.h"
 #include "hw/vfio/vfio-container.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "vfio-helpers.h"
-#include "vfio-cpr.h"
 #include "vfio-listener.h"
 
 #define TYPE_HOST_IOMMU_DEVICE_LEGACY_VFIO TYPE_HOST_IOMMU_DEVICE "-legacy-vfio"
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 3214184..0210e76 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,9 +8,9 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
 #include "migration/misc.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
-#include "vfio-cpr.h"
 
 static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
                                     MigrationEvent *e, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index af1c7ab..167bda4 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -21,13 +21,13 @@
 #include "qapi/error.h"
 #include "system/iommufd.h"
 #include "hw/qdev-core.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "system/reset.h"
 #include "qemu/cutils.h"
 #include "qemu/chardev_open.h"
 #include "pci.h"
 #include "vfio-iommufd.h"
 #include "vfio-helpers.h"
-#include "vfio-cpr.h"
 #include "vfio-listener.h"
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
diff --git a/hw/vfio/vfio-cpr.h b/hw/vfio/vfio-cpr.h
deleted file mode 100644
index 134b83a..0000000
--- a/hw/vfio/vfio-cpr.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * VFIO CPR
- *
- * Copyright (c) 2025 Oracle and/or its affiliates.
- *
- * SPDX-License-Identifier: GPL-2.0-or-later
- */
-
-#ifndef HW_VFIO_CPR_H
-#define HW_VFIO_CPR_H
-
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
-
-#endif /* HW_VFIO_CPR_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
new file mode 100644
index 0000000..750ea5b
--- /dev/null
+++ b/include/hw/vfio/vfio-cpr.h
@@ -0,0 +1,18 @@
+/*
+ * VFIO CPR
+ *
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_VFIO_VFIO_CPR_H
+#define HW_VFIO_VFIO_CPR_H
+
+struct VFIOContainerBase;
+
+bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
+                                 Error **errp);
+void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
+
+#endif /* HW_VFIO_VFIO_CPR_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 06/42] vfio/container: register container for cpr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (4 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 05/42] vfio: move vfio-cpr.h Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15  7:54   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 07/42] vfio/container: preserve descriptors Steve Sistare
                   ` (36 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Register a legacy container for cpr-transfer, replacing the generic CPR
register call with a more specific legacy container register call.  Add a
blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.

This is mostly boiler plate.  The fields to to saved and restored are added
in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c              |  6 ++--
 hw/vfio/cpr-legacy.c             | 70 ++++++++++++++++++++++++++++++++++++++++
 hw/vfio/cpr.c                    |  5 ++-
 hw/vfio/meson.build              |  1 +
 include/hw/vfio/vfio-container.h |  2 ++
 include/hw/vfio/vfio-cpr.h       | 14 ++++++++
 6 files changed, 92 insertions(+), 6 deletions(-)
 create mode 100644 hw/vfio/cpr-legacy.c

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index eb56f00..85c76da 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -642,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     new_container = true;
     bcontainer = &container->bcontainer;
 
-    if (!vfio_cpr_register_container(bcontainer, errp)) {
+    if (!vfio_legacy_cpr_register_container(container, errp)) {
         goto fail;
     }
 
@@ -678,7 +678,7 @@ fail:
         vioc->release(bcontainer);
     }
     if (new_container) {
-        vfio_cpr_unregister_container(bcontainer);
+        vfio_legacy_cpr_unregister_container(container);
         object_unref(container);
     }
     if (fd >= 0) {
@@ -719,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
         VFIOAddressSpace *space = bcontainer->space;
 
         trace_vfio_container_disconnect(container->fd);
-        vfio_cpr_unregister_container(bcontainer);
+        vfio_legacy_cpr_unregister_container(container);
         close(container->fd);
         object_unref(container);
 
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..fac323c
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-container.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+
+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+        return false;
+
+    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+        return false;
+
+    } else {
+        return true;
+    }
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+    .name = "vfio-container",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_needed_for_reuse,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+    Error **cpr_blocker = &container->cpr.blocker;
+
+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+                                vfio_cpr_reboot_notifier,
+                                MIG_MODE_CPR_REBOOT);
+
+    if (!vfio_cpr_supported(container, cpr_blocker)) {
+        return migrate_add_blocker_modes(cpr_blocker, errp,
+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+    }
+
+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+    return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+    migrate_del_blocker(&container->cpr.blocker);
+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0210e76..0e59612 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -7,13 +7,12 @@
 
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
-#include "migration/misc.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
 
-static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
-                                    MigrationEvent *e, Error **errp)
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
+                             MigrationEvent *e, Error **errp)
 {
     if (e->type == MIG_EVENT_PRECOPY_SETUP &&
         !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bccb050..73d29f9 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
 system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
 system_ss.add(when: 'CONFIG_VFIO', if_true: files(
   'cpr.c',
+  'cpr-legacy.c',
   'device.c',
   'migration.c',
   'migration-multifd.c',
diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
index afc498d..21e5807 100644
--- a/include/hw/vfio/vfio-container.h
+++ b/include/hw/vfio/vfio-container.h
@@ -10,6 +10,7 @@
 #define HW_VFIO_CONTAINER_H
 
 #include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
 
 typedef struct VFIOContainer VFIOContainer;
 typedef struct VFIODevice VFIODevice;
@@ -29,6 +30,7 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     unsigned iommu_type;
     QLIST_HEAD(, VFIOGroup) group_list;
+    VFIOContainerCPR cpr;
 } VFIOContainer;
 
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 750ea5b..f864547 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -9,8 +9,22 @@
 #ifndef HW_VFIO_VFIO_CPR_H
 #define HW_VFIO_VFIO_CPR_H
 
+#include "migration/misc.h"
+
+typedef struct VFIOContainerCPR {
+    Error *blocker;
+} VFIOContainerCPR;
+
+struct VFIOContainer;
 struct VFIOContainerBase;
 
+bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
+                                        Error **errp);
+void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+                             Error **errp);
+
 bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (5 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 06/42] vfio/container: register container for cpr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15 12:59   ` Cédric Le Goater
  2025-05-22 13:51   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map Steve Sistare
                   ` (35 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
the saved descriptors, and remembers the reused status for subsequent
patches.  The reused status is cleared when vmstate load finishes.

During reuse, device and iommu state is already configured, so operations
in vfio_realize that would modify the configuration, such as vfio ioctl's,
are skipped.  The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
 hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
 include/hw/vfio/vfio-cpr.h    |  9 ++++++
 include/hw/vfio/vfio-device.h |  2 ++
 4 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 85c76da..278a220 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,8 @@
 #include "system/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
 #include "pci.h"
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
@@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
 }
 
 static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
-                                            Error **errp)
+                                            bool cpr_reused, Error **errp)
 {
     int iommu_type;
     const char *vioc_name;
@@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
         return NULL;
     }
 
-    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
+    /*
+     * If container is reused, just set its type and skip the ioctls, as the
+     * container and group are already configured in the kernel.
+     */
+    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
         return NULL;
     }
 
@@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
 
     container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
     container->fd = fd;
+    container->cpr.reused = cpr_reused;
     container->iommu_type = iommu_type;
     return container;
 }
@@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
 }
 
 static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
-                                     Error **errp)
+                                     bool cpr_reused, Error **errp)
 {
     if (!vfio_container_attach_discard_disable(container, group, errp)) {
         return false;
@@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
     vfio_group_add_kvm_device(group);
+    if (!cpr_reused) {
+        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
+    }
     return true;
 }
 
@@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
     group->container = NULL;
     vfio_group_del_kvm_device(group);
     vfio_ram_block_discard_disable(container, false);
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 }
 
 static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
@@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     VFIOIOMMUClass *vioc = NULL;
     bool new_container = false;
     bool group_was_added = false;
+    bool cpr_reused;
 
     space = vfio_address_space_get(as);
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+    cpr_reused = (fd > 0);
+
+    /*
+     * If the container is reused, then the group is already attached in the
+     * kernel.  If a container with matching fd is found, then update the
+     * userland group list and return.  If not, then after the loop, create
+     * the container struct and group list.
+     */
 
     QLIST_FOREACH(bcontainer, &space->containers, next) {
         container = container_of(bcontainer, VFIOContainer, bcontainer);
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            return vfio_container_group_add(container, group, errp);
+
+        if (cpr_reused) {
+            if (!vfio_cpr_container_match(container, group, &fd)) {
+                continue;
+            }
+        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            continue;
         }
+
+        return vfio_container_group_add(container, group, cpr_reused, errp);
+    }
+
+    if (!cpr_reused) {
+        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
     }
 
-    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
     if (fd < 0) {
         goto fail;
     }
@@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
         goto fail;
     }
 
-    container = vfio_create_container(fd, group, errp);
+    container = vfio_create_container(fd, group, cpr_reused, errp);
     if (!container) {
         goto fail;
     }
@@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
 
     vfio_address_space_insert(space, bcontainer);
 
-    if (!vfio_container_group_add(container, group, errp)) {
+    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
         goto fail;
     }
     group_was_added = true;
@@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open(path, O_RDWR, errp);
+    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
     if (group->fd < 0) {
         goto free_group_exit;
     }
@@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
     return group;
 
 close_fd_exit:
+    cpr_delete_fd("vfio_group", groupid);
     close(group->fd);
 
 free_group_exit:
@@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
     vfio_container_disconnect(group);
     QLIST_REMOVE(group, next);
     trace_vfio_group_put(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 }
@@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
 {
     g_autofree struct vfio_device_info *info = NULL;
     int fd;
+    bool cpr_reused;
+
+    fd = cpr_find_fd(name, 0);
+    cpr_reused = (fd >= 0);
+    if (!cpr_reused) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
     vbasedev->group = group;
     QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
 
+    vbasedev->cpr.reused = cpr_reused;
+    if (!cpr_reused) {
+        cpr_save_fd(name, 0, fd);
+    }
     trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
 
     return true;
@@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_device_put(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index fac323c..638a8e0 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
@@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
     }
 }
 
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+    VFIOContainer *container = opaque;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    container->cpr.reused = false;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->cpr.reused = false;
+        }
+    }
+    return 0;
+}
+
 static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
@@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
     migrate_del_blocker(&container->cpr.blocker);
     vmstate_unregister(NULL, &vfio_container_vmstate, container);
 }
+
+static bool same_device(int fd1, int fd2)
+{
+    struct stat st1, st2;
+
+    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
+}
+
+bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
+                              int *pfd)
+{
+    if (container->fd == *pfd) {
+        return true;
+    }
+    if (!same_device(container->fd, *pfd)) {
+        return false;
+    }
+    /*
+     * Same device, different fd.  This occurs when the container fd is
+     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
+     * produces duplicates.  De-dup it.
+     */
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
+    close(*pfd);
+    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
+    *pfd = container->fd;
+    return true;
+}
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index f864547..1c4f070 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -13,10 +13,16 @@
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
+    bool reused;
 } VFIOContainerCPR;
 
+typedef struct VFIODeviceCPR {
+    bool reused;
+} VFIODeviceCPR;
+
 struct VFIOContainer;
 struct VFIOContainerBase;
+struct VFIOGroup;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
                                         Error **errp);
@@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
 
+bool vfio_cpr_container_match(struct VFIOContainer *container,
+                              struct VFIOGroup *group, int *fd);
+
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 8bcb3c1..4e4d0b6 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -28,6 +28,7 @@
 #endif
 #include "system/system.h"
 #include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "system/host_iommu_device.h"
 #include "system/iommufd.h"
 
@@ -84,6 +85,7 @@ typedef struct VFIODevice {
     VFIOIOASHwpt *hwpt;
     QLIST_ENTRY(VFIODevice) hwpt_next;
     struct vfio_region_info **reginfo;
+    VFIODeviceCPR cpr;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (6 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 07/42] vfio/container: preserve descriptors Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15 13:42   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 09/42] vfio/container: discard old DMA vaddr Steve Sistare
                   ` (34 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export vfio_legacy_dma_map so it may be referenced outside the file
in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c                   | 4 ++--
 include/hw/vfio/vfio-container-base.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 278a220..a554683 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -208,8 +208,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
     return ret;
 }
 
-static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
-                               ram_addr_t size, void *vaddr, bool readonly)
+int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly)
 {
     const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
                                                   bcontainer);
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 1dc760f..a2f6c3a 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -186,4 +186,7 @@ struct VFIOIOMMUClass {
 VFIORamDiscardListener *vfio_find_ram_discard_listener(
     VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly);
+
 #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 09/42] vfio/container: discard old DMA vaddr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (7 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15 13:30   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 10/42] vfio/container: restore " Steve Sistare
                   ` (33 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

In the container pre_save handler, discard the virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will be
remapped at a different VA after in new QEMU.  DMA to already-mapped
pages continues.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-legacy.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 638a8e0..519d772 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -17,6 +17,22 @@
 #include "migration/vmstate.h"
 #include "qapi/error.h"
 
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return false;
+    }
+    return true;
+}
+
+
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -32,6 +48,18 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
     }
 }
 
+static int vfio_container_pre_save(void *opaque)
+{
+    VFIOContainer *container = opaque;
+    Error *err = NULL;
+
+    if (!vfio_dma_unmap_vaddr_all(container, &err)) {
+        error_report_err(err);
+        return -1;
+    }
+    return 0;
+}
+
 static int vfio_container_post_load(void *opaque, int version_id)
 {
     VFIOContainer *container = opaque;
@@ -52,6 +80,7 @@ static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .pre_save = vfio_container_pre_save,
     .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (8 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 09/42] vfio/container: discard old DMA vaddr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-15 13:42   ` Cédric Le Goater
  2025-05-22  6:37   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 11/42] vfio/container: mdev cpr blocker Steve Sistare
                   ` (32 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

In new QEMU, do not register the memory listener at device creation time.
Register it later, in the container post_load handler, after all vmstate
that may affect regions and mapping boundaries has been loaded.  The
post_load registration will cause the listener to invoke its callback on
each flat section, and the calls will match the mappings remembered by the
kernel.

The listener calls a special dma_map handler that passes the new VA of each
section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
handler at the end.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c  | 15 +++++++++++++--
 hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index a554683..0e02726 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
     int ret;
     Error *local_err = NULL;
 
+    assert(!container->cpr.reused);
+
     if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
         if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
             bcontainer->dirty_pages_supported) {
@@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     }
     group_was_added = true;
 
-    if (!vfio_listener_register(bcontainer, errp)) {
-        goto fail;
+    /*
+     * If reused, register the listener later, after all state that may
+     * affect regions and mapping boundaries has been cpr load'ed.  Later,
+     * the listener will invoke its callback on each flat section and call
+     * dma_map to supply the new vaddr, and the calls will match the mappings
+     * remembered by the kernel.
+     */
+    if (!cpr_reused) {
+        if (!vfio_listener_register(bcontainer, errp)) {
+            goto fail;
+        }
     }
 
     bcontainer->initialized = true;
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 519d772..bbcf71e 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -11,11 +11,13 @@
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "hw/vfio/vfio-device.h"
+#include "hw/vfio/vfio-listener.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
 #include "migration/vmstate.h"
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 
 static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
 {
@@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
     return true;
 }
 
+/*
+ * Set the new @vaddr for any mappings registered during cpr load.
+ * Reused is cleared thereafter.
+ */
+static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
+                                   hwaddr iova, ram_addr_t size, void *vaddr,
+                                   bool readonly)
+{
+    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
+                                                  bcontainer);
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_VADDR,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    assert(container->cpr.reused);
+
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
+                     iova, size, vaddr, strerror(errno));
+        return -errno;
+    }
+
+    return 0;
+}
 
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
@@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
 static int vfio_container_post_load(void *opaque, int version_id)
 {
     VFIOContainer *container = opaque;
+    VFIOContainerBase *bcontainer = &container->bcontainer;
     VFIOGroup *group;
     VFIODevice *vbasedev;
+    Error *err = NULL;
+
+    if (!vfio_listener_register(bcontainer, &err)) {
+        error_report_err(err);
+        return -1;
+    }
 
     container->cpr.reused = false;
 
     QLIST_FOREACH(group, &container->group_list, container_next) {
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+
+        /* Restore original dma_map function */
+        vioc->dma_map = vfio_legacy_dma_map;
+
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             vbasedev->cpr.reused = false;
         }
@@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
     .pre_save = vfio_container_pre_save,
     .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
@@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 
     vmstate_register(NULL, -1, &vfio_container_vmstate, container);
 
+    /* During incoming CPR, divert calls to dma_map. */
+    if (container->cpr.reused) {
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+        vioc->dma_map = vfio_legacy_cpr_dma_map;
+    }
     return true;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 11/42] vfio/container: mdev cpr blocker
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (9 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 10/42] vfio/container: restore " Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:16   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
                   ` (31 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

During CPR, after VFIO_DMA_UNMAP_FLAG_VADDR, the vaddr is temporarily
invalid, so mediated devices cannot be supported.  Add a blocker for them.
This restriction will not apply to iommufd containers when CPR is added
for them in a future patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container.c        | 8 ++++++++
 include/hw/vfio/vfio-cpr.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 0e02726..562e3bd 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -995,6 +995,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
         goto device_put_exit;
     }
 
+    if (vbasedev->mdev) {
+        error_setg(&vbasedev->cpr.mdev_blocker,
+                   "CPR does not support vfio mdev %s", vbasedev->name);
+        migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, &error_fatal,
+                                  MIG_MODE_CPR_TRANSFER, -1);
+    }
+
     return true;
 
 device_put_exit:
@@ -1012,6 +1019,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
 
     vfio_device_unprepare(vbasedev);
 
+    migrate_del_blocker(&vbasedev->cpr.mdev_blocker);
     object_unref(vbasedev->hiod);
     vfio_device_put(vbasedev);
     vfio_group_put(group);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 1c4f070..0fc7ab2 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -18,6 +18,7 @@ typedef struct VFIOContainerCPR {
 
 typedef struct VFIODeviceCPR {
     bool reused;
+    Error *mdev_blocker;
 } VFIODeviceCPR;
 
 struct VFIOContainer;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (10 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 11/42] vfio/container: mdev cpr blocker Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-20  6:29   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 13/42] pci: export msix_is_pending Steve Sistare
                   ` (30 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each.  Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-legacy.c                  | 91 +++++++++++++++++++++++++++++++++++
 hw/vfio/listener.c                    | 19 +++++++-
 include/hw/vfio/vfio-container-base.h |  3 ++
 include/hw/vfio/vfio-cpr.h            | 10 ++++
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index bbcf71e..f8ddf78 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -31,6 +31,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
         error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
         return false;
     }
+    container->cpr.vaddr_unmapped = true;
     return true;
 }
 
@@ -63,6 +64,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
     return 0;
 }
 
+static void vfio_region_remap(MemoryListener *listener,
+                              MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            cpr.remap_listener);
+    vfio_container_region_add(&container->bcontainer, section, true);
+}
+
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -131,6 +140,40 @@ static const VMStateDescription vfio_container_vmstate = {
     }
 };
 
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+                                  MigrationEvent *e, Error **errp)
+{
+    VFIOContainer *container =
+        container_of(notifier, VFIOContainer, cpr.transfer_notifier);
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+        return 0;
+    }
+
+    if (container->cpr.vaddr_unmapped) {
+        /*
+         * Force a call to vfio_region_remap for each mapped section by
+         * temporarily registering a listener, and temporarily diverting
+         * dma_map to vfio_legacy_cpr_dma_map.  The latter restores vaddr.
+         */
+
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+        vioc->dma_map = vfio_legacy_cpr_dma_map;
+
+        container->cpr.remap_listener = (MemoryListener) {
+            .name = "vfio cpr recover",
+            .region_add = vfio_region_remap
+        };
+        memory_listener_register(&container->cpr.remap_listener,
+                                 bcontainer->space->as);
+        memory_listener_unregister(&container->cpr.remap_listener);
+        container->cpr.vaddr_unmapped = false;
+        vioc->dma_map = vfio_legacy_dma_map;
+    }
+    return 0;
+}
+
 bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 {
     VFIOContainerBase *bcontainer = &container->bcontainer;
@@ -152,6 +195,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
         VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
         vioc->dma_map = vfio_legacy_cpr_dma_map;
     }
+
+    migration_add_notifier_mode(&container->cpr.transfer_notifier,
+                                vfio_cpr_fail_notifier,
+                                MIG_MODE_CPR_TRANSFER);
     return true;
 }
 
@@ -162,6 +209,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
     migrate_del_blocker(&container->cpr.blocker);
     vmstate_unregister(NULL, &vfio_container_vmstate, container);
+    migration_remove_notifier(&container->cpr.transfer_notifier);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr.  Call this
+ * to restore vaddr for a section with a giommu.
+ *
+ * The giommu already exists.  Find it and replay it, which calls
+ * vfio_legacy_cpr_dma_map further down the stack.
+ */
+void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
+                           MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu = NULL;
+    hwaddr as_offset = section->offset_within_address_space;
+    hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+    QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+        if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
+            giommu->iommu_offset == iommu_offset) {
+            break;
+        }
+    }
+    g_assert(giommu);
+    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr.  Call this
+ * to restore vaddr for a section with a RamDiscardManager.
+ *
+ * The ram discard listener already exists.  Call its populate function
+ * directly, which calls vfio_legacy_cpr_dma_map.
+ */
+bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
+                                            MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(bcontainer, section);
+
+    g_assert(vrdl);
+    return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
 }
 
 static bool same_device(int fd1, int fd2)
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 5642d04..e86ffcf 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -474,6 +474,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
 {
     VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
                                                  listener);
+    vfio_container_region_add(bcontainer, section, false);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+                               MemoryRegionSection *section,
+                               bool cpr_remap)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -509,6 +516,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
         int iommu_idx;
 
         trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
+
+        if (cpr_remap) {
+            vfio_cpr_giommu_remap(bcontainer, section);
+        }
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -551,7 +563,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
      * about changes.
      */
     if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_ram_discard_register_listener(bcontainer, section);
+        if (!cpr_remap) {
+            vfio_ram_discard_register_listener(bcontainer, section);
+        } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
+                                                           section)) {
+            goto fail;
+        }
         return;
     }
 
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index a2f6c3a..5776fd7 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -189,4 +189,7 @@ VFIORamDiscardListener *vfio_find_ram_discard_listener(
 int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                         ram_addr_t size, void *vaddr, bool readonly);
 
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+                               MemoryRegionSection *section, bool cpr_remap);
+
 #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 0fc7ab2..d6d22f2 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -10,10 +10,14 @@
 #define HW_VFIO_VFIO_CPR_H
 
 #include "migration/misc.h"
+#include "system/memory.h"
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
     bool reused;
+    bool vaddr_unmapped;
+    NotifierWithReturn transfer_notifier;
+    MemoryListener remap_listener;
 } VFIOContainerCPR;
 
 typedef struct VFIODeviceCPR {
@@ -39,4 +43,10 @@ void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
 bool vfio_cpr_container_match(struct VFIOContainer *container,
                               struct VFIOGroup *group, int *fd);
 
+void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
+                           MemoryRegionSection *section);
+
+bool vfio_cpr_ram_discard_register_listener(
+    struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
 #endif /* HW_VFIO_VFIO_CPR_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 13/42] pci: export msix_is_pending
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (11 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 14/42] pci: skip reset during cpr Steve Sistare
                   ` (29 subsequent siblings)
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export msix_is_pending for use by cpr.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/pci/msix.c         | 2 +-
 include/hw/pci/msix.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 66f27b9..8c7f670 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -72,7 +72,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 void msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (12 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 13/42] pci: export msix_is_pending Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:19   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 15/42] vfio-pci: " Steve Sistare
                   ` (28 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Do not reset a vfio-pci device during CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/pci.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fe38c4c..2ba2e0f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,8 @@
 #include "hw/pci/pci_host.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
+#include "migration/cpr.h"
+#include "migration/misc.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
 #include "net/net.h"
@@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
 
 static void pci_do_device_reset(PCIDevice *dev)
 {
+    /*
+     * A PCI device that is resuming for cpr is already configured, so do
+     * not reset it here when we are called from qemu_system_reset prior to
+     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
+     * safe to skip this reset for all PCI devices, because vmstate load will
+     * set all fields that would have been set here.
+     */
+    if (cpr_is_incoming()) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 15/42] vfio-pci: skip reset during cpr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (13 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 14/42] pci: skip reset during cpr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-20  6:48   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 16/42] vfio/pci: vfio_vector_init Steve Sistare
                   ` (27 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Do not reset a vfio-pci device during CPR, and do not complain if the
kernel's PCI config space changes for non-emulated bits between the
vmstate save and load, which can happen due to ongoing interrupt activity.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c              | 31 +++++++++++++++++++++++++++++++
 hw/vfio/pci.c              |  6 ++++++
 include/hw/vfio/vfio-cpr.h |  2 ++
 3 files changed, 39 insertions(+)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0e59612..6ea8e9f 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,6 +8,8 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/pci.h"
+#include "migration/cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
 
@@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
 {
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
 }
+
+/*
+ * The kernel may change non-emulated config bits.  Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_cpr_pci_pre_load(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    int i;
+
+    for (i = 0; i < size; i++) {
+        pdev->cmask[i] &= vdev->emulated_config_bits[i];
+    }
+
+    return 0;
+}
+
+const VMStateDescription vfio_cpr_pci_vmstate = {
+    .name = "vfio-cpr-pci",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_load = vfio_cpr_pci_pre_load,
+    .needed = cpr_needed_for_reuse,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a1bfdfe..4aa83b1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3344,6 +3344,11 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
 
+    /* Do not reset the device during qemu_system_reset prior to cpr load */
+    if (vdev->vbasedev.cpr.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3513,6 +3518,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
 #ifdef CONFIG_IOMMUFD
     object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
 #endif
+    dc->vmsd = &vfio_cpr_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     pdc->realize = vfio_realize;
 
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index d6d22f2..e93600f 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -49,4 +49,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
 bool vfio_cpr_ram_discard_register_listener(
     struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+extern const VMStateDescription vfio_cpr_pci_vmstate;
+
 #endif /* HW_VFIO_VFIO_CPR_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 16/42] vfio/pci: vfio_vector_init
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (14 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 15/42] vfio-pci: " Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:32   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 17/42] vfio/pci: vfio_notifier_init Steve Sistare
                   ` (26 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Extract a subroutine vfio_vector_init.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4aa83b1..b46c42e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -511,6 +511,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (event_notifier_init(&vector->interrupt, 0)) {
+        error_report("vfio: Error: event_notifier_init failed");
+    }
+    vector->use = true;
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_vector_use(pdev, nr);
+    }
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -524,13 +540,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     vector = &vdev->msi_vectors[nr];
 
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_vector_init(vdev, nr);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 17/42] vfio/pci: vfio_notifier_init
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (15 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 16/42] vfio/pci: vfio_vector_init Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:29   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 18/42] vfio/pci: pass vector to virq functions Steve Sistare
                   ` (25 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move event_notifier_init calls to a helper vfio_notifier_init.
This version is trivial, but it will be expanded to support CPR
in subsequent patches.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b46c42e..4159deb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -56,6 +56,16 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+{
+    int ret = event_notifier_init(e, 0);
+
+    if (ret) {
+        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+    }
+    return !ret;
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -136,8 +146,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (event_notifier_init(&vdev->intx.unmask, 0)) {
-        error_setg(errp, "event_notifier_init failed eoi");
+    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
         goto fail;
     }
 
@@ -268,7 +277,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
     Error *err = NULL;
     int32_t fd;
-    int ret;
 
 
     if (!pin) {
@@ -291,9 +299,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    ret = event_notifier_init(&vdev->intx.interrupt, 0);
-    if (ret) {
-        error_setg_errno(errp, -ret, "event_notifier_init failed");
+    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
         return false;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -473,11 +479,13 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
 
 static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
 {
+    const char *name = "kvm_interrupt";
+
     if (vector->virq < 0) {
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
         goto fail_notifier;
     }
 
@@ -515,11 +523,12 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
 {
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     PCIDevice *pdev = &vdev->pdev;
+    Error *err = NULL;
 
     vector->vdev = vdev;
     vector->virq = -1;
-    if (event_notifier_init(&vector->interrupt, 0)) {
-        error_report("vfio: Error: event_notifier_init failed");
+    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+        error_report_err(err);
     }
     vector->use = true;
     if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -749,13 +758,14 @@ retry:
 
     for (i = 0; i < vdev->nr_vectors; i++) {
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        Error *err = NULL;
 
         vector->vdev = vdev;
         vector->virq = -1;
         vector->use = true;
 
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
+        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+            error_report_err(err);
         }
 
         qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2907,8 +2917,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->err_notifier, 0)) {
-        error_report("vfio: Unable to init event notifier for error detection");
+    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+        error_report_err(err);
         vdev->pci_aer = false;
         return;
     }
@@ -2974,8 +2984,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->req_notifier, 0)) {
-        error_report("vfio: Unable to init event notifier for device request");
+    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+        error_report_err(err);
         return;
     }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 18/42] vfio/pci: pass vector to virq functions
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (16 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 17/42] vfio/pci: vfio_notifier_init Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:28   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
                   ` (24 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass the vector number to vfio_connect_kvm_msi_virq and
vfio_remove_kvm_msi_virq, so it can be passed to their subroutines in
a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4159deb..dad6209 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -477,7 +477,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                              vector_n, &vdev->pdev);
 }
 
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
 {
     const char *name = "kvm_interrupt";
 
@@ -503,7 +503,8 @@ fail_notifier:
     vector->virq = -1;
 }
 
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                                     int nr)
 {
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
                                           vector->virq);
@@ -561,7 +562,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
      */
     if (vector->virq >= 0) {
         if (!msg) {
-            vfio_remove_kvm_msi_virq(vector);
+            vfio_remove_kvm_msi_virq(vdev, vector, nr);
         } else {
             vfio_update_kvm_msi_virq(vector, *msg, pdev);
         }
@@ -573,7 +574,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                 vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
                 vfio_add_kvm_msi_virq(vdev, vector, nr, true);
                 kvm_irqchip_commit_route_changes(&vfio_route_change);
-                vfio_connect_kvm_msi_virq(vector);
+                vfio_connect_kvm_msi_virq(vector, nr);
             }
         }
     }
@@ -681,7 +682,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
     kvm_irqchip_commit_route_changes(&vfio_route_change);
 
     for (i = 0; i < vdev->nr_vectors; i++) {
-        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
     }
 }
 
@@ -821,7 +822,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
         if (vdev->msi_vectors[i].use) {
             if (vector->virq >= 0) {
-                vfio_remove_kvm_msi_virq(vector);
+                vfio_remove_kvm_msi_virq(vdev, vector, i);
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (17 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 18/42] vfio/pci: pass vector to virq functions Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:29   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup Steve Sistare
                   ` (23 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass vdev and nr to vfio_notifier_init, for use by CPR in a subsequent
patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index dad6209..bfeaafa 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -56,7 +56,8 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
-static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+                               const char *name, int nr, Error **errp)
 {
     int ret = event_notifier_init(e, 0);
 
@@ -146,7 +147,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
+    if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
         goto fail;
     }
 
@@ -299,7 +300,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
+    if (!vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0,
+                            errp)) {
         return false;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -485,7 +487,8 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
         return;
     }
 
-    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
+    if (!vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr,
+                            NULL)) {
         goto fail_notifier;
     }
 
@@ -528,7 +531,7 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
 
     vector->vdev = vdev;
     vector->virq = -1;
-    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+    if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr, &err)) {
         error_report_err(err);
     }
     vector->use = true;
@@ -765,7 +768,8 @@ retry:
         vector->virq = -1;
         vector->use = true;
 
-        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+        if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i,
+                                &err)) {
             error_report_err(err);
         }
 
@@ -2918,7 +2922,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+    if (!vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0,
+                            &err)) {
         error_report_err(err);
         vdev->pci_aer = false;
         return;
@@ -2985,7 +2990,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+    if (!vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0,
+                            &err)) {
         error_report_err(err);
         return;
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (18 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:30   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 21/42] vfio/pci: export MSI functions Steve Sistare
                   ` (22 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move event_notifier_cleanup calls to a helper vfio_notifier_cleanup.
This version is trivial, and does not yet use the vdev and nr parameters.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bfeaafa..d2b08a3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -67,6 +67,12 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
     return !ret;
 }
 
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+                                  const char *name, int nr)
+{
+    event_notifier_cleanup(e);
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -179,7 +185,7 @@ fail_vfio:
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
                                           vdev->intx.route.irq);
 fail_irqfd:
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 fail:
     qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -211,7 +217,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     }
 
     /* We only need to close the eventfd for VFIO to cleanup the kernel side */
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 
     /* QEMU starts listening for interrupt events. */
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -310,7 +316,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->intx.interrupt);
+        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
         return false;
     }
 
@@ -337,7 +343,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->intx.interrupt);
+    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
 
     vdev->interrupt = VFIO_INT_NONE;
 
@@ -500,7 +506,7 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
     return;
 
 fail_kvm:
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
 fail_notifier:
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
@@ -513,7 +519,7 @@ static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                           vector->virq);
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
 }
 
 static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -830,7 +836,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-            event_notifier_cleanup(&vector->interrupt);
+            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
         }
     }
 
@@ -2936,7 +2942,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->err_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
         vdev->pci_aer = false;
     }
 }
@@ -2955,7 +2961,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->err_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
 }
 
 static void vfio_req_notifier_handler(void *opaque)
@@ -3003,7 +3009,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->req_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
     } else {
         vdev->req_enabled = true;
     }
@@ -3023,7 +3029,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->req_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
 
     vdev->req_enabled = false;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (19 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:31   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 22/42] vfio-pci: preserve MSI Steve Sistare
                   ` (21 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export various MSI functions, for use by CPR in subsequent patches.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 21 ++++++++++-----------
 hw/vfio/pci.h | 12 ++++++++++++
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d2b08a3..1bca415 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -279,7 +279,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
     vfio_intx_update(vdev, &vdev->intx.route);
 }
 
-static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
+bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
 {
     uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
     Error *err = NULL;
@@ -353,7 +353,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
 /*
  * MSI/X
  */
-static void vfio_msi_interrupt(void *opaque)
+void vfio_msi_interrupt(void *opaque)
 {
     VFIOMSIVector *vector = opaque;
     VFIOPCIDevice *vdev = vector->vdev;
@@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
     return ret;
 }
 
-static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-                                  int vector_n, bool msix)
+void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                           int vector_n, bool msix)
 {
     if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
         return;
@@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
-static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
 {
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     PCIDevice *pdev = &vdev->pdev;
@@ -641,13 +641,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     return 0;
 }
 
-static int vfio_msix_vector_use(PCIDevice *pdev,
-                                unsigned int nr, MSIMessage msg)
+int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
 {
     return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
 }
 
-static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
+void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
@@ -674,14 +673,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     }
 }
 
-static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
 {
     assert(!vdev->defer_kvm_irq_routing);
     vdev->defer_kvm_irq_routing = true;
     vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
 }
 
-static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -2632,7 +2631,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
     return OBJECT(vdev);
 }
 
-static bool vfio_msix_present(void *opaque, int version_id)
+bool vfio_msix_present(void *opaque, int version_id)
 {
     PCIDevice *pdev = opaque;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 5ce0fb9..c892054 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -210,6 +210,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
     return class == PCI_CLASS_DISPLAY_VGA;
 }
 
+/* MSI/MSI-X/INTx */
+void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
+void vfio_msi_interrupt(void *opaque);
+void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                           int vector_n, bool msix);
+int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
+void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
+bool vfio_msix_present(void *opaque, int version_id);
+void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev,
                            uint32_t addr, uint32_t val, int len);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 22/42] vfio-pci: preserve MSI
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (20 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 21/42] vfio/pci: export MSI functions Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-28 17:44   ` Steven Sistare
  2025-05-12 15:32 ` [PATCH V3 23/42] vfio-pci: preserve INTx Steve Sistare
                   ` (20 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Save the MSI message area as part of vfio-pci vmstate, and preserve the
interrupt and notifier eventfd's.  migrate_incoming loads the MSI data,
then the vfio-pci post_load handler finds the eventfds in CPR state,
rebuilds vector data structures, and attaches the interrupts to the new
KVM instance.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c              | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c              | 40 ++++++++++++++++++--
 include/hw/vfio/vfio-cpr.h |  8 ++++
 3 files changed, 136 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 6ea8e9f..be132fa 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -9,6 +9,8 @@
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "hw/vfio/pci.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/msi.h"
 #include "migration/cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
@@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
 }
 
+#define STRDUP_VECTOR_FD_NAME(vdev, name)   \
+    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+                             int fd)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    cpr_save_fd(fdname, nr, fd);
+}
+
+int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    return cpr_find_fd(fdname, nr);
+}
+
+void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    cpr_delete_fd(fdname, nr);
+}
+
+static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
+                                   bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    vfio_prepare_kvm_msi_virq_batch(vdev);
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
+        if (fd >= 0) {
+            vfio_vector_init(vdev, i);
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+        }
+
+        if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
+            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+        } else {
+            vdev->msi_vectors[i].virq = -1;
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    vfio_commit_kvm_msi_virq_batch(vdev);
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 /*
  * The kernel may change non-emulated config bits.  Exclude them from the
  * changed-bits check in get_pci_config_device.
@@ -58,13 +123,39 @@ static int vfio_cpr_pci_pre_load(void *opaque)
     return 0;
 }
 
+static int vfio_cpr_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
+
+    if (msix_enabled(pdev)) {
+        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+                                   vfio_msix_vector_release, NULL);
+        nr_vectors = vdev->msix->entries;
+        vfio_cpr_claim_vectors(vdev, nr_vectors, true);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_cpr_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        g_assert_not_reached();      /* completed in a subsequent patch */
+    }
+
+    return 0;
+}
+
 const VMStateDescription vfio_cpr_pci_vmstate = {
     .name = "vfio-cpr-pci",
     .version_id = 0,
     .minimum_version_id = 0,
     .pre_load = vfio_cpr_pci_pre_load,
+    .post_load = vfio_cpr_pci_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1bca415..bfa72bc 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
 #include "hw/pci/pci_bridge.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "migration/vmstate.h"
 #include "qobject/qdict.h"
 #include "qemu/error-report.h"
@@ -56,13 +57,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+/* Create new or reuse existing eventfd */
 static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                                const char *name, int nr, Error **errp)
 {
-    int ret = event_notifier_init(e, 0);
+    int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
+    int ret = 0;
 
-    if (ret) {
-        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+    if (fd >= 0) {
+        event_notifier_init_fd(e, fd);
+    } else {
+        ret = event_notifier_init(e, 0);
+        if (ret) {
+            error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+        } else {
+            fd = event_notifier_get_fd(e);
+            if (fd >= 0) {
+                vfio_cpr_save_vector_fd(vdev, name, nr, fd);
+            }
+        }
     }
     return !ret;
 }
@@ -70,6 +83,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
 static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
                                   const char *name, int nr)
 {
+    vfio_cpr_delete_vector_fd(vdev, name, nr);
     event_notifier_cleanup(e);
 }
 
@@ -554,6 +568,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     int ret;
     bool resizing = !!(vdev->nr_vectors < nr + 1);
 
+    /*
+     * Ignore the callback from msix_set_vector_notifiers during resume.
+     * The necessary subset of these actions is called from
+     * vfio_cpr_claim_vectors during post load.
+     */
+    if (vdev->vbasedev.cpr.reused) {
+        return 0;
+    }
+
     trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
 
     vector = &vdev->msi_vectors[nr];
@@ -2937,6 +2960,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->vbasedev.cpr.reused) {
+        return;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3004,6 +3032,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->req_notifier);
     qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->vbasedev.cpr.reused) {
+        vdev->req_enabled = true;
+        return;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index e93600f..765e334 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -28,6 +28,7 @@ typedef struct VFIODeviceCPR {
 struct VFIOContainer;
 struct VFIOContainerBase;
 struct VFIOGroup;
+struct VFIOPCIDevice;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
                                         Error **errp);
@@ -49,6 +50,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
 bool vfio_cpr_ram_discard_register_listener(
     struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                             int nr, int fd);
+int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                            int nr);
+void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                               int nr);
+
 extern const VMStateDescription vfio_cpr_pci_vmstate;
 
 #endif /* HW_VFIO_VFIO_CPR_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 23/42] vfio-pci: preserve INTx
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (21 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 22/42] vfio-pci: preserve MSI Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 24/42] migration: close kvm after cpr Steve Sistare
                   ` (19 subsequent siblings)
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Preserve vfio INTx state across cpr-transfer.  Preserve VFIOINTx fields as
follows:
  pin : Recover this from the vfio config in kernel space
  interrupt : Preserve its eventfd descriptor across exec.
  unmask : Ditto
  route.irq : This could perhaps be recovered in vfio_pci_post_load by
    calling pci_device_route_intx_to_irq(pin), whose implementation reads
    config space for a bridge device such as ich9.  However, there is no
    guarantee that the bridge vmstate is read before vfio vmstate.  Rather
    than fiddling with MigrationPriority for vmstate handlers, explicitly
    save route.irq in vfio vmstate.
  pending : save in vfio vmstate.
  mmap_timeout, mmap_timer : Re-initialize
  bool kvm_accel : Re-initialize

In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
 hw/vfio/pci.c | 29 ++++++++++++++++++++++++++---
 2 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index be132fa..6081a89 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -140,12 +140,36 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
         vfio_cpr_claim_vectors(vdev, nr_vectors, false);
 
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        g_assert_not_reached();      /* completed in a subsequent patch */
+        Error *err = NULL;
+        if (!vfio_intx_enable(vdev, &err)) {
+            error_report_err(err);
+            return -1;
+        }
     }
 
     return 0;
 }
 
+static const VMStateDescription vfio_intx_vmstate = {
+    .name = "vfio-cpr-intx",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(pending, VFIOINTx),
+        VMSTATE_UINT32(route.mode, VFIOINTx),
+        VMSTATE_INT32(route.irq, VFIOINTx),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) {                         \
+    .name       = (stringify(_field)),                              \
+    .size       = sizeof(VFIOINTx),                                 \
+    .vmsd       = &vfio_intx_vmstate,                               \
+    .flags      = VMS_STRUCT,                                       \
+    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
+}
+
 const VMStateDescription vfio_cpr_pci_vmstate = {
     .name = "vfio-cpr-pci",
     .version_id = 0,
@@ -156,6 +180,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bfa72bc..84282c0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -160,12 +160,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return true;
     }
 
+    if (vdev->vbasedev.cpr.reused) {
+        goto skip_state;
+    }
+
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_device_irq_mask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
 
+skip_state:
     /* Get an eventfd for resample/unmask */
     if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
         goto fail;
@@ -179,6 +184,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         goto fail_irqfd;
     }
 
+    if (vdev->vbasedev.cpr.reused) {
+        goto skip_irq;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_UNMASK,
                                        event_notifier_get_fd(&vdev->intx.unmask),
@@ -189,6 +198,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     /* Let'em rip */
     vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 
+skip_irq:
     vdev->intx.kvm_accel = true;
 
     trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -304,7 +314,13 @@ bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         return true;
     }
 
-    vfio_disable_interrupts(vdev);
+    /*
+     * Do not alter interrupt state during vfio_realize and cpr load.  The
+     * reused flag is cleared thereafter.
+     */
+    if (!vdev->vbasedev.cpr.reused) {
+        vfio_disable_interrupts(vdev);
+    }
 
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -327,7 +343,8 @@ bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
-    if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
+    if (!vdev->vbasedev.cpr.reused &&
+        !vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
         vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
@@ -3182,7 +3199,13 @@ static bool vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        if (!vfio_intx_enable(vdev, errp)) {
+
+        /*
+         * During CPR, do not call vfio_intx_enable at this time.  Instead,
+         * call it from vfio_pci_post_load after the intx routing data has
+         * been loaded from vmstate.
+         */
+        if (!vdev->vbasedev.cpr.reused && !vfio_intx_enable(vdev, errp)) {
             timer_free(vdev->intx.mmap_timer);
             pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
             kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (22 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 23/42] vfio-pci: preserve INTx Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:35   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 25/42] migration: cpr_get_fd_param helper Steve Sistare
                   ` (18 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

cpr-transfer breaks vfio network connectivity to and from the guest, and
the host system log shows:
  irq bypass consumer (token 00000000a03c32e5) registration fails: -16
which is EBUSY.  This occurs because KVM descriptors are still open in
the old QEMU process.  Close them.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
 hw/vfio/helpers.c             | 10 ++++++++++
 include/hw/vfio/vfio-device.h |  2 ++
 include/migration/cpr.h       |  2 ++
 include/qemu/vfio-helpers.h   |  1 -
 include/system/kvm.h          |  1 +
 migration/cpr-transfer.c      | 18 ++++++++++++++++++
 migration/cpr.c               |  8 ++++++++
 migration/migration.c         |  1 +
 9 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 278a506..d619448 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
         goto err;
     }
 
+    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
+    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
+        s->coalesced_mmio_ring = NULL;
+    }
+
     ret = munmap(cpu->kvm_run, mmap_size);
     if (ret < 0) {
         goto err;
     }
+    cpu->kvm_run = NULL;
 
     if (cpu->kvm_dirty_gfns) {
         ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
         if (ret < 0) {
             goto err;
         }
+        cpu->kvm_dirty_gfns = NULL;
     }
 
     kvm_park_vcpu(cpu);
@@ -600,6 +607,27 @@ err:
     return ret;
 }
 
+void kvm_close(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        cpu_remove_sync(cpu);
+        close(cpu->kvm_fd);
+        cpu->kvm_fd = -1;
+        close(cpu->kvm_vcpu_stats_fd);
+        cpu->kvm_vcpu_stats_fd = -1;
+    }
+
+    if (kvm_state && kvm_state->fd != -1) {
+        close(kvm_state->vmfd);
+        kvm_state->vmfd = -1;
+        close(kvm_state->fd);
+        kvm_state->fd = -1;
+    }
+    kvm_state = NULL;
+}
+
 /*
  * dirty pages logging control
  */
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index d0dbab1..af1db2f 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
 int vfio_kvm_device_fd = -1;
 #endif
 
+void vfio_kvm_device_close(void)
+{
+#ifdef CONFIG_KVM
+    if (vfio_kvm_device_fd != -1) {
+        close(vfio_kvm_device_fd);
+        vfio_kvm_device_fd = -1;
+    }
+#endif
+}
+
 int vfio_kvm_device_add_fd(int fd, Error **errp)
 {
 #ifdef CONFIG_KVM
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 4e4d0b6..6eb6f21 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
 void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
                       DeviceState *dev, bool ram_discard);
 int vfio_device_get_aw_bits(VFIODevice *vdev);
+
+void vfio_kvm_device_close(void);
 #endif /* HW_VFIO_VFIO_COMMON_H */
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index fc6aa33..5f1ff10 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -31,7 +31,9 @@ void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
 bool cpr_needed_for_reuse(void *opaque);
+void cpr_kvm_close(void);
 
+void cpr_transfer_init(void);
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
index bde9495..a029036 100644
--- a/include/qemu/vfio-helpers.h
+++ b/include/qemu/vfio-helpers.h
@@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
                              uint64_t offset, uint64_t size);
 int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
                            int irq_type, Error **errp);
-
 #endif
diff --git a/include/system/kvm.h b/include/system/kvm.h
index b690dda..cfaa94c 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
 int kvm_max_nested_state_length(void);
 int kvm_has_gsi_routing(void);
+void kvm_close(void);
 
 /**
  * kvm_arm_supports_user_irq
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f1403..396558f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,24 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 
+static int cpr_transfer_notifier(NotifierWithReturn *notifier,
+                                 MigrationEvent *e,
+                                 Error **errp)
+{
+    if (e->type == MIG_EVENT_PRECOPY_DONE) {
+        cpr_kvm_close();
+    }
+    return 0;
+}
+
+void cpr_transfer_init(void)
+{
+    static NotifierWithReturn notifier;
+
+    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
+                                MIG_MODE_CPR_TRANSFER);
+}
+
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
 {
     MigrationAddress *addr = channel->addr;
diff --git a/migration/cpr.c b/migration/cpr.c
index 0b01e25..6102d04 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,12 +7,14 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/options.h"
 #include "migration/qemu-file.h"
 #include "migration/savevm.h"
 #include "migration/vmstate.h"
+#include "system/kvm.h"
 #include "system/runstate.h"
 #include "trace.h"
 
@@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
     MigMode mode = migrate_mode();
     return mode == MIG_MODE_CPR_TRANSFER;
 }
+
+void cpr_kvm_close(void)
+{
+    kvm_close();
+    vfio_kvm_device_close();
+}
diff --git a/migration/migration.c b/migration/migration.c
index 4697732..89e2026 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -337,6 +337,7 @@ void migration_object_init(void)
 
     ram_mig_init();
     dirty_bitmap_mig_init();
+    cpr_transfer_init();
 
     /* Initialize cpu throttle timers */
     cpu_throttle_init();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 25/42] migration: cpr_get_fd_param helper
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (23 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 24/42] migration: close kvm after cpr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-19 21:22   ` Fabiano Rosas
  2025-05-12 15:32 ` [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr Steve Sistare
                   ` (17 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.

When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes.  Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.

Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS.  Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  2 ++
 migration/cpr.c         | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 5f1ff10..353fd34 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -32,6 +32,8 @@ struct QIOChannel *cpr_state_ioc(void);
 
 bool cpr_needed_for_reuse(void *opaque);
 void cpr_kvm_close(void);
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+                     bool *reused, Error **errp);
 
 void cpr_transfer_init(void);
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 6102d04..179aed1 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -14,6 +14,7 @@
 #include "migration/qemu-file.h"
 #include "migration/savevm.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "system/kvm.h"
 #include "system/runstate.h"
 #include "trace.h"
@@ -260,3 +261,42 @@ void cpr_kvm_close(void)
     kvm_close();
     vfio_kvm_device_close();
 }
+
+/*
+ * cpr_get_fd_param: find a descriptor and return its value.
+ *
+ * @name: CPR name for the descriptor
+ * @fdname: An integer-valued string, or a name passed to a getfd command
+ * @index: CPR index of the descriptor
+ * @reused: returns true if the fd is found in CPR state, else false.
+ * @errp: returned error message
+ *
+ * If CPR is not being performed, then use @fdname to find the fd.
+ * If CPR is being performed, then ignore @fdname, and look for @name
+ * and @index in CPR state.
+ *
+ * On success returns the fd value, else returns -1.
+ */
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+                     bool *reused, Error **errp)
+{
+    ERRP_GUARD();
+    int fd;
+
+    if (cpr_is_incoming()) {
+        fd = cpr_find_fd(name, index);
+        if (fd < 0) {
+            error_setg(errp, "cannot find saved value for fd %s", fdname);
+        }
+        *reused = true;
+    } else {
+        fd = monitor_fd_param(monitor_cur(), fdname, errp);
+        if (fd >= 0) {
+            cpr_save_fd(name, index, fd);
+        } else {
+            error_prepend(errp, "Could not parse object fd %s:", fdname);
+        }
+        *reused = false;
+    }
+    return fd;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (24 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 25/42] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 20:51   ` John Levon
  2025-05-13 11:12   ` Mark Cave-Ayland
  2025-05-12 15:32 ` [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
                   ` (16 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
region that the translated address is found in.  This will be needed by
CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.

Also return the xlat offset, so we can simplify the interface by removing
the out parameters that can be trivially derived from mr and xlat.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/listener.c      | 29 +++++++++++++++++++----------
 hw/virtio/vhost-vdpa.c  |  8 ++++++--
 include/system/memory.h | 16 +++++++---------
 system/memory.c         | 25 ++++---------------------
 4 files changed, 36 insertions(+), 42 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index e86ffcf..87b7a3c 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -90,16 +90,17 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-/* Called with rcu_read_lock held.  */
-static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                               ram_addr_t *ram_addr, bool *read_only,
-                               Error **errp)
+/*
+ * Called with rcu_read_lock held.
+ * The returned MemoryRegion must not be accessed after calling rcu_read_unlock.
+ */
+static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
+                               hwaddr *xlat_p, Error **errp)
 {
-    bool ret, mr_has_discard_manager;
+    bool ret;
 
-    ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
-                               &mr_has_discard_manager, errp);
-    if (ret && mr_has_discard_manager) {
+    ret = memory_get_xlat_addr(iotlb, mr_p, xlat_p, errp);
+    if (ret && memory_region_has_ram_discard_manager(*mr_p)) {
         /*
          * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
          * pages will remain pinned inside vfio until unmapped, resulting in a
@@ -126,6 +127,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainerBase *bcontainer = giommu->bcontainer;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    MemoryRegion *mr;
+    hwaddr xlat;
     void *vaddr;
     int ret;
     Error *local_err = NULL;
@@ -150,10 +153,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
+        if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
             error_report_err(local_err);
             goto out;
         }
+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
         /*
          * vaddr is only valid until rcu_read_unlock(). But after
          * vfio_dma_map has set up the mapping the pages will be
@@ -1047,6 +1053,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     ram_addr_t translated_addr;
     Error *local_err = NULL;
     int ret = -EINVAL;
+    MemoryRegion *mr;
+    ram_addr_t xlat;
 
     trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
 
@@ -1058,9 +1066,10 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 
     rcu_read_lock();
-    if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
+    if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
         goto out_unlock;
     }
+    translated_addr = memory_region_get_ram_addr(mr) + xlat;
 
     ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
                                 translated_addr, &local_err);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1ab2c11..f191360 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     int ret;
     Int128 llend;
     Error *local_err = NULL;
+    MemoryRegion *mr;
+    hwaddr xlat;
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
@@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
-                                  &local_err)) {
+        if (!memory_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
             error_report_err(local_err);
             return;
         }
+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
         ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                  iotlb->addr_mask + 1, vaddr, read_only);
         if (ret) {
diff --git a/include/system/memory.h b/include/system/memory.h
index fbbf4cf..d743214 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -738,21 +738,19 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
                                              RamDiscardListener *rdl);
 
 /**
- * memory_get_xlat_addr: Extract addresses from a TLB entry
+ * memory_get_xlat_addr: Extract addresses from a TLB entry.
+ *                       Called with rcu_read_lock held.
  *
  * @iotlb: pointer to an #IOMMUTLBEntry
- * @vaddr: virtual address
- * @ram_addr: RAM address
- * @read_only: indicates if writes are allowed
- * @mr_has_discard_manager: indicates memory is controlled by a
- *                          RamDiscardManager
+ * @mr_p: return the MemoryRegion containing the @iotlb translated addr.
+ *        The MemoryRegion must not be accessed after rcu_read_unlock.
+ * @xlat_p: return the offset of the entry from the start of @mr_p
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Return: true on success, else false setting @errp with error.
  */
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                          ram_addr_t *ram_addr, bool *read_only,
-                          bool *mr_has_discard_manager, Error **errp);
+bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
+                          hwaddr *xlat_p, Error **errp);
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
 typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
diff --git a/system/memory.c b/system/memory.c
index 63b983e..4894c0d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2174,18 +2174,14 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
 }
 
 /* Called with rcu_read_lock held.  */
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                          ram_addr_t *ram_addr, bool *read_only,
-                          bool *mr_has_discard_manager, Error **errp)
+bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
+                          hwaddr *xlat_p, Error **errp)
 {
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     bool writable = iotlb->perm & IOMMU_WO;
 
-    if (mr_has_discard_manager) {
-        *mr_has_discard_manager = false;
-    }
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
@@ -2203,9 +2199,6 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
             .offset_within_region = xlat,
             .size = int128_make64(len),
         };
-        if (mr_has_discard_manager) {
-            *mr_has_discard_manager = true;
-        }
         /*
          * Malicious VMs can map memory into the IOMMU, which is expected
          * to remain discarded. vfio will pin all pages, populating memory.
@@ -2229,18 +2222,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
         return false;
     }
 
-    if (vaddr) {
-        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    }
-
-    if (ram_addr) {
-        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
-    }
-
-    if (read_only) {
-        *read_only = !writable || mr->readonly;
-    }
-
+    *xlat_p = xlat;
+    *mr_p = mr;
     return true;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (25 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:26   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
                   ` (15 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass ramblock to vfio_container_dma_map for use in a subsequent patch.
The ramblock's attributes will be needed to map the block using
IOMMU_IOAS_MAP_FILE.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container-base.c              | 3 ++-
 hw/vfio/listener.c                    | 8 +++++---
 include/hw/vfio/vfio-container-base.h | 3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 1c6ca94..8f43bc8 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -75,7 +75,8 @@ void vfio_address_space_insert(VFIOAddressSpace *space,
 
 int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            hwaddr iova, ram_addr_t size,
-                           void *vaddr, bool readonly)
+                           void *vaddr, bool readonly,
+                           RAMBlock *rb)
 {
     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
 
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 87b7a3c..653c6fb 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -169,7 +169,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          */
         ret = vfio_container_dma_map(bcontainer, iova,
                                      iotlb->addr_mask + 1, vaddr,
-                                     read_only);
+                                     read_only, mr->ram_block);
         if (ret) {
             error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%s)",
@@ -239,7 +239,8 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
         ret = vfio_container_dma_map(bcontainer, iova, next - start,
-                                     vaddr, section->readonly);
+                                     vaddr, section->readonly,
+                                     section->mr->ram_block);
         if (ret) {
             /* Rollback */
             vfio_ram_discard_notify_discard(rdl, section);
@@ -600,7 +601,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
     }
 
     ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
-                                 vaddr, section->readonly);
+                                 vaddr, section->readonly,
+                                 section->mr->ram_block);
     if (ret) {
         error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                    "0x%"HWADDR_PRIx", %p) = %d (%s)",
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 5776fd7..03b3f9c 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -78,7 +78,8 @@ void vfio_address_space_insert(VFIOAddressSpace *space,
 
 int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            hwaddr iova, ram_addr_t size,
-                           void *vaddr, bool readonly);
+                           void *vaddr, bool readonly,
+                           RAMBlock *rb);
 int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb, bool unmap_all);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (26 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:26   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 29/42] backends/iommufd: change process ioctl Steve Sistare
                   ` (14 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
This will be called as a substitute for iommufd_backend_map_dma, so
the error conditions for BARs are copied as-is from that function.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c       | 36 ++++++++++++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 include/system/iommufd.h |  3 +++
 3 files changed, 40 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index b73f75c..5c1958f 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -172,6 +172,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
     return ret;
 }
 
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+                                 hwaddr iova, ram_addr_t size,
+                                 int mfd, unsigned long start, bool readonly)
+{
+    int ret, fd = be->fd;
+    struct iommu_ioas_map_file map = {
+        .size = sizeof(map),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .ioas_id = ioas_id,
+        .fd = mfd,
+        .start = start,
+        .iova = iova,
+        .length = size,
+    };
+
+    if (!readonly) {
+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
+    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
+                                       readonly, ret);
+    if (ret) {
+        ret = -errno;
+
+        /* TODO: Not support mapping hardware PCI BAR region for now. */
+        if (errno == EFAULT) {
+            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
+        } else {
+            error_report("IOMMU_IOAS_MAP_FILE failed: %m");
+        }
+    }
+    return ret;
+}
+
 int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
                               hwaddr iova, ram_addr_t size)
 {
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3..f478e18 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
 iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
 iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
 iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
 iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75b..ac700b8 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
 bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
                                 Error **errp);
 void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+                                 hwaddr iova, ram_addr_t size, int fd,
+                                 unsigned long start, bool readonly);
 int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly);
 int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (27 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:42   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset Steve Sistare
                   ` (13 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define the change process ioctl

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c       | 20 ++++++++++++++++++++
 backends/trace-events    |  1 +
 include/system/iommufd.h |  2 ++
 3 files changed, 23 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 5c1958f..6fed1c1 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
     object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
 }
 
+bool iommufd_change_process_capable(IOMMUFDBackend *be)
+{
+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
+
+    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+}
+
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
+{
+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
+    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+
+    if (!ret) {
+        error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d failed",
+                         be->fd);
+    }
+    trace_iommufd_change_process(be->fd, ret);
+    return ret;
+}
+
 bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
 {
     int fd;
diff --git a/backends/trace-events b/backends/trace-events
index f478e18..5ccdf90 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
 dbus_vmstate_saving(const char *id) "id: %s"
 
 # iommufd.c
+iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
 iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
 iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
 iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index ac700b8..db9ed53 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
                                       Error **errp);
+bool iommufd_change_process_capable(IOMMUFDBackend *be);
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (28 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 29/42] backends/iommufd: change process ioctl Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:40   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
                   ` (12 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define qemu_ram_get_fd_offset, so CPR can map a memory region using
IOMMU_IOAS_MAP_FILE in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/exec/cpu-common.h | 1 +
 system/physmem.c          | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index dab1e7e..782bc73 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -85,6 +85,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca3..18684a4 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1593,6 +1593,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
     return rb->offset;
 }
 
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
+{
+    return rb->fd_offset;
+}
+
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
 {
     return rb->used_length;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (29 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:48   ` Duan, Zhenzhong
  2025-05-20 12:27   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
                   ` (11 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
Such a mapping can be preserved without modification during CPR,
because it depends on the file's address space, which does not change,
rather than on the process's address space, which does change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/container-base.c              |  9 +++++++++
 hw/vfio/iommufd.c                     | 13 +++++++++++++
 include/hw/vfio/vfio-container-base.h |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 8f43bc8..72a51a6 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            RAMBlock *rb)
 {
     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
 
+    if (mfd >= 0 && vioc->dma_map_file) {
+        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
+        unsigned long offset = qemu_ram_get_fd_offset(rb);
+
+        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
+                           readonly);
+        return 0;
+    }
     g_assert(vioc->dma_map);
     return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
 }
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 167bda4..6eb417a 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                                    iova, size, vaddr, readonly);
 }
 
+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
+                                 hwaddr iova, ram_addr_t size,
+                                 int fd, unsigned long start, bool readonly)
+{
+    const VFIOIOMMUFDContainer *container =
+        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+    return iommufd_backend_map_file_dma(container->be,
+                                        container->ioas_id,
+                                        iova, size, fd, start, readonly);
+}
+
 static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
                               hwaddr iova, ram_addr_t size,
                               IOMMUTLBEntry *iotlb, bool unmap_all)
@@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
     VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
 
     vioc->dma_map = iommufd_cdev_map;
+    vioc->dma_map_file = iommufd_cdev_map_file;
     vioc->dma_unmap = iommufd_cdev_unmap;
     vioc->attach_device = iommufd_cdev_attach;
     vioc->detach_device = iommufd_cdev_detach;
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 03b3f9c..f30f828 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
     int (*dma_map)(const VFIOContainerBase *bcontainer,
                    hwaddr iova, ram_addr_t size,
                    void *vaddr, bool readonly);
+    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
+                        hwaddr iova, ram_addr_t size,
+                        int fd, unsigned long start, bool readonly);
     /**
      * @dma_unmap
      *
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (30 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-21 18:35   ` Steven Sistare
  2025-05-12 15:32 ` [PATCH V3 33/42] vfio/iommufd: define hwpt constructors Steve Sistare
                   ` (10 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export iommufd_cdev_get_info_iova_range, for use by CPR in a subsequent
patch to reconstruct the userland device state.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/iommufd.c      | 4 ++--
 hw/vfio/vfio-iommufd.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 6eb417a..f645a62 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -459,8 +459,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
     return ram_block_uncoordinated_discard_disable(state);
 }
 
-static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
-                                             uint32_t ioas_id, Error **errp)
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+                                      uint32_t ioas_id, Error **errp)
 {
     VFIOContainerBase *bcontainer = &container->bcontainer;
     g_autofree struct iommu_ioas_iova_ranges *info = NULL;
diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index 07ea0f4..5615dcd 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -31,4 +31,7 @@ typedef struct VFIOIOMMUFDContainer {
 
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
 
+bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
+                                      uint32_t ioas_id, Error **errp);
+
 #endif /* HW_VFIO_VFIO_IOMMUFD_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (31 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  8:55   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 34/42] vfio/iommufd: invariant device name Steve Sistare
                   ` (9 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Extract hwpt creation code from iommufd_cdev_autodomains_get into the
helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
be used by CPR in a subsequent patch.

Call vfio_device_hiod_create_and_realize earlier so iommufd_cdev_make_hwpt
can use vbasedev->hiod hw_caps, avoiding an extra call to
iommufd_backend_get_device_info

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/iommufd.c | 116 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 65 insertions(+), 51 deletions(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f645a62..8661947 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -310,16 +310,70 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
     return true;
 }
 
+static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
+{
+    vbasedev->hwpt = hwpt;
+    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+}
+
+/*
+ * iommufd_cdev_make_hwpt: If @alloc_id, allocate a hwpt_id, else use @hwpt_id.
+ * Create and add a hwpt struct to the container's list and to the device.
+ * Always succeeds if !@alloc_id.
+ */
+static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
+                                   VFIOIOMMUFDContainer *container,
+                                   uint32_t hwpt_id, bool alloc_id,
+                                   Error **errp)
+{
+    VFIOIOASHwpt *hwpt;
+    uint32_t flags = 0;
+
+    /*
+     * This is quite early and VFIO Migration state isn't yet fully
+     * initialized, thus rely only on IOMMU hardware capabilities as to
+     * whether IOMMU dirty tracking is going to be requested. Later
+     * vfio_migration_realize() may decide to use VF dirty tracking
+     * instead.
+     */
+    g_assert(vbasedev->hiod);
+    if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
+        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
+    }
+
+    if (alloc_id) {
+        if (!iommufd_backend_alloc_hwpt(vbasedev->iommufd, vbasedev->devid,
+                                        container->ioas_id, flags,
+                                        IOMMU_HWPT_DATA_NONE, 0, NULL,
+                                        &hwpt_id, errp)) {
+            return false;
+        }
+
+        if (iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp)) {
+            iommufd_backend_free_id(container->be, hwpt_id);
+            return false;
+        }
+    }
+
+    hwpt = g_malloc0(sizeof(*hwpt));
+    hwpt->hwpt_id = hwpt_id;
+    hwpt->hwpt_flags = flags;
+    QLIST_INIT(&hwpt->device_list);
+
+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+    container->bcontainer.dirty_pages_supported |=
+                                vbasedev->iommu_dirty_tracking;
+    iommufd_cdev_use_hwpt(vbasedev, hwpt);
+    return true;
+}
+
 static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
                                          VFIOIOMMUFDContainer *container,
                                          Error **errp)
 {
     ERRP_GUARD();
-    IOMMUFDBackend *iommufd = vbasedev->iommufd;
-    uint32_t type, flags = 0;
-    uint64_t hw_caps;
     VFIOIOASHwpt *hwpt;
-    uint32_t hwpt_id;
     int ret;
 
     /* Try to find a domain */
@@ -340,54 +394,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
 
             return false;
         } else {
-            vbasedev->hwpt = hwpt;
-            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
-            vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
+            iommufd_cdev_use_hwpt(vbasedev, hwpt);
             return true;
         }
     }
-
-    /*
-     * This is quite early and VFIO Migration state isn't yet fully
-     * initialized, thus rely only on IOMMU hardware capabilities as to
-     * whether IOMMU dirty tracking is going to be requested. Later
-     * vfio_migration_realize() may decide to use VF dirty tracking
-     * instead.
-     */
-    if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
-        return false;
-    }
-
-    if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
-        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
-    }
-
-    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
-                                    container->ioas_id, flags,
-                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
-                                    &hwpt_id, errp)) {
-        return false;
-    }
-
-    hwpt = g_malloc0(sizeof(*hwpt));
-    hwpt->hwpt_id = hwpt_id;
-    hwpt->hwpt_flags = flags;
-    QLIST_INIT(&hwpt->device_list);
-
-    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
-    if (ret) {
-        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
-        g_free(hwpt);
+    if (!iommufd_cdev_make_hwpt(vbasedev, container, 0, true, errp)) {
         return false;
     }
 
-    vbasedev->hwpt = hwpt;
-    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
-    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
-    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
-    container->bcontainer.dirty_pages_supported |=
-                                vbasedev->iommu_dirty_tracking;
     if (container->bcontainer.dirty_pages_supported &&
         !vbasedev->iommu_dirty_tracking) {
         warn_report("IOMMU instance for device %s doesn't support dirty tracking",
@@ -530,6 +544,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
 
     space = vfio_address_space_get(as);
 
+    if (!vfio_device_hiod_create_and_realize(vbasedev,
+            TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
+        goto err_alloc_ioas;
+    }
+
     /* try to attach to an existing container in this space */
     QLIST_FOREACH(bcontainer, &space->containers, next) {
         container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
@@ -604,11 +623,6 @@ found_container:
         goto err_listener_register;
     }
 
-    if (!vfio_device_hiod_create_and_realize(vbasedev,
-                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
-        goto err_listener_register;
-    }
-
     /*
      * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
      * for discarding incompatibility check as well?
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (32 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 33/42] vfio/iommufd: define hwpt constructors Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16  9:29   ` Duan, Zhenzhong
  2025-05-20 13:55   ` Cédric Le Goater
  2025-05-12 15:32 ` [PATCH V3 35/42] vfio/iommufd: register container for cpr Steve Sistare
                   ` (8 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

cpr-transfer will use the device name as a key to find the value
of the device descriptor in new QEMU.  However, if the descriptor
number is specified by a command-line fd parameter, then
vfio_device_get_name creates a name that includes the fd number.
This causes a chicken-and-egg problem: new QEMU must know the fd
number to construct a name to find the fd number.

To fix, create an invariant name based on the id command-line
parameter.  If id is not defined, add a CPR blocker.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c              | 21 +++++++++++++++++++++
 hw/vfio/device.c           | 10 ++++------
 hw/vfio/iommufd.c          |  2 ++
 include/hw/vfio/vfio-cpr.h |  4 ++++
 4 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 6081a89..7609c62 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -11,6 +11,7 @@
 #include "hw/vfio/pci.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/msi.h"
+#include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
@@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
         VMSTATE_END_OF_LIST()
     }
 };
+
+bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
+{
+    if (vbasedev->dev->id) {
+        vbasedev->name = g_strdup(vbasedev->dev->id);
+        return true;
+    } else {
+        /*
+         * Assign a name so any function printing it will not break, but the
+         * fd number changes across processes, so this cannot be used as an
+         * invariant name for CPR.
+         */
+        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+        error_setg(&vbasedev->cpr.id_blocker,
+                   "vfio device with fd=%d needs an id property",
+                   vbasedev->fd);
+        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+    }
+}
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 9fba2c7..8e9de68 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -28,6 +28,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/units.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "vfio-helpers.h"
 
@@ -284,6 +285,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
 {
     ERRP_GUARD();
     struct stat st;
+    bool ret = true;
 
     if (vbasedev->fd < 0) {
         if (stat(vbasedev->sysfsdev, &st) < 0) {
@@ -300,16 +302,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
             error_setg(errp, "Use FD passing only with iommufd backend");
             return false;
         }
-        /*
-         * Give a name with fd so any function printing out vbasedev->name
-         * will not break.
-         */
         if (!vbasedev->name) {
-            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+            ret = vfio_cpr_set_device_name(vbasedev, errp);
         }
     }
 
-    return true;
+    return ret;
 }
 
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8661947..ea99b8d 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -25,6 +25,7 @@
 #include "system/reset.h"
 #include "qemu/cutils.h"
 #include "qemu/chardev_open.h"
+#include "migration/blocker.h"
 #include "pci.h"
 #include "vfio-iommufd.h"
 #include "vfio-helpers.h"
@@ -669,6 +670,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
     iommufd_cdev_container_destroy(container);
     vfio_address_space_put(space);
 
+    migrate_del_blocker(&vbasedev->cpr.id_blocker);
     iommufd_cdev_unbind_and_disconnect(vbasedev);
     close(vbasedev->fd);
 }
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 765e334..d06d117 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -23,12 +23,14 @@ typedef struct VFIOContainerCPR {
 typedef struct VFIODeviceCPR {
     bool reused;
     Error *mdev_blocker;
+    Error *id_blocker;
 } VFIODeviceCPR;
 
 struct VFIOContainer;
 struct VFIOContainerBase;
 struct VFIOGroup;
 struct VFIOPCIDevice;
+struct VFIODevice;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
                                         Error **errp);
@@ -59,4 +61,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
 
 extern const VMStateDescription vfio_cpr_pci_vmstate;
 
+bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
+
 #endif /* HW_VFIO_VFIO_CPR_H */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 35/42] vfio/iommufd: register container for cpr
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (33 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 34/42] vfio/iommufd: invariant device name Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16 10:23   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 36/42] vfio/iommufd: preserve descriptors Steve Sistare
                   ` (7 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Register a vfio iommufd container and device for CPR, replacing the generic
CPR register call with a more specific iommufd register call.  Add a
blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.

This is mostly boiler plate.  The fields to to saved and restored are added
in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c      | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/iommufd.c          |  6 ++-
 hw/vfio/meson.build        |  1 +
 hw/vfio/vfio-iommufd.h     |  1 +
 include/hw/vfio/vfio-cpr.h |  8 ++++
 5 files changed, 111 insertions(+), 2 deletions(-)
 create mode 100644 hw/vfio/cpr-iommufd.c

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
new file mode 100644
index 0000000..46f2006
--- /dev/null
+++ b/hw/vfio/cpr-iommufd.c
@@ -0,0 +1,97 @@
+/*
+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "system/iommufd.h"
+#include "vfio-iommufd.h"
+
+static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
+{
+    if (!iommufd_change_process_capable(container->be)) {
+        error_setg(errp,
+                   "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
+        return false;
+    }
+    return true;
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+    .name = "vfio-iommufd-container",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_needed_for_reuse,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static const VMStateDescription iommufd_cpr_vmstate = {
+    .name = "iommufd",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_needed_for_reuse,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+    Error **cpr_blocker = &container->cpr_blocker;
+
+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+                                vfio_cpr_reboot_notifier,
+                                MIG_MODE_CPR_REBOOT);
+
+    if (!vfio_cpr_supported(container, cpr_blocker)) {
+        return migrate_add_blocker_modes(cpr_blocker, errp,
+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+    }
+
+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
+
+    return true;
+}
+
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
+    migrate_del_blocker(&container->cpr_blocker);
+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+}
+
+static const VMStateDescription vfio_device_vmstate = {
+    .name = "vfio-iommufd-device",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_needed_for_reuse,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
+{
+    vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
+}
+
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
+{
+    vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ea99b8d..dabb948 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -460,7 +460,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
     if (!QLIST_EMPTY(&bcontainer->device_list)) {
         return;
     }
-    vfio_cpr_unregister_container(bcontainer);
+    vfio_iommufd_cpr_unregister_container(container);
     vfio_listener_unregister(bcontainer);
     iommufd_backend_free_id(container->be, container->ioas_id);
     object_unref(container);
@@ -611,7 +611,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         goto err_listener_register;
     }
 
-    if (!vfio_cpr_register_container(bcontainer, errp)) {
+    if (!vfio_iommufd_cpr_register_container(container, errp)) {
         goto err_listener_register;
     }
 
@@ -633,6 +633,7 @@ found_container:
     }
 
     vfio_device_prepare(vbasedev, bcontainer, &dev_info);
+    vfio_iommufd_cpr_register_device(vbasedev);
 
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
@@ -671,6 +672,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
     vfio_address_space_put(space);
 
     migrate_del_blocker(&vbasedev->cpr.id_blocker);
+    vfio_iommufd_cpr_unregister_device(vbasedev);
     iommufd_cdev_unbind_and_disconnect(vbasedev);
     close(vbasedev->fd);
 }
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 73d29f9..a158fd8 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
 system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
 system_ss.add(when: 'CONFIG_VFIO', if_true: files(
   'cpr.c',
+  'cpr-iommufd.c',
   'cpr-legacy.c',
   'device.c',
   'migration.c',
diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index 5615dcd..cc57a05 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -25,6 +25,7 @@ typedef struct IOMMUFDBackend IOMMUFDBackend;
 typedef struct VFIOIOMMUFDContainer {
     VFIOContainerBase bcontainer;
     IOMMUFDBackend *be;
+    Error *cpr_blocker;
     uint32_t ioas_id;
     QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
 } VFIOIOMMUFDContainer;
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index d06d117..1379b20 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -31,6 +31,7 @@ struct VFIOContainerBase;
 struct VFIOGroup;
 struct VFIOPCIDevice;
 struct VFIODevice;
+struct VFIOIOMMUFDContainer;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
                                         Error **errp);
@@ -43,6 +44,13 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
 
+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
+                                         Error **errp);
+void vfio_iommufd_cpr_unregister_container(
+    struct VFIOIOMMUFDContainer *container);
+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+
 bool vfio_cpr_container_match(struct VFIOContainer *container,
                               struct VFIOGroup *group, int *fd);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 36/42] vfio/iommufd: preserve descriptors
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (34 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 35/42] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16 10:06   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 37/42] vfio/iommufd: reconstruct device Steve Sistare
                   ` (6 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Save the iommu and vfio device fd in CPR state when it is created.
After CPR, the fd number is found in CPR state and reused.  Remember
the reused status for subsequent patches.  The reused status is cleared
when vmstate load finishes.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c       | 19 ++++++++++---------
 hw/vfio/cpr-iommufd.c    | 16 ++++++++++++++++
 hw/vfio/device.c         | 10 ++--------
 hw/vfio/iommufd.c        | 13 +++++++++++--
 include/system/iommufd.h |  1 +
 5 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 6fed1c1..492747c 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -16,12 +16,18 @@
 #include "qemu/module.h"
 #include "qom/object_interfaces.h"
 #include "qemu/error-report.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "trace.h"
 #include "hw/vfio/vfio-device.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
 
+static const char *iommufd_fd_name(IOMMUFDBackend *be)
+{
+    return object_get_canonical_path_component(OBJECT(be));
+}
+
 static void iommufd_backend_init(Object *obj)
 {
     IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
@@ -47,9 +53,8 @@ static void iommufd_backend_set_fd(Object *obj, const char *str, Error **errp)
     IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
     int fd = -1;
 
-    fd = monitor_fd_param(monitor_cur(), str, errp);
+    fd = cpr_get_fd_param(iommufd_fd_name(be), str, 0, &be->cpr_reused, errp);
     if (fd == -1) {
-        error_prepend(errp, "Could not parse remote object fd %s:", str);
         return;
     }
     be->fd = fd;
@@ -95,14 +100,9 @@ bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
 
 bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
 {
-    int fd;
-
     if (be->owned && !be->users) {
-        fd = qemu_open("/dev/iommu", O_RDWR, errp);
-        if (fd < 0) {
-            return false;
-        }
-        be->fd = fd;
+        be->fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0,
+                             &be->cpr_reused, errp);
     }
     be->users++;
 
@@ -121,6 +121,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
         be->fd = -1;
     }
 out:
+    cpr_delete_fd(iommufd_fd_name(be), 0);
     trace_iommufd_backend_disconnect(be->fd, be->users);
 }
 
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 46f2006..b760bd3 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
@@ -25,10 +26,25 @@ static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error **errp)
     return true;
 }
 
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+    VFIOIOMMUFDContainer *container = opaque;
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
+        vbasedev->cpr.reused = false;
+    }
+    container->be->cpr_reused = false;
+
+    return 0;
+}
+
 static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-iommufd-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 8e9de68..02f384e 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -312,14 +312,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
 
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
 {
-    ERRP_GUARD();
-    int fd = monitor_fd_param(monitor_cur(), str, errp);
-
-    if (fd < 0) {
-        error_prepend(errp, "Could not parse remote object fd %s:", str);
-        return;
-    }
-    vbasedev->fd = fd;
+    vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
+                                    &vbasedev->cpr.reused, errp);
 }
 
 static VFIODeviceIOOps vfio_device_io_ops_ioctl;
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index dabb948..046f601 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -26,6 +26,7 @@
 #include "qemu/cutils.h"
 #include "qemu/chardev_open.h"
 #include "migration/blocker.h"
+#include "migration/cpr.h"
 #include "pci.h"
 #include "vfio-iommufd.h"
 #include "vfio-helpers.h"
@@ -530,13 +531,18 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
 
     if (vbasedev->fd < 0) {
-        devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+        devfd = cpr_find_fd(vbasedev->name, 0);
+        vbasedev->cpr.reused = (devfd >= 0);
+        if (!vbasedev->cpr.reused) {
+            devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
+        }
         if (devfd < 0) {
             return false;
         }
         vbasedev->fd = devfd;
     } else {
         devfd = vbasedev->fd;
+        /* reused was set in iommufd_backend_set_fd */
     }
 
     if (!iommufd_cdev_connect_and_bind(vbasedev, errp)) {
@@ -634,7 +640,9 @@ found_container:
 
     vfio_device_prepare(vbasedev, bcontainer, &dev_info);
     vfio_iommufd_cpr_register_device(vbasedev);
-
+    if (!vbasedev->cpr.reused) {
+        cpr_save_fd(vbasedev->name, 0, vbasedev->fd);
+    }
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
     return true;
@@ -673,6 +681,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
 
     migrate_del_blocker(&vbasedev->cpr.id_blocker);
     vfio_iommufd_cpr_unregister_device(vbasedev);
+    cpr_delete_fd(vbasedev->name, 0);
     iommufd_cdev_unbind_and_disconnect(vbasedev);
     close(vbasedev->fd);
 }
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index db9ed53..5c17abd 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
     /*< protected >*/
     int fd;            /* /dev/iommu file descriptor */
     bool owned;        /* is the /dev/iommu opened internally */
+    bool cpr_reused;   /* fd is reused after CPR */
     uint32_t users;
 
     /*< public >*/
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 37/42] vfio/iommufd: reconstruct device
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (35 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 36/42] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16 10:22   ` Duan, Zhenzhong
  2025-05-21 18:38   ` Steven Sistare
  2025-05-12 15:32 ` [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps Steve Sistare
                   ` (5 subsequent siblings)
  42 siblings, 2 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Reconstruct userland device state after CPR.  During vfio_realize, skip
all ioctls that configure the device, as it was already configured in old
QEMU.

Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
Because we skip ioctl's, it is not needed at realize time.  However, we do
need the range info, so defer the call to iommufd_cdev_get_info_iova_range
to a post_load handler, at which time the ioas_id is known.

This reconstruction is not complete.  hwpt_id and devid need special
treatment, handled in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c |  8 ++++++++
 hw/vfio/iommufd.c     | 17 +++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index b760bd3..3d430f0 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -31,6 +31,13 @@ static int vfio_container_post_load(void *opaque, int version_id)
     VFIOIOMMUFDContainer *container = opaque;
     VFIOContainerBase *bcontainer = &container->bcontainer;
     VFIODevice *vbasedev;
+    Error *err = NULL;
+    uint32_t ioas_id = container->ioas_id;
+
+    if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
+        error_report_err(err);
+        return -1;
+    }
 
     QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
         vbasedev->cpr.reused = false;
@@ -47,6 +54,7 @@ static const VMStateDescription vfio_container_vmstate = {
     .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
+        VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 046f601..c49a7e7 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -122,6 +122,10 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
         goto err_kvm_device_add;
     }
 
+    if (vbasedev->cpr.reused) {
+        goto skip_bind;
+    }
+
     /* Bind device to iommufd */
     bind.iommufd = iommufd->fd;
     if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
@@ -133,6 +137,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
     vbasedev->devid = bind.out_devid;
     trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
                                         vbasedev->fd, vbasedev->devid);
+
+skip_bind:
     return true;
 err_bind:
     iommufd_cdev_kvm_device_del(vbasedev);
@@ -580,6 +586,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         }
     }
 
+    if (vbasedev->cpr.reused) {
+        ioas_id = -1;           /* ioas_id will be received from vmstate */
+        goto skip_ioas_alloc;
+    }
+
     /* Need to allocate a new dedicated container */
     if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
         goto err_alloc_ioas;
@@ -587,6 +598,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
 
     trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
 
+skip_ioas_alloc:
     container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
     container->be = vbasedev->iommufd;
     container->ioas_id = ioas_id;
@@ -605,6 +617,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         goto err_discard_disable;
     }
 
+    if (vbasedev->cpr.reused) {
+        goto skip_info;
+    }
+
     if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
         error_append_hint(&err,
                    "Fallback to default 64bit IOVA range and 4K page size\n");
@@ -613,6 +629,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         bcontainer->pgsizes = qemu_real_host_page_size();
     }
 
+skip_info:
     if (!vfio_listener_register(bcontainer, errp)) {
         goto err_listener_register;
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (36 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 37/42] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-21 19:59   ` Steven Sistare
  2025-05-12 15:32 ` [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt Steve Sistare
                   ` (4 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

hw_caps is normally derived during realize, at
vfio_device_hiod_create_and_realize -> hiod_iommufd_vfio_realize ->
iommufd_backend_get_device_info.  However, this depends on the devid, which
is not preserved during CPR.

Save devid in vmstate.  Defer the vfio_device_hiod_create_and_realize call
to post_load time, after devid has been recovered from vmstate.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c  | 15 +++++++++++++++
 hw/vfio/iommufd.c      |  6 ++----
 hw/vfio/vfio-iommufd.h |  3 +++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 3d430f0..24cdf10 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -100,12 +100,27 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
 }
 
+static int vfio_device_post_load(void *opaque, int version_id)
+{
+    VFIODevice *vbasedev = opaque;
+    Error *err = NULL;
+
+    if (!vfio_device_hiod_create_and_realize(vbasedev,
+                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, &err)) {
+        error_report_err(err);
+        return false;
+    }
+    return true;
+}
+
 static const VMStateDescription vfio_device_vmstate = {
     .name = "vfio-iommufd-device",
     .version_id = 0,
     .minimum_version_id = 0,
+    .post_load = vfio_device_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
+        VMSTATE_INT32(devid, VFIODevice),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index c49a7e7..d980684 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -32,9 +32,6 @@
 #include "vfio-helpers.h"
 #include "vfio-listener.h"
 
-#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
-            TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
-
 static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly)
 {
@@ -557,7 +554,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
 
     space = vfio_address_space_get(as);
 
-    if (!vfio_device_hiod_create_and_realize(vbasedev,
+    if (!vbasedev->cpr.reused &&
+        !vfio_device_hiod_create_and_realize(vbasedev,
             TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
         goto err_alloc_ioas;
     }
diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index cc57a05..148ce89 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -11,6 +11,9 @@
 
 #include "hw/vfio/vfio-container-base.h"
 
+#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
+            TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
+
 typedef struct VFIODevice VFIODevice;
 
 typedef struct VFIOIOASHwpt {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (37 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-19  3:25   ` Duan, Zhenzhong
  2025-05-12 15:32 ` [PATCH V3 40/42] vfio/iommufd: change process Steve Sistare
                   ` (3 subsequent siblings)
  42 siblings, 1 reply; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Save the hwpt_id in vmstate.  In realize, skip its allocation from
iommufd_cdev_attach -> iommufd_cdev_attach_container ->
iommufd_cdev_autodomains_get.

Rebuild userland structures to hold hwpt_id by calling
iommufd_cdev_rebuild_hwpt at post load time.  This depends on hw_caps, which
was restored by the post_load call to vfio_device_hiod_create_and_realize.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c      |  7 +++++++
 hw/vfio/iommufd.c          | 24 ++++++++++++++++++++++--
 hw/vfio/trace-events       |  1 +
 hw/vfio/vfio-iommufd.h     |  3 +++
 include/hw/vfio/vfio-cpr.h |  1 +
 5 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 24cdf10..6d3f4e0 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -110,6 +110,12 @@ static int vfio_device_post_load(void *opaque, int version_id)
         error_report_err(err);
         return false;
     }
+    if (!vbasedev->mdev) {
+        VFIOIOMMUFDContainer *container = container_of(vbasedev->bcontainer,
+                                                       VFIOIOMMUFDContainer,
+                                                       bcontainer);
+        iommufd_cdev_rebuild_hwpt(vbasedev, container);
+    }
     return true;
 }
 
@@ -121,6 +127,7 @@ static const VMStateDescription vfio_device_vmstate = {
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
         VMSTATE_INT32(devid, VFIODevice),
+        VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d980684..ec79c83 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -318,6 +318,7 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
 static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt *hwpt)
 {
     vbasedev->hwpt = hwpt;
+    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
     vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
     QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
 }
@@ -373,6 +374,23 @@ static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
     return true;
 }
 
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+                               VFIOIOMMUFDContainer *container)
+{
+    VFIOIOASHwpt *hwpt;
+    int hwpt_id = vbasedev->cpr.hwpt_id;
+
+    trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
+
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        if (hwpt->hwpt_id == hwpt_id) {
+            iommufd_cdev_use_hwpt(vbasedev, hwpt);
+            return;
+        }
+    }
+    iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id, false, NULL);
+}
+
 static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
                                          VFIOIOMMUFDContainer *container,
                                          Error **errp)
@@ -567,7 +585,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
             vbasedev->iommufd != container->be) {
             continue;
         }
-        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
+        if (!vbasedev->cpr.reused &&
+            !iommufd_cdev_attach_container(vbasedev, container, &err)) {
             const char *msg = error_get_pretty(err);
 
             trace_iommufd_cdev_fail_attach_existing_container(msg);
@@ -605,7 +624,8 @@ skip_ioas_alloc:
     bcontainer = &container->bcontainer;
     vfio_address_space_insert(space, bcontainer);
 
-    if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
+    if (!vbasedev->cpr.reused &&
+        !iommufd_cdev_attach_container(vbasedev, container, errp)) {
         goto err_attach_container;
     }
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e90ec9b..4955264 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -190,6 +190,7 @@ iommufd_cdev_connect_and_bind(int iommufd, const char *name, int devfd, int devi
 iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
 iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id) " [iommufd=%d] Successfully attached device %s (%d) to id=%d"
 iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d] Successfully detached %s"
+iommufd_cdev_rebuild_hwpt(int iommufd, int hwpt_id) " [iommufd=%d] hwpt %d"
 iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
 iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD container with ioasid=%d"
 iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index 148ce89..78af0d8 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -38,4 +38,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
 bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
                                       uint32_t ioas_id, Error **errp);
 
+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
+                               VFIOIOMMUFDContainer *container);
+
 #endif /* HW_VFIO_VFIO_IOMMUFD_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 1379b20..b98c247 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -24,6 +24,7 @@ typedef struct VFIODeviceCPR {
     bool reused;
     Error *mdev_blocker;
     Error *id_blocker;
+    uint32_t hwpt_id;
 } VFIODeviceCPR;
 
 struct VFIOContainer;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 40/42] vfio/iommufd: change process
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (38 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 41/42] iommufd: preserve DMA mappings Steve Sistare
                   ` (2 subsequent siblings)
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Finish CPR by change the owning process of the iommufd device in
post load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 6d3f4e0..67be775 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -47,10 +47,27 @@ static int vfio_container_post_load(void *opaque, int version_id)
     return 0;
 }
 
+static int vfio_container_pre_save(void *opaque)
+{
+    VFIOIOMMUFDContainer *container = opaque;
+    Error *err = NULL;
+
+    /*
+     * The process has not changed yet, but proactively call the ioctl,
+     * and it will fail if any DMA mappings are not supported.
+     */
+    if (!iommufd_change_process(container->be, &err)) {
+        error_report_err(err);
+        return -1;
+    }
+    return 0;
+}
+
 static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-iommufd-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .pre_save = vfio_container_pre_save,
     .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
@@ -59,10 +76,23 @@ static const VMStateDescription vfio_container_vmstate = {
     }
 };
 
+static int iommufd_cpr_post_load(void *opaque, int version_id)
+{
+     IOMMUFDBackend *be = opaque;
+     Error *err = NULL;
+
+     if (!iommufd_change_process(be, &err)) {
+        error_report_err(err);
+        return -1;
+     }
+     return 0;
+}
+
 static const VMStateDescription iommufd_cpr_vmstate = {
     .name = "iommufd",
     .version_id = 0,
     .minimum_version_id = 0,
+    .post_load = iommufd_cpr_post_load,
     .needed = cpr_needed_for_reuse,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 41/42] iommufd: preserve DMA mappings
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (39 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 40/42] vfio/iommufd: change process Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-12 15:32 ` [PATCH V3 42/42] vfio/container: delete old cpr register Steve Sistare
  2025-05-16 16:37 ` [PATCH V3 00/42] Live update: vfio and iommufd Cédric Le Goater
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

During cpr-transfer load in new QEMU, the vfio_memory_listener causes
spurious calls to map and unmap DMA regions, as devices are created and
the address space is built.  This memory was already already mapped by the
device in old QEMU, so suppress the map and unmap callbacks during CPR --
eg, if the reused flag is set.  The reused flag is cleared in the post_load
handler.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c    | 8 ++++++++
 hw/vfio/cpr-iommufd.c | 1 +
 2 files changed, 9 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 492747c..c765f2d 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -209,6 +209,10 @@ int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
         .length = size,
     };
 
+    if (be->cpr_reused) {
+        return 0;
+    }
+
     if (!readonly) {
         map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
     }
@@ -240,6 +244,10 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
         .length = size,
     };
 
+    if (be->cpr_reused) {
+        return 0;
+    }
+
     ret = ioctl(fd, IOMMU_IOAS_UNMAP, &unmap);
     /*
      * IOMMUFD takes mapping as some kind of object, unmapping
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 67be775..9680f5a 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -67,6 +67,7 @@ static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-iommufd-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .priority = MIG_PRI_LOW,   /* Must happen after devices and groups */
     .pre_save = vfio_container_pre_save,
     .post_load = vfio_container_post_load,
     .needed = cpr_needed_for_reuse,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* [PATCH V3 42/42] vfio/container: delete old cpr register
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (40 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 41/42] iommufd: preserve DMA mappings Steve Sistare
@ 2025-05-12 15:32 ` Steve Sistare
  2025-05-16 16:37 ` [PATCH V3 00/42] Live update: vfio and iommufd Cédric Le Goater
  42 siblings, 0 replies; 157+ messages in thread
From: Steve Sistare @ 2025-05-12 15:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

vfio_cpr_[un]register_container is no longer used since they were
subsumed by container type-specific registration.  Delete them.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c              | 13 -------------
 include/hw/vfio/vfio-cpr.h |  4 ----
 2 files changed, 17 deletions(-)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 7609c62..ea1773e 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -30,19 +30,6 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
     return 0;
 }
 
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
-{
-    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
-                                vfio_cpr_reboot_notifier,
-                                MIG_MODE_CPR_REBOOT);
-    return true;
-}
-
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
-{
-    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
-}
-
 #define STRDUP_VECTOR_FD_NAME(vdev, name)   \
     g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
 
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index b98c247..601037b 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -41,10 +41,6 @@ void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
 int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
                              Error **errp);
 
-bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
-                                 Error **errp);
-void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
-
 bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
                                          Error **errp);
 void vfio_iommufd_cpr_unregister_container(
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-12 15:32 ` [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-05-12 20:51   ` John Levon
  2025-05-14 17:03     ` Cédric Le Goater
  2025-05-15 17:24     ` Steven Sistare
  2025-05-13 11:12   ` Mark Cave-Ayland
  1 sibling, 2 replies; 157+ messages in thread
From: John Levon @ 2025-05-12 20:51 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On Mon, May 12, 2025 at 08:32:37AM -0700, Steve Sistare wrote:

> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
> region that the translated address is found in.  This will be needed by
> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
> 
> Also return the xlat offset, so we can simplify the interface by removing
> the out parameters that can be trivially derived from mr and xlat.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Steve, would you consider splitting this out from the full CPR series and
submitting as a standalone, as we both have a dependency on doing this, and your
patch seems much nicer than the current one in vfio-user series?

regards
john


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-12 15:32 ` [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr Steve Sistare
  2025-05-12 20:51   ` John Levon
@ 2025-05-13 11:12   ` Mark Cave-Ayland
  2025-05-15 19:40     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Mark Cave-Ayland @ 2025-05-13 11:12 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On 12/05/2025 16:32, Steve Sistare wrote:

> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
> region that the translated address is found in.  This will be needed by
> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
> 
> Also return the xlat offset, so we can simplify the interface by removing
> the out parameters that can be trivially derived from mr and xlat.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/listener.c      | 29 +++++++++++++++++++----------
>   hw/virtio/vhost-vdpa.c  |  8 ++++++--
>   include/system/memory.h | 16 +++++++---------
>   system/memory.c         | 25 ++++---------------------
>   4 files changed, 36 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index e86ffcf..87b7a3c 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -90,16 +90,17 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>              section->offset_within_address_space & (1ULL << 63);
>   }
>   
> -/* Called with rcu_read_lock held.  */
> -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                               ram_addr_t *ram_addr, bool *read_only,
> -                               Error **errp)
> +/*
> + * Called with rcu_read_lock held.
> + * The returned MemoryRegion must not be accessed after calling rcu_read_unlock.
> + */
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
> +                               hwaddr *xlat_p, Error **errp)
>   {
> -    bool ret, mr_has_discard_manager;
> +    bool ret;
>   
> -    ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
> -                               &mr_has_discard_manager, errp);
> -    if (ret && mr_has_discard_manager) {
> +    ret = memory_get_xlat_addr(iotlb, mr_p, xlat_p, errp);
> +    if (ret && memory_region_has_ram_discard_manager(*mr_p)) {

I'm trying to understand the underlying intention of this patch: is it 
just so that you can access the corresponding RAMBlock in 
vfio_container_dma_map() in patch 31 "vfio/iommufd: use 
IOMMU_IOAS_MAP_FILE"?

Given that the flatview can theoretically change at any point, it feels 
as if the current API whereby the vaddr is passed around is the correct 
approach, and that the final MemoryRegion lookup should be done at the 
point where it is required.

If this is the case, is it not simpler to add a call to 
address_space_translate() in patch 31 to obtain the MemoryRegion pointer 
there instead?

>           /*
>            * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
>            * pages will remain pinned inside vfio until unmapped, resulting in a
> @@ -126,6 +127,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>       VFIOContainerBase *bcontainer = giommu->bcontainer;
>       hwaddr iova = iotlb->iova + giommu->iommu_offset;
> +    MemoryRegion *mr;
> +    hwaddr xlat;
>       void *vaddr;
>       int ret;
>       Error *local_err = NULL;
> @@ -150,10 +153,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>           bool read_only;
>   
> -        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
> +        if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>               error_report_err(local_err);
>               goto out;
>           }
> +        vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
>           /*
>            * vaddr is only valid until rcu_read_unlock(). But after
>            * vfio_dma_map has set up the mapping the pages will be
> @@ -1047,6 +1053,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       ram_addr_t translated_addr;
>       Error *local_err = NULL;
>       int ret = -EINVAL;
> +    MemoryRegion *mr;
> +    ram_addr_t xlat;
>   
>       trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>   
> @@ -1058,9 +1066,10 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       }
>   
>       rcu_read_lock();
> -    if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
> +    if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>           goto out_unlock;
>       }
> +    translated_addr = memory_region_get_ram_addr(mr) + xlat;
>   
>       ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
>                                   translated_addr, &local_err);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 1ab2c11..f191360 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       int ret;
>       Int128 llend;
>       Error *local_err = NULL;
> +    MemoryRegion *mr;
> +    hwaddr xlat;
>   
>       if (iotlb->target_as != &address_space_memory) {
>           error_report("Wrong target AS \"%s\", only system memory is allowed",
> @@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>           bool read_only;
>   
> -        if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
> -                                  &local_err)) {
> +        if (!memory_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>               error_report_err(local_err);
>               return;
>           }
> +        vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
> +
>           ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
>                                    iotlb->addr_mask + 1, vaddr, read_only);
>           if (ret) {
> diff --git a/include/system/memory.h b/include/system/memory.h
> index fbbf4cf..d743214 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -738,21 +738,19 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>                                                RamDiscardListener *rdl);
>   
>   /**
> - * memory_get_xlat_addr: Extract addresses from a TLB entry
> + * memory_get_xlat_addr: Extract addresses from a TLB entry.
> + *                       Called with rcu_read_lock held.
>    *
>    * @iotlb: pointer to an #IOMMUTLBEntry
> - * @vaddr: virtual address
> - * @ram_addr: RAM address
> - * @read_only: indicates if writes are allowed
> - * @mr_has_discard_manager: indicates memory is controlled by a
> - *                          RamDiscardManager
> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr.
> + *        The MemoryRegion must not be accessed after rcu_read_unlock.
> + * @xlat_p: return the offset of the entry from the start of @mr_p
>    * @errp: pointer to Error*, to store an error if it happens.
>    *
>    * Return: true on success, else false setting @errp with error.
>    */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                          ram_addr_t *ram_addr, bool *read_only,
> -                          bool *mr_has_discard_manager, Error **errp);
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
> +                          hwaddr *xlat_p, Error **errp);
>   
>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>   typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
> diff --git a/system/memory.c b/system/memory.c
> index 63b983e..4894c0d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2174,18 +2174,14 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>   }
>   
>   /* Called with rcu_read_lock held.  */
> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> -                          ram_addr_t *ram_addr, bool *read_only,
> -                          bool *mr_has_discard_manager, Error **errp)
> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
> +                          hwaddr *xlat_p, Error **errp)
>   {
>       MemoryRegion *mr;
>       hwaddr xlat;
>       hwaddr len = iotlb->addr_mask + 1;
>       bool writable = iotlb->perm & IOMMU_WO;
>   
> -    if (mr_has_discard_manager) {
> -        *mr_has_discard_manager = false;
> -    }
>       /*
>        * The IOMMU TLB entry we have just covers translation through
>        * this IOMMU to its immediate target.  We need to translate
> @@ -2203,9 +2199,6 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>               .offset_within_region = xlat,
>               .size = int128_make64(len),
>           };
> -        if (mr_has_discard_manager) {
> -            *mr_has_discard_manager = true;
> -        }
>           /*
>            * Malicious VMs can map memory into the IOMMU, which is expected
>            * to remain discarded. vfio will pin all pages, populating memory.
> @@ -2229,18 +2222,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>           return false;
>       }
>   
> -    if (vaddr) {
> -        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -    }
> -
> -    if (ram_addr) {
> -        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
> -    }
> -
> -    if (read_only) {
> -        *read_only = !writable || mr->readonly;
> -    }
> -
> +    *xlat_p = xlat;
> +    *mr_p = mr;
>       return true;
>   }


ATB,

Mark.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-12 20:51   ` John Levon
@ 2025-05-14 17:03     ` Cédric Le Goater
  2025-05-15  8:22       ` David Hildenbrand
  2025-05-15 17:24     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-14 17:03 UTC (permalink / raw)
  To: John Levon, Steve Sistare
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	Paolo Bonzini, David Hildenbrand, Peter Xu,
	Philippe Mathieu-Daudé

+ Paolo
+ David
+ Peter
+ Phil

On 5/12/25 22:51, John Levon wrote:
> On Mon, May 12, 2025 at 08:32:37AM -0700, Steve Sistare wrote:
> 
>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>> region that the translated address is found in.  This will be needed by
>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>
>> Also return the xlat offset, so we can simplify the interface by removing
>> the out parameters that can be trivially derived from mr and xlat.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Steve, would you consider splitting this out from the full CPR series and
> submitting as a standalone, as we both have a dependency on doing this, and your
> patch seems much nicer than the current one in vfio-user series?

May be we can merge this version if maintainers ack the change ?

Thanks,

C.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR
  2025-05-12 15:32 ` [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR Steve Sistare
@ 2025-05-15  7:36   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15  7:36 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> CPR is integrated with live migration, and has the same maintainers.
> But, add a CPR section to add a reviewer.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   MAINTAINERS | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dacd6d..d54a532 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3019,6 +3019,15 @@ F: include/qemu/co-shared-resource.h
>   T: git https://gitlab.com/jsnow/qemu.git jobs
>   T: git https://gitlab.com/vsementsov/qemu.git block
>   
> +CheckPoint and Restart (CPR)
> +R: Steve Sistare <steven.sistare@oracle.com>
> +S: Supported
> +F: hw/vfio/cpr*
> +F: include/migration/cpr.h
> +F: migration/cpr*
> +F: tests/qtest/migration/cpr*
> +F: docs/devel/migration/CPR.rst
> +
>   Compute Express Link
>   M: Jonathan Cameron <jonathan.cameron@huawei.com>
>   R: Fan Ni <fan.ni@samsung.com>

Please add :

   include/hw/vfio/vfio-cpr.h

with that,
  
Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 02/42] migration: cpr helpers
  2025-05-12 15:32 ` [PATCH V3 02/42] migration: cpr helpers Steve Sistare
@ 2025-05-15  7:43   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15  7:43 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Add the cpr_needed_for_reuse and cpr_open_fd, for use when adding cpr
> support for vfio and iommufd.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/migration/cpr.h |  4 ++++
>   migration/cpr.c         | 24 ++++++++++++++++++++++++
>   2 files changed, 28 insertions(+)
> 
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 7561fc7..fc6aa33 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -18,6 +18,8 @@
>   void cpr_save_fd(const char *name, int id, int fd);
>   void cpr_delete_fd(const char *name, int id);
>   int cpr_find_fd(const char *name, int id);
> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
> +                bool *reused, Error **errp);
>   
>   MigMode cpr_get_incoming_mode(void);
>   void cpr_set_incoming_mode(MigMode mode);
> @@ -28,6 +30,8 @@ int cpr_state_load(MigrationChannel *channel, Error **errp);
>   void cpr_state_close(void);
>   struct QIOChannel *cpr_state_ioc(void);
>   
> +bool cpr_needed_for_reuse(void *opaque);
> +
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>   
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 42c4656..0b01e25 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -95,6 +95,24 @@ int cpr_find_fd(const char *name, int id)
>       trace_cpr_find_fd(name, id, fd);
>       return fd;
>   }
> +
> +int cpr_open_fd(const char *path, int flags, const char *name, int id,
> +                bool *reused, Error **errp)
> +{
> +    int fd = cpr_find_fd(name, id);
> +
> +    if (reused) {
> +        *reused = (fd >= 0);
> +    }
> +    if (fd < 0) {
> +        fd = qemu_open(path, flags, errp);
> +        if (fd >= 0) {
> +            cpr_save_fd(name, id, fd);
> +        }
> +    }
> +    return fd;
> +}
> +
>   /*************************************************************************/
>   #define CPR_STATE "CprState"
>   
> @@ -228,3 +246,9 @@ void cpr_state_close(void)
>           cpr_state_file = NULL;
>       }
>   }
> +
> +bool cpr_needed_for_reuse(void *opaque)
> +{
> +    MigMode mode = migrate_mode();
> +    return mode == MIG_MODE_CPR_TRANSFER;
> +}



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 05/42] vfio: move vfio-cpr.h
  2025-05-12 15:32 ` [PATCH V3 05/42] vfio: move vfio-cpr.h Steve Sistare
@ 2025-05-15  7:46   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15  7:46 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Move vfio-cpr.h to include/hw/vfio, because it will need to be included by
> other files there.


So patch 1 is fine. Forget my comment.


> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   MAINTAINERS                |  1 +
>   hw/vfio/container.c        |  2 +-
>   hw/vfio/cpr.c              |  2 +-
>   hw/vfio/iommufd.c          |  2 +-
>   hw/vfio/vfio-cpr.h         | 15 ---------------
>   include/hw/vfio/vfio-cpr.h | 18 ++++++++++++++++++
>   6 files changed, 22 insertions(+), 18 deletions(-)
>   delete mode 100644 hw/vfio/vfio-cpr.h
>   create mode 100644 include/hw/vfio/vfio-cpr.h


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> diff --git a/MAINTAINERS b/MAINTAINERS
> index d54a532..9bee3cf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3023,6 +3023,7 @@ CheckPoint and Restart (CPR)
>   R: Steve Sistare <steven.sistare@oracle.com>
>   S: Supported
>   F: hw/vfio/cpr*
> +F: include/hw/vfio/vfio-cpr.h
>   F: include/migration/cpr.h
>   F: migration/cpr*
>   F: tests/qtest/migration/cpr*
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index a9f0dba..eb56f00 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -33,8 +33,8 @@
>   #include "qapi/error.h"
>   #include "pci.h"
>   #include "hw/vfio/vfio-container.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "vfio-helpers.h"
> -#include "vfio-cpr.h"
>   #include "vfio-listener.h"
>   
>   #define TYPE_HOST_IOMMU_DEVICE_LEGACY_VFIO TYPE_HOST_IOMMU_DEVICE "-legacy-vfio"
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 3214184..0210e76 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -8,9 +8,9 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-device.h"
>   #include "migration/misc.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
> -#include "vfio-cpr.h"
>   
>   static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>                                       MigrationEvent *e, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index af1c7ab..167bda4 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -21,13 +21,13 @@
>   #include "qapi/error.h"
>   #include "system/iommufd.h"
>   #include "hw/qdev-core.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "system/reset.h"
>   #include "qemu/cutils.h"
>   #include "qemu/chardev_open.h"
>   #include "pci.h"
>   #include "vfio-iommufd.h"
>   #include "vfio-helpers.h"
> -#include "vfio-cpr.h"
>   #include "vfio-listener.h"
>   
>   #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
> diff --git a/hw/vfio/vfio-cpr.h b/hw/vfio/vfio-cpr.h
> deleted file mode 100644
> index 134b83a..0000000
> --- a/hw/vfio/vfio-cpr.h
> +++ /dev/null
> @@ -1,15 +0,0 @@
> -/*
> - * VFIO CPR
> - *
> - * Copyright (c) 2025 Oracle and/or its affiliates.
> - *
> - * SPDX-License-Identifier: GPL-2.0-or-later
> - */
> -
> -#ifndef HW_VFIO_CPR_H
> -#define HW_VFIO_CPR_H
> -
> -bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
> -void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
> -
> -#endif /* HW_VFIO_CPR_H */
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> new file mode 100644
> index 0000000..750ea5b
> --- /dev/null
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -0,0 +1,18 @@
> +/*
> + * VFIO CPR
> + *
> + * Copyright (c) 2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VFIO_VFIO_CPR_H
> +#define HW_VFIO_VFIO_CPR_H
> +
> +struct VFIOContainerBase;
> +
> +bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
> +                                 Error **errp);
> +void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
> +
> +#endif /* HW_VFIO_VFIO_CPR_H */



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 06/42] vfio/container: register container for cpr
  2025-05-12 15:32 ` [PATCH V3 06/42] vfio/container: register container for cpr Steve Sistare
@ 2025-05-15  7:54   ` Cédric Le Goater
  2025-05-15 19:06     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15  7:54 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Register a legacy container for cpr-transfer, replacing the generic CPR
> register call with a more specific legacy container register call.  Add a
> blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
> 
> This is mostly boiler plate.  The fields to to saved and restored are added
> in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c              |  6 ++--
>   hw/vfio/cpr-legacy.c             | 70 ++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/cpr.c                    |  5 ++-
>   hw/vfio/meson.build              |  1 +
>   include/hw/vfio/vfio-container.h |  2 ++
>   include/hw/vfio/vfio-cpr.h       | 14 ++++++++
>   6 files changed, 92 insertions(+), 6 deletions(-)
>   create mode 100644 hw/vfio/cpr-legacy.c
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index eb56f00..85c76da 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -642,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       new_container = true;
>       bcontainer = &container->bcontainer;
>   
> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>           goto fail;
>       }
>   
> @@ -678,7 +678,7 @@ fail:
>           vioc->release(bcontainer);
>       }
>       if (new_container) {
> -        vfio_cpr_unregister_container(bcontainer);
> +        vfio_legacy_cpr_unregister_container(container);
>           object_unref(container);
>       }
>       if (fd >= 0) {
> @@ -719,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>           VFIOAddressSpace *space = bcontainer->space;
>   
>           trace_vfio_container_disconnect(container->fd);
> -        vfio_cpr_unregister_container(bcontainer);
> +        vfio_legacy_cpr_unregister_container(container);
>           close(container->fd);
>           object_unref(container);
>   
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> new file mode 100644
> index 0000000..fac323c
> --- /dev/null
> +++ b/hw/vfio/cpr-legacy.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.

Please add a SPDX-License-Identifier tag.


> + */
> +
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "qemu/osdep.h"
> +#include "hw/vfio/vfio-container.h"
> +#include "hw/vfio/vfio-cpr.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +
> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
> +        return false;
> +
> +    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
> +        return false;
> +
> +    } else {
> +        return true;
> +    }
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> +    .name = "vfio-container",
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .needed = cpr_needed_for_reuse,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +    Error **cpr_blocker = &container->cpr.blocker;
> +
> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
> +                                vfio_cpr_reboot_notifier,
> +                                MIG_MODE_CPR_REBOOT);
> +
> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
> +        return migrate_add_blocker_modes(cpr_blocker, errp,
> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +    }
> +
> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> +
> +    return true;
> +}
> +
> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> +    migrate_del_blocker(&container->cpr.blocker);
> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +}
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0210e76..0e59612 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -7,13 +7,12 @@
>   
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-device.h"
> -#include "migration/misc.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
>   
> -static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
> -                                    MigrationEvent *e, Error **errp)
> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
> +                             MigrationEvent *e, Error **errp)
>   {
>       if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>           !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index bccb050..73d29f9 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>     'cpr.c',
> +  'cpr-legacy.c',
>     'device.c',
>     'migration.c',
>     'migration-multifd.c',
> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
> index afc498d..21e5807 100644
> --- a/include/hw/vfio/vfio-container.h
> +++ b/include/hw/vfio/vfio-container.h
> @@ -10,6 +10,7 @@
>   #define HW_VFIO_CONTAINER_H
>   
>   #include "hw/vfio/vfio-container-base.h"
> +#include "hw/vfio/vfio-cpr.h"
>   
>   typedef struct VFIOContainer VFIOContainer;
>   typedef struct VFIODevice VFIODevice;
> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>       unsigned iommu_type;
>       QLIST_HEAD(, VFIOGroup) group_list;
> +    VFIOContainerCPR cpr;
>   } VFIOContainer;
>   
>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 750ea5b..f864547 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -9,8 +9,22 @@
>   #ifndef HW_VFIO_VFIO_CPR_H
>   #define HW_VFIO_VFIO_CPR_H
>   
> +#include "migration/misc.h"
> +
> +typedef struct VFIOContainerCPR {
> +    Error *blocker;
> +} VFIOContainerCPR;
> +
> +struct VFIOContainer;
>   struct VFIOContainerBase;
>   
> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
> +                                        Error **errp);
> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
> +
> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> +                             Error **errp);
> +
>   bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);

what about vfio_cpr_un/register_container ? Shouldn't we remove them ?


Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-14 17:03     ` Cédric Le Goater
@ 2025-05-15  8:22       ` David Hildenbrand
  2025-05-15 19:13         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: David Hildenbrand @ 2025-05-15  8:22 UTC (permalink / raw)
  To: Cédric Le Goater, John Levon, Steve Sistare
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 14.05.25 19:03, Cédric Le Goater wrote:
> + Paolo
> + David
> + Peter
> + Phil
> 
> On 5/12/25 22:51, John Levon wrote:
>> On Mon, May 12, 2025 at 08:32:37AM -0700, Steve Sistare wrote:
>>
>>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>>> region that the translated address is found in.  This will be needed by
>>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>>
>>> Also return the xlat offset, so we can simplify the interface by removing
>>> the out parameters that can be trivially derived from mr and xlat.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>
>> Steve, would you consider splitting this out from the full CPR series and
>> submitting as a standalone, as we both have a dependency on doing this, and your
>> patch seems much nicer than the current one in vfio-user series?
> 
> May be we can merge this version if maintainers ack the change ?

The change itself looks good to me. Now that we want to return the mr 
from memory_get_xlat_addr(), why not make that the return type (NULL vs. 
! NULL), to get rid of the boolean?

MemoryRegion *memory_get_xlat_addr(IOMMUTLBEntry *iotlb, hwaddr *xlat_p,
		Error **errp);

Same with "vfio_get_xlat_addr".

Of course, we could consider renaming both functions to something like

memory_translate_iotlb()
vfio_translate_iotlb()

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-12 15:32 ` [PATCH V3 07/42] vfio/container: preserve descriptors Steve Sistare
@ 2025-05-15 12:59   ` Cédric Le Goater
  2025-05-15 19:08     ` Steven Sistare
  2025-05-22 13:51   ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15 12:59 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
> the saved descriptors, and remembers the reused status for subsequent
> patches.  The reused status is cleared when vmstate load finishes.
> 
> During reuse, device and iommu state is already configured, so operations
> in vfio_realize that would modify the configuration, such as vfio ioctl's,
> are skipped.  The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>   include/hw/vfio/vfio-device.h |  2 ++
>   4 files changed, 112 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 85c76da..278a220 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -31,6 +31,8 @@
>   #include "system/reset.h"
>   #include "trace.h"
>   #include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/blocker.h"
>   #include "pci.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>   }
>   
>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
> -                                            Error **errp)
> +                                            bool cpr_reused, Error **errp)
>   {
>       int iommu_type;
>       const char *vioc_name;
> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>           return NULL;
>       }
>   
> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     */
> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>           return NULL;
>       }
>   
> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>   
>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>       container->fd = fd;
> +    container->cpr.reused = cpr_reused;
>       container->iommu_type = iommu_type;
>       return container;
>   }
> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>   }
>   
>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
> -                                     Error **errp)
> +                                     bool cpr_reused, Error **errp)
>   {
>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>           return false;
> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>       group->container = container;
>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>       vfio_group_add_kvm_device(group);
> +    if (!cpr_reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
> +    }

Could we avoid the test on cpr_reused always call cpr_save_fd() ?

>       return true;
>   }
>   
> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>       group->container = NULL;
>       vfio_group_del_kvm_device(group);
>       vfio_ram_block_discard_disable(container, false);
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   }
>   
>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       VFIOIOMMUClass *vioc = NULL;
>       bool new_container = false;
>       bool group_was_added = false;
> +    bool cpr_reused;
>   
>       space = vfio_address_space_get(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    cpr_reused = (fd > 0);


The code above is doing 2 things : it grabs a restored fd and
deduces from the fd value that the VM is doing are doing a CPR
reboot.

Instead of adding this cpr_reused flag, I would prefer to duplicate
the code into something like:

if (!cpr_reboot) {
    QLIST_FOREACH(bcontainer, &space->containers, next) {
         container = container_of(bcontainer, VFIOContainer, bcontainer);
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             return vfio_container_group_add(container, group, errp);
         }
     }

     fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
     if (fd < 0) {
         goto fail;
     }

     ret = ioctl(fd, VFIO_GET_API_VERSION);
     if (ret != VFIO_API_VERSION) {
         error_setg(errp, "supported vfio version: %d, "
                    "reported version: %d", VFIO_API_VERSION, ret);
         goto fail;
     }

     container = vfio_create_container(fd, group, errp);
} else {
    /* ... */
}



> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  If not, then after the loop, create
> +     * the container struct and group list.
> +     */
>   
>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>           container = container_of(bcontainer, VFIOContainer, bcontainer);
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            return vfio_container_group_add(container, group, errp);
> +
> +        if (cpr_reused) {
> +            if (!vfio_cpr_container_match(container, group, &fd)) {

why do we need to modify fd ?

> +                continue;
> +            }
> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            continue;
>           }
> +        return vfio_container_group_add(container, group, cpr_reused, errp);
> +    }
> +
> +    if (!cpr_reused) {
> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>       }
>   
> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>       if (fd < 0) {>           goto fail;
>       }
> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>           goto fail;
>       }
>   
> -    container = vfio_create_container(fd, group, errp);
> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>       if (!container) {
>           goto fail;
>       }
> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>   
>       vfio_address_space_insert(space, bcontainer);
>   
> -    if (!vfio_container_group_add(container, group, errp)) {
> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>           goto fail;
>       }
>       group_was_added = true;
> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>   
>       QLIST_REMOVE(group, container_next);
>       group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   
>       /*
>        * Explicitly release the listener first before unset container,
> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       group = g_malloc0(sizeof(*group));
>   
>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open(path, O_RDWR, errp);
> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>       if (group->fd < 0) {
>           goto free_group_exit;
>       }
> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       return group;
>   
>   close_fd_exit:
> +    cpr_delete_fd("vfio_group", groupid);
>       close(group->fd);
>   
>   free_group_exit:
> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>       vfio_container_disconnect(group);
>       QLIST_REMOVE(group, next);
>       trace_vfio_group_put(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>       close(group->fd);
>       g_free(group);
>   }
> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>   {
>       g_autofree struct vfio_device_info *info = NULL;
>       int fd;
> +    bool cpr_reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    cpr_reused = (fd >= 0);
> +    if (!cpr_reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>   

Could we introduce an helper routine to open this file,  like we have
cpr_open_fd() ?


> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>       if (fd < 0) {
>           error_setg_errno(errp, errno, "error getting device from group %d",
>                            group->groupid);
> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>       vbasedev->group = group;
>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>   
> +    vbasedev->cpr.reused = cpr_reused;
> +    if (!cpr_reused) {
> +        cpr_save_fd(name, 0, fd);

Could we avoid the test on cpr_reused always call cpr_save_fd() ?

> +    }
>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>   
>       return true;
> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>       QLIST_REMOVE(vbasedev, next);
>       vbasedev->group = NULL;
>       trace_vfio_device_put(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>       close(vbasedev->fd);
>   }
>   
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index fac323c..638a8e0 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -10,6 +10,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> +#include "hw/vfio/vfio-device.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>       }
>   }
>   
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> +    VFIOContainer *container = opaque;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    container->cpr.reused = false;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->cpr.reused = false;
> +        }
> +    }
> +    return 0;
> +}
> +
>   static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {
>           VMSTATE_END_OF_LIST()
> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>       migrate_del_blocker(&container->cpr.blocker);
>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>   }
> +
> +static bool same_device(int fd1, int fd2)
> +{
> +    struct stat st1, st2;
> +
> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
> +}
> +
> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
> +                              int *pfd)
> +{
> +    if (container->fd == *pfd) {
> +        return true;
> +    }
> +    if (!same_device(container->fd, *pfd)) {
> +        return false;
> +    }
> +    /*
> +     * Same device, different fd.  This occurs when the container fd is
> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
> +     * produces duplicates.  De-dup it.
> +     */
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
> +    close(*pfd);
> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
> +    *pfd = container->fd;

I am not sure 'pfd' is used afterwards. Is it ?

> +    return true;
> +}
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index f864547..1c4f070 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -13,10 +13,16 @@
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
> +    bool reused;
>   } VFIOContainerCPR;
>   
> +typedef struct VFIODeviceCPR {
> +    bool reused;
> +} VFIODeviceCPR;
> +
>   struct VFIOContainer;
>   struct VFIOContainerBase;
> +struct VFIOGroup;
>   
>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                           Error **errp);
> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>   
> +bool vfio_cpr_container_match(struct VFIOContainer *container,
> +                              struct VFIOGroup *group, int *fd);
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 8bcb3c1..4e4d0b6 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -28,6 +28,7 @@
>   #endif
>   #include "system/system.h"
>   #include "hw/vfio/vfio-container-base.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "system/host_iommu_device.h"
>   #include "system/iommufd.h"
>   
> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>       VFIOIOASHwpt *hwpt;
>       QLIST_ENTRY(VFIODevice) hwpt_next;
>       struct vfio_region_info **reginfo;
> +    VFIODeviceCPR cpr;
>   } VFIODevice;
>   
>   struct VFIODeviceOps {


Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 09/42] vfio/container: discard old DMA vaddr
  2025-05-12 15:32 ` [PATCH V3 09/42] vfio/container: discard old DMA vaddr Steve Sistare
@ 2025-05-15 13:30   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15 13:30 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> In the container pre_save handler, discard the virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will be
> remapped at a different VA after in new QEMU.  DMA to already-mapped
> pages continues.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Looks OK. Too bad the pre_save() handler doesn't have an
'Error **' parameter.

It shouldn't be too complex to add in vmstate_save_state_v().


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/cpr-legacy.c | 29 +++++++++++++++++++++++++++++
>   1 file changed, 29 insertions(+)
> 
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 638a8e0..519d772 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -17,6 +17,22 @@
>   #include "migration/vmstate.h"
>   #include "qapi/error.h"
>   
> +static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return false;
> +    }
> +    return true;
> +}
> +
> +
>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>   {
>       if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> @@ -32,6 +48,18 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>       }
>   }
>   
> +static int vfio_container_pre_save(void *opaque)
> +{
> +    VFIOContainer *container = opaque;
> +    Error *err = NULL;
> +
> +    if (!vfio_dma_unmap_vaddr_all(container, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
>   static int vfio_container_post_load(void *opaque, int version_id)
>   {
>       VFIOContainer *container = opaque;
> @@ -52,6 +80,7 @@ static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .pre_save = vfio_container_pre_save,
>       .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map
  2025-05-12 15:32 ` [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map Steve Sistare
@ 2025-05-15 13:42   ` Cédric Le Goater
  2025-05-15 19:08     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15 13:42 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Export vfio_legacy_dma_map so it may be referenced outside the file
> in a subsequent patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c                   | 4 ++--
>   include/hw/vfio/vfio-container-base.h | 3 +++
>   2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 278a220..a554683 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -208,8 +208,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
>       return ret;
>   }
>   
> -static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> -                               ram_addr_t size, void *vaddr, bool readonly)
> +int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> +                        ram_addr_t size, void *vaddr, bool readonly)
>   {
>       const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>                                                     bcontainer);
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index 1dc760f..a2f6c3a 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -186,4 +186,7 @@ struct VFIOIOMMUClass {
>   VFIORamDiscardListener *vfio_find_ram_discard_listener(
>       VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>   
> +int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
> +                        ram_addr_t size, void *vaddr, bool readonly);
> +
>   #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */

I don't think this export is necessary. See comment on patch 10.


Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-12 15:32 ` [PATCH V3 10/42] vfio/container: restore " Steve Sistare
@ 2025-05-15 13:42   ` Cédric Le Goater
  2025-05-15 19:08     ` Steven Sistare
  2025-05-22  6:37   ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-15 13:42 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> In new QEMU, do not register the memory listener at device creation time.
> Register it later, in the container post_load handler, after all vmstate
> that may affect regions and mapping boundaries has been loaded.  The
> post_load registration will cause the listener to invoke its callback on
> each flat section, and the calls will match the mappings remembered by the
> kernel.
> 
> The listener calls a special dma_map handler that passes the new VA of each
> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
> handler at the end.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c  | 15 +++++++++++++--
>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index a554683..0e02726 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>       int ret;
>       Error *local_err = NULL;
>   
> +    assert(!container->cpr.reused);
> +
>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>               bcontainer->dirty_pages_supported) {
> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       }
>       group_was_added = true;
>   
> -    if (!vfio_listener_register(bcontainer, errp)) {
> -        goto fail;
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * dma_map to supply the new vaddr, and the calls will match the mappings
> +     * remembered by the kernel.
> +     */
> +    if (!cpr_reused) {
> +        if (!vfio_listener_register(bcontainer, errp)) {
> +            goto fail;
> +        }

hmm, I am starting to think we should have a vfio_cpr_container_connect
routine too.


>       }
>   
>       bcontainer->initialized = true;
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 519d772..bbcf71e 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -11,11 +11,13 @@
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "hw/vfio/vfio-device.h"
> +#include "hw/vfio/vfio-listener.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
>   #include "migration/vmstate.h"
>   #include "qapi/error.h"
> +#include "qemu/error-report.h"
>   
>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>   {
> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>       return true;
>   }
>   
> +/*
> + * Set the new @vaddr for any mappings registered during cpr load.
> + * Reused is cleared thereafter.
> + */
> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
> +                                   bool readonly)
> +{
> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
> +                                                  bcontainer);
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    assert(container->cpr.reused);
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
> +                     iova, size, vaddr, strerror(errno));

Callers should also report the error. No need to do it here.

> +        return -errno;
> +    }
> +
> +    return 0;
> +}
>   
>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>   {
> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>   static int vfio_container_post_load(void *opaque, int version_id)
>   {
>       VFIOContainer *container = opaque;
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>       VFIOGroup *group;
>       VFIODevice *vbasedev;
> +    Error *err = NULL;
> +
> +    if (!vfio_listener_register(bcontainer, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
>   
>       container->cpr.reused = false;
>   
>       QLIST_FOREACH(group, &container->group_list, container_next) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +
> +        /* Restore original dma_map function */
> +        vioc->dma_map = vfio_legacy_dma_map;
> +
>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>               vbasedev->cpr.reused = false;
>           }
> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>       .pre_save = vfio_container_pre_save,
>       .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   
>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>   
> +    /* During incoming CPR, divert calls to dma_map. */
> +    if (container->cpr.reused) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +        vioc->dma_map = vfio_legacy_cpr_dma_map;

You could backup the previous dma_map() handler in a static variable or,
better, under container->cpr.


Thanks,

C.




> +    }
>       return true;
>   }
>   



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-12 20:51   ` John Levon
  2025-05-14 17:03     ` Cédric Le Goater
@ 2025-05-15 17:24     ` Steven Sistare
  1 sibling, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 17:24 UTC (permalink / raw)
  To: John Levon
  Cc: qemu-devel, Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On 5/12/2025 4:51 PM, John Levon wrote:
> On Mon, May 12, 2025 at 08:32:37AM -0700, Steve Sistare wrote:
> 
>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>> region that the translated address is found in.  This will be needed by
>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>
>> Also return the xlat offset, so we can simplify the interface by removing
>> the out parameters that can be trivially derived from mr and xlat.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Steve, would you consider splitting this out from the full CPR series and
> submitting as a standalone, as we both have a dependency on doing this, and your
> patch seems much nicer than the current one in vfio-user series?

Sure.  I just returned from vacation, and I see you submitted it.

- Steve


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 06/42] vfio/container: register container for cpr
  2025-05-15  7:54   ` Cédric Le Goater
@ 2025-05-15 19:06     ` Steven Sistare
  2025-05-16 16:20       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:06 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/2025 3:54 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Register a legacy container for cpr-transfer, replacing the generic CPR
>> register call with a more specific legacy container register call.  Add a
>> blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
>>
>> This is mostly boiler plate.  The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c              |  6 ++--
>>   hw/vfio/cpr-legacy.c             | 70 ++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/cpr.c                    |  5 ++-
>>   hw/vfio/meson.build              |  1 +
>>   include/hw/vfio/vfio-container.h |  2 ++
>>   include/hw/vfio/vfio-cpr.h       | 14 ++++++++
>>   6 files changed, 92 insertions(+), 6 deletions(-)
>>   create mode 100644 hw/vfio/cpr-legacy.c
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index eb56f00..85c76da 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -642,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       new_container = true;
>>       bcontainer = &container->bcontainer;
>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>>           goto fail;
>>       }
>> @@ -678,7 +678,7 @@ fail:
>>           vioc->release(bcontainer);
>>       }
>>       if (new_container) {
>> -        vfio_cpr_unregister_container(bcontainer);
>> +        vfio_legacy_cpr_unregister_container(container);
>>           object_unref(container);
>>       }
>>       if (fd >= 0) {
>> @@ -719,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>           VFIOAddressSpace *space = bcontainer->space;
>>           trace_vfio_container_disconnect(container->fd);
>> -        vfio_cpr_unregister_container(bcontainer);
>> +        vfio_legacy_cpr_unregister_container(container);
>>           close(container->fd);
>>           object_unref(container);
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> new file mode 100644
>> index 0000000..fac323c
>> --- /dev/null
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -0,0 +1,70 @@
>> +/*
>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
> 
> Please add a SPDX-License-Identifier tag.

Sure.  I'll do the same for my other new files.

>> + */
>> +
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "qemu/osdep.h"
>> +#include "hw/vfio/vfio-container.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "qapi/error.h"
>> +
>> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
>> +        return false;
>> +
>> +    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
>> +        return false;
>> +
>> +    } else {
>> +        return true;
>> +    }
>> +}
>> +
>> +static const VMStateDescription vfio_container_vmstate = {
>> +    .name = "vfio-container",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .needed = cpr_needed_for_reuse,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +    Error **cpr_blocker = &container->cpr.blocker;
>> +
>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>> +                                vfio_cpr_reboot_notifier,
>> +                                MIG_MODE_CPR_REBOOT);
>> +
>> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +    }
>> +
>> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +
>> +    return true;
>> +}
>> +
>> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>> +    migrate_del_blocker(&container->cpr.blocker);
>> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> +}
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index 0210e76..0e59612 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -7,13 +7,12 @@
>>   #include "qemu/osdep.h"
>>   #include "hw/vfio/vfio-device.h"
>> -#include "migration/misc.h"
>>   #include "hw/vfio/vfio-cpr.h"
>>   #include "qapi/error.h"
>>   #include "system/runstate.h"
>> -static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>> -                                    MigrationEvent *e, Error **errp)
>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>> +                             MigrationEvent *e, Error **errp)
>>   {
>>       if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>>           !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index bccb050..73d29f9 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>     'cpr.c',
>> +  'cpr-legacy.c',
>>     'device.c',
>>     'migration.c',
>>     'migration-multifd.c',
>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>> index afc498d..21e5807 100644
>> --- a/include/hw/vfio/vfio-container.h
>> +++ b/include/hw/vfio/vfio-container.h
>> @@ -10,6 +10,7 @@
>>   #define HW_VFIO_CONTAINER_H
>>   #include "hw/vfio/vfio-container-base.h"
>> +#include "hw/vfio/vfio-cpr.h"
>>   typedef struct VFIOContainer VFIOContainer;
>>   typedef struct VFIODevice VFIODevice;
>> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>       unsigned iommu_type;
>>       QLIST_HEAD(, VFIOGroup) group_list;
>> +    VFIOContainerCPR cpr;
>>   } VFIOContainer;
>>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 750ea5b..f864547 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -9,8 +9,22 @@
>>   #ifndef HW_VFIO_VFIO_CPR_H
>>   #define HW_VFIO_VFIO_CPR_H
>> +#include "migration/misc.h"
>> +
>> +typedef struct VFIOContainerCPR {
>> +    Error *blocker;
>> +} VFIOContainerCPR;
>> +
>> +struct VFIOContainer;
>>   struct VFIOContainerBase;
>> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>> +                                        Error **errp);
>> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>> +
>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>> +                             Error **errp);
>> +
>>   bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>                                    Error **errp);
>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
> 
> what about vfio_cpr_un/register_container ? Shouldn't we remove them ?

At this patch in the series, those are still used by iommufd containers.
Those uses are removed in "vfio/iommufd: register container for cpr", and
vfio_cpr_un/register_container are deleted by the last patch in the series.

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-15 12:59   ` Cédric Le Goater
@ 2025-05-15 19:08     ` Steven Sistare
  2025-05-19 13:20       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:08 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/2025 8:59 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
>> the saved descriptors, and remembers the reused status for subsequent
>> patches.  The reused status is cleared when vmstate load finishes.
>>
>> During reuse, device and iommu state is already configured, so operations
>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>> are skipped.  The result is that vfio_realize constructs qemu data
>> structures that reflect the current state of the device.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>>   include/hw/vfio/vfio-device.h |  2 ++
>>   4 files changed, 112 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 85c76da..278a220 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -31,6 +31,8 @@
>>   #include "system/reset.h"
>>   #include "trace.h"
>>   #include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/blocker.h"
>>   #include "pci.h"
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>>   }
>>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>> -                                            Error **errp)
>> +                                            bool cpr_reused, Error **errp)
>>   {
>>       int iommu_type;
>>       const char *vioc_name;
>> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>           return NULL;
>>       }
>> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>> +    /*
>> +     * If container is reused, just set its type and skip the ioctls, as the
>> +     * container and group are already configured in the kernel.
>> +     */
>> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>           return NULL;
>>       }
>> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>>       container->fd = fd;
>> +    container->cpr.reused = cpr_reused;
>>       container->iommu_type = iommu_type;
>>       return container;
>>   }
>> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>>   }
>>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>> -                                     Error **errp)
>> +                                     bool cpr_reused, Error **errp)
>>   {
>>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>>           return false;
>> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>       group->container = container;
>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>       vfio_group_add_kvm_device(group);
>> +    if (!cpr_reused) {
>> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>> +    }
> 
> Could we avoid the test on cpr_reused always call cpr_save_fd() ?

No.  If cpr_reused is true, then the fd is already on cpr's save list.
We don't want to save duplicates of the same entry.

>>       return true;
>>   }
>> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>>       group->container = NULL;
>>       vfio_group_del_kvm_device(group);
>>       vfio_ram_block_discard_disable(container, false);
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>   }
>>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       VFIOIOMMUClass *vioc = NULL;
>>       bool new_container = false;
>>       bool group_was_added = false;
>> +    bool cpr_reused;
>>       space = vfio_address_space_get(as);
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    cpr_reused = (fd > 0);
> 
> 
> The code above is doing 2 things : it grabs a restored fd and
> deduces from the fd value that the VM is doing are doing a CPR
> reboot.
> 
> Instead of adding this cpr_reused flag, I would prefer to duplicate
> the code into something like:
> 
> if (!cpr_reboot) {
>     QLIST_FOREACH(bcontainer, &space->containers, next) {
>          container = container_of(bcontainer, VFIOContainer, bcontainer);
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              return vfio_container_group_add(container, group, errp);
>          }
>      }
> 
>      fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>      if (fd < 0) {
>          goto fail;
>      }
> 
>      ret = ioctl(fd, VFIO_GET_API_VERSION);
>      if (ret != VFIO_API_VERSION) {
>          error_setg(errp, "supported vfio version: %d, "
>                     "reported version: %d", VFIO_API_VERSION, ret);
>          goto fail;
>      }
> 
>      container = vfio_create_container(fd, group, errp);
> } else {
>     /* ... */
> }
> 

OK, but there is no sense in duplicating the identical code for
VFIO_GET_API_VERSION and vfio_create_container.  If you want me to
simplify the loop, I suggest:

if (!cpr_reused) {
     QLIST_FOREACH(bcontainer, &space->containers, next) {
          container = container_of(bcontainer, VFIOContainer, bcontainer);
          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
              return vfio_container_group_add(container, group, false, errp);
          }
      }

      fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
      if (fd < 0) {
          goto fail;
      }
} else {
     QLIST_FOREACH(bcontainer, &space->containers, next) {
         container = container_of(bcontainer, VFIOContainer, bcontainer);
         if (vfio_cpr_container_match(container, group, &fd)) {
             return vfio_container_group_add(container, group, true, errp);
         }
     }
}

ret = ioctl(fd, VFIO_GET_API_VERSION);
...

>> +    /*
>> +     * If the container is reused, then the group is already attached in the
>> +     * kernel.  If a container with matching fd is found, then update the
>> +     * userland group list and return.  If not, then after the loop, create
>> +     * the container struct and group list.
>> +     */
>>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> -            return vfio_container_group_add(container, group, errp);
>> +
>> +        if (cpr_reused) {
>> +            if (!vfio_cpr_container_match(container, group, &fd)) {
> 
> why do we need to modify fd ?

That is explained by the comments inside vfio_cpr_container_match, where the
explanation is more easily understood.

>> +                continue;
>> +            }
>> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +            continue;
>>           }
>> +        return vfio_container_group_add(container, group, cpr_reused, errp);
>> +    }
>> +
>> +    if (!cpr_reused) {
>> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>       }
>> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>       if (fd < 0) {>           goto fail;
>>       }
>> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>           goto fail;
>>       }
>> -    container = vfio_create_container(fd, group, errp);
>> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>>       if (!container) {
>>           goto fail;
>>       }
>> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       vfio_address_space_insert(space, bcontainer);
>> -    if (!vfio_container_group_add(container, group, errp)) {
>> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>>           goto fail;
>>       }
>>       group_was_added = true;
>> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>       QLIST_REMOVE(group, container_next);
>>       group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>       /*
>>        * Explicitly release the listener first before unset container,
>> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>       group = g_malloc0(sizeof(*group));
>>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open(path, O_RDWR, errp);
>> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>>       if (group->fd < 0) {
>>           goto free_group_exit;
>>       }
>> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>       return group;
>>   close_fd_exit:
>> +    cpr_delete_fd("vfio_group", groupid);
>>       close(group->fd);
>>   free_group_exit:
>> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>>       vfio_container_disconnect(group);
>>       QLIST_REMOVE(group, next);
>>       trace_vfio_group_put(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>       close(group->fd);
>>       g_free(group);
>>   }
>> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>   {
>>       g_autofree struct vfio_device_info *info = NULL;
>>       int fd;
>> +    bool cpr_reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    cpr_reused = (fd >= 0);
>> +    if (!cpr_reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
> 
> Could we introduce an helper routine to open this file,  like we have
> cpr_open_fd() ?

OK, but this would be the only use of the helper, and it would bury
generic vfio functionality -- VFIO_GROUP_GET_DEVICE_FD -- inside a cpr
flavored helper.  IMO not an improvement.

>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>       if (fd < 0) {
>>           error_setg_errno(errp, errno, "error getting device from group %d",
>>                            group->groupid);
>> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>       vbasedev->group = group;
>>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>> +    vbasedev->cpr.reused = cpr_reused;
>> +    if (!cpr_reused) {
>> +        cpr_save_fd(name, 0, fd);
> 
> Could we avoid the test on cpr_reused always call cpr_save_fd() ?

No.  Must avoid adding duplicate entries.

>> +    }
>>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>>       return true;
>> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>>       QLIST_REMOVE(vbasedev, next);
>>       vbasedev->group = NULL;
>>       trace_vfio_device_put(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>       close(vbasedev->fd);
>>   }
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index fac323c..638a8e0 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -10,6 +10,7 @@
>>   #include "qemu/osdep.h"
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>> +#include "hw/vfio/vfio-device.h"
>>   #include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "migration/migration.h"
>> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>       }
>>   }
>> +static int vfio_container_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOContainer *container = opaque;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    container->cpr.reused = false;
>> +
>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->cpr.reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static const VMStateDescription vfio_container_vmstate = {
>>       .name = "vfio-container",
>>       .version_id = 0,
>>       .minimum_version_id = 0,
>> +    .post_load = vfio_container_post_load,
>>       .needed = cpr_needed_for_reuse,
>>       .fields = (VMStateField[]) {
>>           VMSTATE_END_OF_LIST()
>> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>       migrate_del_blocker(&container->cpr.blocker);
>>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>   }
>> +
>> +static bool same_device(int fd1, int fd2)
>> +{
>> +    struct stat st1, st2;
>> +
>> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
>> +}
>> +
>> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
>> +                              int *pfd)
>> +{
>> +    if (container->fd == *pfd) {
>> +        return true;
>> +    }
>> +    if (!same_device(container->fd, *pfd)) {
>> +        return false;
>> +    }
>> +    /*
>> +     * Same device, different fd.  This occurs when the container fd is
>> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
>> +     * produces duplicates.  De-dup it.
>> +     */
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>> +    close(*pfd);
>> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>> +    *pfd = container->fd;
> 
> I am not sure 'pfd' is used afterwards. Is it ?

True, good eye.  I will change it to "int fd" and stop returning the new value.

- Steve

> 
>> +    return true;
>> +}
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index f864547..1c4f070 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -13,10 +13,16 @@
>>   typedef struct VFIOContainerCPR {
>>       Error *blocker;
>> +    bool reused;
>>   } VFIOContainerCPR;
>> +typedef struct VFIODeviceCPR {
>> +    bool reused;
>> +} VFIODeviceCPR;
>> +
>>   struct VFIOContainer;
>>   struct VFIOContainerBase;
>> +struct VFIOGroup;
>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                           Error **errp);
>> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>                                    Error **errp);
>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>> +                              struct VFIOGroup *group, int *fd);
>> +
>>   #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index 8bcb3c1..4e4d0b6 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -28,6 +28,7 @@
>>   #endif
>>   #include "system/system.h"
>>   #include "hw/vfio/vfio-container-base.h"
>> +#include "hw/vfio/vfio-cpr.h"
>>   #include "system/host_iommu_device.h"
>>   #include "system/iommufd.h"
>> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>>       VFIOIOASHwpt *hwpt;
>>       QLIST_ENTRY(VFIODevice) hwpt_next;
>>       struct vfio_region_info **reginfo;
>> +    VFIODeviceCPR cpr;
>>   } VFIODevice;
>>   struct VFIODeviceOps {
> 
> 
> Thanks,
> 
> C.
> 
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map
  2025-05-15 13:42   ` Cédric Le Goater
@ 2025-05-15 19:08     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:08 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/2025 9:42 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Export vfio_legacy_dma_map so it may be referenced outside the file
>> in a subsequent patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c                   | 4 ++--
>>   include/hw/vfio/vfio-container-base.h | 3 +++
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 278a220..a554683 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -208,8 +208,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
>>       return ret;
>>   }
>> -static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> -                               ram_addr_t size, void *vaddr, bool readonly)
>> +int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> +                        ram_addr_t size, void *vaddr, bool readonly)
>>   {
>>       const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>>                                                     bcontainer);
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
>> index 1dc760f..a2f6c3a 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -186,4 +186,7 @@ struct VFIOIOMMUClass {
>>   VFIORamDiscardListener *vfio_find_ram_discard_listener(
>>       VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>> +                        ram_addr_t size, void *vaddr, bool readonly);
>> +
>>   #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
> 
> I don't think this export is necessary. See comment on patch 10.

OK, I will drop this patch - steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-15 13:42   ` Cédric Le Goater
@ 2025-05-15 19:08     ` Steven Sistare
  2025-05-19 13:32       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:08 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/2025 9:42 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> In new QEMU, do not register the memory listener at device creation time.
>> Register it later, in the container post_load handler, after all vmstate
>> that may affect regions and mapping boundaries has been loaded.  The
>> post_load registration will cause the listener to invoke its callback on
>> each flat section, and the calls will match the mappings remembered by the
>> kernel.
>>
>> The listener calls a special dma_map handler that passes the new VA of each
>> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
>> handler at the end.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c  | 15 +++++++++++++--
>>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index a554683..0e02726 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>>       int ret;
>>       Error *local_err = NULL;
>> +    assert(!container->cpr.reused);
>> +
>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>               bcontainer->dirty_pages_supported) {
>> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       }
>>       group_was_added = true;
>> -    if (!vfio_listener_register(bcontainer, errp)) {
>> -        goto fail;
>> +    /*
>> +     * If reused, register the listener later, after all state that may
>> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
>> +     * the listener will invoke its callback on each flat section and call
>> +     * dma_map to supply the new vaddr, and the calls will match the mappings
>> +     * remembered by the kernel.
>> +     */
>> +    if (!cpr_reused) {
>> +        if (!vfio_listener_register(bcontainer, errp)) {
>> +            goto fail;
>> +        }
> 
> hmm, I am starting to think we should have a vfio_cpr_container_connect
> routine too.

I think that would obscure rather than clarify the code, since the normal
non-cpr action of calling vfio_listener_register would be buried in a
cpr flavored function name.

>>       }
>>       bcontainer->initialized = true;
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index 519d772..bbcf71e 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -11,11 +11,13 @@
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>>   #include "hw/vfio/vfio-device.h"
>> +#include "hw/vfio/vfio-listener.h"
>>   #include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "migration/migration.h"
>>   #include "migration/vmstate.h"
>>   #include "qapi/error.h"
>> +#include "qemu/error-report.h"
>>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>   {
>> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>       return true;
>>   }
>> +/*
>> + * Set the new @vaddr for any mappings registered during cpr load.
>> + * Reused is cleared thereafter.
>> + */
>> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
>> +                                   bool readonly)
>> +{
>> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>> +                                                  bcontainer);
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +
>> +    assert(container->cpr.reused);
>> +
>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
>> +                     iova, size, vaddr, strerror(errno));
> 
> Callers should also report the error. No need to do it here.

This function has the same signature as the dma_map class method,
which does not return an error message.  It's existing implementations
use error_report.

>> +        return -errno;
>> +    }
>> +
>> +    return 0;
>> +}
>>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>   {
>> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>>   static int vfio_container_post_load(void *opaque, int version_id)
>>   {
>>       VFIOContainer *container = opaque;
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>       VFIOGroup *group;
>>       VFIODevice *vbasedev;
>> +    Error *err = NULL;
>> +
>> +    if (!vfio_listener_register(bcontainer, &err)) {
>> +        error_report_err(err);
>> +        return -1;
>> +    }
>>       container->cpr.reused = false;
>>       QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +
>> +        /* Restore original dma_map function */
>> +        vioc->dma_map = vfio_legacy_dma_map;
>> +
>>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>               vbasedev->cpr.reused = false;
>>           }
>> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>>       .name = "vfio-container",
>>       .version_id = 0,
>>       .minimum_version_id = 0,
>> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>>       .pre_save = vfio_container_pre_save,
>>       .post_load = vfio_container_post_load,
>>       .needed = cpr_needed_for_reuse,
>> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +    /* During incoming CPR, divert calls to dma_map. */
>> +    if (container->cpr.reused) {
>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
> 
> You could backup the previous dma_map() handler in a static variable or,
> better, under container->cpr.

OK.

- Steve


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-15  8:22       ` David Hildenbrand
@ 2025-05-15 19:13         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:13 UTC (permalink / raw)
  To: David Hildenbrand, Cédric Le Goater, John Levon
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/15/2025 4:22 AM, David Hildenbrand wrote:
> On 14.05.25 19:03, Cédric Le Goater wrote:
>> + Paolo
>> + David
>> + Peter
>> + Phil
>>
>> On 5/12/25 22:51, John Levon wrote:
>>> On Mon, May 12, 2025 at 08:32:37AM -0700, Steve Sistare wrote:
>>>
>>>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>>>> region that the translated address is found in.  This will be needed by
>>>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>>>
>>>> Also return the xlat offset, so we can simplify the interface by removing
>>>> the out parameters that can be trivially derived from mr and xlat.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> Steve, would you consider splitting this out from the full CPR series and
>>> submitting as a standalone, as we both have a dependency on doing this, and your
>>> patch seems much nicer than the current one in vfio-user series?
>>
>> May be we can merge this version if maintainers ack the change ?
> 
> The change itself looks good to me. Now that we want to return the mr from memory_get_xlat_addr(), why not make that the return type (NULL vs. ! NULL), to get rid of the boolean?
> 
> MemoryRegion *memory_get_xlat_addr(IOMMUTLBEntry *iotlb, hwaddr *xlat_p,
>          Error **errp);
> 
> Same with "vfio_get_xlat_addr".
> 
> Of course, we could consider renaming both functions to something like
> 
> memory_translate_iotlb()
> vfio_translate_iotlb()

Sure. I'll post those changes tomorrow unless someone tells me not to.

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr
  2025-05-13 11:12   ` Mark Cave-Ayland
@ 2025-05-15 19:40     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-15 19:40 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On 5/13/2025 7:12 AM, Mark Cave-Ayland wrote:
> On 12/05/2025 16:32, Steve Sistare wrote:
> 
>> Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>> region that the translated address is found in.  This will be needed by
>> CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>>
>> Also return the xlat offset, so we can simplify the interface by removing
>> the out parameters that can be trivially derived from mr and xlat.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/listener.c      | 29 +++++++++++++++++++----------
>>   hw/virtio/vhost-vdpa.c  |  8 ++++++--
>>   include/system/memory.h | 16 +++++++---------
>>   system/memory.c         | 25 ++++---------------------
>>   4 files changed, 36 insertions(+), 42 deletions(-)
>>
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index e86ffcf..87b7a3c 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -90,16 +90,17 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>              section->offset_within_address_space & (1ULL << 63);
>>   }
>> -/* Called with rcu_read_lock held.  */
>> -static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> -                               ram_addr_t *ram_addr, bool *read_only,
>> -                               Error **errp)
>> +/*
>> + * Called with rcu_read_lock held.
>> + * The returned MemoryRegion must not be accessed after calling rcu_read_unlock.
>> + */
>> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
>> +                               hwaddr *xlat_p, Error **errp)
>>   {
>> -    bool ret, mr_has_discard_manager;
>> +    bool ret;
>> -    ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
>> -                               &mr_has_discard_manager, errp);
>> -    if (ret && mr_has_discard_manager) {
>> +    ret = memory_get_xlat_addr(iotlb, mr_p, xlat_p, errp);
>> +    if (ret && memory_region_has_ram_discard_manager(*mr_p)) {
> 
> I'm trying to understand the underlying intention of this patch: is it just so that you can access the corresponding RAMBlock in vfio_container_dma_map() in patch 31 "vfio/iommufd: use IOMMU_IOAS_MAP_FILE"?

Yes.

> Given that the flatview can theoretically change at any point, it feels as if the current API whereby the vaddr is passed around is the correct approach, and that the final MemoryRegion lookup should be done at the point where it is required.

The existing code already guarantees a stable address space when vfio_container_dma_map()
is called ...

     vfio_iommu_map_notify()
         rcu_read_lock();
         vfio_get_xlat_addr()
         vfio_container_dma_map()


> If this is the case, is it not simpler to add a call to address_space_translate() in patch 31 to obtain the MemoryRegion pointer there instead?

... so it is simpler and more efficient (saving a translation) if we simply
expose mr->ram_block in that range of code.

- Steve

>>           /*
>>            * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
>>            * pages will remain pinned inside vfio until unmapped, resulting in a
>> @@ -126,6 +127,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>>       VFIOContainerBase *bcontainer = giommu->bcontainer;
>>       hwaddr iova = iotlb->iova + giommu->iommu_offset;
>> +    MemoryRegion *mr;
>> +    hwaddr xlat;
>>       void *vaddr;
>>       int ret;
>>       Error *local_err = NULL;
>> @@ -150,10 +153,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>           bool read_only;
>> -        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
>> +        if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>>               error_report_err(local_err);
>>               goto out;
>>           }
>> +        vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> +        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>> +
>>           /*
>>            * vaddr is only valid until rcu_read_unlock(). But after
>>            * vfio_dma_map has set up the mapping the pages will be
>> @@ -1047,6 +1053,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       ram_addr_t translated_addr;
>>       Error *local_err = NULL;
>>       int ret = -EINVAL;
>> +    MemoryRegion *mr;
>> +    ram_addr_t xlat;
>>       trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>> @@ -1058,9 +1066,10 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       }
>>       rcu_read_lock();
>> -    if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
>> +    if (!vfio_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>>           goto out_unlock;
>>       }
>> +    translated_addr = memory_region_get_ram_addr(mr) + xlat;
>>       ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
>>                                   translated_addr, &local_err);
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 1ab2c11..f191360 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       int ret;
>>       Int128 llend;
>>       Error *local_err = NULL;
>> +    MemoryRegion *mr;
>> +    hwaddr xlat;
>>       if (iotlb->target_as != &address_space_memory) {
>>           error_report("Wrong target AS \"%s\", only system memory is allowed",
>> @@ -228,11 +230,13 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>>           bool read_only;
>> -        if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
>> -                                  &local_err)) {
>> +        if (!memory_get_xlat_addr(iotlb, &mr, &xlat, &local_err)) {
>>               error_report_err(local_err);
>>               return;
>>           }
>> +        vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> +        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>> +
>>           ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
>>                                    iotlb->addr_mask + 1, vaddr, read_only);
>>           if (ret) {
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index fbbf4cf..d743214 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -738,21 +738,19 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>>                                                RamDiscardListener *rdl);
>>   /**
>> - * memory_get_xlat_addr: Extract addresses from a TLB entry
>> + * memory_get_xlat_addr: Extract addresses from a TLB entry.
>> + *                       Called with rcu_read_lock held.
>>    *
>>    * @iotlb: pointer to an #IOMMUTLBEntry
>> - * @vaddr: virtual address
>> - * @ram_addr: RAM address
>> - * @read_only: indicates if writes are allowed
>> - * @mr_has_discard_manager: indicates memory is controlled by a
>> - *                          RamDiscardManager
>> + * @mr_p: return the MemoryRegion containing the @iotlb translated addr.
>> + *        The MemoryRegion must not be accessed after rcu_read_unlock.
>> + * @xlat_p: return the offset of the entry from the start of @mr_p
>>    * @errp: pointer to Error*, to store an error if it happens.
>>    *
>>    * Return: true on success, else false setting @errp with error.
>>    */
>> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> -                          ram_addr_t *ram_addr, bool *read_only,
>> -                          bool *mr_has_discard_manager, Error **errp);
>> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
>> +                          hwaddr *xlat_p, Error **errp);
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>   typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>> diff --git a/system/memory.c b/system/memory.c
>> index 63b983e..4894c0d 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -2174,18 +2174,14 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>>   }
>>   /* Called with rcu_read_lock held.  */
>> -bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>> -                          ram_addr_t *ram_addr, bool *read_only,
>> -                          bool *mr_has_discard_manager, Error **errp)
>> +bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, MemoryRegion **mr_p,
>> +                          hwaddr *xlat_p, Error **errp)
>>   {
>>       MemoryRegion *mr;
>>       hwaddr xlat;
>>       hwaddr len = iotlb->addr_mask + 1;
>>       bool writable = iotlb->perm & IOMMU_WO;
>> -    if (mr_has_discard_manager) {
>> -        *mr_has_discard_manager = false;
>> -    }
>>       /*
>>        * The IOMMU TLB entry we have just covers translation through
>>        * this IOMMU to its immediate target.  We need to translate
>> @@ -2203,9 +2199,6 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>>               .offset_within_region = xlat,
>>               .size = int128_make64(len),
>>           };
>> -        if (mr_has_discard_manager) {
>> -            *mr_has_discard_manager = true;
>> -        }
>>           /*
>>            * Malicious VMs can map memory into the IOMMU, which is expected
>>            * to remain discarded. vfio will pin all pages, populating memory.
>> @@ -2229,18 +2222,8 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>>           return false;
>>       }
>> -    if (vaddr) {
>> -        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> -    }
>> -
>> -    if (ram_addr) {
>> -        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>> -    }
>> -
>> -    if (read_only) {
>> -        *read_only = !writable || mr->readonly;
>> -    }
>> -
>> +    *xlat_p = xlat;
>> +    *mr_p = mr;
>>       return true;
>>   }
> 
> 
> ATB,
> 
> Mark.
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 11/42] vfio/container: mdev cpr blocker
  2025-05-12 15:32 ` [PATCH V3 11/42] vfio/container: mdev cpr blocker Steve Sistare
@ 2025-05-16  8:16   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:16 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> During CPR, after VFIO_DMA_UNMAP_FLAG_VADDR, the vaddr is temporarily
> invalid, so mediated devices cannot be supported.  Add a blocker for them.
> This restriction will not apply to iommufd containers when CPR is added
> for them in a future patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/container.c        | 8 ++++++++
>   include/hw/vfio/vfio-cpr.h | 1 +
>   2 files changed, 9 insertions(+)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 0e02726..562e3bd 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -995,6 +995,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
>           goto device_put_exit;
>       }
>   
> +    if (vbasedev->mdev) {
> +        error_setg(&vbasedev->cpr.mdev_blocker,
> +                   "CPR does not support vfio mdev %s", vbasedev->name);
> +        migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, &error_fatal,
> +                                  MIG_MODE_CPR_TRANSFER, -1);
> +    }
> +
>       return true;
>   
>   device_put_exit:
> @@ -1012,6 +1019,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
>   
>       vfio_device_unprepare(vbasedev);
>   
> +    migrate_del_blocker(&vbasedev->cpr.mdev_blocker);
>       object_unref(vbasedev->hiod);
>       vfio_device_put(vbasedev);
>       vfio_group_put(group);
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 1c4f070..0fc7ab2 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -18,6 +18,7 @@ typedef struct VFIOContainerCPR {
>   
>   typedef struct VFIODeviceCPR {
>       bool reused;
> +    Error *mdev_blocker;
>   } VFIODeviceCPR;
>   
>   struct VFIOContainer;



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-12 15:32 ` [PATCH V3 14/42] pci: skip reset during cpr Steve Sistare
@ 2025-05-16  8:19   ` Cédric Le Goater
  2025-05-16 17:58     ` Steven Sistare
  2025-05-24  9:34     ` Michael S. Tsirkin
  0 siblings, 2 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:19 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/pci/pci.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index fe38c4c..2ba2e0f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -32,6 +32,8 @@
>   #include "hw/pci/pci_host.h"
>   #include "hw/qdev-properties.h"
>   #include "hw/qdev-properties-system.h"
> +#include "migration/cpr.h"
> +#include "migration/misc.h"
>   #include "migration/qemu-file-types.h"
>   #include "migration/vmstate.h"
>   #include "net/net.h"
> @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
>   
>   static void pci_do_device_reset(PCIDevice *dev)
>   {
> +    /*
> +     * A PCI device that is resuming for cpr is already configured, so do
> +     * not reset it here when we are called from qemu_system_reset prior to
> +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
> +     * safe to skip this reset for all PCI devices, because vmstate load will
> +     * set all fields that would have been set here.
> +     */
> +    if (cpr_is_incoming()) {

Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
on saved fds?

Thanks,

C.



> +        return;
> +    }
> +
>       pci_device_deassert_intx(dev);
>       assert(dev->irq_state == 0);
>   



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-12 15:32 ` [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
@ 2025-05-16  8:26   ` Duan, Zhenzhong
  2025-05-19 15:51     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:26 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 28/42] backends/iommufd:
>iommufd_backend_map_file_dma
>
>Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
>This will be called as a substitute for iommufd_backend_map_dma, so
>the error conditions for BARs are copied as-is from that function.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> backends/iommufd.c       | 36 ++++++++++++++++++++++++++++++++++++
> backends/trace-events    |  1 +
> include/system/iommufd.h |  3 +++
> 3 files changed, 40 insertions(+)
>
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index b73f75c..5c1958f 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -172,6 +172,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend
>*be, uint32_t ioas_id, hwaddr iova,
>     return ret;
> }
>
>+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>+                                 hwaddr iova, ram_addr_t size,
>+                                 int mfd, unsigned long start, bool readonly)
>+{
>+    int ret, fd = be->fd;
>+    struct iommu_ioas_map_file map = {
>+        .size = sizeof(map),
>+        .flags = IOMMU_IOAS_MAP_READABLE |
>+                 IOMMU_IOAS_MAP_FIXED_IOVA,
>+        .ioas_id = ioas_id,
>+        .fd = mfd,
>+        .start = start,
>+        .iova = iova,
>+        .length = size,
>+    };
>+
>+    if (!readonly) {
>+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
>+    }
>+
>+    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
>+    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
>+                                       readonly, ret);
>+    if (ret) {
>+        ret = -errno;
>+
>+        /* TODO: Not support mapping hardware PCI BAR region for now. */
>+        if (errno == EFAULT) {
>+            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
>+        } else {
>+            error_report("IOMMU_IOAS_MAP_FILE failed: %m");

No need to print error here as caller does the same thing.

>+        }
>+    }
>+    return ret;
>+}
>+
> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>                               hwaddr iova, ram_addr_t size)
> {
>diff --git a/backends/trace-events b/backends/trace-events
>index 40811a3..f478e18 100644
>--- a/backends/trace-events
>+++ b/backends/trace-events
>@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t
>users) "fd=%d owned=%d user
> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
> iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t
>size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d
>iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
>+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova,
>uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d
>ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d
>(%d)"
> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t
>iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d
>iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova,
>uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64"
>size=0x%"PRIx64" (%d)"
> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d
>ioas=%d"
>diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>index cbab75b..ac700b8 100644
>--- a/include/system/iommufd.h
>+++ b/include/system/iommufd.h
>@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>*be);
> bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
>                                 Error **errp);
> void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
>+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>+                                 hwaddr iova, ram_addr_t size, int fd,
>+                                 unsigned long start, bool readonly);
> int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>hwaddr iova,
>                             ram_addr_t size, void *vaddr, bool readonly);
> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map
  2025-05-12 15:32 ` [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
@ 2025-05-16  8:26   ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:26 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map
>
>Pass ramblock to vfio_container_dma_map for use in a subsequent patch.
>The ramblock's attributes will be needed to map the block using
>IOMMU_IOAS_MAP_FILE.  No functional change.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 18/42] vfio/pci: pass vector to virq functions
  2025-05-12 15:32 ` [PATCH V3 18/42] vfio/pci: pass vector to virq functions Steve Sistare
@ 2025-05-16  8:28   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:28 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Pass the vector number to vfio_connect_kvm_msi_virq and
> vfio_remove_kvm_msi_virq, so it can be passed to their subroutines in
> a subsequent patch.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/pci.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 4159deb..dad6209 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -477,7 +477,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>                                                vector_n, &vdev->pdev);
>   }
>   
> -static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
> +static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
>   {
>       const char *name = "kvm_interrupt";
>   
> @@ -503,7 +503,8 @@ fail_notifier:
>       vector->virq = -1;
>   }
>   
> -static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
> +static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                                     int nr)
>   {
>       kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
>                                             vector->virq);
> @@ -561,7 +562,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>        */
>       if (vector->virq >= 0) {
>           if (!msg) {
> -            vfio_remove_kvm_msi_virq(vector);
> +            vfio_remove_kvm_msi_virq(vdev, vector, nr);
>           } else {
>               vfio_update_kvm_msi_virq(vector, *msg, pdev);
>           }
> @@ -573,7 +574,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>                   vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>                   vfio_add_kvm_msi_virq(vdev, vector, nr, true);
>                   kvm_irqchip_commit_route_changes(&vfio_route_change);
> -                vfio_connect_kvm_msi_virq(vector);
> +                vfio_connect_kvm_msi_virq(vector, nr);
>               }
>           }
>       }
> @@ -681,7 +682,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>       kvm_irqchip_commit_route_changes(&vfio_route_change);
>   
>       for (i = 0; i < vdev->nr_vectors; i++) {
> -        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
> +        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
>       }
>   }
>   
> @@ -821,7 +822,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
>           VFIOMSIVector *vector = &vdev->msi_vectors[i];
>           if (vdev->msi_vectors[i].use) {
>               if (vector->virq >= 0) {
> -                vfio_remove_kvm_msi_virq(vector);
> +                vfio_remove_kvm_msi_virq(vdev, vector, i);
>               }
>               qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>                                   NULL, NULL, NULL);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 17/42] vfio/pci: vfio_notifier_init
  2025-05-12 15:32 ` [PATCH V3 17/42] vfio/pci: vfio_notifier_init Steve Sistare
@ 2025-05-16  8:29   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:29 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Move event_notifier_init calls to a helper vfio_notifier_init.
> This version is trivial, but it will be expanded to support CPR
> in subsequent patches.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/pci.c | 40 +++++++++++++++++++++++++---------------
>   1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index b46c42e..4159deb 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -56,6 +56,16 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>   static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>   
> +static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
> +{
> +    int ret = event_notifier_init(e, 0);
> +
> +    if (ret) {
> +        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> +    }
> +    return !ret;
> +}
> +
>   /*
>    * Disabling BAR mmaping can be slow, but toggling it around INTx can
>    * also be a huge overhead.  We try to get the best of both worlds by
> @@ -136,8 +146,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>       pci_irq_deassert(&vdev->pdev);
>   
>       /* Get an eventfd for resample/unmask */
> -    if (event_notifier_init(&vdev->intx.unmask, 0)) {
> -        error_setg(errp, "event_notifier_init failed eoi");
> +    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
>           goto fail;
>       }
>   
> @@ -268,7 +277,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>       uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>       Error *err = NULL;
>       int32_t fd;
> -    int ret;
>   
>   
>       if (!pin) {
> @@ -291,9 +299,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>       }
>   #endif
>   
> -    ret = event_notifier_init(&vdev->intx.interrupt, 0);
> -    if (ret) {
> -        error_setg_errno(errp, -ret, "event_notifier_init failed");
> +    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
>           return false;
>       }
>       fd = event_notifier_get_fd(&vdev->intx.interrupt);
> @@ -473,11 +479,13 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>   
>   static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
>   {
> +    const char *name = "kvm_interrupt";
> +
>       if (vector->virq < 0) {
>           return;
>       }
>   
> -    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
> +    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
>           goto fail_notifier;
>       }
>   
> @@ -515,11 +523,12 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>   {
>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>       PCIDevice *pdev = &vdev->pdev;
> +    Error *err = NULL;

In case you resend, I prefer naming local Error variables local_err.


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


>   
>       vector->vdev = vdev;
>       vector->virq = -1;
> -    if (event_notifier_init(&vector->interrupt, 0)) {
> -        error_report("vfio: Error: event_notifier_init failed");
> +    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
> +        error_report_err(err);
>       }
>       vector->use = true;
>       if (vdev->interrupt == VFIO_INT_MSIX) {
> @@ -749,13 +758,14 @@ retry:
>   
>       for (i = 0; i < vdev->nr_vectors; i++) {
>           VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +        Error *err = NULL;
>   
>           vector->vdev = vdev;
>           vector->virq = -1;
>           vector->use = true;
>   
> -        if (event_notifier_init(&vector->interrupt, 0)) {
> -            error_report("vfio: Error: event_notifier_init failed");
> +        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
> +            error_report_err(err);
>           }
>   
>           qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> @@ -2907,8 +2917,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>           return;
>       }
>   
> -    if (event_notifier_init(&vdev->err_notifier, 0)) {
> -        error_report("vfio: Unable to init event notifier for error detection");
> +    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
> +        error_report_err(err);
>           vdev->pci_aer = false;
>           return;
>       }
> @@ -2974,8 +2984,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>           return;
>       }
>   
> -    if (event_notifier_init(&vdev->req_notifier, 0)) {
> -        error_report("vfio: Unable to init event notifier for device request");
> +    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
> +        error_report_err(err);
>           return;
>       }
>   



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters
  2025-05-12 15:32 ` [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
@ 2025-05-16  8:29   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:29 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Pass vdev and nr to vfio_notifier_init, for use by CPR in a subsequent
> patch.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/pci.c | 22 ++++++++++++++--------
>   1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index dad6209..bfeaafa 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -56,7 +56,8 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>   static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>   
> -static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
> +static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
> +                               const char *name, int nr, Error **errp)
>   {
>       int ret = event_notifier_init(e, 0);
>   
> @@ -146,7 +147,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>       pci_irq_deassert(&vdev->pdev);
>   
>       /* Get an eventfd for resample/unmask */
> -    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
> +    if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
>           goto fail;
>       }
>   
> @@ -299,7 +300,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>       }
>   #endif
>   
> -    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
> +    if (!vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0,
> +                            errp)) {
>           return false;
>       }
>       fd = event_notifier_get_fd(&vdev->intx.interrupt);
> @@ -485,7 +487,8 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
>           return;
>       }
>   
> -    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
> +    if (!vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr,
> +                            NULL)) {
>           goto fail_notifier;
>       }
>   
> @@ -528,7 +531,7 @@ static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>   
>       vector->vdev = vdev;
>       vector->virq = -1;
> -    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
> +    if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr, &err)) {
>           error_report_err(err);
>       }
>       vector->use = true;
> @@ -765,7 +768,8 @@ retry:
>           vector->virq = -1;
>           vector->use = true;
>   
> -        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
> +        if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i,
> +                                &err)) {
>               error_report_err(err);
>           }
>   
> @@ -2918,7 +2922,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>           return;
>       }
>   
> -    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
> +    if (!vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0,
> +                            &err)) {
>           error_report_err(err);
>           vdev->pci_aer = false;
>           return;
> @@ -2985,7 +2990,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>           return;
>       }
>   
> -    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
> +    if (!vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0,
> +                            &err)) {
>           error_report_err(err);
>           return;
>       }



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup
  2025-05-12 15:32 ` [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup Steve Sistare
@ 2025-05-16  8:30   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:30 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Move event_notifier_cleanup calls to a helper vfio_notifier_cleanup.
> This version is trivial, and does not yet use the vdev and nr parameters.
> No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

	
> ---
>   hw/vfio/pci.c | 28 +++++++++++++++++-----------
>   1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bfeaafa..d2b08a3 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -67,6 +67,12 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>       return !ret;
>   }
>   
> +static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
> +                                  const char *name, int nr)
> +{
> +    event_notifier_cleanup(e);
> +}
> +
>   /*
>    * Disabling BAR mmaping can be slow, but toggling it around INTx can
>    * also be a huge overhead.  We try to get the best of both worlds by
> @@ -179,7 +185,7 @@ fail_vfio:
>       kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
>                                             vdev->intx.route.irq);
>   fail_irqfd:
> -    event_notifier_cleanup(&vdev->intx.unmask);
> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
>   fail:
>       qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
>       vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> @@ -211,7 +217,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>       }
>   
>       /* We only need to close the eventfd for VFIO to cleanup the kernel side */
> -    event_notifier_cleanup(&vdev->intx.unmask);
> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
>   
>       /* QEMU starts listening for interrupt events. */
>       qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
> @@ -310,7 +316,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>       if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>                                   VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>           qemu_set_fd_handler(fd, NULL, NULL, vdev);
> -        event_notifier_cleanup(&vdev->intx.interrupt);
> +        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>           return false;
>       }
>   
> @@ -337,7 +343,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>   
>       fd = event_notifier_get_fd(&vdev->intx.interrupt);
>       qemu_set_fd_handler(fd, NULL, NULL, vdev);
> -    event_notifier_cleanup(&vdev->intx.interrupt);
> +    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
>   
>       vdev->interrupt = VFIO_INT_NONE;
>   
> @@ -500,7 +506,7 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
>       return;
>   
>   fail_kvm:
> -    event_notifier_cleanup(&vector->kvm_interrupt);
> +    vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
>   fail_notifier:
>       kvm_irqchip_release_virq(kvm_state, vector->virq);
>       vector->virq = -1;
> @@ -513,7 +519,7 @@ static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>                                             vector->virq);
>       kvm_irqchip_release_virq(kvm_state, vector->virq);
>       vector->virq = -1;
> -    event_notifier_cleanup(&vector->kvm_interrupt);
> +    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
>   }
>   
>   static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
> @@ -830,7 +836,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
>               }
>               qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>                                   NULL, NULL, NULL);
> -            event_notifier_cleanup(&vector->interrupt);
> +            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
>           }
>       }
>   
> @@ -2936,7 +2942,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>           qemu_set_fd_handler(fd, NULL, NULL, vdev);
> -        event_notifier_cleanup(&vdev->err_notifier);
> +        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
>           vdev->pci_aer = false;
>       }
>   }
> @@ -2955,7 +2961,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
>       }
>       qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
>                           NULL, NULL, vdev);
> -    event_notifier_cleanup(&vdev->err_notifier);
> +    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
>   }
>   
>   static void vfio_req_notifier_handler(void *opaque)
> @@ -3003,7 +3009,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>           qemu_set_fd_handler(fd, NULL, NULL, vdev);
> -        event_notifier_cleanup(&vdev->req_notifier);
> +        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
>       } else {
>           vdev->req_enabled = true;
>       }
> @@ -3023,7 +3029,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>       }
>       qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
>                           NULL, NULL, vdev);
> -    event_notifier_cleanup(&vdev->req_notifier);
> +    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
>   
>       vdev->req_enabled = false;
>   }



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-12 15:32 ` [PATCH V3 21/42] vfio/pci: export MSI functions Steve Sistare
@ 2025-05-16  8:31   ` Cédric Le Goater
  2025-05-16 17:58     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:31 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Export various MSI functions, for use by CPR in subsequent patches.
> No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Please rename this routines with a 'vfio_pci' prefix.


Thanks,

C.



> ---
>   hw/vfio/pci.c | 21 ++++++++++-----------
>   hw/vfio/pci.h | 12 ++++++++++++
>   2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index d2b08a3..1bca415 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -279,7 +279,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
>       vfio_intx_update(vdev, &vdev->intx.route);
>   }
>   
> -static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>   {
>       uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>       Error *err = NULL;
> @@ -353,7 +353,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>   /*
>    * MSI/X
>    */
> -static void vfio_msi_interrupt(void *opaque)
> +void vfio_msi_interrupt(void *opaque)
>   {
>       VFIOMSIVector *vector = opaque;
>       VFIOPCIDevice *vdev = vector->vdev;
> @@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>       return ret;
>   }
>   
> -static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> -                                  int vector_n, bool msix)
> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                           int vector_n, bool msix)
>   {
>       if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
>           return;
> @@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>       kvm_irqchip_commit_routes(kvm_state);
>   }
>   
> -static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>   {
>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>       PCIDevice *pdev = &vdev->pdev;
> @@ -641,13 +641,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>       return 0;
>   }
>   
> -static int vfio_msix_vector_use(PCIDevice *pdev,
> -                                unsigned int nr, MSIMessage msg)
> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
>   {
>       return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
>   }
>   
> -static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> @@ -674,14 +673,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>       }
>   }
>   
> -static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>   {
>       assert(!vdev->defer_kvm_irq_routing);
>       vdev->defer_kvm_irq_routing = true;
>       vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>   }
>   
> -static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>   {
>       int i;
>   
> @@ -2632,7 +2631,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>       return OBJECT(vdev);
>   }
>   
> -static bool vfio_msix_present(void *opaque, int version_id)
> +bool vfio_msix_present(void *opaque, int version_id)
>   {
>       PCIDevice *pdev = opaque;
>   
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 5ce0fb9..c892054 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -210,6 +210,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
>       return class == PCI_CLASS_DISPLAY_VGA;
>   }
>   
> +/* MSI/MSI-X/INTx */
> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
> +void vfio_msi_interrupt(void *opaque);
> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                           int vector_n, bool msix);
> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
> +bool vfio_msix_present(void *opaque, int version_id);
> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
> +
>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>   void vfio_pci_write_config(PCIDevice *pdev,
>                              uint32_t addr, uint32_t val, int len);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 16/42] vfio/pci: vfio_vector_init
  2025-05-12 15:32 ` [PATCH V3 16/42] vfio/pci: vfio_vector_init Steve Sistare
@ 2025-05-16  8:32   ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:32 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Extract a subroutine vfio_vector_init.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/pci.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 4aa83b1..b46c42e 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -511,6 +511,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>       kvm_irqchip_commit_routes(kvm_state);
>   }
>   
> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
> +{
> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vector->vdev = vdev;
> +    vector->virq = -1;
> +    if (event_notifier_init(&vector->interrupt, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed");
> +    }
> +    vector->use = true;
> +    if (vdev->interrupt == VFIO_INT_MSIX) {

You seem to have plans to use vfio_vector_init elsewhere (patch 22).

Looks ok for now.

Thanks,

C.

  


> +        msix_vector_use(pdev, nr);
> +    }
> +}
> +
>   static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>                                      MSIMessage *msg, IOHandler *handler)
>   {
> @@ -524,13 +540,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>       vector = &vdev->msi_vectors[nr];
>   
>       if (!vector->use) {
> -        vector->vdev = vdev;
> -        vector->virq = -1;
> -        if (event_notifier_init(&vector->interrupt, 0)) {
> -            error_report("vfio: Error: event_notifier_init failed");
> -        }
> -        vector->use = true;
> -        msix_vector_use(pdev, nr);
> +        vfio_vector_init(vdev, nr);
>       }
>   
>       qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-12 15:32 ` [PATCH V3 24/42] migration: close kvm after cpr Steve Sistare
@ 2025-05-16  8:35   ` Cédric Le Goater
  2025-05-16 17:14     ` Peter Xu
  2025-05-16 18:18     ` Steven Sistare
  0 siblings, 2 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16  8:35 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> cpr-transfer breaks vfio network connectivity to and from the guest, and
> the host system log shows:
>    irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> which is EBUSY.  This occurs because KVM descriptors are still open in
> the old QEMU process.  Close them.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

This patch doesn't build.

/usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
./build/../migration/cpr.c:260: undefined reference to `kvm_close'



Thanks,

C.



> ---
>   accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
>   hw/vfio/helpers.c             | 10 ++++++++++
>   include/hw/vfio/vfio-device.h |  2 ++
>   include/migration/cpr.h       |  2 ++
>   include/qemu/vfio-helpers.h   |  1 -
>   include/system/kvm.h          |  1 +
>   migration/cpr-transfer.c      | 18 ++++++++++++++++++
>   migration/cpr.c               |  8 ++++++++
>   migration/migration.c         |  1 +
>   9 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 278a506..d619448 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>           goto err;
>       }
>   
> +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
> +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
> +        s->coalesced_mmio_ring = NULL;
> +    }
> +
>       ret = munmap(cpu->kvm_run, mmap_size);
>       if (ret < 0) {
>           goto err;
>       }
> +    cpu->kvm_run = NULL;
>   
>       if (cpu->kvm_dirty_gfns) {
>           ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>           if (ret < 0) {
>               goto err;
>           }
> +        cpu->kvm_dirty_gfns = NULL;
>       }
>   
>       kvm_park_vcpu(cpu);
> @@ -600,6 +607,27 @@ err:
>       return ret;
>   }
>   
> +void kvm_close(void)
> +{
> +    CPUState *cpu;
> +
> +    CPU_FOREACH(cpu) {
> +        cpu_remove_sync(cpu);
> +        close(cpu->kvm_fd);
> +        cpu->kvm_fd = -1;
> +        close(cpu->kvm_vcpu_stats_fd);
> +        cpu->kvm_vcpu_stats_fd = -1;
> +    }
> +
> +    if (kvm_state && kvm_state->fd != -1) {
> +        close(kvm_state->vmfd);
> +        kvm_state->vmfd = -1;
> +        close(kvm_state->fd);
> +        kvm_state->fd = -1;
> +    }
> +    kvm_state = NULL;
> +}
> +
>   /*
>    * dirty pages logging control
>    */
> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> index d0dbab1..af1db2f 100644
> --- a/hw/vfio/helpers.c
> +++ b/hw/vfio/helpers.c
> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>   int vfio_kvm_device_fd = -1;
>   #endif
>   
> +void vfio_kvm_device_close(void)
> +{
> +#ifdef CONFIG_KVM
> +    if (vfio_kvm_device_fd != -1) {
> +        close(vfio_kvm_device_fd);
> +        vfio_kvm_device_fd = -1;
> +    }
> +#endif
> +}
> +
>   int vfio_kvm_device_add_fd(int fd, Error **errp)
>   {
>   #ifdef CONFIG_KVM
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 4e4d0b6..6eb6f21 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>                         DeviceState *dev, bool ram_discard);
>   int vfio_device_get_aw_bits(VFIODevice *vdev);
> +
> +void vfio_kvm_device_close(void);
>   #endif /* HW_VFIO_VFIO_COMMON_H */
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index fc6aa33..5f1ff10 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -31,7 +31,9 @@ void cpr_state_close(void);
>   struct QIOChannel *cpr_state_ioc(void);
>   
>   bool cpr_needed_for_reuse(void *opaque);
> +void cpr_kvm_close(void);
>   
> +void cpr_transfer_init(void);
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>   
> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> index bde9495..a029036 100644
> --- a/include/qemu/vfio-helpers.h
> +++ b/include/qemu/vfio-helpers.h
> @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>                                uint64_t offset, uint64_t size);
>   int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>                              int irq_type, Error **errp);
> -
>   #endif
> diff --git a/include/system/kvm.h b/include/system/kvm.h
> index b690dda..cfaa94c 100644
> --- a/include/system/kvm.h
> +++ b/include/system/kvm.h
> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
>   int kvm_has_vcpu_events(void);
>   int kvm_max_nested_state_length(void);
>   int kvm_has_gsi_routing(void);
> +void kvm_close(void);
>   
>   /**
>    * kvm_arm_supports_user_irq
> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> index e1f1403..396558f 100644
> --- a/migration/cpr-transfer.c
> +++ b/migration/cpr-transfer.c
> @@ -17,6 +17,24 @@
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   
> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
> +                                 MigrationEvent *e,
> +                                 Error **errp)
> +{
> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> +        cpr_kvm_close();
> +    }
> +    return 0;
> +}
> +
> +void cpr_transfer_init(void)
> +{
> +    static NotifierWithReturn notifier;
> +
> +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
> +                                MIG_MODE_CPR_TRANSFER);
> +}
> +
>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>   {
>       MigrationAddress *addr = channel->addr;
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 0b01e25..6102d04 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,12 +7,14 @@
>   
>   #include "qemu/osdep.h"
>   #include "qapi/error.h"
> +#include "hw/vfio/vfio-device.h"
>   #include "migration/cpr.h"
>   #include "migration/misc.h"
>   #include "migration/options.h"
>   #include "migration/qemu-file.h"
>   #include "migration/savevm.h"
>   #include "migration/vmstate.h"
> +#include "system/kvm.h"
>   #include "system/runstate.h"
>   #include "trace.h"
>   
> @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
>       MigMode mode = migrate_mode();
>       return mode == MIG_MODE_CPR_TRANSFER;
>   }
> +
> +void cpr_kvm_close(void)
> +{
> +    kvm_close();
> +    vfio_kvm_device_close();
> +}
> diff --git a/migration/migration.c b/migration/migration.c
> index 4697732..89e2026 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -337,6 +337,7 @@ void migration_object_init(void)
>   
>       ram_mig_init();
>       dirty_bitmap_mig_init();
> +    cpr_transfer_init();
>   
>       /* Initialize cpu throttle timers */
>       cpu_throttle_init();



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset
  2025-05-12 15:32 ` [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-05-16  8:40   ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:40 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset
>
>Define qemu_ram_get_fd_offset, so CPR can map a memory region using
>IOMMU_IOAS_MAP_FILE in a subsequent patch.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-12 15:32 ` [PATCH V3 29/42] backends/iommufd: change process ioctl Steve Sistare
@ 2025-05-16  8:42   ` Duan, Zhenzhong
  2025-05-19 15:51     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:42 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>Define the change process ioctl
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> backends/iommufd.c       | 20 ++++++++++++++++++++
> backends/trace-events    |  1 +
> include/system/iommufd.h |  2 ++
> 3 files changed, 23 insertions(+)
>
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 5c1958f..6fed1c1 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc,
>const void *data)
>     object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
> }
>
>+bool iommufd_change_process_capable(IOMMUFDBackend *be)
>+{
>+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>+
>+    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>+}
>+
>+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>+{
>+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>+    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);

This is same ioctl as above check, could it be called more than once for same process?

>+
>+    if (!ret) {
>+        error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d
>failed",
>+                         be->fd);
>+    }
>+    trace_iommufd_change_process(be->fd, ret);
>+    return ret;
>+}
>+
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> {
>     int fd;
>diff --git a/backends/trace-events b/backends/trace-events
>index f478e18..5ccdf90 100644
>--- a/backends/trace-events
>+++ b/backends/trace-events
>@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
> dbus_vmstate_saving(const char *id) "id: %s"
>
> # iommufd.c
>+iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d
>owned=%d users=%d"
> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>index ac700b8..db9ed53 100644
>--- a/include/system/iommufd.h
>+++ b/include/system/iommufd.h
>@@ -64,6 +64,8 @@ bool
>iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>                                       uint64_t iova, ram_addr_t size,
>                                       uint64_t page_size, uint64_t *data,
>                                       Error **errp);
>+bool iommufd_change_process_capable(IOMMUFDBackend *be);
>+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE
>"-iommufd"
> #endif
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-12 15:32 ` [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-05-16  8:48   ` Duan, Zhenzhong
  2025-05-19 15:52     ` Steven Sistare
  2025-05-20 12:27   ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:48 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>
>Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>Such a mapping can be preserved without modification during CPR,
>because it depends on the file's address space, which does not change,
>rather than on the process's address space, which does change.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/container-base.c              |  9 +++++++++
> hw/vfio/iommufd.c                     | 13 +++++++++++++
> include/hw/vfio/vfio-container-base.h |  3 +++
> 3 files changed, 25 insertions(+)
>
>diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>index 8f43bc8..72a51a6 100644
>--- a/hw/vfio/container-base.c
>+++ b/hw/vfio/container-base.c
>@@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase
>*bcontainer,
>                            RAMBlock *rb)
> {
>     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>+    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>
>+    if (mfd >= 0 && vioc->dma_map_file) {
>+        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>+        unsigned long offset = qemu_ram_get_fd_offset(rb);
>+
>+        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>+                           readonly);

Shouldn't we return result to call site?

>+        return 0;
>+    }
>     g_assert(vioc->dma_map);
>     return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
> }
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 167bda4..6eb417a 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase
>*bcontainer, hwaddr iova,
>                                    iova, size, vaddr, readonly);
> }
>
>+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>+                                 hwaddr iova, ram_addr_t size,
>+                                 int fd, unsigned long start, bool readonly)
>+{
>+    const VFIOIOMMUFDContainer *container =
>+        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>+
>+    return iommufd_backend_map_file_dma(container->be,
>+                                        container->ioas_id,
>+                                        iova, size, fd, start, readonly);
>+}
>+
> static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>                               hwaddr iova, ram_addr_t size,
>                               IOMMUTLBEntry *iotlb, bool unmap_all)
>@@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass
>*klass, const void *data)
>     VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
>
>     vioc->dma_map = iommufd_cdev_map;
>+    vioc->dma_map_file = iommufd_cdev_map_file;
>     vioc->dma_unmap = iommufd_cdev_unmap;
>     vioc->attach_device = iommufd_cdev_attach;
>     vioc->detach_device = iommufd_cdev_detach;
>diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>container-base.h
>index 03b3f9c..f30f828 100644
>--- a/include/hw/vfio/vfio-container-base.h
>+++ b/include/hw/vfio/vfio-container-base.h
>@@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
>     int (*dma_map)(const VFIOContainerBase *bcontainer,
>                    hwaddr iova, ram_addr_t size,
>                    void *vaddr, bool readonly);
>+    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
>+                        hwaddr iova, ram_addr_t size,
>+                        int fd, unsigned long start, bool readonly);
>     /**
>      * @dma_unmap
>      *
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-12 15:32 ` [PATCH V3 33/42] vfio/iommufd: define hwpt constructors Steve Sistare
@ 2025-05-16  8:55   ` Duan, Zhenzhong
  2025-05-19 15:55     ` Steven Sistare
  2025-05-20 12:34     ` Cédric Le Goater
  0 siblings, 2 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  8:55 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>
>Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>be used by CPR in a subsequent patch.
>
>Call vfio_device_hiod_create_and_realize earlier so iommufd_cdev_make_hwpt
>can use vbasedev->hiod hw_caps, avoiding an extra call to
>iommufd_backend_get_device_info

We had made consensus to realize hiod after attachment,
it's not a hot path so an extra call is acceptable per Cedric.

>
>No functional change.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/iommufd.c | 116 ++++++++++++++++++++++++++++++----------------------
>--
> 1 file changed, 65 insertions(+), 51 deletions(-)
>
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index f645a62..8661947 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -310,16 +310,70 @@ static bool
>iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>     return true;
> }
>
>+static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>*hwpt)
>+{
>+    vbasedev->hwpt = hwpt;
>+    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>+}
>+
>+/*
>+ * iommufd_cdev_make_hwpt: If @alloc_id, allocate a hwpt_id, else use
>@hwpt_id.
>+ * Create and add a hwpt struct to the container's list and to the device.
>+ * Always succeeds if !@alloc_id.
>+ */
>+static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
>+                                   VFIOIOMMUFDContainer *container,
>+                                   uint32_t hwpt_id, bool alloc_id,
>+                                   Error **errp)
>+{
>+    VFIOIOASHwpt *hwpt;
>+    uint32_t flags = 0;
>+
>+    /*
>+     * This is quite early and VFIO Migration state isn't yet fully
>+     * initialized, thus rely only on IOMMU hardware capabilities as to
>+     * whether IOMMU dirty tracking is going to be requested. Later
>+     * vfio_migration_realize() may decide to use VF dirty tracking
>+     * instead.
>+     */
>+    g_assert(vbasedev->hiod);
>+    if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>+        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>+    }
>+
>+    if (alloc_id) {
>+        if (!iommufd_backend_alloc_hwpt(vbasedev->iommufd, vbasedev->devid,
>+                                        container->ioas_id, flags,
>+                                        IOMMU_HWPT_DATA_NONE, 0, NULL,
>+                                        &hwpt_id, errp)) {
>+            return false;
>+        }
>+
>+        if (iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp)) {
>+            iommufd_backend_free_id(container->be, hwpt_id);
>+            return false;
>+        }
>+    }
>+
>+    hwpt = g_malloc0(sizeof(*hwpt));
>+    hwpt->hwpt_id = hwpt_id;
>+    hwpt->hwpt_flags = flags;
>+    QLIST_INIT(&hwpt->device_list);
>+
>+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>+    container->bcontainer.dirty_pages_supported |=
>+                                vbasedev->iommu_dirty_tracking;
>+    iommufd_cdev_use_hwpt(vbasedev, hwpt);
>+    return true;
>+}
>+
> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>                                          VFIOIOMMUFDContainer *container,
>                                          Error **errp)
> {
>     ERRP_GUARD();
>-    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>-    uint32_t type, flags = 0;
>-    uint64_t hw_caps;
>     VFIOIOASHwpt *hwpt;
>-    uint32_t hwpt_id;
>     int ret;
>
>     /* Try to find a domain */
>@@ -340,54 +394,14 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>
>             return false;
>         } else {
>-            vbasedev->hwpt = hwpt;
>-            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>-            vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>+            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>             return true;
>         }
>     }
>-
>-    /*
>-     * This is quite early and VFIO Migration state isn't yet fully
>-     * initialized, thus rely only on IOMMU hardware capabilities as to
>-     * whether IOMMU dirty tracking is going to be requested. Later
>-     * vfio_migration_realize() may decide to use VF dirty tracking
>-     * instead.
>-     */
>-    if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
>-                                         &type, NULL, 0, &hw_caps, errp)) {
>-        return false;
>-    }
>-
>-    if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>-        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>-    }
>-
>-    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>-                                    container->ioas_id, flags,
>-                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>-                                    &hwpt_id, errp)) {
>-        return false;
>-    }
>-
>-    hwpt = g_malloc0(sizeof(*hwpt));
>-    hwpt->hwpt_id = hwpt_id;
>-    hwpt->hwpt_flags = flags;
>-    QLIST_INIT(&hwpt->device_list);
>-
>-    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>-    if (ret) {
>-        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>-        g_free(hwpt);
>+    if (!iommufd_cdev_make_hwpt(vbasedev, container, 0, true, errp)) {
>         return false;
>     }
>
>-    vbasedev->hwpt = hwpt;
>-    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>-    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>-    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>-    container->bcontainer.dirty_pages_supported |=
>-                                vbasedev->iommu_dirty_tracking;
>     if (container->bcontainer.dirty_pages_supported &&
>         !vbasedev->iommu_dirty_tracking) {
>         warn_report("IOMMU instance for device %s doesn't support dirty tracking",
>@@ -530,6 +544,11 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>
>     space = vfio_address_space_get(as);
>
>+    if (!vfio_device_hiod_create_and_realize(vbasedev,
>+            TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>+        goto err_alloc_ioas;
>+    }
>+
>     /* try to attach to an existing container in this space */
>     QLIST_FOREACH(bcontainer, &space->containers, next) {
>         container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>@@ -604,11 +623,6 @@ found_container:
>         goto err_listener_register;
>     }
>
>-    if (!vfio_device_hiod_create_and_realize(vbasedev,
>-                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>-        goto err_listener_register;
>-    }
>-
>     /*
>      * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>      * for discarding incompatibility check as well?
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-12 15:32 ` [PATCH V3 34/42] vfio/iommufd: invariant device name Steve Sistare
@ 2025-05-16  9:29   ` Duan, Zhenzhong
  2025-05-19 15:52     ` Steven Sistare
  2025-05-20 13:55   ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16  9:29 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 34/42] vfio/iommufd: invariant device name
>
>cpr-transfer will use the device name as a key to find the value
>of the device descriptor in new QEMU.  However, if the descriptor
>number is specified by a command-line fd parameter, then
>vfio_device_get_name creates a name that includes the fd number.
>This causes a chicken-and-egg problem: new QEMU must know the fd
>number to construct a name to find the fd number.
>
>To fix, create an invariant name based on the id command-line
>parameter.  If id is not defined, add a CPR blocker.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/cpr.c              | 21 +++++++++++++++++++++
> hw/vfio/device.c           | 10 ++++------
> hw/vfio/iommufd.c          |  2 ++
> include/hw/vfio/vfio-cpr.h |  4 ++++
> 4 files changed, 31 insertions(+), 6 deletions(-)
>
>diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>index 6081a89..7609c62 100644
>--- a/hw/vfio/cpr.c
>+++ b/hw/vfio/cpr.c
>@@ -11,6 +11,7 @@
> #include "hw/vfio/pci.h"
> #include "hw/pci/msix.h"
> #include "hw/pci/msi.h"
>+#include "migration/blocker.h"
> #include "migration/cpr.h"
> #include "qapi/error.h"
> #include "system/runstate.h"
>@@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>         VMSTATE_END_OF_LIST()
>     }
> };
>+
>+bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
>+{
>+    if (vbasedev->dev->id) {
>+        vbasedev->name = g_strdup(vbasedev->dev->id);
>+        return true;
>+    } else {
>+        /*
>+         * Assign a name so any function printing it will not break, but the
>+         * fd number changes across processes, so this cannot be used as an
>+         * invariant name for CPR.
>+         */
>+        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>+        error_setg(&vbasedev->cpr.id_blocker,
>+                   "vfio device with fd=%d needs an id property",
>+                   vbasedev->fd);
>+        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
>+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>+    }
>+}
>diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>index 9fba2c7..8e9de68 100644
>--- a/hw/vfio/device.c
>+++ b/hw/vfio/device.c
>@@ -28,6 +28,7 @@
> #include "qapi/error.h"
> #include "qemu/error-report.h"
> #include "qemu/units.h"
>+#include "migration/cpr.h"
> #include "monitor/monitor.h"
> #include "vfio-helpers.h"
>
>@@ -284,6 +285,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>Error **errp)
> {
>     ERRP_GUARD();
>     struct stat st;
>+    bool ret = true;
>
>     if (vbasedev->fd < 0) {
>         if (stat(vbasedev->sysfsdev, &st) < 0) {
>@@ -300,16 +302,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>Error **errp)
>             error_setg(errp, "Use FD passing only with iommufd backend");
>             return false;
>         }
>-        /*
>-         * Give a name with fd so any function printing out vbasedev->name
>-         * will not break.
>-         */
>         if (!vbasedev->name) {
>-            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>+            ret = vfio_cpr_set_device_name(vbasedev, errp);
>         }
>     }
>
>-    return true;
>+    return ret;
> }
>
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 8661947..ea99b8d 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -25,6 +25,7 @@
> #include "system/reset.h"
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
>+#include "migration/blocker.h"
> #include "pci.h"
> #include "vfio-iommufd.h"
> #include "vfio-helpers.h"
>@@ -669,6 +670,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>     iommufd_cdev_container_destroy(container);
>     vfio_address_space_put(space);
>
>+    migrate_del_blocker(&vbasedev->cpr.id_blocker);

We also need to del blocker in error path, e.g., when attach fails.

>     iommufd_cdev_unbind_and_disconnect(vbasedev);
>     close(vbasedev->fd);
> }
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index 765e334..d06d117 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -23,12 +23,14 @@ typedef struct VFIOContainerCPR {
> typedef struct VFIODeviceCPR {
>     bool reused;
>     Error *mdev_blocker;
>+    Error *id_blocker;
> } VFIODeviceCPR;
>
> struct VFIOContainer;
> struct VFIOContainerBase;
> struct VFIOGroup;
> struct VFIOPCIDevice;
>+struct VFIODevice;
>
> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                         Error **errp);
>@@ -59,4 +61,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev,
>const char *name,
>
> extern const VMStateDescription vfio_cpr_pci_vmstate;
>
>+bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
>+
> #endif /* HW_VFIO_VFIO_CPR_H */
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
  2025-05-12 15:32 ` [PATCH V3 36/42] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-05-16 10:06   ` Duan, Zhenzhong
  2025-05-19 15:53     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16 10:06 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
>
>Save the iommu and vfio device fd in CPR state when it is created.
>After CPR, the fd number is found in CPR state and reused.  Remember
>the reused status for subsequent patches.  The reused status is cleared
>when vmstate load finishes.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> backends/iommufd.c       | 19 ++++++++++---------
> hw/vfio/cpr-iommufd.c    | 16 ++++++++++++++++
> hw/vfio/device.c         | 10 ++--------
> hw/vfio/iommufd.c        | 13 +++++++++++--
> include/system/iommufd.h |  1 +
> 5 files changed, 40 insertions(+), 19 deletions(-)
>
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 6fed1c1..492747c 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -16,12 +16,18 @@
> #include "qemu/module.h"
> #include "qom/object_interfaces.h"
> #include "qemu/error-report.h"
>+#include "migration/cpr.h"
> #include "monitor/monitor.h"
> #include "trace.h"
> #include "hw/vfio/vfio-device.h"
> #include <sys/ioctl.h>
> #include <linux/iommufd.h>
>
>+static const char *iommufd_fd_name(IOMMUFDBackend *be)
>+{
>+    return object_get_canonical_path_component(OBJECT(be));
>+}
>+
> static void iommufd_backend_init(Object *obj)
> {
>     IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>@@ -47,9 +53,8 @@ static void iommufd_backend_set_fd(Object *obj, const
>char *str, Error **errp)
>     IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>     int fd = -1;
>
>-    fd = monitor_fd_param(monitor_cur(), str, errp);
>+    fd = cpr_get_fd_param(iommufd_fd_name(be), str, 0, &be->cpr_reused, errp);
>     if (fd == -1) {
>-        error_prepend(errp, "Could not parse remote object fd %s:", str);
>         return;
>     }
>     be->fd = fd;
>@@ -95,14 +100,9 @@ bool iommufd_change_process(IOMMUFDBackend *be,
>Error **errp)
>
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
> {
>-    int fd;
>-
>     if (be->owned && !be->users) {
>-        fd = qemu_open("/dev/iommu", O_RDWR, errp);
>-        if (fd < 0) {
>-            return false;
>-        }
>-        be->fd = fd;
>+        be->fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0,
>+                             &be->cpr_reused, errp);

Need to check error before assign to be->fd.

>     }
>     be->users++;
>
>@@ -121,6 +121,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>*be)
>         be->fd = -1;
>     }
> out:
>+    cpr_delete_fd(iommufd_fd_name(be), 0);
>     trace_iommufd_backend_disconnect(be->fd, be->users);
> }
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 46f2006..b760bd3 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -8,6 +8,7 @@
> #include "qemu/osdep.h"
> #include "qapi/error.h"
> #include "hw/vfio/vfio-cpr.h"
>+#include "hw/vfio/vfio-device.h"
> #include "migration/blocker.h"
> #include "migration/cpr.h"
> #include "migration/migration.h"
>@@ -25,10 +26,25 @@ static bool vfio_cpr_supported(VFIOIOMMUFDContainer
>*container, Error **errp)
>     return true;
> }
>
>+static int vfio_container_post_load(void *opaque, int version_id)
>+{
>+    VFIOIOMMUFDContainer *container = opaque;
>+    VFIOContainerBase *bcontainer = &container->bcontainer;
>+    VFIODevice *vbasedev;
>+
>+    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>+        vbasedev->cpr.reused = false;
>+    }
>+    container->be->cpr_reused = false;

It's strange to set iommufd and vfio device's reused in container's post load,
Maybe better to do it in their own post load handler?

>+
>+    return 0;
>+}
>+
> static const VMStateDescription vfio_container_vmstate = {
>     .name = "vfio-iommufd-container",
>     .version_id = 0,
>     .minimum_version_id = 0,
>+    .post_load = vfio_container_post_load,
>     .needed = cpr_needed_for_reuse,
>     .fields = (VMStateField[]) {
>         VMSTATE_END_OF_LIST()
>diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>index 8e9de68..02f384e 100644
>--- a/hw/vfio/device.c
>+++ b/hw/vfio/device.c
>@@ -312,14 +312,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>Error **errp)
>
> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
> {
>-    ERRP_GUARD();
>-    int fd = monitor_fd_param(monitor_cur(), str, errp);
>-
>-    if (fd < 0) {
>-        error_prepend(errp, "Could not parse remote object fd %s:", str);
>-        return;
>-    }
>-    vbasedev->fd = fd;
>+    vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
>+                                    &vbasedev->cpr.reused, errp);

Same here.

> }
>
> static VFIODeviceIOOps vfio_device_io_ops_ioctl;
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index dabb948..046f601 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -26,6 +26,7 @@
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
> #include "migration/blocker.h"
>+#include "migration/cpr.h"
> #include "pci.h"
> #include "vfio-iommufd.h"
> #include "vfio-helpers.h"
>@@ -530,13 +531,18 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>
>VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
>
>     if (vbasedev->fd < 0) {
>-        devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>+        devfd = cpr_find_fd(vbasedev->name, 0);
>+        vbasedev->cpr.reused = (devfd >= 0);
>+        if (!vbasedev->cpr.reused) {
>+            devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>+        }
>         if (devfd < 0) {
>             return false;
>         }
>         vbasedev->fd = devfd;
>     } else {
>         devfd = vbasedev->fd;
>+        /* reused was set in iommufd_backend_set_fd */

Should be vfio_device_set_fd

>     }
>
>     if (!iommufd_cdev_connect_and_bind(vbasedev, errp)) {
>@@ -634,7 +640,9 @@ found_container:
>
>     vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>     vfio_iommufd_cpr_register_device(vbasedev);
>-
>+    if (!vbasedev->cpr.reused) {
>+        cpr_save_fd(vbasedev->name, 0, vbasedev->fd);
>+    }
>     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>num_irqs,
>                                    vbasedev->num_regions, vbasedev->flags);
>     return true;
>@@ -673,6 +681,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>
>     migrate_del_blocker(&vbasedev->cpr.id_blocker);
>     vfio_iommufd_cpr_unregister_device(vbasedev);
>+    cpr_delete_fd(vbasedev->name, 0);
>     iommufd_cdev_unbind_and_disconnect(vbasedev);
>     close(vbasedev->fd);
> }
>diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>index db9ed53..5c17abd 100644
>--- a/include/system/iommufd.h
>+++ b/include/system/iommufd.h
>@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>     /*< protected >*/
>     int fd;            /* /dev/iommu file descriptor */
>     bool owned;        /* is the /dev/iommu opened internally */
>+    bool cpr_reused;   /* fd is reused after CPR */
>     uint32_t users;
>
>     /*< public >*/
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 37/42] vfio/iommufd: reconstruct device
  2025-05-12 15:32 ` [PATCH V3 37/42] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-05-16 10:22   ` Duan, Zhenzhong
  2025-05-19 15:53     ` Steven Sistare
  2025-05-21 18:38   ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16 10:22 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 37/42] vfio/iommufd: reconstruct device
>
>Reconstruct userland device state after CPR.  During vfio_realize, skip
>all ioctls that configure the device, as it was already configured in old
>QEMU.
>
>Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
>Because we skip ioctl's, it is not needed at realize time.  However, we do
>need the range info, so defer the call to iommufd_cdev_get_info_iova_range
>to a post_load handler, at which time the ioas_id is known.
>
>This reconstruction is not complete.  hwpt_id and devid need special
>treatment, handled in subsequent patches.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/cpr-iommufd.c |  8 ++++++++
> hw/vfio/iommufd.c     | 17 +++++++++++++++++
> 2 files changed, 25 insertions(+)
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index b760bd3..3d430f0 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -31,6 +31,13 @@ static int vfio_container_post_load(void *opaque, int
>version_id)
>     VFIOIOMMUFDContainer *container = opaque;
>     VFIOContainerBase *bcontainer = &container->bcontainer;
>     VFIODevice *vbasedev;
>+    Error *err = NULL;
>+    uint32_t ioas_id = container->ioas_id;
>+
>+    if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>+        error_report_err(err);
>+        return -1;
>+    }
>
>     QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>         vbasedev->cpr.reused = false;
>@@ -47,6 +54,7 @@ static const VMStateDescription vfio_container_vmstate = {
>     .post_load = vfio_container_post_load,
>     .needed = cpr_needed_for_reuse,
>     .fields = (VMStateField[]) {
>+        VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
>         VMSTATE_END_OF_LIST()
>     }
> };
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 046f601..c49a7e7 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -122,6 +122,10 @@ static bool
>iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>         goto err_kvm_device_add;
>     }
>
>+    if (vbasedev->cpr.reused) {
>+        goto skip_bind;
>+    }
>+
>     /* Bind device to iommufd */
>     bind.iommufd = iommufd->fd;
>     if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>@@ -133,6 +137,8 @@ static bool
>iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>     vbasedev->devid = bind.out_devid;
>     trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
>                                         vbasedev->fd, vbasedev->devid);
>+
>+skip_bind:
>     return true;
> err_bind:
>     iommufd_cdev_kvm_device_del(vbasedev);
>@@ -580,6 +586,11 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>         }
>     }
>
>+    if (vbasedev->cpr.reused) {
>+        ioas_id = -1;           /* ioas_id will be received from vmstate */
>+        goto skip_ioas_alloc;
>+    }
>+
>     /* Need to allocate a new dedicated container */
>     if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
>         goto err_alloc_ioas;
>@@ -587,6 +598,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>
>     trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>
>+skip_ioas_alloc:
>     container =
>VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>     container->be = vbasedev->iommufd;
>     container->ioas_id = ioas_id;
>@@ -605,6 +617,10 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>         goto err_discard_disable;
>     }
>
>+    if (vbasedev->cpr.reused) {
>+        goto skip_info;

I suspect this will break virtio-iommu, see virtio_iommu_set_iommu_device().
When virtio-iommu try to get host_iova_ranges, it's not ready until post load.

>+    }
>+
>     if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>         error_append_hint(&err,
>                    "Fallback to default 64bit IOVA range and 4K page size\n");
>@@ -613,6 +629,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>         bcontainer->pgsizes = qemu_real_host_page_size();
>     }
>
>+skip_info:
>     if (!vfio_listener_register(bcontainer, errp)) {
>         goto err_listener_register;
>     }
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 35/42] vfio/iommufd: register container for cpr
  2025-05-12 15:32 ` [PATCH V3 35/42] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-05-16 10:23   ` Duan, Zhenzhong
  2025-05-19 15:52     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-16 10:23 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 35/42] vfio/iommufd: register container for cpr
>
>Register a vfio iommufd container and device for CPR, replacing the generic
>CPR register call with a more specific iommufd register call.  Add a
>blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>
>This is mostly boiler plate.  The fields to to saved and restored are added
>in subsequent patches.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/cpr-iommufd.c      | 97
>++++++++++++++++++++++++++++++++++++++++++++++
> hw/vfio/iommufd.c          |  6 ++-
> hw/vfio/meson.build        |  1 +
> hw/vfio/vfio-iommufd.h     |  1 +
> include/hw/vfio/vfio-cpr.h |  8 ++++
> 5 files changed, 111 insertions(+), 2 deletions(-)
> create mode 100644 hw/vfio/cpr-iommufd.c
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>new file mode 100644
>index 0000000..46f2006
>--- /dev/null
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -0,0 +1,97 @@
>+/*
>+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>+ *
>+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
>+ * See the COPYING file in the top-level directory.
>+ */
>+
>+#include "qemu/osdep.h"
>+#include "qapi/error.h"
>+#include "hw/vfio/vfio-cpr.h"
>+#include "migration/blocker.h"
>+#include "migration/cpr.h"
>+#include "migration/migration.h"
>+#include "migration/vmstate.h"
>+#include "system/iommufd.h"
>+#include "vfio-iommufd.h"
>+
>+static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error
>**errp)
>+{
>+    if (!iommufd_change_process_capable(container->be)) {
>+        error_setg(errp,
>+                   "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
>+        return false;
>+    }
>+    return true;
>+}
>+
>+static const VMStateDescription vfio_container_vmstate = {
>+    .name = "vfio-iommufd-container",
>+    .version_id = 0,
>+    .minimum_version_id = 0,
>+    .needed = cpr_needed_for_reuse,
>+    .fields = (VMStateField[]) {
>+        VMSTATE_END_OF_LIST()
>+    }
>+};
>+
>+static const VMStateDescription iommufd_cpr_vmstate = {
>+    .name = "iommufd",
>+    .version_id = 0,
>+    .minimum_version_id = 0,
>+    .needed = cpr_needed_for_reuse,
>+    .fields = (VMStateField[]) {
>+        VMSTATE_END_OF_LIST()
>+    }
>+};
>+
>+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer
>*container,
>+                                         Error **errp)
>+{
>+    VFIOContainerBase *bcontainer = &container->bcontainer;
>+    Error **cpr_blocker = &container->cpr_blocker;
>+
>+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>+                                vfio_cpr_reboot_notifier,
>+                                MIG_MODE_CPR_REBOOT);
>+
>+    if (!vfio_cpr_supported(container, cpr_blocker)) {
>+        return migrate_add_blocker_modes(cpr_blocker, errp,
>+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>+    }
>+
>+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>+    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);

Will this register iommufd be multiple times if multiple containers under one iommufd?
Maybe introduce a cpr_register_iommufd()?

>+
>+    return true;
>+}
>+
>+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer
>*container)
>+{
>+    VFIOContainerBase *bcontainer = &container->bcontainer;
>+
>+    vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
>+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>+    migrate_del_blocker(&container->cpr_blocker);
>+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>+}
>+
>+static const VMStateDescription vfio_device_vmstate = {
>+    .name = "vfio-iommufd-device",
>+    .version_id = 0,
>+    .minimum_version_id = 0,
>+    .needed = cpr_needed_for_reuse,
>+    .fields = (VMStateField[]) {
>+        VMSTATE_END_OF_LIST()
>+    }
>+};
>+
>+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>+{
>+    vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
>+}
>+
>+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>+{
>+    vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
>+}
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index ea99b8d..dabb948 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -460,7 +460,7 @@ static void
>iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
>     if (!QLIST_EMPTY(&bcontainer->device_list)) {
>         return;
>     }
>-    vfio_cpr_unregister_container(bcontainer);
>+    vfio_iommufd_cpr_unregister_container(container);
>     vfio_listener_unregister(bcontainer);
>     iommufd_backend_free_id(container->be, container->ioas_id);
>     object_unref(container);
>@@ -611,7 +611,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>         goto err_listener_register;
>     }
>
>-    if (!vfio_cpr_register_container(bcontainer, errp)) {
>+    if (!vfio_iommufd_cpr_register_container(container, errp)) {
>         goto err_listener_register;
>     }
>
>@@ -633,6 +633,7 @@ found_container:
>     }
>
>     vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>+    vfio_iommufd_cpr_register_device(vbasedev);
>
>     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>num_irqs,
>                                    vbasedev->num_regions, vbasedev->flags);
>@@ -671,6 +672,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>     vfio_address_space_put(space);
>
>     migrate_del_blocker(&vbasedev->cpr.id_blocker);
>+    vfio_iommufd_cpr_unregister_device(vbasedev);
>     iommufd_cdev_unbind_and_disconnect(vbasedev);
>     close(vbasedev->fd);
> }
>diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>index 73d29f9..a158fd8 100644
>--- a/hw/vfio/meson.build
>+++ b/hw/vfio/meson.build
>@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true:
>files('calxeda-xgmac.c'))
> system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
> system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>   'cpr.c',
>+  'cpr-iommufd.c',
>   'cpr-legacy.c',
>   'device.c',
>   'migration.c',
>diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>index 5615dcd..cc57a05 100644
>--- a/hw/vfio/vfio-iommufd.h
>+++ b/hw/vfio/vfio-iommufd.h
>@@ -25,6 +25,7 @@ typedef struct IOMMUFDBackend IOMMUFDBackend;
> typedef struct VFIOIOMMUFDContainer {
>     VFIOContainerBase bcontainer;
>     IOMMUFDBackend *be;
>+    Error *cpr_blocker;
>     uint32_t ioas_id;
>     QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index d06d117..1379b20 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -31,6 +31,7 @@ struct VFIOContainerBase;
> struct VFIOGroup;
> struct VFIOPCIDevice;
> struct VFIODevice;
>+struct VFIOIOMMUFDContainer;
>
> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                         Error **errp);
>@@ -43,6 +44,13 @@ bool vfio_cpr_register_container(struct
>VFIOContainerBase *bcontainer,
>                                  Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>
>+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer
>*container,
>+                                         Error **errp);
>+void vfio_iommufd_cpr_unregister_container(
>+    struct VFIOIOMMUFDContainer *container);
>+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>+
> bool vfio_cpr_container_match(struct VFIOContainer *container,
>                               struct VFIOGroup *group, int *fd);
>
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 06/42] vfio/container: register container for cpr
  2025-05-15 19:06     ` Steven Sistare
@ 2025-05-16 16:20       ` Cédric Le Goater
  2025-05-16 17:21         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16 16:20 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/25 21:06, Steven Sistare wrote:
> On 5/15/2025 3:54 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> Register a legacy container for cpr-transfer, replacing the generic CPR
>>> register call with a more specific legacy container register call.  Add a
>>> blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
>>>
>>> This is mostly boiler plate.  The fields to to saved and restored are added
>>> in subsequent patches.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   hw/vfio/container.c              |  6 ++--
>>>   hw/vfio/cpr-legacy.c             | 70 ++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/cpr.c                    |  5 ++-
>>>   hw/vfio/meson.build              |  1 +
>>>   include/hw/vfio/vfio-container.h |  2 ++
>>>   include/hw/vfio/vfio-cpr.h       | 14 ++++++++
>>>   6 files changed, 92 insertions(+), 6 deletions(-)
>>>   create mode 100644 hw/vfio/cpr-legacy.c
>>>
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index eb56f00..85c76da 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -642,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>       new_container = true;
>>>       bcontainer = &container->bcontainer;
>>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>>> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>>>           goto fail;
>>>       }
>>> @@ -678,7 +678,7 @@ fail:
>>>           vioc->release(bcontainer);
>>>       }
>>>       if (new_container) {
>>> -        vfio_cpr_unregister_container(bcontainer);
>>> +        vfio_legacy_cpr_unregister_container(container);
>>>           object_unref(container);
>>>       }
>>>       if (fd >= 0) {
>>> @@ -719,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>>           VFIOAddressSpace *space = bcontainer->space;
>>>           trace_vfio_container_disconnect(container->fd);
>>> -        vfio_cpr_unregister_container(bcontainer);
>>> +        vfio_legacy_cpr_unregister_container(container);
>>>           close(container->fd);
>>>           object_unref(container);
>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>> new file mode 100644
>>> index 0000000..fac323c
>>> --- /dev/null
>>> +++ b/hw/vfio/cpr-legacy.c
>>> @@ -0,0 +1,70 @@
>>> +/*
>>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>> + * See the COPYING file in the top-level directory.
>>
>> Please add a SPDX-License-Identifier tag.
> 
> Sure.  I'll do the same for my other new files.

and remove the License boiler plate too please.

A newer version of checkpatch will complain with :

   ERROR: New file 'hw/vfio/cpr-legacy.c' requires 'SPDX-License-Identifier'
   ERROR: New file 'hw/vfio/cpr-legacy.c' must not have license boilerplate header text unless this file is copied from existing code with such  text already present.
   WARNING: added, moved or deleted file(s):

     hw/vfio/cpr-legacy.c

   Does MAINTAINERS need updating?

   total: 2 errors, 1 warnings, 152 lines checked


Thanks,

C.


> 
>>> + */
>>> +
>>> +#include <sys/ioctl.h>
>>> +#include <linux/vfio.h>
>>> +#include "qemu/osdep.h"
>>> +#include "hw/vfio/vfio-container.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>> +#include "migration/blocker.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/migration.h"
>>> +#include "migration/vmstate.h"
>>> +#include "qapi/error.h"
>>> +
>>> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>> +{
>>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
>>> +        return false;
>>> +
>>> +    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>>> +        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
>>> +        return false;
>>> +
>>> +    } else {
>>> +        return true;
>>> +    }
>>> +}
>>> +
>>> +static const VMStateDescription vfio_container_vmstate = {
>>> +    .name = "vfio-container",
>>> +    .version_id = 0,
>>> +    .minimum_version_id = 0,
>>> +    .needed = cpr_needed_for_reuse,
>>> +    .fields = (VMStateField[]) {
>>> +        VMSTATE_END_OF_LIST()
>>> +    }
>>> +};
>>> +
>>> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>> +{
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>> +    Error **cpr_blocker = &container->cpr.blocker;
>>> +
>>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>>> +                                vfio_cpr_reboot_notifier,
>>> +                                MIG_MODE_CPR_REBOOT);
>>> +
>>> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
>>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>> +    }
>>> +
>>> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>> +{
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>> +
>>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>> +    migrate_del_blocker(&container->cpr.blocker);
>>> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>> +}
>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>> index 0210e76..0e59612 100644
>>> --- a/hw/vfio/cpr.c
>>> +++ b/hw/vfio/cpr.c
>>> @@ -7,13 +7,12 @@
>>>   #include "qemu/osdep.h"
>>>   #include "hw/vfio/vfio-device.h"
>>> -#include "migration/misc.h"
>>>   #include "hw/vfio/vfio-cpr.h"
>>>   #include "qapi/error.h"
>>>   #include "system/runstate.h"
>>> -static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>> -                                    MigrationEvent *e, Error **errp)
>>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>> +                             MigrationEvent *e, Error **errp)
>>>   {
>>>       if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>>>           !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
>>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>>> index bccb050..73d29f9 100644
>>> --- a/hw/vfio/meson.build
>>> +++ b/hw/vfio/meson.build
>>> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>>     'cpr.c',
>>> +  'cpr-legacy.c',
>>>     'device.c',
>>>     'migration.c',
>>>     'migration-multifd.c',
>>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>>> index afc498d..21e5807 100644
>>> --- a/include/hw/vfio/vfio-container.h
>>> +++ b/include/hw/vfio/vfio-container.h
>>> @@ -10,6 +10,7 @@
>>>   #define HW_VFIO_CONTAINER_H
>>>   #include "hw/vfio/vfio-container-base.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>>   typedef struct VFIOContainer VFIOContainer;
>>>   typedef struct VFIODevice VFIODevice;
>>> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>>       unsigned iommu_type;
>>>       QLIST_HEAD(, VFIOGroup) group_list;
>>> +    VFIOContainerCPR cpr;
>>>   } VFIOContainer;
>>>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index 750ea5b..f864547 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -9,8 +9,22 @@
>>>   #ifndef HW_VFIO_VFIO_CPR_H
>>>   #define HW_VFIO_VFIO_CPR_H
>>> +#include "migration/misc.h"
>>> +
>>> +typedef struct VFIOContainerCPR {
>>> +    Error *blocker;
>>> +} VFIOContainerCPR;
>>> +
>>> +struct VFIOContainer;
>>>   struct VFIOContainerBase;
>>> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>> +                                        Error **errp);
>>> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>>> +
>>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>> +                             Error **errp);
>>> +
>>>   bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>                                    Error **errp);
>>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>
>> what about vfio_cpr_un/register_container ? Shouldn't we remove them ?
> 
> At this patch in the series, those are still used by iommufd containers.
> Those uses are removed in "vfio/iommufd: register container for cpr", and
> vfio_cpr_un/register_container are deleted by the last patch in the series.
> 
> - Steve
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 00/42] Live update: vfio and iommufd
  2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
                   ` (41 preceding siblings ...)
  2025-05-12 15:32 ` [PATCH V3 42/42] vfio/container: delete old cpr register Steve Sistare
@ 2025-05-16 16:37 ` Cédric Le Goater
  2025-05-16 17:17   ` Steven Sistare
  42 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-16 16:37 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

Steve,

+ John

On 5/12/25 17:32, Steve Sistare wrote:
> Support vfio and iommufd devices with the cpr-transfer live migration mode.
> Devices that do not support live migration can still support cpr-transfer,
> allowing live update to a new version of QEMU on the same host, with no loss
> of guest connectivity.
> 
> No user-visible interfaces are added.
> 
> For legacy containers:
> 
> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
> skip the ioctls that configure the device, because it is already configured.
> 
> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
> QEMU and update the locked memory accounting.  The physical pages remain
> pinned, because the descriptor of the device that locked them remains open,
> so DMA to those pages continues without interruption.  Mediated devices are
> not supported, however, because they require the VA to always be valid, and
> there is a brief window where no VA is registered.
> 
> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
> data structures, and attaches the interrupts to the new KVM instance.  This
> logic also applies to iommufd containers.
> 
> For iommufd containers:
> 
> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
> backed by a file (including a memfd), so DMA mappings do not depend on VA,
> which can differ after live update.  This allows mediated devices to be
> supported.
> 
> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
> QEMU, during vfio_realize, skip the ioctls that configure the device, because
> it is already configured.
> 
> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
> locked memory accounting.
> 
> Patches 4 to 12 are specific to legacy containers.
> Patches 25 to 41 are specific to iommufd containers.

For v4, could you please send a first "part I" with patches [1-20] ?
I think these are reviewed, or nearly, and could be merged quickly.
Even if the "Live update: vfio and iommufd" series is not fully
reviewed yet, there are good signs that it will before the end of
the QEMU 10.1 cycle. The same applies to vfio-user.

We need to bring together the proposals changing memory_get_xlat_addr().
It's important as it is blocking both the vfio-user series and yours.
This can be done in parallel.

Then we can address the iommufd part.

Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-16  8:35   ` Cédric Le Goater
@ 2025-05-16 17:14     ` Peter Xu
  2025-05-16 19:17       ` Steven Sistare
  2025-05-16 18:18     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Peter Xu @ 2025-05-16 17:14 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Steve Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum,
	Fabiano Rosas

On Fri, May 16, 2025 at 10:35:44AM +0200, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
> > cpr-transfer breaks vfio network connectivity to and from the guest, and
> > the host system log shows:
> >    irq bypass consumer (token 00000000a03c32e5) registration fails: -16
> > which is EBUSY.  This occurs because KVM descriptors are still open in
> > the old QEMU process.  Close them.
> > 

[1]

> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> This patch doesn't build.
> 
> /usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
> ./build/../migration/cpr.c:260: undefined reference to `kvm_close'

And it'll be also good if this patch can keep copying the kvm maintainer
(Paolo)..  I have Paolo copied.  So this patch is more of a kvm change not
migration, afaict.  Maybe we should split this into two patches.

Steve, you could attach a cc line in this patch to make sure it won't be
forgotten when you repost (at [1] above, I think git-send-email would
remember that then):

Cc: Paolo Bonzini <pbonzini@redhat.com>

Some other questions below.

> 
> 
> 
> Thanks,
> 
> C.
> 
> 
> 
> > ---
> >   accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
> >   hw/vfio/helpers.c             | 10 ++++++++++
> >   include/hw/vfio/vfio-device.h |  2 ++
> >   include/migration/cpr.h       |  2 ++
> >   include/qemu/vfio-helpers.h   |  1 -
> >   include/system/kvm.h          |  1 +
> >   migration/cpr-transfer.c      | 18 ++++++++++++++++++
> >   migration/cpr.c               |  8 ++++++++
> >   migration/migration.c         |  1 +
> >   9 files changed, 70 insertions(+), 1 deletion(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 278a506..d619448 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
> >           goto err;
> >       }
> > +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */

Are these "reset to NULL" below required or cleanup?  It's not yet clear to
me when coalesced_mmio_ring isn't owned by the current CPU state.  Maybe
also better to split this chunk with some commit message?

> > +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
> > +        s->coalesced_mmio_ring = NULL;
> > +    }
> > +
> >       ret = munmap(cpu->kvm_run, mmap_size);
> >       if (ret < 0) {
> >           goto err;
> >       }
> > +    cpu->kvm_run = NULL;
> >       if (cpu->kvm_dirty_gfns) {
> >           ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
> >           if (ret < 0) {
> >               goto err;
> >           }
> > +        cpu->kvm_dirty_gfns = NULL;
> >       }
> >       kvm_park_vcpu(cpu);
> > @@ -600,6 +607,27 @@ err:
> >       return ret;
> >   }
> > +void kvm_close(void)
> > +{
> > +    CPUState *cpu;
> > +
> > +    CPU_FOREACH(cpu) {
> > +        cpu_remove_sync(cpu);
> > +        close(cpu->kvm_fd);
> > +        cpu->kvm_fd = -1;
> > +        close(cpu->kvm_vcpu_stats_fd);
> > +        cpu->kvm_vcpu_stats_fd = -1;
> > +    }
> > +
> > +    if (kvm_state && kvm_state->fd != -1) {
> > +        close(kvm_state->vmfd);
> > +        kvm_state->vmfd = -1;
> > +        close(kvm_state->fd);
> > +        kvm_state->fd = -1;
> > +    }
> > +    kvm_state = NULL;
> > +}
> > +
> >   /*
> >    * dirty pages logging control
> >    */
> > diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
> > index d0dbab1..af1db2f 100644
> > --- a/hw/vfio/helpers.c
> > +++ b/hw/vfio/helpers.c
> > @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
> >   int vfio_kvm_device_fd = -1;
> >   #endif
> > +void vfio_kvm_device_close(void)
> > +{
> > +#ifdef CONFIG_KVM
> > +    if (vfio_kvm_device_fd != -1) {
> > +        close(vfio_kvm_device_fd);
> > +        vfio_kvm_device_fd = -1;
> > +    }
> > +#endif
> > +}
> > +
> >   int vfio_kvm_device_add_fd(int fd, Error **errp)
> >   {
> >   #ifdef CONFIG_KVM
> > diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> > index 4e4d0b6..6eb6f21 100644
> > --- a/include/hw/vfio/vfio-device.h
> > +++ b/include/hw/vfio/vfio-device.h
> > @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
> >   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
> >                         DeviceState *dev, bool ram_discard);
> >   int vfio_device_get_aw_bits(VFIODevice *vdev);
> > +
> > +void vfio_kvm_device_close(void);
> >   #endif /* HW_VFIO_VFIO_COMMON_H */
> > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > index fc6aa33..5f1ff10 100644
> > --- a/include/migration/cpr.h
> > +++ b/include/migration/cpr.h
> > @@ -31,7 +31,9 @@ void cpr_state_close(void);
> >   struct QIOChannel *cpr_state_ioc(void);
> >   bool cpr_needed_for_reuse(void *opaque);
> > +void cpr_kvm_close(void);
> > +void cpr_transfer_init(void);
> >   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> >   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
> > index bde9495..a029036 100644
> > --- a/include/qemu/vfio-helpers.h
> > +++ b/include/qemu/vfio-helpers.h
> > @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
> >                                uint64_t offset, uint64_t size);
> >   int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
> >                              int irq_type, Error **errp);
> > -
> >   #endif
> > diff --git a/include/system/kvm.h b/include/system/kvm.h
> > index b690dda..cfaa94c 100644
> > --- a/include/system/kvm.h
> > +++ b/include/system/kvm.h
> > @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
> >   int kvm_has_vcpu_events(void);
> >   int kvm_max_nested_state_length(void);
> >   int kvm_has_gsi_routing(void);
> > +void kvm_close(void);
> >   /**
> >    * kvm_arm_supports_user_irq
> > diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
> > index e1f1403..396558f 100644
> > --- a/migration/cpr-transfer.c
> > +++ b/migration/cpr-transfer.c
> > @@ -17,6 +17,24 @@
> >   #include "migration/vmstate.h"
> >   #include "trace.h"
> > +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
> > +                                 MigrationEvent *e,
> > +                                 Error **errp)
> > +{
> > +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> > +        cpr_kvm_close();
> > +    }
> > +    return 0;
> > +}
> > +
> > +void cpr_transfer_init(void)
> > +{
> > +    static NotifierWithReturn notifier;
> > +
> > +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
> > +                                MIG_MODE_CPR_TRANSFER);
> > +}
> > +
> >   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
> >   {
> >       MigrationAddress *addr = channel->addr;
> > diff --git a/migration/cpr.c b/migration/cpr.c
> > index 0b01e25..6102d04 100644
> > --- a/migration/cpr.c
> > +++ b/migration/cpr.c
> > @@ -7,12 +7,14 @@
> >   #include "qemu/osdep.h"
> >   #include "qapi/error.h"
> > +#include "hw/vfio/vfio-device.h"
> >   #include "migration/cpr.h"
> >   #include "migration/misc.h"
> >   #include "migration/options.h"
> >   #include "migration/qemu-file.h"
> >   #include "migration/savevm.h"
> >   #include "migration/vmstate.h"
> > +#include "system/kvm.h"
> >   #include "system/runstate.h"
> >   #include "trace.h"
> > @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
> >       MigMode mode = migrate_mode();
> >       return mode == MIG_MODE_CPR_TRANSFER;
> >   }
> > +
> > +void cpr_kvm_close(void)
> > +{
> > +    kvm_close();
> > +    vfio_kvm_device_close();
> > +}
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4697732..89e2026 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -337,6 +337,7 @@ void migration_object_init(void)
> >       ram_mig_init();
> >       dirty_bitmap_mig_init();
> > +    cpr_transfer_init();
> >       /* Initialize cpu throttle timers */
> >       cpu_throttle_init();
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 00/42] Live update: vfio and iommufd
  2025-05-16 16:37 ` [PATCH V3 00/42] Live update: vfio and iommufd Cédric Le Goater
@ 2025-05-16 17:17   ` Steven Sistare
  2025-05-16 19:48     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 17:17 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

On 5/16/2025 12:37 PM, Cédric Le Goater wrote:
> Steve,
> 
> + John
> 
> On 5/12/25 17:32, Steve Sistare wrote:
>> Support vfio and iommufd devices with the cpr-transfer live migration mode.
>> Devices that do not support live migration can still support cpr-transfer,
>> allowing live update to a new version of QEMU on the same host, with no loss
>> of guest connectivity.
>>
>> No user-visible interfaces are added.
>>
>> For legacy containers:
>>
>> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
>> skip the ioctls that configure the device, because it is already configured.
>>
>> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
>> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
>> QEMU and update the locked memory accounting.  The physical pages remain
>> pinned, because the descriptor of the device that locked them remains open,
>> so DMA to those pages continues without interruption.  Mediated devices are
>> not supported, however, because they require the VA to always be valid, and
>> there is a brief window where no VA is registered.
>>
>> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
>> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
>> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
>> data structures, and attaches the interrupts to the new KVM instance.  This
>> logic also applies to iommufd containers.
>>
>> For iommufd containers:
>>
>> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
>> backed by a file (including a memfd), so DMA mappings do not depend on VA,
>> which can differ after live update.  This allows mediated devices to be
>> supported.
>>
>> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
>> QEMU, during vfio_realize, skip the ioctls that configure the device, because
>> it is already configured.
>>
>> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
>> locked memory accounting.
>>
>> Patches 4 to 12 are specific to legacy containers.
>> Patches 25 to 41 are specific to iommufd containers.
> 
> For v4, could you please send a first "part I" with patches [1-20] ?
> I think these are reviewed, or nearly, and could be merged quickly.
> Even if the "Live update: vfio and iommufd" series is not fully
> reviewed yet, there are good signs that it will before the end of
> the QEMU 10.1 cycle. The same applies to vfio-user.
> 
> We need to bring together the proposals changing memory_get_xlat_addr().
> It's important as it is blocking both the vfio-user series and yours.
> This can be done in parallel.
> 
> Then we can address the iommufd part.

OK.  I was already preparing memory_get_xlat_addr when I received your email.

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 06/42] vfio/container: register container for cpr
  2025-05-16 16:20       ` Cédric Le Goater
@ 2025-05-16 17:21         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 17:21 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 12:20 PM, Cédric Le Goater wrote:
> On 5/15/25 21:06, Steven Sistare wrote:
>> On 5/15/2025 3:54 AM, Cédric Le Goater wrote:
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> Register a legacy container for cpr-transfer, replacing the generic CPR
>>>> register call with a more specific legacy container register call.  Add a
>>>> blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
>>>>
>>>> This is mostly boiler plate.  The fields to to saved and restored are added
>>>> in subsequent patches.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>   hw/vfio/container.c              |  6 ++--
>>>>   hw/vfio/cpr-legacy.c             | 70 ++++++++++++++++++++++++++++++++++++++++
>>>>   hw/vfio/cpr.c                    |  5 ++-
>>>>   hw/vfio/meson.build              |  1 +
>>>>   include/hw/vfio/vfio-container.h |  2 ++
>>>>   include/hw/vfio/vfio-cpr.h       | 14 ++++++++
>>>>   6 files changed, 92 insertions(+), 6 deletions(-)
>>>>   create mode 100644 hw/vfio/cpr-legacy.c
>>>>
>>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>>> index eb56f00..85c76da 100644
>>>> --- a/hw/vfio/container.c
>>>> +++ b/hw/vfio/container.c
>>>> @@ -642,7 +642,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>>       new_container = true;
>>>>       bcontainer = &container->bcontainer;
>>>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>>>> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>>>>           goto fail;
>>>>       }
>>>> @@ -678,7 +678,7 @@ fail:
>>>>           vioc->release(bcontainer);
>>>>       }
>>>>       if (new_container) {
>>>> -        vfio_cpr_unregister_container(bcontainer);
>>>> +        vfio_legacy_cpr_unregister_container(container);
>>>>           object_unref(container);
>>>>       }
>>>>       if (fd >= 0) {
>>>> @@ -719,7 +719,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>>>           VFIOAddressSpace *space = bcontainer->space;
>>>>           trace_vfio_container_disconnect(container->fd);
>>>> -        vfio_cpr_unregister_container(bcontainer);
>>>> +        vfio_legacy_cpr_unregister_container(container);
>>>>           close(container->fd);
>>>>           object_unref(container);
>>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>>> new file mode 100644
>>>> index 0000000..fac323c
>>>> --- /dev/null
>>>> +++ b/hw/vfio/cpr-legacy.c
>>>> @@ -0,0 +1,70 @@
>>>> +/*
>>>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> + * See the COPYING file in the top-level directory.
>>>
>>> Please add a SPDX-License-Identifier tag.
>>
>> Sure.  I'll do the same for my other new files.
> 
> and remove the License boiler plate too please.

Yes.  I understood you wanted me to replace one with the other.

- Steve

> 
> A newer version of checkpatch will complain with :
> 
>    ERROR: New file 'hw/vfio/cpr-legacy.c' requires 'SPDX-License-Identifier'
>    ERROR: New file 'hw/vfio/cpr-legacy.c' must not have license boilerplate header text unless this file is copied from existing code with such  text already present.
>    WARNING: added, moved or deleted file(s):
> 
>      hw/vfio/cpr-legacy.c
> 
>    Does MAINTAINERS need updating?
> 
>    total: 2 errors, 1 warnings, 152 lines checked
> 
> 
> Thanks,
> 
> C.
> 
> 
>>
>>>> + */
>>>> +
>>>> +#include <sys/ioctl.h>
>>>> +#include <linux/vfio.h>
>>>> +#include "qemu/osdep.h"
>>>> +#include "hw/vfio/vfio-container.h"
>>>> +#include "hw/vfio/vfio-cpr.h"
>>>> +#include "migration/blocker.h"
>>>> +#include "migration/cpr.h"
>>>> +#include "migration/migration.h"
>>>> +#include "migration/vmstate.h"
>>>> +#include "qapi/error.h"
>>>> +
>>>> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>>> +{
>>>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>>>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
>>>> +        return false;
>>>> +
>>>> +    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>>>> +        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
>>>> +        return false;
>>>> +
>>>> +    } else {
>>>> +        return true;
>>>> +    }
>>>> +}
>>>> +
>>>> +static const VMStateDescription vfio_container_vmstate = {
>>>> +    .name = "vfio-container",
>>>> +    .version_id = 0,
>>>> +    .minimum_version_id = 0,
>>>> +    .needed = cpr_needed_for_reuse,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    }
>>>> +};
>>>> +
>>>> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>>> +{
>>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>>> +    Error **cpr_blocker = &container->cpr.blocker;
>>>> +
>>>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>>>> +                                vfio_cpr_reboot_notifier,
>>>> +                                MIG_MODE_CPR_REBOOT);
>>>> +
>>>> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
>>>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>>>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>>> +    }
>>>> +
>>>> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>>> +{
>>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>>> +
>>>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>>> +    migrate_del_blocker(&container->cpr.blocker);
>>>> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>>> +}
>>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>>> index 0210e76..0e59612 100644
>>>> --- a/hw/vfio/cpr.c
>>>> +++ b/hw/vfio/cpr.c
>>>> @@ -7,13 +7,12 @@
>>>>   #include "qemu/osdep.h"
>>>>   #include "hw/vfio/vfio-device.h"
>>>> -#include "migration/misc.h"
>>>>   #include "hw/vfio/vfio-cpr.h"
>>>>   #include "qapi/error.h"
>>>>   #include "system/runstate.h"
>>>> -static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>>> -                                    MigrationEvent *e, Error **errp)
>>>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>>>> +                             MigrationEvent *e, Error **errp)
>>>>   {
>>>>       if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>>>>           !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
>>>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>>>> index bccb050..73d29f9 100644
>>>> --- a/hw/vfio/meson.build
>>>> +++ b/hw/vfio/meson.build
>>>> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>>>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>>>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>>>     'cpr.c',
>>>> +  'cpr-legacy.c',
>>>>     'device.c',
>>>>     'migration.c',
>>>>     'migration-multifd.c',
>>>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>>>> index afc498d..21e5807 100644
>>>> --- a/include/hw/vfio/vfio-container.h
>>>> +++ b/include/hw/vfio/vfio-container.h
>>>> @@ -10,6 +10,7 @@
>>>>   #define HW_VFIO_CONTAINER_H
>>>>   #include "hw/vfio/vfio-container-base.h"
>>>> +#include "hw/vfio/vfio-cpr.h"
>>>>   typedef struct VFIOContainer VFIOContainer;
>>>>   typedef struct VFIODevice VFIODevice;
>>>> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>>>>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>>>       unsigned iommu_type;
>>>>       QLIST_HEAD(, VFIOGroup) group_list;
>>>> +    VFIOContainerCPR cpr;
>>>>   } VFIOContainer;
>>>>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>>> index 750ea5b..f864547 100644
>>>> --- a/include/hw/vfio/vfio-cpr.h
>>>> +++ b/include/hw/vfio/vfio-cpr.h
>>>> @@ -9,8 +9,22 @@
>>>>   #ifndef HW_VFIO_VFIO_CPR_H
>>>>   #define HW_VFIO_VFIO_CPR_H
>>>> +#include "migration/misc.h"
>>>> +
>>>> +typedef struct VFIOContainerCPR {
>>>> +    Error *blocker;
>>>> +} VFIOContainerCPR;
>>>> +
>>>> +struct VFIOContainer;
>>>>   struct VFIOContainerBase;
>>>> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>>> +                                        Error **errp);
>>>> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>>>> +
>>>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>>> +                             Error **errp);
>>>> +
>>>>   bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>>                                    Error **errp);
>>>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>>
>>> what about vfio_cpr_un/register_container ? Shouldn't we remove them ?
>>
>> At this patch in the series, those are still used by iommufd containers.
>> Those uses are removed in "vfio/iommufd: register container for cpr", and
>> vfio_cpr_un/register_container are deleted by the last patch in the series.
>>
>> - Steve
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-16  8:19   ` Cédric Le Goater
@ 2025-05-16 17:58     ` Steven Sistare
  2025-05-24  9:34     ` Michael S. Tsirkin
  1 sibling, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 17:58 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:19 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Do not reset a vfio-pci device during CPR.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/pci/pci.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index fe38c4c..2ba2e0f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -32,6 +32,8 @@
>>   #include "hw/pci/pci_host.h"
>>   #include "hw/qdev-properties.h"
>>   #include "hw/qdev-properties-system.h"
>> +#include "migration/cpr.h"
>> +#include "migration/misc.h"
>>   #include "migration/qemu-file-types.h"
>>   #include "migration/vmstate.h"
>>   #include "net/net.h"
>> @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
>>   static void pci_do_device_reset(PCIDevice *dev)
>>   {
>> +    /*
>> +     * A PCI device that is resuming for cpr is already configured, so do
>> +     * not reset it here when we are called from qemu_system_reset prior to
>> +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
>> +     * safe to skip this reset for all PCI devices, because vmstate load will
>> +     * set all fields that would have been set here.
>> +     */
>> +    if (cpr_is_incoming()) {
> 
> Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
> on saved fds?

We could (and we had the same discussion in V1 or V2).
I thought it slightly more object-oriented to derive the cpr_reused
boolean where an fd is involved, and save/use that in the associated object,
rather than call a global function everywhere. I do not feel strongly about it,
but it is used a lot:

$ git grep -F cpr_reused -- hw/vfio | wc -l
19
$ git grep -F cpr.reused -- hw/vfio | wc -l
27

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-16  8:31   ` Cédric Le Goater
@ 2025-05-16 17:58     ` Steven Sistare
  2025-05-20  5:52       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 17:58 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:31 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Export various MSI functions, for use by CPR in subsequent patches.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Please rename this routines with a 'vfio_pci' prefix.

Are you sure?  That makes sense for:
   vfio_vector_init -> vfio_pci_vector_init

but the rest already have msi or intx in the name which unambiguously
means pci.  Adding pci_ seems unecessarily verbose:

+void vfio_msi_interrupt(void *opaque);
+void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                           int vector_n, bool msix);
+int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
+void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
+bool vfio_msix_present(void *opaque, int version_id);
+void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);

- Steve

>> ---
>>   hw/vfio/pci.c | 21 ++++++++++-----------
>>   hw/vfio/pci.h | 12 ++++++++++++
>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index d2b08a3..1bca415 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -279,7 +279,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
>>       vfio_intx_update(vdev, &vdev->intx.route);
>>   }
>> -static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>   {
>>       uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>>       Error *err = NULL;
>> @@ -353,7 +353,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>>   /*
>>    * MSI/X
>>    */
>> -static void vfio_msi_interrupt(void *opaque)
>> +void vfio_msi_interrupt(void *opaque)
>>   {
>>       VFIOMSIVector *vector = opaque;
>>       VFIOPCIDevice *vdev = vector->vdev;
>> @@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>       return ret;
>>   }
>> -static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> -                                  int vector_n, bool msix)
>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> +                           int vector_n, bool msix)
>>   {
>>       if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
>>           return;
>> @@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>>       kvm_irqchip_commit_routes(kvm_state);
>>   }
>> -static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>>   {
>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>       PCIDevice *pdev = &vdev->pdev;
>> @@ -641,13 +641,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>       return 0;
>>   }
>> -static int vfio_msix_vector_use(PCIDevice *pdev,
>> -                                unsigned int nr, MSIMessage msg)
>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
>>   {
>>       return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
>>   }
>> -static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>   {
>>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>> @@ -674,14 +673,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>       }
>>   }
>> -static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>   {
>>       assert(!vdev->defer_kvm_irq_routing);
>>       vdev->defer_kvm_irq_routing = true;
>>       vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>>   }
>> -static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>   {
>>       int i;
>> @@ -2632,7 +2631,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>       return OBJECT(vdev);
>>   }
>> -static bool vfio_msix_present(void *opaque, int version_id)
>> +bool vfio_msix_present(void *opaque, int version_id)
>>   {
>>       PCIDevice *pdev = opaque;
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 5ce0fb9..c892054 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -210,6 +210,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
>>       return class == PCI_CLASS_DISPLAY_VGA;
>>   }
>> +/* MSI/MSI-X/INTx */
>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
>> +void vfio_msi_interrupt(void *opaque);
>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> +                           int vector_n, bool msix);
>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
>> +bool vfio_msix_present(void *opaque, int version_id);
>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>> +
>>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>   void vfio_pci_write_config(PCIDevice *pdev,
>>                              uint32_t addr, uint32_t val, int len);
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-16  8:35   ` Cédric Le Goater
  2025-05-16 17:14     ` Peter Xu
@ 2025-05-16 18:18     ` Steven Sistare
  2025-05-19  8:51       ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 18:18 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:35 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> cpr-transfer breaks vfio network connectivity to and from the guest, and
>> the host system log shows:
>>    irq bypass consumer (token 00000000a03c32e5) registration fails: -16
>> which is EBUSY.  This occurs because KVM descriptors are still open in
>> the old QEMU process.  Close them.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> This patch doesn't build.
> 
> /usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
> ./build/../migration/cpr.c:260: undefined reference to `kvm_close'

My build works.
For what binary does this ld command fail?
Could you send the complete ld command with make V=1?

- Steve

>> ---
>>   accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
>>   hw/vfio/helpers.c             | 10 ++++++++++
>>   include/hw/vfio/vfio-device.h |  2 ++
>>   include/migration/cpr.h       |  2 ++
>>   include/qemu/vfio-helpers.h   |  1 -
>>   include/system/kvm.h          |  1 +
>>   migration/cpr-transfer.c      | 18 ++++++++++++++++++
>>   migration/cpr.c               |  8 ++++++++
>>   migration/migration.c         |  1 +
>>   9 files changed, 70 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 278a506..d619448 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>>           goto err;
>>       }
>> +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
>> +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
>> +        s->coalesced_mmio_ring = NULL;
>> +    }
>> +
>>       ret = munmap(cpu->kvm_run, mmap_size);
>>       if (ret < 0) {
>>           goto err;
>>       }
>> +    cpu->kvm_run = NULL;
>>       if (cpu->kvm_dirty_gfns) {
>>           ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>>           if (ret < 0) {
>>               goto err;
>>           }
>> +        cpu->kvm_dirty_gfns = NULL;
>>       }
>>       kvm_park_vcpu(cpu);
>> @@ -600,6 +607,27 @@ err:
>>       return ret;
>>   }
>> +void kvm_close(void)
>> +{
>> +    CPUState *cpu;
>> +
>> +    CPU_FOREACH(cpu) {
>> +        cpu_remove_sync(cpu);
>> +        close(cpu->kvm_fd);
>> +        cpu->kvm_fd = -1;
>> +        close(cpu->kvm_vcpu_stats_fd);
>> +        cpu->kvm_vcpu_stats_fd = -1;
>> +    }
>> +
>> +    if (kvm_state && kvm_state->fd != -1) {
>> +        close(kvm_state->vmfd);
>> +        kvm_state->vmfd = -1;
>> +        close(kvm_state->fd);
>> +        kvm_state->fd = -1;
>> +    }
>> +    kvm_state = NULL;
>> +}
>> +
>>   /*
>>    * dirty pages logging control
>>    */
>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>> index d0dbab1..af1db2f 100644
>> --- a/hw/vfio/helpers.c
>> +++ b/hw/vfio/helpers.c
>> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>   int vfio_kvm_device_fd = -1;
>>   #endif
>> +void vfio_kvm_device_close(void)
>> +{
>> +#ifdef CONFIG_KVM
>> +    if (vfio_kvm_device_fd != -1) {
>> +        close(vfio_kvm_device_fd);
>> +        vfio_kvm_device_fd = -1;
>> +    }
>> +#endif
>> +}
>> +
>>   int vfio_kvm_device_add_fd(int fd, Error **errp)
>>   {
>>   #ifdef CONFIG_KVM
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index 4e4d0b6..6eb6f21 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>>   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>>                         DeviceState *dev, bool ram_discard);
>>   int vfio_device_get_aw_bits(VFIODevice *vdev);
>> +
>> +void vfio_kvm_device_close(void);
>>   #endif /* HW_VFIO_VFIO_COMMON_H */
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index fc6aa33..5f1ff10 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -31,7 +31,9 @@ void cpr_state_close(void);
>>   struct QIOChannel *cpr_state_ioc(void);
>>   bool cpr_needed_for_reuse(void *opaque);
>> +void cpr_kvm_close(void);
>> +void cpr_transfer_init(void);
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
>> index bde9495..a029036 100644
>> --- a/include/qemu/vfio-helpers.h
>> +++ b/include/qemu/vfio-helpers.h
>> @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>>                                uint64_t offset, uint64_t size);
>>   int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>>                              int irq_type, Error **errp);
>> -
>>   #endif
>> diff --git a/include/system/kvm.h b/include/system/kvm.h
>> index b690dda..cfaa94c 100644
>> --- a/include/system/kvm.h
>> +++ b/include/system/kvm.h
>> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
>>   int kvm_has_vcpu_events(void);
>>   int kvm_max_nested_state_length(void);
>>   int kvm_has_gsi_routing(void);
>> +void kvm_close(void);
>>   /**
>>    * kvm_arm_supports_user_irq
>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>> index e1f1403..396558f 100644
>> --- a/migration/cpr-transfer.c
>> +++ b/migration/cpr-transfer.c
>> @@ -17,6 +17,24 @@
>>   #include "migration/vmstate.h"
>>   #include "trace.h"
>> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
>> +                                 MigrationEvent *e,
>> +                                 Error **errp)
>> +{
>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>> +        cpr_kvm_close();
>> +    }
>> +    return 0;
>> +}
>> +
>> +void cpr_transfer_init(void)
>> +{
>> +    static NotifierWithReturn notifier;
>> +
>> +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
>> +                                MIG_MODE_CPR_TRANSFER);
>> +}
>> +
>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>>   {
>>       MigrationAddress *addr = channel->addr;
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 0b01e25..6102d04 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -7,12 +7,14 @@
>>   #include "qemu/osdep.h"
>>   #include "qapi/error.h"
>> +#include "hw/vfio/vfio-device.h"
>>   #include "migration/cpr.h"
>>   #include "migration/misc.h"
>>   #include "migration/options.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/savevm.h"
>>   #include "migration/vmstate.h"
>> +#include "system/kvm.h"
>>   #include "system/runstate.h"
>>   #include "trace.h"
>> @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
>>       MigMode mode = migrate_mode();
>>       return mode == MIG_MODE_CPR_TRANSFER;
>>   }
>> +
>> +void cpr_kvm_close(void)
>> +{
>> +    kvm_close();
>> +    vfio_kvm_device_close();
>> +}
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 4697732..89e2026 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -337,6 +337,7 @@ void migration_object_init(void)
>>       ram_mig_init();
>>       dirty_bitmap_mig_init();
>> +    cpr_transfer_init();
>>       /* Initialize cpu throttle timers */
>>       cpu_throttle_init();
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-16 17:14     ` Peter Xu
@ 2025-05-16 19:17       ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 19:17 UTC (permalink / raw)
  To: Peter Xu, Cédric Le Goater
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Fabiano Rosas,
	Paolo Bonzini

On 5/16/2025 1:14 PM, Peter Xu wrote:
> On Fri, May 16, 2025 at 10:35:44AM +0200, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> cpr-transfer breaks vfio network connectivity to and from the guest, and
>>> the host system log shows:
>>>     irq bypass consumer (token 00000000a03c32e5) registration fails: -16
>>> which is EBUSY.  This occurs because KVM descriptors are still open in
>>> the old QEMU process.  Close them.
>>>
> 
> [1]
> 
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>
>> This patch doesn't build.
>>
>> /usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
>> ./build/../migration/cpr.c:260: undefined reference to `kvm_close'
> 
> And it'll be also good if this patch can keep copying the kvm maintainer
> (Paolo)..  I have Paolo copied.  So this patch is more of a kvm change not
> migration, afaict.  Maybe we should split this into two patches.
> 
> Steve, you could attach a cc line in this patch to make sure it won't be
> forgotten when you repost (at [1] above, I think git-send-email would
> remember that then):
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>

Righto.  My intention was to cc him on this specific patch and not the whole
series, which I did in V2 but forgot in V3.  Thanks for the tip on embedding
cc in the patch.

> Some other questions below.
> 
>>> ---
>>>    accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
>>>    hw/vfio/helpers.c             | 10 ++++++++++
>>>    include/hw/vfio/vfio-device.h |  2 ++
>>>    include/migration/cpr.h       |  2 ++
>>>    include/qemu/vfio-helpers.h   |  1 -
>>>    include/system/kvm.h          |  1 +
>>>    migration/cpr-transfer.c      | 18 ++++++++++++++++++
>>>    migration/cpr.c               |  8 ++++++++
>>>    migration/migration.c         |  1 +
>>>    9 files changed, 70 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 278a506..d619448 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>>>            goto err;
>>>        }
>>> +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
> 
> Are these "reset to NULL" below required or cleanup?  It's not yet clear to
> me when coalesced_mmio_ring isn't owned by the current CPU state.  Maybe
> also better to split this chunk with some commit message?

The pointers are not valid after this point.  Setting to NULL is cleanup, but
a best practice IMO.  The vcpus are paused, so nothing should be touching
coalesced_mmio_ring, and it does not matter in which order we destroy vcpus.

- Steve

>>> +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
>>> +        s->coalesced_mmio_ring = NULL;
>>> +    }
>>> +
>>>        ret = munmap(cpu->kvm_run, mmap_size);
>>>        if (ret < 0) {
>>>            goto err;
>>>        }
>>> +    cpu->kvm_run = NULL;
>>>        if (cpu->kvm_dirty_gfns) {
>>>            ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>>>            if (ret < 0) {
>>>                goto err;
>>>            }
>>> +        cpu->kvm_dirty_gfns = NULL;
>>>        }
>>>        kvm_park_vcpu(cpu);
>>> @@ -600,6 +607,27 @@ err:
>>>        return ret;
>>>    }
>>> +void kvm_close(void)
>>> +{
>>> +    CPUState *cpu;
>>> +
>>> +    CPU_FOREACH(cpu) {
>>> +        cpu_remove_sync(cpu);
>>> +        close(cpu->kvm_fd);
>>> +        cpu->kvm_fd = -1;
>>> +        close(cpu->kvm_vcpu_stats_fd);
>>> +        cpu->kvm_vcpu_stats_fd = -1;
>>> +    }
>>> +
>>> +    if (kvm_state && kvm_state->fd != -1) {
>>> +        close(kvm_state->vmfd);
>>> +        kvm_state->vmfd = -1;
>>> +        close(kvm_state->fd);
>>> +        kvm_state->fd = -1;
>>> +    }
>>> +    kvm_state = NULL;
>>> +}
>>> +
>>>    /*
>>>     * dirty pages logging control
>>>     */
>>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>>> index d0dbab1..af1db2f 100644
>>> --- a/hw/vfio/helpers.c
>>> +++ b/hw/vfio/helpers.c
>>> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>>    int vfio_kvm_device_fd = -1;
>>>    #endif
>>> +void vfio_kvm_device_close(void)
>>> +{
>>> +#ifdef CONFIG_KVM
>>> +    if (vfio_kvm_device_fd != -1) {
>>> +        close(vfio_kvm_device_fd);
>>> +        vfio_kvm_device_fd = -1;
>>> +    }
>>> +#endif
>>> +}
>>> +
>>>    int vfio_kvm_device_add_fd(int fd, Error **errp)
>>>    {
>>>    #ifdef CONFIG_KVM
>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>> index 4e4d0b6..6eb6f21 100644
>>> --- a/include/hw/vfio/vfio-device.h
>>> +++ b/include/hw/vfio/vfio-device.h
>>> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>>>    void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>>>                          DeviceState *dev, bool ram_discard);
>>>    int vfio_device_get_aw_bits(VFIODevice *vdev);
>>> +
>>> +void vfio_kvm_device_close(void);
>>>    #endif /* HW_VFIO_VFIO_COMMON_H */
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index fc6aa33..5f1ff10 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -31,7 +31,9 @@ void cpr_state_close(void);
>>>    struct QIOChannel *cpr_state_ioc(void);
>>>    bool cpr_needed_for_reuse(void *opaque);
>>> +void cpr_kvm_close(void);
>>> +void cpr_transfer_init(void);
>>>    QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>    QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
>>> index bde9495..a029036 100644
>>> --- a/include/qemu/vfio-helpers.h
>>> +++ b/include/qemu/vfio-helpers.h
>>> @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>>>                                 uint64_t offset, uint64_t size);
>>>    int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>>>                               int irq_type, Error **errp);
>>> -
>>>    #endif
>>> diff --git a/include/system/kvm.h b/include/system/kvm.h
>>> index b690dda..cfaa94c 100644
>>> --- a/include/system/kvm.h
>>> +++ b/include/system/kvm.h
>>> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
>>>    int kvm_has_vcpu_events(void);
>>>    int kvm_max_nested_state_length(void);
>>>    int kvm_has_gsi_routing(void);
>>> +void kvm_close(void);
>>>    /**
>>>     * kvm_arm_supports_user_irq
>>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>>> index e1f1403..396558f 100644
>>> --- a/migration/cpr-transfer.c
>>> +++ b/migration/cpr-transfer.c
>>> @@ -17,6 +17,24 @@
>>>    #include "migration/vmstate.h"
>>>    #include "trace.h"
>>> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
>>> +                                 MigrationEvent *e,
>>> +                                 Error **errp)
>>> +{
>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>> +        cpr_kvm_close();
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +void cpr_transfer_init(void)
>>> +{
>>> +    static NotifierWithReturn notifier;
>>> +
>>> +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
>>> +                                MIG_MODE_CPR_TRANSFER);
>>> +}
>>> +
>>>    QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>>>    {
>>>        MigrationAddress *addr = channel->addr;
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index 0b01e25..6102d04 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -7,12 +7,14 @@
>>>    #include "qemu/osdep.h"
>>>    #include "qapi/error.h"
>>> +#include "hw/vfio/vfio-device.h"
>>>    #include "migration/cpr.h"
>>>    #include "migration/misc.h"
>>>    #include "migration/options.h"
>>>    #include "migration/qemu-file.h"
>>>    #include "migration/savevm.h"
>>>    #include "migration/vmstate.h"
>>> +#include "system/kvm.h"
>>>    #include "system/runstate.h"
>>>    #include "trace.h"
>>> @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
>>>        MigMode mode = migrate_mode();
>>>        return mode == MIG_MODE_CPR_TRANSFER;
>>>    }
>>> +
>>> +void cpr_kvm_close(void)
>>> +{
>>> +    kvm_close();
>>> +    vfio_kvm_device_close();
>>> +}
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 4697732..89e2026 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -337,6 +337,7 @@ void migration_object_init(void)
>>>        ram_mig_init();
>>>        dirty_bitmap_mig_init();
>>> +    cpr_transfer_init();
>>>        /* Initialize cpu throttle timers */
>>>        cpu_throttle_init();
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 00/42] Live update: vfio and iommufd
  2025-05-16 17:17   ` Steven Sistare
@ 2025-05-16 19:48     ` Steven Sistare
  2025-05-19  8:54       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-16 19:48 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

On 5/16/2025 1:17 PM, Steven Sistare wrote:
> On 5/16/2025 12:37 PM, Cédric Le Goater wrote:
>> Steve,
>>
>> + John
>>
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> Support vfio and iommufd devices with the cpr-transfer live migration mode.
>>> Devices that do not support live migration can still support cpr-transfer,
>>> allowing live update to a new version of QEMU on the same host, with no loss
>>> of guest connectivity.
>>>
>>> No user-visible interfaces are added.
>>>
>>> For legacy containers:
>>>
>>> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
>>> skip the ioctls that configure the device, because it is already configured.
>>>
>>> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
>>> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
>>> QEMU and update the locked memory accounting.  The physical pages remain
>>> pinned, because the descriptor of the device that locked them remains open,
>>> so DMA to those pages continues without interruption.  Mediated devices are
>>> not supported, however, because they require the VA to always be valid, and
>>> there is a brief window where no VA is registered.
>>>
>>> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
>>> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
>>> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
>>> data structures, and attaches the interrupts to the new KVM instance.  This
>>> logic also applies to iommufd containers.
>>>
>>> For iommufd containers:
>>>
>>> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
>>> backed by a file (including a memfd), so DMA mappings do not depend on VA,
>>> which can differ after live update.  This allows mediated devices to be
>>> supported.
>>>
>>> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
>>> QEMU, during vfio_realize, skip the ioctls that configure the device, because
>>> it is already configured.
>>>
>>> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
>>> locked memory accounting.
>>>
>>> Patches 4 to 12 are specific to legacy containers.
>>> Patches 25 to 41 are specific to iommufd containers.
>>
>> For v4, could you please send a first "part I" with patches [1-20] ?
>> I think these are reviewed, or nearly, and could be merged quickly.

Just to help you keep track, these are the remaining vfio patches for you
to review before I send V4:

   vfio/container: recover from unmap-all-vaddr failure
   vfio-pci: preserve MSI
   vfio-pci: preserve INTx

- Steve

>> Even if the "Live update: vfio and iommufd" series is not fully
>> reviewed yet, there are good signs that it will before the end of
>> the QEMU 10.1 cycle. The same applies to vfio-user.
>>
>> We need to bring together the proposals changing memory_get_xlat_addr().
>> It's important as it is blocking both the vfio-user series and yours.
>> This can be done in parallel.
>>
>> Then we can address the iommufd part.
> 
> OK.  I was already preparing memory_get_xlat_addr when I received your email.
> 
> - Steve
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
  2025-05-12 15:32 ` [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-05-19  3:25   ` Duan, Zhenzhong
  2025-05-19 15:53     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-19  3:25 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>
>Save the hwpt_id in vmstate.  In realize, skip its allocation from
>iommufd_cdev_attach -> iommufd_cdev_attach_container ->
>iommufd_cdev_autodomains_get.
>
>Rebuild userland structures to hold hwpt_id by calling
>iommufd_cdev_rebuild_hwpt at post load time.  This depends on hw_caps, which
>was restored by the post_load call to vfio_device_hiod_create_and_realize.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/cpr-iommufd.c      |  7 +++++++
> hw/vfio/iommufd.c          | 24 ++++++++++++++++++++++--
> hw/vfio/trace-events       |  1 +
> hw/vfio/vfio-iommufd.h     |  3 +++
> include/hw/vfio/vfio-cpr.h |  1 +
> 5 files changed, 34 insertions(+), 2 deletions(-)
>
>diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>index 24cdf10..6d3f4e0 100644
>--- a/hw/vfio/cpr-iommufd.c
>+++ b/hw/vfio/cpr-iommufd.c
>@@ -110,6 +110,12 @@ static int vfio_device_post_load(void *opaque, int
>version_id)
>         error_report_err(err);
>         return false;
>     }
>+    if (!vbasedev->mdev) {
>+        VFIOIOMMUFDContainer *container = container_of(vbasedev->bcontainer,
>+                                                       VFIOIOMMUFDContainer,
>+                                                       bcontainer);
>+        iommufd_cdev_rebuild_hwpt(vbasedev, container);
>+    }
>     return true;
> }
>
>@@ -121,6 +127,7 @@ static const VMStateDescription vfio_device_vmstate = {
>     .needed = cpr_needed_for_reuse,
>     .fields = (VMStateField[]) {
>         VMSTATE_INT32(devid, VFIODevice),
>+        VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
>         VMSTATE_END_OF_LIST()
>     }
> };
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index d980684..ec79c83 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -318,6 +318,7 @@ static bool
>iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
> static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>*hwpt)
> {
>     vbasedev->hwpt = hwpt;
>+    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
>     vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>     QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> }
>@@ -373,6 +374,23 @@ static bool iommufd_cdev_make_hwpt(VFIODevice
>*vbasedev,
>     return true;
> }
>
>+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>+                               VFIOIOMMUFDContainer *container)
>+{
>+    VFIOIOASHwpt *hwpt;
>+    int hwpt_id = vbasedev->cpr.hwpt_id;
>+
>+    trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
>+
>+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+        if (hwpt->hwpt_id == hwpt_id) {
>+            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>+            return;
>+        }
>+    }
>+    iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id, false, NULL);
>+}
>+
> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>                                          VFIOIOMMUFDContainer *container,
>                                          Error **errp)
>@@ -567,7 +585,8 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>             vbasedev->iommufd != container->be) {
>             continue;
>         }
>-        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>+        if (!vbasedev->cpr.reused &&
>+            !iommufd_cdev_attach_container(vbasedev, container, &err)) {
>             const char *msg = error_get_pretty(err);
>
>             trace_iommufd_cdev_fail_attach_existing_container(msg);
>@@ -605,7 +624,8 @@ skip_ioas_alloc:
>     bcontainer = &container->bcontainer;
>     vfio_address_space_insert(space, bcontainer);
>
>-    if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
>+    if (!vbasedev->cpr.reused &&
>+        !iommufd_cdev_attach_container(vbasedev, container, errp)) {

All container attaching is bypassed in new qemu. I have a concern that new qemu doesn't generate same containers as old qemu if there are more than one container in old qemu.
Then there can be devices attached to wrong container or attaching fail in post load.

>         goto err_attach_container;
>     }
>
>diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>index e90ec9b..4955264 100644
>--- a/hw/vfio/trace-events
>+++ b/hw/vfio/trace-events
>@@ -190,6 +190,7 @@ iommufd_cdev_connect_and_bind(int iommufd, const
>char *name, int devfd, int devi
> iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
> iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int
>id) " [iommufd=%d] Successfully attached device %s (%d) to id=%d"
> iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) "
>[iommufd=%d] Successfully detached %s"
>+iommufd_cdev_rebuild_hwpt(int iommufd, int hwpt_id) " [iommufd=%d]
>hwpt %d"
> iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
> iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new
>IOMMUFD container with ioasid=%d"
> iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int
>num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
>diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>index 148ce89..78af0d8 100644
>--- a/hw/vfio/vfio-iommufd.h
>+++ b/hw/vfio/vfio-iommufd.h
>@@ -38,4 +38,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
> bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>                                       uint32_t ioas_id, Error **errp);
>
>+void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>+                               VFIOIOMMUFDContainer *container);
>+
> #endif /* HW_VFIO_VFIO_IOMMUFD_H */
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index 1379b20..b98c247 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -24,6 +24,7 @@ typedef struct VFIODeviceCPR {
>     bool reused;
>     Error *mdev_blocker;
>     Error *id_blocker;
>+    uint32_t hwpt_id;
> } VFIODeviceCPR;
>
> struct VFIOContainer;
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-16 18:18     ` Steven Sistare
@ 2025-05-19  8:51       ` Cédric Le Goater
  2025-05-19 19:07         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-19  8:51 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/25 20:18, Steven Sistare wrote:
> On 5/16/2025 4:35 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> cpr-transfer breaks vfio network connectivity to and from the guest, and
>>> the host system log shows:
>>>    irq bypass consumer (token 00000000a03c32e5) registration fails: -16
>>> which is EBUSY.  This occurs because KVM descriptors are still open in
>>> the old QEMU process.  Close them.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>
>> This patch doesn't build.
>>
>> /usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
>> ./build/../migration/cpr.c:260: undefined reference to `kvm_close'
> 
> My build works.
> For what binary does this ld command fail?


FAILED: qemu-system-s390x
FAILED: qemu-system-ppc
FAILED: qemu-system-ppc64
FAILED: qemu-system-arm
FAILED: qemu-system-aarch64


Thanks,

C.



> Could you send the complete ld command with make V=1?
> 
> - Steve
> 
>>> ---
>>>   accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
>>>   hw/vfio/helpers.c             | 10 ++++++++++
>>>   include/hw/vfio/vfio-device.h |  2 ++
>>>   include/migration/cpr.h       |  2 ++
>>>   include/qemu/vfio-helpers.h   |  1 -
>>>   include/system/kvm.h          |  1 +
>>>   migration/cpr-transfer.c      | 18 ++++++++++++++++++
>>>   migration/cpr.c               |  8 ++++++++
>>>   migration/migration.c         |  1 +
>>>   9 files changed, 70 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index 278a506..d619448 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>>>           goto err;
>>>       }
>>> +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
>>> +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
>>> +        s->coalesced_mmio_ring = NULL;
>>> +    }
>>> +
>>>       ret = munmap(cpu->kvm_run, mmap_size);
>>>       if (ret < 0) {
>>>           goto err;
>>>       }
>>> +    cpu->kvm_run = NULL;
>>>       if (cpu->kvm_dirty_gfns) {
>>>           ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>>>           if (ret < 0) {
>>>               goto err;
>>>           }
>>> +        cpu->kvm_dirty_gfns = NULL;
>>>       }
>>>       kvm_park_vcpu(cpu);
>>> @@ -600,6 +607,27 @@ err:
>>>       return ret;
>>>   }
>>> +void kvm_close(void)
>>> +{
>>> +    CPUState *cpu;
>>> +
>>> +    CPU_FOREACH(cpu) {
>>> +        cpu_remove_sync(cpu);
>>> +        close(cpu->kvm_fd);
>>> +        cpu->kvm_fd = -1;
>>> +        close(cpu->kvm_vcpu_stats_fd);
>>> +        cpu->kvm_vcpu_stats_fd = -1;
>>> +    }
>>> +
>>> +    if (kvm_state && kvm_state->fd != -1) {
>>> +        close(kvm_state->vmfd);
>>> +        kvm_state->vmfd = -1;
>>> +        close(kvm_state->fd);
>>> +        kvm_state->fd = -1;
>>> +    }
>>> +    kvm_state = NULL;
>>> +}
>>> +
>>>   /*
>>>    * dirty pages logging control
>>>    */
>>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>>> index d0dbab1..af1db2f 100644
>>> --- a/hw/vfio/helpers.c
>>> +++ b/hw/vfio/helpers.c
>>> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>>   int vfio_kvm_device_fd = -1;
>>>   #endif
>>> +void vfio_kvm_device_close(void)
>>> +{
>>> +#ifdef CONFIG_KVM
>>> +    if (vfio_kvm_device_fd != -1) {
>>> +        close(vfio_kvm_device_fd);
>>> +        vfio_kvm_device_fd = -1;
>>> +    }
>>> +#endif
>>> +}
>>> +
>>>   int vfio_kvm_device_add_fd(int fd, Error **errp)
>>>   {
>>>   #ifdef CONFIG_KVM
>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>> index 4e4d0b6..6eb6f21 100644
>>> --- a/include/hw/vfio/vfio-device.h
>>> +++ b/include/hw/vfio/vfio-device.h
>>> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>>>   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>>>                         DeviceState *dev, bool ram_discard);
>>>   int vfio_device_get_aw_bits(VFIODevice *vdev);
>>> +
>>> +void vfio_kvm_device_close(void);
>>>   #endif /* HW_VFIO_VFIO_COMMON_H */
>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>> index fc6aa33..5f1ff10 100644
>>> --- a/include/migration/cpr.h
>>> +++ b/include/migration/cpr.h
>>> @@ -31,7 +31,9 @@ void cpr_state_close(void);
>>>   struct QIOChannel *cpr_state_ioc(void);
>>>   bool cpr_needed_for_reuse(void *opaque);
>>> +void cpr_kvm_close(void);
>>> +void cpr_transfer_init(void);
>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
>>> index bde9495..a029036 100644
>>> --- a/include/qemu/vfio-helpers.h
>>> +++ b/include/qemu/vfio-helpers.h
>>> @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>>>                                uint64_t offset, uint64_t size);
>>>   int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>>>                              int irq_type, Error **errp);
>>> -
>>>   #endif
>>> diff --git a/include/system/kvm.h b/include/system/kvm.h
>>> index b690dda..cfaa94c 100644
>>> --- a/include/system/kvm.h
>>> +++ b/include/system/kvm.h
>>> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
>>>   int kvm_has_vcpu_events(void);
>>>   int kvm_max_nested_state_length(void);
>>>   int kvm_has_gsi_routing(void);
>>> +void kvm_close(void);
>>>   /**
>>>    * kvm_arm_supports_user_irq
>>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>>> index e1f1403..396558f 100644
>>> --- a/migration/cpr-transfer.c
>>> +++ b/migration/cpr-transfer.c
>>> @@ -17,6 +17,24 @@
>>>   #include "migration/vmstate.h"
>>>   #include "trace.h"
>>> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
>>> +                                 MigrationEvent *e,
>>> +                                 Error **errp)
>>> +{
>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>> +        cpr_kvm_close();
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +void cpr_transfer_init(void)
>>> +{
>>> +    static NotifierWithReturn notifier;
>>> +
>>> +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
>>> +                                MIG_MODE_CPR_TRANSFER);
>>> +}
>>> +
>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>>>   {
>>>       MigrationAddress *addr = channel->addr;
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index 0b01e25..6102d04 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -7,12 +7,14 @@
>>>   #include "qemu/osdep.h"
>>>   #include "qapi/error.h"
>>> +#include "hw/vfio/vfio-device.h"
>>>   #include "migration/cpr.h"
>>>   #include "migration/misc.h"
>>>   #include "migration/options.h"
>>>   #include "migration/qemu-file.h"
>>>   #include "migration/savevm.h"
>>>   #include "migration/vmstate.h"
>>> +#include "system/kvm.h"
>>>   #include "system/runstate.h"
>>>   #include "trace.h"
>>> @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
>>>       MigMode mode = migrate_mode();
>>>       return mode == MIG_MODE_CPR_TRANSFER;
>>>   }
>>> +
>>> +void cpr_kvm_close(void)
>>> +{
>>> +    kvm_close();
>>> +    vfio_kvm_device_close();
>>> +}
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 4697732..89e2026 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -337,6 +337,7 @@ void migration_object_init(void)
>>>       ram_mig_init();
>>>       dirty_bitmap_mig_init();
>>> +    cpr_transfer_init();
>>>       /* Initialize cpu throttle timers */
>>>       cpu_throttle_init();
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 00/42] Live update: vfio and iommufd
  2025-05-16 19:48     ` Steven Sistare
@ 2025-05-19  8:54       ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-19  8:54 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

On 5/16/25 21:48, Steven Sistare wrote:
> On 5/16/2025 1:17 PM, Steven Sistare wrote:
>> On 5/16/2025 12:37 PM, Cédric Le Goater wrote:
>>> Steve,
>>>
>>> + John
>>>
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> Support vfio and iommufd devices with the cpr-transfer live migration mode.
>>>> Devices that do not support live migration can still support cpr-transfer,
>>>> allowing live update to a new version of QEMU on the same host, with no loss
>>>> of guest connectivity.
>>>>
>>>> No user-visible interfaces are added.
>>>>
>>>> For legacy containers:
>>>>
>>>> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
>>>> skip the ioctls that configure the device, because it is already configured.
>>>>
>>>> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
>>>> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
>>>> QEMU and update the locked memory accounting.  The physical pages remain
>>>> pinned, because the descriptor of the device that locked them remains open,
>>>> so DMA to those pages continues without interruption.  Mediated devices are
>>>> not supported, however, because they require the VA to always be valid, and
>>>> there is a brief window where no VA is registered.
>>>>
>>>> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
>>>> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
>>>> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
>>>> data structures, and attaches the interrupts to the new KVM instance.  This
>>>> logic also applies to iommufd containers.
>>>>
>>>> For iommufd containers:
>>>>
>>>> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
>>>> backed by a file (including a memfd), so DMA mappings do not depend on VA,
>>>> which can differ after live update.  This allows mediated devices to be
>>>> supported.
>>>>
>>>> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
>>>> QEMU, during vfio_realize, skip the ioctls that configure the device, because
>>>> it is already configured.
>>>>
>>>> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
>>>> locked memory accounting.
>>>>
>>>> Patches 4 to 12 are specific to legacy containers.
>>>> Patches 25 to 41 are specific to iommufd containers.
>>>
>>> For v4, could you please send a first "part I" with patches [1-20] ?
>>> I think these are reviewed, or nearly, and could be merged quickly.
> 
> Just to help you keep track, these are the remaining vfio patches for you
> to review before I send V4:
> 
>    vfio/container: recover from unmap-all-vaddr failure

yes. I haven't read your other replies yet.

>    vfio-pci: preserve MSI
>    vfio-pci: preserve INTx

IRQ changes are for later because they conflit with vfio-user.

Thanks,

C.



> 
> - Steve
> 
>>> Even if the "Live update: vfio and iommufd" series is not fully
>>> reviewed yet, there are good signs that it will before the end of
>>> the QEMU 10.1 cycle. The same applies to vfio-user.
>>>
>>> We need to bring together the proposals changing memory_get_xlat_addr().
>>> It's important as it is blocking both the vfio-user series and yours.
>>> This can be done in parallel.
>>>
>>> Then we can address the iommufd part.
>>
>> OK.  I was already preparing memory_get_xlat_addr when I received your email.
>>
>> - Steve
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-15 19:08     ` Steven Sistare
@ 2025-05-19 13:20       ` Cédric Le Goater
  2025-05-19 16:21         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-19 13:20 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/25 21:08, Steven Sistare wrote:
> On 5/15/2025 8:59 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> At vfio creation time, save the value of vfio container, group, and device
>>> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
>>> the saved descriptors, and remembers the reused status for subsequent
>>> patches.  The reused status is cleared when vmstate load finishes.
>>>
>>> During reuse, device and iommu state is already configured, so operations
>>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>>> are skipped.  The result is that vfio_realize constructs qemu data
>>> structures that reflect the current state of the device.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>>>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>>>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>>>   include/hw/vfio/vfio-device.h |  2 ++
>>>   4 files changed, 112 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index 85c76da..278a220 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -31,6 +31,8 @@
>>>   #include "system/reset.h"
>>>   #include "trace.h"
>>>   #include "qapi/error.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/blocker.h"
>>>   #include "pci.h"
>>>   #include "hw/vfio/vfio-container.h"
>>>   #include "hw/vfio/vfio-cpr.h"
>>> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>>>   }
>>>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>> -                                            Error **errp)
>>> +                                            bool cpr_reused, Error **errp)
>>>   {
>>>       int iommu_type;
>>>       const char *vioc_name;
>>> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>>           return NULL;
>>>       }
>>> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>> +    /*
>>> +     * If container is reused, just set its type and skip the ioctls, as the
>>> +     * container and group are already configured in the kernel.
>>> +     */
>>> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>>           return NULL;
>>>       }
>>> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>>>       container->fd = fd;
>>> +    container->cpr.reused = cpr_reused;
>>>       container->iommu_type = iommu_type;
>>>       return container;
>>>   }
>>> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>>>   }
>>>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>> -                                     Error **errp)
>>> +                                     bool cpr_reused, Error **errp)
>>>   {
>>>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>>>           return false;
>>> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>>       group->container = container;
>>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>>       vfio_group_add_kvm_device(group);
>>> +    if (!cpr_reused) {
>>> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>>> +    }
>>
>> Could we avoid the test on cpr_reused always call cpr_save_fd() ?
> 
> No.  If cpr_reused is true, then the fd is already on cpr's save list.
> We don't want to save duplicates of the same entry.

Can't we call cpr_find_fd() like in cpr_open_fd() ?

> 
>>>       return true;
>>>   }
>>> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>>>       group->container = NULL;
>>>       vfio_group_del_kvm_device(group);
>>>       vfio_ram_block_discard_disable(container, false);
>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>>   }
>>>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>       VFIOIOMMUClass *vioc = NULL;
>>>       bool new_container = false;
>>>       bool group_was_added = false;
>>> +    bool cpr_reused;
>>>       space = vfio_address_space_get(as);
>>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>>> +    cpr_reused = (fd > 0);
>>
>>
>> The code above is doing 2 things : it grabs a restored fd and
>> deduces from the fd value that the VM is doing are doing a CPR
>> reboot.
>>
>> Instead of adding this cpr_reused flag, I would prefer to duplicate
>> the code into something like:
>>
>> if (!cpr_reboot) {
>>     QLIST_FOREACH(bcontainer, &space->containers, next) {
>>          container = container_of(bcontainer, VFIOContainer, bcontainer);
>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>              return vfio_container_group_add(container, group, errp);
>>          }
>>      }
>>
>>      fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>      if (fd < 0) {
>>          goto fail;
>>      }
>>
>>      ret = ioctl(fd, VFIO_GET_API_VERSION);
>>      if (ret != VFIO_API_VERSION) {
>>          error_setg(errp, "supported vfio version: %d, "
>>                     "reported version: %d", VFIO_API_VERSION, ret);
>>          goto fail;
>>      }
>>
>>      container = vfio_create_container(fd, group, errp);
>> } else {
>>     /* ... */
>> }
>>
> 
> OK, but there is no sense in duplicating the identical code for
> VFIO_GET_API_VERSION and vfio_create_container.  If you want me to
> simplify the loop, I suggest:
> 
> if (!cpr_reused) {
>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>               return vfio_container_group_add(container, group, false, errp);
>           }
>       }
> 
>       fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>       if (fd < 0) {
>           goto fail;
>       }
> } else {
>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>          container = container_of(bcontainer, VFIOContainer, bcontainer);
>          if (vfio_cpr_container_match(container, group, &fd)) {
>              return vfio_container_group_add(container, group, true, errp);
>          }
>      }
> }
> 
> ret = ioctl(fd, VFIO_GET_API_VERSION);
> ...

OK. Let's do that. I find it easier to read.


>>> +    /*
>>> +     * If the container is reused, then the group is already attached in the
>>> +     * kernel.  If a container with matching fd is found, then update the
>>> +     * userland group list and return.  If not, then after the loop, create
>>> +     * the container struct and group list.
>>> +     */
>>>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>>>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>> -            return vfio_container_group_add(container, group, errp);
>>> +
>>> +        if (cpr_reused) {
>>> +            if (!vfio_cpr_container_match(container, group, &fd)) {
>>
>> why do we need to modify fd ?
> 
> That is explained by the comments inside vfio_cpr_container_match, where the
> explanation is more easily understood.

I haven't been able to see what a modified fd was useful for before because
we test cpr_reused and in other places !cpr_reused :

         if (cpr_reused) {
             if (!vfio_cpr_container_match(container, group, &fd)) {
                 continue;
             }

and later

     if (!cpr_reused) {
         fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
     }

I think I got it now. This was a bit confusing.

> 
>>> +                continue;
>>> +            }
>>> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>> +            continue;
>>>           }
>>> +        return vfio_container_group_add(container, group, cpr_reused, errp);
>>> +    }
>>> +
>>> +    if (!cpr_reused) {
>>> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>>       }
>>> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>>       if (fd < 0) {>           goto fail;
>>>       }
>>> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>           goto fail;
>>>       }
>>> -    container = vfio_create_container(fd, group, errp);
>>> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>>>       if (!container) {
>>>           goto fail;
>>>       }
>>> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>       vfio_address_space_insert(space, bcontainer);
>>> -    if (!vfio_container_group_add(container, group, errp)) {
>>> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>>>           goto fail;
>>>       }
>>>       group_was_added = true;
>>> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>>       QLIST_REMOVE(group, container_next);
>>>       group->container = NULL;
>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>>       /*
>>>        * Explicitly release the listener first before unset container,
>>> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>>       group = g_malloc0(sizeof(*group));
>>>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>>> -    group->fd = qemu_open(path, O_RDWR, errp);
>>> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>>>       if (group->fd < 0) {
>>>           goto free_group_exit;
>>>       }
>>> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>>       return group;
>>>   close_fd_exit:
>>> +    cpr_delete_fd("vfio_group", groupid);
>>>       close(group->fd);
>>>   free_group_exit:
>>> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>>>       vfio_container_disconnect(group);
>>>       QLIST_REMOVE(group, next);
>>>       trace_vfio_group_put(group->fd);
>>> +    cpr_delete_fd("vfio_group", group->groupid);
>>>       close(group->fd);
>>>       g_free(group);
>>>   }
>>> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>>   {
>>>       g_autofree struct vfio_device_info *info = NULL;
>>>       int fd;
>>> +    bool cpr_reused;
>>> +
>>> +    fd = cpr_find_fd(name, 0);
>>> +    cpr_reused = (fd >= 0);
>>> +    if (!cpr_reused) {
>>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>> +    }
>>
>> Could we introduce an helper routine to open this file,  like we have
>> cpr_open_fd() ?
> 
> OK, but this would be the only use of the helper, and it would bury
> generic vfio functionality -- VFIO_GROUP_GET_DEVICE_FD -- inside a cpr
> flavored helper.  IMO not an improvement.

VFIO_GROUP_GET_DEVICE_FD would still be passed as a parameter and
so it won't be buried IMO. I don't dislike it that much.

However, I don't like the "if (cpr_reused)" statements scattered
throughout the code, so I'm looking for ways to bury them.


Thanks,

C.



> 
>>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>>       if (fd < 0) {
>>>           error_setg_errno(errp, errno, "error getting device from group %d",
>>>                            group->groupid);
>>> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>>       vbasedev->group = group;
>>>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>>> +    vbasedev->cpr.reused = cpr_reused;
>>> +    if (!cpr_reused) {
>>> +        cpr_save_fd(name, 0, fd);
>>
>> Could we avoid the test on cpr_reused always call cpr_save_fd() ?
> 
> No.  Must avoid adding duplicate entries.
> 
>>> +    }
>>>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>>>       return true;
>>> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>>>       QLIST_REMOVE(vbasedev, next);
>>>       vbasedev->group = NULL;
>>>       trace_vfio_device_put(vbasedev->fd);
>>> +    cpr_delete_fd(vbasedev->name, 0);
>>>       close(vbasedev->fd);
>>>   }
>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>> index fac323c..638a8e0 100644
>>> --- a/hw/vfio/cpr-legacy.c
>>> +++ b/hw/vfio/cpr-legacy.c
>>> @@ -10,6 +10,7 @@
>>>   #include "qemu/osdep.h"
>>>   #include "hw/vfio/vfio-container.h"
>>>   #include "hw/vfio/vfio-cpr.h"
>>> +#include "hw/vfio/vfio-device.h"
>>>   #include "migration/blocker.h"
>>>   #include "migration/cpr.h"
>>>   #include "migration/migration.h"
>>> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>>       }
>>>   }
>>> +static int vfio_container_post_load(void *opaque, int version_id)
>>> +{
>>> +    VFIOContainer *container = opaque;
>>> +    VFIOGroup *group;
>>> +    VFIODevice *vbasedev;
>>> +
>>> +    container->cpr.reused = false;
>>> +
>>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>> +            vbasedev->cpr.reused = false;
>>> +        }
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>   static const VMStateDescription vfio_container_vmstate = {
>>>       .name = "vfio-container",
>>>       .version_id = 0,
>>>       .minimum_version_id = 0,
>>> +    .post_load = vfio_container_post_load,
>>>       .needed = cpr_needed_for_reuse,
>>>       .fields = (VMStateField[]) {
>>>           VMSTATE_END_OF_LIST()
>>> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>>       migrate_del_blocker(&container->cpr.blocker);
>>>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>>   }
>>> +
>>> +static bool same_device(int fd1, int fd2)
>>> +{
>>> +    struct stat st1, st2;
>>> +
>>> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
>>> +}
>>> +
>>> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
>>> +                              int *pfd)
>>> +{
>>> +    if (container->fd == *pfd) {
>>> +        return true;
>>> +    }
>>> +    if (!same_device(container->fd, *pfd)) {
>>> +        return false;
>>> +    }
>>> +    /*
>>> +     * Same device, different fd.  This occurs when the container fd is
>>> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
>>> +     * produces duplicates.  De-dup it.
>>> +     */
>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>> +    close(*pfd);
>>> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>>> +    *pfd = container->fd;
>>
>> I am not sure 'pfd' is used afterwards. Is it ?
> 
> True, good eye.  I will change it to "int fd" and stop returning the new value.
> 
> - Steve
> 
>>
>>> +    return true;
>>> +}
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index f864547..1c4f070 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -13,10 +13,16 @@
>>>   typedef struct VFIOContainerCPR {
>>>       Error *blocker;
>>> +    bool reused;
>>>   } VFIOContainerCPR;
>>> +typedef struct VFIODeviceCPR {
>>> +    bool reused;
>>> +} VFIODeviceCPR;
>>> +
>>>   struct VFIOContainer;
>>>   struct VFIOContainerBase;
>>> +struct VFIOGroup;
>>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>>                                           Error **errp);
>>> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>                                    Error **errp);
>>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>>> +                              struct VFIOGroup *group, int *fd);
>>> +
>>>   #endif /* HW_VFIO_VFIO_CPR_H */
>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>> index 8bcb3c1..4e4d0b6 100644
>>> --- a/include/hw/vfio/vfio-device.h
>>> +++ b/include/hw/vfio/vfio-device.h
>>> @@ -28,6 +28,7 @@
>>>   #endif
>>>   #include "system/system.h"
>>>   #include "hw/vfio/vfio-container-base.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>>   #include "system/host_iommu_device.h"
>>>   #include "system/iommufd.h"
>>> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>>>       VFIOIOASHwpt *hwpt;
>>>       QLIST_ENTRY(VFIODevice) hwpt_next;
>>>       struct vfio_region_info **reginfo;
>>> +    VFIODeviceCPR cpr;
>>>   } VFIODevice;
>>>   struct VFIODeviceOps {
>>
>>
>> Thanks,
>>
>> C.
>>
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-15 19:08     ` Steven Sistare
@ 2025-05-19 13:32       ` Cédric Le Goater
  2025-05-19 16:33         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-19 13:32 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/15/25 21:08, Steven Sistare wrote:
> On 5/15/2025 9:42 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> In new QEMU, do not register the memory listener at device creation time.
>>> Register it later, in the container post_load handler, after all vmstate
>>> that may affect regions and mapping boundaries has been loaded.  The
>>> post_load registration will cause the listener to invoke its callback on
>>> each flat section, and the calls will match the mappings remembered by the
>>> kernel.
>>>
>>> The listener calls a special dma_map handler that passes the new VA of each
>>> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
>>> handler at the end.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   hw/vfio/container.c  | 15 +++++++++++++--
>>>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 61 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index a554683..0e02726 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>>>       int ret;
>>>       Error *local_err = NULL;
>>> +    assert(!container->cpr.reused);
>>> +
>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>               bcontainer->dirty_pages_supported) {
>>> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>       }
>>>       group_was_added = true;
>>> -    if (!vfio_listener_register(bcontainer, errp)) {
>>> -        goto fail;
>>> +    /*
>>> +     * If reused, register the listener later, after all state that may
>>> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
>>> +     * the listener will invoke its callback on each flat section and call
>>> +     * dma_map to supply the new vaddr, and the calls will match the mappings
>>> +     * remembered by the kernel.
>>> +     */
>>> +    if (!cpr_reused) {
>>> +        if (!vfio_listener_register(bcontainer, errp)) {
>>> +            goto fail;
>>> +        }
>>
>> hmm, I am starting to think we should have a vfio_cpr_container_connect
>> routine too.
> 
> I think that would obscure rather than clarify the code, since the normal
> non-cpr action of calling vfio_listener_register would be buried in a
> cpr flavored function name.
> 
>>>       }
>>>       bcontainer->initialized = true;
>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>> index 519d772..bbcf71e 100644
>>> --- a/hw/vfio/cpr-legacy.c
>>> +++ b/hw/vfio/cpr-legacy.c
>>> @@ -11,11 +11,13 @@
>>>   #include "hw/vfio/vfio-container.h"
>>>   #include "hw/vfio/vfio-cpr.h"
>>>   #include "hw/vfio/vfio-device.h"
>>> +#include "hw/vfio/vfio-listener.h"
>>>   #include "migration/blocker.h"
>>>   #include "migration/cpr.h"
>>>   #include "migration/migration.h"
>>>   #include "migration/vmstate.h"
>>>   #include "qapi/error.h"
>>> +#include "qemu/error-report.h"
>>>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>>   {
>>> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>>       return true;
>>>   }
>>> +/*
>>> + * Set the new @vaddr for any mappings registered during cpr load.
>>> + * Reused is cleared thereafter.
>>> + */
>>> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>>> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
>>> +                                   bool readonly)
>>> +{
>>> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>>> +                                                  bcontainer);
>>> +    struct vfio_iommu_type1_dma_map map = {
>>> +        .argsz = sizeof(map),
>>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>>> +        .iova = iova,
>>> +        .size = size,
>>> +    };
>>> +
>>> +    assert(container->cpr.reused);
>>> +
>>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>>> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
>>> +                     iova, size, vaddr, strerror(errno));
>>
>> Callers should also report the error. No need to do it here.
> 
> This function has the same signature as the dma_map class method,
> which does not return an error message.  It's existing implementations
> use error_report.


backends .dma_map handlers : vfio_legacy_dma_map(), iommufd_backend_map_dma()
don't report errors. vfio_container_dma_map() doesn't either.

callers of vfio_container_dma_map() : vfio_iommu_map_notify(),
vfio_listener_region_add() report errors.


Thanks,

C.


> 
>>> +        return -errno;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>>   {
>>> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>>>   static int vfio_container_post_load(void *opaque, int version_id)
>>>   {
>>>       VFIOContainer *container = opaque;
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>>       VFIOGroup *group;
>>>       VFIODevice *vbasedev;
>>> +    Error *err = NULL;
>>> +
>>> +    if (!vfio_listener_register(bcontainer, &err)) {
>>> +        error_report_err(err);
>>> +        return -1;
>>> +    }
>>>       container->cpr.reused = false;
>>>       QLIST_FOREACH(group, &container->group_list, container_next) {
>>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>> +
>>> +        /* Restore original dma_map function */
>>> +        vioc->dma_map = vfio_legacy_dma_map;
>>> +
>>>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>               vbasedev->cpr.reused = false;
>>>           }
>>> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>>>       .name = "vfio-container",
>>>       .version_id = 0,
>>>       .minimum_version_id = 0,
>>> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>>>       .pre_save = vfio_container_pre_save,
>>>       .post_load = vfio_container_post_load,
>>>       .needed = cpr_needed_for_reuse,
>>> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>>> +    /* During incoming CPR, divert calls to dma_map. */
>>> +    if (container->cpr.reused) {
>>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
>>
>> You could backup the previous dma_map() handler in a static variable or,
>> better, under container->cpr.
> 
> OK.
> 
> - Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-16  8:26   ` Duan, Zhenzhong
@ 2025-05-19 15:51     ` Steven Sistare
  2025-05-20 19:32       ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:51 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:26 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 28/42] backends/iommufd:
>> iommufd_backend_map_file_dma
>>
>> Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
>> This will be called as a substitute for iommufd_backend_map_dma, so
>> the error conditions for BARs are copied as-is from that function.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c       | 36 ++++++++++++++++++++++++++++++++++++
>> backends/trace-events    |  1 +
>> include/system/iommufd.h |  3 +++
>> 3 files changed, 40 insertions(+)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index b73f75c..5c1958f 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -172,6 +172,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend
>> *be, uint32_t ioas_id, hwaddr iova,
>>      return ret;
>> }
>>
>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> +                                 hwaddr iova, ram_addr_t size,
>> +                                 int mfd, unsigned long start, bool readonly)
>> +{
>> +    int ret, fd = be->fd;
>> +    struct iommu_ioas_map_file map = {
>> +        .size = sizeof(map),
>> +        .flags = IOMMU_IOAS_MAP_READABLE |
>> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
>> +        .ioas_id = ioas_id,
>> +        .fd = mfd,
>> +        .start = start,
>> +        .iova = iova,
>> +        .length = size,
>> +    };
>> +
>> +    if (!readonly) {
>> +        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
>> +    }
>> +
>> +    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
>> +    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
>> +                                       readonly, ret);
>> +    if (ret) {
>> +        ret = -errno;
>> +
>> +        /* TODO: Not support mapping hardware PCI BAR region for now. */
>> +        if (errno == EFAULT) {
>> +            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
>> +        } else {
>> +            error_report("IOMMU_IOAS_MAP_FILE failed: %m");
> 
> No need to print error here as caller does the same thing.

OK.  I was copying iommufd_backend_map_dma, but I see it has recently
dropped the error_report.

- Steve

>> +        }
>> +    }
>> +    return ret;
>> +}
>> +
>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>                                hwaddr iova, ram_addr_t size)
>> {
>> diff --git a/backends/trace-events b/backends/trace-events
>> index 40811a3..f478e18 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t
>> users) "fd=%d owned=%d user
>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>> iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t
>> size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d
>> iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
>> +iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova,
>> uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d
>> ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d
>> (%d)"
>> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t
>> iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d
>> iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova,
>> uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64"
>> size=0x%"PRIx64" (%d)"
>> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d
>> ioas=%d"
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index cbab75b..ac700b8 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>> *be);
>> bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
>>                                  Error **errp);
>> void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> +                                 hwaddr iova, ram_addr_t size, int fd,
>> +                                 unsigned long start, bool readonly);
>> int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> hwaddr iova,
>>                              ram_addr_t size, void *vaddr, bool readonly);
>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-16  8:42   ` Duan, Zhenzhong
@ 2025-05-19 15:51     ` Steven Sistare
  2025-05-20 19:34       ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:51 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>
>> Define the change process ioctl
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c       | 20 ++++++++++++++++++++
>> backends/trace-events    |  1 +
>> include/system/iommufd.h |  2 ++
>> 3 files changed, 23 insertions(+)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 5c1958f..6fed1c1 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc,
>> const void *data)
>>      object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>> }
>>
>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>> +{
>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>> +
>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>> +}
>> +
>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>> +{
>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
> 
> This is same ioctl as above check, could it be called more than once for same process?

Yes, and it is a no-op if the process has not changed since the last time DMA
was mapped.

- Steve

>> +
>> +    if (!ret) {
>> +        error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d
>> failed",
>> +                         be->fd);
>> +    }
>> +    trace_iommufd_change_process(be->fd, ret);
>> +    return ret;
>> +}
>> +
>> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>> {
>>      int fd;
>> diff --git a/backends/trace-events b/backends/trace-events
>> index f478e18..5ccdf90 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
>> dbus_vmstate_saving(const char *id) "id: %s"
>>
>> # iommufd.c
>> +iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
>> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d
>> owned=%d users=%d"
>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index ac700b8..db9ed53 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -64,6 +64,8 @@ bool
>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>>                                        uint64_t iova, ram_addr_t size,
>>                                        uint64_t page_size, uint64_t *data,
>>                                        Error **errp);
>> +bool iommufd_change_process_capable(IOMMUFDBackend *be);
>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
>>
>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE
>> "-iommufd"
>> #endif
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-16  8:48   ` Duan, Zhenzhong
@ 2025-05-19 15:52     ` Steven Sistare
  2025-05-20 19:39       ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:52 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:48 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>>
>> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>> Such a mapping can be preserved without modification during CPR,
>> because it depends on the file's address space, which does not change,
>> rather than on the process's address space, which does change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/container-base.c              |  9 +++++++++
>> hw/vfio/iommufd.c                     | 13 +++++++++++++
>> include/hw/vfio/vfio-container-base.h |  3 +++
>> 3 files changed, 25 insertions(+)
>>
>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> index 8f43bc8..72a51a6 100644
>> --- a/hw/vfio/container-base.c
>> +++ b/hw/vfio/container-base.c
>> @@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase
>> *bcontainer,
>>                             RAMBlock *rb)
>> {
>>      VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>>
>> +    if (mfd >= 0 && vioc->dma_map_file) {
>> +        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>> +        unsigned long offset = qemu_ram_get_fd_offset(rb);
>> +
>> +        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>> +                           readonly);
> 
> Shouldn't we return result to call site?

Yes!  Good catch, thanks.

- Steve

>> +        return 0;
>> +    }
>>      g_assert(vioc->dma_map);
>>      return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>> }
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 167bda4..6eb417a 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase
>> *bcontainer, hwaddr iova,
>>                                     iova, size, vaddr, readonly);
>> }
>>
>> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>> +                                 hwaddr iova, ram_addr_t size,
>> +                                 int fd, unsigned long start, bool readonly)
>> +{
>> +    const VFIOIOMMUFDContainer *container =
>> +        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>> +
>> +    return iommufd_backend_map_file_dma(container->be,
>> +                                        container->ioas_id,
>> +                                        iova, size, fd, start, readonly);
>> +}
>> +
>> static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>>                                hwaddr iova, ram_addr_t size,
>>                                IOMMUTLBEntry *iotlb, bool unmap_all)
>> @@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass
>> *klass, const void *data)
>>      VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
>>
>>      vioc->dma_map = iommufd_cdev_map;
>> +    vioc->dma_map_file = iommufd_cdev_map_file;
>>      vioc->dma_unmap = iommufd_cdev_unmap;
>>      vioc->attach_device = iommufd_cdev_attach;
>>      vioc->detach_device = iommufd_cdev_detach;
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>> container-base.h
>> index 03b3f9c..f30f828 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
>>      int (*dma_map)(const VFIOContainerBase *bcontainer,
>>                     hwaddr iova, ram_addr_t size,
>>                     void *vaddr, bool readonly);
>> +    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
>> +                        hwaddr iova, ram_addr_t size,
>> +                        int fd, unsigned long start, bool readonly);
>>      /**
>>       * @dma_unmap
>>       *
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-16  9:29   ` Duan, Zhenzhong
@ 2025-05-19 15:52     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:52 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 5:29 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 34/42] vfio/iommufd: invariant device name
>>
>> cpr-transfer will use the device name as a key to find the value
>> of the device descriptor in new QEMU.  However, if the descriptor
>> number is specified by a command-line fd parameter, then
>> vfio_device_get_name creates a name that includes the fd number.
>> This causes a chicken-and-egg problem: new QEMU must know the fd
>> number to construct a name to find the fd number.
>>
>> To fix, create an invariant name based on the id command-line
>> parameter.  If id is not defined, add a CPR blocker.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr.c              | 21 +++++++++++++++++++++
>> hw/vfio/device.c           | 10 ++++------
>> hw/vfio/iommufd.c          |  2 ++
>> include/hw/vfio/vfio-cpr.h |  4 ++++
>> 4 files changed, 31 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index 6081a89..7609c62 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -11,6 +11,7 @@
>> #include "hw/vfio/pci.h"
>> #include "hw/pci/msix.h"
>> #include "hw/pci/msi.h"
>> +#include "migration/blocker.h"
>> #include "migration/cpr.h"
>> #include "qapi/error.h"
>> #include "system/runstate.h"
>> @@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>>          VMSTATE_END_OF_LIST()
>>      }
>> };
>> +
>> +bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    if (vbasedev->dev->id) {
>> +        vbasedev->name = g_strdup(vbasedev->dev->id);
>> +        return true;
>> +    } else {
>> +        /*
>> +         * Assign a name so any function printing it will not break, but the
>> +         * fd number changes across processes, so this cannot be used as an
>> +         * invariant name for CPR.
>> +         */
>> +        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>> +        error_setg(&vbasedev->cpr.id_blocker,
>> +                   "vfio device with fd=%d needs an id property",
>> +                   vbasedev->fd);
>> +        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +    }
>> +}
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 9fba2c7..8e9de68 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -28,6 +28,7 @@
>> #include "qapi/error.h"
>> #include "qemu/error-report.h"
>> #include "qemu/units.h"
>> +#include "migration/cpr.h"
>> #include "monitor/monitor.h"
>> #include "vfio-helpers.h"
>>
>> @@ -284,6 +285,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>> Error **errp)
>> {
>>      ERRP_GUARD();
>>      struct stat st;
>> +    bool ret = true;
>>
>>      if (vbasedev->fd < 0) {
>>          if (stat(vbasedev->sysfsdev, &st) < 0) {
>> @@ -300,16 +302,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>> Error **errp)
>>              error_setg(errp, "Use FD passing only with iommufd backend");
>>              return false;
>>          }
>> -        /*
>> -         * Give a name with fd so any function printing out vbasedev->name
>> -         * will not break.
>> -         */
>>          if (!vbasedev->name) {
>> -            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>> +            ret = vfio_cpr_set_device_name(vbasedev, errp);
>>          }
>>      }
>>
>> -    return true;
>> +    return ret;
>> }
>>
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 8661947..ea99b8d 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -25,6 +25,7 @@
>> #include "system/reset.h"
>> #include "qemu/cutils.h"
>> #include "qemu/chardev_open.h"
>> +#include "migration/blocker.h"
>> #include "pci.h"
>> #include "vfio-iommufd.h"
>> #include "vfio-helpers.h"
>> @@ -669,6 +670,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>      iommufd_cdev_container_destroy(container);
>>      vfio_address_space_put(space);
>>
>> +    migrate_del_blocker(&vbasedev->cpr.id_blocker);
> 
> We also need to del blocker in error path, e.g., when attach fails.

Good catch, will do - steve

>>      iommufd_cdev_unbind_and_disconnect(vbasedev);
>>      close(vbasedev->fd);
>> }
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 765e334..d06d117 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -23,12 +23,14 @@ typedef struct VFIOContainerCPR {
>> typedef struct VFIODeviceCPR {
>>      bool reused;
>>      Error *mdev_blocker;
>> +    Error *id_blocker;
>> } VFIODeviceCPR;
>>
>> struct VFIOContainer;
>> struct VFIOContainerBase;
>> struct VFIOGroup;
>> struct VFIOPCIDevice;
>> +struct VFIODevice;
>>
>> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                          Error **errp);
>> @@ -59,4 +61,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev,
>> const char *name,
>>
>> extern const VMStateDescription vfio_cpr_pci_vmstate;
>>
>> +bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
>> +
>> #endif /* HW_VFIO_VFIO_CPR_H */
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 35/42] vfio/iommufd: register container for cpr
  2025-05-16 10:23   ` Duan, Zhenzhong
@ 2025-05-19 15:52     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:52 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 6:23 AM, Duan, Zhenzhong wrote:
  >> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 35/42] vfio/iommufd: register container for cpr
>>
>> Register a vfio iommufd container and device for CPR, replacing the generic
>> CPR register call with a more specific iommufd register call.  Add a
>> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>>
>> This is mostly boiler plate.  The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr-iommufd.c      | 97
>> ++++++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/iommufd.c          |  6 ++-
>> hw/vfio/meson.build        |  1 +
>> hw/vfio/vfio-iommufd.h     |  1 +
>> include/hw/vfio/vfio-cpr.h |  8 ++++
>> 5 files changed, 111 insertions(+), 2 deletions(-)
>> create mode 100644 hw/vfio/cpr-iommufd.c
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> new file mode 100644
>> index 0000000..46f2006
>> --- /dev/null
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -0,0 +1,97 @@
>> +/*
>> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "system/iommufd.h"
>> +#include "vfio-iommufd.h"
>> +
>> +static bool vfio_cpr_supported(VFIOIOMMUFDContainer *container, Error
>> **errp)
>> +{
>> +    if (!iommufd_change_process_capable(container->be)) {
>> +        error_setg(errp,
>> +                   "VFIO container does not support IOMMU_IOAS_CHANGE_PROCESS");
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static const VMStateDescription vfio_container_vmstate = {
>> +    .name = "vfio-iommufd-container",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .needed = cpr_needed_for_reuse,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +static const VMStateDescription iommufd_cpr_vmstate = {
>> +    .name = "iommufd",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .needed = cpr_needed_for_reuse,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer
>> *container,
>> +                                         Error **errp)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +    Error **cpr_blocker = &container->cpr_blocker;
>> +
>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>> +                                vfio_cpr_reboot_notifier,
>> +                                MIG_MODE_CPR_REBOOT);
>> +
>> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +    }
>> +
>> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, container->be);
> 
> Will this register iommufd be multiple times if multiple containers under one iommufd?
> Maybe introduce a cpr_register_iommufd()?

I thought iommufd:container is 1:1 because of this logic in iommufd_cdev_attach:

     QLIST_FOREACH(bcontainer, &space->containers, next) {
         container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
         if (VFIO_IOMMU_GET_CLASS(bcontainer) != iommufd_vioc ||
             vbasedev->iommufd != container->be) {
             continue;
     }
     /* Need to allocate a new dedicated container */

- Steve

>> +
>> +    return true;
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer
>> *container)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> +    vmstate_unregister(NULL, &iommufd_cpr_vmstate, container->be);
>> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> +    migrate_del_blocker(&container->cpr_blocker);
>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>> +}
>> +
>> +static const VMStateDescription vfio_device_vmstate = {
>> +    .name = "vfio-iommufd-device",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .needed = cpr_needed_for_reuse,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>> +{
>> +    vmstate_register(NULL, -1, &vfio_device_vmstate, vbasedev);
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>> +{
>> +    vmstate_unregister(NULL, &vfio_device_vmstate, vbasedev);
>> +}
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index ea99b8d..dabb948 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -460,7 +460,7 @@ static void
>> iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
>>      if (!QLIST_EMPTY(&bcontainer->device_list)) {
>>          return;
>>      }
>> -    vfio_cpr_unregister_container(bcontainer);
>> +    vfio_iommufd_cpr_unregister_container(container);
>>      vfio_listener_unregister(bcontainer);
>>      iommufd_backend_free_id(container->be, container->ioas_id);
>>      object_unref(container);
>> @@ -611,7 +611,7 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>          goto err_listener_register;
>>      }
>>
>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>> +    if (!vfio_iommufd_cpr_register_container(container, errp)) {
>>          goto err_listener_register;
>>      }
>>
>> @@ -633,6 +633,7 @@ found_container:
>>      }
>>
>>      vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>> +    vfio_iommufd_cpr_register_device(vbasedev);
>>
>>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>> num_irqs,
>>                                     vbasedev->num_regions, vbasedev->flags);
>> @@ -671,6 +672,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>      vfio_address_space_put(space);
>>
>>      migrate_del_blocker(&vbasedev->cpr.id_blocker);
>> +    vfio_iommufd_cpr_unregister_device(vbasedev);
>>      iommufd_cdev_unbind_and_disconnect(vbasedev);
>>      close(vbasedev->fd);
>> }
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 73d29f9..a158fd8 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true:
>> files('calxeda-xgmac.c'))
>> system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>> system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>    'cpr.c',
>> +  'cpr-iommufd.c',
>>    'cpr-legacy.c',
>>    'device.c',
>>    'migration.c',
>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>> index 5615dcd..cc57a05 100644
>> --- a/hw/vfio/vfio-iommufd.h
>> +++ b/hw/vfio/vfio-iommufd.h
>> @@ -25,6 +25,7 @@ typedef struct IOMMUFDBackend IOMMUFDBackend;
>> typedef struct VFIOIOMMUFDContainer {
>>      VFIOContainerBase bcontainer;
>>      IOMMUFDBackend *be;
>> +    Error *cpr_blocker;
>>      uint32_t ioas_id;
>>      QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>> } VFIOIOMMUFDContainer;
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index d06d117..1379b20 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -31,6 +31,7 @@ struct VFIOContainerBase;
>> struct VFIOGroup;
>> struct VFIOPCIDevice;
>> struct VFIODevice;
>> +struct VFIOIOMMUFDContainer;
>>
>> bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                          Error **errp);
>> @@ -43,6 +44,13 @@ bool vfio_cpr_register_container(struct
>> VFIOContainerBase *bcontainer,
>>                                   Error **errp);
>> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>
>> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer
>> *container,
>> +                                         Error **errp);
>> +void vfio_iommufd_cpr_unregister_container(
>> +    struct VFIOIOMMUFDContainer *container);
>> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>> +
>> bool vfio_cpr_container_match(struct VFIOContainer *container,
>>                                struct VFIOGroup *group, int *fd);
>>
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
  2025-05-16 10:06   ` Duan, Zhenzhong
@ 2025-05-19 15:53     ` Steven Sistare
  2025-05-20  9:15       ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 6:06 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
>>
>> Save the iommu and vfio device fd in CPR state when it is created.
>> After CPR, the fd number is found in CPR state and reused.  Remember
>> the reused status for subsequent patches.  The reused status is cleared
>> when vmstate load finishes.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> backends/iommufd.c       | 19 ++++++++++---------
>> hw/vfio/cpr-iommufd.c    | 16 ++++++++++++++++
>> hw/vfio/device.c         | 10 ++--------
>> hw/vfio/iommufd.c        | 13 +++++++++++--
>> include/system/iommufd.h |  1 +
>> 5 files changed, 40 insertions(+), 19 deletions(-)
>>
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 6fed1c1..492747c 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -16,12 +16,18 @@
>> #include "qemu/module.h"
>> #include "qom/object_interfaces.h"
>> #include "qemu/error-report.h"
>> +#include "migration/cpr.h"
>> #include "monitor/monitor.h"
>> #include "trace.h"
>> #include "hw/vfio/vfio-device.h"
>> #include <sys/ioctl.h>
>> #include <linux/iommufd.h>
>>
>> +static const char *iommufd_fd_name(IOMMUFDBackend *be)
>> +{
>> +    return object_get_canonical_path_component(OBJECT(be));
>> +}
>> +
>> static void iommufd_backend_init(Object *obj)
>> {
>>      IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>> @@ -47,9 +53,8 @@ static void iommufd_backend_set_fd(Object *obj, const
>> char *str, Error **errp)
>>      IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>>      int fd = -1;
>>
>> -    fd = monitor_fd_param(monitor_cur(), str, errp);
>> +    fd = cpr_get_fd_param(iommufd_fd_name(be), str, 0, &be->cpr_reused, errp);
>>      if (fd == -1) {
>> -        error_prepend(errp, "Could not parse remote object fd %s:", str);
>>          return;
>>      }
>>      be->fd = fd;
>> @@ -95,14 +100,9 @@ bool iommufd_change_process(IOMMUFDBackend *be,
>> Error **errp)
>>
>> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>> {
>> -    int fd;
>> -
>>      if (be->owned && !be->users) {
>> -        fd = qemu_open("/dev/iommu", O_RDWR, errp);
>> -        if (fd < 0) {
>> -            return false;
>> -        }
>> -        be->fd = fd;
>> +        be->fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0,
>> +                             &be->cpr_reused, errp);
> 
> Need to check error before assign to be->fd.

will do.

>>      }
>>      be->users++;
>>
>> @@ -121,6 +121,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>> *be)
>>          be->fd = -1;
>>      }
>> out:
>> +    cpr_delete_fd(iommufd_fd_name(be), 0);
>>      trace_iommufd_backend_disconnect(be->fd, be->users);
>> }
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 46f2006..b760bd3 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -8,6 +8,7 @@
>> #include "qemu/osdep.h"
>> #include "qapi/error.h"
>> #include "hw/vfio/vfio-cpr.h"
>> +#include "hw/vfio/vfio-device.h"
>> #include "migration/blocker.h"
>> #include "migration/cpr.h"
>> #include "migration/migration.h"
>> @@ -25,10 +26,25 @@ static bool vfio_cpr_supported(VFIOIOMMUFDContainer
>> *container, Error **errp)
>>      return true;
>> }
>>
>> +static int vfio_container_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOIOMMUFDContainer *container = opaque;
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>> +        vbasedev->cpr.reused = false;
>> +    }
>> +    container->be->cpr_reused = false;
> 
> It's strange to set iommufd and vfio device's reused in container's post load,
> Maybe better to do it in their own post load handler?

vfio_container_post_load has MIG_PRI_LOW so it is called last, which guarantees
that be->cpr_reused remains true while all devices are loaded.  This is required
so that we supress dma_map calls during device load processing:

   iommufd_backend_map_file_dma()
     if (be->cpr_reused)
       return 0;

"vbasedev->cpr.reused = false" could be moved to vfio_device_post_load.
I put it here to be future proof -- al reused flags are cleared together,
at the end of post_load, and to be consistent with cpr-legacy.c:vfio_container_post_load

>> +
>> +    return 0;
>> +}
>> +
>> static const VMStateDescription vfio_container_vmstate = {
>>      .name = "vfio-iommufd-container",
>>      .version_id = 0,
>>      .minimum_version_id = 0,
>> +    .post_load = vfio_container_post_load,
>>      .needed = cpr_needed_for_reuse,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_END_OF_LIST()
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 8e9de68..02f384e 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -312,14 +312,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>> Error **errp)
>>
>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>> {
>> -    ERRP_GUARD();
>> -    int fd = monitor_fd_param(monitor_cur(), str, errp);
>> -
>> -    if (fd < 0) {
>> -        error_prepend(errp, "Could not parse remote object fd %s:", str);
>> -        return;
>> -    }
>> -    vbasedev->fd = fd;
>> +    vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
>> +                                    &vbasedev->cpr.reused, errp);
> 
> Same here.

Do you mean, "need to check error"?
If so, no need.  The new function definition is:

void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
{
     vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
                                     &vbasedev->cpr.reused, errp);
}

cpr_get_fd_param() returns -1 on error and sets errp.

>> }
>>
>> static VFIODeviceIOOps vfio_device_io_ops_ioctl;
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index dabb948..046f601 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -26,6 +26,7 @@
>> #include "qemu/cutils.h"
>> #include "qemu/chardev_open.h"
>> #include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> #include "pci.h"
>> #include "vfio-iommufd.h"
>> #include "vfio-helpers.h"
>> @@ -530,13 +531,18 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>
>> VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
>>
>>      if (vbasedev->fd < 0) {
>> -        devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>> +        devfd = cpr_find_fd(vbasedev->name, 0);
>> +        vbasedev->cpr.reused = (devfd >= 0);
>> +        if (!vbasedev->cpr.reused) {
>> +            devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
>> +        }
>>          if (devfd < 0) {
>>              return false;
>>          }
>>          vbasedev->fd = devfd;
>>      } else {
>>          devfd = vbasedev->fd;
>> +        /* reused was set in iommufd_backend_set_fd */
> 
> Should be vfio_device_set_fd

yes.

- Steve

>>      }
>>
>>      if (!iommufd_cdev_connect_and_bind(vbasedev, errp)) {
>> @@ -634,7 +640,9 @@ found_container:
>>
>>      vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>>      vfio_iommufd_cpr_register_device(vbasedev);
>> -
>> +    if (!vbasedev->cpr.reused) {
>> +        cpr_save_fd(vbasedev->name, 0, vbasedev->fd);
>> +    }
>>      trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev-
>>> num_irqs,
>>                                     vbasedev->num_regions, vbasedev->flags);
>>      return true;
>> @@ -673,6 +681,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>
>>      migrate_del_blocker(&vbasedev->cpr.id_blocker);
>>      vfio_iommufd_cpr_unregister_device(vbasedev);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>      iommufd_cdev_unbind_and_disconnect(vbasedev);
>>      close(vbasedev->fd);
>> }
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index db9ed53..5c17abd 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>>      /*< protected >*/
>>      int fd;            /* /dev/iommu file descriptor */
>>      bool owned;        /* is the /dev/iommu opened internally */
>> +    bool cpr_reused;   /* fd is reused after CPR */
>>      uint32_t users;
>>
>>      /*< public >*/
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 37/42] vfio/iommufd: reconstruct device
  2025-05-16 10:22   ` Duan, Zhenzhong
@ 2025-05-19 15:53     ` Steven Sistare
  2025-05-20  9:14       ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 6:22 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 37/42] vfio/iommufd: reconstruct device
>>
>> Reconstruct userland device state after CPR.  During vfio_realize, skip
>> all ioctls that configure the device, as it was already configured in old
>> QEMU.
>>
>> Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
>> Because we skip ioctl's, it is not needed at realize time.  However, we do
>> need the range info, so defer the call to iommufd_cdev_get_info_iova_range
>> to a post_load handler, at which time the ioas_id is known.
>>
>> This reconstruction is not complete.  hwpt_id and devid need special
>> treatment, handled in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr-iommufd.c |  8 ++++++++
>> hw/vfio/iommufd.c     | 17 +++++++++++++++++
>> 2 files changed, 25 insertions(+)
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index b760bd3..3d430f0 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -31,6 +31,13 @@ static int vfio_container_post_load(void *opaque, int
>> version_id)
>>      VFIOIOMMUFDContainer *container = opaque;
>>      VFIOContainerBase *bcontainer = &container->bcontainer;
>>      VFIODevice *vbasedev;
>> +    Error *err = NULL;
>> +    uint32_t ioas_id = container->ioas_id;
>> +
>> +    if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>> +        error_report_err(err);
>> +        return -1;
>> +    }
>>
>>      QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>>          vbasedev->cpr.reused = false;
>> @@ -47,6 +54,7 @@ static const VMStateDescription vfio_container_vmstate = {
>>      .post_load = vfio_container_post_load,
>>      .needed = cpr_needed_for_reuse,
>>      .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
>>          VMSTATE_END_OF_LIST()
>>      }
>> };
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 046f601..c49a7e7 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -122,6 +122,10 @@ static bool
>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>          goto err_kvm_device_add;
>>      }
>>
>> +    if (vbasedev->cpr.reused) {
>> +        goto skip_bind;
>> +    }
>> +
>>      /* Bind device to iommufd */
>>      bind.iommufd = iommufd->fd;
>>      if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>> @@ -133,6 +137,8 @@ static bool
>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>      vbasedev->devid = bind.out_devid;
>>      trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
>>                                          vbasedev->fd, vbasedev->devid);
>> +
>> +skip_bind:
>>      return true;
>> err_bind:
>>      iommufd_cdev_kvm_device_del(vbasedev);
>> @@ -580,6 +586,11 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>          }
>>      }
>>
>> +    if (vbasedev->cpr.reused) {
>> +        ioas_id = -1;           /* ioas_id will be received from vmstate */
>> +        goto skip_ioas_alloc;
>> +    }
>> +
>>      /* Need to allocate a new dedicated container */
>>      if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
>>          goto err_alloc_ioas;
>> @@ -587,6 +598,7 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>
>>      trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>>
>> +skip_ioas_alloc:
>>      container =
>> VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>      container->be = vbasedev->iommufd;
>>      container->ioas_id = ioas_id;
>> @@ -605,6 +617,10 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>          goto err_discard_disable;
>>      }
>>
>> +    if (vbasedev->cpr.reused) {
>> +        goto skip_info;
> 
> I suspect this will break virtio-iommu, see virtio_iommu_set_iommu_device().
> When virtio-iommu try to get host_iova_ranges, it's not ready until post load.

Thanks, I'll look into it.
Can you give me a clue or a pointer on command line options to set this up?

- Steve

>> +    }
>> +
>>      if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>>          error_append_hint(&err,
>>                     "Fallback to default 64bit IOVA range and 4K page size\n");
>> @@ -613,6 +629,7 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>          bcontainer->pgsizes = qemu_real_host_page_size();
>>      }
>>
>> +skip_info:
>>      if (!vfio_listener_register(bcontainer, errp)) {
>>          goto err_listener_register;
>>      }
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
  2025-05-19  3:25   ` Duan, Zhenzhong
@ 2025-05-19 15:53     ` Steven Sistare
  2025-05-20  9:16       ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/18/2025 11:25 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>>
>> Save the hwpt_id in vmstate.  In realize, skip its allocation from
>> iommufd_cdev_attach -> iommufd_cdev_attach_container ->
>> iommufd_cdev_autodomains_get.
>>
>> Rebuild userland structures to hold hwpt_id by calling
>> iommufd_cdev_rebuild_hwpt at post load time.  This depends on hw_caps, which
>> was restored by the post_load call to vfio_device_hiod_create_and_realize.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/cpr-iommufd.c      |  7 +++++++
>> hw/vfio/iommufd.c          | 24 ++++++++++++++++++++++--
>> hw/vfio/trace-events       |  1 +
>> hw/vfio/vfio-iommufd.h     |  3 +++
>> include/hw/vfio/vfio-cpr.h |  1 +
>> 5 files changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> index 24cdf10..6d3f4e0 100644
>> --- a/hw/vfio/cpr-iommufd.c
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -110,6 +110,12 @@ static int vfio_device_post_load(void *opaque, int
>> version_id)
>>          error_report_err(err);
>>          return false;
>>      }
>> +    if (!vbasedev->mdev) {
>> +        VFIOIOMMUFDContainer *container = container_of(vbasedev->bcontainer,
>> +                                                       VFIOIOMMUFDContainer,
>> +                                                       bcontainer);
>> +        iommufd_cdev_rebuild_hwpt(vbasedev, container);
>> +    }
>>      return true;
>> }
>>
>> @@ -121,6 +127,7 @@ static const VMStateDescription vfio_device_vmstate = {
>>      .needed = cpr_needed_for_reuse,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_INT32(devid, VFIODevice),
>> +        VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
>>          VMSTATE_END_OF_LIST()
>>      }
>> };
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index d980684..ec79c83 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -318,6 +318,7 @@ static bool
>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>> static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>> *hwpt)
>> {
>>      vbasedev->hwpt = hwpt;
>> +    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
>>      vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>      QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> }
>> @@ -373,6 +374,23 @@ static bool iommufd_cdev_make_hwpt(VFIODevice
>> *vbasedev,
>>      return true;
>> }
>>
>> +void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>> +                               VFIOIOMMUFDContainer *container)
>> +{
>> +    VFIOIOASHwpt *hwpt;
>> +    int hwpt_id = vbasedev->cpr.hwpt_id;
>> +
>> +    trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
>> +
>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>> +        if (hwpt->hwpt_id == hwpt_id) {
>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>> +            return;
>> +        }
>> +    }
>> +    iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id, false, NULL);
>> +}
>> +
>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>                                           VFIOIOMMUFDContainer *container,
>>                                           Error **errp)
>> @@ -567,7 +585,8 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>              vbasedev->iommufd != container->be) {
>>              continue;
>>          }
>> -        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>> +        if (!vbasedev->cpr.reused &&
>> +            !iommufd_cdev_attach_container(vbasedev, container, &err)) {
>>              const char *msg = error_get_pretty(err);
>>
>>              trace_iommufd_cdev_fail_attach_existing_container(msg);
>> @@ -605,7 +624,8 @@ skip_ioas_alloc:
>>      bcontainer = &container->bcontainer;
>>      vfio_address_space_insert(space, bcontainer);
>>
>> -    if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
>> +    if (!vbasedev->cpr.reused &&
>> +        !iommufd_cdev_attach_container(vbasedev, container, errp)) {
> 
> All container attaching is bypassed in new qemu. I have a concern that new qemu doesn't generate same containers as old qemu if there are more than one container in old qemu.
> Then there can be devices attached to wrong container or attaching fail in post load.

Yes, this relates to our discussion in patch 35.  Please explain, how can a single
iommufd backend have multiple containers?

- Steve

>>          goto err_attach_container;
>>      }
>>
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index e90ec9b..4955264 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -190,6 +190,7 @@ iommufd_cdev_connect_and_bind(int iommufd, const
>> char *name, int devfd, int devi
>> iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
>> iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int
>> id) " [iommufd=%d] Successfully attached device %s (%d) to id=%d"
>> iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) "
>> [iommufd=%d] Successfully detached %s"
>> +iommufd_cdev_rebuild_hwpt(int iommufd, int hwpt_id) " [iommufd=%d]
>> hwpt %d"
>> iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
>> iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new
>> IOMMUFD container with ioasid=%d"
>> iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int
>> num_regions, int flags) " %s (%d) num_irqs=%d num_regions=%d flags=%d"
>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>> index 148ce89..78af0d8 100644
>> --- a/hw/vfio/vfio-iommufd.h
>> +++ b/hw/vfio/vfio-iommufd.h
>> @@ -38,4 +38,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>> VFIO_IOMMU_IOMMUFD);
>> bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
>>                                        uint32_t ioas_id, Error **errp);
>>
>> +void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>> +                               VFIOIOMMUFDContainer *container);
>> +
>> #endif /* HW_VFIO_VFIO_IOMMUFD_H */
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 1379b20..b98c247 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -24,6 +24,7 @@ typedef struct VFIODeviceCPR {
>>      bool reused;
>>      Error *mdev_blocker;
>>      Error *id_blocker;
>> +    uint32_t hwpt_id;
>> } VFIODeviceCPR;
>>
>> struct VFIOContainer;
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-16  8:55   ` Duan, Zhenzhong
@ 2025-05-19 15:55     ` Steven Sistare
  2025-05-23 17:47       ` Steven Sistare
  2025-05-20 12:34     ` Cédric Le Goater
  1 sibling, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 15:55 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/2025 4:55 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>
>> Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>> helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>> be used by CPR in a subsequent patch.
>>
>> Call vfio_device_hiod_create_and_realize earlier so iommufd_cdev_make_hwpt
>> can use vbasedev->hiod hw_caps, avoiding an extra call to
>> iommufd_backend_get_device_info
> 
> We had made consensus to realize hiod after attachment,
> it's not a hot path so an extra call is acceptable per Cedric.

I'll rework it per the consensus, but I suspect the result will be less pretty --
code duplication, or more conditionals.

- Steve

>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/iommufd.c | 116 ++++++++++++++++++++++++++++++----------------------
>> --
>> 1 file changed, 65 insertions(+), 51 deletions(-)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index f645a62..8661947 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -310,16 +310,70 @@ static bool
>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>      return true;
>> }
>>
>> +static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>> *hwpt)
>> +{
>> +    vbasedev->hwpt = hwpt;
>> +    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +}
>> +
>> +/*
>> + * iommufd_cdev_make_hwpt: If @alloc_id, allocate a hwpt_id, else use
>> @hwpt_id.
>> + * Create and add a hwpt struct to the container's list and to the device.
>> + * Always succeeds if !@alloc_id.
>> + */
>> +static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
>> +                                   VFIOIOMMUFDContainer *container,
>> +                                   uint32_t hwpt_id, bool alloc_id,
>> +                                   Error **errp)
>> +{
>> +    VFIOIOASHwpt *hwpt;
>> +    uint32_t flags = 0;
>> +
>> +    /*
>> +     * This is quite early and VFIO Migration state isn't yet fully
>> +     * initialized, thus rely only on IOMMU hardware capabilities as to
>> +     * whether IOMMU dirty tracking is going to be requested. Later
>> +     * vfio_migration_realize() may decide to use VF dirty tracking
>> +     * instead.
>> +     */
>> +    g_assert(vbasedev->hiod);
>> +    if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> +        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> +    }
>> +
>> +    if (alloc_id) {
>> +        if (!iommufd_backend_alloc_hwpt(vbasedev->iommufd, vbasedev->devid,
>> +                                        container->ioas_id, flags,
>> +                                        IOMMU_HWPT_DATA_NONE, 0, NULL,
>> +                                        &hwpt_id, errp)) {
>> +            return false;
>> +        }
>> +
>> +        if (iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp)) {
>> +            iommufd_backend_free_id(container->be, hwpt_id);
>> +            return false;
>> +        }
>> +    }
>> +
>> +    hwpt = g_malloc0(sizeof(*hwpt));
>> +    hwpt->hwpt_id = hwpt_id;
>> +    hwpt->hwpt_flags = flags;
>> +    QLIST_INIT(&hwpt->device_list);
>> +
>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> +    container->bcontainer.dirty_pages_supported |=
>> +                                vbasedev->iommu_dirty_tracking;
>> +    iommufd_cdev_use_hwpt(vbasedev, hwpt);
>> +    return true;
>> +}
>> +
>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>                                           VFIOIOMMUFDContainer *container,
>>                                           Error **errp)
>> {
>>      ERRP_GUARD();
>> -    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> -    uint32_t type, flags = 0;
>> -    uint64_t hw_caps;
>>      VFIOIOASHwpt *hwpt;
>> -    uint32_t hwpt_id;
>>      int ret;
>>
>>      /* Try to find a domain */
>> @@ -340,54 +394,14 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>
>>              return false;
>>          } else {
>> -            vbasedev->hwpt = hwpt;
>> -            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> -            vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>              return true;
>>          }
>>      }
>> -
>> -    /*
>> -     * This is quite early and VFIO Migration state isn't yet fully
>> -     * initialized, thus rely only on IOMMU hardware capabilities as to
>> -     * whether IOMMU dirty tracking is going to be requested. Later
>> -     * vfio_migration_realize() may decide to use VF dirty tracking
>> -     * instead.
>> -     */
>> -    if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
>> -                                         &type, NULL, 0, &hw_caps, errp)) {
>> -        return false;
>> -    }
>> -
>> -    if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> -        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> -    }
>> -
>> -    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>> -                                    container->ioas_id, flags,
>> -                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>> -                                    &hwpt_id, errp)) {
>> -        return false;
>> -    }
>> -
>> -    hwpt = g_malloc0(sizeof(*hwpt));
>> -    hwpt->hwpt_id = hwpt_id;
>> -    hwpt->hwpt_flags = flags;
>> -    QLIST_INIT(&hwpt->device_list);
>> -
>> -    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>> -    if (ret) {
>> -        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>> -        g_free(hwpt);
>> +    if (!iommufd_cdev_make_hwpt(vbasedev, container, 0, true, errp)) {
>>          return false;
>>      }
>>
>> -    vbasedev->hwpt = hwpt;
>> -    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> -    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> -    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> -    container->bcontainer.dirty_pages_supported |=
>> -                                vbasedev->iommu_dirty_tracking;
>>      if (container->bcontainer.dirty_pages_supported &&
>>          !vbasedev->iommu_dirty_tracking) {
>>          warn_report("IOMMU instance for device %s doesn't support dirty tracking",
>> @@ -530,6 +544,11 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>
>>      space = vfio_address_space_get(as);
>>
>> +    if (!vfio_device_hiod_create_and_realize(vbasedev,
>> +            TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>> +        goto err_alloc_ioas;
>> +    }
>> +
>>      /* try to attach to an existing container in this space */
>>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>>          container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>> @@ -604,11 +623,6 @@ found_container:
>>          goto err_listener_register;
>>      }
>>
>> -    if (!vfio_device_hiod_create_and_realize(vbasedev,
>> -                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>> -        goto err_listener_register;
>> -    }
>> -
>>      /*
>>       * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>>       * for discarding incompatibility check as well?
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-19 13:20       ` Cédric Le Goater
@ 2025-05-19 16:21         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 16:21 UTC (permalink / raw)
  To: Cédric Le Goater, Peter Xu
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Fabiano Rosas, qemu-devel

On 5/19/2025 9:20 AM, Cédric Le Goater wrote:
> On 5/15/25 21:08, Steven Sistare wrote:
>> On 5/15/2025 8:59 AM, Cédric Le Goater wrote:
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> At vfio creation time, save the value of vfio container, group, and device
>>>> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
>>>> the saved descriptors, and remembers the reused status for subsequent
>>>> patches.  The reused status is cleared when vmstate load finishes.
>>>>
>>>> During reuse, device and iommu state is already configured, so operations
>>>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>>>> are skipped.  The result is that vfio_realize constructs qemu data
>>>> structures that reflect the current state of the device.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>>>>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>>>>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>>>>   include/hw/vfio/vfio-device.h |  2 ++
>>>>   4 files changed, 112 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>>> index 85c76da..278a220 100644
>>>> --- a/hw/vfio/container.c
>>>> +++ b/hw/vfio/container.c
>>>> @@ -31,6 +31,8 @@
>>>>   #include "system/reset.h"
>>>>   #include "trace.h"
>>>>   #include "qapi/error.h"
>>>> +#include "migration/cpr.h"
>>>> +#include "migration/blocker.h"
>>>>   #include "pci.h"
>>>>   #include "hw/vfio/vfio-container.h"
>>>>   #include "hw/vfio/vfio-cpr.h"
>>>> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>>>>   }
>>>>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>>> -                                            Error **errp)
>>>> +                                            bool cpr_reused, Error **errp)
>>>>   {
>>>>       int iommu_type;
>>>>       const char *vioc_name;
>>>> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>>>           return NULL;
>>>>       }
>>>> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>>> +    /*
>>>> +     * If container is reused, just set its type and skip the ioctls, as the
>>>> +     * container and group are already configured in the kernel.
>>>> +     */
>>>> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>>>           return NULL;
>>>>       }
>>>> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>>>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>>>>       container->fd = fd;
>>>> +    container->cpr.reused = cpr_reused;
>>>>       container->iommu_type = iommu_type;
>>>>       return container;
>>>>   }
>>>> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>>>>   }
>>>>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>>> -                                     Error **errp)
>>>> +                                     bool cpr_reused, Error **errp)
>>>>   {
>>>>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>>>>           return false;
>>>> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>>>       group->container = container;
>>>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>>>       vfio_group_add_kvm_device(group);
>>>> +    if (!cpr_reused) {
>>>> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>>>> +    }
>>>
>>> Could we avoid the test on cpr_reused always call cpr_save_fd() ?
>>
>> No.  If cpr_reused is true, then the fd is already on cpr's save list.
>> We don't want to save duplicates of the same entry.
> 
> Can't we call cpr_find_fd() like in cpr_open_fd() ?

I could indeed, and you have re-invented the cpr_resave_fd() helper which was
used here and elsewhere in V2, and Peter didn't like it.

Peter said:
   If the caller know the fd was created, then IIUC the caller shouldn't
   invoke the call.
   For the other case, could you give an example when the caller may have been
   created, but maybe not?

I said:
   It avoids the need to remember that an fd was reused, and test that fact before
   calling cpr_save_fd.  And sometimes those operations occur in different functions.
   Thus resave saves a few lines of code.

Peter, can I bring back cpr_resave_fd() ?

>>>>       return true;
>>>>   }
>>>> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>>>>       group->container = NULL;
>>>>       vfio_group_del_kvm_device(group);
>>>>       vfio_ram_block_discard_disable(container, false);
>>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>>>   }
>>>>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>>       VFIOIOMMUClass *vioc = NULL;
>>>>       bool new_container = false;
>>>>       bool group_was_added = false;
>>>> +    bool cpr_reused;
>>>>       space = vfio_address_space_get(as);
>>>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>>>> +    cpr_reused = (fd > 0);
>>>
>>>
>>> The code above is doing 2 things : it grabs a restored fd and
>>> deduces from the fd value that the VM is doing are doing a CPR
>>> reboot.
>>>
>>> Instead of adding this cpr_reused flag, I would prefer to duplicate
>>> the code into something like:
>>>
>>> if (!cpr_reboot) {
>>>     QLIST_FOREACH(bcontainer, &space->containers, next) {
>>>          container = container_of(bcontainer, VFIOContainer, bcontainer);
>>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>>              return vfio_container_group_add(container, group, errp);
>>>          }
>>>      }
>>>
>>>      fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>>      if (fd < 0) {
>>>          goto fail;
>>>      }
>>>
>>>      ret = ioctl(fd, VFIO_GET_API_VERSION);
>>>      if (ret != VFIO_API_VERSION) {
>>>          error_setg(errp, "supported vfio version: %d, "
>>>                     "reported version: %d", VFIO_API_VERSION, ret);
>>>          goto fail;
>>>      }
>>>
>>>      container = vfio_create_container(fd, group, errp);
>>> } else {
>>>     /* ... */
>>> }
>>>
>>
>> OK, but there is no sense in duplicating the identical code for
>> VFIO_GET_API_VERSION and vfio_create_container.  If you want me to
>> simplify the loop, I suggest:
>>
>> if (!cpr_reused) {
>>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>>           if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>               return vfio_container_group_add(container, group, false, errp);
>>           }
>>       }
>>
>>       fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>       if (fd < 0) {
>>           goto fail;
>>       }
>> } else {
>>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>>          container = container_of(bcontainer, VFIOContainer, bcontainer);
>>          if (vfio_cpr_container_match(container, group, &fd)) {
>>              return vfio_container_group_add(container, group, true, errp);
>>          }
>>      }
>> }
>>
>> ret = ioctl(fd, VFIO_GET_API_VERSION);
>> ...
> 
> OK. Let's do that. I find it easier to read.

will do.

>>>> +    /*
>>>> +     * If the container is reused, then the group is already attached in the
>>>> +     * kernel.  If a container with matching fd is found, then update the
>>>> +     * userland group list and return.  If not, then after the loop, create
>>>> +     * the container struct and group list.
>>>> +     */
>>>>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>>>>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>>>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>>> -            return vfio_container_group_add(container, group, errp);
>>>> +
>>>> +        if (cpr_reused) {
>>>> +            if (!vfio_cpr_container_match(container, group, &fd)) {
>>>
>>> why do we need to modify fd ?
>>
>> That is explained by the comments inside vfio_cpr_container_match, where the
>> explanation is more easily understood.
> 
> I haven't been able to see what a modified fd was useful for before because
> we test cpr_reused and in other places !cpr_reused :
> 
>          if (cpr_reused) {
>              if (!vfio_cpr_container_match(container, group, &fd)) {
>                  continue;
>              }
> 
> and later
> 
>      if (!cpr_reused) {
>          fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>      }
> 
> I think I got it now. This was a bit confusing.
> 
>>
>>>> +                continue;
>>>> +            }
>>>> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>>> +            continue;
>>>>           }
>>>> +        return vfio_container_group_add(container, group, cpr_reused, errp);
>>>> +    }
>>>> +
>>>> +    if (!cpr_reused) {
>>>> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>>>       }
>>>> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>>>       if (fd < 0) {>           goto fail;
>>>>       }
>>>> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>>           goto fail;
>>>>       }
>>>> -    container = vfio_create_container(fd, group, errp);
>>>> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>>>>       if (!container) {
>>>>           goto fail;
>>>>       }
>>>> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>>       vfio_address_space_insert(space, bcontainer);
>>>> -    if (!vfio_container_group_add(container, group, errp)) {
>>>> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>>>>           goto fail;
>>>>       }
>>>>       group_was_added = true;
>>>> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>>>       QLIST_REMOVE(group, container_next);
>>>>       group->container = NULL;
>>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>>>       /*
>>>>        * Explicitly release the listener first before unset container,
>>>> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>>>       group = g_malloc0(sizeof(*group));
>>>>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>>>> -    group->fd = qemu_open(path, O_RDWR, errp);
>>>> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>>>>       if (group->fd < 0) {
>>>>           goto free_group_exit;
>>>>       }
>>>> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>>>       return group;
>>>>   close_fd_exit:
>>>> +    cpr_delete_fd("vfio_group", groupid);
>>>>       close(group->fd);
>>>>   free_group_exit:
>>>> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>>>>       vfio_container_disconnect(group);
>>>>       QLIST_REMOVE(group, next);
>>>>       trace_vfio_group_put(group->fd);
>>>> +    cpr_delete_fd("vfio_group", group->groupid);
>>>>       close(group->fd);
>>>>       g_free(group);
>>>>   }
>>>> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>>>   {
>>>>       g_autofree struct vfio_device_info *info = NULL;
>>>>       int fd;
>>>> +    bool cpr_reused;
>>>> +
>>>> +    fd = cpr_find_fd(name, 0);
>>>> +    cpr_reused = (fd >= 0);
>>>> +    if (!cpr_reused) {
>>>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>>> +    }
>>>
>>> Could we introduce an helper routine to open this file,  like we have
>>> cpr_open_fd() ?
>>
>> OK, but this would be the only use of the helper, and it would bury
>> generic vfio functionality -- VFIO_GROUP_GET_DEVICE_FD -- inside a cpr
>> flavored helper.  IMO not an improvement.
> 
> VFIO_GROUP_GET_DEVICE_FD would still be passed as a parameter and
> so it won't be buried IMO. I don't dislike it that much.

OK.

> However, I don't like the "if (cpr_reused)" statements scattered
> throughout the code, so I'm looking for ways to bury them.

cpr_resave_fd will help.

- Steve

>>>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>>>       if (fd < 0) {
>>>>           error_setg_errno(errp, errno, "error getting device from group %d",
>>>>                            group->groupid);
>>>> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>>>       vbasedev->group = group;
>>>>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>>>> +    vbasedev->cpr.reused = cpr_reused;
>>>> +    if (!cpr_reused) {
>>>> +        cpr_save_fd(name, 0, fd);
>>>
>>> Could we avoid the test on cpr_reused always call cpr_save_fd() ?
>>
>> No.  Must avoid adding duplicate entries.
>>
>>>> +    }
>>>>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>>>>       return true;
>>>> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>>>>       QLIST_REMOVE(vbasedev, next);
>>>>       vbasedev->group = NULL;
>>>>       trace_vfio_device_put(vbasedev->fd);
>>>> +    cpr_delete_fd(vbasedev->name, 0);
>>>>       close(vbasedev->fd);
>>>>   }
>>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>>> index fac323c..638a8e0 100644
>>>> --- a/hw/vfio/cpr-legacy.c
>>>> +++ b/hw/vfio/cpr-legacy.c
>>>> @@ -10,6 +10,7 @@
>>>>   #include "qemu/osdep.h"
>>>>   #include "hw/vfio/vfio-container.h"
>>>>   #include "hw/vfio/vfio-cpr.h"
>>>> +#include "hw/vfio/vfio-device.h"
>>>>   #include "migration/blocker.h"
>>>>   #include "migration/cpr.h"
>>>>   #include "migration/migration.h"
>>>> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>>>       }
>>>>   }
>>>> +static int vfio_container_post_load(void *opaque, int version_id)
>>>> +{
>>>> +    VFIOContainer *container = opaque;
>>>> +    VFIOGroup *group;
>>>> +    VFIODevice *vbasedev;
>>>> +
>>>> +    container->cpr.reused = false;
>>>> +
>>>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>>>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>> +            vbasedev->cpr.reused = false;
>>>> +        }
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   static const VMStateDescription vfio_container_vmstate = {
>>>>       .name = "vfio-container",
>>>>       .version_id = 0,
>>>>       .minimum_version_id = 0,
>>>> +    .post_load = vfio_container_post_load,
>>>>       .needed = cpr_needed_for_reuse,
>>>>       .fields = (VMStateField[]) {
>>>>           VMSTATE_END_OF_LIST()
>>>> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>>>       migrate_del_blocker(&container->cpr.blocker);
>>>>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>>>   }
>>>> +
>>>> +static bool same_device(int fd1, int fd2)
>>>> +{
>>>> +    struct stat st1, st2;
>>>> +
>>>> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
>>>> +}
>>>> +
>>>> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
>>>> +                              int *pfd)
>>>> +{
>>>> +    if (container->fd == *pfd) {
>>>> +        return true;
>>>> +    }
>>>> +    if (!same_device(container->fd, *pfd)) {
>>>> +        return false;
>>>> +    }
>>>> +    /*
>>>> +     * Same device, different fd.  This occurs when the container fd is
>>>> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
>>>> +     * produces duplicates.  De-dup it.
>>>> +     */
>>>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>>> +    close(*pfd);
>>>> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>>>> +    *pfd = container->fd;
>>>
>>> I am not sure 'pfd' is used afterwards. Is it ?
>>
>> True, good eye.  I will change it to "int fd" and stop returning the new value.
>>
>> - Steve
>>
>>>
>>>> +    return true;
>>>> +}
>>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>>> index f864547..1c4f070 100644
>>>> --- a/include/hw/vfio/vfio-cpr.h
>>>> +++ b/include/hw/vfio/vfio-cpr.h
>>>> @@ -13,10 +13,16 @@
>>>>   typedef struct VFIOContainerCPR {
>>>>       Error *blocker;
>>>> +    bool reused;
>>>>   } VFIOContainerCPR;
>>>> +typedef struct VFIODeviceCPR {
>>>> +    bool reused;
>>>> +} VFIODeviceCPR;
>>>> +
>>>>   struct VFIOContainer;
>>>>   struct VFIOContainerBase;
>>>> +struct VFIOGroup;
>>>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>>>                                           Error **errp);
>>>> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>>                                    Error **errp);
>>>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>>>> +                              struct VFIOGroup *group, int *fd);
>>>> +
>>>>   #endif /* HW_VFIO_VFIO_CPR_H */
>>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>>> index 8bcb3c1..4e4d0b6 100644
>>>> --- a/include/hw/vfio/vfio-device.h
>>>> +++ b/include/hw/vfio/vfio-device.h
>>>> @@ -28,6 +28,7 @@
>>>>   #endif
>>>>   #include "system/system.h"
>>>>   #include "hw/vfio/vfio-container-base.h"
>>>> +#include "hw/vfio/vfio-cpr.h"
>>>>   #include "system/host_iommu_device.h"
>>>>   #include "system/iommufd.h"
>>>> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>>>>       VFIOIOASHwpt *hwpt;
>>>>       QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>       struct vfio_region_info **reginfo;
>>>> +    VFIODeviceCPR cpr;
>>>>   } VFIODevice;
>>>>   struct VFIODeviceOps {
>>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-19 13:32       ` Cédric Le Goater
@ 2025-05-19 16:33         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 16:33 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 9:32 AM, Cédric Le Goater wrote:
> On 5/15/25 21:08, Steven Sistare wrote:
>> On 5/15/2025 9:42 AM, Cédric Le Goater wrote:
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> In new QEMU, do not register the memory listener at device creation time.
>>>> Register it later, in the container post_load handler, after all vmstate
>>>> that may affect regions and mapping boundaries has been loaded.  The
>>>> post_load registration will cause the listener to invoke its callback on
>>>> each flat section, and the calls will match the mappings remembered by the
>>>> kernel.
>>>>
>>>> The listener calls a special dma_map handler that passes the new VA of each
>>>> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
>>>> handler at the end.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>   hw/vfio/container.c  | 15 +++++++++++++--
>>>>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   2 files changed, 61 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>>> index a554683..0e02726 100644
>>>> --- a/hw/vfio/container.c
>>>> +++ b/hw/vfio/container.c
>>>> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>>>>       int ret;
>>>>       Error *local_err = NULL;
>>>> +    assert(!container->cpr.reused);
>>>> +
>>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>               bcontainer->dirty_pages_supported) {
>>>> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>>>       }
>>>>       group_was_added = true;
>>>> -    if (!vfio_listener_register(bcontainer, errp)) {
>>>> -        goto fail;
>>>> +    /*
>>>> +     * If reused, register the listener later, after all state that may
>>>> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
>>>> +     * the listener will invoke its callback on each flat section and call
>>>> +     * dma_map to supply the new vaddr, and the calls will match the mappings
>>>> +     * remembered by the kernel.
>>>> +     */
>>>> +    if (!cpr_reused) {
>>>> +        if (!vfio_listener_register(bcontainer, errp)) {
>>>> +            goto fail;
>>>> +        }
>>>
>>> hmm, I am starting to think we should have a vfio_cpr_container_connect
>>> routine too.
>>
>> I think that would obscure rather than clarify the code, since the normal
>> non-cpr action of calling vfio_listener_register would be buried in a
>> cpr flavored function name.
>>
>>>>       }
>>>>       bcontainer->initialized = true;
>>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>>> index 519d772..bbcf71e 100644
>>>> --- a/hw/vfio/cpr-legacy.c
>>>> +++ b/hw/vfio/cpr-legacy.c
>>>> @@ -11,11 +11,13 @@
>>>>   #include "hw/vfio/vfio-container.h"
>>>>   #include "hw/vfio/vfio-cpr.h"
>>>>   #include "hw/vfio/vfio-device.h"
>>>> +#include "hw/vfio/vfio-listener.h"
>>>>   #include "migration/blocker.h"
>>>>   #include "migration/cpr.h"
>>>>   #include "migration/migration.h"
>>>>   #include "migration/vmstate.h"
>>>>   #include "qapi/error.h"
>>>> +#include "qemu/error-report.h"
>>>>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>>>   {
>>>> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>>>       return true;
>>>>   }
>>>> +/*
>>>> + * Set the new @vaddr for any mappings registered during cpr load.
>>>> + * Reused is cleared thereafter.
>>>> + */
>>>> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>>>> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
>>>> +                                   bool readonly)
>>>> +{
>>>> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>>>> +                                                  bcontainer);
>>>> +    struct vfio_iommu_type1_dma_map map = {
>>>> +        .argsz = sizeof(map),
>>>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>>>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>>>> +        .iova = iova,
>>>> +        .size = size,
>>>> +    };
>>>> +
>>>> +    assert(container->cpr.reused);
>>>> +
>>>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>>>> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
>>>> +                     iova, size, vaddr, strerror(errno));
>>>
>>> Callers should also report the error. No need to do it here.
>>
>> This function has the same signature as the dma_map class method,
>> which does not return an error message.  It's existing implementations
>> use error_report.
> 
> backends .dma_map handlers : vfio_legacy_dma_map(), iommufd_backend_map_dma()
> don't report errors. vfio_container_dma_map() doesn't either.
> 
> callers of vfio_container_dma_map() : vfio_iommu_map_notify(),
> vfio_listener_region_add() report errors.

OK, I misunderstood your suggestion.

I will drop the error_report and just return the errno.

- Steve

>>>> +        return -errno;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>>>   {
>>>> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>>>>   static int vfio_container_post_load(void *opaque, int version_id)
>>>>   {
>>>>       VFIOContainer *container = opaque;
>>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>>>       VFIOGroup *group;
>>>>       VFIODevice *vbasedev;
>>>> +    Error *err = NULL;
>>>> +
>>>> +    if (!vfio_listener_register(bcontainer, &err)) {
>>>> +        error_report_err(err);
>>>> +        return -1;
>>>> +    }
>>>>       container->cpr.reused = false;
>>>>       QLIST_FOREACH(group, &container->group_list, container_next) {
>>>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>>> +
>>>> +        /* Restore original dma_map function */
>>>> +        vioc->dma_map = vfio_legacy_dma_map;
>>>> +
>>>>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>>>               vbasedev->cpr.reused = false;
>>>>           }
>>>> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>>>>       .name = "vfio-container",
>>>>       .version_id = 0,
>>>>       .minimum_version_id = 0,
>>>> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>>>>       .pre_save = vfio_container_pre_save,
>>>>       .post_load = vfio_container_post_load,
>>>>       .needed = cpr_needed_for_reuse,
>>>> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>>>> +    /* During incoming CPR, divert calls to dma_map. */
>>>> +    if (container->cpr.reused) {
>>>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>>> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
>>>
>>> You could backup the previous dma_map() handler in a static variable or,
>>> better, under container->cpr.
>>
>> OK.
>>
>> - Steve
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 24/42] migration: close kvm after cpr
  2025-05-19  8:51       ` Cédric Le Goater
@ 2025-05-19 19:07         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-19 19:07 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 4:51 AM, Cédric Le Goater wrote:
> On 5/16/25 20:18, Steven Sistare wrote:
>> On 5/16/2025 4:35 AM, Cédric Le Goater wrote:
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> cpr-transfer breaks vfio network connectivity to and from the guest, and
>>>> the host system log shows:
>>>>    irq bypass consumer (token 00000000a03c32e5) registration fails: -16
>>>> which is EBUSY.  This occurs because KVM descriptors are still open in
>>>> the old QEMU process.  Close them.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> This patch doesn't build.
>>>
>>> /usr/bin/ld: libcommon.a.p/migration_cpr.c.o: in function `cpr_kvm_close':
>>> ./build/../migration/cpr.c:260: undefined reference to `kvm_close'
>>
>> My build works.
>> For what binary does this ld command fail?
> 
> 
> FAILED: qemu-system-s390x
> FAILED: qemu-system-ppc
> FAILED: qemu-system-ppc64
> FAILED: qemu-system-arm
> FAILED: qemu-system-aarch64

OK, I finally reproduced this using configure --disable-kvm.

I will add the necessary CONFIG_KVM conditionals.

- Steve

>> Could you send the complete ld command with make V=1?
>>
>> - Steve
>>
>>>> ---
>>>>   accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
>>>>   hw/vfio/helpers.c             | 10 ++++++++++
>>>>   include/hw/vfio/vfio-device.h |  2 ++
>>>>   include/migration/cpr.h       |  2 ++
>>>>   include/qemu/vfio-helpers.h   |  1 -
>>>>   include/system/kvm.h          |  1 +
>>>>   migration/cpr-transfer.c      | 18 ++++++++++++++++++
>>>>   migration/cpr.c               |  8 ++++++++
>>>>   migration/migration.c         |  1 +
>>>>   9 files changed, 70 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>> index 278a506..d619448 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
>>>>           goto err;
>>>>       }
>>>> +    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
>>>> +    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
>>>> +        s->coalesced_mmio_ring = NULL;
>>>> +    }
>>>> +
>>>>       ret = munmap(cpu->kvm_run, mmap_size);
>>>>       if (ret < 0) {
>>>>           goto err;
>>>>       }
>>>> +    cpu->kvm_run = NULL;
>>>>       if (cpu->kvm_dirty_gfns) {
>>>>           ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
>>>>           if (ret < 0) {
>>>>               goto err;
>>>>           }
>>>> +        cpu->kvm_dirty_gfns = NULL;
>>>>       }
>>>>       kvm_park_vcpu(cpu);
>>>> @@ -600,6 +607,27 @@ err:
>>>>       return ret;
>>>>   }
>>>> +void kvm_close(void)
>>>> +{
>>>> +    CPUState *cpu;
>>>> +
>>>> +    CPU_FOREACH(cpu) {
>>>> +        cpu_remove_sync(cpu);
>>>> +        close(cpu->kvm_fd);
>>>> +        cpu->kvm_fd = -1;
>>>> +        close(cpu->kvm_vcpu_stats_fd);
>>>> +        cpu->kvm_vcpu_stats_fd = -1;
>>>> +    }
>>>> +
>>>> +    if (kvm_state && kvm_state->fd != -1) {
>>>> +        close(kvm_state->vmfd);
>>>> +        kvm_state->vmfd = -1;
>>>> +        close(kvm_state->fd);
>>>> +        kvm_state->fd = -1;
>>>> +    }
>>>> +    kvm_state = NULL;
>>>> +}
>>>> +
>>>>   /*
>>>>    * dirty pages logging control
>>>>    */
>>>> diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
>>>> index d0dbab1..af1db2f 100644
>>>> --- a/hw/vfio/helpers.c
>>>> +++ b/hw/vfio/helpers.c
>>>> @@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
>>>>   int vfio_kvm_device_fd = -1;
>>>>   #endif
>>>> +void vfio_kvm_device_close(void)
>>>> +{
>>>> +#ifdef CONFIG_KVM
>>>> +    if (vfio_kvm_device_fd != -1) {
>>>> +        close(vfio_kvm_device_fd);
>>>> +        vfio_kvm_device_fd = -1;
>>>> +    }
>>>> +#endif
>>>> +}
>>>> +
>>>>   int vfio_kvm_device_add_fd(int fd, Error **errp)
>>>>   {
>>>>   #ifdef CONFIG_KVM
>>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>>> index 4e4d0b6..6eb6f21 100644
>>>> --- a/include/hw/vfio/vfio-device.h
>>>> +++ b/include/hw/vfio/vfio-device.h
>>>> @@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>>>>   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>>>>                         DeviceState *dev, bool ram_discard);
>>>>   int vfio_device_get_aw_bits(VFIODevice *vdev);
>>>> +
>>>> +void vfio_kvm_device_close(void);
>>>>   #endif /* HW_VFIO_VFIO_COMMON_H */
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index fc6aa33..5f1ff10 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -31,7 +31,9 @@ void cpr_state_close(void);
>>>>   struct QIOChannel *cpr_state_ioc(void);
>>>>   bool cpr_needed_for_reuse(void *opaque);
>>>> +void cpr_kvm_close(void);
>>>> +void cpr_transfer_init(void);
>>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>>   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> diff --git a/include/qemu/vfio-helpers.h b/include/qemu/vfio-helpers.h
>>>> index bde9495..a029036 100644
>>>> --- a/include/qemu/vfio-helpers.h
>>>> +++ b/include/qemu/vfio-helpers.h
>>>> @@ -28,5 +28,4 @@ void qemu_vfio_pci_unmap_bar(QEMUVFIOState *s, int index, void *bar,
>>>>                                uint64_t offset, uint64_t size);
>>>>   int qemu_vfio_pci_init_irq(QEMUVFIOState *s, EventNotifier *e,
>>>>                              int irq_type, Error **errp);
>>>> -
>>>>   #endif
>>>> diff --git a/include/system/kvm.h b/include/system/kvm.h
>>>> index b690dda..cfaa94c 100644
>>>> --- a/include/system/kvm.h
>>>> +++ b/include/system/kvm.h
>>>> @@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
>>>>   int kvm_has_vcpu_events(void);
>>>>   int kvm_max_nested_state_length(void);
>>>>   int kvm_has_gsi_routing(void);
>>>> +void kvm_close(void);
>>>>   /**
>>>>    * kvm_arm_supports_user_irq
>>>> diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
>>>> index e1f1403..396558f 100644
>>>> --- a/migration/cpr-transfer.c
>>>> +++ b/migration/cpr-transfer.c
>>>> @@ -17,6 +17,24 @@
>>>>   #include "migration/vmstate.h"
>>>>   #include "trace.h"
>>>> +static int cpr_transfer_notifier(NotifierWithReturn *notifier,
>>>> +                                 MigrationEvent *e,
>>>> +                                 Error **errp)
>>>> +{
>>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>>> +        cpr_kvm_close();
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void cpr_transfer_init(void)
>>>> +{
>>>> +    static NotifierWithReturn notifier;
>>>> +
>>>> +    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
>>>> +                                MIG_MODE_CPR_TRANSFER);
>>>> +}
>>>> +
>>>>   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
>>>>   {
>>>>       MigrationAddress *addr = channel->addr;
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 0b01e25..6102d04 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -7,12 +7,14 @@
>>>>   #include "qemu/osdep.h"
>>>>   #include "qapi/error.h"
>>>> +#include "hw/vfio/vfio-device.h"
>>>>   #include "migration/cpr.h"
>>>>   #include "migration/misc.h"
>>>>   #include "migration/options.h"
>>>>   #include "migration/qemu-file.h"
>>>>   #include "migration/savevm.h"
>>>>   #include "migration/vmstate.h"
>>>> +#include "system/kvm.h"
>>>>   #include "system/runstate.h"
>>>>   #include "trace.h"
>>>> @@ -252,3 +254,9 @@ bool cpr_needed_for_reuse(void *opaque)
>>>>       MigMode mode = migrate_mode();
>>>>       return mode == MIG_MODE_CPR_TRANSFER;
>>>>   }
>>>> +
>>>> +void cpr_kvm_close(void)
>>>> +{
>>>> +    kvm_close();
>>>> +    vfio_kvm_device_close();
>>>> +}
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 4697732..89e2026 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -337,6 +337,7 @@ void migration_object_init(void)
>>>>       ram_mig_init();
>>>>       dirty_bitmap_mig_init();
>>>> +    cpr_transfer_init();
>>>>       /* Initialize cpu throttle timers */
>>>>       cpu_throttle_init();
>>>
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 25/42] migration: cpr_get_fd_param helper
  2025-05-12 15:32 ` [PATCH V3 25/42] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-05-19 21:22   ` Fabiano Rosas
  0 siblings, 0 replies; 157+ messages in thread
From: Fabiano Rosas @ 2025-05-19 21:22 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Steve Sistare

Steve Sistare <steven.sistare@oracle.com> writes:

> Add the helper function cpr_get_fd_param, to use when preserving
> a file descriptor that is opened externally and passed to QEMU.
> cpr_get_fd_param returns a descriptor number either from a QEMU
> command-line parameter, from a getfd command, or from CPR state.
>
> When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
> changes.  Hence, during CPR, the command-line parameter is ignored
> in new QEMU, and over-ridden by the value found in CPR state.
>
> Similarly, if the descriptor was originally specified by a getfd
> command in old QEMU, the fd number is not known outside of QEMU,
> and it changes when sent to new QEMU via SCM_RIGHTS.  Hence the
> user cannot send getfd to new QEMU, but when the user sends a
> hotplug command that references the fd, cpr_get_fd_param finds
> its value in CPR state.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-16 17:58     ` Steven Sistare
@ 2025-05-20  5:52       ` Cédric Le Goater
  2025-05-20 14:56         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20  5:52 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/25 19:58, Steven Sistare wrote:
> On 5/16/2025 4:31 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> Export various MSI functions, for use by CPR in subsequent patches.
>>> No functional change.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>
>> Please rename this routines with a 'vfio_pci' prefix.
> 
> Are you sure?  That makes sense for:
>    vfio_vector_init -> vfio_pci_vector_init
> 
> but the rest already have msi or intx in the name which unambiguously
> means pci.  Adding pci_ seems unecessarily verbose:

We are slowly defining an API for an internal VFIO library. I prefer
to ensure the interface is clean by changing the names of external
services to reflect the namespace they belong to.

All routines are implemented in hw/vfio/pci.c and most take a
VFIOPCIDevice as first argument.

> +void vfio_msi_interrupt(void *opaque);
> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                           int vector_n, bool msix);
> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);

vfio_msi_interrupt(), vfio_msix_vector_use() and
vfio_msix_vector_release() are rather low level routines.
I think we need a wrapper to avoid exposing them.


Thanks,

C.


> +bool vfio_msix_present(void *opaque, int version_id);
> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);

> 
> - Steve
> 
>>> ---
>>>   hw/vfio/pci.c | 21 ++++++++++-----------
>>>   hw/vfio/pci.h | 12 ++++++++++++
>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index d2b08a3..1bca415 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -279,7 +279,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
>>>       vfio_intx_update(vdev, &vdev->intx.route);
>>>   }
>>> -static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>>   {
>>>       uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>>>       Error *err = NULL;
>>> @@ -353,7 +353,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>>>   /*
>>>    * MSI/X
>>>    */
>>> -static void vfio_msi_interrupt(void *opaque)
>>> +void vfio_msi_interrupt(void *opaque)
>>>   {
>>>       VFIOMSIVector *vector = opaque;
>>>       VFIOPCIDevice *vdev = vector->vdev;
>>> @@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>>       return ret;
>>>   }
>>> -static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>> -                                  int vector_n, bool msix)
>>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>> +                           int vector_n, bool msix)
>>>   {
>>>       if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
>>>           return;
>>> @@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>>>       kvm_irqchip_commit_routes(kvm_state);
>>>   }
>>> -static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>>>   {
>>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>>       PCIDevice *pdev = &vdev->pdev;
>>> @@ -641,13 +641,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>>       return 0;
>>>   }
>>> -static int vfio_msix_vector_use(PCIDevice *pdev,
>>> -                                unsigned int nr, MSIMessage msg)
>>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
>>>   {
>>>       return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
>>>   }
>>> -static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>>   {
>>>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>> @@ -674,14 +673,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>>       }
>>>   }
>>> -static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>   {
>>>       assert(!vdev->defer_kvm_irq_routing);
>>>       vdev->defer_kvm_irq_routing = true;
>>>       vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>>>   }
>>> -static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>   {
>>>       int i;
>>> @@ -2632,7 +2631,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>>       return OBJECT(vdev);
>>>   }
>>> -static bool vfio_msix_present(void *opaque, int version_id)
>>> +bool vfio_msix_present(void *opaque, int version_id)
>>>   {
>>>       PCIDevice *pdev = opaque;
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index 5ce0fb9..c892054 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -210,6 +210,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
>>>       return class == PCI_CLASS_DISPLAY_VGA;
>>>   }
>>> +/* MSI/MSI-X/INTx */
>>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
>>> +void vfio_msi_interrupt(void *opaque);
>>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>> +                           int vector_n, bool msix);
>>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
>>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
>>> +bool vfio_msix_present(void *opaque, int version_id);
>>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>>> +
>>>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>   void vfio_pci_write_config(PCIDevice *pdev,
>>>                              uint32_t addr, uint32_t val, int len);
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure
  2025-05-12 15:32 ` [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-05-20  6:29   ` Cédric Le Goater
  2025-05-20 13:39     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20  6:29 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> If there are multiple containers and unmap-all fails for some container, we
> need to remap vaddr for the other containers for which unmap-all succeeded.
> Recover by walking all address ranges of all containers to restore the vaddr
> for each.  Do so by invoking the vfio listener callback, and passing a new
> "remap" flag that tells it to restore a mapping without re-allocating new
> userland data structures.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr-legacy.c                  | 91 +++++++++++++++++++++++++++++++++++
>   hw/vfio/listener.c                    | 19 +++++++-
>   include/hw/vfio/vfio-container-base.h |  3 ++
>   include/hw/vfio/vfio-cpr.h            | 10 ++++
>   4 files changed, 122 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index bbcf71e..f8ddf78 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -31,6 +31,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>           error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>           return false;
>       }
> +    container->cpr.vaddr_unmapped = true;
>       return true;
>   }
>   
> @@ -63,6 +64,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>       return 0;
>   }
>   
> +static void vfio_region_remap(MemoryListener *listener,
> +                              MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            cpr.remap_listener);
> +    vfio_container_region_add(&container->bcontainer, section, true);
> +}
> +
>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>   {
>       if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> @@ -131,6 +140,40 @@ static const VMStateDescription vfio_container_vmstate = {
>       }
>   };
>   
> +static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
> +                                  MigrationEvent *e, Error **errp)
> +{
> +    VFIOContainer *container =
> +        container_of(notifier, VFIOContainer, cpr.transfer_notifier);
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> +    if (e->type != MIG_EVENT_PRECOPY_FAILED) {
> +        return 0;
> +    }
> +
> +    if (container->cpr.vaddr_unmapped) {
> +        /*
> +         * Force a call to vfio_region_remap for each mapped section by
> +         * temporarily registering a listener, and temporarily diverting
> +         * dma_map to vfio_legacy_cpr_dma_map.  The latter restores vaddr.
> +         */
> +
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
> +
> +        container->cpr.remap_listener = (MemoryListener) {
> +            .name = "vfio cpr recover",
> +            .region_add = vfio_region_remap
> +        };
> +        memory_listener_register(&container->cpr.remap_listener,
> +                                 bcontainer->space->as);
> +        memory_listener_unregister(&container->cpr.remap_listener);
> +        container->cpr.vaddr_unmapped = false;
> +        vioc->dma_map = vfio_legacy_dma_map;
> +    }
> +    return 0;
> +}
> +
>   bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   {
>       VFIOContainerBase *bcontainer = &container->bcontainer;
> @@ -152,6 +195,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>           VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>           vioc->dma_map = vfio_legacy_cpr_dma_map;
>       }
> +
> +    migration_add_notifier_mode(&container->cpr.transfer_notifier,
> +                                vfio_cpr_fail_notifier,
> +                                MIG_MODE_CPR_TRANSFER);
>       return true;
>   }
>   
> @@ -162,6 +209,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>       migrate_del_blocker(&container->cpr.blocker);
>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +    migration_remove_notifier(&container->cpr.transfer_notifier);
> +}
> +
> +/*
> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
> + * succeeding for others, so the latter have lost their vaddr.  Call this
> + * to restore vaddr for a section with a giommu.
> + *
> + * The giommu already exists.  Find it and replay it, which calls
> + * vfio_legacy_cpr_dma_map further down the stack.
> + */
> +void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
> +                           MemoryRegionSection *section)
> +{
> +    VFIOGuestIOMMU *giommu = NULL;
> +    hwaddr as_offset = section->offset_within_address_space;
> +    hwaddr iommu_offset = as_offset - section->offset_within_region;
> +
> +    QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
> +        if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
> +            giommu->iommu_offset == iommu_offset) {
> +            break;
> +        }
> +    }
> +    g_assert(giommu);
> +    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
> +}
> +
> +/*
> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
> + * succeeding for others, so the latter have lost their vaddr.  Call this
> + * to restore vaddr for a section with a RamDiscardManager.
> + *
> + * The ram discard listener already exists.  Call its populate function
> + * directly, which calls vfio_legacy_cpr_dma_map.
> + */
> +bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
> +                                            MemoryRegionSection *section)
> +{
> +    VFIORamDiscardListener *vrdl =
> +        vfio_find_ram_discard_listener(bcontainer, section);
> +
> +    g_assert(vrdl);
> +    return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
>   }
>   
>   static bool same_device(int fd1, int fd2)
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index 5642d04..e86ffcf 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -474,6 +474,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>   {
>       VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
>                                                    listener);
> +    vfio_container_region_add(bcontainer, section, false);
> +}
> +
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> +                               MemoryRegionSection *section,
> +                               bool cpr_remap)
> +{
>       hwaddr iova, end;
>       Int128 llend, llsize;
>       void *vaddr;
> @@ -509,6 +516,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
>           int iommu_idx;
>   
>           trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
> +
> +        if (cpr_remap) {
> +            vfio_cpr_giommu_remap(bcontainer, section);
> +        }
> +
>           /*
>            * FIXME: For VFIO iommu types which have KVM acceleration to
>            * avoid bouncing all map/unmaps through qemu this way, this
> @@ -551,7 +563,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>        * about changes.
>        */
>       if (memory_region_has_ram_discard_manager(section->mr)) {
> -        vfio_ram_discard_register_listener(bcontainer, section);
> +        if (!cpr_remap) {
> +            vfio_ram_discard_register_listener(bcontainer, section);
> +        } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
> +                                                           section)) {
> +            goto fail;
> +        }
>           return;
>       }
>   
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index a2f6c3a..5776fd7 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -189,4 +189,7 @@ VFIORamDiscardListener *vfio_find_ram_discard_listener(
>   int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>                           ram_addr_t size, void *vaddr, bool readonly);
>   
> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
> +                               MemoryRegionSection *section, bool cpr_remap);
> +
>   #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 0fc7ab2..d6d22f2 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -10,10 +10,14 @@
>   #define HW_VFIO_VFIO_CPR_H
>   
>   #include "migration/misc.h"
> +#include "system/memory.h"
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
>       bool reused;
> +    bool vaddr_unmapped;
> +    NotifierWithReturn transfer_notifier;
> +    MemoryListener remap_listener;
>   } VFIOContainerCPR;
>   
>   typedef struct VFIODeviceCPR {
> @@ -39,4 +43,10 @@ void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>   bool vfio_cpr_container_match(struct VFIOContainer *container,
>                                 struct VFIOGroup *group, int *fd);
>   
> +void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
> +                           MemoryRegionSection *section);
> +
> +bool vfio_cpr_ram_discard_register_listener(
> +    struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */

Please add to your .gitconfig :

[diff]
     orderFile = /path/to/qemu/scripts/git.orderfile




Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 15/42] vfio-pci: skip reset during cpr
  2025-05-12 15:32 ` [PATCH V3 15/42] vfio-pci: " Steve Sistare
@ 2025-05-20  6:48   ` Cédric Le Goater
  2025-05-20 13:44     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20  6:48 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR, and do not complain if the
> kernel's PCI config space changes for non-emulated bits between the
> vmstate save and load, which can happen due to ongoing interrupt activity.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr.c              | 31 +++++++++++++++++++++++++++++++
>   hw/vfio/pci.c              |  6 ++++++
>   include/hw/vfio/vfio-cpr.h |  2 ++
>   3 files changed, 39 insertions(+)
> 
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0e59612..6ea8e9f 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -8,6 +8,8 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-device.h"
>   #include "hw/vfio/vfio-cpr.h"
> +#include "hw/vfio/pci.h"
> +#include "migration/cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
>   
> @@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>   {
>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>   }
> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_cpr_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vfio_cpr_pci_vmstate = {
> +    .name = "vfio-cpr-pci",
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_cpr_pci_pre_load,
> +    .needed = cpr_needed_for_reuse,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a1bfdfe..4aa83b1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3344,6 +3344,11 @@ static void vfio_pci_reset(DeviceState *dev)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
>   
> +    /* Do not reset the device during qemu_system_reset prior to cpr load */
> +    if (vdev->vbasedev.cpr.reused) {
> +        return;
> +    }
> +

hw/pci/pci.c does :

     if (cpr_is_incoming()) {
         return;
     }

So, to be consistent, I think VFIO should do the same.


Thanks,

C.




>       trace_vfio_pci_reset(vdev->vbasedev.name);
>   
>       vfio_pci_pre_reset(vdev);
> @@ -3513,6 +3518,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
>   #ifdef CONFIG_IOMMUFD
>       object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
>   #endif
> +    dc->vmsd = &vfio_cpr_pci_vmstate;
>       dc->desc = "VFIO-based PCI device assignment";
>       pdc->realize = vfio_realize;
>   
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index d6d22f2..e93600f 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -49,4 +49,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>   bool vfio_cpr_ram_discard_register_listener(
>       struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>   
> +extern const VMStateDescription vfio_cpr_pci_vmstate;
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 37/42] vfio/iommufd: reconstruct device
  2025-05-19 15:53     ` Steven Sistare
@ 2025-05-20  9:14       ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-20  9:14 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 37/42] vfio/iommufd: reconstruct device
>
>On 5/16/2025 6:22 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 37/42] vfio/iommufd: reconstruct device
>>>
>>> Reconstruct userland device state after CPR.  During vfio_realize, skip
>>> all ioctls that configure the device, as it was already configured in old
>>> QEMU.
>>>
>>> Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
>>> Because we skip ioctl's, it is not needed at realize time.  However, we do
>>> need the range info, so defer the call to iommufd_cdev_get_info_iova_range
>>> to a post_load handler, at which time the ioas_id is known.
>>>
>>> This reconstruction is not complete.  hwpt_id and devid need special
>>> treatment, handled in subsequent patches.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/cpr-iommufd.c |  8 ++++++++
>>> hw/vfio/iommufd.c     | 17 +++++++++++++++++
>>> 2 files changed, 25 insertions(+)
>>>
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index b760bd3..3d430f0 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -31,6 +31,13 @@ static int vfio_container_post_load(void *opaque, int
>>> version_id)
>>>      VFIOIOMMUFDContainer *container = opaque;
>>>      VFIOContainerBase *bcontainer = &container->bcontainer;
>>>      VFIODevice *vbasedev;
>>> +    Error *err = NULL;
>>> +    uint32_t ioas_id = container->ioas_id;
>>> +
>>> +    if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>>> +        error_report_err(err);
>>> +        return -1;
>>> +    }
>>>
>>>      QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>>>          vbasedev->cpr.reused = false;
>>> @@ -47,6 +54,7 @@ static const VMStateDescription vfio_container_vmstate
>= {
>>>      .post_load = vfio_container_post_load,
>>>      .needed = cpr_needed_for_reuse,
>>>      .fields = (VMStateField[]) {
>>> +        VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
>>>          VMSTATE_END_OF_LIST()
>>>      }
>>> };
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 046f601..c49a7e7 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -122,6 +122,10 @@ static bool
>>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>>          goto err_kvm_device_add;
>>>      }
>>>
>>> +    if (vbasedev->cpr.reused) {
>>> +        goto skip_bind;
>>> +    }
>>> +
>>>      /* Bind device to iommufd */
>>>      bind.iommufd = iommufd->fd;
>>>      if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
>>> @@ -133,6 +137,8 @@ static bool
>>> iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>>>      vbasedev->devid = bind.out_devid;
>>>      trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
>>>                                          vbasedev->fd, vbasedev->devid);
>>> +
>>> +skip_bind:
>>>      return true;
>>> err_bind:
>>>      iommufd_cdev_kvm_device_del(vbasedev);
>>> @@ -580,6 +586,11 @@ static bool iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>          }
>>>      }
>>>
>>> +    if (vbasedev->cpr.reused) {
>>> +        ioas_id = -1;           /* ioas_id will be received from vmstate */
>>> +        goto skip_ioas_alloc;
>>> +    }
>>> +
>>>      /* Need to allocate a new dedicated container */
>>>      if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
>>>          goto err_alloc_ioas;
>>> @@ -587,6 +598,7 @@ static bool iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>
>>>      trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>>>
>>> +skip_ioas_alloc:
>>>      container =
>>> VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>>      container->be = vbasedev->iommufd;
>>>      container->ioas_id = ioas_id;
>>> @@ -605,6 +617,10 @@ static bool iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>          goto err_discard_disable;
>>>      }
>>>
>>> +    if (vbasedev->cpr.reused) {
>>> +        goto skip_info;
>>
>> I suspect this will break virtio-iommu, see virtio_iommu_set_iommu_device().
>> When virtio-iommu try to get host_iova_ranges, it's not ready until post load.
>
>Thanks, I'll look into it.
>Can you give me a clue or a pointer on command line options to set this up?

    -device virtio-iommu-pci \
    -device vfio-pci,host=0000:01:00.0 \
    -trace virtio_iommu_host_resv_regions

The vfio device needs to have reserved region, then diff the trace between old and new qemu can show us if reserved region is lost in new qemu.

^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
  2025-05-19 15:53     ` Steven Sistare
@ 2025-05-20  9:15       ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-20  9:15 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
>
>On 5/16/2025 6:06 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 36/42] vfio/iommufd: preserve descriptors
>>>
>>> Save the iommu and vfio device fd in CPR state when it is created.
>>> After CPR, the fd number is found in CPR state and reused.  Remember
>>> the reused status for subsequent patches.  The reused status is cleared
>>> when vmstate load finishes.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> backends/iommufd.c       | 19 ++++++++++---------
>>> hw/vfio/cpr-iommufd.c    | 16 ++++++++++++++++
>>> hw/vfio/device.c         | 10 ++--------
>>> hw/vfio/iommufd.c        | 13 +++++++++++--
>>> include/system/iommufd.h |  1 +
>>> 5 files changed, 40 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 6fed1c1..492747c 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -16,12 +16,18 @@
>>> #include "qemu/module.h"
>>> #include "qom/object_interfaces.h"
>>> #include "qemu/error-report.h"
>>> +#include "migration/cpr.h"
>>> #include "monitor/monitor.h"
>>> #include "trace.h"
>>> #include "hw/vfio/vfio-device.h"
>>> #include <sys/ioctl.h>
>>> #include <linux/iommufd.h>
>>>
>>> +static const char *iommufd_fd_name(IOMMUFDBackend *be)
>>> +{
>>> +    return object_get_canonical_path_component(OBJECT(be));
>>> +}
>>> +
>>> static void iommufd_backend_init(Object *obj)
>>> {
>>>      IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>>> @@ -47,9 +53,8 @@ static void iommufd_backend_set_fd(Object *obj, const
>>> char *str, Error **errp)
>>>      IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
>>>      int fd = -1;
>>>
>>> -    fd = monitor_fd_param(monitor_cur(), str, errp);
>>> +    fd = cpr_get_fd_param(iommufd_fd_name(be), str, 0, &be->cpr_reused,
>errp);
>>>      if (fd == -1) {
>>> -        error_prepend(errp, "Could not parse remote object fd %s:", str);
>>>          return;
>>>      }
>>>      be->fd = fd;
>>> @@ -95,14 +100,9 @@ bool iommufd_change_process(IOMMUFDBackend
>*be,
>>> Error **errp)
>>>
>>> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>>> {
>>> -    int fd;
>>> -
>>>      if (be->owned && !be->users) {
>>> -        fd = qemu_open("/dev/iommu", O_RDWR, errp);
>>> -        if (fd < 0) {
>>> -            return false;
>>> -        }
>>> -        be->fd = fd;
>>> +        be->fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be),
>0,
>>> +                             &be->cpr_reused, errp);
>>
>> Need to check error before assign to be->fd.
>
>will do.
>
>>>      }
>>>      be->users++;
>>>
>>> @@ -121,6 +121,7 @@ void
>iommufd_backend_disconnect(IOMMUFDBackend
>>> *be)
>>>          be->fd = -1;
>>>      }
>>> out:
>>> +    cpr_delete_fd(iommufd_fd_name(be), 0);
>>>      trace_iommufd_backend_disconnect(be->fd, be->users);
>>> }
>>>
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 46f2006..b760bd3 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -8,6 +8,7 @@
>>> #include "qemu/osdep.h"
>>> #include "qapi/error.h"
>>> #include "hw/vfio/vfio-cpr.h"
>>> +#include "hw/vfio/vfio-device.h"
>>> #include "migration/blocker.h"
>>> #include "migration/cpr.h"
>>> #include "migration/migration.h"
>>> @@ -25,10 +26,25 @@ static bool
>vfio_cpr_supported(VFIOIOMMUFDContainer
>>> *container, Error **errp)
>>>      return true;
>>> }
>>>
>>> +static int vfio_container_post_load(void *opaque, int version_id)
>>> +{
>>> +    VFIOIOMMUFDContainer *container = opaque;
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>> +    VFIODevice *vbasedev;
>>> +
>>> +    QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>>> +        vbasedev->cpr.reused = false;
>>> +    }
>>> +    container->be->cpr_reused = false;
>>
>> It's strange to set iommufd and vfio device's reused in container's post load,
>> Maybe better to do it in their own post load handler?
>
>vfio_container_post_load has MIG_PRI_LOW so it is called last, which guarantees
>that be->cpr_reused remains true while all devices are loaded.  This is required
>so that we supress dma_map calls during device load processing:
>
>   iommufd_backend_map_file_dma()
>     if (be->cpr_reused)
>       return 0;
>
>"vbasedev->cpr.reused = false" could be moved to vfio_device_post_load.
>I put it here to be future proof -- al reused flags are cleared together,
>at the end of post_load, and to be consistent with cpr-
>legacy.c:vfio_container_post_load

OK

>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> static const VMStateDescription vfio_container_vmstate = {
>>>      .name = "vfio-iommufd-container",
>>>      .version_id = 0,
>>>      .minimum_version_id = 0,
>>> +    .post_load = vfio_container_post_load,
>>>      .needed = cpr_needed_for_reuse,
>>>      .fields = (VMStateField[]) {
>>>          VMSTATE_END_OF_LIST()
>>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>>> index 8e9de68..02f384e 100644
>>> --- a/hw/vfio/device.c
>>> +++ b/hw/vfio/device.c
>>> @@ -312,14 +312,8 @@ bool vfio_device_get_name(VFIODevice *vbasedev,
>>> Error **errp)
>>>
>>> void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>>> {
>>> -    ERRP_GUARD();
>>> -    int fd = monitor_fd_param(monitor_cur(), str, errp);
>>> -
>>> -    if (fd < 0) {
>>> -        error_prepend(errp, "Could not parse remote object fd %s:", str);
>>> -        return;
>>> -    }
>>> -    vbasedev->fd = fd;
>>> +    vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
>>> +                                    &vbasedev->cpr.reused, errp);
>>
>> Same here.
>
>Do you mean, "need to check error"?
>If so, no need.  The new function definition is:
>
>void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>{
>     vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0,
>                                     &vbasedev->cpr.reused, errp);
>}
>
>cpr_get_fd_param() returns -1 on error and sets errp.

OK.

Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
  2025-05-19 15:53     ` Steven Sistare
@ 2025-05-20  9:16       ` Duan, Zhenzhong
  2025-05-21 17:40         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-20  9:16 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>
>On 5/18/2025 11:25 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>>>
>>> Save the hwpt_id in vmstate.  In realize, skip its allocation from
>>> iommufd_cdev_attach -> iommufd_cdev_attach_container ->
>>> iommufd_cdev_autodomains_get.
>>>
>>> Rebuild userland structures to hold hwpt_id by calling
>>> iommufd_cdev_rebuild_hwpt at post load time.  This depends on hw_caps,
>which
>>> was restored by the post_load call to vfio_device_hiod_create_and_realize.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/cpr-iommufd.c      |  7 +++++++
>>> hw/vfio/iommufd.c          | 24 ++++++++++++++++++++++--
>>> hw/vfio/trace-events       |  1 +
>>> hw/vfio/vfio-iommufd.h     |  3 +++
>>> include/hw/vfio/vfio-cpr.h |  1 +
>>> 5 files changed, 34 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> index 24cdf10..6d3f4e0 100644
>>> --- a/hw/vfio/cpr-iommufd.c
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -110,6 +110,12 @@ static int vfio_device_post_load(void *opaque, int
>>> version_id)
>>>          error_report_err(err);
>>>          return false;
>>>      }
>>> +    if (!vbasedev->mdev) {
>>> +        VFIOIOMMUFDContainer *container = container_of(vbasedev-
>>bcontainer,
>>> +                                                       VFIOIOMMUFDContainer,
>>> +                                                       bcontainer);
>>> +        iommufd_cdev_rebuild_hwpt(vbasedev, container);
>>> +    }
>>>      return true;
>>> }
>>>
>>> @@ -121,6 +127,7 @@ static const VMStateDescription vfio_device_vmstate
>= {
>>>      .needed = cpr_needed_for_reuse,
>>>      .fields = (VMStateField[]) {
>>>          VMSTATE_INT32(devid, VFIODevice),
>>> +        VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
>>>          VMSTATE_END_OF_LIST()
>>>      }
>>> };
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index d980684..ec79c83 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -318,6 +318,7 @@ static bool
>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>> static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>>> *hwpt)
>>> {
>>>      vbasedev->hwpt = hwpt;
>>> +    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
>>>      vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>>      QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> }
>>> @@ -373,6 +374,23 @@ static bool iommufd_cdev_make_hwpt(VFIODevice
>>> *vbasedev,
>>>      return true;
>>> }
>>>
>>> +void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>>> +                               VFIOIOMMUFDContainer *container)
>>> +{
>>> +    VFIOIOASHwpt *hwpt;
>>> +    int hwpt_id = vbasedev->cpr.hwpt_id;
>>> +
>>> +    trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
>>> +
>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>> +        if (hwpt->hwpt_id == hwpt_id) {
>>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>> +            return;
>>> +        }
>>> +    }
>>> +    iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id, false, NULL);
>>> +}
>>> +
>>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>                                           VFIOIOMMUFDContainer *container,
>>>                                           Error **errp)
>>> @@ -567,7 +585,8 @@ static bool iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>              vbasedev->iommufd != container->be) {
>>>              continue;
>>>          }
>>> -        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>>> +        if (!vbasedev->cpr.reused &&
>>> +            !iommufd_cdev_attach_container(vbasedev, container, &err)) {
>>>              const char *msg = error_get_pretty(err);
>>>
>>>              trace_iommufd_cdev_fail_attach_existing_container(msg);
>>> @@ -605,7 +624,8 @@ skip_ioas_alloc:
>>>      bcontainer = &container->bcontainer;
>>>      vfio_address_space_insert(space, bcontainer);
>>>
>>> -    if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
>>> +    if (!vbasedev->cpr.reused &&
>>> +        !iommufd_cdev_attach_container(vbasedev, container, errp)) {
>>
>> All container attaching is bypassed in new qemu. I have a concern that new
>qemu doesn't generate same containers as old qemu if there are more than one
>container in old qemu.
>> Then there can be devices attached to wrong container or attaching fail in post
>load.
>
>Yes, this relates to our discussion in patch 35.  Please explain, how can a single
>iommufd backend have multiple containers?

Similar as legacy container, there can be multiple containers in one address space.
If existing mapping in one container conflicts with new device's reserved region,
Attaching to that container will fail and a new container need to be created to accept new device's reserved region.

Maybe you need to do same thing just like you do for legacy container, e.g., saving  ioas_id just like you saving container->fd, then checking existing ioas_id and restore iommufd container based on that.

Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-12 15:32 ` [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
  2025-05-16  8:48   ` Duan, Zhenzhong
@ 2025-05-20 12:27   ` Cédric Le Goater
  2025-05-20 13:58     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20 12:27 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
> Such a mapping can be preserved without modification during CPR,
> because it depends on the file's address space, which does not change,
> rather than on the process's address space, which does change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container-base.c              |  9 +++++++++
>   hw/vfio/iommufd.c                     | 13 +++++++++++++
>   include/hw/vfio/vfio-container-base.h |  3 +++
>   3 files changed, 25 insertions(+)
> 
> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
> index 8f43bc8..72a51a6 100644
> --- a/hw/vfio/container-base.c
> +++ b/hw/vfio/container-base.c
> @@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
>                              RAMBlock *rb)
>   {
>       VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>   
> +    if (mfd >= 0 && vioc->dma_map_file) {
> +        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
> +        unsigned long offset = qemu_ram_get_fd_offset(rb);
> +
> +        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
> +                           readonly);
> +        return 0;
> +    }
>       g_assert(vioc->dma_map);
>       return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>   }
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 167bda4..6eb417a 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>                                      iova, size, vaddr, readonly);
>   }
>   
> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
> +                                 hwaddr iova, ram_addr_t size,
> +                                 int fd, unsigned long start, bool readonly)
> +{
> +    const VFIOIOMMUFDContainer *container =
> +        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
> +
> +    return iommufd_backend_map_file_dma(container->be,
> +                                        container->ioas_id,
> +                                        iova, size, fd, start, readonly);
> +}
> +
>   static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>                                 hwaddr iova, ram_addr_t size,
>                                 IOMMUTLBEntry *iotlb, bool unmap_all)
> @@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
>       VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
>   
>       vioc->dma_map = iommufd_cdev_map;
> +    vioc->dma_map_file = iommufd_cdev_map_file;
>       vioc->dma_unmap = iommufd_cdev_unmap;
>       vioc->attach_device = iommufd_cdev_attach;
>       vioc->detach_device = iommufd_cdev_detach;
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index 03b3f9c..f30f828 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
>       int (*dma_map)(const VFIOContainerBase *bcontainer,
>                      hwaddr iova, ram_addr_t size,
>                      void *vaddr, bool readonly);
> +    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
> +                        hwaddr iova, ram_addr_t size,
> +                        int fd, unsigned long start, bool readonly);

Please add documentation.

Thanks,

C.



>       /**
>        * @dma_unmap
>        *



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-16  8:55   ` Duan, Zhenzhong
  2025-05-19 15:55     ` Steven Sistare
@ 2025-05-20 12:34     ` Cédric Le Goater
  2025-05-21  2:48       ` Duan, Zhenzhong
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20 12:34 UTC (permalink / raw)
  To: Duan, Zhenzhong, Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/16/25 10:55, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>
>> Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>> helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>> be used by CPR in a subsequent patch.
>>
>> Call vfio_device_hiod_create_and_realize earlier so iommufd_cdev_make_hwpt
>> can use vbasedev->hiod hw_caps, avoiding an extra call to
>> iommufd_backend_get_device_info
> 
> We had made consensus to realize hiod after attachment,
> it's not a hot path so an extra call is acceptable per Cedric.

We also placed the realize call where it is in preparation for
nested IOMMU support, and avoid a late_realize handler AFAICR


>> No functional change.


We should add a comment before to make sure the code is not moved
around.


Thanks,

C.




>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/iommufd.c | 116 ++++++++++++++++++++++++++++++----------------------
>> --
>> 1 file changed, 65 insertions(+), 51 deletions(-)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index f645a62..8661947 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -310,16 +310,70 @@ static bool
>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>      return true;
>> }
>>
>> +static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>> *hwpt)
>> +{
>> +    vbasedev->hwpt = hwpt;
>> +    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +}
>> +
>> +/*
>> + * iommufd_cdev_make_hwpt: If @alloc_id, allocate a hwpt_id, else use
>> @hwpt_id.
>> + * Create and add a hwpt struct to the container's list and to the device.
>> + * Always succeeds if !@alloc_id.
>> + */
>> +static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
>> +                                   VFIOIOMMUFDContainer *container,
>> +                                   uint32_t hwpt_id, bool alloc_id,
>> +                                   Error **errp)
>> +{
>> +    VFIOIOASHwpt *hwpt;
>> +    uint32_t flags = 0;
>> +
>> +    /*
>> +     * This is quite early and VFIO Migration state isn't yet fully
>> +     * initialized, thus rely only on IOMMU hardware capabilities as to
>> +     * whether IOMMU dirty tracking is going to be requested. Later
>> +     * vfio_migration_realize() may decide to use VF dirty tracking
>> +     * instead.
>> +     */
>> +    g_assert(vbasedev->hiod);
>> +    if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> +        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> +    }
>> +
>> +    if (alloc_id) {
>> +        if (!iommufd_backend_alloc_hwpt(vbasedev->iommufd, vbasedev->devid,
>> +                                        container->ioas_id, flags,
>> +                                        IOMMU_HWPT_DATA_NONE, 0, NULL,
>> +                                        &hwpt_id, errp)) {
>> +            return false;
>> +        }
>> +
>> +        if (iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp)) {
>> +            iommufd_backend_free_id(container->be, hwpt_id);
>> +            return false;
>> +        }
>> +    }
>> +
>> +    hwpt = g_malloc0(sizeof(*hwpt));
>> +    hwpt->hwpt_id = hwpt_id;
>> +    hwpt->hwpt_flags = flags;
>> +    QLIST_INIT(&hwpt->device_list);
>> +
>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> +    container->bcontainer.dirty_pages_supported |=
>> +                                vbasedev->iommu_dirty_tracking;
>> +    iommufd_cdev_use_hwpt(vbasedev, hwpt);
>> +    return true;
>> +}
>> +
>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>                                           VFIOIOMMUFDContainer *container,
>>                                           Error **errp)
>> {
>>      ERRP_GUARD();
>> -    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> -    uint32_t type, flags = 0;
>> -    uint64_t hw_caps;
>>      VFIOIOASHwpt *hwpt;
>> -    uint32_t hwpt_id;
>>      int ret;
>>
>>      /* Try to find a domain */
>> @@ -340,54 +394,14 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>
>>              return false;
>>          } else {
>> -            vbasedev->hwpt = hwpt;
>> -            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> -            vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>              return true;
>>          }
>>      }
>> -
>> -    /*
>> -     * This is quite early and VFIO Migration state isn't yet fully
>> -     * initialized, thus rely only on IOMMU hardware capabilities as to
>> -     * whether IOMMU dirty tracking is going to be requested. Later
>> -     * vfio_migration_realize() may decide to use VF dirty tracking
>> -     * instead.
>> -     */
>> -    if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
>> -                                         &type, NULL, 0, &hw_caps, errp)) {
>> -        return false;
>> -    }
>> -
>> -    if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> -        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> -    }
>> -
>> -    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>> -                                    container->ioas_id, flags,
>> -                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>> -                                    &hwpt_id, errp)) {
>> -        return false;
>> -    }
>> -
>> -    hwpt = g_malloc0(sizeof(*hwpt));
>> -    hwpt->hwpt_id = hwpt_id;
>> -    hwpt->hwpt_flags = flags;
>> -    QLIST_INIT(&hwpt->device_list);
>> -
>> -    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>> -    if (ret) {
>> -        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>> -        g_free(hwpt);
>> +    if (!iommufd_cdev_make_hwpt(vbasedev, container, 0, true, errp)) {
>>          return false;
>>      }
>>
>> -    vbasedev->hwpt = hwpt;
>> -    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>> -    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> -    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> -    container->bcontainer.dirty_pages_supported |=
>> -                                vbasedev->iommu_dirty_tracking;
>>      if (container->bcontainer.dirty_pages_supported &&
>>          !vbasedev->iommu_dirty_tracking) {
>>          warn_report("IOMMU instance for device %s doesn't support dirty tracking",
>> @@ -530,6 +544,11 @@ static bool iommufd_cdev_attach(const char *name,
>> VFIODevice *vbasedev,
>>
>>      space = vfio_address_space_get(as);
>>
>> +    if (!vfio_device_hiod_create_and_realize(vbasedev,
>> +            TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>> +        goto err_alloc_ioas;
>> +    }
>> +
>>      /* try to attach to an existing container in this space */
>>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>>          container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>> @@ -604,11 +623,6 @@ found_container:
>>          goto err_listener_register;
>>      }
>>
>> -    if (!vfio_device_hiod_create_and_realize(vbasedev,
>> -                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>> -        goto err_listener_register;
>> -    }
>> -
>>      /*
>>       * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>>       * for discarding incompatibility check as well?
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure
  2025-05-20  6:29   ` Cédric Le Goater
@ 2025-05-20 13:39     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 13:39 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 2:29 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> If there are multiple containers and unmap-all fails for some container, we
>> need to remap vaddr for the other containers for which unmap-all succeeded.
>> Recover by walking all address ranges of all containers to restore the vaddr
>> for each.  Do so by invoking the vfio listener callback, and passing a new
>> "remap" flag that tells it to restore a mapping without re-allocating new
>> userland data structures.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/cpr-legacy.c                  | 91 +++++++++++++++++++++++++++++++++++
>>   hw/vfio/listener.c                    | 19 +++++++-
>>   include/hw/vfio/vfio-container-base.h |  3 ++
>>   include/hw/vfio/vfio-cpr.h            | 10 ++++
>>   4 files changed, 122 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index bbcf71e..f8ddf78 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -31,6 +31,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>           error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>>           return false;
>>       }
>> +    container->cpr.vaddr_unmapped = true;
>>       return true;
>>   }
>> @@ -63,6 +64,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>>       return 0;
>>   }
>> +static void vfio_region_remap(MemoryListener *listener,
>> +                              MemoryRegionSection *section)
>> +{
>> +    VFIOContainer *container = container_of(listener, VFIOContainer,
>> +                                            cpr.remap_listener);
>> +    vfio_container_region_add(&container->bcontainer, section, true);
>> +}
>> +
>>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>   {
>>       if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>> @@ -131,6 +140,40 @@ static const VMStateDescription vfio_container_vmstate = {
>>       }
>>   };
>> +static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
>> +                                  MigrationEvent *e, Error **errp)
>> +{
>> +    VFIOContainer *container =
>> +        container_of(notifier, VFIOContainer, cpr.transfer_notifier);
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> +    if (e->type != MIG_EVENT_PRECOPY_FAILED) {
>> +        return 0;
>> +    }
>> +
>> +    if (container->cpr.vaddr_unmapped) {
>> +        /*
>> +         * Force a call to vfio_region_remap for each mapped section by
>> +         * temporarily registering a listener, and temporarily diverting
>> +         * dma_map to vfio_legacy_cpr_dma_map.  The latter restores vaddr.
>> +         */
>> +
>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
>> +
>> +        container->cpr.remap_listener = (MemoryListener) {
>> +            .name = "vfio cpr recover",
>> +            .region_add = vfio_region_remap
>> +        };
>> +        memory_listener_register(&container->cpr.remap_listener,
>> +                                 bcontainer->space->as);
>> +        memory_listener_unregister(&container->cpr.remap_listener);
>> +        container->cpr.vaddr_unmapped = false;
>> +        vioc->dma_map = vfio_legacy_dma_map;
>> +    }
>> +    return 0;
>> +}
>> +
>>   bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>   {
>>       VFIOContainerBase *bcontainer = &container->bcontainer;
>> @@ -152,6 +195,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>           VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>           vioc->dma_map = vfio_legacy_cpr_dma_map;
>>       }
>> +
>> +    migration_add_notifier_mode(&container->cpr.transfer_notifier,
>> +                                vfio_cpr_fail_notifier,
>> +                                MIG_MODE_CPR_TRANSFER);
>>       return true;
>>   }
>> @@ -162,6 +209,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>       migrate_del_blocker(&container->cpr.blocker);
>>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> +    migration_remove_notifier(&container->cpr.transfer_notifier);
>> +}
>> +
>> +/*
>> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
>> + * succeeding for others, so the latter have lost their vaddr.  Call this
>> + * to restore vaddr for a section with a giommu.
>> + *
>> + * The giommu already exists.  Find it and replay it, which calls
>> + * vfio_legacy_cpr_dma_map further down the stack.
>> + */
>> +void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
>> +                           MemoryRegionSection *section)
>> +{
>> +    VFIOGuestIOMMU *giommu = NULL;
>> +    hwaddr as_offset = section->offset_within_address_space;
>> +    hwaddr iommu_offset = as_offset - section->offset_within_region;
>> +
>> +    QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
>> +        if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
>> +            giommu->iommu_offset == iommu_offset) {
>> +            break;
>> +        }
>> +    }
>> +    g_assert(giommu);
>> +    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
>> +}
>> +
>> +/*
>> + * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
>> + * succeeding for others, so the latter have lost their vaddr.  Call this
>> + * to restore vaddr for a section with a RamDiscardManager.
>> + *
>> + * The ram discard listener already exists.  Call its populate function
>> + * directly, which calls vfio_legacy_cpr_dma_map.
>> + */
>> +bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
>> +                                            MemoryRegionSection *section)
>> +{
>> +    VFIORamDiscardListener *vrdl =
>> +        vfio_find_ram_discard_listener(bcontainer, section);
>> +
>> +    g_assert(vrdl);
>> +    return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
>>   }
>>   static bool same_device(int fd1, int fd2)
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index 5642d04..e86ffcf 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -474,6 +474,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>   {
>>       VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
>>                                                    listener);
>> +    vfio_container_region_add(bcontainer, section, false);
>> +}
>> +
>> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
>> +                               MemoryRegionSection *section,
>> +                               bool cpr_remap)
>> +{
>>       hwaddr iova, end;
>>       Int128 llend, llsize;
>>       void *vaddr;
>> @@ -509,6 +516,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           int iommu_idx;
>>           trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
>> +
>> +        if (cpr_remap) {
>> +            vfio_cpr_giommu_remap(bcontainer, section);
>> +        }
>> +
>>           /*
>>            * FIXME: For VFIO iommu types which have KVM acceleration to
>>            * avoid bouncing all map/unmaps through qemu this way, this
>> @@ -551,7 +563,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>        * about changes.
>>        */
>>       if (memory_region_has_ram_discard_manager(section->mr)) {
>> -        vfio_ram_discard_register_listener(bcontainer, section);
>> +        if (!cpr_remap) {
>> +            vfio_ram_discard_register_listener(bcontainer, section);
>> +        } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
>> +                                                           section)) {
>> +            goto fail;
>> +        }
>>           return;
>>       }
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
>> index a2f6c3a..5776fd7 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -189,4 +189,7 @@ VFIORamDiscardListener *vfio_find_ram_discard_listener(
>>   int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>>                           ram_addr_t size, void *vaddr, bool readonly);
>> +void vfio_container_region_add(VFIOContainerBase *bcontainer,
>> +                               MemoryRegionSection *section, bool cpr_remap);
>> +
>>   #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 0fc7ab2..d6d22f2 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -10,10 +10,14 @@
>>   #define HW_VFIO_VFIO_CPR_H
>>   #include "migration/misc.h"
>> +#include "system/memory.h"
>>   typedef struct VFIOContainerCPR {
>>       Error *blocker;
>>       bool reused;
>> +    bool vaddr_unmapped;
>> +    NotifierWithReturn transfer_notifier;
>> +    MemoryListener remap_listener;
>>   } VFIOContainerCPR;
>>   typedef struct VFIODeviceCPR {
>> @@ -39,4 +43,10 @@ void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>   bool vfio_cpr_container_match(struct VFIOContainer *container,
>>                                 struct VFIOGroup *group, int *fd);
>> +void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>> +                           MemoryRegionSection *section);
>> +
>> +bool vfio_cpr_ram_discard_register_listener(
>> +    struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +
>>   #endif /* HW_VFIO_VFIO_CPR_H */
> 
> Please add to your .gitconfig :
> 
> [diff]
>      orderFile = /path/to/qemu/scripts/git.orderfile

Cool, thanks for the tip - steve

> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> 
> Thanks,
> 
> C.


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 15/42] vfio-pci: skip reset during cpr
  2025-05-20  6:48   ` Cédric Le Goater
@ 2025-05-20 13:44     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 13:44 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 2:48 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Do not reset a vfio-pci device during CPR, and do not complain if the
>> kernel's PCI config space changes for non-emulated bits between the
>> vmstate save and load, which can happen due to ongoing interrupt activity.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/cpr.c              | 31 +++++++++++++++++++++++++++++++
>>   hw/vfio/pci.c              |  6 ++++++
>>   include/hw/vfio/vfio-cpr.h |  2 ++
>>   3 files changed, 39 insertions(+)
>>
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index 0e59612..6ea8e9f 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -8,6 +8,8 @@
>>   #include "qemu/osdep.h"
>>   #include "hw/vfio/vfio-device.h"
>>   #include "hw/vfio/vfio-cpr.h"
>> +#include "hw/vfio/pci.h"
>> +#include "migration/cpr.h"
>>   #include "qapi/error.h"
>>   #include "system/runstate.h"
>> @@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>>   {
>>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>   }
>> +
>> +/*
>> + * The kernel may change non-emulated config bits.  Exclude them from the
>> + * changed-bits check in get_pci_config_device.
>> + */
>> +static int vfio_cpr_pci_pre_load(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    int i;
>> +
>> +    for (i = 0; i < size; i++) {
>> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +const VMStateDescription vfio_cpr_pci_vmstate = {
>> +    .name = "vfio-cpr-pci",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .pre_load = vfio_cpr_pci_pre_load,
>> +    .needed = cpr_needed_for_reuse,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index a1bfdfe..4aa83b1 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3344,6 +3344,11 @@ static void vfio_pci_reset(DeviceState *dev)
>>   {
>>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
>> +    /* Do not reset the device during qemu_system_reset prior to cpr load */
>> +    if (vdev->vbasedev.cpr.reused) {
>> +        return;
>> +    }
>> +
> 
> hw/pci/pci.c does :
> 
>      if (cpr_is_incoming()) {
>          return;
>      }
> 
> So, to be consistent, I think VFIO should do the same.
>  

will do - steve

>>       trace_vfio_pci_reset(vdev->vbasedev.name);
>>       vfio_pci_pre_reset(vdev);
>> @@ -3513,6 +3518,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
>>   #ifdef CONFIG_IOMMUFD
>>       object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
>>   #endif
>> +    dc->vmsd = &vfio_cpr_pci_vmstate;
>>       dc->desc = "VFIO-based PCI device assignment";
>>       pdc->realize = vfio_realize;
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index d6d22f2..e93600f 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -49,4 +49,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>>   bool vfio_cpr_ram_discard_register_listener(
>>       struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +extern const VMStateDescription vfio_cpr_pci_vmstate;
>> +
>>   #endif /* HW_VFIO_VFIO_CPR_H */
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-12 15:32 ` [PATCH V3 34/42] vfio/iommufd: invariant device name Steve Sistare
  2025-05-16  9:29   ` Duan, Zhenzhong
@ 2025-05-20 13:55   ` Cédric Le Goater
  2025-05-20 21:00     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20 13:55 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> cpr-transfer will use the device name as a key to find the value
> of the device descriptor in new QEMU.  However, if the descriptor
> number is specified by a command-line fd parameter, then
> vfio_device_get_name creates a name that includes the fd number.
> This causes a chicken-and-egg problem: new QEMU must know the fd
> number to construct a name to find the fd number.
> 
> To fix, create an invariant name based on the id command-line
> parameter.  If id is not defined, add a CPR blocker.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr.c              | 21 +++++++++++++++++++++
>   hw/vfio/device.c           | 10 ++++------
>   hw/vfio/iommufd.c          |  2 ++
>   include/hw/vfio/vfio-cpr.h |  4 ++++
>   4 files changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 6081a89..7609c62 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -11,6 +11,7 @@
>   #include "hw/vfio/pci.h"
>   #include "hw/pci/msix.h"
>   #include "hw/pci/msi.h"
> +#include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
> @@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>           VMSTATE_END_OF_LIST()
>       }
>   };
> +
> +bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
> +{
> +    if (vbasedev->dev->id) {
> +        vbasedev->name = g_strdup(vbasedev->dev->id);
> +        return true;
> +    } else {
> +        /*
> +         * Assign a name so any function printing it will not break, but the
> +         * fd number changes across processes, so this cannot be used as an
> +         * invariant name for CPR.
> +         */
> +        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);

The code above should be in vfio_device_get_name() proposed in its own path.


> +        error_setg(&vbasedev->cpr.id_blocker,
> +                   "vfio device with fd=%d needs an id property",
> +                   vbasedev->fd);
> +        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;

The cpr blocker should proposed in a second patch, maybe with a small
wrapper to set the 'Error *'.


Thanks,

C.



> +    }
> +}
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 9fba2c7..8e9de68 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -28,6 +28,7 @@
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
>   #include "qemu/units.h"
> +#include "migration/cpr.h"
>   #include "monitor/monitor.h"
>   #include "vfio-helpers.h"
>   
> @@ -284,6 +285,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>   {
>       ERRP_GUARD();
>       struct stat st;
> +    bool ret = true;
>   
>       if (vbasedev->fd < 0) {
>           if (stat(vbasedev->sysfsdev, &st) < 0) {
> @@ -300,16 +302,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>               error_setg(errp, "Use FD passing only with iommufd backend");
>               return false;
>           }
> -        /*
> -         * Give a name with fd so any function printing out vbasedev->name
> -         * will not break.
> -         */
>           if (!vbasedev->name) {
> -            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> +            ret = vfio_cpr_set_device_name(vbasedev, errp);
>           }
>       }
>   
> -    return true;
> +    return ret;
>   }
>   
>   void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8661947..ea99b8d 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -25,6 +25,7 @@
>   #include "system/reset.h"
>   #include "qemu/cutils.h"
>   #include "qemu/chardev_open.h"
> +#include "migration/blocker.h"
>   #include "pci.h"
>   #include "vfio-iommufd.h"
>   #include "vfio-helpers.h"
> @@ -669,6 +670,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>       iommufd_cdev_container_destroy(container);
>       vfio_address_space_put(space);
>   
> +    migrate_del_blocker(&vbasedev->cpr.id_blocker);
>       iommufd_cdev_unbind_and_disconnect(vbasedev);
>       close(vbasedev->fd);
>   }
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 765e334..d06d117 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -23,12 +23,14 @@ typedef struct VFIOContainerCPR {
>   typedef struct VFIODeviceCPR {
>       bool reused;
>       Error *mdev_blocker;
> +    Error *id_blocker;
>   } VFIODeviceCPR;
>   
>   struct VFIOContainer;
>   struct VFIOContainerBase;
>   struct VFIOGroup;
>   struct VFIOPCIDevice;
> +struct VFIODevice;
>   
>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                           Error **errp);
> @@ -59,4 +61,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>   
>   extern const VMStateDescription vfio_cpr_pci_vmstate;
>   
> +bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-20 12:27   ` Cédric Le Goater
@ 2025-05-20 13:58     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 13:58 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 8:27 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>> Such a mapping can be preserved without modification during CPR,
>> because it depends on the file's address space, which does not change,
>> rather than on the process's address space, which does change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container-base.c              |  9 +++++++++
>>   hw/vfio/iommufd.c                     | 13 +++++++++++++
>>   include/hw/vfio/vfio-container-base.h |  3 +++
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> index 8f43bc8..72a51a6 100644
>> --- a/hw/vfio/container-base.c
>> +++ b/hw/vfio/container-base.c
>> @@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
>>                              RAMBlock *rb)
>>   {
>>       VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>> +    if (mfd >= 0 && vioc->dma_map_file) {
>> +        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>> +        unsigned long offset = qemu_ram_get_fd_offset(rb);
>> +
>> +        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>> +                           readonly);
>> +        return 0;
>> +    }
>>       g_assert(vioc->dma_map);
>>       return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>>   }
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 167bda4..6eb417a 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>>                                      iova, size, vaddr, readonly);
>>   }
>> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>> +                                 hwaddr iova, ram_addr_t size,
>> +                                 int fd, unsigned long start, bool readonly)
>> +{
>> +    const VFIOIOMMUFDContainer *container =
>> +        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>> +
>> +    return iommufd_backend_map_file_dma(container->be,
>> +                                        container->ioas_id,
>> +                                        iova, size, fd, start, readonly);
>> +}
>> +
>>   static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>>                                 hwaddr iova, ram_addr_t size,
>>                                 IOMMUTLBEntry *iotlb, bool unmap_all)
>> @@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
>>       VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
>>       vioc->dma_map = iommufd_cdev_map;
>> +    vioc->dma_map_file = iommufd_cdev_map_file;
>>       vioc->dma_unmap = iommufd_cdev_unmap;
>>       vioc->attach_device = iommufd_cdev_attach;
>>       vioc->detach_device = iommufd_cdev_detach;
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
>> index 03b3f9c..f30f828 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
>>       int (*dma_map)(const VFIOContainerBase *bcontainer,
>>                      hwaddr iova, ram_addr_t size,
>>                      void *vaddr, bool readonly);
>> +    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
>> +                        hwaddr iova, ram_addr_t size,
>> +                        int fd, unsigned long start, bool readonly);
> 
> Please add documentation.

OK.  Using @dma_unmap as a template:

     /**
      * @dma_map_file
      *
      * Map a file range for the container.
      *
      * @bcontainer: #VFIOContainerBase to use for map
      * @iova: start address to map
      * @size: size of the range to map
      * @fd: descriptor of the file to map
      * @start: starting file offset of the range to map
      * @readonly: map read only if true
      */

- Steve

>>       /**
>>        * @dma_unmap
>>        *
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-20  5:52       ` Cédric Le Goater
@ 2025-05-20 14:56         ` Steven Sistare
  2025-05-20 15:10           ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 14:56 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 1:52 AM, Cédric Le Goater wrote:
> On 5/16/25 19:58, Steven Sistare wrote:
>> On 5/16/2025 4:31 AM, Cédric Le Goater wrote:
>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>> Export various MSI functions, for use by CPR in subsequent patches.
>>>> No functional change.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> Please rename this routines with a 'vfio_pci' prefix.
>>
>> Are you sure?  That makes sense for:
>>    vfio_vector_init -> vfio_pci_vector_init
>>
>> but the rest already have msi or intx in the name which unambiguously
>> means pci.  Adding pci_ seems unecessarily verbose:
> 
> We are slowly defining an API for an internal VFIO library. I prefer
> to ensure the interface is clean by changing the names of external
> services to reflect the namespace they belong to.
> 
> All routines are implemented in hw/vfio/pci.c and most take a
> VFIOPCIDevice as first argument.

OK.  So this:

void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr);
void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                int vector_n, bool msix);
void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);

> vfio_msi_interrupt(), vfio_msix_vector_use() and
> vfio_msix_vector_release() are rather low level routines.
> I think we need a wrapper to avoid exposing them.

OK.  These will do the trick, defined in pci.c and exported to cpr.c:

void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
{
     msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
                               vfio_msix_vector_release, NULL);
}

void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
{
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     int fd = event_notifier_get_fd(&vector->interrupt);

     qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
}

- Steve

>>>> ---
>>>>   hw/vfio/pci.c | 21 ++++++++++-----------
>>>>   hw/vfio/pci.h | 12 ++++++++++++
>>>>   2 files changed, 22 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index d2b08a3..1bca415 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -279,7 +279,7 @@ static void vfio_irqchip_change(Notifier *notify, void *data)
>>>>       vfio_intx_update(vdev, &vdev->intx.route);
>>>>   }
>>>> -static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>>>   {
>>>>       uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
>>>>       Error *err = NULL;
>>>> @@ -353,7 +353,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>>>>   /*
>>>>    * MSI/X
>>>>    */
>>>> -static void vfio_msi_interrupt(void *opaque)
>>>> +void vfio_msi_interrupt(void *opaque)
>>>>   {
>>>>       VFIOMSIVector *vector = opaque;
>>>>       VFIOPCIDevice *vdev = vector->vdev;
>>>> @@ -474,8 +474,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>>>       return ret;
>>>>   }
>>>> -static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>>> -                                  int vector_n, bool msix)
>>>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>>> +                           int vector_n, bool msix)
>>>>   {
>>>>       if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
>>>>           return;
>>>> @@ -529,7 +529,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>>>>       kvm_irqchip_commit_routes(kvm_state);
>>>>   }
>>>> -static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>>>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
>>>>   {
>>>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>>>       PCIDevice *pdev = &vdev->pdev;
>>>> @@ -641,13 +641,12 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>>>       return 0;
>>>>   }
>>>> -static int vfio_msix_vector_use(PCIDevice *pdev,
>>>> -                                unsigned int nr, MSIMessage msg)
>>>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg)
>>>>   {
>>>>       return vfio_msix_vector_do_use(pdev, nr, &msg, vfio_msi_interrupt);
>>>>   }
>>>> -static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>>>   {
>>>>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
>>>>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>>> @@ -674,14 +673,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>>>>       }
>>>>   }
>>>> -static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>>   {
>>>>       assert(!vdev->defer_kvm_irq_routing);
>>>>       vdev->defer_kvm_irq_routing = true;
>>>>       vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>>>>   }
>>>> -static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>>>>   {
>>>>       int i;
>>>> @@ -2632,7 +2631,7 @@ static Object *vfio_pci_get_object(VFIODevice *vbasedev)
>>>>       return OBJECT(vdev);
>>>>   }
>>>> -static bool vfio_msix_present(void *opaque, int version_id)
>>>> +bool vfio_msix_present(void *opaque, int version_id)
>>>>   {
>>>>       PCIDevice *pdev = opaque;
>>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>>> index 5ce0fb9..c892054 100644
>>>> --- a/hw/vfio/pci.h
>>>> +++ b/hw/vfio/pci.h
>>>> @@ -210,6 +210,18 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
>>>>       return class == PCI_CLASS_DISPLAY_VGA;
>>>>   }
>>>> +/* MSI/MSI-X/INTx */
>>>> +void vfio_vector_init(VFIOPCIDevice *vdev, int nr);
>>>> +void vfio_msi_interrupt(void *opaque);
>>>> +void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>>> +                           int vector_n, bool msix);
>>>> +int vfio_msix_vector_use(PCIDevice *pdev, unsigned int nr, MSIMessage msg);
>>>> +void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr);
>>>> +bool vfio_msix_present(void *opaque, int version_id);
>>>> +void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>>> +void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
>>>> +bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp);
>>>> +
>>>>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>>>>   void vfio_pci_write_config(PCIDevice *pdev,
>>>>                              uint32_t addr, uint32_t val, int len);
>>>
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 21/42] vfio/pci: export MSI functions
  2025-05-20 14:56         ` Steven Sistare
@ 2025-05-20 15:10           ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-20 15:10 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/25 16:56, Steven Sistare wrote:
> On 5/20/2025 1:52 AM, Cédric Le Goater wrote:
>> On 5/16/25 19:58, Steven Sistare wrote:
>>> On 5/16/2025 4:31 AM, Cédric Le Goater wrote:
>>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>>> Export various MSI functions, for use by CPR in subsequent patches.
>>>>> No functional change.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>
>>>> Please rename this routines with a 'vfio_pci' prefix.
>>>
>>> Are you sure?  That makes sense for:
>>>    vfio_vector_init -> vfio_pci_vector_init
>>>
>>> but the rest already have msi or intx in the name which unambiguously
>>> means pci.  Adding pci_ seems unecessarily verbose:
>>
>> We are slowly defining an API for an internal VFIO library. I prefer
>> to ensure the interface is clean by changing the names of external
>> services to reflect the namespace they belong to.
>>
>> All routines are implemented in hw/vfio/pci.c and most take a
>> VFIOPCIDevice as first argument.
> 
> OK.  So this:
> 
> void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr);
> void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>                                 int vector_n, bool msix);
> void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
> 
>> vfio_msi_interrupt(), vfio_msix_vector_use() and
>> vfio_msix_vector_release() are rather low level routines.
>> I think we need a wrapper to avoid exposing them.
> 
> OK.  These will do the trick, defined in pci.c and exported to cpr.c:
> 
> void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
> {
>      msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
>                                vfio_msix_vector_release, NULL);
> }
> 
> void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
> {
>      VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>      int fd = event_notifier_get_fd(&vector->interrupt);
> 
>      qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> }


LGTM,

Thanks,

C.





^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-19 15:51     ` Steven Sistare
@ 2025-05-20 19:32       ` Steven Sistare
  2025-05-21  2:48         ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 19:32 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 11:51 AM, Steven Sistare wrote:
> On 5/16/2025 4:26 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 28/42] backends/iommufd:
>>> iommufd_backend_map_file_dma
>>>
>>> Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
>>> This will be called as a substitute for iommufd_backend_map_dma, so
>>> the error conditions for BARs are copied as-is from that function.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> backends/iommufd.c       | 36 ++++++++++++++++++++++++++++++++++++
>>> backends/trace-events    |  1 +
>>> include/system/iommufd.h |  3 +++
>>> 3 files changed, 40 insertions(+)
>>>
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index b73f75c..5c1958f 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -172,6 +172,42 @@ int iommufd_backend_map_dma(IOMMUFDBackend
>>> *be, uint32_t ioas_id, hwaddr iova,
>>>      return ret;
>>> }
>>>
>>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>> +                                 hwaddr iova, ram_addr_t size,
>>> +                                 int mfd, unsigned long start, bool readonly)
>>> +{
>>> +    int ret, fd = be->fd;
>>> +    struct iommu_ioas_map_file map = {
>>> +        .size = sizeof(map),
>>> +        .flags = IOMMU_IOAS_MAP_READABLE |
>>> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
>>> +        .ioas_id = ioas_id,
>>> +        .fd = mfd,
>>> +        .start = start,
>>> +        .iova = iova,
>>> +        .length = size,
>>> +    };
>>> +
>>> +    if (!readonly) {
>>> +        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
>>> +    }
>>> +
>>> +    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
>>> +    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
>>> +                                       readonly, ret);
>>> +    if (ret) {
>>> +        ret = -errno;
>>> +
>>> +        /* TODO: Not support mapping hardware PCI BAR region for now. */
>>> +        if (errno == EFAULT) {
>>> +            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
>>> +        } else {
>>> +            error_report("IOMMU_IOAS_MAP_FILE failed: %m");
>>
>> No need to print error here as caller does the same thing.
> 
> OK.  I was copying iommufd_backend_map_dma, but I see it has recently
> dropped the error_report.

If I delete the error_report line, can I add your RB?

- Steve

>>> +        }
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>                                hwaddr iova, ram_addr_t size)
>>> {
>>> diff --git a/backends/trace-events b/backends/trace-events
>>> index 40811a3..f478e18 100644
>>> --- a/backends/trace-events
>>> +++ b/backends/trace-events
>>> @@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t
>>> users) "fd=%d owned=%d user
>>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>>> iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t
>>> size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d
>>> iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
>>> +iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova,
>>> uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d
>>> ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d
>>> (%d)"
>>> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t
>>> iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d
>>> iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova,
>>> uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64"
>>> size=0x%"PRIx64" (%d)"
>>> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d
>>> ioas=%d"
>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>> index cbab75b..ac700b8 100644
>>> --- a/include/system/iommufd.h
>>> +++ b/include/system/iommufd.h
>>> @@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend
>>> *be);
>>> bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
>>>                                  Error **errp);
>>> void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
>>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>> +                                 hwaddr iova, ram_addr_t size, int fd,
>>> +                                 unsigned long start, bool readonly);
>>> int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>> hwaddr iova,
>>>                              ram_addr_t size, void *vaddr, bool readonly);
>>> int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>> -- 
>>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-19 15:51     ` Steven Sistare
@ 2025-05-20 19:34       ` Steven Sistare
  2025-05-21  3:11         ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 19:34 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 11:51 AM, Steven Sistare wrote:
> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>
>>> Define the change process ioctl
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>> backends/trace-events    |  1 +
>>> include/system/iommufd.h |  2 ++
>>> 3 files changed, 23 insertions(+)
>>>
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 5c1958f..6fed1c1 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass *oc,
>>> const void *data)
>>>      object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>>> }
>>>
>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>> +{
>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>> +
>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>> +}
>>> +
>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>> +{
>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>
>> This is same ioctl as above check, could it be called more than once for same process?
> 
> Yes, and it is a no-op if the process has not changed since the last time DMA
> was mapped.

More questions?
RB?

- Steve

>>> +
>>> +    if (!ret) {
>>> +        error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d
>>> failed",
>>> +                         be->fd);
>>> +    }
>>> +    trace_iommufd_change_process(be->fd, ret);
>>> +    return ret;
>>> +}
>>> +
>>> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>>> {
>>>      int fd;
>>> diff --git a/backends/trace-events b/backends/trace-events
>>> index f478e18..5ccdf90 100644
>>> --- a/backends/trace-events
>>> +++ b/backends/trace-events
>>> @@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
>>> dbus_vmstate_saving(const char *id) "id: %s"
>>>
>>> # iommufd.c
>>> +iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
>>> iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d
>>> owned=%d users=%d"
>>> iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
>>> iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>> index ac700b8..db9ed53 100644
>>> --- a/include/system/iommufd.h
>>> +++ b/include/system/iommufd.h
>>> @@ -64,6 +64,8 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>>>                                        uint64_t iova, ram_addr_t size,
>>>                                        uint64_t page_size, uint64_t *data,
>>>                                        Error **errp);
>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be);
>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
>>>
>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE
>>> "-iommufd"
>>> #endif
>>> -- 
>>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-19 15:52     ` Steven Sistare
@ 2025-05-20 19:39       ` Steven Sistare
  2025-05-21  3:13         ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 19:39 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 11:52 AM, Steven Sistare wrote:
> On 5/16/2025 4:48 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>>>
>>> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>>> Such a mapping can be preserved without modification during CPR,
>>> because it depends on the file's address space, which does not change,
>>> rather than on the process's address space, which does change.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/container-base.c              |  9 +++++++++
>>> hw/vfio/iommufd.c                     | 13 +++++++++++++
>>> include/hw/vfio/vfio-container-base.h |  3 +++
>>> 3 files changed, 25 insertions(+)
>>>
>>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>>> index 8f43bc8..72a51a6 100644
>>> --- a/hw/vfio/container-base.c
>>> +++ b/hw/vfio/container-base.c
>>> @@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase
>>> *bcontainer,
>>>                             RAMBlock *rb)
>>> {
>>>      VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>> +    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>>>
>>> +    if (mfd >= 0 && vioc->dma_map_file) {
>>> +        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>>> +        unsigned long offset = qemu_ram_get_fd_offset(rb);
>>> +
>>> +        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>>> +                           readonly);
>>
>> Shouldn't we return result to call site?
> 
> Yes!  Good catch, thanks.

With that simple fix:
   return vioc->dma_map_file(...)
can I add your RB?

- Steve

>>> +        return 0;
>>> +    }
>>>      g_assert(vioc->dma_map);
>>>      return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>>> }
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 167bda4..6eb417a 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -44,6 +44,18 @@ static int iommufd_cdev_map(const VFIOContainerBase
>>> *bcontainer, hwaddr iova,
>>>                                     iova, size, vaddr, readonly);
>>> }
>>>
>>> +static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>>> +                                 hwaddr iova, ram_addr_t size,
>>> +                                 int fd, unsigned long start, bool readonly)
>>> +{
>>> +    const VFIOIOMMUFDContainer *container =
>>> +        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>>> +
>>> +    return iommufd_backend_map_file_dma(container->be,
>>> +                                        container->ioas_id,
>>> +                                        iova, size, fd, start, readonly);
>>> +}
>>> +
>>> static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
>>>                                hwaddr iova, ram_addr_t size,
>>>                                IOMMUTLBEntry *iotlb, bool unmap_all)
>>> @@ -802,6 +814,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass
>>> *klass, const void *data)
>>>      VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
>>>
>>>      vioc->dma_map = iommufd_cdev_map;
>>> +    vioc->dma_map_file = iommufd_cdev_map_file;
>>>      vioc->dma_unmap = iommufd_cdev_unmap;
>>>      vioc->attach_device = iommufd_cdev_attach;
>>>      vioc->detach_device = iommufd_cdev_detach;
>>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>>> container-base.h
>>> index 03b3f9c..f30f828 100644
>>> --- a/include/hw/vfio/vfio-container-base.h
>>> +++ b/include/hw/vfio/vfio-container-base.h
>>> @@ -123,6 +123,9 @@ struct VFIOIOMMUClass {
>>>      int (*dma_map)(const VFIOContainerBase *bcontainer,
>>>                     hwaddr iova, ram_addr_t size,
>>>                     void *vaddr, bool readonly);
>>> +    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
>>> +                        hwaddr iova, ram_addr_t size,
>>> +                        int fd, unsigned long start, bool readonly);
>>>      /**
>>>       * @dma_unmap
>>>       *
>>> -- 
>>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-20 13:55   ` Cédric Le Goater
@ 2025-05-20 21:00     ` Steven Sistare
  2025-05-21  8:20       ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-20 21:00 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 9:55 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> cpr-transfer will use the device name as a key to find the value
>> of the device descriptor in new QEMU.  However, if the descriptor
>> number is specified by a command-line fd parameter, then
>> vfio_device_get_name creates a name that includes the fd number.
>> This causes a chicken-and-egg problem: new QEMU must know the fd
>> number to construct a name to find the fd number.
>>
>> To fix, create an invariant name based on the id command-line
>> parameter.  If id is not defined, add a CPR blocker.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/cpr.c              | 21 +++++++++++++++++++++
>>   hw/vfio/device.c           | 10 ++++------
>>   hw/vfio/iommufd.c          |  2 ++
>>   include/hw/vfio/vfio-cpr.h |  4 ++++
>>   4 files changed, 31 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index 6081a89..7609c62 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -11,6 +11,7 @@
>>   #include "hw/vfio/pci.h"
>>   #include "hw/pci/msix.h"
>>   #include "hw/pci/msi.h"
>> +#include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "qapi/error.h"
>>   #include "system/runstate.h"
>> @@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>>           VMSTATE_END_OF_LIST()
>>       }
>>   };
>> +
>> +bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
>> +{
>> +    if (vbasedev->dev->id) {
>> +        vbasedev->name = g_strdup(vbasedev->dev->id);
>> +        return true;
>> +    } else {
>> +        /*
>> +         * Assign a name so any function printing it will not break, but the
>> +         * fd number changes across processes, so this cannot be used as an
>> +         * invariant name for CPR.
>> +         */
>> +        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> 
> The code above should be in vfio_device_get_name() proposed in its own path.

I understand, "in its own patch".  Will do.

>> +        error_setg(&vbasedev->cpr.id_blocker,
>> +                   "vfio device with fd=%d needs an id property",
>> +                   vbasedev->fd);
>> +        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> 
> The cpr blocker should proposed in a second patch, maybe with a small
> wrapper to set the 'Error *'.

will do.

- Steve

>> +    }
>> +}
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 9fba2c7..8e9de68 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -28,6 +28,7 @@
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>>   #include "qemu/units.h"
>> +#include "migration/cpr.h"
>>   #include "monitor/monitor.h"
>>   #include "vfio-helpers.h"
>> @@ -284,6 +285,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>>   {
>>       ERRP_GUARD();
>>       struct stat st;
>> +    bool ret = true;
>>       if (vbasedev->fd < 0) {
>>           if (stat(vbasedev->sysfsdev, &st) < 0) {
>> @@ -300,16 +302,12 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>>               error_setg(errp, "Use FD passing only with iommufd backend");
>>               return false;
>>           }
>> -        /*
>> -         * Give a name with fd so any function printing out vbasedev->name
>> -         * will not break.
>> -         */
>>           if (!vbasedev->name) {
>> -            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>> +            ret = vfio_cpr_set_device_name(vbasedev, errp);
>>           }
>>       }
>> -    return true;
>> +    return ret;
>>   }
>>   void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 8661947..ea99b8d 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -25,6 +25,7 @@
>>   #include "system/reset.h"
>>   #include "qemu/cutils.h"
>>   #include "qemu/chardev_open.h"
>> +#include "migration/blocker.h"
>>   #include "pci.h"
>>   #include "vfio-iommufd.h"
>>   #include "vfio-helpers.h"
>> @@ -669,6 +670,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>       iommufd_cdev_container_destroy(container);
>>       vfio_address_space_put(space);
>> +    migrate_del_blocker(&vbasedev->cpr.id_blocker);
>>       iommufd_cdev_unbind_and_disconnect(vbasedev);
>>       close(vbasedev->fd);
>>   }
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 765e334..d06d117 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -23,12 +23,14 @@ typedef struct VFIOContainerCPR {
>>   typedef struct VFIODeviceCPR {
>>       bool reused;
>>       Error *mdev_blocker;
>> +    Error *id_blocker;
>>   } VFIODeviceCPR;
>>   struct VFIOContainer;
>>   struct VFIOContainerBase;
>>   struct VFIOGroup;
>>   struct VFIOPCIDevice;
>> +struct VFIODevice;
>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                           Error **errp);
>> @@ -59,4 +61,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>>   extern const VMStateDescription vfio_cpr_pci_vmstate;
>> +bool vfio_cpr_set_device_name(struct VFIODevice *vbasedev, Error **errp);
>> +
>>   #endif /* HW_VFIO_VFIO_CPR_H */
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-20 12:34     ` Cédric Le Goater
@ 2025-05-21  2:48       ` Duan, Zhenzhong
  2025-05-21  8:19         ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-21  2:48 UTC (permalink / raw)
  To: Cédric Le Goater, Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>
>On 5/16/25 10:55, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>>
>>> Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>>> helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>>> be used by CPR in a subsequent patch.
>>>
>>> Call vfio_device_hiod_create_and_realize earlier so
>iommufd_cdev_make_hwpt
>>> can use vbasedev->hiod hw_caps, avoiding an extra call to
>>> iommufd_backend_get_device_info
>>
>> We had made consensus to realize hiod after attachment,
>> it's not a hot path so an extra call is acceptable per Cedric.
>
>We also placed the realize call where it is in preparation for
>nested IOMMU support, and avoid a late_realize handler AFAICR
>
>
>>> No functional change.
>
>
>We should add a comment before to make sure the code is not moved
>around.

Yes, I should have done that last time. Do you want me to send a patch to fix it?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-20 19:32       ` Steven Sistare
@ 2025-05-21  2:48         ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-21  2:48 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 28/42] backends/iommufd:
>iommufd_backend_map_file_dma
>
>On 5/19/2025 11:51 AM, Steven Sistare wrote:
>> On 5/16/2025 4:26 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V3 28/42] backends/iommufd:
>>>> iommufd_backend_map_file_dma
>>>>
>>>> Define iommufd_backend_map_file_dma to implement
>IOMMU_IOAS_MAP_FILE.
>>>> This will be called as a substitute for iommufd_backend_map_dma, so
>>>> the error conditions for BARs are copied as-is from that function.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> backends/iommufd.c       | 36 ++++++++++++++++++++++++++++++++++++
>>>> backends/trace-events    |  1 +
>>>> include/system/iommufd.h |  3 +++
>>>> 3 files changed, 40 insertions(+)
>>>>
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index b73f75c..5c1958f 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -172,6 +172,42 @@ int
>iommufd_backend_map_dma(IOMMUFDBackend
>>>> *be, uint32_t ioas_id, hwaddr iova,
>>>>      return ret;
>>>> }
>>>>
>>>> +int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>>> +                                 hwaddr iova, ram_addr_t size,
>>>> +                                 int mfd, unsigned long start, bool readonly)
>>>> +{
>>>> +    int ret, fd = be->fd;
>>>> +    struct iommu_ioas_map_file map = {
>>>> +        .size = sizeof(map),
>>>> +        .flags = IOMMU_IOAS_MAP_READABLE |
>>>> +                 IOMMU_IOAS_MAP_FIXED_IOVA,
>>>> +        .ioas_id = ioas_id,
>>>> +        .fd = mfd,
>>>> +        .start = start,
>>>> +        .iova = iova,
>>>> +        .length = size,
>>>> +    };
>>>> +
>>>> +    if (!readonly) {
>>>> +        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
>>>> +    }
>>>> +
>>>> +    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
>>>> +    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
>>>> +                                       readonly, ret);
>>>> +    if (ret) {
>>>> +        ret = -errno;
>>>> +
>>>> +        /* TODO: Not support mapping hardware PCI BAR region for now. */
>>>> +        if (errno == EFAULT) {
>>>> +            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
>>>> +        } else {
>>>> +            error_report("IOMMU_IOAS_MAP_FILE failed: %m");
>>>
>>> No need to print error here as caller does the same thing.
>>
>> OK.  I was copying iommufd_backend_map_dma, but I see it has recently
>> dropped the error_report.
>
>If I delete the error_report line, can I add your RB?

Sure.

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-20 19:34       ` Steven Sistare
@ 2025-05-21  3:11         ` Duan, Zhenzhong
  2025-05-21 13:01           ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-21  3:11 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>On 5/19/2025 11:51 AM, Steven Sistare wrote:
>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>
>>>> Define the change process ioctl
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>> backends/trace-events    |  1 +
>>>> include/system/iommufd.h |  2 ++
>>>> 3 files changed, 23 insertions(+)
>>>>
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 5c1958f..6fed1c1 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass
>*oc,
>>>> const void *data)
>>>>      object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>>>> }
>>>>
>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>> +{
>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>> +
>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>> +}
>>>> +
>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>> +{
>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>
>>> This is same ioctl as above check, could it be called more than once for same
>process?
>>
>> Yes, and it is a no-op if the process has not changed since the last time DMA
>> was mapped.
>
>More questions?

Looks a bit redundant for me, meanwhile if iommufd_change_process_capable() is called on target qemu, may it do both checking and change?

I would suggest to define only iommufd_change_process() and comment that it's no-op if process not changed...

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-20 19:39       ` Steven Sistare
@ 2025-05-21  3:13         ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-21  3:13 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>
>On 5/19/2025 11:52 AM, Steven Sistare wrote:
>> On 5/16/2025 4:48 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>>>>
>>>> Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
>>>> Such a mapping can be preserved without modification during CPR,
>>>> because it depends on the file's address space, which does not change,
>>>> rather than on the process's address space, which does change.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> hw/vfio/container-base.c              |  9 +++++++++
>>>> hw/vfio/iommufd.c                     | 13 +++++++++++++
>>>> include/hw/vfio/vfio-container-base.h |  3 +++
>>>> 3 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>>>> index 8f43bc8..72a51a6 100644
>>>> --- a/hw/vfio/container-base.c
>>>> +++ b/hw/vfio/container-base.c
>>>> @@ -79,7 +79,16 @@ int vfio_container_dma_map(VFIOContainerBase
>>>> *bcontainer,
>>>>                             RAMBlock *rb)
>>>> {
>>>>      VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>>> +    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
>>>>
>>>> +    if (mfd >= 0 && vioc->dma_map_file) {
>>>> +        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
>>>> +        unsigned long offset = qemu_ram_get_fd_offset(rb);
>>>> +
>>>> +        vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
>>>> +                           readonly);
>>>
>>> Shouldn't we return result to call site?
>>
>> Yes!  Good catch, thanks.
>
>With that simple fix:
>   return vioc->dma_map_file(...)
>can I add your RB?
Yes,

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-21  2:48       ` Duan, Zhenzhong
@ 2025-05-21  8:19         ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-21  8:19 UTC (permalink / raw)
  To: Duan, Zhenzhong, Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/21/25 04:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>
>> On 5/16/25 10:55, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>>>
>>>> Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>>>> helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>>>> be used by CPR in a subsequent patch.
>>>>
>>>> Call vfio_device_hiod_create_and_realize earlier so
>> iommufd_cdev_make_hwpt
>>>> can use vbasedev->hiod hw_caps, avoiding an extra call to
>>>> iommufd_backend_get_device_info
>>>
>>> We had made consensus to realize hiod after attachment,
>>> it's not a hot path so an extra call is acceptable per Cedric.
>>
>> We also placed the realize call where it is in preparation for
>> nested IOMMU support, and avoid a late_realize handler AFAICR
>>
>>
>>>> No functional change.
>>
>>
>> We should add a comment before to make sure the code is not moved
>> around.
> 
> Yes, I should have done that last time. Do you want me to send a patch to fix it?

Sure. I will handle the conflicts if needed.

Thanks Zhenzhong,

C.



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 34/42] vfio/iommufd: invariant device name
  2025-05-20 21:00     ` Steven Sistare
@ 2025-05-21  8:20       ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-21  8:20 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/25 23:00, Steven Sistare wrote:
> On 5/20/2025 9:55 AM, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> cpr-transfer will use the device name as a key to find the value
>>> of the device descriptor in new QEMU.  However, if the descriptor
>>> number is specified by a command-line fd parameter, then
>>> vfio_device_get_name creates a name that includes the fd number.
>>> This causes a chicken-and-egg problem: new QEMU must know the fd
>>> number to construct a name to find the fd number.
>>>
>>> To fix, create an invariant name based on the id command-line
>>> parameter.  If id is not defined, add a CPR blocker.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   hw/vfio/cpr.c              | 21 +++++++++++++++++++++
>>>   hw/vfio/device.c           | 10 ++++------
>>>   hw/vfio/iommufd.c          |  2 ++
>>>   include/hw/vfio/vfio-cpr.h |  4 ++++
>>>   4 files changed, 31 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>>> index 6081a89..7609c62 100644
>>> --- a/hw/vfio/cpr.c
>>> +++ b/hw/vfio/cpr.c
>>> @@ -11,6 +11,7 @@
>>>   #include "hw/vfio/pci.h"
>>>   #include "hw/pci/msix.h"
>>>   #include "hw/pci/msi.h"
>>> +#include "migration/blocker.h"
>>>   #include "migration/cpr.h"
>>>   #include "qapi/error.h"
>>>   #include "system/runstate.h"
>>> @@ -184,3 +185,23 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
>>>           VMSTATE_END_OF_LIST()
>>>       }
>>>   };
>>> +
>>> +bool vfio_cpr_set_device_name(VFIODevice *vbasedev, Error **errp)
>>> +{
>>> +    if (vbasedev->dev->id) {
>>> +        vbasedev->name = g_strdup(vbasedev->dev->id);
>>> +        return true;
>>> +    } else {
>>> +        /*
>>> +         * Assign a name so any function printing it will not break, but the
>>> +         * fd number changes across processes, so this cannot be used as an
>>> +         * invariant name for CPR.
>>> +         */
>>> +        vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
>>
>> The code above should be in vfio_device_get_name() proposed in its own path.
> 
> I understand, "in its own patch".  Will do.

yes. This typo could clearly be misunderstood :/ Sorry for the noise.


> 
>>> +        error_setg(&vbasedev->cpr.id_blocker,
>>> +                   "vfio device with fd=%d needs an id property",
>>> +                   vbasedev->fd);
>>> +        return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker, errp,
>>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>
>> The cpr blocker should proposed in a second patch, maybe with a small
>> wrapper to set the 'Error *'.
> 
> will do.


Thanks,

C.




^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-21  3:11         ` Duan, Zhenzhong
@ 2025-05-21 13:01           ` Steven Sistare
  2025-05-22  3:19             ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-21 13:01 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>
>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>
>>>>> Define the change process ioctl
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>> backends/trace-events    |  1 +
>>>>> include/system/iommufd.h |  2 ++
>>>>> 3 files changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 5c1958f..6fed1c1 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -73,6 +73,26 @@ static void iommufd_backend_class_init(ObjectClass
>> *oc,
>>>>> const void *data)
>>>>>       object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
>>>>> }
>>>>>
>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>> +{
>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>> +
>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>> +}
>>>>> +
>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>> +{
>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>
>>>> This is same ioctl as above check, could it be called more than once for same
>> process?
>>>
>>> Yes, and it is a no-op if the process has not changed since the last time DMA
>>> was mapped.
>>
>> More questions?
> 
> Looks a bit redundant for me, meanwhile if iommufd_change_process_capable() is called on target qemu, may it do both checking and change?
> 
> I would suggest to define only iommufd_change_process() and comment that it's no-op if process not changed...

We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before performing
live update so we can add a blocker and prevent live update cleanly:

vfio_iommufd_cpr_register_container
     if !vfio_cpr_supported()        // calls iommufd_change_process_capable
         migrate_add_blocker_modes()

How about I just add a comment:

bool iommufd_change_process_capable(IOMMUFDBackend *be)
{
     /*
      * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
      * This is a no-op if the process has not changed since DMA was mapped.
      */

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
  2025-05-20  9:16       ` Duan, Zhenzhong
@ 2025-05-21 17:40         ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-21 17:40 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/20/2025 5:16 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>>
>> On 5/18/2025 11:25 PM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt
>>>>
>>>> Save the hwpt_id in vmstate.  In realize, skip its allocation from
>>>> iommufd_cdev_attach -> iommufd_cdev_attach_container ->
>>>> iommufd_cdev_autodomains_get.
>>>>
>>>> Rebuild userland structures to hold hwpt_id by calling
>>>> iommufd_cdev_rebuild_hwpt at post load time.  This depends on hw_caps,
>> which
>>>> was restored by the post_load call to vfio_device_hiod_create_and_realize.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> hw/vfio/cpr-iommufd.c      |  7 +++++++
>>>> hw/vfio/iommufd.c          | 24 ++++++++++++++++++++++--
>>>> hw/vfio/trace-events       |  1 +
>>>> hw/vfio/vfio-iommufd.h     |  3 +++
>>>> include/hw/vfio/vfio-cpr.h |  1 +
>>>> 5 files changed, 34 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>>> index 24cdf10..6d3f4e0 100644
>>>> --- a/hw/vfio/cpr-iommufd.c
>>>> +++ b/hw/vfio/cpr-iommufd.c
>>>> @@ -110,6 +110,12 @@ static int vfio_device_post_load(void *opaque, int
>>>> version_id)
>>>>           error_report_err(err);
>>>>           return false;
>>>>       }
>>>> +    if (!vbasedev->mdev) {
>>>> +        VFIOIOMMUFDContainer *container = container_of(vbasedev-
>>> bcontainer,
>>>> +                                                       VFIOIOMMUFDContainer,
>>>> +                                                       bcontainer);
>>>> +        iommufd_cdev_rebuild_hwpt(vbasedev, container);
>>>> +    }
>>>>       return true;
>>>> }
>>>>
>>>> @@ -121,6 +127,7 @@ static const VMStateDescription vfio_device_vmstate
>> = {
>>>>       .needed = cpr_needed_for_reuse,
>>>>       .fields = (VMStateField[]) {
>>>>           VMSTATE_INT32(devid, VFIODevice),
>>>> +        VMSTATE_UINT32(cpr.hwpt_id, VFIODevice),
>>>>           VMSTATE_END_OF_LIST()
>>>>       }
>>>> };
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index d980684..ec79c83 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -318,6 +318,7 @@ static bool
>>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>> static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>>>> *hwpt)
>>>> {
>>>>       vbasedev->hwpt = hwpt;
>>>> +    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
>>>>       vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>>>       QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>> }
>>>> @@ -373,6 +374,23 @@ static bool iommufd_cdev_make_hwpt(VFIODevice
>>>> *vbasedev,
>>>>       return true;
>>>> }
>>>>
>>>> +void iommufd_cdev_rebuild_hwpt(VFIODevice *vbasedev,
>>>> +                               VFIOIOMMUFDContainer *container)
>>>> +{
>>>> +    VFIOIOASHwpt *hwpt;
>>>> +    int hwpt_id = vbasedev->cpr.hwpt_id;
>>>> +
>>>> +    trace_iommufd_cdev_rebuild_hwpt(container->be->fd, hwpt_id);
>>>> +
>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>> +        if (hwpt->hwpt_id == hwpt_id) {
>>>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>>> +            return;
>>>> +        }
>>>> +    }
>>>> +    iommufd_cdev_make_hwpt(vbasedev, container, hwpt_id, false, NULL);
>>>> +}
>>>> +
>>>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>                                            VFIOIOMMUFDContainer *container,
>>>>                                            Error **errp)
>>>> @@ -567,7 +585,8 @@ static bool iommufd_cdev_attach(const char *name,
>>>> VFIODevice *vbasedev,
>>>>               vbasedev->iommufd != container->be) {
>>>>               continue;
>>>>           }
>>>> -        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
>>>> +        if (!vbasedev->cpr.reused &&
>>>> +            !iommufd_cdev_attach_container(vbasedev, container, &err)) {
>>>>               const char *msg = error_get_pretty(err);
>>>>
>>>>               trace_iommufd_cdev_fail_attach_existing_container(msg);
>>>> @@ -605,7 +624,8 @@ skip_ioas_alloc:
>>>>       bcontainer = &container->bcontainer;
>>>>       vfio_address_space_insert(space, bcontainer);
>>>>
>>>> -    if (!iommufd_cdev_attach_container(vbasedev, container, errp)) {
>>>> +    if (!vbasedev->cpr.reused &&
>>>> +        !iommufd_cdev_attach_container(vbasedev, container, errp)) {
>>>
>>> All container attaching is bypassed in new qemu. I have a concern that new
>> qemu doesn't generate same containers as old qemu if there are more than one
>> container in old qemu.
>>> Then there can be devices attached to wrong container or attaching fail in post
>> load.
>>
>> Yes, this relates to our discussion in patch 35.  Please explain, how can a single
>> iommufd backend have multiple containers?
> 
> Similar as legacy container, there can be multiple containers in one address space.
> If existing mapping in one container conflicts with new device's reserved region,
> Attaching to that container will fail and a new container need to be created to accept new device's reserved region.
> 
> Maybe you need to do same thing just like you do for legacy container, e.g., saving  ioas_id just like you saving container->fd, then checking existing ioas_id and restore iommufd container based on that.

Thanks, now I understand.
iommufd_cdev_attach calls
   iommufd_cdev_attach_container -> iommufd_cdev_attach_ioas_hwpt(container->ioas_id)
until it finds a container that works, or creates a new container with a new ioas_id.

To fix, I need to record each device's ioas_id in cpr-state, so it is available when
vfio_realize -> iommufd_cdev_attach is called.  Saving it in vmstate as I do now and
recovering it in a post_load handler is too late.

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range
  2025-05-12 15:32 ` [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
@ 2025-05-21 18:35   ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-21 18:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

I withdraw this patch.  It is not needed if I save ioas_id in cpr-state.

- Steve

On 5/12/2025 11:32 AM, Steve Sistare wrote:
> Export iommufd_cdev_get_info_iova_range, for use by CPR in a subsequent
> patch to reconstruct the userland device state.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/iommufd.c      | 4 ++--
>   hw/vfio/vfio-iommufd.h | 3 +++
>   2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 6eb417a..f645a62 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -459,8 +459,8 @@ static int iommufd_cdev_ram_block_discard_disable(bool state)
>       return ram_block_uncoordinated_discard_disable(state);
>   }
>   
> -static bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> -                                             uint32_t ioas_id, Error **errp)
> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> +                                      uint32_t ioas_id, Error **errp)
>   {
>       VFIOContainerBase *bcontainer = &container->bcontainer;
>       g_autofree struct iommu_ioas_iova_ranges *info = NULL;
> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
> index 07ea0f4..5615dcd 100644
> --- a/hw/vfio/vfio-iommufd.h
> +++ b/hw/vfio/vfio-iommufd.h
> @@ -31,4 +31,7 @@ typedef struct VFIOIOMMUFDContainer {
>   
>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
>   
> +bool iommufd_cdev_get_info_iova_range(VFIOIOMMUFDContainer *container,
> +                                      uint32_t ioas_id, Error **errp);
> +
>   #endif /* HW_VFIO_VFIO_IOMMUFD_H */



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 37/42] vfio/iommufd: reconstruct device
  2025-05-12 15:32 ` [PATCH V3 37/42] vfio/iommufd: reconstruct device Steve Sistare
  2025-05-16 10:22   ` Duan, Zhenzhong
@ 2025-05-21 18:38   ` Steven Sistare
  1 sibling, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-21 18:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

I withdraw this patch.  Most of it is not needed if I save ioas_id
in cpr-state.  I will move a tiny bit that remains to another patch.

- Steve

On 5/12/2025 11:32 AM, Steve Sistare wrote:
> Reconstruct userland device state after CPR.  During vfio_realize, skip
> all ioctls that configure the device, as it was already configured in old
> QEMU.
> 
> Save the ioas_id in vmstate, and skip its allocation in vfio_realize.
> Because we skip ioctl's, it is not needed at realize time.  However, we do
> need the range info, so defer the call to iommufd_cdev_get_info_iova_range
> to a post_load handler, at which time the ioas_id is known.
> 
> This reconstruction is not complete.  hwpt_id and devid need special
> treatment, handled in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr-iommufd.c |  8 ++++++++
>   hw/vfio/iommufd.c     | 17 +++++++++++++++++
>   2 files changed, 25 insertions(+)
> 
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> index b760bd3..3d430f0 100644
> --- a/hw/vfio/cpr-iommufd.c
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -31,6 +31,13 @@ static int vfio_container_post_load(void *opaque, int version_id)
>       VFIOIOMMUFDContainer *container = opaque;
>       VFIOContainerBase *bcontainer = &container->bcontainer;
>       VFIODevice *vbasedev;
> +    Error *err = NULL;
> +    uint32_t ioas_id = container->ioas_id;
> +
> +    if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
>   
>       QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
>           vbasedev->cpr.reused = false;
> @@ -47,6 +54,7 @@ static const VMStateDescription vfio_container_vmstate = {
>       .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {
> +        VMSTATE_UINT32(ioas_id, VFIOIOMMUFDContainer),
>           VMSTATE_END_OF_LIST()
>       }
>   };
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 046f601..c49a7e7 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -122,6 +122,10 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>           goto err_kvm_device_add;
>       }
>   
> +    if (vbasedev->cpr.reused) {
> +        goto skip_bind;
> +    }
> +
>       /* Bind device to iommufd */
>       bind.iommufd = iommufd->fd;
>       if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
> @@ -133,6 +137,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
>       vbasedev->devid = bind.out_devid;
>       trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
>                                           vbasedev->fd, vbasedev->devid);
> +
> +skip_bind:
>       return true;
>   err_bind:
>       iommufd_cdev_kvm_device_del(vbasedev);
> @@ -580,6 +586,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>           }
>       }
>   
> +    if (vbasedev->cpr.reused) {
> +        ioas_id = -1;           /* ioas_id will be received from vmstate */
> +        goto skip_ioas_alloc;
> +    }
> +
>       /* Need to allocate a new dedicated container */
>       if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
>           goto err_alloc_ioas;
> @@ -587,6 +598,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>   
>       trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
>   
> +skip_ioas_alloc:
>       container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>       container->be = vbasedev->iommufd;
>       container->ioas_id = ioas_id;
> @@ -605,6 +617,10 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>           goto err_discard_disable;
>       }
>   
> +    if (vbasedev->cpr.reused) {
> +        goto skip_info;
> +    }
> +
>       if (!iommufd_cdev_get_info_iova_range(container, ioas_id, &err)) {
>           error_append_hint(&err,
>                      "Fallback to default 64bit IOVA range and 4K page size\n");
> @@ -613,6 +629,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>           bcontainer->pgsizes = qemu_real_host_page_size();
>       }
>   
> +skip_info:
>       if (!vfio_listener_register(bcontainer, errp)) {
>           goto err_listener_register;
>       }



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps
  2025-05-12 15:32 ` [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps Steve Sistare
@ 2025-05-21 19:59   ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-21 19:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

I withdraw this patch.  I will instead save devid in cpr-state.

- Steve

On 5/12/2025 11:32 AM, Steve Sistare wrote:
> hw_caps is normally derived during realize, at
> vfio_device_hiod_create_and_realize -> hiod_iommufd_vfio_realize ->
> iommufd_backend_get_device_info.  However, this depends on the devid, which
> is not preserved during CPR.
> 
> Save devid in vmstate.  Defer the vfio_device_hiod_create_and_realize call
> to post_load time, after devid has been recovered from vmstate.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr-iommufd.c  | 15 +++++++++++++++
>   hw/vfio/iommufd.c      |  6 ++----
>   hw/vfio/vfio-iommufd.h |  3 +++
>   3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> index 3d430f0..24cdf10 100644
> --- a/hw/vfio/cpr-iommufd.c
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -100,12 +100,27 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>   }
>   
> +static int vfio_device_post_load(void *opaque, int version_id)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    Error *err = NULL;
> +
> +    if (!vfio_device_hiod_create_and_realize(vbasedev,
> +                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, &err)) {
> +        error_report_err(err);
> +        return false;
> +    }
> +    return true;
> +}
> +
>   static const VMStateDescription vfio_device_vmstate = {
>       .name = "vfio-iommufd-device",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .post_load = vfio_device_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {
> +        VMSTATE_INT32(devid, VFIODevice),
>           VMSTATE_END_OF_LIST()
>       }
>   };
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index c49a7e7..d980684 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -32,9 +32,6 @@
>   #include "vfio-helpers.h"
>   #include "vfio-listener.h"
>   
> -#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
> -            TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
> -
>   static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>                               ram_addr_t size, void *vaddr, bool readonly)
>   {
> @@ -557,7 +554,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>   
>       space = vfio_address_space_get(as);
>   
> -    if (!vfio_device_hiod_create_and_realize(vbasedev,
> +    if (!vbasedev->cpr.reused &&
> +        !vfio_device_hiod_create_and_realize(vbasedev,
>               TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>           goto err_alloc_ioas;
>       }
> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
> index cc57a05..148ce89 100644
> --- a/hw/vfio/vfio-iommufd.h
> +++ b/hw/vfio/vfio-iommufd.h
> @@ -11,6 +11,9 @@
>   
>   #include "hw/vfio/vfio-container-base.h"
>   
> +#define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
> +            TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
> +
>   typedef struct VFIODevice VFIODevice;
>   
>   typedef struct VFIOIOASHwpt {



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-21 13:01           ` Steven Sistare
@ 2025-05-22  3:19             ` Duan, Zhenzhong
  2025-05-22 21:11               ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-22  3:19 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>
>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>
>>>>>> Define the change process ioctl
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>> backends/trace-events    |  1 +
>>>>>> include/system/iommufd.h |  2 ++
>>>>>> 3 files changed, 23 insertions(+)
>>>>>>
>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>> index 5c1958f..6fed1c1 100644
>>>>>> --- a/backends/iommufd.c
>>>>>> +++ b/backends/iommufd.c
>>>>>> @@ -73,6 +73,26 @@ static void
>iommufd_backend_class_init(ObjectClass
>>> *oc,
>>>>>> const void *data)
>>>>>>       object_class_property_add_str(oc, "fd", NULL,
>iommufd_backend_set_fd);
>>>>>> }
>>>>>>
>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>> +{
>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>> +
>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>> +}
>>>>>> +
>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>>> +{
>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>
>>>>> This is same ioctl as above check, could it be called more than once for
>same
>>> process?
>>>>
>>>> Yes, and it is a no-op if the process has not changed since the last time DMA
>>>> was mapped.
>>>
>>> More questions?
>>
>> Looks a bit redundant for me, meanwhile if iommufd_change_process_capable()
>is called on target qemu, may it do both checking and change?
>>
>> I would suggest to define only iommufd_change_process() and comment that
>it's no-op if process not changed...
>
>We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>performing
>live update so we can add a blocker and prevent live update cleanly:
>
>vfio_iommufd_cpr_register_container
>     if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>         migrate_add_blocker_modes()

This reminds me of other questions, is this ioctl() suitable for checking if cpr-transfer supported?
If there is vIOMMU, there can be no mapping and process_capable() check will pass,
but if memory is not file backed...
Does cpr-transfer support vIOMMU or not?

QEMU knows details of all memory backends, why not checking memory backends directly instead of a system call?

>
>How about I just add a comment:
>
>bool iommufd_change_process_capable(IOMMUFDBackend *be)
>{
>     /*
>      * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
>      * This is a no-op if the process has not changed since DMA was mapped.
>      */
>
>- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-12 15:32 ` [PATCH V3 10/42] vfio/container: restore " Steve Sistare
  2025-05-15 13:42   ` Cédric Le Goater
@ 2025-05-22  6:37   ` Cédric Le Goater
  2025-05-22 14:00     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-22  6:37 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> In new QEMU, do not register the memory listener at device creation time.
> Register it later, in the container post_load handler, after all vmstate
> that may affect regions and mapping boundaries has been loaded.  The
> post_load registration will cause the listener to invoke its callback on
> each flat section, and the calls will match the mappings remembered by the
> kernel.
> 
> The listener calls a special dma_map handler that passes the new VA of each
> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
> handler at the end.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c  | 15 +++++++++++++--
>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index a554683..0e02726 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>       int ret;
>       Error *local_err = NULL;
>   
> +    assert(!container->cpr.reused);

assert -> g_assert

this can be called at runtime, which would mean crashing QEMU in case
of error. Doing an error_report() call is more friendly.


Thanks,

C.


> +
>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>               bcontainer->dirty_pages_supported) {
> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       }
>       group_was_added = true;
>   
> -    if (!vfio_listener_register(bcontainer, errp)) {
> -        goto fail;
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * dma_map to supply the new vaddr, and the calls will match the mappings
> +     * remembered by the kernel.
> +     */
> +    if (!cpr_reused) {
> +        if (!vfio_listener_register(bcontainer, errp)) {
> +            goto fail;
> +        }
>       }
>   
>       bcontainer->initialized = true;
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 519d772..bbcf71e 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -11,11 +11,13 @@
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "hw/vfio/vfio-device.h"
> +#include "hw/vfio/vfio-listener.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
>   #include "migration/vmstate.h"
>   #include "qapi/error.h"
> +#include "qemu/error-report.h"
>   
>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>   {
> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>       return true;
>   }
>   
> +/*
> + * Set the new @vaddr for any mappings registered during cpr load.
> + * Reused is cleared thereafter.
> + */
> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
> +                                   bool readonly)
> +{
> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
> +                                                  bcontainer);
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    assert(container->cpr.reused);
> +> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
> +                     iova, size, vaddr, strerror(errno));
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
>   
>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>   {
> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>   static int vfio_container_post_load(void *opaque, int version_id)
>   {
>       VFIOContainer *container = opaque;
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>       VFIOGroup *group;
>       VFIODevice *vbasedev;
> +    Error *err = NULL;
> +
> +    if (!vfio_listener_register(bcontainer, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
>   
>       container->cpr.reused = false;
>   
>       QLIST_FOREACH(group, &container->group_list, container_next) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +
> +        /* Restore original dma_map function */
> +        vioc->dma_map = vfio_legacy_dma_map;
> +
>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>               vbasedev->cpr.reused = false;
>           }
> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>       .pre_save = vfio_container_pre_save,
>       .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   
>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>   
> +    /* During incoming CPR, divert calls to dma_map. */
> +    if (container->cpr.reused) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
> +    }
>       return true;
>   }
>   



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-12 15:32 ` [PATCH V3 07/42] vfio/container: preserve descriptors Steve Sistare
  2025-05-15 12:59   ` Cédric Le Goater
@ 2025-05-22 13:51   ` Cédric Le Goater
  2025-05-22 13:56     ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Cédric Le Goater @ 2025-05-22 13:51 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/12/25 17:32, Steve Sistare wrote:
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
> the saved descriptors, and remembers the reused status for subsequent
> patches.  The reused status is cleared when vmstate load finishes.
> 
> During reuse, device and iommu state is already configured, so operations
> in vfio_realize that would modify the configuration, such as vfio ioctl's,
> are skipped.  The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>   include/hw/vfio/vfio-device.h |  2 ++
>   4 files changed, 112 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 85c76da..278a220 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -31,6 +31,8 @@
>   #include "system/reset.h"
>   #include "trace.h"
>   #include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/blocker.h"
>   #include "pci.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>   }
>   
>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
> -                                            Error **errp)
> +                                            bool cpr_reused, Error **errp)
>   {
>       int iommu_type;
>       const char *vioc_name;
> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>           return NULL;
>       }
>   
> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     */
> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>           return NULL;
>       }
>   
> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>   
>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>       container->fd = fd;
> +    container->cpr.reused = cpr_reused;
>       container->iommu_type = iommu_type;
>       return container;
>   }
> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>   }
>   
>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
> -                                     Error **errp)
> +                                     bool cpr_reused, Error **errp)
>   {
>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>           return false;
> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>       group->container = container;
>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>       vfio_group_add_kvm_device(group);
> +    if (!cpr_reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
> +    }
>       return true;
>   }
>   
> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>       group->container = NULL;
>       vfio_group_del_kvm_device(group);
>       vfio_ram_block_discard_disable(container, false);
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   }
>   
>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       VFIOIOMMUClass *vioc = NULL;
>       bool new_container = false;
>       bool group_was_added = false;
> +    bool cpr_reused;
>   
>       space = vfio_address_space_get(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    cpr_reused = (fd > 0);

btw, 0 is a valid fd number.

Thanks,

C.


> +
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  If not, then after the loop, create
> +     * the container struct and group list.
> +     */
>   
>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>           container = container_of(bcontainer, VFIOContainer, bcontainer);
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            return vfio_container_group_add(container, group, errp);
> +
> +        if (cpr_reused) {
> +            if (!vfio_cpr_container_match(container, group, &fd)) {
> +                continue;
> +            }
> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            continue;
>           }
> +
> +        return vfio_container_group_add(container, group, cpr_reused, errp);
> +    }
> +
> +    if (!cpr_reused) {
> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>       }
>   
> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>       if (fd < 0) {
>           goto fail;
>       }
> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>           goto fail;
>       }
>   
> -    container = vfio_create_container(fd, group, errp);
> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>       if (!container) {
>           goto fail;
>       }
> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>   
>       vfio_address_space_insert(space, bcontainer);
>   
> -    if (!vfio_container_group_add(container, group, errp)) {
> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>           goto fail;
>       }
>       group_was_added = true;
> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>   
>       QLIST_REMOVE(group, container_next);
>       group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   
>       /*
>        * Explicitly release the listener first before unset container,
> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       group = g_malloc0(sizeof(*group));
>   
>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open(path, O_RDWR, errp);
> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>       if (group->fd < 0) {
>           goto free_group_exit;
>       }
> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       return group;
>   
>   close_fd_exit:
> +    cpr_delete_fd("vfio_group", groupid);
>       close(group->fd);
>   
>   free_group_exit:
> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>       vfio_container_disconnect(group);
>       QLIST_REMOVE(group, next);
>       trace_vfio_group_put(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>       close(group->fd);
>       g_free(group);
>   }
> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>   {
>       g_autofree struct vfio_device_info *info = NULL;
>       int fd;
> +    bool cpr_reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    cpr_reused = (fd >= 0);
> +    if (!cpr_reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>   
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>       if (fd < 0) {
>           error_setg_errno(errp, errno, "error getting device from group %d",
>                            group->groupid);
> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>       vbasedev->group = group;
>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>   
> +    vbasedev->cpr.reused = cpr_reused;
> +    if (!cpr_reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }
>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>   
>       return true;
> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>       QLIST_REMOVE(vbasedev, next);
>       vbasedev->group = NULL;
>       trace_vfio_device_put(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>       close(vbasedev->fd);
>   }
>   
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index fac323c..638a8e0 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -10,6 +10,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> +#include "hw/vfio/vfio-device.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>       }
>   }
>   
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> +    VFIOContainer *container = opaque;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    container->cpr.reused = false;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->cpr.reused = false;
> +        }
> +    }
> +    return 0;
> +}
> +
>   static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .post_load = vfio_container_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {
>           VMSTATE_END_OF_LIST()
> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>       migrate_del_blocker(&container->cpr.blocker);
>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>   }
> +
> +static bool same_device(int fd1, int fd2)
> +{
> +    struct stat st1, st2;
> +
> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
> +}
> +
> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
> +                              int *pfd)
> +{
> +    if (container->fd == *pfd) {
> +        return true;
> +    }
> +    if (!same_device(container->fd, *pfd)) {
> +        return false;
> +    }
> +    /*
> +     * Same device, different fd.  This occurs when the container fd is
> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
> +     * produces duplicates.  De-dup it.
> +     */
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
> +    close(*pfd);
> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
> +    *pfd = container->fd;
> +    return true;
> +}
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index f864547..1c4f070 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -13,10 +13,16 @@
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
> +    bool reused;
>   } VFIOContainerCPR;
>   
> +typedef struct VFIODeviceCPR {
> +    bool reused;
> +} VFIODeviceCPR;
> +
>   struct VFIOContainer;
>   struct VFIOContainerBase;
> +struct VFIOGroup;
>   
>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                           Error **errp);
> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>   
> +bool vfio_cpr_container_match(struct VFIOContainer *container,
> +                              struct VFIOGroup *group, int *fd);
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 8bcb3c1..4e4d0b6 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -28,6 +28,7 @@
>   #endif
>   #include "system/system.h"
>   #include "hw/vfio/vfio-container-base.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "system/host_iommu_device.h"
>   #include "system/iommufd.h"
>   
> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>       VFIOIOASHwpt *hwpt;
>       QLIST_ENTRY(VFIODevice) hwpt_next;
>       struct vfio_region_info **reginfo;
> +    VFIODeviceCPR cpr;
>   } VFIODevice;
>   
>   struct VFIODeviceOps {



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 07/42] vfio/container: preserve descriptors
  2025-05-22 13:51   ` Cédric Le Goater
@ 2025-05-22 13:56     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-22 13:56 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/22/2025 9:51 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
>> the saved descriptors, and remembers the reused status for subsequent
>> patches.  The reused status is cleared when vmstate load finishes.
>>
>> During reuse, device and iommu state is already configured, so operations
>> in vfio_realize that would modify the configuration, such as vfio ioctl's,
>> are skipped.  The result is that vfio_realize constructs qemu data
>> structures that reflect the current state of the device.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c           | 65 ++++++++++++++++++++++++++++++++++++-------
>>   hw/vfio/cpr-legacy.c          | 46 ++++++++++++++++++++++++++++++
>>   include/hw/vfio/vfio-cpr.h    |  9 ++++++
>>   include/hw/vfio/vfio-device.h |  2 ++
>>   4 files changed, 112 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 85c76da..278a220 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -31,6 +31,8 @@
>>   #include "system/reset.h"
>>   #include "trace.h"
>>   #include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/blocker.h"
>>   #include "pci.h"
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>> @@ -414,7 +416,7 @@ static bool vfio_set_iommu(int container_fd, int group_fd,
>>   }
>>   static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>> -                                            Error **errp)
>> +                                            bool cpr_reused, Error **errp)
>>   {
>>       int iommu_type;
>>       const char *vioc_name;
>> @@ -425,7 +427,11 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>           return NULL;
>>       }
>> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>> +    /*
>> +     * If container is reused, just set its type and skip the ioctls, as the
>> +     * container and group are already configured in the kernel.
>> +     */
>> +    if (!cpr_reused && !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>>           return NULL;
>>       }
>> @@ -433,6 +439,7 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>>       container = VFIO_IOMMU_LEGACY(object_new(vioc_name));
>>       container->fd = fd;
>> +    container->cpr.reused = cpr_reused;
>>       container->iommu_type = iommu_type;
>>       return container;
>>   }
>> @@ -584,7 +591,7 @@ static bool vfio_container_attach_discard_disable(VFIOContainer *container,
>>   }
>>   static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>> -                                     Error **errp)
>> +                                     bool cpr_reused, Error **errp)
>>   {
>>       if (!vfio_container_attach_discard_disable(container, group, errp)) {
>>           return false;
>> @@ -592,6 +599,9 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>>       group->container = container;
>>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>       vfio_group_add_kvm_device(group);
>> +    if (!cpr_reused) {
>> +        cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>> +    }
>>       return true;
>>   }
>> @@ -601,6 +611,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>>       group->container = NULL;
>>       vfio_group_del_kvm_device(group);
>>       vfio_ram_block_discard_disable(container, false);
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>   }
>>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>> @@ -613,17 +624,37 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       VFIOIOMMUClass *vioc = NULL;
>>       bool new_container = false;
>>       bool group_was_added = false;
>> +    bool cpr_reused;
>>       space = vfio_address_space_get(as);
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    cpr_reused = (fd > 0);
> 
> btw, 0 is a valid fd number.

That's a typo, but a bad one, thanks!  That is the only broken one:

$ fgrep '(fd >' hw/vfio/*.c
container.c:    cpr_reused = (fd > 0);
container.c:    if (fd >= 0) {
cpr.c:        if (fd >= 0) {
cpr-legacy.c:        *reused = (fd >= 0);
cpr-legacy.c:        if (fd >= 0) {
pci.c:    if (fd >= 0) {
pci.c:            if (fd >= 0) {

- Steve

>> +
>> +    /*
>> +     * If the container is reused, then the group is already attached in the
>> +     * kernel.  If a container with matching fd is found, then update the
>> +     * userland group list and return.  If not, then after the loop, create
>> +     * the container struct and group list.
>> +     */
>>       QLIST_FOREACH(bcontainer, &space->containers, next) {
>>           container = container_of(bcontainer, VFIOContainer, bcontainer);
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> -            return vfio_container_group_add(container, group, errp);
>> +
>> +        if (cpr_reused) {
>> +            if (!vfio_cpr_container_match(container, group, &fd)) {
>> +                continue;
>> +            }
>> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +            continue;
>>           }
>> +
>> +        return vfio_container_group_add(container, group, cpr_reused, errp);
>> +    }
>> +
>> +    if (!cpr_reused) {
>> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>       }
>> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>>       if (fd < 0) {
>>           goto fail;
>>       }
>> @@ -635,7 +666,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>           goto fail;
>>       }
>> -    container = vfio_create_container(fd, group, errp);
>> +    container = vfio_create_container(fd, group, cpr_reused, errp);
>>       if (!container) {
>>           goto fail;
>>       }
>> @@ -655,7 +686,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       vfio_address_space_insert(space, bcontainer);
>> -    if (!vfio_container_group_add(container, group, errp)) {
>> +    if (!vfio_container_group_add(container, group, cpr_reused, errp)) {
>>           goto fail;
>>       }
>>       group_was_added = true;
>> @@ -697,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>       QLIST_REMOVE(group, container_next);
>>       group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>       /*
>>        * Explicitly release the listener first before unset container,
>> @@ -750,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>       group = g_malloc0(sizeof(*group));
>>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open(path, O_RDWR, errp);
>> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, NULL, errp);
>>       if (group->fd < 0) {
>>           goto free_group_exit;
>>       }
>> @@ -782,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>>       return group;
>>   close_fd_exit:
>> +    cpr_delete_fd("vfio_group", groupid);
>>       close(group->fd);
>>   free_group_exit:
>> @@ -803,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>>       vfio_container_disconnect(group);
>>       QLIST_REMOVE(group, next);
>>       trace_vfio_group_put(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>       close(group->fd);
>>       g_free(group);
>>   }
>> @@ -812,8 +846,14 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>   {
>>       g_autofree struct vfio_device_info *info = NULL;
>>       int fd;
>> +    bool cpr_reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    cpr_reused = (fd >= 0);
>> +    if (!cpr_reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>       if (fd < 0) {
>>           error_setg_errno(errp, errno, "error getting device from group %d",
>>                            group->groupid);
>> @@ -857,6 +897,10 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>>       vbasedev->group = group;
>>       QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
>> +    vbasedev->cpr.reused = cpr_reused;
>> +    if (!cpr_reused) {
>> +        cpr_save_fd(name, 0, fd);
>> +    }
>>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>>       return true;
>> @@ -870,6 +914,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>>       QLIST_REMOVE(vbasedev, next);
>>       vbasedev->group = NULL;
>>       trace_vfio_device_put(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>       close(vbasedev->fd);
>>   }
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index fac323c..638a8e0 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -10,6 +10,7 @@
>>   #include "qemu/osdep.h"
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>> +#include "hw/vfio/vfio-device.h"
>>   #include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "migration/migration.h"
>> @@ -31,10 +32,27 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>       }
>>   }
>> +static int vfio_container_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOContainer *container = opaque;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    container->cpr.reused = false;
>> +
>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->cpr.reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>>   static const VMStateDescription vfio_container_vmstate = {
>>       .name = "vfio-container",
>>       .version_id = 0,
>>       .minimum_version_id = 0,
>> +    .post_load = vfio_container_post_load,
>>       .needed = cpr_needed_for_reuse,
>>       .fields = (VMStateField[]) {
>>           VMSTATE_END_OF_LIST()
>> @@ -68,3 +86,31 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>>       migrate_del_blocker(&container->cpr.blocker);
>>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>>   }
>> +
>> +static bool same_device(int fd1, int fd2)
>> +{
>> +    struct stat st1, st2;
>> +
>> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
>> +}
>> +
>> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
>> +                              int *pfd)
>> +{
>> +    if (container->fd == *pfd) {
>> +        return true;
>> +    }
>> +    if (!same_device(container->fd, *pfd)) {
>> +        return false;
>> +    }
>> +    /*
>> +     * Same device, different fd.  This occurs when the container fd is
>> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
>> +     * produces duplicates.  De-dup it.
>> +     */
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>> +    close(*pfd);
>> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>> +    *pfd = container->fd;
>> +    return true;
>> +}
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index f864547..1c4f070 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -13,10 +13,16 @@
>>   typedef struct VFIOContainerCPR {
>>       Error *blocker;
>> +    bool reused;
>>   } VFIOContainerCPR;
>> +typedef struct VFIODeviceCPR {
>> +    bool reused;
>> +} VFIODeviceCPR;
>> +
>>   struct VFIOContainer;
>>   struct VFIOContainerBase;
>> +struct VFIOGroup;
>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                           Error **errp);
>> @@ -29,4 +35,7 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>                                    Error **errp);
>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>> +bool vfio_cpr_container_match(struct VFIOContainer *container,
>> +                              struct VFIOGroup *group, int *fd);
>> +
>>   #endif /* HW_VFIO_VFIO_CPR_H */
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index 8bcb3c1..4e4d0b6 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -28,6 +28,7 @@
>>   #endif
>>   #include "system/system.h"
>>   #include "hw/vfio/vfio-container-base.h"
>> +#include "hw/vfio/vfio-cpr.h"
>>   #include "system/host_iommu_device.h"
>>   #include "system/iommufd.h"
>> @@ -84,6 +85,7 @@ typedef struct VFIODevice {
>>       VFIOIOASHwpt *hwpt;
>>       QLIST_ENTRY(VFIODevice) hwpt_next;
>>       struct vfio_region_info **reginfo;
>> +    VFIODeviceCPR cpr;
>>   } VFIODevice;
>>   struct VFIODeviceOps {
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 10/42] vfio/container: restore DMA vaddr
  2025-05-22  6:37   ` Cédric Le Goater
@ 2025-05-22 14:00     ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-22 14:00 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/22/2025 2:37 AM, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
>> In new QEMU, do not register the memory listener at device creation time.
>> Register it later, in the container post_load handler, after all vmstate
>> that may affect regions and mapping boundaries has been loaded.  The
>> post_load registration will cause the listener to invoke its callback on
>> each flat section, and the calls will match the mappings remembered by the
>> kernel.
>>
>> The listener calls a special dma_map handler that passes the new VA of each
>> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
>> handler at the end.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/container.c  | 15 +++++++++++++--
>>   hw/vfio/cpr-legacy.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index a554683..0e02726 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>>       int ret;
>>       Error *local_err = NULL;
>> +    assert(!container->cpr.reused);
> 
> assert -> g_assert

will do.

> this can be called at runtime, which would mean crashing QEMU in case
> of error. Doing an error_report() call is more friendly.

It is an internal error if this assertion is hit, so the state of the system
cannot be trusted.  Hence assert rather than error_report and attempt to recover.

- Steve

>> +
>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>               bcontainer->dirty_pages_supported) {
>> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>>       }
>>       group_was_added = true;
>> -    if (!vfio_listener_register(bcontainer, errp)) {
>> -        goto fail;
>> +    /*
>> +     * If reused, register the listener later, after all state that may
>> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
>> +     * the listener will invoke its callback on each flat section and call
>> +     * dma_map to supply the new vaddr, and the calls will match the mappings
>> +     * remembered by the kernel.
>> +     */
>> +    if (!cpr_reused) {
>> +        if (!vfio_listener_register(bcontainer, errp)) {
>> +            goto fail;
>> +        }
>>       }
>>       bcontainer->initialized = true;
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> index 519d772..bbcf71e 100644
>> --- a/hw/vfio/cpr-legacy.c
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -11,11 +11,13 @@
>>   #include "hw/vfio/vfio-container.h"
>>   #include "hw/vfio/vfio-cpr.h"
>>   #include "hw/vfio/vfio-device.h"
>> +#include "hw/vfio/vfio-listener.h"
>>   #include "migration/blocker.h"
>>   #include "migration/cpr.h"
>>   #include "migration/migration.h"
>>   #include "migration/vmstate.h"
>>   #include "qapi/error.h"
>> +#include "qemu/error-report.h"
>>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>   {
>> @@ -32,6 +34,34 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>>       return true;
>>   }
>> +/*
>> + * Set the new @vaddr for any mappings registered during cpr load.
>> + * Reused is cleared thereafter.
>> + */
>> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
>> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
>> +                                   bool readonly)
>> +{
>> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>> +                                                  bcontainer);
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +
>> +    assert(container->cpr.reused);
>> +> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +        error_report("vfio_legacy_cpr_dma_map (iova %lu, size %ld, va %p): %s",
>> +                     iova, size, vaddr, strerror(errno));
>> +        return -errno;
>> +    }
>> +
>> +    return 0;
>> +}
>>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>>   {
>> @@ -63,12 +93,24 @@ static int vfio_container_pre_save(void *opaque)
>>   static int vfio_container_post_load(void *opaque, int version_id)
>>   {
>>       VFIOContainer *container = opaque;
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>       VFIOGroup *group;
>>       VFIODevice *vbasedev;
>> +    Error *err = NULL;
>> +
>> +    if (!vfio_listener_register(bcontainer, &err)) {
>> +        error_report_err(err);
>> +        return -1;
>> +    }
>>       container->cpr.reused = false;
>>       QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +
>> +        /* Restore original dma_map function */
>> +        vioc->dma_map = vfio_legacy_dma_map;
>> +
>>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>               vbasedev->cpr.reused = false;
>>           }
>> @@ -80,6 +122,7 @@ static const VMStateDescription vfio_container_vmstate = {
>>       .name = "vfio-container",
>>       .version_id = 0,
>>       .minimum_version_id = 0,
>> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>>       .pre_save = vfio_container_pre_save,
>>       .post_load = vfio_container_post_load,
>>       .needed = cpr_needed_for_reuse,
>> @@ -104,6 +147,11 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +    /* During incoming CPR, divert calls to dma_map. */
>> +    if (container->cpr.reused) {
>> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
>> +    }
>>       return true;
>>   }
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-22  3:19             ` Duan, Zhenzhong
@ 2025-05-22 21:11               ` Steven Sistare
  2025-05-23  8:56                 ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-22 21:11 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>
>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>
>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>
>>>>>>> Define the change process ioctl
>>>>>>>
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> ---
>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>> backends/trace-events    |  1 +
>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>
>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>> --- a/backends/iommufd.c
>>>>>>> +++ b/backends/iommufd.c
>>>>>>> @@ -73,6 +73,26 @@ static void
>> iommufd_backend_class_init(ObjectClass
>>>> *oc,
>>>>>>> const void *data)
>>>>>>>        object_class_property_add_str(oc, "fd", NULL,
>> iommufd_backend_set_fd);
>>>>>>> }
>>>>>>>
>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>> +{
>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>> +
>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>> +}
>>>>>>> +
>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>>>> +{
>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>
>>>>>> This is same ioctl as above check, could it be called more than once for
>> same
>>>> process?
>>>>>
>>>>> Yes, and it is a no-op if the process has not changed since the last time DMA
>>>>> was mapped.
>>>>
>>>> More questions?
>>>
>>> Looks a bit redundant for me, meanwhile if iommufd_change_process_capable()
>> is called on target qemu, may it do both checking and change?
>>>
>>> I would suggest to define only iommufd_change_process() and comment that
>> it's no-op if process not changed...
>>
>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>> performing
>> live update so we can add a blocker and prevent live update cleanly:
>>
>> vfio_iommufd_cpr_register_container
>>      if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>          migrate_add_blocker_modes()
> 
> This reminds me of other questions, is this ioctl() suitable for checking if cpr-transfer supported?
> If there is vIOMMU, there can be no mapping and process_capable() check will pass,
> but if memory is not file backed...
> Does cpr-transfer support vIOMMU or not?

I don't know, I have not tried your sample args yet, but I will.
With vIOMMU, what entity/interface pins memory for the vfio device?

> QEMU knows details of all memory backends, why not checking memory backends directly instead of a system call?

IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the kernel
supports it.  And if supported, it also verifies that all dma mappings are
of the file type.

- Steve

>> How about I just add a comment:
>>
>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>> {
>>      /*
>>       * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
>>       * This is a no-op if the process has not changed since DMA was mapped.
>>       */
>>
>> - Steve
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-22 21:11               ` Steven Sistare
@ 2025-05-23  8:56                 ` Duan, Zhenzhong
  2025-05-23 14:56                   ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-23  8:56 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>
>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>
>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>
>>>>>>>> Define the change process ioctl
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>> backends/trace-events    |  1 +
>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>> --- a/backends/iommufd.c
>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>> @@ -73,6 +73,26 @@ static void
>>> iommufd_backend_class_init(ObjectClass
>>>>> *oc,
>>>>>>>> const void *data)
>>>>>>>>        object_class_property_add_str(oc, "fd", NULL,
>>> iommufd_backend_set_fd);
>>>>>>>> }
>>>>>>>>
>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>> +{
>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>> +
>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>>>>> +{
>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>
>>>>>>> This is same ioctl as above check, could it be called more than once for
>>> same
>>>>> process?
>>>>>>
>>>>>> Yes, and it is a no-op if the process has not changed since the last time
>DMA
>>>>>> was mapped.
>>>>>
>>>>> More questions?
>>>>
>>>> Looks a bit redundant for me, meanwhile if
>iommufd_change_process_capable()
>>> is called on target qemu, may it do both checking and change?
>>>>
>>>> I would suggest to define only iommufd_change_process() and comment that
>>> it's no-op if process not changed...
>>>
>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>> performing
>>> live update so we can add a blocker and prevent live update cleanly:
>>>
>>> vfio_iommufd_cpr_register_container
>>>      if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>>          migrate_add_blocker_modes()
>>
>> This reminds me of other questions, is this ioctl() suitable for checking if cpr-
>transfer supported?
>> If there is vIOMMU, there can be no mapping and process_capable() check will
>pass,
>> but if memory is not file backed...
>> Does cpr-transfer support vIOMMU or not?
>
>I don't know, I have not tried your sample args yet, but I will.
>With vIOMMU, what entity/interface pins memory for the vfio device?

Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for this issue.
I mean when guest attach device to a DMA domain, there can be no mapping in that domain initially.

>
>> QEMU knows details of all memory backends, why not checking memory
>backends directly instead of a system call?
>
>IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the
>kernel
>supports it.  And if supported, it also verifies that all dma mappings are
>of the file type.

But the dma mappings are dynamic if there is vIOMMU, so checking dma mappings are checking nothing if there is no mapping in the DMA domain.

>
>- Steve
>
>>> How about I just add a comment:
>>>
>>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>> {
>>>      /*
>>>       * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
>>>       * This is a no-op if the process has not changed since DMA was mapped.
>>>       */
>>>
>>> - Steve
>>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-23  8:56                 ` Duan, Zhenzhong
@ 2025-05-23 14:56                   ` Steven Sistare
  2025-05-23 19:19                     ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-23 14:56 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/23/2025 4:56 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>
>> On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>
>>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>
>>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>>
>>>>>>>>> Define the change process ioctl
>>>>>>>>>
>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>> ---
>>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>>> backends/trace-events    |  1 +
>>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>>> --- a/backends/iommufd.c
>>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>>> @@ -73,6 +73,26 @@ static void
>>>> iommufd_backend_class_init(ObjectClass
>>>>>> *oc,
>>>>>>>>> const void *data)
>>>>>>>>>         object_class_property_add_str(oc, "fd", NULL,
>>>> iommufd_backend_set_fd);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>>> +{
>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>> +
>>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>>>>>> +{
>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>
>>>>>>>> This is same ioctl as above check, could it be called more than once for
>>>> same
>>>>>> process?
>>>>>>>
>>>>>>> Yes, and it is a no-op if the process has not changed since the last time
>> DMA
>>>>>>> was mapped.
>>>>>>
>>>>>> More questions?
>>>>>
>>>>> Looks a bit redundant for me, meanwhile if
>> iommufd_change_process_capable()
>>>> is called on target qemu, may it do both checking and change?
>>>>>
>>>>> I would suggest to define only iommufd_change_process() and comment that
>>>> it's no-op if process not changed...
>>>>
>>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>>> performing
>>>> live update so we can add a blocker and prevent live update cleanly:
>>>>
>>>> vfio_iommufd_cpr_register_container
>>>>       if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>>>           migrate_add_blocker_modes()
>>>
>>> This reminds me of other questions, is this ioctl() suitable for checking if cpr-
>> transfer supported?
>>> If there is vIOMMU, there can be no mapping and process_capable() check will
>> pass,
>>> but if memory is not file backed...
>>> Does cpr-transfer support vIOMMU or not?
>>
>> I don't know, I have not tried your sample args yet, but I will.
>> With vIOMMU, what entity/interface pins memory for the vfio device?
> 
> Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for this issue.
> I mean when guest attach device to a DMA domain, there can be no mapping in that domain initially.
> 
>>
>>> QEMU knows details of all memory backends, why not checking memory
>> backends directly instead of a system call?
>>
>> IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the
>> kernel
>> supports it.  And if supported, it also verifies that all dma mappings are
>> of the file type.
> 
> But the dma mappings are dynamic if there is vIOMMU, so checking dma mappings are checking nothing if there is no mapping in the DMA domain.

Yes, so there are 2 checks:
   * at realize -> cpr register time.  if cpr can never work because
     IOMMU_IOAS_CHANGE_PROCESS is not supported, then adds a blocker.

   * at cpr time, in vfio_container_pre_save.  refuses to proceed if
     iommufd_change_process() fails because non-file mappings are present.
     Allows cpr if there are no mappings present.

- Steve

>>>> How about I just add a comment:
>>>>
>>>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>> {
>>>>       /*
>>>>        * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
>>>>        * This is a no-op if the process has not changed since DMA was mapped.
>>>>        */
>>>>
>>>> - Steve
>>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
  2025-05-19 15:55     ` Steven Sistare
@ 2025-05-23 17:47       ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-23 17:47 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/19/2025 11:55 AM, Steven Sistare wrote:
> On 5/16/2025 4:55 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V3 33/42] vfio/iommufd: define hwpt constructors
>>>
>>> Extract hwpt creation code from iommufd_cdev_autodomains_get into the
>>> helpers iommufd_cdev_use_hwpt and iommufd_cdev_make_hwpt.  These will
>>> be used by CPR in a subsequent patch.
>>>
>>> Call vfio_device_hiod_create_and_realize earlier so iommufd_cdev_make_hwpt
>>> can use vbasedev->hiod hw_caps, avoiding an extra call to
>>> iommufd_backend_get_device_info
>>
>> We had made consensus to realize hiod after attachment,
>> it's not a hot path so an extra call is acceptable per Cedric.
> 
> I'll rework it per the consensus, but I suspect the result will be less pretty --
> code duplication, or more conditionals.

I withdraw this patch.  I will instead save hwpt_id in cpr state, and the existing
hwpt functions, called in the existing order, will use it.

- Steve

>>> No functional change.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/iommufd.c | 116 ++++++++++++++++++++++++++++++----------------------
>>> -- 
>>> 1 file changed, 65 insertions(+), 51 deletions(-)
>>>
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index f645a62..8661947 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -310,16 +310,70 @@ static bool
>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>      return true;
>>> }
>>>
>>> +static void iommufd_cdev_use_hwpt(VFIODevice *vbasedev, VFIOIOASHwpt
>>> *hwpt)
>>> +{
>>> +    vbasedev->hwpt = hwpt;
>>> +    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> +}
>>> +
>>> +/*
>>> + * iommufd_cdev_make_hwpt: If @alloc_id, allocate a hwpt_id, else use
>>> @hwpt_id.
>>> + * Create and add a hwpt struct to the container's list and to the device.
>>> + * Always succeeds if !@alloc_id.
>>> + */
>>> +static bool iommufd_cdev_make_hwpt(VFIODevice *vbasedev,
>>> +                                   VFIOIOMMUFDContainer *container,
>>> +                                   uint32_t hwpt_id, bool alloc_id,
>>> +                                   Error **errp)
>>> +{
>>> +    VFIOIOASHwpt *hwpt;
>>> +    uint32_t flags = 0;
>>> +
>>> +    /*
>>> +     * This is quite early and VFIO Migration state isn't yet fully
>>> +     * initialized, thus rely only on IOMMU hardware capabilities as to
>>> +     * whether IOMMU dirty tracking is going to be requested. Later
>>> +     * vfio_migration_realize() may decide to use VF dirty tracking
>>> +     * instead.
>>> +     */
>>> +    g_assert(vbasedev->hiod);
>>> +    if (vbasedev->hiod->caps.hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>>> +        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>> +    }
>>> +
>>> +    if (alloc_id) {
>>> +        if (!iommufd_backend_alloc_hwpt(vbasedev->iommufd, vbasedev->devid,
>>> +                                        container->ioas_id, flags,
>>> +                                        IOMMU_HWPT_DATA_NONE, 0, NULL,
>>> +                                        &hwpt_id, errp)) {
>>> +            return false;
>>> +        }
>>> +
>>> +        if (iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp)) {
>>> +            iommufd_backend_free_id(container->be, hwpt_id);
>>> +            return false;
>>> +        }
>>> +    }
>>> +
>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>> +    hwpt->hwpt_id = hwpt_id;
>>> +    hwpt->hwpt_flags = flags;
>>> +    QLIST_INIT(&hwpt->device_list);
>>> +
>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>> +    container->bcontainer.dirty_pages_supported |=
>>> +                                vbasedev->iommu_dirty_tracking;
>>> +    iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>> +    return true;
>>> +}
>>> +
>>> static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>                                           VFIOIOMMUFDContainer *container,
>>>                                           Error **errp)
>>> {
>>>      ERRP_GUARD();
>>> -    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>> -    uint32_t type, flags = 0;
>>> -    uint64_t hw_caps;
>>>      VFIOIOASHwpt *hwpt;
>>> -    uint32_t hwpt_id;
>>>      int ret;
>>>
>>>      /* Try to find a domain */
>>> @@ -340,54 +394,14 @@ static bool
>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>
>>>              return false;
>>>          } else {
>>> -            vbasedev->hwpt = hwpt;
>>> -            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> -            vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>> +            iommufd_cdev_use_hwpt(vbasedev, hwpt);
>>>              return true;
>>>          }
>>>      }
>>> -
>>> -    /*
>>> -     * This is quite early and VFIO Migration state isn't yet fully
>>> -     * initialized, thus rely only on IOMMU hardware capabilities as to
>>> -     * whether IOMMU dirty tracking is going to be requested. Later
>>> -     * vfio_migration_realize() may decide to use VF dirty tracking
>>> -     * instead.
>>> -     */
>>> -    if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
>>> -                                         &type, NULL, 0, &hw_caps, errp)) {
>>> -        return false;
>>> -    }
>>> -
>>> -    if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>>> -        flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>> -    }
>>> -
>>> -    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>> -                                    container->ioas_id, flags,
>>> -                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>> -                                    &hwpt_id, errp)) {
>>> -        return false;
>>> -    }
>>> -
>>> -    hwpt = g_malloc0(sizeof(*hwpt));
>>> -    hwpt->hwpt_id = hwpt_id;
>>> -    hwpt->hwpt_flags = flags;
>>> -    QLIST_INIT(&hwpt->device_list);
>>> -
>>> -    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>>> -    if (ret) {
>>> -        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>> -        g_free(hwpt);
>>> +    if (!iommufd_cdev_make_hwpt(vbasedev, container, 0, true, errp)) {
>>>          return false;
>>>      }
>>>
>>> -    vbasedev->hwpt = hwpt;
>>> -    vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
>>> -    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> -    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>> -    container->bcontainer.dirty_pages_supported |=
>>> -                                vbasedev->iommu_dirty_tracking;
>>>      if (container->bcontainer.dirty_pages_supported &&
>>>          !vbasedev->iommu_dirty_tracking) {
>>>          warn_report("IOMMU instance for device %s doesn't support dirty tracking",
>>> @@ -530,6 +544,11 @@ static bool iommufd_cdev_attach(const char *name,
>>> VFIODevice *vbasedev,
>>>
>>>      space = vfio_address_space_get(as);
>>>
>>> +    if (!vfio_device_hiod_create_and_realize(vbasedev,
>>> +            TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>>> +        goto err_alloc_ioas;
>>> +    }
>>> +
>>>      /* try to attach to an existing container in this space */
>>>      QLIST_FOREACH(bcontainer, &space->containers, next) {
>>>          container = container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>>> @@ -604,11 +623,6 @@ found_container:
>>>          goto err_listener_register;
>>>      }
>>>
>>> -    if (!vfio_device_hiod_create_and_realize(vbasedev,
>>> -                     TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO, errp)) {
>>> -        goto err_listener_register;
>>> -    }
>>> -
>>>      /*
>>>       * TODO: examine RAM_BLOCK_DISCARD stuff, should we do group level
>>>       * for discarding incompatibility check as well?
>>> -- 
>>> 1.8.3.1
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-23 14:56                   ` Steven Sistare
@ 2025-05-23 19:19                     ` Steven Sistare
  2025-05-26  2:31                       ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-23 19:19 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/23/2025 10:56 AM, Steven Sistare wrote:
> On 5/23/2025 4:56 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>
>>> On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>
>>>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>
>>>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>>>
>>>>>>>>>> Define the change process ioctl
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>>>> backends/trace-events    |  1 +
>>>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>>>> --- a/backends/iommufd.c
>>>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>>>> @@ -73,6 +73,26 @@ static void
>>>>> iommufd_backend_class_init(ObjectClass
>>>>>>> *oc,
>>>>>>>>>> const void *data)
>>>>>>>>>>         object_class_property_add_str(oc, "fd", NULL,
>>>>> iommufd_backend_set_fd);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>>>> +{
>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>> +
>>>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
>>>>>>>>>> +{
>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>>
>>>>>>>>> This is same ioctl as above check, could it be called more than once for
>>>>> same
>>>>>>> process?
>>>>>>>>
>>>>>>>> Yes, and it is a no-op if the process has not changed since the last time
>>> DMA
>>>>>>>> was mapped.
>>>>>>>
>>>>>>> More questions?
>>>>>>
>>>>>> Looks a bit redundant for me, meanwhile if
>>> iommufd_change_process_capable()
>>>>> is called on target qemu, may it do both checking and change?
>>>>>>
>>>>>> I would suggest to define only iommufd_change_process() and comment that
>>>>> it's no-op if process not changed...
>>>>>
>>>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>>>> performing
>>>>> live update so we can add a blocker and prevent live update cleanly:
>>>>>
>>>>> vfio_iommufd_cpr_register_container
>>>>>       if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>>>>           migrate_add_blocker_modes()
>>>>
>>>> This reminds me of other questions, is this ioctl() suitable for checking if cpr-
>>> transfer supported?
>>>> If there is vIOMMU, there can be no mapping and process_capable() check will
>>> pass,
>>>> but if memory is not file backed...
>>>> Does cpr-transfer support vIOMMU or not?
>>>
>>> I don't know, I have not tried your sample args yet, but I will.
>>> With vIOMMU, what entity/interface pins memory for the vfio device?
>>
>> Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for this issue.
>> I mean when guest attach device to a DMA domain, there can be no mapping in that domain initially.
>>
>>>
>>>> QEMU knows details of all memory backends, why not checking memory
>>> backends directly instead of a system call?
>>>
>>> IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the
>>> kernel
>>> supports it.  And if supported, it also verifies that all dma mappings are
>>> of the file type.
>>
>> But the dma mappings are dynamic if there is vIOMMU, so checking dma mappings are checking nothing if there is no mapping in the DMA domain.
> 
> Yes, so there are 2 checks:
>    * at realize -> cpr register time.  if cpr can never work because
>      IOMMU_IOAS_CHANGE_PROCESS is not supported, then adds a blocker.
> 
>    * at cpr time, in vfio_container_pre_save.  refuses to proceed if
>      iommufd_change_process() fails because non-file mappings are present.
>      Allows cpr if there are no mappings present.
> 

If my explanation makes sense, any chance of getting an RB for this and the
related patch?
   backends/iommufd: change process ioctl
   vfio/iommufd: change process

They are not affected by the other changes we have discussed.

- Steve

>>>>> How about I just add a comment:
>>>>>
>>>>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>> {
>>>>>       /*
>>>>>        * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
>>>>>        * This is a no-op if the process has not changed since DMA was mapped.
>>>>>        */
>>>>>
>>>>> - Steve
>>>>
>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-16  8:19   ` Cédric Le Goater
  2025-05-16 17:58     ` Steven Sistare
@ 2025-05-24  9:34     ` Michael S. Tsirkin
  2025-05-27 20:42       ` Steven Sistare
  1 sibling, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2025-05-24  9:34 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Steve Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
	Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On Fri, May 16, 2025 at 10:19:09AM +0200, Cédric Le Goater wrote:
> On 5/12/25 17:32, Steve Sistare wrote:
> > Do not reset a vfio-pci device during CPR.
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >   hw/pci/pci.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index fe38c4c..2ba2e0f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -32,6 +32,8 @@
> >   #include "hw/pci/pci_host.h"
> >   #include "hw/qdev-properties.h"
> >   #include "hw/qdev-properties-system.h"
> > +#include "migration/cpr.h"
> > +#include "migration/misc.h"
> >   #include "migration/qemu-file-types.h"
> >   #include "migration/vmstate.h"
> >   #include "net/net.h"
> > @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
> >   static void pci_do_device_reset(PCIDevice *dev)
> >   {
> > +    /*
> > +     * A PCI device that is resuming for cpr is already configured, so do
> > +     * not reset it here when we are called from qemu_system_reset prior to
> > +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
> > +     * safe to skip this reset for all PCI devices, because vmstate load will
> > +     * set all fields that would have been set here.
> > +     */
> > +    if (cpr_is_incoming()) {
> 
> Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
> on saved fds?
> 
> Thanks,
> 
> C.

Think I agree.

> 
> 
> > +        return;
> > +    }
> > +
> >       pci_device_deassert_intx(dev);
> >       assert(dev->irq_state == 0);



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-23 19:19                     ` Steven Sistare
@ 2025-05-26  2:31                       ` Duan, Zhenzhong
  2025-05-28 13:31                         ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-26  2:31 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>On 5/23/2025 10:56 AM, Steven Sistare wrote:
>> On 5/23/2025 4:56 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>
>>>> On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>
>>>>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>
>>>>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>>>>
>>>>>>>>>>> Define the change process ioctl
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>> ---
>>>>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>>>>> backends/trace-events    |  1 +
>>>>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>>>>> --- a/backends/iommufd.c
>>>>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>>>>> @@ -73,6 +73,26 @@ static void
>>>>>> iommufd_backend_class_init(ObjectClass
>>>>>>>> *oc,
>>>>>>>>>>> const void *data)
>>>>>>>>>>>         object_class_property_add_str(oc, "fd", NULL,
>>>>>> iommufd_backend_set_fd);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>>> +
>>>>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error
>**errp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS,
>&args);
>>>>>>>>>>
>>>>>>>>>> This is same ioctl as above check, could it be called more than once
>for
>>>>>> same
>>>>>>>> process?
>>>>>>>>>
>>>>>>>>> Yes, and it is a no-op if the process has not changed since the last time
>>>> DMA
>>>>>>>>> was mapped.
>>>>>>>>
>>>>>>>> More questions?
>>>>>>>
>>>>>>> Looks a bit redundant for me, meanwhile if
>>>> iommufd_change_process_capable()
>>>>>> is called on target qemu, may it do both checking and change?
>>>>>>>
>>>>>>> I would suggest to define only iommufd_change_process() and comment
>that
>>>>>> it's no-op if process not changed...
>>>>>>
>>>>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>>>>> performing
>>>>>> live update so we can add a blocker and prevent live update cleanly:
>>>>>>
>>>>>> vfio_iommufd_cpr_register_container
>>>>>>       if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>>>>>           migrate_add_blocker_modes()
>>>>>
>>>>> This reminds me of other questions, is this ioctl() suitable for checking if cpr-
>>>> transfer supported?
>>>>> If there is vIOMMU, there can be no mapping and process_capable() check
>will
>>>> pass,
>>>>> but if memory is not file backed...
>>>>> Does cpr-transfer support vIOMMU or not?
>>>>
>>>> I don't know, I have not tried your sample args yet, but I will.
>>>> With vIOMMU, what entity/interface pins memory for the vfio device?
>>>
>>> Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for this
>issue.
>>> I mean when guest attach device to a DMA domain, there can be no mapping
>in that domain initially.
>>>
>>>>
>>>>> QEMU knows details of all memory backends, why not checking memory
>>>> backends directly instead of a system call?
>>>>
>>>> IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the
>>>> kernel
>>>> supports it.  And if supported, it also verifies that all dma mappings are
>>>> of the file type.
>>>
>>> But the dma mappings are dynamic if there is vIOMMU, so checking dma
>mappings are checking nothing if there is no mapping in the DMA domain.
>>
>> Yes, so there are 2 checks:
>>    * at realize -> cpr register time.  if cpr can never work because
>>      IOMMU_IOAS_CHANGE_PROCESS is not supported, then adds a blocker.
>>
>>    * at cpr time, in vfio_container_pre_save.  refuses to proceed if
>>      iommufd_change_process() fails because non-file mappings are present.
>>      Allows cpr if there are no mappings present.

Let me explain further.

There is a corner case that could bypass above checks. Source qemu starts with
vIOMMU and non-file memory backend, then hotplug VFIO device, if guest
driver doesn't setup any mapping or no guest driver attached, the mapping on
host side can be empty, then above checks will both pass.

I'm not sure if that's a case we need to support. If not, feel free to add my RB.

>>
>
>If my explanation makes sense, any chance of getting an RB for this and the
>related patch?
>   backends/iommufd: change process ioctl
>   vfio/iommufd: change process
>
>They are not affected by the other changes we have discussed.
>
>- Steve
>
>>>>>> How about I just add a comment:
>>>>>>
>>>>>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>> {
>>>>>>       /*
>>>>>>        * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized
>ioctl.
>>>>>>        * This is a no-op if the process has not changed since DMA was
>mapped.
>>>>>>        */
>>>>>>
>>>>>> - Steve
>>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-24  9:34     ` Michael S. Tsirkin
@ 2025-05-27 20:42       ` Steven Sistare
  2025-05-27 21:03         ` Michael S. Tsirkin
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-27 20:42 UTC (permalink / raw)
  To: Michael S. Tsirkin, Cédric Le Goater
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas



On 5/24/2025 5:34 AM, Michael S. Tsirkin wrote:
> On Fri, May 16, 2025 at 10:19:09AM +0200, Cédric Le Goater wrote:
>> On 5/12/25 17:32, Steve Sistare wrote:
>>> Do not reset a vfio-pci device during CPR.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>    hw/pci/pci.c | 13 +++++++++++++
>>>    1 file changed, 13 insertions(+)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index fe38c4c..2ba2e0f 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -32,6 +32,8 @@
>>>    #include "hw/pci/pci_host.h"
>>>    #include "hw/qdev-properties.h"
>>>    #include "hw/qdev-properties-system.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/misc.h"
>>>    #include "migration/qemu-file-types.h"
>>>    #include "migration/vmstate.h"
>>>    #include "net/net.h"
>>> @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
>>>    static void pci_do_device_reset(PCIDevice *dev)
>>>    {
>>> +    /*
>>> +     * A PCI device that is resuming for cpr is already configured, so do
>>> +     * not reset it here when we are called from qemu_system_reset prior to
>>> +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
>>> +     * safe to skip this reset for all PCI devices, because vmstate load will
>>> +     * set all fields that would have been set here.
>>> +     */
>>> +    if (cpr_is_incoming()) {
>>
>> Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
>> on saved fds?
>>
>> Thanks,
>>
>> C.
> 
> Think I agree.

OK.  I will delete the "reused" variable everywhere, and use cpr_is_incoming.

Michael, since I already use cpr_is_incoming in this pci patch, can I have
your RB or ack?

- Steve

> 
>>
>>
>>> +        return;
>>> +    }
>>> +
>>>        pci_device_deassert_intx(dev);
>>>        assert(dev->irq_state == 0);
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-27 20:42       ` Steven Sistare
@ 2025-05-27 21:03         ` Michael S. Tsirkin
  2025-05-28 16:11           ` Steven Sistare
  0 siblings, 1 reply; 157+ messages in thread
From: Michael S. Tsirkin @ 2025-05-27 21:03 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Cédric Le Goater, qemu-devel, Alex Williamson, Yi Liu,
	Eric Auger, Zhenzhong Duan, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On Tue, May 27, 2025 at 04:42:16PM -0400, Steven Sistare wrote:
> 
> 
> On 5/24/2025 5:34 AM, Michael S. Tsirkin wrote:
> > On Fri, May 16, 2025 at 10:19:09AM +0200, Cédric Le Goater wrote:
> > > On 5/12/25 17:32, Steve Sistare wrote:
> > > > Do not reset a vfio-pci device during CPR.
> > > > 
> > > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > > ---
> > > >    hw/pci/pci.c | 13 +++++++++++++
> > > >    1 file changed, 13 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > index fe38c4c..2ba2e0f 100644
> > > > --- a/hw/pci/pci.c
> > > > +++ b/hw/pci/pci.c
> > > > @@ -32,6 +32,8 @@
> > > >    #include "hw/pci/pci_host.h"
> > > >    #include "hw/qdev-properties.h"
> > > >    #include "hw/qdev-properties-system.h"
> > > > +#include "migration/cpr.h"
> > > > +#include "migration/misc.h"
> > > >    #include "migration/qemu-file-types.h"
> > > >    #include "migration/vmstate.h"
> > > >    #include "net/net.h"
> > > > @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
> > > >    static void pci_do_device_reset(PCIDevice *dev)
> > > >    {
> > > > +    /*
> > > > +     * A PCI device that is resuming for cpr is already configured, so do
> > > > +     * not reset it here when we are called from qemu_system_reset prior to
> > > > +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
> > > > +     * safe to skip this reset for all PCI devices, because vmstate load will
> > > > +     * set all fields that would have been set here.
> > > > +     */
> > > > +    if (cpr_is_incoming()) {
> > > 
> > > Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
> > > on saved fds?
> > > 
> > > Thanks,
> > > 
> > > C.
> > 
> > Think I agree.
> 
> OK.  I will delete the "reused" variable everywhere, and use cpr_is_incoming.
> 
> Michael, since I already use cpr_is_incoming in this pci patch, can I have
> your RB or ack?
> 
> - Steve

My problem is not with cpr_is_incoming as such.

First this comment is a very low level thing to say in common pci code.
vfio will change and we will not remember to keep this up to date.


Second, do we really know vmload for all devices sets all fields as
opposed to assume that qemu_system_reset cleared them?  If not this
introduces an information leak.

It feels safer to just add a way for VFIO to opt out of
(all or part of) reset, instead.

> > 
> > > 
> > > 
> > > > +        return;
> > > > +    }
> > > > +
> > > >        pci_device_deassert_intx(dev);
> > > >        assert(dev->irq_state == 0);
> > 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-26  2:31                       ` Duan, Zhenzhong
@ 2025-05-28 13:31                         ` Steven Sistare
  2025-05-30  9:56                           ` Duan, Zhenzhong
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-28 13:31 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/25/2025 10:31 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>
>> On 5/23/2025 10:56 AM, Steven Sistare wrote:
>>> On 5/23/2025 4:56 AM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>
>>>>> On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>
>>>>>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>>
>>>>>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>>>>>
>>>>>>>>>>>> Define the change process ioctl
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>>>>>> backends/trace-events    |  1 +
>>>>>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>>>>>> --- a/backends/iommufd.c
>>>>>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>>>>>> @@ -73,6 +73,26 @@ static void
>>>>>>> iommufd_backend_class_init(ObjectClass
>>>>>>>>> *oc,
>>>>>>>>>>>> const void *data)
>>>>>>>>>>>>          object_class_property_add_str(oc, "fd", NULL,
>>>>>>> iommufd_backend_set_fd);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>>>> +
>>>>>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error
>> **errp)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size = sizeof(args)};
>>>>>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS,
>> &args);
>>>>>>>>>>>
>>>>>>>>>>> This is same ioctl as above check, could it be called more than once
>> for
>>>>>>> same
>>>>>>>>> process?
>>>>>>>>>>
>>>>>>>>>> Yes, and it is a no-op if the process has not changed since the last time
>>>>> DMA
>>>>>>>>>> was mapped.
>>>>>>>>>
>>>>>>>>> More questions?
>>>>>>>>
>>>>>>>> Looks a bit redundant for me, meanwhile if
>>>>> iommufd_change_process_capable()
>>>>>>> is called on target qemu, may it do both checking and change?
>>>>>>>>
>>>>>>>> I would suggest to define only iommufd_change_process() and comment
>> that
>>>>>>> it's no-op if process not changed...
>>>>>>>
>>>>>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>>>>>> performing
>>>>>>> live update so we can add a blocker and prevent live update cleanly:
>>>>>>>
>>>>>>> vfio_iommufd_cpr_register_container
>>>>>>>        if !vfio_cpr_supported()        // calls iommufd_change_process_capable
>>>>>>>            migrate_add_blocker_modes()
>>>>>>
>>>>>> This reminds me of other questions, is this ioctl() suitable for checking if cpr-
>>>>> transfer supported?
>>>>>> If there is vIOMMU, there can be no mapping and process_capable() check
>> will
>>>>> pass,
>>>>>> but if memory is not file backed...
>>>>>> Does cpr-transfer support vIOMMU or not?
>>>>>
>>>>> I don't know, I have not tried your sample args yet, but I will.
>>>>> With vIOMMU, what entity/interface pins memory for the vfio device?
>>>>
>>>> Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for this
>> issue.
>>>> I mean when guest attach device to a DMA domain, there can be no mapping
>> in that domain initially.
>>>>
>>>>>
>>>>>> QEMU knows details of all memory backends, why not checking memory
>>>>> backends directly instead of a system call?
>>>>>
>>>>> IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that the
>>>>> kernel
>>>>> supports it.  And if supported, it also verifies that all dma mappings are
>>>>> of the file type.
>>>>
>>>> But the dma mappings are dynamic if there is vIOMMU, so checking dma
>> mappings are checking nothing if there is no mapping in the DMA domain.
>>>
>>> Yes, so there are 2 checks:
>>>     * at realize -> cpr register time.  if cpr can never work because
>>>       IOMMU_IOAS_CHANGE_PROCESS is not supported, then adds a blocker.
>>>
>>>     * at cpr time, in vfio_container_pre_save.  refuses to proceed if
>>>       iommufd_change_process() fails because non-file mappings are present.
>>>       Allows cpr if there are no mappings present.
> 
> Let me explain further.
> 
> There is a corner case that could bypass above checks. Source qemu starts with
> vIOMMU and non-file memory backend, then hotplug VFIO device, if guest
> driver doesn't setup any mapping or no guest driver attached, the mapping on
> host side can be empty, then above checks will both pass.

That is OK.  CPR is allowed in that case and succeeds because iommufd_change_process
has nothing to do.

However, after CPR, if non-file mappings are added, then the next CPR operation
would be blocked.

- Steve

> I'm not sure if that's a case we need to support. If not, feel free to add my RB.
> 
>>>
>>
>> If my explanation makes sense, any chance of getting an RB for this and the
>> related patch?
>>    backends/iommufd: change process ioctl
>>    vfio/iommufd: change process
>>
>> They are not affected by the other changes we have discussed.
>>
>> - Steve
>>
>>>>>>> How about I just add a comment:
>>>>>>>
>>>>>>> bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>> {
>>>>>>>        /*
>>>>>>>         * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized
>> ioctl.
>>>>>>>         * This is a no-op if the process has not changed since DMA was
>> mapped.
>>>>>>>         */
>>>>>>>
>>>>>>> - Steve
>>>>>>
>>>>
>>>
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 14/42] pci: skip reset during cpr
  2025-05-27 21:03         ` Michael S. Tsirkin
@ 2025-05-28 16:11           ` Steven Sistare
  0 siblings, 0 replies; 157+ messages in thread
From: Steven Sistare @ 2025-05-28 16:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Cédric Le Goater, qemu-devel, Alex Williamson, Yi Liu,
	Eric Auger, Zhenzhong Duan, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

On 5/27/2025 5:03 PM, Michael S. Tsirkin wrote:
> On Tue, May 27, 2025 at 04:42:16PM -0400, Steven Sistare wrote:
>> On 5/24/2025 5:34 AM, Michael S. Tsirkin wrote:
>>> On Fri, May 16, 2025 at 10:19:09AM +0200, Cédric Le Goater wrote:
>>>> On 5/12/25 17:32, Steve Sistare wrote:
>>>>> Do not reset a vfio-pci device during CPR.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>     hw/pci/pci.c | 13 +++++++++++++
>>>>>     1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>> index fe38c4c..2ba2e0f 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -32,6 +32,8 @@
>>>>>     #include "hw/pci/pci_host.h"
>>>>>     #include "hw/qdev-properties.h"
>>>>>     #include "hw/qdev-properties-system.h"
>>>>> +#include "migration/cpr.h"
>>>>> +#include "migration/misc.h"
>>>>>     #include "migration/qemu-file-types.h"
>>>>>     #include "migration/vmstate.h"
>>>>>     #include "net/net.h"
>>>>> @@ -537,6 +539,17 @@ static void pci_reset_regions(PCIDevice *dev)
>>>>>     static void pci_do_device_reset(PCIDevice *dev)
>>>>>     {
>>>>> +    /*
>>>>> +     * A PCI device that is resuming for cpr is already configured, so do
>>>>> +     * not reset it here when we are called from qemu_system_reset prior to
>>>>> +     * cpr load, else interrupts may be lost for vfio-pci devices.  It is
>>>>> +     * safe to skip this reset for all PCI devices, because vmstate load will
>>>>> +     * set all fields that would have been set here.
>>>>> +     */
>>>>> +    if (cpr_is_incoming()) {
>>>>
>>>> Why can't we use cpr_is_incoming() in vfio instead of using an heuristic
>>>> on saved fds?
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>
>>> Think I agree.
>>
>> OK.  I will delete the "reused" variable everywhere, and use cpr_is_incoming.
>>
>> Michael, since I already use cpr_is_incoming in this pci patch, can I have
>> your RB or ack?
>>
>> - Steve
> 
> My problem is not with cpr_is_incoming as such.
> 
> First this comment is a very low level thing to say in common pci code.
> vfio will change and we will not remember to keep this up to date.
> 
> Second, do we really know vmload for all devices sets all fields as
> opposed to assume that qemu_system_reset cleared them?  If not this
> introduces an information leak.
> 
> It feels safer to just add a way for VFIO to opt out of
> (all or part of) reset, instead.

Thanks very much for the feedback.  How about:

hw/vfio/pci.c
vfio_instance_init()
     /*
      * A device that is resuming for cpr is already configured, so do not
      * reset it during qemu_system_reset prior to cpr load, else interrupts
      * may be lost.
      */
     pci_dev->skip_reset_on_cpr = true

hw/pci/pci.c
pci_do_device_reset()
     if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
         return;
     }

- Steve



^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 22/42] vfio-pci: preserve MSI
  2025-05-12 15:32 ` [PATCH V3 22/42] vfio-pci: preserve MSI Steve Sistare
@ 2025-05-28 17:44   ` Steven Sistare
  2025-06-01 17:28     ` Cédric Le Goater
  0 siblings, 1 reply; 157+ messages in thread
From: Steven Sistare @ 2025-05-28 17:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas

Hi Cedric,
   Do you have any comments on this before I send V4?
Ditto for patch "vfio-pci: preserve INTx".
In both, I made the changes you requested in V2.
And I will change all "reused" tests to cpr_is_incoming as we
discussed elsewhere.

You mentioned these possibly conflict with vfio-user, but it would
help to get your stylistic and correctness comments on these from a
cpr-only point of view before I send the next version.

And as I mentioned, I propose to block CPR when vfio-user is used,
at least initially, so you can ignore vfio-user in the cpr load paths below.

- Steve

On 5/12/2025 11:32 AM, Steve Sistare wrote:
> Save the MSI message area as part of vfio-pci vmstate, and preserve the
> interrupt and notifier eventfd's.  migrate_incoming loads the MSI data,
> then the vfio-pci post_load handler finds the eventfds in CPR state,
> rebuilds vector data structures, and attaches the interrupts to the new
> KVM instance.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   hw/vfio/cpr.c              | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.c              | 40 ++++++++++++++++++--
>   include/hw/vfio/vfio-cpr.h |  8 ++++
>   3 files changed, 136 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 6ea8e9f..be132fa 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -9,6 +9,8 @@
>   #include "hw/vfio/vfio-device.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "hw/vfio/pci.h"
> +#include "hw/pci/msix.h"
> +#include "hw/pci/msi.h"
>   #include "migration/cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>   }
>   
> +#define STRDUP_VECTOR_FD_NAME(vdev, name)   \
> +    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
> +
> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
> +                             int fd)
> +{
> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> +    cpr_save_fd(fdname, nr, fd);
> +}
> +
> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> +    return cpr_find_fd(fdname, nr);
> +}
> +
> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
> +    cpr_delete_fd(fdname, nr);
> +}
> +
> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
> +                                   bool msix)
> +{
> +    int i, fd;
> +    bool pending = false;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vdev->nr_vectors = nr_vectors;
> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> +    vfio_prepare_kvm_msi_virq_batch(vdev);
> +
> +    for (i = 0; i < nr_vectors; i++) {
> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> +        fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
> +        if (fd >= 0) {
> +            vfio_vector_init(vdev, i);
> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +        }
> +
> +        if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> +        } else {
> +            vdev->msi_vectors[i].virq = -1;
> +        }
> +
> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> +            set_bit(i, vdev->msix->pending);
> +            pending = true;
> +        }
> +    }
> +
> +    vfio_commit_kvm_msi_virq_batch(vdev);
> +
> +    if (msix) {
> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> +    }
> +}
> +
>   /*
>    * The kernel may change non-emulated config bits.  Exclude them from the
>    * changed-bits check in get_pci_config_device.
> @@ -58,13 +123,39 @@ static int vfio_cpr_pci_pre_load(void *opaque)
>       return 0;
>   }
>   
> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int nr_vectors;
> +
> +    if (msix_enabled(pdev)) {
> +        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
> +                                   vfio_msix_vector_release, NULL);
> +        nr_vectors = vdev->msix->entries;
> +        vfio_cpr_claim_vectors(vdev, nr_vectors, true);
> +
> +    } else if (msi_enabled(pdev)) {
> +        nr_vectors = msi_nr_vectors_allocated(pdev);
> +        vfio_cpr_claim_vectors(vdev, nr_vectors, false);
> +
> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> +        g_assert_not_reached();      /* completed in a subsequent patch */
> +    }
> +
> +    return 0;
> +}
> +
>   const VMStateDescription vfio_cpr_pci_vmstate = {
>       .name = "vfio-cpr-pci",
>       .version_id = 0,
>       .minimum_version_id = 0,
>       .pre_load = vfio_cpr_pci_pre_load,
> +    .post_load = vfio_cpr_pci_post_load,
>       .needed = cpr_needed_for_reuse,
>       .fields = (VMStateField[]) {
> +        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> +        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>           VMSTATE_END_OF_LIST()
>       }
>   };
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 1bca415..bfa72bc 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,7 @@
>   #include "hw/pci/pci_bridge.h"
>   #include "hw/qdev-properties.h"
>   #include "hw/qdev-properties-system.h"
> +#include "hw/vfio/vfio-cpr.h"
>   #include "migration/vmstate.h"
>   #include "qobject/qdict.h"
>   #include "qemu/error-report.h"
> @@ -56,13 +57,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>   static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>   
> +/* Create new or reuse existing eventfd */
>   static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>                                  const char *name, int nr, Error **errp)
>   {
> -    int ret = event_notifier_init(e, 0);
> +    int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
> +    int ret = 0;
>   
> -    if (ret) {
> -        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> +    if (fd >= 0) {
> +        event_notifier_init_fd(e, fd);
> +    } else {
> +        ret = event_notifier_init(e, 0);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
> +        } else {
> +            fd = event_notifier_get_fd(e);
> +            if (fd >= 0) {
> +                vfio_cpr_save_vector_fd(vdev, name, nr, fd);
> +            }
> +        }
>       }
>       return !ret;
>   }
> @@ -70,6 +83,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>   static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>                                     const char *name, int nr)
>   {
> +    vfio_cpr_delete_vector_fd(vdev, name, nr);
>       event_notifier_cleanup(e);
>   }
>   
> @@ -554,6 +568,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>       int ret;
>       bool resizing = !!(vdev->nr_vectors < nr + 1);
>   
> +    /*
> +     * Ignore the callback from msix_set_vector_notifiers during resume.
> +     * The necessary subset of these actions is called from
> +     * vfio_cpr_claim_vectors during post load.
> +     */
> +    if (vdev->vbasedev.cpr.reused) {
> +        return 0;
> +    }
> +
>       trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>   
>       vector = &vdev->msi_vectors[nr];
> @@ -2937,6 +2960,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>       fd = event_notifier_get_fd(&vdev->err_notifier);
>       qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>   
> +    /* Do not alter irq_signaling during vfio_realize for cpr */
> +    if (vdev->vbasedev.cpr.reused) {
> +        return;
> +    }
> +
>       if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3004,6 +3032,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>       fd = event_notifier_get_fd(&vdev->req_notifier);
>       qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>   
> +    /* Do not alter irq_signaling during vfio_realize for cpr */
> +    if (vdev->vbasedev.cpr.reused) {
> +        vdev->req_enabled = true;
> +        return;
> +    }
> +
>       if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index e93600f..765e334 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -28,6 +28,7 @@ typedef struct VFIODeviceCPR {
>   struct VFIOContainer;
>   struct VFIOContainerBase;
>   struct VFIOGroup;
> +struct VFIOPCIDevice;
>   
>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>                                           Error **errp);
> @@ -49,6 +50,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>   bool vfio_cpr_ram_discard_register_listener(
>       struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>   
> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> +                             int nr, int fd);
> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> +                            int nr);
> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
> +                               int nr);
> +
>   extern const VMStateDescription vfio_cpr_pci_vmstate;
>   
>   #endif /* HW_VFIO_VFIO_CPR_H */



^ permalink raw reply	[flat|nested] 157+ messages in thread

* RE: [PATCH V3 29/42] backends/iommufd: change process ioctl
  2025-05-28 13:31                         ` Steven Sistare
@ 2025-05-30  9:56                           ` Duan, Zhenzhong
  0 siblings, 0 replies; 157+ messages in thread
From: Duan, Zhenzhong @ 2025-05-30  9:56 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>
>On 5/25/2025 10:31 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>
>>> On 5/23/2025 10:56 AM, Steven Sistare wrote:
>>>> On 5/23/2025 4:56 AM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>
>>>>>> On 5/21/2025 11:19 PM, Duan, Zhenzhong wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process ioctl
>>>>>>>>
>>>>>>>> On 5/20/2025 11:11 PM, Duan, Zhenzhong wrote:
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>>>>>>> Subject: Re: [PATCH V3 29/42] backends/iommufd: change process
>ioctl
>>>>>>>>>>
>>>>>>>>>> On 5/19/2025 11:51 AM, Steven Sistare wrote:
>>>>>>>>>>> On 5/16/2025 4:42 AM, Duan, Zhenzhong wrote:
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>>>> Subject: [PATCH V3 29/42] backends/iommufd: change process
>ioctl
>>>>>>>>>>>>>
>>>>>>>>>>>>> Define the change process ioctl
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> backends/iommufd.c       | 20 ++++++++++++++++++++
>>>>>>>>>>>>> backends/trace-events    |  1 +
>>>>>>>>>>>>> include/system/iommufd.h |  2 ++
>>>>>>>>>>>>> 3 files changed, 23 insertions(+)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>>>>>>> index 5c1958f..6fed1c1 100644
>>>>>>>>>>>>> --- a/backends/iommufd.c
>>>>>>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>>>>>>> @@ -73,6 +73,26 @@ static void
>>>>>>>> iommufd_backend_class_init(ObjectClass
>>>>>>>>>> *oc,
>>>>>>>>>>>>> const void *data)
>>>>>>>>>>>>>          object_class_property_add_str(oc, "fd", NULL,
>>>>>>>> iommufd_backend_set_fd);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> +bool iommufd_change_process_capable(IOMMUFDBackend *be)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size =
>sizeof(args)};
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
>>>>>>>>>>>>> +}
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +bool iommufd_change_process(IOMMUFDBackend *be, Error
>>> **errp)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    struct iommu_ioas_change_process args = {.size =
>sizeof(args)};
>>>>>>>>>>>>> +    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS,
>>> &args);
>>>>>>>>>>>>
>>>>>>>>>>>> This is same ioctl as above check, could it be called more than once
>>> for
>>>>>>>> same
>>>>>>>>>> process?
>>>>>>>>>>>
>>>>>>>>>>> Yes, and it is a no-op if the process has not changed since the last
>time
>>>>>> DMA
>>>>>>>>>>> was mapped.
>>>>>>>>>>
>>>>>>>>>> More questions?
>>>>>>>>>
>>>>>>>>> Looks a bit redundant for me, meanwhile if
>>>>>> iommufd_change_process_capable()
>>>>>>>> is called on target qemu, may it do both checking and change?
>>>>>>>>>
>>>>>>>>> I would suggest to define only iommufd_change_process() and
>comment
>>> that
>>>>>>>> it's no-op if process not changed...
>>>>>>>>
>>>>>>>> We need to check if IOMMU_IOAS_CHANGE_PROCESS is allowed before
>>>>>>>> performing
>>>>>>>> live update so we can add a blocker and prevent live update cleanly:
>>>>>>>>
>>>>>>>> vfio_iommufd_cpr_register_container
>>>>>>>>        if !vfio_cpr_supported()        // calls
>iommufd_change_process_capable
>>>>>>>>            migrate_add_blocker_modes()
>>>>>>>
>>>>>>> This reminds me of other questions, is this ioctl() suitable for checking if
>cpr-
>>>>>> transfer supported?
>>>>>>> If there is vIOMMU, there can be no mapping and process_capable()
>check
>>> will
>>>>>> pass,
>>>>>>> but if memory is not file backed...
>>>>>>> Does cpr-transfer support vIOMMU or not?
>>>>>>
>>>>>> I don't know, I have not tried your sample args yet, but I will.
>>>>>> With vIOMMU, what entity/interface pins memory for the vfio device?
>>>>>
>>>>> Oh, I don't mean virtio-iommu, it can be intel-iommu or virtio-iommu for
>this
>>> issue.
>>>>> I mean when guest attach device to a DMA domain, there can be no
>mapping
>>> in that domain initially.
>>>>>
>>>>>>
>>>>>>> QEMU knows details of all memory backends, why not checking memory
>>>>>> backends directly instead of a system call?
>>>>>>
>>>>>> IOMMU_IOAS_CHANGE_PROCESS is relatively new. The ioctl verifies that
>the
>>>>>> kernel
>>>>>> supports it.  And if supported, it also verifies that all dma mappings are
>>>>>> of the file type.
>>>>>
>>>>> But the dma mappings are dynamic if there is vIOMMU, so checking dma
>>> mappings are checking nothing if there is no mapping in the DMA domain.
>>>>
>>>> Yes, so there are 2 checks:
>>>>     * at realize -> cpr register time.  if cpr can never work because
>>>>       IOMMU_IOAS_CHANGE_PROCESS is not supported, then adds a blocker.
>>>>
>>>>     * at cpr time, in vfio_container_pre_save.  refuses to proceed if
>>>>       iommufd_change_process() fails because non-file mappings are present.
>>>>       Allows cpr if there are no mappings present.
>>
>> Let me explain further.
>>
>> There is a corner case that could bypass above checks. Source qemu starts with
>> vIOMMU and non-file memory backend, then hotplug VFIO device, if guest
>> driver doesn't setup any mapping or no guest driver attached, the mapping on
>> host side can be empty, then above checks will both pass.
>
>That is OK.  CPR is allowed in that case and succeeds because
>iommufd_change_process
>has nothing to do.
>
>However, after CPR, if non-file mappings are added, then the next CPR operation
>would be blocked.

Clear, no problem. Thanks for your explanation.

Zhenzhong

^ permalink raw reply	[flat|nested] 157+ messages in thread

* Re: [PATCH V3 22/42] vfio-pci: preserve MSI
  2025-05-28 17:44   ` Steven Sistare
@ 2025-06-01 17:28     ` Cédric Le Goater
  0 siblings, 0 replies; 157+ messages in thread
From: Cédric Le Goater @ 2025-06-01 17:28 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/28/25 19:44, Steven Sistare wrote:
> Hi Cedric,
>    Do you have any comments on this before I send V4?
> Ditto for patch "vfio-pci: preserve INTx".
> In both, I made the changes you requested in V2.
> And I will change all "reused" tests to cpr_is_incoming as we
> discussed elsewhere.

I saw. Thanks for the changes.

> You mentioned these possibly conflict with vfio-user, but it would
> help to get your stylistic and correctness comments on these from a
> cpr-only point of view before I send the next version.

OK. I will try to do that on v4 after the PR (with part 1) is sent.

> And as I mentioned, I propose to block CPR when vfio-user is used,
> at least initially, so you can ignore vfio-user in the cpr load paths below.

Thanks,

C.



> 
> - Steve
> 
> On 5/12/2025 11:32 AM, Steve Sistare wrote:
>> Save the MSI message area as part of vfio-pci vmstate, and preserve the
>> interrupt and notifier eventfd's.  migrate_incoming loads the MSI data,
>> then the vfio-pci post_load handler finds the eventfds in CPR state,
>> rebuilds vector data structures, and attaches the interrupts to the new
>> KVM instance.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   hw/vfio/cpr.c              | 91 ++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/pci.c              | 40 ++++++++++++++++++--
>>   include/hw/vfio/vfio-cpr.h |  8 ++++
>>   3 files changed, 136 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> index 6ea8e9f..be132fa 100644
>> --- a/hw/vfio/cpr.c
>> +++ b/hw/vfio/cpr.c
>> @@ -9,6 +9,8 @@
>>   #include "hw/vfio/vfio-device.h"
>>   #include "hw/vfio/vfio-cpr.h"
>>   #include "hw/vfio/pci.h"
>> +#include "hw/pci/msix.h"
>> +#include "hw/pci/msi.h"
>>   #include "migration/cpr.h"
>>   #include "qapi/error.h"
>>   #include "system/runstate.h"
>> @@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>   }
>> +#define STRDUP_VECTOR_FD_NAME(vdev, name)   \
>> +    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>> +
>> +void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>> +                             int fd)
>> +{
>> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> +    cpr_save_fd(fdname, nr, fd);
>> +}
>> +
>> +int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> +    return cpr_find_fd(fdname, nr);
>> +}
>> +
>> +void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> +    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
>> +    cpr_delete_fd(fdname, nr);
>> +}
>> +
>> +static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
>> +                                   bool msix)
>> +{
>> +    int i, fd;
>> +    bool pending = false;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vdev->nr_vectors = nr_vectors;
>> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> +    vfio_prepare_kvm_msi_virq_batch(vdev);
>> +
>> +    for (i = 0; i < nr_vectors; i++) {
>> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> +        fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
>> +        if (fd >= 0) {
>> +            vfio_vector_init(vdev, i);
>> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> +        }
>> +
>> +        if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
>> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
>> +        } else {
>> +            vdev->msi_vectors[i].virq = -1;
>> +        }
>> +
>> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> +            set_bit(i, vdev->msix->pending);
>> +            pending = true;
>> +        }
>> +    }
>> +
>> +    vfio_commit_kvm_msi_virq_batch(vdev);
>> +
>> +    if (msix) {
>> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> +    }
>> +}
>> +
>>   /*
>>    * The kernel may change non-emulated config bits.  Exclude them from the
>>    * changed-bits check in get_pci_config_device.
>> @@ -58,13 +123,39 @@ static int vfio_cpr_pci_pre_load(void *opaque)
>>       return 0;
>>   }
>> +static int vfio_cpr_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int nr_vectors;
>> +
>> +    if (msix_enabled(pdev)) {
>> +        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
>> +                                   vfio_msix_vector_release, NULL);
>> +        nr_vectors = vdev->msix->entries;
>> +        vfio_cpr_claim_vectors(vdev, nr_vectors, true);
>> +
>> +    } else if (msi_enabled(pdev)) {
>> +        nr_vectors = msi_nr_vectors_allocated(pdev);
>> +        vfio_cpr_claim_vectors(vdev, nr_vectors, false);
>> +
>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> +        g_assert_not_reached();      /* completed in a subsequent patch */
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   const VMStateDescription vfio_cpr_pci_vmstate = {
>>       .name = "vfio-cpr-pci",
>>       .version_id = 0,
>>       .minimum_version_id = 0,
>>       .pre_load = vfio_cpr_pci_pre_load,
>> +    .post_load = vfio_cpr_pci_post_load,
>>       .needed = cpr_needed_for_reuse,
>>       .fields = (VMStateField[]) {
>> +        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> +        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>>           VMSTATE_END_OF_LIST()
>>       }
>>   };
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 1bca415..bfa72bc 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,7 @@
>>   #include "hw/pci/pci_bridge.h"
>>   #include "hw/qdev-properties.h"
>>   #include "hw/qdev-properties-system.h"
>> +#include "hw/vfio/vfio-cpr.h"
>>   #include "migration/vmstate.h"
>>   #include "qobject/qdict.h"
>>   #include "qemu/error-report.h"
>> @@ -56,13 +57,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>   static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>   static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>> +/* Create new or reuse existing eventfd */
>>   static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>                                  const char *name, int nr, Error **errp)
>>   {
>> -    int ret = event_notifier_init(e, 0);
>> +    int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
>> +    int ret = 0;
>> -    if (ret) {
>> -        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>> +    if (fd >= 0) {
>> +        event_notifier_init_fd(e, fd);
>> +    } else {
>> +        ret = event_notifier_init(e, 0);
>> +        if (ret) {
>> +            error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
>> +        } else {
>> +            fd = event_notifier_get_fd(e);
>> +            if (fd >= 0) {
>> +                vfio_cpr_save_vector_fd(vdev, name, nr, fd);
>> +            }
>> +        }
>>       }
>>       return !ret;
>>   }
>> @@ -70,6 +83,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>   static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>>                                     const char *name, int nr)
>>   {
>> +    vfio_cpr_delete_vector_fd(vdev, name, nr);
>>       event_notifier_cleanup(e);
>>   }
>> @@ -554,6 +568,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>       int ret;
>>       bool resizing = !!(vdev->nr_vectors < nr + 1);
>> +    /*
>> +     * Ignore the callback from msix_set_vector_notifiers during resume.
>> +     * The necessary subset of these actions is called from
>> +     * vfio_cpr_claim_vectors during post load.
>> +     */
>> +    if (vdev->vbasedev.cpr.reused) {
>> +        return 0;
>> +    }
>> +
>>       trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>>       vector = &vdev->msi_vectors[nr];
>> @@ -2937,6 +2960,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>>       fd = event_notifier_get_fd(&vdev->err_notifier);
>>       qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>> +    /* Do not alter irq_signaling during vfio_realize for cpr */
>> +    if (vdev->vbasedev.cpr.reused) {
>> +        return;
>> +    }
>> +
>>       if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -3004,6 +3032,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>>       fd = event_notifier_get_fd(&vdev->req_notifier);
>>       qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>> +    /* Do not alter irq_signaling during vfio_realize for cpr */
>> +    if (vdev->vbasedev.cpr.reused) {
>> +        vdev->req_enabled = true;
>> +        return;
>> +    }
>> +
>>       if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>>                                          VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>           error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index e93600f..765e334 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -28,6 +28,7 @@ typedef struct VFIODeviceCPR {
>>   struct VFIOContainer;
>>   struct VFIOContainerBase;
>>   struct VFIOGroup;
>> +struct VFIOPCIDevice;
>>   bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>                                           Error **errp);
>> @@ -49,6 +50,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>>   bool vfio_cpr_ram_discard_register_listener(
>>       struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>> +void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> +                             int nr, int fd);
>> +int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> +                            int nr);
>> +void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>> +                               int nr);
>> +
>>   extern const VMStateDescription vfio_cpr_pci_vmstate;
>>   #endif /* HW_VFIO_VFIO_CPR_H */
> 



^ permalink raw reply	[flat|nested] 157+ messages in thread

end of thread, other threads:[~2025-06-01 17:28 UTC | newest]

Thread overview: 157+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-12 15:32 [PATCH V3 00/42] Live update: vfio and iommufd Steve Sistare
2025-05-12 15:32 ` [PATCH V3 01/42] MAINTAINERS: Add reviewer for CPR Steve Sistare
2025-05-15  7:36   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 02/42] migration: cpr helpers Steve Sistare
2025-05-15  7:43   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 03/42] migration: lower handler priority Steve Sistare
2025-05-12 15:32 ` [PATCH V3 04/42] vfio: vfio_find_ram_discard_listener Steve Sistare
2025-05-12 15:32 ` [PATCH V3 05/42] vfio: move vfio-cpr.h Steve Sistare
2025-05-15  7:46   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 06/42] vfio/container: register container for cpr Steve Sistare
2025-05-15  7:54   ` Cédric Le Goater
2025-05-15 19:06     ` Steven Sistare
2025-05-16 16:20       ` Cédric Le Goater
2025-05-16 17:21         ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 07/42] vfio/container: preserve descriptors Steve Sistare
2025-05-15 12:59   ` Cédric Le Goater
2025-05-15 19:08     ` Steven Sistare
2025-05-19 13:20       ` Cédric Le Goater
2025-05-19 16:21         ` Steven Sistare
2025-05-22 13:51   ` Cédric Le Goater
2025-05-22 13:56     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 08/42] vfio/container: export vfio_legacy_dma_map Steve Sistare
2025-05-15 13:42   ` Cédric Le Goater
2025-05-15 19:08     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 09/42] vfio/container: discard old DMA vaddr Steve Sistare
2025-05-15 13:30   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 10/42] vfio/container: restore " Steve Sistare
2025-05-15 13:42   ` Cédric Le Goater
2025-05-15 19:08     ` Steven Sistare
2025-05-19 13:32       ` Cédric Le Goater
2025-05-19 16:33         ` Steven Sistare
2025-05-22  6:37   ` Cédric Le Goater
2025-05-22 14:00     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 11/42] vfio/container: mdev cpr blocker Steve Sistare
2025-05-16  8:16   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 12/42] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
2025-05-20  6:29   ` Cédric Le Goater
2025-05-20 13:39     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 13/42] pci: export msix_is_pending Steve Sistare
2025-05-12 15:32 ` [PATCH V3 14/42] pci: skip reset during cpr Steve Sistare
2025-05-16  8:19   ` Cédric Le Goater
2025-05-16 17:58     ` Steven Sistare
2025-05-24  9:34     ` Michael S. Tsirkin
2025-05-27 20:42       ` Steven Sistare
2025-05-27 21:03         ` Michael S. Tsirkin
2025-05-28 16:11           ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 15/42] vfio-pci: " Steve Sistare
2025-05-20  6:48   ` Cédric Le Goater
2025-05-20 13:44     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 16/42] vfio/pci: vfio_vector_init Steve Sistare
2025-05-16  8:32   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 17/42] vfio/pci: vfio_notifier_init Steve Sistare
2025-05-16  8:29   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 18/42] vfio/pci: pass vector to virq functions Steve Sistare
2025-05-16  8:28   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 19/42] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
2025-05-16  8:29   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 20/42] vfio/pci: vfio_notifier_cleanup Steve Sistare
2025-05-16  8:30   ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 21/42] vfio/pci: export MSI functions Steve Sistare
2025-05-16  8:31   ` Cédric Le Goater
2025-05-16 17:58     ` Steven Sistare
2025-05-20  5:52       ` Cédric Le Goater
2025-05-20 14:56         ` Steven Sistare
2025-05-20 15:10           ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 22/42] vfio-pci: preserve MSI Steve Sistare
2025-05-28 17:44   ` Steven Sistare
2025-06-01 17:28     ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 23/42] vfio-pci: preserve INTx Steve Sistare
2025-05-12 15:32 ` [PATCH V3 24/42] migration: close kvm after cpr Steve Sistare
2025-05-16  8:35   ` Cédric Le Goater
2025-05-16 17:14     ` Peter Xu
2025-05-16 19:17       ` Steven Sistare
2025-05-16 18:18     ` Steven Sistare
2025-05-19  8:51       ` Cédric Le Goater
2025-05-19 19:07         ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 25/42] migration: cpr_get_fd_param helper Steve Sistare
2025-05-19 21:22   ` Fabiano Rosas
2025-05-12 15:32 ` [PATCH V3 26/42] vfio: return mr from vfio_get_xlat_addr Steve Sistare
2025-05-12 20:51   ` John Levon
2025-05-14 17:03     ` Cédric Le Goater
2025-05-15  8:22       ` David Hildenbrand
2025-05-15 19:13         ` Steven Sistare
2025-05-15 17:24     ` Steven Sistare
2025-05-13 11:12   ` Mark Cave-Ayland
2025-05-15 19:40     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 27/42] vfio: pass ramblock to vfio_container_dma_map Steve Sistare
2025-05-16  8:26   ` Duan, Zhenzhong
2025-05-12 15:32 ` [PATCH V3 28/42] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
2025-05-16  8:26   ` Duan, Zhenzhong
2025-05-19 15:51     ` Steven Sistare
2025-05-20 19:32       ` Steven Sistare
2025-05-21  2:48         ` Duan, Zhenzhong
2025-05-12 15:32 ` [PATCH V3 29/42] backends/iommufd: change process ioctl Steve Sistare
2025-05-16  8:42   ` Duan, Zhenzhong
2025-05-19 15:51     ` Steven Sistare
2025-05-20 19:34       ` Steven Sistare
2025-05-21  3:11         ` Duan, Zhenzhong
2025-05-21 13:01           ` Steven Sistare
2025-05-22  3:19             ` Duan, Zhenzhong
2025-05-22 21:11               ` Steven Sistare
2025-05-23  8:56                 ` Duan, Zhenzhong
2025-05-23 14:56                   ` Steven Sistare
2025-05-23 19:19                     ` Steven Sistare
2025-05-26  2:31                       ` Duan, Zhenzhong
2025-05-28 13:31                         ` Steven Sistare
2025-05-30  9:56                           ` Duan, Zhenzhong
2025-05-12 15:32 ` [PATCH V3 30/42] physmem: qemu_ram_get_fd_offset Steve Sistare
2025-05-16  8:40   ` Duan, Zhenzhong
2025-05-12 15:32 ` [PATCH V3 31/42] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
2025-05-16  8:48   ` Duan, Zhenzhong
2025-05-19 15:52     ` Steven Sistare
2025-05-20 19:39       ` Steven Sistare
2025-05-21  3:13         ` Duan, Zhenzhong
2025-05-20 12:27   ` Cédric Le Goater
2025-05-20 13:58     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 32/42] vfio/iommufd: export iommufd_cdev_get_info_iova_range Steve Sistare
2025-05-21 18:35   ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 33/42] vfio/iommufd: define hwpt constructors Steve Sistare
2025-05-16  8:55   ` Duan, Zhenzhong
2025-05-19 15:55     ` Steven Sistare
2025-05-23 17:47       ` Steven Sistare
2025-05-20 12:34     ` Cédric Le Goater
2025-05-21  2:48       ` Duan, Zhenzhong
2025-05-21  8:19         ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 34/42] vfio/iommufd: invariant device name Steve Sistare
2025-05-16  9:29   ` Duan, Zhenzhong
2025-05-19 15:52     ` Steven Sistare
2025-05-20 13:55   ` Cédric Le Goater
2025-05-20 21:00     ` Steven Sistare
2025-05-21  8:20       ` Cédric Le Goater
2025-05-12 15:32 ` [PATCH V3 35/42] vfio/iommufd: register container for cpr Steve Sistare
2025-05-16 10:23   ` Duan, Zhenzhong
2025-05-19 15:52     ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 36/42] vfio/iommufd: preserve descriptors Steve Sistare
2025-05-16 10:06   ` Duan, Zhenzhong
2025-05-19 15:53     ` Steven Sistare
2025-05-20  9:15       ` Duan, Zhenzhong
2025-05-12 15:32 ` [PATCH V3 37/42] vfio/iommufd: reconstruct device Steve Sistare
2025-05-16 10:22   ` Duan, Zhenzhong
2025-05-19 15:53     ` Steven Sistare
2025-05-20  9:14       ` Duan, Zhenzhong
2025-05-21 18:38   ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 38/42] vfio/iommufd: reconstruct hw_caps Steve Sistare
2025-05-21 19:59   ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 39/42] vfio/iommufd: reconstruct hwpt Steve Sistare
2025-05-19  3:25   ` Duan, Zhenzhong
2025-05-19 15:53     ` Steven Sistare
2025-05-20  9:16       ` Duan, Zhenzhong
2025-05-21 17:40         ` Steven Sistare
2025-05-12 15:32 ` [PATCH V3 40/42] vfio/iommufd: change process Steve Sistare
2025-05-12 15:32 ` [PATCH V3 41/42] iommufd: preserve DMA mappings Steve Sistare
2025-05-12 15:32 ` [PATCH V3 42/42] vfio/container: delete old cpr register Steve Sistare
2025-05-16 16:37 ` [PATCH V3 00/42] Live update: vfio and iommufd Cédric Le Goater
2025-05-16 17:17   ` Steven Sistare
2025-05-16 19:48     ` Steven Sistare
2025-05-19  8:54       ` Cédric Le Goater

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).