qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/43] Live update: vfio and iommufd
@ 2025-05-29 19:23 Steve Sistare
  2025-05-29 19:23 ` [PATCH V4 01/43] MAINTAINERS: Add reviewer for CPR Steve Sistare
                   ` (44 more replies)
  0 siblings, 45 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Support vfio and iommufd devices with the cpr-transfer live migration mode.
Devices that do not support live migration can still support cpr-transfer,
allowing live update to a new version of QEMU on the same host, with no loss
of guest connectivity.

No user-visible interfaces are added.

For legacy containers:

Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
skip the ioctls that configure the device, because it is already configured.

Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
QEMU and update the locked memory accounting.  The physical pages remain
pinned, because the descriptor of the device that locked them remains open,
so DMA to those pages continues without interruption.  Mediated devices are
not supported, however, because they require the VA to always be valid, and
there is a brief window where no VA is registered.

Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
data structures, and attaches the interrupts to the new KVM instance.  This
logic also applies to iommufd containers.

For iommufd containers:

Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
backed by a file (including a memfd), so DMA mappings do not depend on VA,
which can differ after live update.  This allows mediated devices to be
supported.

Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
QEMU, during vfio_realize, skip the ioctls that configure the device, because
it is already configured.

In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
locked memory accounting.

Patches 4 to 12 are specific to legacy containers.
Patches 25 to 41 are specific to iommufd containers.
The remainder apply to both.

Changes from previous versions:
  * V1 of this series contains minor changes from the "Live update: vfio" and
    "Live update: iommufd" series, mainly bug fixes and refactored patches.

Changes in V2:
  * refactored various vfio code snippets into new cpr helpers
  * refactored vfio struct members into cpr-specific structures
  * refactored various small changes into their own patches
  * split complex patches.  Notably:
    - split "refactor for cpr" into 5 patches
    - split "reconstruct device" into 4 patches
  * refactored vfio_connect_container using helpers and made its
    error recovery more robust.
  * moved vfio pci msi/vector/intx cpr functions to cpr.c
  * renamed "reused" to cpr_reused and cpr.reused
  * squashed vfio_cpr_[un]register_container to their call sites
  * simplified iommu_type setting after cpr
  * added cpr_open_fd and cpr_is_incoming helpers
  * removed changes from vfio_legacy_dma_map, and instead temporarily
    override dma_map and dma_unmap ops.
  * deleted error_report and returned Error to callers where possible.
  * simplified the memory_get_xlat_addr interface
  * fixed flags passed to iommufd_backend_alloc_hwpt
  * defined MIG_PRI_UNINITIALIZED
  * added maintainers

Changes in V3:
  * removed cleanup patches that were already pulled
  * rebased to latest master

Changes in V4:
  * added SPDX-License-Identifier
  * patch "vfio/container: preserve descriptors"
    - rewrote search loop in vfio_container_connect
    - do not return pfd from vfio_cpr_container_match
    - add helper for VFIO_GROUP_GET_DEVICE_FD
  * deleted patch "export vfio_legacy_dma_map"
  * patch "vfio/container: restore DMA vaddr"
    - deleted redundant error_report from vfio_legacy_cpr_dma_map
    - save old dma_map function
  * patch "vfio-pci: skip reset during cpr"
    - use cpr_is_incoming instead of cpr_reused
  * renamed err -> local_err in all new code
  * patch "export MSI functions"
    -  renamed with vfio_pci prefix, and defined wrappers for low level
       routines instead of exporting them.
  * patch "close kvm after cpr"
    - fixed build error for !CONFIG_KVM
  * added the cpr_resave_fd helper
  * dropped patch "pass ramblock to vfio_container_dma_map", relying on
    "pass MemoryRegion" from the vfio-user series instead.
  * deleted "reused" variables, replaced with cpr_is_incoming()
  * renamed cpr_needed_for_reuse -> cpr_incoming_needed
  * rewrote patch "pci: skip reset during cpr"
  * rebased to latest master

  for iommufd:
    * deleted redundant error_report from iommufd_backend_map_file_dma
    * added interface doc for dma_map_file
    * check return value of cpr_open_fd
    * deleted "export iommufd_cdev_get_info_iova_range"
    * deleted "reconstruct device"
    * deleted "reconstruct hw_caps"
    * deleted "define hwpt constructors"
    * seperated cpr registration for iommufd be and vfio container
    * correctly attach to multiple containers per iommufd using ioas_id
    * simplified "reconstruct hwpt" by matching against hwpt_id.
    * added patch "add vfio_device_free_name"


Steve Sistare (43):
  MAINTAINERS: Add reviewer for CPR
  vfio: return mr from vfio_get_xlat_addr
  vfio/container: pass MemoryRegion to DMA operations
  vfio/pci: vfio_pci_put_device on failure
  migration: cpr helpers
  migration: lower handler priority
  vfio: vfio_find_ram_discard_listener
  vfio: move vfio-cpr.h
  vfio/container: register container for cpr
  vfio/container: preserve descriptors
  vfio/container: discard old DMA vaddr
  vfio/container: restore DMA vaddr
  vfio/container: mdev cpr blocker
  vfio/container: recover from unmap-all-vaddr failure
  pci: export msix_is_pending
  pci: skip reset during cpr
  vfio-pci: skip reset during cpr
  vfio/pci: vfio_pci_vector_init
  vfio/pci: vfio_notifier_init
  vfio/pci: pass vector to virq functions
  vfio/pci: vfio_notifier_init cpr parameters
  vfio/pci: vfio_notifier_cleanup
  vfio/pci: export MSI functions
  vfio-pci: preserve MSI
  vfio-pci: preserve INTx
  migration: close kvm after cpr
  migration: cpr_get_fd_param helper
  backends/iommufd: iommufd_backend_map_file_dma
  backends/iommufd: change process ioctl
  physmem: qemu_ram_get_fd_offset
  vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  vfio/iommufd: invariant device name
  vfio/iommufd: add vfio_device_free_name
  vfio/iommufd: device name blocker
  vfio/iommufd: register container for cpr
  migration: vfio cpr state hook
  vfio/iommufd: cpr state
  vfio/iommufd: preserve descriptors
  vfio/iommufd: reconstruct device
  vfio/iommufd: reconstruct hwpt
  vfio/iommufd: change process
  iommufd: preserve DMA mappings
  vfio/container: delete old cpr register

 MAINTAINERS                           |  10 ++
 hw/vfio/pci.h                         |  10 ++
 hw/vfio/vfio-cpr.h                    |  15 --
 include/exec/cpu-common.h             |   1 +
 include/hw/pci/msix.h                 |   1 +
 include/hw/pci/pci_device.h           |   3 +
 include/hw/vfio/vfio-container-base.h |  38 ++++-
 include/hw/vfio/vfio-container.h      |   2 +
 include/hw/vfio/vfio-cpr.h            |  78 +++++++++
 include/hw/vfio/vfio-device.h         |   5 +
 include/migration/cpr.h               |  21 +++
 include/migration/vmstate.h           |   6 +-
 include/system/iommufd.h              |   6 +
 include/system/kvm.h                  |   1 +
 include/system/memory.h               |  19 ++-
 accel/kvm/kvm-all.c                   |  28 ++++
 accel/stubs/kvm-stub.c                |   5 +
 backends/iommufd.c                    | 101 +++++++++++-
 hw/pci/msix.c                         |   2 +-
 hw/pci/pci.c                          |   5 +
 hw/vfio/ap.c                          |   2 +-
 hw/vfio/ccw.c                         |   2 +-
 hw/vfio/container-base.c              |  13 +-
 hw/vfio/container.c                   | 101 +++++++++---
 hw/vfio/cpr-iommufd.c                 | 220 ++++++++++++++++++++++++++
 hw/vfio/cpr-legacy.c                  | 288 ++++++++++++++++++++++++++++++++++
 hw/vfio/cpr.c                         | 161 +++++++++++++++++--
 hw/vfio/device.c                      |  40 +++--
 hw/vfio/helpers.c                     |  10 ++
 hw/vfio/iommufd.c                     |  86 ++++++++--
 hw/vfio/listener.c                    |  93 +++++++----
 hw/vfio/pci.c                         | 232 ++++++++++++++++++++-------
 hw/vfio/platform.c                    |   2 +-
 hw/vfio/vfio-stubs.c                  |  13 ++
 hw/virtio/vhost-vdpa.c                |   9 +-
 migration/cpr-transfer.c              |  18 +++
 migration/cpr.c                       |  95 +++++++++--
 migration/migration.c                 |   1 +
 migration/savevm.c                    |   4 +-
 system/memory.c                       |  32 +---
 system/physmem.c                      |   5 +
 backends/trace-events                 |   2 +
 hw/vfio/meson.build                   |   4 +
 43 files changed, 1576 insertions(+), 214 deletions(-)
 delete mode 100644 hw/vfio/vfio-cpr.h
 create mode 100644 include/hw/vfio/vfio-cpr.h
 create mode 100644 hw/vfio/cpr-iommufd.c
 create mode 100644 hw/vfio/cpr-legacy.c
 create mode 100644 hw/vfio/vfio-stubs.c

base-commit: d2e9b78162e31b1eaf20f3a4f563da82da56908d
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH V4 01/43] MAINTAINERS: Add reviewer for CPR
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
@ 2025-05-29 19:23 ` Steve Sistare
  2025-05-29 19:23 ` [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr Steve Sistare
                   ` (43 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

CPR is integrated with live migration, and has the same maintainers.
But, add a CPR section to add a reviewer.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 MAINTAINERS | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e27d145..e29fb4f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3030,6 +3030,15 @@ F: include/qemu/co-shared-resource.h
 T: git https://gitlab.com/jsnow/qemu.git jobs
 T: git https://gitlab.com/vsementsov/qemu.git block
 
+CheckPoint and Restart (CPR)
+R: Steve Sistare <steven.sistare@oracle.com>
+S: Supported
+F: hw/vfio/cpr*
+F: include/migration/cpr.h
+F: migration/cpr*
+F: tests/qtest/migration/cpr*
+F: docs/devel/migration/CPR.rst
+
 Compute Express Link
 M: Jonathan Cameron <jonathan.cameron@huawei.com>
 R: Fan Ni <fan.ni@samsung.com>
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
  2025-05-29 19:23 ` [PATCH V4 01/43] MAINTAINERS: Add reviewer for CPR Steve Sistare
@ 2025-05-29 19:23 ` Steve Sistare
  2025-06-03 10:39   ` Duan, Zhenzhong
  2025-05-29 19:23 ` [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations Steve Sistare
                   ` (42 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
region that the translated address is found in.  This will be needed by
CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.

Also return the xlat offset, so we can simplify the interface by removing
the out parameters that can be trivially derived from mr and xlat.

Lastly, rename the functions to  to memory_translate_iotlb() and
vfio_translate_iotlb().

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/system/memory.h | 19 +++++++++----------
 hw/vfio/listener.c      | 33 ++++++++++++++++++++++-----------
 hw/virtio/vhost-vdpa.c  |  9 +++++++--
 system/memory.c         | 32 +++++++-------------------------
 4 files changed, 45 insertions(+), 48 deletions(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index fbbf4cf..13416d7 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -738,21 +738,20 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
                                              RamDiscardListener *rdl);
 
 /**
- * memory_get_xlat_addr: Extract addresses from a TLB entry
+ * memory_translate_iotlb: Extract addresses from a TLB entry.
+ *                         Called with rcu_read_lock held.
  *
  * @iotlb: pointer to an #IOMMUTLBEntry
- * @vaddr: virtual address
- * @ram_addr: RAM address
- * @read_only: indicates if writes are allowed
- * @mr_has_discard_manager: indicates memory is controlled by a
- *                          RamDiscardManager
+ * @xlat_p: return the offset of the entry from the start of the returned
+ *          MemoryRegion.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Return: true on success, else false setting @errp with error.
+ * Return: On success, return the MemoryRegion containing the @iotlb translated
+ *         addr.  The MemoryRegion must not be accessed after rcu_read_unlock.
+ *         On failure, return NULL, setting @errp with error.
  */
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                          ram_addr_t *ram_addr, bool *read_only,
-                          bool *mr_has_discard_manager, Error **errp);
+MemoryRegion *memory_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr *xlat_p,
+                                     Error **errp);
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
 typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index bfacb3d..0afafe3 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -90,16 +90,17 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-/* Called with rcu_read_lock held.  */
-static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                               ram_addr_t *ram_addr, bool *read_only,
-                               Error **errp)
+/*
+ * Called with rcu_read_lock held.
+ * The returned MemoryRegion must not be accessed after calling rcu_read_unlock.
+ */
+static MemoryRegion *vfio_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr *xlat_p,
+                                          Error **errp)
 {
-    bool ret, mr_has_discard_manager;
+    MemoryRegion *mr;
 
-    ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
-                               &mr_has_discard_manager, errp);
-    if (ret && mr_has_discard_manager) {
+    mr = memory_translate_iotlb(iotlb, xlat_p, errp);
+    if (mr && memory_region_has_ram_discard_manager(mr)) {
         /*
          * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
          * pages will remain pinned inside vfio until unmapped, resulting in a
@@ -118,7 +119,7 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
                          " intended via an IOMMU. It's possible to mitigate "
                          " by setting/adjusting RLIMIT_MEMLOCK.");
     }
-    return ret;
+    return mr;
 }
 
 static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
@@ -126,6 +127,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
     VFIOContainerBase *bcontainer = giommu->bcontainer;
     hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    MemoryRegion *mr;
+    hwaddr xlat;
     void *vaddr;
     int ret;
     Error *local_err = NULL;
@@ -150,10 +153,14 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
+        mr = vfio_translate_iotlb(iotlb, &xlat, &local_err);
+        if (!mr) {
             error_report_err(local_err);
             goto out;
         }
+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
         /*
          * vaddr is only valid until rcu_read_unlock(). But after
          * vfio_dma_map has set up the mapping the pages will be
@@ -1010,6 +1017,8 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     ram_addr_t translated_addr;
     Error *local_err = NULL;
     int ret = -EINVAL;
+    MemoryRegion *mr;
+    ram_addr_t xlat;
 
     trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
 
@@ -1021,9 +1030,11 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 
     rcu_read_lock();
-    if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
+    mr = vfio_translate_iotlb(iotlb, &xlat, &local_err);
+    if (!mr) {
         goto out_unlock;
     }
+    translated_addr = memory_region_get_ram_addr(mr) + xlat;
 
     ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
                                 translated_addr, &local_err);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1ab2c11..a1dd9e1 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -209,6 +209,8 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     int ret;
     Int128 llend;
     Error *local_err = NULL;
+    MemoryRegion *mr;
+    hwaddr xlat;
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
@@ -228,11 +230,14 @@ static void vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
         bool read_only;
 
-        if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
-                                  &local_err)) {
+        mr = memory_translate_iotlb(iotlb, &xlat, &local_err);
+        if (!mr) {
             error_report_err(local_err);
             return;
         }
+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
+
         ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
                                  iotlb->addr_mask + 1, vaddr, read_only);
         if (ret) {
diff --git a/system/memory.c b/system/memory.c
index 63b983e..306e9ff 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2174,18 +2174,14 @@ void ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
 }
 
 /* Called with rcu_read_lock held.  */
-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
-                          ram_addr_t *ram_addr, bool *read_only,
-                          bool *mr_has_discard_manager, Error **errp)
+MemoryRegion *memory_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr *xlat_p,
+                                     Error **errp)
 {
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
     bool writable = iotlb->perm & IOMMU_WO;
 
-    if (mr_has_discard_manager) {
-        *mr_has_discard_manager = false;
-    }
     /*
      * The IOMMU TLB entry we have just covers translation through
      * this IOMMU to its immediate target.  We need to translate
@@ -2195,7 +2191,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
                                  &xlat, &len, writable, MEMTXATTRS_UNSPECIFIED);
     if (!memory_region_is_ram(mr)) {
         error_setg(errp, "iommu map to non memory area %" HWADDR_PRIx "", xlat);
-        return false;
+        return NULL;
     } else if (memory_region_has_ram_discard_manager(mr)) {
         RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr);
         MemoryRegionSection tmp = {
@@ -2203,9 +2199,6 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
             .offset_within_region = xlat,
             .size = int128_make64(len),
         };
-        if (mr_has_discard_manager) {
-            *mr_has_discard_manager = true;
-        }
         /*
          * Malicious VMs can map memory into the IOMMU, which is expected
          * to remain discarded. vfio will pin all pages, populating memory.
@@ -2216,7 +2209,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
             error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
                          " via virtio-mem): %" HWADDR_PRIx "",
                          iotlb->translated_addr);
-            return false;
+            return NULL;
         }
     }
 
@@ -2226,22 +2219,11 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
      */
     if (len & iotlb->addr_mask) {
         error_setg(errp, "iommu has granularity incompatible with target AS");
-        return false;
+        return NULL;
     }
 
-    if (vaddr) {
-        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
-    }
-
-    if (ram_addr) {
-        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
-    }
-
-    if (read_only) {
-        *read_only = !writable || mr->readonly;
-    }
-
-    return true;
+    *xlat_p = xlat;
+    return mr;
 }
 
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
  2025-05-29 19:23 ` [PATCH V4 01/43] MAINTAINERS: Add reviewer for CPR Steve Sistare
  2025-05-29 19:23 ` [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-05-29 19:23 ` Steve Sistare
  2025-06-03 10:39   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure Steve Sistare
                   ` (41 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass through the MemoryRegion to DMA operation handlers of vfio
containers. The vfio-user container will need this later, to translate
the vaddr into an offset for the dma map vfio-user message.

Originally-by: John Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John Levon <john.levon@nutanix.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-container-base.h | 17 +++++++++++++++--
 hw/vfio/container-base.c              |  4 ++--
 hw/vfio/container.c                   |  3 ++-
 hw/vfio/iommufd.c                     |  3 ++-
 hw/vfio/listener.c                    |  6 +++---
 5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 3d392b0..83ba7a5 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -78,7 +78,7 @@ void vfio_address_space_insert(VFIOAddressSpace *space,
 
 int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            hwaddr iova, ram_addr_t size,
-                           void *vaddr, bool readonly);
+                           void *vaddr, bool readonly, MemoryRegion *mr);
 int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
                              hwaddr iova, ram_addr_t size,
                              IOMMUTLBEntry *iotlb, bool unmap_all);
@@ -119,9 +119,22 @@ struct VFIOIOMMUClass {
     bool (*setup)(VFIOContainerBase *bcontainer, Error **errp);
     void (*listener_begin)(VFIOContainerBase *bcontainer);
     void (*listener_commit)(VFIOContainerBase *bcontainer);
+    /**
+     * @dma_map
+     *
+     * Map an address range into the container. Note that @mr will within an
+     * RCU read lock region across this call.
+     *
+     * @bcontainer: #VFIOContainerBase to use
+     * @iova: start address to map
+     * @size: size of the range to map
+     * @vaddr: process virtual address of mapping
+     * @readonly: true if mapping should be readonly
+     * @mr: the memory region for this mapping
+     */
     int (*dma_map)(const VFIOContainerBase *bcontainer,
                    hwaddr iova, ram_addr_t size,
-                   void *vaddr, bool readonly);
+                   void *vaddr, bool readonly, MemoryRegion *mr);
     /**
      * @dma_unmap
      *
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 1c6ca94..d834bd4 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -75,12 +75,12 @@ void vfio_address_space_insert(VFIOAddressSpace *space,
 
 int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            hwaddr iova, ram_addr_t size,
-                           void *vaddr, bool readonly)
+                           void *vaddr, bool readonly, MemoryRegion *mr)
 {
     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
 
     g_assert(vioc->dma_map);
-    return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
+    return vioc->dma_map(bcontainer, iova, size, vaddr, readonly, mr);
 }
 
 int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index a9f0dba..a8c76eb 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -207,7 +207,8 @@ static int vfio_legacy_dma_unmap(const VFIOContainerBase *bcontainer,
 }
 
 static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr iova,
-                               ram_addr_t size, void *vaddr, bool readonly)
+                               ram_addr_t size, void *vaddr, bool readonly,
+                               MemoryRegion *mr)
 {
     const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
                                                   bcontainer);
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index af1c7ab..a8cc543 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -34,7 +34,8 @@
             TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
 
 static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
-                            ram_addr_t size, void *vaddr, bool readonly)
+                            ram_addr_t size, void *vaddr, bool readonly,
+                            MemoryRegion *mr)
 {
     const VFIOIOMMUFDContainer *container =
         container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 0afafe3..a1d2d25 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -170,7 +170,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
          */
         ret = vfio_container_dma_map(bcontainer, iova,
                                      iotlb->addr_mask + 1, vaddr,
-                                     read_only);
+                                     read_only, mr);
         if (ret) {
             error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%s)",
@@ -240,7 +240,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
         vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
         ret = vfio_container_dma_map(bcontainer, iova, next - start,
-                                     vaddr, section->readonly);
+                                     vaddr, section->readonly, section->mr);
         if (ret) {
             /* Rollback */
             vfio_ram_discard_notify_discard(rdl, section);
@@ -564,7 +564,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 
     ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
-                                 vaddr, section->readonly);
+                                 vaddr, section->readonly, section->mr);
     if (ret) {
         error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
                    "0x%"HWADDR_PRIx", %p) = %d (%s)",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (2 preceding siblings ...)
  2025-05-29 19:23 ` [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-03 10:40   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 05/43] migration: cpr helpers Steve Sistare
                   ` (40 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

If vfio_realize fails after vfio_device_attach, it should call
vfio_device_detach during error recovery.  If it fails after
vfio_device_get_name, it should free vbasedev->name.  If it fails
after vfio_pci_config_setup, it should free vdev->msix.

To fix all, call vfio_pci_put_device().

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a1bfdfe..7d3b9ff 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3296,6 +3296,7 @@ out_teardown:
     vfio_bars_exit(vdev);
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
+    vfio_pci_put_device(vdev);
 }
 
 static void vfio_instance_finalize(Object *obj)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 05/43] migration: cpr helpers
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (3 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 06/43] migration: lower handler priority Steve Sistare
                   ` (39 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Add the cpr_incoming_needed, cpr_open_fd, and cpr_resave_fd helpers,
for use when adding cpr support for vfio and iommufd.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/migration/cpr.h |  5 +++++
 migration/cpr.c         | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7561fc7..07858e9 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -18,6 +18,9 @@
 void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
 int cpr_find_fd(const char *name, int id);
+void cpr_resave_fd(const char *name, int id, int fd);
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+                Error **errp);
 
 MigMode cpr_get_incoming_mode(void);
 void cpr_set_incoming_mode(MigMode mode);
@@ -28,6 +31,8 @@ int cpr_state_load(MigrationChannel *channel, Error **errp);
 void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
+bool cpr_incoming_needed(void *opaque);
+
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
diff --git a/migration/cpr.c b/migration/cpr.c
index 42c4656..a50a57e 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -95,6 +95,36 @@ int cpr_find_fd(const char *name, int id)
     trace_cpr_find_fd(name, id, fd);
     return fd;
 }
+
+void cpr_resave_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = find_fd(&cpr_state.fds, name, id);
+    int old_fd = elem ? elem->fd : -1;
+
+    if (old_fd < 0) {
+        cpr_save_fd(name, id, fd);
+    } else if (old_fd != fd) {
+        error_setg(&error_fatal,
+                   "internal error: cpr fd '%s' id %d value %d "
+                   "already saved with a different value %d",
+                   name, id, fd, old_fd);
+    }
+}
+
+int cpr_open_fd(const char *path, int flags, const char *name, int id,
+                Error **errp)
+{
+    int fd = cpr_find_fd(name, id);
+
+    if (fd < 0) {
+        fd = qemu_open(path, flags, errp);
+        if (fd >= 0) {
+            cpr_save_fd(name, id, fd);
+        }
+    }
+    return fd;
+}
+
 /*************************************************************************/
 #define CPR_STATE "CprState"
 
@@ -228,3 +258,9 @@ void cpr_state_close(void)
         cpr_state_file = NULL;
     }
 }
+
+bool cpr_incoming_needed(void *opaque)
+{
+    MigMode mode = migrate_mode();
+    return mode == MIG_MODE_CPR_TRANSFER;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 06/43] migration: lower handler priority
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (4 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 05/43] migration: cpr helpers Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener Steve Sistare
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define a vmstate priority that is lower than the default, so its handlers
run after all default priority handlers.  Since 0 is no longer the default
priority, translate an uninitialized priority of 0 to MIG_PRI_DEFAULT.

CPR for vfio will use this to install handlers for containers that run
after handlers for the devices that they contain.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 include/migration/vmstate.h | 6 +++++-
 migration/savevm.c          | 4 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index a1dfab4..1ff7bd9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -155,7 +155,11 @@ enum VMStateFlags {
 };
 
 typedef enum {
-    MIG_PRI_DEFAULT = 0,
+    MIG_PRI_UNINITIALIZED = 0,  /* An uninitialized priority field maps to */
+                                /* MIG_PRI_DEFAULT in save_state_priority */
+
+    MIG_PRI_LOW,                /* Must happen after default */
+    MIG_PRI_DEFAULT,
     MIG_PRI_IOMMU,              /* Must happen before PCI devices */
     MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
     MIG_PRI_VIRTIO_MEM,         /* Must happen before IOMMU */
diff --git a/migration/savevm.c b/migration/savevm.c
index 006514c..7e87815 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -266,7 +266,7 @@ typedef struct SaveState {
 
 static SaveState savevm_state = {
     .handlers = QTAILQ_HEAD_INITIALIZER(savevm_state.handlers),
-    .handler_pri_head = { [MIG_PRI_DEFAULT ... MIG_PRI_MAX] = NULL },
+    .handler_pri_head = { [0 ... MIG_PRI_MAX] = NULL },
     .global_section_id = 0,
 };
 
@@ -737,7 +737,7 @@ static int calculate_compat_instance_id(const char *idstr)
 
 static inline MigrationPriority save_state_priority(SaveStateEntry *se)
 {
-    if (se->vmsd) {
+    if (se->vmsd && se->vmsd->priority) {
         return se->vmsd->priority;
     }
     return MIG_PRI_DEFAULT;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (5 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 06/43] migration: lower handler priority Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-03 10:59   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 08/43] vfio: move vfio-cpr.h Steve Sistare
                   ` (37 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define vfio_find_ram_discard_listener as a subroutine so additional calls to
it may be added in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-container-base.h |  3 +++
 hw/vfio/listener.c                    | 35 ++++++++++++++++++++++-------------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 83ba7a5..01cdcb6 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -196,4 +196,7 @@ struct VFIOIOMMUClass {
     void (*release)(VFIOContainerBase *bcontainer);
 };
 
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
 #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index a1d2d25..fb1fd84 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -456,6 +456,26 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
     }
 }
 
+VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainerBase *bcontainer, MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to sync missing RAM discard listener");
+        /* does not return */
+    }
+    return vrdl;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
@@ -1086,19 +1106,8 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
                                             MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to sync missing RAM discard listener");
-    }
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(bcontainer, section);
 
     /*
      * We only want/can synchronize the bitmap for actually mapped parts -
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 08/43] vfio: move vfio-cpr.h
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (6 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-03 11:01   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 09/43] vfio/container: register container for cpr Steve Sistare
                   ` (36 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move vfio-cpr.h to include/hw/vfio, because it will need to be included by
other files there.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 MAINTAINERS                |  1 +
 hw/vfio/vfio-cpr.h         | 15 ---------------
 include/hw/vfio/vfio-cpr.h | 18 ++++++++++++++++++
 hw/vfio/container.c        |  2 +-
 hw/vfio/cpr.c              |  2 +-
 hw/vfio/iommufd.c          |  2 +-
 6 files changed, 22 insertions(+), 18 deletions(-)
 delete mode 100644 hw/vfio/vfio-cpr.h
 create mode 100644 include/hw/vfio/vfio-cpr.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e29fb4f..7b919d7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3034,6 +3034,7 @@ CheckPoint and Restart (CPR)
 R: Steve Sistare <steven.sistare@oracle.com>
 S: Supported
 F: hw/vfio/cpr*
+F: include/hw/vfio/vfio-cpr.h
 F: include/migration/cpr.h
 F: migration/cpr*
 F: tests/qtest/migration/cpr*
diff --git a/hw/vfio/vfio-cpr.h b/hw/vfio/vfio-cpr.h
deleted file mode 100644
index 134b83a..0000000
--- a/hw/vfio/vfio-cpr.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * VFIO CPR
- *
- * Copyright (c) 2025 Oracle and/or its affiliates.
- *
- * SPDX-License-Identifier: GPL-2.0-or-later
- */
-
-#ifndef HW_VFIO_CPR_H
-#define HW_VFIO_CPR_H
-
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
-
-#endif /* HW_VFIO_CPR_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
new file mode 100644
index 0000000..750ea5b
--- /dev/null
+++ b/include/hw/vfio/vfio-cpr.h
@@ -0,0 +1,18 @@
+/*
+ * VFIO CPR
+ *
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_VFIO_VFIO_CPR_H
+#define HW_VFIO_VFIO_CPR_H
+
+struct VFIOContainerBase;
+
+bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
+                                 Error **errp);
+void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
+
+#endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index a8c76eb..0f948d0 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -33,8 +33,8 @@
 #include "qapi/error.h"
 #include "pci.h"
 #include "hw/vfio/vfio-container.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "vfio-helpers.h"
-#include "vfio-cpr.h"
 #include "vfio-listener.h"
 
 #define TYPE_HOST_IOMMU_DEVICE_LEGACY_VFIO TYPE_HOST_IOMMU_DEVICE "-legacy-vfio"
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 3214184..0210e76 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,9 +8,9 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
 #include "migration/misc.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
-#include "vfio-cpr.h"
 
 static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
                                     MigrationEvent *e, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index a8cc543..eb2f88d 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -21,13 +21,13 @@
 #include "qapi/error.h"
 #include "system/iommufd.h"
 #include "hw/qdev-core.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "system/reset.h"
 #include "qemu/cutils.h"
 #include "qemu/chardev_open.h"
 #include "pci.h"
 #include "vfio-iommufd.h"
 #include "vfio-helpers.h"
-#include "vfio-cpr.h"
 #include "vfio-listener.h"
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 09/43] vfio/container: register container for cpr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (7 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 08/43] vfio: move vfio-cpr.h Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 15:21   ` Cédric Le Goater
  2025-06-03 11:57   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 10/43] vfio/container: preserve descriptors Steve Sistare
                   ` (35 subsequent siblings)
  44 siblings, 2 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Register a legacy container for cpr-transfer, replacing the generic CPR
register call with a more specific legacy container register call.  Add a
blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.

This is mostly boiler plate.  The fields to to saved and restored are added
in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-container.h |  2 ++
 include/hw/vfio/vfio-cpr.h       | 15 +++++++++
 hw/vfio/container.c              |  6 ++--
 hw/vfio/cpr-legacy.c             | 69 ++++++++++++++++++++++++++++++++++++++++
 hw/vfio/cpr.c                    |  5 ++-
 hw/vfio/meson.build              |  1 +
 6 files changed, 92 insertions(+), 6 deletions(-)
 create mode 100644 hw/vfio/cpr-legacy.c

diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
index afc498d..21e5807 100644
--- a/include/hw/vfio/vfio-container.h
+++ b/include/hw/vfio/vfio-container.h
@@ -10,6 +10,7 @@
 #define HW_VFIO_CONTAINER_H
 
 #include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
 
 typedef struct VFIOContainer VFIOContainer;
 typedef struct VFIODevice VFIODevice;
@@ -29,6 +30,7 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     unsigned iommu_type;
     QLIST_HEAD(, VFIOGroup) group_list;
+    VFIOContainerCPR cpr;
 } VFIOContainer;
 
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 750ea5b..d4e0bd5 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -9,8 +9,23 @@
 #ifndef HW_VFIO_VFIO_CPR_H
 #define HW_VFIO_VFIO_CPR_H
 
+#include "migration/misc.h"
+
+struct VFIOContainer;
 struct VFIOContainerBase;
 
+typedef struct VFIOContainerCPR {
+    Error *blocker;
+} VFIOContainerCPR;
+
+
+bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
+                                        Error **errp);
+void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
+
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
+                             Error **errp);
+
 bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 0f948d0..7d2035c 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -643,7 +643,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     new_container = true;
     bcontainer = &container->bcontainer;
 
-    if (!vfio_cpr_register_container(bcontainer, errp)) {
+    if (!vfio_legacy_cpr_register_container(container, errp)) {
         goto fail;
     }
 
@@ -679,7 +679,7 @@ fail:
         vioc->release(bcontainer);
     }
     if (new_container) {
-        vfio_cpr_unregister_container(bcontainer);
+        vfio_legacy_cpr_unregister_container(container);
         object_unref(container);
     }
     if (fd >= 0) {
@@ -720,7 +720,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
         VFIOAddressSpace *space = bcontainer->space;
 
         trace_vfio_container_disconnect(container->fd);
-        vfio_cpr_unregister_container(bcontainer);
+        vfio_legacy_cpr_unregister_container(container);
         close(container->fd);
         object_unref(container);
 
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
new file mode 100644
index 0000000..419b9fb
--- /dev/null
+++ b/hw/vfio/cpr-legacy.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-container.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+
+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
+        return false;
+
+    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
+        return false;
+
+    } else {
+        return true;
+    }
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+    .name = "vfio-container",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_incoming_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+    Error **cpr_blocker = &container->cpr.blocker;
+
+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+                                vfio_cpr_reboot_notifier,
+                                MIG_MODE_CPR_REBOOT);
+
+    if (!vfio_cpr_supported(container, cpr_blocker)) {
+        return migrate_add_blocker_modes(cpr_blocker, errp,
+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+    }
+
+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+    return true;
+}
+
+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+    migrate_del_blocker(&container->cpr.blocker);
+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0210e76..0e59612 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -7,13 +7,12 @@
 
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
-#include "migration/misc.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
 
-static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
-                                    MigrationEvent *e, Error **errp)
+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
+                             MigrationEvent *e, Error **errp)
 {
     if (e->type == MIG_EVENT_PRECOPY_SETUP &&
         !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index bccb050..73d29f9 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
 system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
 system_ss.add(when: 'CONFIG_VFIO', if_true: files(
   'cpr.c',
+  'cpr-legacy.c',
   'device.c',
   'migration.c',
   'migration-multifd.c',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 10/43] vfio/container: preserve descriptors
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (8 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 09/43] vfio/container: register container for cpr Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 16:57   ` Cédric Le Goater
  2025-06-03 11:57   ` Duan, Zhenzhong
  2025-05-29 19:24 ` [PATCH V4 11/43] vfio/container: discard old DMA vaddr Steve Sistare
                   ` (34 subsequent siblings)
  44 siblings, 2 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

At vfio creation time, save the value of vfio container, group, and device
descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
the saved descriptors.

During reuse, device and iommu state is already configured, so operations
in vfio_realize that would modify the configuration, such as vfio ioctl's,
are skipped.  The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  6 +++++
 hw/vfio/container.c        | 67 +++++++++++++++++++++++++++++++++++-----------
 hw/vfio/cpr-legacy.c       | 42 +++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+), 15 deletions(-)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index d4e0bd5..5a2e5f6 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -13,6 +13,7 @@
 
 struct VFIOContainer;
 struct VFIOContainerBase;
+struct VFIOGroup;
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
@@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
 
+int vfio_cpr_group_get_device_fd(int d, const char *name);
+
+bool vfio_cpr_container_match(struct VFIOContainer *container,
+                              struct VFIOGroup *group, int fd);
+
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 7d2035c..798abda 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -31,6 +31,8 @@
 #include "system/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
 #include "pci.h"
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
@@ -426,7 +428,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
         return NULL;
     }
 
-    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
+    /*
+     * During CPR, just set the container type and skip the ioctls, as the
+     * container and group are already configured in the kernel.
+     */
+    if (!cpr_is_incoming() &&
+        !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
         return NULL;
     }
 
@@ -593,6 +600,11 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
     vfio_group_add_kvm_device(group);
+    /*
+     * Remember the container fd for each group, so we can attach to the same
+     * container after CPR.
+     */
+    cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
     return true;
 }
 
@@ -602,6 +614,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
     group->container = NULL;
     vfio_group_del_kvm_device(group);
     vfio_ram_block_discard_disable(container, false);
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 }
 
 static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
@@ -616,17 +629,34 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     bool group_was_added = false;
 
     space = vfio_address_space_get(as);
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
 
-    QLIST_FOREACH(bcontainer, &space->containers, next) {
-        container = container_of(bcontainer, VFIOContainer, bcontainer);
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            return vfio_container_group_add(container, group, errp);
+    if (!cpr_is_incoming()) {
+        QLIST_FOREACH(bcontainer, &space->containers, next) {
+            container = container_of(bcontainer, VFIOContainer, bcontainer);
+            if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+                return vfio_container_group_add(container, group, errp);
+            }
         }
-    }
 
-    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
-    if (fd < 0) {
-        goto fail;
+        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
+        if (fd < 0) {
+            goto fail;
+        }
+    } else {
+        /*
+         * For incoming CPR, the group is already attached in the kernel.
+         * If a container with matching fd is found, then update the
+         * userland group list and return.  If not, then after the loop,
+         * create the container struct and group list.
+         */
+        QLIST_FOREACH(bcontainer, &space->containers, next) {
+            container = container_of(bcontainer, VFIOContainer, bcontainer);
+
+            if (vfio_cpr_container_match(container, group, fd)) {
+                return vfio_container_group_add(container, group, errp);
+            }
+        }
     }
 
     ret = ioctl(fd, VFIO_GET_API_VERSION);
@@ -698,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -751,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open(path, O_RDWR, errp);
+    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, errp);
     if (group->fd < 0) {
         goto free_group_exit;
     }
@@ -783,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
     return group;
 
 close_fd_exit:
+    cpr_delete_fd("vfio_group", groupid);
     close(group->fd);
 
 free_group_exit:
@@ -804,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
     vfio_container_disconnect(group);
     QLIST_REMOVE(group, next);
     trace_vfio_group_put(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 }
@@ -814,7 +847,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
     g_autofree struct vfio_device_info *info = NULL;
     int fd;
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    fd = vfio_cpr_group_get_device_fd(group->fd, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -827,8 +860,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
     info = vfio_get_device_info(fd);
     if (!info) {
         error_setg_errno(errp, errno, "error getting device info");
-        close(fd);
-        return false;
+        goto fail;
     }
 
     /*
@@ -842,8 +874,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
         if (!QLIST_EMPTY(&group->device_list)) {
             error_setg(errp, "Inconsistent setting of support for discarding "
                        "RAM (e.g., balloon) within group");
-            close(fd);
-            return false;
+            goto fail;
         }
 
         if (!group->ram_block_discard_allowed) {
@@ -861,6 +892,11 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
     trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
 
     return true;
+
+fail:
+    close(fd);
+    cpr_delete_fd(name, 0);
+    return false;
 }
 
 static void vfio_device_put(VFIODevice *vbasedev)
@@ -871,6 +907,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_device_put(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 419b9fb..29be64f 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
@@ -67,3 +68,44 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
     migrate_del_blocker(&container->cpr.blocker);
     vmstate_unregister(NULL, &vfio_container_vmstate, container);
 }
+
+int vfio_cpr_group_get_device_fd(int d, const char *name)
+{
+    const int id = 0;
+    int fd = cpr_find_fd(name, id);
+
+    if (fd < 0) {
+        fd = ioctl(d, VFIO_GROUP_GET_DEVICE_FD, name);
+        if (fd >= 0) {
+            cpr_save_fd(name, id, fd);
+        }
+    }
+    return fd;
+}
+
+static bool same_device(int fd1, int fd2)
+{
+    struct stat st1, st2;
+
+    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
+}
+
+bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
+                              int fd)
+{
+    if (container->fd == fd) {
+        return true;
+    }
+    if (!same_device(container->fd, fd)) {
+        return false;
+    }
+    /*
+     * Same device, different fd.  This occurs when the container fd is
+     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
+     * produces duplicates.  De-dup it.
+     */
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
+    close(fd);
+    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
+    return true;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 11/43] vfio/container: discard old DMA vaddr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (9 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 10/43] vfio/container: preserve descriptors Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 12/43] vfio/container: restore " Steve Sistare
                   ` (33 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

In the container pre_save handler, discard the virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest RAM will be
remapped at a different VA after in new QEMU.  DMA to already-mapped
pages continues.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/cpr-legacy.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 29be64f..cf80332 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -16,6 +16,22 @@
 #include "migration/vmstate.h"
 #include "qapi/error.h"
 
+static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return false;
+    }
+    return true;
+}
+
+
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -31,10 +47,23 @@ static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
     }
 }
 
+static int vfio_container_pre_save(void *opaque)
+{
+    VFIOContainer *container = opaque;
+    Error *local_err = NULL;
+
+    if (!vfio_dma_unmap_vaddr_all(container, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+    return 0;
+}
+
 static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .pre_save = vfio_container_pre_save,
     .needed = cpr_incoming_needed,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 12/43] vfio/container: restore DMA vaddr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (10 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 11/43] vfio/container: discard old DMA vaddr Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 16:48   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 13/43] vfio/container: mdev cpr blocker Steve Sistare
                   ` (32 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

In new QEMU, do not register the memory listener at device creation time.
Register it later, in the container post_load handler, after all vmstate
that may affect regions and mapping boundaries has been loaded.  The
post_load registration will cause the listener to invoke its callback on
each flat section, and the calls will match the mappings remembered by the
kernel.

The listener calls a special dma_map handler that passes the new VA of each
section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
handler at the end.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  3 +++
 hw/vfio/container.c        | 15 ++++++++++--
 hw/vfio/cpr-legacy.c       | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 5a2e5f6..0462447 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -17,6 +17,9 @@ struct VFIOGroup;
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
+    int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
+                         hwaddr iova, ram_addr_t size,
+                         void *vaddr, bool readonly, MemoryRegion *mr);
 } VFIOContainerCPR;
 
 
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 798abda..f91f2d5 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
     int ret;
     Error *local_err = NULL;
 
+    g_assert(!cpr_is_incoming());
+
     if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
         if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
             bcontainer->dirty_pages_supported) {
@@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
     }
     group_was_added = true;
 
-    if (!vfio_listener_register(bcontainer, errp)) {
-        goto fail;
+    /*
+     * If CPR, register the listener later, after all state that may
+     * affect regions and mapping boundaries has been cpr load'ed.  Later,
+     * the listener will invoke its callback on each flat section and call
+     * dma_map to supply the new vaddr, and the calls will match the mappings
+     * remembered by the kernel.
+     */
+    if (!cpr_is_incoming()) {
+        if (!vfio_listener_register(bcontainer, errp)) {
+            goto fail;
+        }
     }
 
     bcontainer->initialized = true;
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index cf80332..512ef41 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -10,11 +10,13 @@
 #include "hw/vfio/vfio-container.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "hw/vfio/vfio-device.h"
+#include "hw/vfio/vfio-listener.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
 #include "migration/vmstate.h"
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 
 static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
 {
@@ -31,6 +33,32 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
     return true;
 }
 
+/*
+ * Set the new @vaddr for any mappings registered during cpr load.
+ * The incoming state is cleared thereafter.
+ */
+static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
+                                   hwaddr iova, ram_addr_t size, void *vaddr,
+                                   bool readonly, MemoryRegion *mr)
+{
+    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
+                                                  bcontainer);
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_VADDR,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    g_assert(cpr_is_incoming());
+
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+        return -errno;
+    }
+
+    return 0;
+}
 
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
@@ -59,11 +87,34 @@ static int vfio_container_pre_save(void *opaque)
     return 0;
 }
 
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+    VFIOContainer *container = opaque;
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+    VFIOGroup *group;
+    Error *local_err = NULL;
+
+    if (!vfio_listener_register(bcontainer, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+
+        /* Restore original dma_map function */
+        vioc->dma_map = container->cpr.saved_dma_map;
+    }
+    return 0;
+}
+
 static const VMStateDescription vfio_container_vmstate = {
     .name = "vfio-container",
     .version_id = 0,
     .minimum_version_id = 0,
+    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
     .pre_save = vfio_container_pre_save,
+    .post_load = vfio_container_post_load,
     .needed = cpr_incoming_needed,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
@@ -86,6 +137,12 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 
     vmstate_register(NULL, -1, &vfio_container_vmstate, container);
 
+    /* During incoming CPR, divert calls to dma_map. */
+    if (cpr_is_incoming()) {
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+        container->cpr.saved_dma_map = vioc->dma_map;
+        vioc->dma_map = vfio_legacy_cpr_dma_map;
+    }
     return true;
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 13/43] vfio/container: mdev cpr blocker
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (11 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 12/43] vfio/container: restore " Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 14/43] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
                   ` (31 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

During CPR, after VFIO_DMA_UNMAP_FLAG_VADDR, the vaddr is temporarily
invalid, so mediated devices cannot be supported.  Add a blocker for them.
This restriction will not apply to iommufd containers when CPR is added
for them in a future patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-cpr.h    | 3 +++
 include/hw/vfio/vfio-device.h | 2 ++
 hw/vfio/container.c           | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 0462447..b83dd42 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -22,6 +22,9 @@ typedef struct VFIOContainerCPR {
                          void *vaddr, bool readonly, MemoryRegion *mr);
 } VFIOContainerCPR;
 
+typedef struct VFIODeviceCPR {
+    Error *mdev_blocker;
+} VFIODeviceCPR;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
                                         Error **errp);
diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 8bcb3c1..4e4d0b6 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -28,6 +28,7 @@
 #endif
 #include "system/system.h"
 #include "hw/vfio/vfio-container-base.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "system/host_iommu_device.h"
 #include "system/iommufd.h"
 
@@ -84,6 +85,7 @@ typedef struct VFIODevice {
     VFIOIOASHwpt *hwpt;
     QLIST_ENTRY(VFIODevice) hwpt_next;
     struct vfio_region_info **reginfo;
+    VFIODeviceCPR cpr;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index f91f2d5..f801a0d 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -988,6 +988,13 @@ static bool vfio_legacy_attach_device(const char *name, VFIODevice *vbasedev,
         goto device_put_exit;
     }
 
+    if (vbasedev->mdev) {
+        error_setg(&vbasedev->cpr.mdev_blocker,
+                   "CPR does not support vfio mdev %s", vbasedev->name);
+        migrate_add_blocker_modes(&vbasedev->cpr.mdev_blocker, &error_fatal,
+                                  MIG_MODE_CPR_TRANSFER, -1);
+    }
+
     return true;
 
 device_put_exit:
@@ -1005,6 +1012,7 @@ static void vfio_legacy_detach_device(VFIODevice *vbasedev)
 
     vfio_device_unprepare(vbasedev);
 
+    migrate_del_blocker(&vbasedev->cpr.mdev_blocker);
     object_unref(vbasedev->hiod);
     vfio_device_put(vbasedev);
     vfio_group_put(group);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 14/43] vfio/container: recover from unmap-all-vaddr failure
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (12 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 13/43] vfio/container: mdev cpr blocker Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 15/43] pci: export msix_is_pending Steve Sistare
                   ` (30 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

If there are multiple containers and unmap-all fails for some container, we
need to remap vaddr for the other containers for which unmap-all succeeded.
Recover by walking all address ranges of all containers to restore the vaddr
for each.  Do so by invoking the vfio listener callback, and passing a new
"remap" flag that tells it to restore a mapping without re-allocating new
userland data structures.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-container-base.h |  3 ++
 include/hw/vfio/vfio-cpr.h            | 10 ++++
 hw/vfio/cpr-legacy.c                  | 91 +++++++++++++++++++++++++++++++++++
 hw/vfio/listener.c                    | 19 +++++++-
 4 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index 01cdcb6..dbbe87d 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -199,4 +199,7 @@ struct VFIOIOMMUClass {
 VFIORamDiscardListener *vfio_find_ram_discard_listener(
     VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+                               MemoryRegionSection *section, bool cpr_remap);
+
 #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index b83dd42..56ede04 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -10,6 +10,7 @@
 #define HW_VFIO_VFIO_CPR_H
 
 #include "migration/misc.h"
+#include "system/memory.h"
 
 struct VFIOContainer;
 struct VFIOContainerBase;
@@ -17,6 +18,9 @@ struct VFIOGroup;
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
+    bool vaddr_unmapped;
+    NotifierWithReturn transfer_notifier;
+    MemoryListener remap_listener;
     int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
                          hwaddr iova, ram_addr_t size,
                          void *vaddr, bool readonly, MemoryRegion *mr);
@@ -42,4 +46,10 @@ int vfio_cpr_group_get_device_fd(int d, const char *name);
 bool vfio_cpr_container_match(struct VFIOContainer *container,
                               struct VFIOGroup *group, int fd);
 
+void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
+                           MemoryRegionSection *section);
+
+bool vfio_cpr_ram_discard_register_listener(
+    struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
+
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
index 512ef41..59e2599 100644
--- a/hw/vfio/cpr-legacy.c
+++ b/hw/vfio/cpr-legacy.c
@@ -30,6 +30,7 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
         error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
         return false;
     }
+    container->cpr.vaddr_unmapped = true;
     return true;
 }
 
@@ -60,6 +61,14 @@ static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
     return 0;
 }
 
+static void vfio_region_remap(MemoryListener *listener,
+                              MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            cpr.remap_listener);
+    vfio_container_region_add(&container->bcontainer, section, true);
+}
+
 static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
@@ -121,6 +130,40 @@ static const VMStateDescription vfio_container_vmstate = {
     }
 };
 
+static int vfio_cpr_fail_notifier(NotifierWithReturn *notifier,
+                                  MigrationEvent *e, Error **errp)
+{
+    VFIOContainer *container =
+        container_of(notifier, VFIOContainer, cpr.transfer_notifier);
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    if (e->type != MIG_EVENT_PRECOPY_FAILED) {
+        return 0;
+    }
+
+    if (container->cpr.vaddr_unmapped) {
+        /*
+         * Force a call to vfio_region_remap for each mapped section by
+         * temporarily registering a listener, and temporarily diverting
+         * dma_map to vfio_legacy_cpr_dma_map.  The latter restores vaddr.
+         */
+
+        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+        vioc->dma_map = vfio_legacy_cpr_dma_map;
+
+        container->cpr.remap_listener = (MemoryListener) {
+            .name = "vfio cpr recover",
+            .region_add = vfio_region_remap
+        };
+        memory_listener_register(&container->cpr.remap_listener,
+                                 bcontainer->space->as);
+        memory_listener_unregister(&container->cpr.remap_listener);
+        container->cpr.vaddr_unmapped = false;
+        vioc->dma_map = container->cpr.saved_dma_map;
+    }
+    return 0;
+}
+
 bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
 {
     VFIOContainerBase *bcontainer = &container->bcontainer;
@@ -143,6 +186,10 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
         container->cpr.saved_dma_map = vioc->dma_map;
         vioc->dma_map = vfio_legacy_cpr_dma_map;
     }
+
+    migration_add_notifier_mode(&container->cpr.transfer_notifier,
+                                vfio_cpr_fail_notifier,
+                                MIG_MODE_CPR_TRANSFER);
     return true;
 }
 
@@ -153,6 +200,50 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
     migrate_del_blocker(&container->cpr.blocker);
     vmstate_unregister(NULL, &vfio_container_vmstate, container);
+    migration_remove_notifier(&container->cpr.transfer_notifier);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr.  Call this
+ * to restore vaddr for a section with a giommu.
+ *
+ * The giommu already exists.  Find it and replay it, which calls
+ * vfio_legacy_cpr_dma_map further down the stack.
+ */
+void vfio_cpr_giommu_remap(VFIOContainerBase *bcontainer,
+                           MemoryRegionSection *section)
+{
+    VFIOGuestIOMMU *giommu = NULL;
+    hwaddr as_offset = section->offset_within_address_space;
+    hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+    QLIST_FOREACH(giommu, &bcontainer->giommu_list, giommu_next) {
+        if (giommu->iommu_mr == IOMMU_MEMORY_REGION(section->mr) &&
+            giommu->iommu_offset == iommu_offset) {
+            break;
+        }
+    }
+    g_assert(giommu);
+    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+}
+
+/*
+ * In old QEMU, VFIO_DMA_UNMAP_FLAG_VADDR may fail on some mapping after
+ * succeeding for others, so the latter have lost their vaddr.  Call this
+ * to restore vaddr for a section with a RamDiscardManager.
+ *
+ * The ram discard listener already exists.  Call its populate function
+ * directly, which calls vfio_legacy_cpr_dma_map.
+ */
+bool vfio_cpr_ram_discard_register_listener(VFIOContainerBase *bcontainer,
+                                            MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(bcontainer, section);
+
+    g_assert(vrdl);
+    return vrdl->listener.notify_populate(&vrdl->listener, section) == 0;
 }
 
 int vfio_cpr_group_get_device_fd(int d, const char *name)
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index fb1fd84..1106dc9 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -481,6 +481,13 @@ static void vfio_listener_region_add(MemoryListener *listener,
 {
     VFIOContainerBase *bcontainer = container_of(listener, VFIOContainerBase,
                                                  listener);
+    vfio_container_region_add(bcontainer, section, false);
+}
+
+void vfio_container_region_add(VFIOContainerBase *bcontainer,
+                               MemoryRegionSection *section,
+                               bool cpr_remap)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -516,6 +523,11 @@ static void vfio_listener_region_add(MemoryListener *listener,
         int iommu_idx;
 
         trace_vfio_listener_region_add_iommu(section->mr->name, iova, end);
+
+        if (cpr_remap) {
+            vfio_cpr_giommu_remap(bcontainer, section);
+        }
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -558,7 +570,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
      * about changes.
      */
     if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_ram_discard_register_listener(bcontainer, section);
+        if (!cpr_remap) {
+            vfio_ram_discard_register_listener(bcontainer, section);
+        } else if (!vfio_cpr_ram_discard_register_listener(bcontainer,
+                                                           section)) {
+            goto fail;
+        }
         return;
     }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 15/43] pci: export msix_is_pending
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (13 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 14/43] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 16/43] pci: skip reset during cpr Steve Sistare
                   ` (29 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export msix_is_pending for use by cpr.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/pci/msix.h | 1 +
 hw/pci/msix.c         | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 0e6f257..11ef945 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 void msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 66f27b9..8c7f670 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -72,7 +72,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 16/43] pci: skip reset during cpr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (14 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 15/43] pci: export msix_is_pending Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 16:38   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 17/43] vfio-pci: " Steve Sistare
                   ` (28 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Do not reset a vfio-pci device during CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/pci/pci_device.h | 3 +++
 hw/pci/pci.c                | 5 +++++
 hw/vfio/pci.c               | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index e41d95b..b481c5d 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -181,6 +181,9 @@ struct PCIDevice {
     uint32_t max_bounce_buffer_size;
 
     char *sriov_pf;
+
+    /* CPR */
+    bool skip_reset_on_cpr;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index f5ab510..21eb11c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,7 @@
 #include "hw/pci/pci_host.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
+#include "migration/cpr.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
 #include "net/net.h"
@@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
 
 static void pci_do_device_reset(PCIDevice *dev)
 {
+    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7d3b9ff..56e7fdd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3402,6 +3402,13 @@ static void vfio_instance_init(Object *obj)
     /* QEMU_PCI_CAP_EXPRESS initialization does not depend on QEMU command
      * line, therefore, no need to wait to realize like other devices */
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
+
+    /*
+     * A device that is resuming for cpr is already configured, so do not
+     * reset it during qemu_system_reset prior to cpr load, else interrupts
+     * may be lost.
+     */
+    pci_dev->skip_reset_on_cpr = true;
 }
 
 static void vfio_pci_base_dev_class_init(ObjectClass *klass, const void *data)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 17/43] vfio-pci: skip reset during cpr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (15 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 16/43] pci: skip reset during cpr Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 16:39   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init Steve Sistare
                   ` (27 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Do not reset a vfio-pci device during CPR, and do not complain if the
kernel's PCI config space changes for non-emulated bits between the
vmstate save and load, which can happen due to ongoing interrupt activity.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  2 ++
 hw/vfio/cpr.c              | 31 +++++++++++++++++++++++++++++++
 hw/vfio/pci.c              |  7 +++++++
 3 files changed, 40 insertions(+)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 56ede04..8bf85b9 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -52,4 +52,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
 bool vfio_cpr_ram_discard_register_listener(
     struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+extern const VMStateDescription vfio_cpr_pci_vmstate;
+
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0e59612..fdbb58e 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -8,6 +8,8 @@
 #include "qemu/osdep.h"
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/pci.h"
+#include "migration/cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
 
@@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
 {
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
 }
+
+/*
+ * The kernel may change non-emulated config bits.  Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_cpr_pci_pre_load(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    int i;
+
+    for (i = 0; i < size; i++) {
+        pdev->cmask[i] &= vdev->emulated_config_bits[i];
+    }
+
+    return 0;
+}
+
+const VMStateDescription vfio_cpr_pci_vmstate = {
+    .name = "vfio-cpr-pci",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_load = vfio_cpr_pci_pre_load,
+    .needed = cpr_incoming_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 56e7fdd..840590c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -30,6 +30,7 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
 #include "qobject/qdict.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
@@ -3345,6 +3346,11 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
 
+    /* Do not reset the device during qemu_system_reset prior to cpr load */
+    if (cpr_is_incoming()) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3521,6 +3527,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
 #ifdef CONFIG_IOMMUFD
     object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
 #endif
+    dc->vmsd = &vfio_cpr_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     pdc->realize = vfio_realize;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (16 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 17/43] vfio-pci: " Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 15:25   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 19/43] vfio/pci: vfio_notifier_init Steve Sistare
                   ` (26 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Extract a subroutine vfio_pci_vector_init.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 840590c..2d6dc54 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -512,6 +512,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (event_notifier_init(&vector->interrupt, 0)) {
+        error_report("vfio: Error: event_notifier_init failed");
+    }
+    vector->use = true;
+    if (vdev->interrupt == VFIO_INT_MSIX) {
+        msix_vector_use(pdev, nr);
+    }
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -525,13 +541,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     vector = &vdev->msi_vectors[nr];
 
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_pci_vector_init(vdev, nr);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 19/43] vfio/pci: vfio_notifier_init
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (17 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 20/43] vfio/pci: pass vector to virq functions Steve Sistare
                   ` (25 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move event_notifier_init calls to a helper vfio_notifier_init.
This version is trivial, but it will be expanded to support CPR
in subsequent patches.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/pci.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2d6dc54..12386a8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -57,6 +57,16 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+{
+    int ret = event_notifier_init(e, 0);
+
+    if (ret) {
+        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+    }
+    return !ret;
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -137,8 +147,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (event_notifier_init(&vdev->intx.unmask, 0)) {
-        error_setg(errp, "event_notifier_init failed eoi");
+    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
         goto fail;
     }
 
@@ -269,7 +278,6 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
     Error *err = NULL;
     int32_t fd;
-    int ret;
 
 
     if (!pin) {
@@ -292,9 +300,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    ret = event_notifier_init(&vdev->intx.interrupt, 0);
-    if (ret) {
-        error_setg_errno(errp, -ret, "event_notifier_init failed");
+    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
         return false;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -474,11 +480,13 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
 
 static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
 {
+    const char *name = "kvm_interrupt";
+
     if (vector->virq < 0) {
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
         goto fail_notifier;
     }
 
@@ -516,11 +524,12 @@ static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
 {
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     PCIDevice *pdev = &vdev->pdev;
+    Error *err = NULL;
 
     vector->vdev = vdev;
     vector->virq = -1;
-    if (event_notifier_init(&vector->interrupt, 0)) {
-        error_report("vfio: Error: event_notifier_init failed");
+    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+        error_report_err(err);
     }
     vector->use = true;
     if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -750,13 +759,14 @@ retry:
 
     for (i = 0; i < vdev->nr_vectors; i++) {
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        Error *err = NULL;
 
         vector->vdev = vdev;
         vector->virq = -1;
         vector->use = true;
 
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
+        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
+            error_report_err(err);
         }
 
         qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2908,8 +2918,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->err_notifier, 0)) {
-        error_report("vfio: Unable to init event notifier for error detection");
+    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+        error_report_err(err);
         vdev->pci_aer = false;
         return;
     }
@@ -2975,8 +2985,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->req_notifier, 0)) {
-        error_report("vfio: Unable to init event notifier for device request");
+    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+        error_report_err(err);
         return;
     }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 20/43] vfio/pci: pass vector to virq functions
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (18 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 19/43] vfio/pci: vfio_notifier_init Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 21/43] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
                   ` (24 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass the vector number to vfio_connect_kvm_msi_virq and
vfio_remove_kvm_msi_virq, so it can be passed to their subroutines in
a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/pci.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 12386a8..fa0601c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -478,7 +478,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                              vector_n, &vdev->pdev);
 }
 
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
 {
     const char *name = "kvm_interrupt";
 
@@ -504,7 +504,8 @@ fail_notifier:
     vector->virq = -1;
 }
 
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                                     int nr)
 {
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
                                           vector->virq);
@@ -562,7 +563,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
      */
     if (vector->virq >= 0) {
         if (!msg) {
-            vfio_remove_kvm_msi_virq(vector);
+            vfio_remove_kvm_msi_virq(vdev, vector, nr);
         } else {
             vfio_update_kvm_msi_virq(vector, *msg, pdev);
         }
@@ -574,7 +575,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                 vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
                 vfio_add_kvm_msi_virq(vdev, vector, nr, true);
                 kvm_irqchip_commit_route_changes(&vfio_route_change);
-                vfio_connect_kvm_msi_virq(vector);
+                vfio_connect_kvm_msi_virq(vector, nr);
             }
         }
     }
@@ -682,7 +683,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
     kvm_irqchip_commit_route_changes(&vfio_route_change);
 
     for (i = 0; i < vdev->nr_vectors; i++) {
-        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
     }
 }
 
@@ -822,7 +823,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
         if (vdev->msi_vectors[i].use) {
             if (vector->virq >= 0) {
-                vfio_remove_kvm_msi_virq(vector);
+                vfio_remove_kvm_msi_virq(vdev, vector, i);
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 21/43] vfio/pci: vfio_notifier_init cpr parameters
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (19 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 20/43] vfio/pci: pass vector to virq functions Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 22/43] vfio/pci: vfio_notifier_cleanup Steve Sistare
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Pass vdev and nr to vfio_notifier_init, for use by CPR in a subsequent
patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/pci.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index fa0601c..b776793 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -57,7 +57,8 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
-static bool vfio_notifier_init(EventNotifier *e, const char *name, Error **errp)
+static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+                               const char *name, int nr, Error **errp)
 {
     int ret = event_notifier_init(e, 0);
 
@@ -147,7 +148,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (!vfio_notifier_init(&vdev->intx.unmask, "intx-unmask", errp)) {
+    if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
         goto fail;
     }
 
@@ -300,7 +301,8 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    if (!vfio_notifier_init(&vdev->intx.interrupt, "intx-interrupt", errp)) {
+    if (!vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0,
+                            errp)) {
         return false;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -486,7 +488,8 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
         return;
     }
 
-    if (!vfio_notifier_init(&vector->kvm_interrupt, name, NULL)) {
+    if (!vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr,
+                            NULL)) {
         goto fail_notifier;
     }
 
@@ -525,12 +528,13 @@ static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
 {
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     PCIDevice *pdev = &vdev->pdev;
-    Error *err = NULL;
+    Error *local_err = NULL;
 
     vector->vdev = vdev;
     vector->virq = -1;
-    if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
-        error_report_err(err);
+    if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr,
+                            &local_err)) {
+        error_report_err(local_err);
     }
     vector->use = true;
     if (vdev->interrupt == VFIO_INT_MSIX) {
@@ -760,14 +764,15 @@ retry:
 
     for (i = 0; i < vdev->nr_vectors; i++) {
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
-        Error *err = NULL;
+        Error *local_err = NULL;
 
         vector->vdev = vdev;
         vector->virq = -1;
         vector->use = true;
 
-        if (!vfio_notifier_init(&vector->interrupt, "interrupt", &err)) {
-            error_report_err(err);
+        if (!vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i,
+                                &local_err)) {
+            error_report_err(local_err);
         }
 
         qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -2919,7 +2924,8 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (!vfio_notifier_init(&vdev->err_notifier, "err_notifier", &err)) {
+    if (!vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0,
+                            &err)) {
         error_report_err(err);
         vdev->pci_aer = false;
         return;
@@ -2986,7 +2992,8 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (!vfio_notifier_init(&vdev->req_notifier, "req_notifier", &err)) {
+    if (!vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0,
+                            &err)) {
         error_report_err(err);
         return;
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 22/43] vfio/pci: vfio_notifier_cleanup
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (20 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 21/43] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 23/43] vfio/pci: export MSI functions Steve Sistare
                   ` (22 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Move event_notifier_cleanup calls to a helper vfio_notifier_cleanup.
This version is trivial, and does not yet use the vdev and nr parameters.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/pci.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b776793..6aa37fe 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -68,6 +68,12 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
     return !ret;
 }
 
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+                                  const char *name, int nr)
+{
+    event_notifier_cleanup(e);
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -180,7 +186,7 @@ fail_vfio:
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
                                           vdev->intx.route.irq);
 fail_irqfd:
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 fail:
     qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -212,7 +218,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     }
 
     /* We only need to close the eventfd for VFIO to cleanup the kernel side */
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 
     /* QEMU starts listening for interrupt events. */
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -311,7 +317,7 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->intx.interrupt);
+        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
         return false;
     }
 
@@ -338,7 +344,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->intx.interrupt);
+    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
 
     vdev->interrupt = VFIO_INT_NONE;
 
@@ -501,7 +507,7 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
     return;
 
 fail_kvm:
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
 fail_notifier:
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
@@ -514,7 +520,7 @@ static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                           vector->virq);
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
 }
 
 static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -832,7 +838,7 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-            event_notifier_cleanup(&vector->interrupt);
+            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
         }
     }
 
@@ -2938,7 +2944,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->err_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
         vdev->pci_aer = false;
     }
 }
@@ -2957,7 +2963,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->err_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
 }
 
 static void vfio_req_notifier_handler(void *opaque)
@@ -3005,7 +3011,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->req_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
     } else {
         vdev->req_enabled = true;
     }
@@ -3025,7 +3031,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->req_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
 
     vdev->req_enabled = false;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 23/43] vfio/pci: export MSI functions
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (21 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 22/43] vfio/pci: vfio_notifier_cleanup Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-01 15:27   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 24/43] vfio-pci: preserve MSI Steve Sistare
                   ` (21 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Export various MSI functions, renamed with a vfio_pci prefix, for use by
CPR in subsequent patches.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.h |  8 ++++++++
 hw/vfio/pci.c | 29 +++++++++++++++++------------
 2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 5ce0fb9..6e4840d 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -210,6 +210,14 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
     return class == PCI_CLASS_DISPLAY_VGA;
 }
 
+/* MSI/MSI-X/INTx */
+void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr);
+void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                               int vector_n, bool msix);
+void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
+bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev,
                            uint32_t addr, uint32_t val, int len);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 6aa37fe..13d7c84 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -351,6 +351,11 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
     trace_vfio_intx_disable(vdev->vbasedev.name);
 }
 
+bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp)
+{
+    return vfio_intx_enable(vdev, errp);
+}
+
 /*
  * MSI/X
  */
@@ -475,8 +480,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
     return ret;
 }
 
-static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-                                  int vector_n, bool msix)
+void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                               int vector_n, bool msix)
 {
     if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
         return;
@@ -530,7 +535,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
-static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
+void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
 {
     VFIOMSIVector *vector = &vdev->msi_vectors[nr];
     PCIDevice *pdev = &vdev->pdev;
@@ -580,10 +585,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     } else {
         if (msg) {
             if (vdev->defer_kvm_irq_routing) {
-                vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+                vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
             } else {
                 vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
-                vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+                vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
                 kvm_irqchip_commit_route_changes(&vfio_route_change);
                 vfio_connect_kvm_msi_virq(vector, nr);
             }
@@ -676,14 +681,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     }
 }
 
-static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
 {
     assert(!vdev->defer_kvm_irq_routing);
     vdev->defer_kvm_irq_routing = true;
     vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
 }
 
-static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
+void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
 {
     int i;
 
@@ -713,14 +718,14 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
      * routes once rather than per vector provides a substantial
      * performance improvement.
      */
-    vfio_prepare_kvm_msi_virq_batch(vdev);
+    vfio_pci_prepare_kvm_msi_virq_batch(vdev);
 
     if (msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
                                   vfio_msix_vector_release, NULL)) {
         error_report("vfio: msix_set_vector_notifiers failed");
     }
 
-    vfio_commit_kvm_msi_virq_batch(vdev);
+    vfio_pci_commit_kvm_msi_virq_batch(vdev);
 
     if (vdev->nr_vectors) {
         ret = vfio_enable_vectors(vdev, true);
@@ -764,7 +769,7 @@ retry:
      * Deferring to commit the KVM routes once rather than per vector
      * provides a substantial performance improvement.
      */
-    vfio_prepare_kvm_msi_virq_batch(vdev);
+    vfio_pci_prepare_kvm_msi_virq_batch(vdev);
 
     vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
 
@@ -788,10 +793,10 @@ retry:
          * Attempt to enable route through KVM irqchip,
          * default to userspace handling if unavailable.
          */
-        vfio_add_kvm_msi_virq(vdev, vector, i, false);
+        vfio_pci_add_kvm_msi_virq(vdev, vector, i, false);
     }
 
-    vfio_commit_kvm_msi_virq_batch(vdev);
+    vfio_pci_commit_kvm_msi_virq_batch(vdev);
 
     /* Set interrupt type prior to possible interrupts */
     vdev->interrupt = VFIO_INT_MSI;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 24/43] vfio-pci: preserve MSI
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (22 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 23/43] vfio/pci: export MSI functions Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 25/43] vfio-pci: preserve INTx Steve Sistare
                   ` (20 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Save the MSI message area as part of vfio-pci vmstate, and preserve the
interrupt and notifier eventfd's.  migrate_incoming loads the MSI data,
then the vfio-pci post_load handler finds the eventfds in CPR state,
rebuilds vector data structures, and attaches the interrupts to the new
KVM instance.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.h              |  2 +
 include/hw/vfio/vfio-cpr.h |  8 ++++
 hw/vfio/cpr.c              | 97 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c              | 54 ++++++++++++++++++++++++--
 4 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6e4840d..4d1203c 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -217,6 +217,8 @@ void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
 void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
 void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
 bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
+void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev);
+void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr);
 
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
 void vfio_pci_write_config(PCIDevice *pdev,
diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 8bf85b9..25e74ee 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -15,6 +15,7 @@
 struct VFIOContainer;
 struct VFIOContainerBase;
 struct VFIOGroup;
+struct VFIOPCIDevice;
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
@@ -52,6 +53,13 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
 bool vfio_cpr_ram_discard_register_listener(
     struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
 
+void vfio_cpr_save_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                             int nr, int fd);
+int vfio_cpr_load_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                            int nr);
+void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
+                               int nr);
+
 extern const VMStateDescription vfio_cpr_pci_vmstate;
 
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index fdbb58e..e467373 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -9,6 +9,8 @@
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "hw/vfio/pci.h"
+#include "hw/pci/msix.h"
+#include "hw/pci/msi.h"
 #include "migration/cpr.h"
 #include "qapi/error.h"
 #include "system/runstate.h"
@@ -40,6 +42,69 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
     migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
 }
 
+#define STRDUP_VECTOR_FD_NAME(vdev, name)   \
+    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+void vfio_cpr_save_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+                             int fd)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    cpr_save_fd(fdname, nr, fd);
+}
+
+int vfio_cpr_load_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    return cpr_find_fd(fdname, nr);
+}
+
+void vfio_cpr_delete_vector_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = STRDUP_VECTOR_FD_NAME(vdev, name);
+    cpr_delete_fd(fdname, nr);
+}
+
+static void vfio_cpr_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors,
+                                   bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    vfio_pci_prepare_kvm_msi_virq_batch(vdev);
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = vfio_cpr_load_vector_fd(vdev, "interrupt", i);
+        if (fd >= 0) {
+            vfio_pci_vector_init(vdev, i);
+            vfio_pci_msi_set_handler(vdev, i);
+        }
+
+        if (vfio_cpr_load_vector_fd(vdev, "kvm_interrupt", i) >= 0) {
+            vfio_pci_add_kvm_msi_virq(vdev, vector, i, msix);
+        } else {
+            vdev->msi_vectors[i].virq = -1;
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    vfio_pci_commit_kvm_msi_virq_batch(vdev);
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 /*
  * The kernel may change non-emulated config bits.  Exclude them from the
  * changed-bits check in get_pci_config_device.
@@ -58,13 +123,45 @@ static int vfio_cpr_pci_pre_load(void *opaque)
     return 0;
 }
 
+static int vfio_cpr_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
+
+    if (msix_enabled(pdev)) {
+        vfio_pci_msix_set_notifiers(vdev);
+        nr_vectors = vdev->msix->entries;
+        vfio_cpr_claim_vectors(vdev, nr_vectors, true);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_cpr_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        g_assert_not_reached();      /* completed in a subsequent patch */
+    }
+
+    return 0;
+}
+
+static bool pci_msix_present(void *opaque, int version_id)
+{
+    PCIDevice *pdev = opaque;
+
+    return msix_present(pdev);
+}
+
 const VMStateDescription vfio_cpr_pci_vmstate = {
     .name = "vfio-cpr-pci",
     .version_id = 0,
     .minimum_version_id = 0,
     .pre_load = vfio_cpr_pci_pre_load,
+    .post_load = vfio_cpr_pci_post_load,
     .needed = cpr_incoming_needed,
     .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 13d7c84..643683c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
 #include "hw/pci/pci_bridge.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
+#include "hw/vfio/vfio-cpr.h"
 #include "migration/vmstate.h"
 #include "migration/cpr.h"
 #include "qobject/qdict.h"
@@ -57,13 +58,25 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+/* Create new or reuse existing eventfd */
 static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                                const char *name, int nr, Error **errp)
 {
-    int ret = event_notifier_init(e, 0);
+    int fd = vfio_cpr_load_vector_fd(vdev, name, nr);
+    int ret = 0;
 
-    if (ret) {
-        error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+    if (fd >= 0) {
+        event_notifier_init_fd(e, fd);
+    } else {
+        ret = event_notifier_init(e, 0);
+        if (ret) {
+            error_setg_errno(errp, -ret, "vfio_notifier_init %s failed", name);
+        } else {
+            fd = event_notifier_get_fd(e);
+            if (fd >= 0) {
+                vfio_cpr_save_vector_fd(vdev, name, nr, fd);
+            }
+        }
     }
     return !ret;
 }
@@ -71,6 +84,7 @@ static bool vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
 static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
                                   const char *name, int nr)
 {
+    vfio_cpr_delete_vector_fd(vdev, name, nr);
     event_notifier_cleanup(e);
 }
 
@@ -394,6 +408,14 @@ static void vfio_msi_interrupt(void *opaque)
     notify(&vdev->pdev, nr);
 }
 
+void vfio_pci_msi_set_handler(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    int fd = event_notifier_get_fd(&vector->interrupt);
+
+    qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+}
+
 /*
  * Get MSI-X enabled, but no vector enabled, by setting vector 0 with an invalid
  * fd to kernel.
@@ -561,6 +583,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     int ret;
     bool resizing = !!(vdev->nr_vectors < nr + 1);
 
+    /*
+     * Ignore the callback from msix_set_vector_notifiers during resume.
+     * The necessary subset of these actions is called from
+     * vfio_cpr_claim_vectors during post load.
+     */
+    if (cpr_is_incoming()) {
+        return 0;
+    }
+
     trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
 
     vector = &vdev->msi_vectors[nr];
@@ -681,6 +712,12 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
     }
 }
 
+void vfio_pci_msix_set_notifiers(VFIOPCIDevice *vdev)
+{
+    msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
+                              vfio_msix_vector_release, NULL);
+}
+
 void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
 {
     assert(!vdev->defer_kvm_irq_routing);
@@ -2945,6 +2982,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (cpr_is_incoming()) {
+        return;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3012,6 +3054,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->req_notifier);
     qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (cpr_is_incoming()) {
+        vdev->req_enabled = true;
+        return;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 25/43] vfio-pci: preserve INTx
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (23 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 24/43] vfio-pci: preserve MSI Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 26/43] migration: close kvm after cpr Steve Sistare
                   ` (19 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Preserve vfio INTx state across cpr-transfer.  Preserve VFIOINTx fields as
follows:
  pin : Recover this from the vfio config in kernel space
  interrupt : Preserve its eventfd descriptor across exec.
  unmask : Ditto
  route.irq : This could perhaps be recovered in vfio_pci_post_load by
    calling pci_device_route_intx_to_irq(pin), whose implementation reads
    config space for a bridge device such as ich9.  However, there is no
    guarantee that the bridge vmstate is read before vfio vmstate.  Rather
    than fiddling with MigrationPriority for vmstate handlers, explicitly
    save route.irq in vfio vmstate.
  pending : save in vfio vmstate.
  mmap_timeout, mmap_timer : Re-initialize
  bool kvm_accel : Re-initialize

In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr.c | 27 ++++++++++++++++++++++++++-
 hw/vfio/pci.c | 32 ++++++++++++++++++++++++++++----
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index e467373..f5555ca 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -139,7 +139,11 @@ static int vfio_cpr_pci_post_load(void *opaque, int version_id)
         vfio_cpr_claim_vectors(vdev, nr_vectors, false);
 
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        g_assert_not_reached();      /* completed in a subsequent patch */
+        Error *local_err = NULL;
+        if (!vfio_pci_intx_enable(vdev, &local_err)) {
+            error_report_err(local_err);
+            return -1;
+        }
     }
 
     return 0;
@@ -152,6 +156,26 @@ static bool pci_msix_present(void *opaque, int version_id)
     return msix_present(pdev);
 }
 
+static const VMStateDescription vfio_intx_vmstate = {
+    .name = "vfio-cpr-intx",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(pending, VFIOINTx),
+        VMSTATE_UINT32(route.mode, VFIOINTx),
+        VMSTATE_INT32(route.irq, VFIOINTx),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) {                         \
+    .name       = (stringify(_field)),                              \
+    .size       = sizeof(VFIOINTx),                                 \
+    .vmsd       = &vfio_intx_vmstate,                               \
+    .flags      = VMS_STRUCT,                                       \
+    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
+}
+
 const VMStateDescription vfio_cpr_pci_vmstate = {
     .name = "vfio-cpr-pci",
     .version_id = 0,
@@ -162,6 +186,7 @@ const VMStateDescription vfio_cpr_pci_vmstate = {
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, pci_msix_present),
+        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 643683c..c8d6ee0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -161,12 +161,17 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return true;
     }
 
+    if (cpr_is_incoming()) {
+        goto skip_state;
+    }
+
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_device_irq_mask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
 
+skip_state:
     /* Get an eventfd for resample/unmask */
     if (!vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0, errp)) {
         goto fail;
@@ -180,6 +185,10 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         goto fail_irqfd;
     }
 
+    if (cpr_is_incoming()) {
+        goto skip_irq;
+    }
+
     if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                        VFIO_IRQ_SET_ACTION_UNMASK,
                                        event_notifier_get_fd(&vdev->intx.unmask),
@@ -190,6 +199,7 @@ static bool vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     /* Let'em rip */
     vfio_device_irq_unmask(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 
+skip_irq:
     vdev->intx.kvm_accel = true;
 
     trace_vfio_intx_enable_kvm(vdev->vbasedev.name);
@@ -305,7 +315,13 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         return true;
     }
 
-    vfio_disable_interrupts(vdev);
+    /*
+     * Do not alter interrupt state during vfio_realize and cpr load.
+     * The incoming state is cleared thereafter.
+     */
+    if (!cpr_is_incoming()) {
+        vfio_disable_interrupts(vdev);
+    }
 
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -328,8 +344,10 @@ static bool vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
-    if (!vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
-                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
+    if (!cpr_is_incoming() &&
+        !vfio_device_irq_set_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX,
+                                       0, VFIO_IRQ_SET_ACTION_TRIGGER, fd,
+                                       errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
         vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
         return false;
@@ -3204,7 +3222,13 @@ static bool vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        if (!vfio_intx_enable(vdev, errp)) {
+
+        /*
+         * During CPR, do not call vfio_intx_enable at this time.  Instead,
+         * call it from vfio_pci_post_load after the intx routing data has
+         * been loaded from vmstate.
+         */
+        if (!cpr_is_incoming() && !vfio_intx_enable(vdev, errp)) {
             timer_free(vdev->intx.mmap_timer);
             pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
             kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 26/43] migration: close kvm after cpr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (24 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 25/43] vfio-pci: preserve INTx Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 27/43] migration: cpr_get_fd_param helper Steve Sistare
                   ` (18 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

cpr-transfer breaks vfio network connectivity to and from the guest, and
the host system log shows:
  irq bypass consumer (token 00000000a03c32e5) registration fails: -16
which is EBUSY.  This occurs because KVM descriptors are still open in
the old QEMU process.  Close them.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-device.h |  2 ++
 include/migration/cpr.h       |  2 ++
 include/system/kvm.h          |  1 +
 accel/kvm/kvm-all.c           | 28 ++++++++++++++++++++++++++++
 accel/stubs/kvm-stub.c        |  5 +++++
 hw/vfio/helpers.c             | 10 ++++++++++
 hw/vfio/vfio-stubs.c          | 13 +++++++++++++
 migration/cpr-transfer.c      | 18 ++++++++++++++++++
 migration/cpr.c               |  8 ++++++++
 migration/migration.c         |  1 +
 hw/vfio/meson.build           |  2 ++
 11 files changed, 90 insertions(+)
 create mode 100644 hw/vfio/vfio-stubs.c

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 4e4d0b6..6eb6f21 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -231,4 +231,6 @@ void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
 void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
                       DeviceState *dev, bool ram_discard);
 int vfio_device_get_aw_bits(VFIODevice *vdev);
+
+void vfio_kvm_device_close(void);
 #endif /* HW_VFIO_VFIO_COMMON_H */
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 07858e9..d09b657 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -32,7 +32,9 @@ void cpr_state_close(void);
 struct QIOChannel *cpr_state_ioc(void);
 
 bool cpr_incoming_needed(void *opaque);
+void cpr_kvm_close(void);
 
+void cpr_transfer_init(void);
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
 QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
 
diff --git a/include/system/kvm.h b/include/system/kvm.h
index b690dda..cfaa94c 100644
--- a/include/system/kvm.h
+++ b/include/system/kvm.h
@@ -194,6 +194,7 @@ bool kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
 int kvm_max_nested_state_length(void);
 int kvm_has_gsi_routing(void);
+void kvm_close(void);
 
 /**
  * kvm_arm_supports_user_irq
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 278a506..d619448 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -512,16 +512,23 @@ static int do_kvm_destroy_vcpu(CPUState *cpu)
         goto err;
     }
 
+    /* If I am the CPU that created coalesced_mmio_ring, then discard it */
+    if (s->coalesced_mmio_ring == (void *)cpu->kvm_run + PAGE_SIZE) {
+        s->coalesced_mmio_ring = NULL;
+    }
+
     ret = munmap(cpu->kvm_run, mmap_size);
     if (ret < 0) {
         goto err;
     }
+    cpu->kvm_run = NULL;
 
     if (cpu->kvm_dirty_gfns) {
         ret = munmap(cpu->kvm_dirty_gfns, s->kvm_dirty_ring_bytes);
         if (ret < 0) {
             goto err;
         }
+        cpu->kvm_dirty_gfns = NULL;
     }
 
     kvm_park_vcpu(cpu);
@@ -600,6 +607,27 @@ err:
     return ret;
 }
 
+void kvm_close(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        cpu_remove_sync(cpu);
+        close(cpu->kvm_fd);
+        cpu->kvm_fd = -1;
+        close(cpu->kvm_vcpu_stats_fd);
+        cpu->kvm_vcpu_stats_fd = -1;
+    }
+
+    if (kvm_state && kvm_state->fd != -1) {
+        close(kvm_state->vmfd);
+        kvm_state->vmfd = -1;
+        close(kvm_state->fd);
+        kvm_state->fd = -1;
+    }
+    kvm_state = NULL;
+}
+
 /*
  * dirty pages logging control
  */
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index ecfd763..97dacb3 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -134,3 +134,8 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
 {
     return -ENOSYS;
 }
+
+void kvm_close(void)
+{
+    return;
+}
diff --git a/hw/vfio/helpers.c b/hw/vfio/helpers.c
index d0dbab1..af1db2f 100644
--- a/hw/vfio/helpers.c
+++ b/hw/vfio/helpers.c
@@ -117,6 +117,16 @@ bool vfio_get_info_dma_avail(struct vfio_iommu_type1_info *info,
 int vfio_kvm_device_fd = -1;
 #endif
 
+void vfio_kvm_device_close(void)
+{
+#ifdef CONFIG_KVM
+    if (vfio_kvm_device_fd != -1) {
+        close(vfio_kvm_device_fd);
+        vfio_kvm_device_fd = -1;
+    }
+#endif
+}
+
 int vfio_kvm_device_add_fd(int fd, Error **errp)
 {
 #ifdef CONFIG_KVM
diff --git a/hw/vfio/vfio-stubs.c b/hw/vfio/vfio-stubs.c
new file mode 100644
index 0000000..a4c8b56
--- /dev/null
+++ b/hw/vfio/vfio-stubs.c
@@ -0,0 +1,13 @@
+/*
+ * Copyright (c) 2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "hw/vfio/vfio-device.h"
+
+void vfio_kvm_device_close(void)
+{
+    return;
+}
diff --git a/migration/cpr-transfer.c b/migration/cpr-transfer.c
index e1f1403..396558f 100644
--- a/migration/cpr-transfer.c
+++ b/migration/cpr-transfer.c
@@ -17,6 +17,24 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 
+static int cpr_transfer_notifier(NotifierWithReturn *notifier,
+                                 MigrationEvent *e,
+                                 Error **errp)
+{
+    if (e->type == MIG_EVENT_PRECOPY_DONE) {
+        cpr_kvm_close();
+    }
+    return 0;
+}
+
+void cpr_transfer_init(void)
+{
+    static NotifierWithReturn notifier;
+
+    migration_add_notifier_mode(&notifier, cpr_transfer_notifier,
+                                MIG_MODE_CPR_TRANSFER);
+}
+
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp)
 {
     MigrationAddress *addr = channel->addr;
diff --git a/migration/cpr.c b/migration/cpr.c
index a50a57e..49fb0a5 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,12 +7,14 @@
 
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/cpr.h"
 #include "migration/misc.h"
 #include "migration/options.h"
 #include "migration/qemu-file.h"
 #include "migration/savevm.h"
 #include "migration/vmstate.h"
+#include "system/kvm.h"
 #include "system/runstate.h"
 #include "trace.h"
 
@@ -264,3 +266,9 @@ bool cpr_incoming_needed(void *opaque)
     MigMode mode = migrate_mode();
     return mode == MIG_MODE_CPR_TRANSFER;
 }
+
+void cpr_kvm_close(void)
+{
+    kvm_close();
+    vfio_kvm_device_close();
+}
diff --git a/migration/migration.c b/migration/migration.c
index 4697732..89e2026 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -337,6 +337,7 @@ void migration_object_init(void)
 
     ram_mig_init();
     dirty_bitmap_mig_init();
+    cpr_transfer_init();
 
     /* Initialize cpu throttle timers */
     cpu_throttle_init();
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 73d29f9..98134a7 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -17,6 +17,8 @@ vfio_ss.add(when: 'CONFIG_VFIO_IGD', if_true: files('igd.c'))
 
 specific_ss.add_all(when: 'CONFIG_VFIO', if_true: vfio_ss)
 
+system_ss.add(when: 'CONFIG_VFIO', if_false: files('vfio-stubs.c'))
+
 system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
 system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
 system_ss.add(when: 'CONFIG_VFIO', if_true: files(
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 27/43] migration: cpr_get_fd_param helper
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (25 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 26/43] migration: close kvm after cpr Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 28/43] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
                   ` (17 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Add the helper function cpr_get_fd_param, to use when preserving
a file descriptor that is opened externally and passed to QEMU.
cpr_get_fd_param returns a descriptor number either from a QEMU
command-line parameter, from a getfd command, or from CPR state.

When a descriptor is passed to new QEMU via SCM_RIGHTS, its number
changes.  Hence, during CPR, the command-line parameter is ignored
in new QEMU, and over-ridden by the value found in CPR state.

Similarly, if the descriptor was originally specified by a getfd
command in old QEMU, the fd number is not known outside of QEMU,
and it changes when sent to new QEMU via SCM_RIGHTS.  Hence the
user cannot send getfd to new QEMU, but when the user sends a
hotplug command that references the fd, cpr_get_fd_param finds
its value in CPR state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 include/migration/cpr.h |  2 ++
 migration/cpr.c         | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index d09b657..7fd8065 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -33,6 +33,8 @@ struct QIOChannel *cpr_state_ioc(void);
 
 bool cpr_incoming_needed(void *opaque);
 void cpr_kvm_close(void);
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+                     Error **errp);
 
 void cpr_transfer_init(void);
 QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
diff --git a/migration/cpr.c b/migration/cpr.c
index 49fb0a5..4574608 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -14,6 +14,7 @@
 #include "migration/qemu-file.h"
 #include "migration/savevm.h"
 #include "migration/vmstate.h"
+#include "monitor/monitor.h"
 #include "system/kvm.h"
 #include "system/runstate.h"
 #include "trace.h"
@@ -272,3 +273,39 @@ void cpr_kvm_close(void)
     kvm_close();
     vfio_kvm_device_close();
 }
+
+/*
+ * cpr_get_fd_param: find a descriptor and return its value.
+ *
+ * @name: CPR name for the descriptor
+ * @fdname: An integer-valued string, or a name passed to a getfd command
+ * @index: CPR index of the descriptor
+ * @errp: returned error message
+ *
+ * If CPR is not being performed, then use @fdname to find the fd.
+ * If CPR is being performed, then ignore @fdname, and look for @name
+ * and @index in CPR state.
+ *
+ * On success returns the fd value, else returns -1.
+ */
+int cpr_get_fd_param(const char *name, const char *fdname, int index,
+                     Error **errp)
+{
+    ERRP_GUARD();
+    int fd;
+
+    if (cpr_is_incoming()) {
+        fd = cpr_find_fd(name, index);
+        if (fd < 0) {
+            error_setg(errp, "cannot find saved value for fd %s", fdname);
+        }
+    } else {
+        fd = monitor_fd_param(monitor_cur(), fdname, errp);
+        if (fd >= 0) {
+            cpr_save_fd(name, index, fd);
+        } else {
+            error_prepend(errp, "Could not parse object fd %s:", fdname);
+        }
+    }
+    return fd;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 28/43] backends/iommufd: iommufd_backend_map_file_dma
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (26 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 27/43] migration: cpr_get_fd_param helper Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 29/43] backends/iommufd: change process ioctl Steve Sistare
                   ` (16 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define iommufd_backend_map_file_dma to implement IOMMU_IOAS_MAP_FILE.
This will be called as a substitute for iommufd_backend_map_dma, so
the error conditions for BARs are copied as-is from that function.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/iommufd.h |  3 +++
 backends/iommufd.c       | 34 ++++++++++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 3 files changed, 38 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75b..ac700b8 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -43,6 +43,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be);
 bool iommufd_backend_alloc_ioas(IOMMUFDBackend *be, uint32_t *ioas_id,
                                 Error **errp);
 void iommufd_backend_free_id(IOMMUFDBackend *be, uint32_t id);
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+                                 hwaddr iova, ram_addr_t size, int fd,
+                                 unsigned long start, bool readonly);
 int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly);
 int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
diff --git a/backends/iommufd.c b/backends/iommufd.c
index b73f75c..4f97b2c 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -172,6 +172,40 @@ int iommufd_backend_map_dma(IOMMUFDBackend *be, uint32_t ioas_id, hwaddr iova,
     return ret;
 }
 
+int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
+                                 hwaddr iova, ram_addr_t size,
+                                 int mfd, unsigned long start, bool readonly)
+{
+    int ret, fd = be->fd;
+    struct iommu_ioas_map_file map = {
+        .size = sizeof(map),
+        .flags = IOMMU_IOAS_MAP_READABLE |
+                 IOMMU_IOAS_MAP_FIXED_IOVA,
+        .ioas_id = ioas_id,
+        .fd = mfd,
+        .start = start,
+        .iova = iova,
+        .length = size,
+    };
+
+    if (!readonly) {
+        map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
+    }
+
+    ret = ioctl(fd, IOMMU_IOAS_MAP_FILE, &map);
+    trace_iommufd_backend_map_file_dma(fd, ioas_id, iova, size, mfd, start,
+                                       readonly, ret);
+    if (ret) {
+        ret = -errno;
+
+        /* TODO: Not support mapping hardware PCI BAR region for now. */
+        if (errno == EFAULT) {
+            warn_report("IOMMU_IOAS_MAP_FILE failed: %m, PCI BAR?");
+        }
+    }
+    return ret;
+}
+
 int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
                               hwaddr iova, ram_addr_t size)
 {
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3..f478e18 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -11,6 +11,7 @@ iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d user
 iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
 iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
 iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, void *vaddr, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" addr=%p readonly=%d (%d)"
+iommufd_backend_map_file_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int fd, unsigned long start, bool readonly, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" fd=%d start=%ld readonly=%d (%d)"
 iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 29/43] backends/iommufd: change process ioctl
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (27 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 28/43] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 30/43] physmem: qemu_ram_get_fd_offset Steve Sistare
                   ` (15 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define the change process ioctl

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/system/iommufd.h |  2 ++
 backends/iommufd.c       | 24 ++++++++++++++++++++++++
 backends/trace-events    |  1 +
 3 files changed, 27 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index ac700b8..db9ed53 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -64,6 +64,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
                                       Error **errp);
+bool iommufd_change_process_capable(IOMMUFDBackend *be);
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 4f97b2c..ed8bb4c 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -73,6 +73,30 @@ static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
     object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
 }
 
+bool iommufd_change_process_capable(IOMMUFDBackend *be)
+{
+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
+
+    /*
+     * Call IOMMU_IOAS_CHANGE_PROCESS to verify it is a recognized ioctl.
+     * This is a no-op if the process has not changed since DMA was mapped.
+     */
+    return !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+}
+
+bool iommufd_change_process(IOMMUFDBackend *be, Error **errp)
+{
+    struct iommu_ioas_change_process args = {.size = sizeof(args)};
+    bool ret = !ioctl(be->fd, IOMMU_IOAS_CHANGE_PROCESS, &args);
+
+    if (!ret) {
+        error_setg_errno(errp, errno, "IOMMU_IOAS_CHANGE_PROCESS fd %d failed",
+                         be->fd);
+    }
+    trace_iommufd_change_process(be->fd, ret);
+    return ret;
+}
+
 bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
 {
     int fd;
diff --git a/backends/trace-events b/backends/trace-events
index f478e18..5ccdf90 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -7,6 +7,7 @@ dbus_vmstate_loading(const char *id) "id: %s"
 dbus_vmstate_saving(const char *id) "id: %s"
 
 # iommufd.c
+iommufd_change_process(int fd, bool ret) "fd=%d (%d)"
 iommufd_backend_connect(int fd, bool owned, uint32_t users) "fd=%d owned=%d users=%d"
 iommufd_backend_disconnect(int fd, uint32_t users) "fd=%d users=%d"
 iommu_backend_set_fd(int fd) "pre-opened /dev/iommu fd=%d"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 30/43] physmem: qemu_ram_get_fd_offset
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (28 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 29/43] backends/iommufd: change process ioctl Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 31/43] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
                   ` (14 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define qemu_ram_get_fd_offset, so CPR can map a memory region using
IOMMU_IOAS_MAP_FILE in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/exec/cpu-common.h | 1 +
 system/physmem.c          | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index a684855..9b658a3 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -85,6 +85,7 @@ void qemu_ram_unset_idstr(RAMBlock *block);
 const char *qemu_ram_get_idstr(RAMBlock *rb);
 void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 ram_addr_t qemu_ram_get_max_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca3..18684a4 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1593,6 +1593,11 @@ ram_addr_t qemu_ram_get_offset(RAMBlock *rb)
     return rb->offset;
 }
 
+ram_addr_t qemu_ram_get_fd_offset(RAMBlock *rb)
+{
+    return rb->fd_offset;
+}
+
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb)
 {
     return rb->used_length;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 31/43] vfio/iommufd: use IOMMU_IOAS_MAP_FILE
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (29 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 30/43] physmem: qemu_ram_get_fd_offset Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 32/43] vfio/iommufd: invariant device name Steve Sistare
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Use IOMMU_IOAS_MAP_FILE when the mapped region is backed by a file.
Such a mapping can be preserved without modification during CPR,
because it depends on the file's address space, which does not change,
rather than on the process's address space, which does change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/vfio/vfio-container-base.h | 15 +++++++++++++++
 hw/vfio/container-base.c              |  9 +++++++++
 hw/vfio/iommufd.c                     | 13 +++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index dbbe87d..91ffafc 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -136,6 +136,21 @@ struct VFIOIOMMUClass {
                    hwaddr iova, ram_addr_t size,
                    void *vaddr, bool readonly, MemoryRegion *mr);
     /**
+     * @dma_map_file
+     *
+     * Map a file range for the container.
+     *
+     * @bcontainer: #VFIOContainerBase to use for map
+     * @iova: start address to map
+     * @size: size of the range to map
+     * @fd: descriptor of the file to map
+     * @start: starting file offset of the range to map
+     * @readonly: map read only if true
+     */
+    int (*dma_map_file)(const VFIOContainerBase *bcontainer,
+                        hwaddr iova, ram_addr_t size,
+                        int fd, unsigned long start, bool readonly);
+    /**
      * @dma_unmap
      *
      * Unmap an address range from the container.
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index d834bd4..5630497 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -78,7 +78,16 @@ int vfio_container_dma_map(VFIOContainerBase *bcontainer,
                            void *vaddr, bool readonly, MemoryRegion *mr)
 {
     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
+    RAMBlock *rb = mr->ram_block;
+    int mfd = rb ? qemu_ram_get_fd(rb) : -1;
 
+    if (mfd >= 0 && vioc->dma_map_file) {
+        unsigned long start = vaddr - qemu_ram_get_host_addr(rb);
+        unsigned long offset = qemu_ram_get_fd_offset(rb);
+
+        return vioc->dma_map_file(bcontainer, iova, size, mfd, start + offset,
+                                  readonly);
+    }
     g_assert(vioc->dma_map);
     return vioc->dma_map(bcontainer, iova, size, vaddr, readonly, mr);
 }
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index eb2f88d..ca00d08 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -45,6 +45,18 @@ static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                                    iova, size, vaddr, readonly);
 }
 
+static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
+                                 hwaddr iova, ram_addr_t size,
+                                 int fd, unsigned long start, bool readonly)
+{
+    const VFIOIOMMUFDContainer *container =
+        container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+
+    return iommufd_backend_map_file_dma(container->be,
+                                        container->ioas_id,
+                                        iova, size, fd, start, readonly);
+}
+
 static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
                               hwaddr iova, ram_addr_t size,
                               IOMMUTLBEntry *iotlb, bool unmap_all)
@@ -803,6 +815,7 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
     VFIOIOMMUClass *vioc = VFIO_IOMMU_CLASS(klass);
 
     vioc->dma_map = iommufd_cdev_map;
+    vioc->dma_map_file = iommufd_cdev_map_file;
     vioc->dma_unmap = iommufd_cdev_unmap;
     vioc->attach_device = iommufd_cdev_attach;
     vioc->detach_device = iommufd_cdev_detach;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 32/43] vfio/iommufd: invariant device name
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (30 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 31/43] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-10  6:10   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name Steve Sistare
                   ` (12 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

cpr-transfer will use the device name as a key to find the value
of the device descriptor in new QEMU.  However, if the descriptor
number is specified by a command-line fd parameter, then
vfio_device_get_name creates a name that includes the fd number.
This causes a chicken-and-egg problem: new QEMU must know the fd
number to construct a name to find the fd number.

To fix, create an invariant name based on the id command-line parameter,
if id is defined.  The user will need to provide such an id to use CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/device.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 9fba2c7..71fa9f4 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -300,12 +300,17 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
             error_setg(errp, "Use FD passing only with iommufd backend");
             return false;
         }
-        /*
-         * Give a name with fd so any function printing out vbasedev->name
-         * will not break.
-         */
         if (!vbasedev->name) {
-            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+
+            if (vbasedev->dev->id) {
+                vbasedev->name = g_strdup(vbasedev->dev->id);
+                return true;
+            } else {
+                /*
+                 * Assign a name so any function printing it will not break.
+                 */
+                vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+            }
         }
     }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (31 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 32/43] vfio/iommufd: invariant device name Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-10  6:12   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 34/43] vfio/iommufd: device name blocker Steve Sistare
                   ` (11 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define vfio_device_free_name to free the name created by
vfio_device_get_name.  A subsequent patch will do more there.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-device.h | 1 +
 hw/vfio/ap.c                  | 2 +-
 hw/vfio/ccw.c                 | 2 +-
 hw/vfio/device.c              | 5 +++++
 hw/vfio/pci.c                 | 2 +-
 hw/vfio/platform.c            | 2 +-
 6 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 6eb6f21..321b442 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
 
 /* Returns 0 on success, or a negative errno. */
 bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
+void vfio_device_free_name(VFIODevice *vbasedev);
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
 void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
                       DeviceState *dev, bool ram_discard);
diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index 785c0a0..013bd59 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
 
 error:
     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
-    g_free(vbasedev->name);
+    vfio_device_free_name(vbasedev);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index cea9d6e..903b8b0 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -619,7 +619,7 @@ out_io_notifier_err:
 out_region_err:
     vfio_device_detach(vbasedev);
 out_attach_dev_err:
-    g_free(vbasedev->name);
+    vfio_device_free_name(vbasedev);
 out_unrealize:
     if (cdc->unrealize) {
         cdc->unrealize(cdev);
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 71fa9f4..151c618 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -317,6 +317,11 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
     return true;
 }
 
+void vfio_device_free_name(VFIODevice *vbasedev)
+{
+    g_free(vbasedev->name);
+}
+
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
 {
     ERRP_GUARD();
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c8d6ee0..7da7a9c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2949,7 +2949,7 @@ static void vfio_pci_put_device(VFIOPCIDevice *vdev)
 {
     vfio_device_detach(&vdev->vbasedev);
 
-    g_free(vdev->vbasedev.name);
+    vfio_device_free_name(&vdev->vbasedev);
     g_free(vdev->msix);
 }
 
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 9a21f2e..5c1795a 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -530,7 +530,7 @@ static bool vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
     /* @fd takes precedence over @sysfsdev which takes precedence over @host */
     if (vbasedev->fd < 0 && vbasedev->sysfsdev) {
-        g_free(vbasedev->name);
+        vfio_device_free_name(vbasedev);
         vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
     } else if (vbasedev->fd < 0) {
         if (!vbasedev->name || strchr(vbasedev->name, '/')) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 34/43] vfio/iommufd: device name blocker
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (32 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 35/43] vfio/iommufd: register container for cpr Steve Sistare
                   ` (10 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

If an invariant device name cannot be created, block CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  1 +
 hw/vfio/device.c           | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 25e74ee..170a116 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -29,6 +29,7 @@ typedef struct VFIOContainerCPR {
 
 typedef struct VFIODeviceCPR {
     Error *mdev_blocker;
+    Error *id_blocker;
 } VFIODeviceCPR;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 151c618..0f29063 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -28,6 +28,8 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/units.h"
+#include "migration/cpr.h"
+#include "migration/blocker.h"
 #include "monitor/monitor.h"
 #include "vfio-helpers.h"
 
@@ -308,8 +310,16 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
             } else {
                 /*
                  * Assign a name so any function printing it will not break.
+                 * The fd number changes across processes, so this cannot be
+                 * used as an invariant name for CPR.
                  */
                 vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
+                error_setg(&vbasedev->cpr.id_blocker,
+                           "vfio device with fd=%d needs an id property",
+                           vbasedev->fd);
+                return migrate_add_blocker_modes(&vbasedev->cpr.id_blocker,
+                                                 errp, MIG_MODE_CPR_TRANSFER,
+                                                 -1) == 0;
             }
         }
     }
@@ -320,6 +330,7 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
 void vfio_device_free_name(VFIODevice *vbasedev)
 {
     g_free(vbasedev->name);
+    migrate_del_blocker(&vbasedev->cpr.id_blocker);
 }
 
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 35/43] vfio/iommufd: register container for cpr
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (33 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 34/43] vfio/iommufd: device name blocker Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-09 20:30   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 36/43] migration: vfio cpr state hook Steve Sistare
                   ` (9 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Register a vfio iommufd container and device for CPR, replacing the generic
CPR register call with a more specific iommufd register call.  Add a
blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.

This is mostly boiler plate.  The fields to to saved and restored are added
in subsequent patches.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h | 12 +++++++
 include/system/iommufd.h   |  1 +
 backends/iommufd.c         | 10 ++++++
 hw/vfio/cpr-iommufd.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/iommufd.c          |  6 ++--
 hw/vfio/meson.build        |  1 +
 6 files changed, 112 insertions(+), 2 deletions(-)
 create mode 100644 hw/vfio/cpr-iommufd.c

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 170a116..b9b77ae 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -15,7 +15,10 @@
 struct VFIOContainer;
 struct VFIOContainerBase;
 struct VFIOGroup;
+struct VFIODevice;
 struct VFIOPCIDevice;
+struct VFIOIOMMUFDContainer;
+struct IOMMUFDBackend;
 
 typedef struct VFIOContainerCPR {
     Error *blocker;
@@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
                                  Error **errp);
 void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
 
+bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
+                                         Error **errp);
+void vfio_iommufd_cpr_unregister_container(
+    struct VFIOIOMMUFDContainer *container);
+bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
+void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
+void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
+void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+
 int vfio_cpr_group_get_device_fd(int d, const char *name);
 
 bool vfio_cpr_container_match(struct VFIOContainer *container,
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index db9ed53..3c58ea8 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -32,6 +32,7 @@ struct IOMMUFDBackend {
     /*< protected >*/
     int fd;            /* /dev/iommu file descriptor */
     bool owned;        /* is the /dev/iommu opened internally */
+    Error *cpr_blocker;/* set if be does not support CPR */
     uint32_t users;
 
     /*< public >*/
diff --git a/backends/iommufd.c b/backends/iommufd.c
index ed8bb4c..2e9d6cb 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
         }
         be->fd = fd;
     }
+    if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
+        if (be->owned) {
+            close(be->fd);
+            be->fd = -1;
+        }
+        return false;
+    }
     be->users++;
 
     trace_iommufd_backend_connect(be->fd, be->owned, be->users);
@@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
         be->fd = -1;
     }
 out:
+    if (!be->users) {
+        vfio_iommufd_cpr_unregister_iommufd(be);
+    }
     trace_iommufd_backend_disconnect(be->fd, be->users);
 }
 
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
new file mode 100644
index 0000000..60bd7e8
--- /dev/null
+++ b/hw/vfio/cpr-iommufd.c
@@ -0,0 +1,84 @@
+/*
+ * Copyright (c) 2024-2025 Oracle and/or its affiliates.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "hw/vfio/vfio-cpr.h"
+#include "migration/blocker.h"
+#include "migration/cpr.h"
+#include "migration/migration.h"
+#include "migration/vmstate.h"
+#include "system/iommufd.h"
+#include "vfio-iommufd.h"
+
+static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
+{
+    if (!iommufd_change_process_capable(be)) {
+        if (errp) {
+            error_setg(errp, "vfio iommufd backend does not support "
+                       "IOMMU_IOAS_CHANGE_PROCESS");
+        }
+        return false;
+    }
+    return true;
+}
+
+static const VMStateDescription iommufd_cpr_vmstate = {
+    .name = "iommufd",
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .needed = cpr_incoming_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
+{
+    Error **cpr_blocker = &be->cpr_blocker;
+
+    if (!vfio_cpr_supported(be, cpr_blocker)) {
+        return migrate_add_blocker_modes(cpr_blocker, errp,
+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
+    }
+
+    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
+
+    return true;
+}
+
+void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
+{
+    vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
+    migrate_del_blocker(&be->cpr_blocker);
+}
+
+bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
+                                vfio_cpr_reboot_notifier,
+                                MIG_MODE_CPR_REBOOT);
+
+    return true;
+}
+
+void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
+{
+    VFIOContainerBase *bcontainer = &container->bcontainer;
+
+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
+}
+
+void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
+{
+}
+
+void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
+{
+}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index ca00d08..c690c2c 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
     if (!QLIST_EMPTY(&bcontainer->device_list)) {
         return;
     }
-    vfio_cpr_unregister_container(bcontainer);
+    vfio_iommufd_cpr_unregister_container(container);
     vfio_listener_unregister(bcontainer);
     iommufd_backend_free_id(container->be, container->ioas_id);
     object_unref(container);
@@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         goto err_listener_register;
     }
 
-    if (!vfio_cpr_register_container(bcontainer, errp)) {
+    if (!vfio_iommufd_cpr_register_container(container, errp)) {
         goto err_listener_register;
     }
 
@@ -619,6 +619,7 @@ found_container:
     }
 
     vfio_device_prepare(vbasedev, bcontainer, &dev_info);
+    vfio_iommufd_cpr_register_device(vbasedev);
 
     trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
                                    vbasedev->num_regions, vbasedev->flags);
@@ -656,6 +657,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
     iommufd_cdev_container_destroy(container);
     vfio_address_space_put(space);
 
+    vfio_iommufd_cpr_unregister_device(vbasedev);
     iommufd_cdev_unbind_and_disconnect(vbasedev);
     close(vbasedev->fd);
 }
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index 98134a7..12711fb 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -23,6 +23,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
 system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
 system_ss.add(when: 'CONFIG_VFIO', if_true: files(
   'cpr.c',
+  'cpr-iommufd.c',
   'cpr-legacy.c',
   'device.c',
   'migration.c',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 36/43] migration: vfio cpr state hook
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (34 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 35/43] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-10  6:14   ` Cédric Le Goater
  2025-05-29 19:24 ` [PATCH V4 37/43] vfio/iommufd: cpr state Steve Sistare
                   ` (8 subsequent siblings)
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Define a list of vfio devices in CPR state, in a subsection so that
older QEMU can be live updated to this version.  However, new QEMU
will not be live updateable to old QEMU.  This is acceptable because
CPR is not yet commonly used, and updates to older versions are unusual.

The contents of each device object will be defined by the vfio subsystem
in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  1 +
 include/migration/cpr.h    | 12 ++++++++++++
 hw/vfio/cpr-iommufd.c      |  2 ++
 migration/cpr.c            | 14 +++++---------
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index b9b77ae..619af07 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
                                int nr);
 
 extern const VMStateDescription vfio_cpr_pci_vmstate;
+extern const VMStateDescription vmstate_cpr_vfio_devices;
 
 #endif /* HW_VFIO_VFIO_CPR_H */
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 7fd8065..8fd8bfe 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -9,11 +9,23 @@
 #define MIGRATION_CPR_H
 
 #include "qapi/qapi-types-migration.h"
+#include "qemu/queue.h"
 
 #define MIG_MODE_NONE           -1
 
 #define QEMU_CPR_FILE_MAGIC     0x51435052
 #define QEMU_CPR_FILE_VERSION   0x00000001
+#define CPR_STATE "CprState"
+
+typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
+typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice) CprVFIODeviceList;
+
+typedef struct CprState {
+    CprFdList fds;
+    CprVFIODeviceList vfio_devices;
+} CprState;
+
+extern CprState cpr_state;
 
 void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 60bd7e8..3e78265 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -14,6 +14,8 @@
 #include "system/iommufd.h"
 #include "vfio-iommufd.h"
 
+const VMStateDescription vmstate_cpr_vfio_devices;  /* TBD in a later patch */
+
 static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
 {
     if (!iommufd_change_process_capable(be)) {
diff --git a/migration/cpr.c b/migration/cpr.c
index 4574608..47898ab 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -22,13 +22,7 @@
 /*************************************************************************/
 /* cpr state container for all information to be saved. */
 
-typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
-
-typedef struct CprState {
-    CprFdList fds;
-} CprState;
-
-static CprState cpr_state;
+CprState cpr_state;
 
 /****************************************************************************/
 
@@ -129,8 +123,6 @@ int cpr_open_fd(const char *path, int flags, const char *name, int id,
 }
 
 /*************************************************************************/
-#define CPR_STATE "CprState"
-
 static const VMStateDescription vmstate_cpr_state = {
     .name = CPR_STATE,
     .version_id = 1,
@@ -138,6 +130,10 @@ static const VMStateDescription vmstate_cpr_state = {
     .fields = (VMStateField[]) {
         VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
         VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription * const []) {
+        &vmstate_cpr_vfio_devices,
+        NULL
     }
 };
 /*************************************************************************/
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 37/43] vfio/iommufd: cpr state
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (35 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 36/43] migration: vfio cpr state hook Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 38/43] vfio/iommufd: preserve descriptors Steve Sistare
                   ` (7 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

VFIO iommufd devices will need access to ioas_id, devid, and hwpt_id in
new QEMU at realize time, so add them to CPR state.  Define CprVFIODevice
as the object which holds the state and is serialized to the vmstate file.
Define accessors to copy state between VFIODevice and CprVFIODevice.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  3 ++
 hw/vfio/cpr-iommufd.c      | 96 +++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/iommufd.c          |  2 +
 3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index 619af07..f88e4ba 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -33,6 +33,8 @@ typedef struct VFIOContainerCPR {
 typedef struct VFIODeviceCPR {
     Error *mdev_blocker;
     Error *id_blocker;
+    uint32_t hwpt_id;
+    uint32_t ioas_id;
 } VFIODeviceCPR;
 
 bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
@@ -54,6 +56,7 @@ bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
 void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
 void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
 void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
+void vfio_cpr_load_device(struct VFIODevice *vbasedev);
 
 int vfio_cpr_group_get_device_fd(int d, const char *name);
 
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 3e78265..2eca8a6 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -7,6 +7,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "hw/vfio/vfio-cpr.h"
+#include "hw/vfio/vfio-device.h"
 #include "migration/blocker.h"
 #include "migration/cpr.h"
 #include "migration/migration.h"
@@ -14,7 +15,88 @@
 #include "system/iommufd.h"
 #include "vfio-iommufd.h"
 
-const VMStateDescription vmstate_cpr_vfio_devices;  /* TBD in a later patch */
+typedef struct CprVFIODevice {
+    char *name;
+    unsigned int namelen;
+    uint32_t ioas_id;
+    int devid;
+    uint32_t hwpt_id;
+    QLIST_ENTRY(CprVFIODevice) next;
+} CprVFIODevice;
+
+static const VMStateDescription vmstate_cpr_vfio_device = {
+    .name = "cpr vfio device",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprVFIODevice),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprVFIODevice, 0, NULL, namelen),
+        VMSTATE_INT32(devid, CprVFIODevice),
+        VMSTATE_UINT32(ioas_id, CprVFIODevice),
+        VMSTATE_UINT32(hwpt_id, CprVFIODevice),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+const VMStateDescription vmstate_cpr_vfio_devices = {
+    .name = CPR_STATE "/vfio devices",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (const VMStateField[]){
+        VMSTATE_QLIST_V(vfio_devices, CprState, 1, vmstate_cpr_vfio_device,
+                        CprVFIODevice, next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void vfio_cpr_save_device(VFIODevice *vbasedev)
+{
+    CprVFIODevice *elem = g_new0(CprVFIODevice, 1);
+
+    elem->name = g_strdup(vbasedev->name);
+    elem->namelen = strlen(vbasedev->name) + 1;
+    elem->ioas_id = vbasedev->cpr.ioas_id;
+    elem->devid = vbasedev->devid;
+    elem->hwpt_id = vbasedev->cpr.hwpt_id;
+    QLIST_INSERT_HEAD(&cpr_state.vfio_devices, elem, next);
+}
+
+static CprVFIODevice *find_device(const char *name)
+{
+    CprVFIODeviceList *head = &cpr_state.vfio_devices;
+    CprVFIODevice *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name)) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+static void vfio_cpr_delete_device(const char *name)
+{
+    CprVFIODevice *elem = find_device(name);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+}
+
+static bool vfio_cpr_find_device(VFIODevice *vbasedev)
+{
+    CprVFIODevice *elem = find_device(vbasedev->name);
+
+    if (elem) {
+        vbasedev->cpr.ioas_id = elem->ioas_id;
+        vbasedev->devid = elem->devid;
+        vbasedev->cpr.hwpt_id = elem->hwpt_id;
+        return true;
+    }
+    return false;
+}
 
 static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
 {
@@ -79,8 +161,20 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
 
 void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
 {
+    if (!cpr_is_incoming()) {
+        vfio_cpr_save_device(vbasedev);
+    }
 }
 
 void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
 {
+    vfio_cpr_delete_device(vbasedev->name);
+}
+
+void vfio_cpr_load_device(VFIODevice *vbasedev)
+{
+    if (cpr_is_incoming()) {
+        bool ret = vfio_cpr_find_device(vbasedev);
+        g_assert(ret);
+    }
 }
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index c690c2c..1fd383e 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -515,6 +515,8 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
     const VFIOIOMMUClass *iommufd_vioc =
         VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
 
+    vfio_cpr_load_device(vbasedev);
+
     if (vbasedev->fd < 0) {
         devfd = iommufd_cdev_getfd(vbasedev->sysfsdev, errp);
         if (devfd < 0) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 38/43] vfio/iommufd: preserve descriptors
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (36 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 37/43] vfio/iommufd: cpr state Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 39/43] vfio/iommufd: reconstruct device Steve Sistare
                   ` (6 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Save the iommu and vfio device fd in CPR state when it is created.
After CPR, the fd number is found in CPR state and reused.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c    | 25 ++++++++++++++++++++++++-
 hw/vfio/cpr-iommufd.c | 10 ++++++++++
 hw/vfio/device.c      |  9 +--------
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2e9d6cb..98d83aa 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -16,12 +16,18 @@
 #include "qemu/module.h"
 #include "qom/object_interfaces.h"
 #include "qemu/error-report.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "trace.h"
 #include "hw/vfio/vfio-device.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
 
+static const char *iommufd_fd_name(IOMMUFDBackend *be)
+{
+    return object_get_canonical_path_component(OBJECT(be));
+}
+
 static void iommufd_backend_init(Object *obj)
 {
     IOMMUFDBackend *be = IOMMUFD_BACKEND(obj);
@@ -64,11 +70,27 @@ static bool iommufd_backend_can_be_deleted(UserCreatable *uc)
     return !be->users;
 }
 
+static void iommufd_backend_complete(UserCreatable *uc, Error **errp)
+{
+    IOMMUFDBackend *be = IOMMUFD_BACKEND(uc);
+    const char *name = iommufd_fd_name(be);
+
+    if (!be->owned) {
+        /* fd came from the command line. Fetch updated value from cpr state. */
+        if (cpr_is_incoming()) {
+            be->fd = cpr_find_fd(name, 0);
+        } else {
+            cpr_save_fd(name, 0, be->fd);
+        }
+    }
+}
+
 static void iommufd_backend_class_init(ObjectClass *oc, const void *data)
 {
     UserCreatableClass *ucc = USER_CREATABLE_CLASS(oc);
 
     ucc->can_be_deleted = iommufd_backend_can_be_deleted;
+    ucc->complete = iommufd_backend_complete;
 
     object_class_property_add_str(oc, "fd", NULL, iommufd_backend_set_fd);
 }
@@ -102,7 +124,7 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
     int fd;
 
     if (be->owned && !be->users) {
-        fd = qemu_open("/dev/iommu", O_RDWR, errp);
+        fd = cpr_open_fd("/dev/iommu", O_RDWR, iommufd_fd_name(be), 0, errp);
         if (fd < 0) {
             return false;
         }
@@ -134,6 +156,7 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
 out:
     if (!be->users) {
         vfio_iommufd_cpr_unregister_iommufd(be);
+        cpr_delete_fd(iommufd_fd_name(be), 0);
     }
     trace_iommufd_backend_disconnect(be->fd, be->users);
 }
diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 2eca8a6..152a661 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -162,17 +162,27 @@ void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
 void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
 {
     if (!cpr_is_incoming()) {
+        /*
+         * Beware fd may have already been saved by vfio_device_set_fd,
+         * so call resave to avoid a duplicate entry.
+         */
+        cpr_resave_fd(vbasedev->name, 0, vbasedev->fd);
         vfio_cpr_save_device(vbasedev);
     }
 }
 
 void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
 {
+    cpr_delete_fd(vbasedev->name, 0);
     vfio_cpr_delete_device(vbasedev->name);
 }
 
 void vfio_cpr_load_device(VFIODevice *vbasedev)
 {
+    if (vbasedev->fd < 0) {
+        vbasedev->fd = cpr_find_fd(vbasedev->name, 0);
+    }
+
     if (cpr_is_incoming()) {
         bool ret = vfio_cpr_find_device(vbasedev);
         g_assert(ret);
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 0f29063..3dd3bd9 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -335,14 +335,7 @@ void vfio_device_free_name(VFIODevice *vbasedev)
 
 void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
 {
-    ERRP_GUARD();
-    int fd = monitor_fd_param(monitor_cur(), str, errp);
-
-    if (fd < 0) {
-        error_prepend(errp, "Could not parse remote object fd %s:", str);
-        return;
-    }
-    vbasedev->fd = fd;
+    vbasedev->fd = cpr_get_fd_param(vbasedev->dev->id, str, 0, errp);
 }
 
 static VFIODeviceIOOps vfio_device_io_ops_ioctl;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 39/43] vfio/iommufd: reconstruct device
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (37 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 38/43] vfio/iommufd: preserve descriptors Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 40/43] vfio/iommufd: reconstruct hwpt Steve Sistare
                   ` (5 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Reconstruct userland device state after CPR.  During vfio_realize, skip all
ioctls that configure the device, as it was already configured in old QEMU.

Skip bind, and use the devid from CPR state.

Skip allocation of, and attachment to, ioas_id.  Recover ioas_id from CPR
state, and use it to find a matching container, if any, before creating a
new one.

This reconstruction is not complete.  hwpt_id is handled in a subsequent
patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/iommufd.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 1fd383e..5119c17 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -25,6 +25,7 @@
 #include "system/reset.h"
 #include "qemu/cutils.h"
 #include "qemu/chardev_open.h"
+#include "migration/cpr.h"
 #include "pci.h"
 #include "vfio-iommufd.h"
 #include "vfio-helpers.h"
@@ -121,6 +122,10 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
         goto err_kvm_device_add;
     }
 
+    if (cpr_is_incoming()) {
+        goto skip_bind;
+    }
+
     /* Bind device to iommufd */
     bind.iommufd = iommufd->fd;
     if (ioctl(vbasedev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind)) {
@@ -132,6 +137,8 @@ static bool iommufd_cdev_connect_and_bind(VFIODevice *vbasedev, Error **errp)
     vbasedev->devid = bind.out_devid;
     trace_iommufd_cdev_connect_and_bind(bind.iommufd, vbasedev->name,
                                         vbasedev->fd, vbasedev->devid);
+
+skip_bind:
     return true;
 err_bind:
     iommufd_cdev_kvm_device_del(vbasedev);
@@ -421,7 +428,9 @@ static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
         return iommufd_cdev_autodomains_get(vbasedev, container, errp);
     }
 
-    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
+    /* If CPR, we are already attached to ioas_id. */
+    return cpr_is_incoming() ||
+           !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
 }
 
 static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
@@ -510,6 +519,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
     VFIOAddressSpace *space;
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, devfd;
+    bool res;
     uint32_t ioas_id;
     Error *err = NULL;
     const VFIOIOMMUClass *iommufd_vioc =
@@ -540,7 +550,16 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
             vbasedev->iommufd != container->be) {
             continue;
         }
-        if (!iommufd_cdev_attach_container(vbasedev, container, &err)) {
+
+        if (!cpr_is_incoming()) {
+            res = iommufd_cdev_attach_container(vbasedev, container, &err);
+        } else if (vbasedev->cpr.ioas_id == container->ioas_id) {
+            res = true;
+        } else {
+            continue;
+        }
+
+        if (!res) {
             const char *msg = error_get_pretty(err);
 
             trace_iommufd_cdev_fail_attach_existing_container(msg);
@@ -557,6 +576,11 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
         }
     }
 
+    if (cpr_is_incoming()) {
+        ioas_id = vbasedev->cpr.ioas_id;
+        goto skip_ioas_alloc;
+    }
+
     /* Need to allocate a new dedicated container */
     if (!iommufd_backend_alloc_ioas(vbasedev->iommufd, &ioas_id, errp)) {
         goto err_alloc_ioas;
@@ -564,10 +588,12 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
 
     trace_iommufd_cdev_alloc_ioas(vbasedev->iommufd->fd, ioas_id);
 
+skip_ioas_alloc:
     container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
     container->be = vbasedev->iommufd;
     container->ioas_id = ioas_id;
     QLIST_INIT(&container->hwpt_list);
+    vbasedev->cpr.ioas_id = ioas_id;
 
     bcontainer = &container->bcontainer;
     vfio_address_space_insert(space, bcontainer);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 40/43] vfio/iommufd: reconstruct hwpt
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (38 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 39/43] vfio/iommufd: reconstruct device Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 41/43] vfio/iommufd: change process Steve Sistare
                   ` (4 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Skip allocation of, and attachment to, hwpt_id.  Recover it from CPR state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/iommufd.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 5119c17..f0abd41 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -332,7 +332,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
 
     /* Try to find a domain */
     QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
-        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+        if (!cpr_is_incoming()) {
+            ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+        } else if (vbasedev->cpr.hwpt_id == hwpt->hwpt_id) {
+            ret = 0;
+        } else {
+            continue;
+        }
+
         if (ret) {
             /* -EINVAL means the domain is incompatible with the device. */
             if (ret == -EINVAL) {
@@ -349,6 +356,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
             return false;
         } else {
             vbasedev->hwpt = hwpt;
+            vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
             QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
             vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
             return true;
@@ -371,6 +379,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
 
+    if (cpr_is_incoming()) {
+        hwpt_id = vbasedev->cpr.hwpt_id;
+        goto skip_alloc;
+    }
+
     if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
                                     container->ioas_id, flags,
                                     IOMMU_HWPT_DATA_NONE, 0, NULL,
@@ -378,19 +391,20 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         return false;
     }
 
+    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
+    if (ret) {
+        iommufd_backend_free_id(container->be, hwpt_id);
+        return false;
+    }
+
+skip_alloc:
     hwpt = g_malloc0(sizeof(*hwpt));
     hwpt->hwpt_id = hwpt_id;
     hwpt->hwpt_flags = flags;
     QLIST_INIT(&hwpt->device_list);
 
-    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
-    if (ret) {
-        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
-        g_free(hwpt);
-        return false;
-    }
-
     vbasedev->hwpt = hwpt;
+    vbasedev->cpr.hwpt_id = hwpt->hwpt_id;
     vbasedev->iommu_dirty_tracking = iommufd_hwpt_dirty_tracking(hwpt);
     QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
     QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 41/43] vfio/iommufd: change process
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (39 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 40/43] vfio/iommufd: reconstruct hwpt Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 42/43] iommufd: preserve DMA mappings Steve Sistare
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

Finish CPR by change the owning process of the iommufd device in
post load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/cpr-iommufd.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
index 152a661..a9e3f68 100644
--- a/hw/vfio/cpr-iommufd.c
+++ b/hw/vfio/cpr-iommufd.c
@@ -110,10 +110,40 @@ static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
     return true;
 }
 
+static int iommufd_cpr_pre_save(void *opaque)
+{
+    IOMMUFDBackend *be = opaque;
+    Error *local_err = NULL;
+
+    /*
+     * The process has not changed yet, but proactively call the ioctl,
+     * and it will fail if any DMA mappings are not supported.
+     */
+    if (!iommufd_change_process(be, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+    }
+    return 0;
+}
+
+static int iommufd_cpr_post_load(void *opaque, int version_id)
+{
+     IOMMUFDBackend *be = opaque;
+     Error *local_err = NULL;
+
+     if (!iommufd_change_process(be, &local_err)) {
+        error_report_err(local_err);
+        return -1;
+     }
+     return 0;
+}
+
 static const VMStateDescription iommufd_cpr_vmstate = {
     .name = "iommufd",
     .version_id = 0,
     .minimum_version_id = 0,
+    .pre_save = iommufd_cpr_pre_save,
+    .post_load = iommufd_cpr_post_load,
     .needed = cpr_incoming_needed,
     .fields = (VMStateField[]) {
         VMSTATE_END_OF_LIST()
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 42/43] iommufd: preserve DMA mappings
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (40 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 41/43] vfio/iommufd: change process Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-05-29 19:24 ` [PATCH V4 43/43] vfio/container: delete old cpr register Steve Sistare
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

During cpr-transfer load in new QEMU, the vfio_memory_listener causes
spurious calls to map and unmap DMA regions, as devices are created and
the address space is built.  This memory was already already mapped by the
device in old QEMU, so suppress the map and unmap callbacks during incoming
CPR.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/iommufd.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 98d83aa..62d1f71 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -245,6 +245,10 @@ int iommufd_backend_map_file_dma(IOMMUFDBackend *be, uint32_t ioas_id,
         .length = size,
     };
 
+    if (cpr_is_incoming()) {
+        return 0;
+    }
+
     if (!readonly) {
         map.flags |= IOMMU_IOAS_MAP_WRITEABLE;
     }
@@ -274,6 +278,10 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
         .length = size,
     };
 
+    if (cpr_is_incoming()) {
+        return 0;
+    }
+
     ret = ioctl(fd, IOMMU_IOAS_UNMAP, &unmap);
     /*
      * IOMMUFD takes mapping as some kind of object, unmapping
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH V4 43/43] vfio/container: delete old cpr register
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (41 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 42/43] iommufd: preserve DMA mappings Steve Sistare
@ 2025-05-29 19:24 ` Steve Sistare
  2025-06-10  6:14   ` Cédric Le Goater
  2025-06-01 17:26 ` [PATCH V4 00/43] Live update: vfio and iommufd Cédric Le Goater
  2025-06-03 12:09 ` Duan, Zhenzhong
  44 siblings, 1 reply; 90+ messages in thread
From: Steve Sistare @ 2025-05-29 19:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cedric Le Goater, Yi Liu, Eric Auger,
	Zhenzhong Duan, Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu,
	Fabiano Rosas, Steve Sistare

vfio_cpr_[un]register_container is no longer used since they were
subsumed by container type-specific registration.  Delete them.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/hw/vfio/vfio-cpr.h |  4 ----
 hw/vfio/cpr.c              | 13 -------------
 2 files changed, 17 deletions(-)

diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
index f88e4ba..5b6c960 100644
--- a/include/hw/vfio/vfio-cpr.h
+++ b/include/hw/vfio/vfio-cpr.h
@@ -44,10 +44,6 @@ void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
 int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
                              Error **errp);
 
-bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
-                                 Error **errp);
-void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
-
 bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
                                          Error **errp);
 void vfio_iommufd_cpr_unregister_container(
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index f5555ca..c97e467 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -29,19 +29,6 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
     return 0;
 }
 
-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
-{
-    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
-                                vfio_cpr_reboot_notifier,
-                                MIG_MODE_CPR_REBOOT);
-    return true;
-}
-
-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
-{
-    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
-}
-
 #define STRDUP_VECTOR_FD_NAME(vdev, name)   \
     g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 09/43] vfio/container: register container for cpr
  2025-05-29 19:24 ` [PATCH V4 09/43] vfio/container: register container for cpr Steve Sistare
@ 2025-06-01 15:21   ` Cédric Le Goater
  2025-06-03 11:57   ` Duan, Zhenzhong
  1 sibling, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 15:21 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Register a legacy container for cpr-transfer, replacing the generic CPR
> register call with a more specific legacy container register call.  Add a
> blocker if the kernel does not support VFIO_UPDATE_VADDR or VFIO_UNMAP_ALL.
> 
> This is mostly boiler plate.  The fields to to saved and restored are added
> in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-container.h |  2 ++
>   include/hw/vfio/vfio-cpr.h       | 15 +++++++++
>   hw/vfio/container.c              |  6 ++--
>   hw/vfio/cpr-legacy.c             | 69 ++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/cpr.c                    |  5 ++-
>   hw/vfio/meson.build              |  1 +
>   6 files changed, 92 insertions(+), 6 deletions(-)
>   create mode 100644 hw/vfio/cpr-legacy.c
> 
> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
> index afc498d..21e5807 100644
> --- a/include/hw/vfio/vfio-container.h
> +++ b/include/hw/vfio/vfio-container.h
> @@ -10,6 +10,7 @@
>   #define HW_VFIO_CONTAINER_H
>   
>   #include "hw/vfio/vfio-container-base.h"
> +#include "hw/vfio/vfio-cpr.h"
>   
>   typedef struct VFIOContainer VFIOContainer;
>   typedef struct VFIODevice VFIODevice;
> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>       int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>       unsigned iommu_type;
>       QLIST_HEAD(, VFIOGroup) group_list;
> +    VFIOContainerCPR cpr;
>   } VFIOContainer;
>   
>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 750ea5b..d4e0bd5 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -9,8 +9,23 @@
>   #ifndef HW_VFIO_VFIO_CPR_H
>   #define HW_VFIO_VFIO_CPR_H
>   
> +#include "migration/misc.h"
> +
> +struct VFIOContainer;
>   struct VFIOContainerBase;
>   
> +typedef struct VFIOContainerCPR {
> +    Error *blocker;
> +} VFIOContainerCPR;
> +
> +
> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
> +                                        Error **errp);
> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
> +
> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> +                             Error **errp);
> +
>   bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 0f948d0..7d2035c 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -643,7 +643,7 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       new_container = true;
>       bcontainer = &container->bcontainer;
>   
> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>           goto fail;
>       }
>   
> @@ -679,7 +679,7 @@ fail:
>           vioc->release(bcontainer);
>       }
>       if (new_container) {
> -        vfio_cpr_unregister_container(bcontainer);
> +        vfio_legacy_cpr_unregister_container(container);
>           object_unref(container);
>       }
>       if (fd >= 0) {
> @@ -720,7 +720,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>           VFIOAddressSpace *space = bcontainer->space;
>   
>           trace_vfio_container_disconnect(container->fd);
> -        vfio_cpr_unregister_container(bcontainer);
> +        vfio_legacy_cpr_unregister_container(container);
>           close(container->fd);
>           object_unref(container);
>   
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> new file mode 100644
> index 0000000..419b9fb
> --- /dev/null
> +++ b/hw/vfio/cpr-legacy.c
> @@ -0,0 +1,69 @@
> +/*
> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "qemu/osdep.h"
> +#include "hw/vfio/vfio-container.h"
> +#include "hw/vfio/vfio-cpr.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "qapi/error.h"
> +
> +static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
> +        return false;
> +
> +    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
> +        return false;
> +
> +    } else {
> +        return true;
> +    }
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> +    .name = "vfio-container",
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .needed = cpr_incoming_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +    Error **cpr_blocker = &container->cpr.blocker;
> +
> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
> +                                vfio_cpr_reboot_notifier,
> +                                MIG_MODE_CPR_REBOOT);
> +
> +    if (!vfio_cpr_supported(container, cpr_blocker)) {
> +        return migrate_add_blocker_modes(cpr_blocker, errp,
> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +    }
> +
> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> +
> +    return true;
> +}
> +
> +void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> +    migrate_del_blocker(&container->cpr.blocker);
> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +}
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0210e76..0e59612 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -7,13 +7,12 @@
>   
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-device.h"
> -#include "migration/misc.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
>   
> -static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
> -                                    MigrationEvent *e, Error **errp)
> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
> +                             MigrationEvent *e, Error **errp)
>   {
>       if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>           !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index bccb050..73d29f9 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>     'cpr.c',
> +  'cpr-legacy.c',
>     'device.c',
>     'migration.c',
>     'migration-multifd.c',



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init
  2025-05-29 19:24 ` [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init Steve Sistare
@ 2025-06-01 15:25   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 15:25 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Extract a subroutine vfio_pci_vector_init.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/pci.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 840590c..2d6dc54 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -512,6 +512,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>       kvm_irqchip_commit_routes(kvm_state);
>   }
>   
> +static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
> +{
> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vector->vdev = vdev;
> +    vector->virq = -1;
> +    if (event_notifier_init(&vector->interrupt, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed");
> +    }
> +    vector->use = true;
> +    if (vdev->interrupt == VFIO_INT_MSIX) {
> +        msix_vector_use(pdev, nr);
> +    }
> +}
> +
>   static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>                                      MSIMessage *msg, IOHandler *handler)
>   {
> @@ -525,13 +541,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>       vector = &vdev->msi_vectors[nr];
>   
>       if (!vector->use) {
> -        vector->vdev = vdev;
> -        vector->virq = -1;
> -        if (event_notifier_init(&vector->interrupt, 0)) {
> -            error_report("vfio: Error: event_notifier_init failed");
> -        }
> -        vector->use = true;
> -        msix_vector_use(pdev, nr);
> +        vfio_pci_vector_init(vdev, nr);
>       }
>   
>       qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 23/43] vfio/pci: export MSI functions
  2025-05-29 19:24 ` [PATCH V4 23/43] vfio/pci: export MSI functions Steve Sistare
@ 2025-06-01 15:27   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 15:27 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Export various MSI functions, renamed with a vfio_pci prefix, for use by
> CPR in subsequent patches.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/pci.h |  8 ++++++++
>   hw/vfio/pci.c | 29 +++++++++++++++++------------
>   2 files changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 5ce0fb9..6e4840d 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -210,6 +210,14 @@ static inline bool vfio_is_vga(VFIOPCIDevice *vdev)
>       return class == PCI_CLASS_DISPLAY_VGA;
>   }
>   
> +/* MSI/MSI-X/INTx */
> +void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr);
> +void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                               int vector_n, bool msix);
> +void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev);
> +bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp);
> +
>   uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
>   void vfio_pci_write_config(PCIDevice *pdev,
>                              uint32_t addr, uint32_t val, int len);
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 6aa37fe..13d7c84 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -351,6 +351,11 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
>       trace_vfio_intx_disable(vdev->vbasedev.name);
>   }
>   
> +bool vfio_pci_intx_enable(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    return vfio_intx_enable(vdev, errp);
> +}
> +
>   /*
>    * MSI/X
>    */
> @@ -475,8 +480,8 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>       return ret;
>   }
>   
> -static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> -                                  int vector_n, bool msix)
> +void vfio_pci_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> +                               int vector_n, bool msix)
>   {
>       if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
>           return;
> @@ -530,7 +535,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>       kvm_irqchip_commit_routes(kvm_state);
>   }
>   
> -static void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
> +void vfio_pci_vector_init(VFIOPCIDevice *vdev, int nr)
>   {
>       VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>       PCIDevice *pdev = &vdev->pdev;
> @@ -580,10 +585,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>       } else {
>           if (msg) {
>               if (vdev->defer_kvm_irq_routing) {
> -                vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> +                vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
>               } else {
>                   vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
> -                vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> +                vfio_pci_add_kvm_msi_virq(vdev, vector, nr, true);
>                   kvm_irqchip_commit_route_changes(&vfio_route_change);
>                   vfio_connect_kvm_msi_virq(vector, nr);
>               }
> @@ -676,14 +681,14 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
>       }
>   }
>   
> -static void vfio_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> +void vfio_pci_prepare_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>   {
>       assert(!vdev->defer_kvm_irq_routing);
>       vdev->defer_kvm_irq_routing = true;
>       vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>   }
>   
> -static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
> +void vfio_pci_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
>   {
>       int i;
>   
> @@ -713,14 +718,14 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
>        * routes once rather than per vector provides a substantial
>        * performance improvement.
>        */
> -    vfio_prepare_kvm_msi_virq_batch(vdev);
> +    vfio_pci_prepare_kvm_msi_virq_batch(vdev);
>   
>       if (msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
>                                     vfio_msix_vector_release, NULL)) {
>           error_report("vfio: msix_set_vector_notifiers failed");
>       }
>   
> -    vfio_commit_kvm_msi_virq_batch(vdev);
> +    vfio_pci_commit_kvm_msi_virq_batch(vdev);
>   
>       if (vdev->nr_vectors) {
>           ret = vfio_enable_vectors(vdev, true);
> @@ -764,7 +769,7 @@ retry:
>        * Deferring to commit the KVM routes once rather than per vector
>        * provides a substantial performance improvement.
>        */
> -    vfio_prepare_kvm_msi_virq_batch(vdev);
> +    vfio_pci_prepare_kvm_msi_virq_batch(vdev);
>   
>       vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors);
>   
> @@ -788,10 +793,10 @@ retry:
>            * Attempt to enable route through KVM irqchip,
>            * default to userspace handling if unavailable.
>            */
> -        vfio_add_kvm_msi_virq(vdev, vector, i, false);
> +        vfio_pci_add_kvm_msi_virq(vdev, vector, i, false);
>       }
>   
> -    vfio_commit_kvm_msi_virq_batch(vdev);
> +    vfio_pci_commit_kvm_msi_virq_batch(vdev);
>   
>       /* Set interrupt type prior to possible interrupts */
>       vdev->interrupt = VFIO_INT_MSI;



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-05-29 19:24 ` [PATCH V4 16/43] pci: skip reset during cpr Steve Sistare
@ 2025-06-01 16:38   ` Cédric Le Goater
  2025-06-01 19:07     ` Michael S. Tsirkin
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 16:38 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/hw/pci/pci_device.h | 3 +++
>   hw/pci/pci.c                | 5 +++++
>   hw/vfio/pci.c               | 7 +++++++
>   3 files changed, 15 insertions(+)
> 
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index e41d95b..b481c5d 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -181,6 +181,9 @@ struct PCIDevice {
>       uint32_t max_bounce_buffer_size;
>   
>       char *sriov_pf;
> +
> +    /* CPR */
> +    bool skip_reset_on_cpr;
>   };
>   
>   static inline int pci_intx(PCIDevice *pci_dev)
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index f5ab510..21eb11c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -32,6 +32,7 @@
>   #include "hw/pci/pci_host.h"
>   #include "hw/qdev-properties.h"
>   #include "hw/qdev-properties-system.h"
> +#include "migration/cpr.h"
>   #include "migration/qemu-file-types.h"
>   #include "migration/vmstate.h"
>   #include "net/net.h"
> @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
>   
>   static void pci_do_device_reset(PCIDevice *dev)
>   {
> +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
> +        return;
> +    }

Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")

Thanks,

C.


> +
>       pci_device_deassert_intx(dev);
>       assert(dev->irq_state == 0);
>   
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7d3b9ff..56e7fdd 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3402,6 +3402,13 @@ static void vfio_instance_init(Object *obj)
>       /* QEMU_PCI_CAP_EXPRESS initialization does not depend on QEMU command
>        * line, therefore, no need to wait to realize like other devices */
>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
> +
> +    /*
> +     * A device that is resuming for cpr is already configured, so do not
> +     * reset it during qemu_system_reset prior to cpr load, else interrupts
> +     * may be lost.
> +     */
> +    pci_dev->skip_reset_on_cpr = true;
>   }>   
>   static void vfio_pci_base_dev_class_init(ObjectClass *klass, const void *data)



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 17/43] vfio-pci: skip reset during cpr
  2025-05-29 19:24 ` [PATCH V4 17/43] vfio-pci: " Steve Sistare
@ 2025-06-01 16:39   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 16:39 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Do not reset a vfio-pci device during CPR, and do not complain if the
> kernel's PCI config space changes for non-emulated bits between the
> vmstate save and load, which can happen due to ongoing interrupt activity.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-cpr.h |  2 ++
>   hw/vfio/cpr.c              | 31 +++++++++++++++++++++++++++++++
>   hw/vfio/pci.c              |  7 +++++++
>   3 files changed, 40 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 56ede04..8bf85b9 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -52,4 +52,6 @@ void vfio_cpr_giommu_remap(struct VFIOContainerBase *bcontainer,
>   bool vfio_cpr_ram_discard_register_listener(
>       struct VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>   
> +extern const VMStateDescription vfio_cpr_pci_vmstate;
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0e59612..fdbb58e 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -8,6 +8,8 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-device.h"
>   #include "hw/vfio/vfio-cpr.h"
> +#include "hw/vfio/pci.h"
> +#include "migration/cpr.h"
>   #include "qapi/error.h"
>   #include "system/runstate.h"
>   
> @@ -37,3 +39,32 @@ void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
>   {
>       migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>   }
> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_cpr_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +const VMStateDescription vfio_cpr_pci_vmstate = {
> +    .name = "vfio-cpr-pci",
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_cpr_pci_pre_load,
> +    .needed = cpr_incoming_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 56e7fdd..840590c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>   #include "hw/qdev-properties.h"
>   #include "hw/qdev-properties-system.h"
>   #include "migration/vmstate.h"
> +#include "migration/cpr.h"
>   #include "qobject/qdict.h"
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
> @@ -3345,6 +3346,11 @@ static void vfio_pci_reset(DeviceState *dev)
>   {
>       VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
>   
> +    /* Do not reset the device during qemu_system_reset prior to cpr load */
> +    if (cpr_is_incoming()) {
> +        return;
> +    }
> +
>       trace_vfio_pci_reset(vdev->vbasedev.name);
>   
>       vfio_pci_pre_reset(vdev);
> @@ -3521,6 +3527,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, const void *data)
>   #ifdef CONFIG_IOMMUFD
>       object_class_property_add_str(klass, "fd", NULL, vfio_pci_set_fd);
>   #endif
> +    dc->vmsd = &vfio_cpr_pci_vmstate;
>       dc->desc = "VFIO-based PCI device assignment";
>       pdc->realize = vfio_realize;
>   



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 12/43] vfio/container: restore DMA vaddr
  2025-05-29 19:24 ` [PATCH V4 12/43] vfio/container: restore " Steve Sistare
@ 2025-06-01 16:48   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 16:48 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> In new QEMU, do not register the memory listener at device creation time.
> Register it later, in the container post_load handler, after all vmstate
> that may affect regions and mapping boundaries has been loaded.  The
> post_load registration will cause the listener to invoke its callback on
> each flat section, and the calls will match the mappings remembered by the
> kernel.
> 
> The listener calls a special dma_map handler that passes the new VA of each
> section to the kernel using VFIO_DMA_MAP_FLAG_VADDR.  Restore the normal
> handler at the end.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-cpr.h |  3 +++
>   hw/vfio/container.c        | 15 ++++++++++--
>   hw/vfio/cpr-legacy.c       | 57 ++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 73 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 5a2e5f6..0462447 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -17,6 +17,9 @@ struct VFIOGroup;
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
> +    int (*saved_dma_map)(const struct VFIOContainerBase *bcontainer,
> +                         hwaddr iova, ram_addr_t size,
> +                         void *vaddr, bool readonly, MemoryRegion *mr);
>   } VFIOContainerCPR;
>   
>   
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 798abda..f91f2d5 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -137,6 +137,8 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
>       int ret;
>       Error *local_err = NULL;
>   
> +    g_assert(!cpr_is_incoming());
> +
>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>           if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>               bcontainer->dirty_pages_supported) {
> @@ -691,8 +693,17 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       }
>       group_was_added = true;
>   
> -    if (!vfio_listener_register(bcontainer, errp)) {
> -        goto fail;
> +    /*
> +     * If CPR, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * dma_map to supply the new vaddr, and the calls will match the mappings
> +     * remembered by the kernel.
> +     */
> +    if (!cpr_is_incoming()) {
> +        if (!vfio_listener_register(bcontainer, errp)) {
> +            goto fail;
> +        }
>       }
>   
>       bcontainer->initialized = true;
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index cf80332..512ef41 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -10,11 +10,13 @@
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
>   #include "hw/vfio/vfio-device.h"
> +#include "hw/vfio/vfio-listener.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
>   #include "migration/vmstate.h"
>   #include "qapi/error.h"
> +#include "qemu/error-report.h"
>   
>   static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>   {
> @@ -31,6 +33,32 @@ static bool vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>       return true;
>   }
>   
> +/*
> + * Set the new @vaddr for any mappings registered during cpr load.
> + * The incoming state is cleared thereafter.
> + */
> +static int vfio_legacy_cpr_dma_map(const VFIOContainerBase *bcontainer,
> +                                   hwaddr iova, ram_addr_t size, void *vaddr,
> +                                   bool readonly, MemoryRegion *mr)
> +{
> +    const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
> +                                                  bcontainer);
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    g_assert(cpr_is_incoming());
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
>   
>   static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>   {
> @@ -59,11 +87,34 @@ static int vfio_container_pre_save(void *opaque)
>       return 0;
>   }
>   
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> +    VFIOContainer *container = opaque;
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +    VFIOGroup *group;
> +    Error *local_err = NULL;
> +
> +    if (!vfio_listener_register(bcontainer, &local_err)) {
> +        error_report_err(local_err);
> +        return -1;
> +    }
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +
> +        /* Restore original dma_map function */
> +        vioc->dma_map = container->cpr.saved_dma_map;
> +    }
> +    return 0;
> +}
> +
>   static const VMStateDescription vfio_container_vmstate = {
>       .name = "vfio-container",
>       .version_id = 0,
>       .minimum_version_id = 0,
> +    .priority = MIG_PRI_LOW,  /* Must happen after devices and groups */
>       .pre_save = vfio_container_pre_save,
> +    .post_load = vfio_container_post_load,
>       .needed = cpr_incoming_needed,
>       .fields = (VMStateField[]) {
>           VMSTATE_END_OF_LIST()
> @@ -86,6 +137,12 @@ bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error **errp)
>   
>       vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>   
> +    /* During incoming CPR, divert calls to dma_map. */
> +    if (cpr_is_incoming()) {
> +        VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
> +        container->cpr.saved_dma_map = vioc->dma_map;
> +        vioc->dma_map = vfio_legacy_cpr_dma_map;
> +    }
>       return true;
>   }
>   



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 10/43] vfio/container: preserve descriptors
  2025-05-29 19:24 ` [PATCH V4 10/43] vfio/container: preserve descriptors Steve Sistare
@ 2025-06-01 16:57   ` Cédric Le Goater
  2025-06-03 11:57   ` Duan, Zhenzhong
  1 sibling, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 16:57 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
> the saved descriptors.
> 
> During reuse, device and iommu state is already configured, so operations
> in vfio_realize that would modify the configuration, such as vfio ioctl's,
> are skipped.  The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Thanks for making the changes. They look much better in container.c.

Reviewed-by: Cédric Le Goater <clg@redhat.com>

C.


> ---
>   include/hw/vfio/vfio-cpr.h |  6 +++++
>   hw/vfio/container.c        | 67 +++++++++++++++++++++++++++++++++++-----------
>   hw/vfio/cpr-legacy.c       | 42 +++++++++++++++++++++++++++++
>   3 files changed, 100 insertions(+), 15 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index d4e0bd5..5a2e5f6 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -13,6 +13,7 @@
>   
>   struct VFIOContainer;
>   struct VFIOContainerBase;
> +struct VFIOGroup;
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
> @@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>   
> +int vfio_cpr_group_get_device_fd(int d, const char *name);
> +
> +bool vfio_cpr_container_match(struct VFIOContainer *container,
> +                              struct VFIOGroup *group, int fd);
> +
>   #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 7d2035c..798abda 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -31,6 +31,8 @@
>   #include "system/reset.h"
>   #include "trace.h"
>   #include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/blocker.h"
>   #include "pci.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> @@ -426,7 +428,12 @@ static VFIOContainer *vfio_create_container(int fd, VFIOGroup *group,
>           return NULL;
>       }
>   
> -    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
> +    /*
> +     * During CPR, just set the container type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     */
> +    if (!cpr_is_incoming() &&
> +        !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>           return NULL;
>       }
>   
> @@ -593,6 +600,11 @@ static bool vfio_container_group_add(VFIOContainer *container, VFIOGroup *group,
>       group->container = container;
>       QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>       vfio_group_add_kvm_device(group);
> +    /*
> +     * Remember the container fd for each group, so we can attach to the same
> +     * container after CPR.
> +     */
> +    cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
>       return true;
>   }
>   
> @@ -602,6 +614,7 @@ static void vfio_container_group_del(VFIOContainer *container, VFIOGroup *group)
>       group->container = NULL;
>       vfio_group_del_kvm_device(group);
>       vfio_ram_block_discard_disable(container, false);
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   }
>   
>   static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
> @@ -616,17 +629,34 @@ static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>       bool group_was_added = false;
>   
>       space = vfio_address_space_get(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>   
> -    QLIST_FOREACH(bcontainer, &space->containers, next) {
> -        container = container_of(bcontainer, VFIOContainer, bcontainer);
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            return vfio_container_group_add(container, group, errp);
> +    if (!cpr_is_incoming()) {
> +        QLIST_FOREACH(bcontainer, &space->containers, next) {
> +            container = container_of(bcontainer, VFIOContainer, bcontainer);
> +            if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +                return vfio_container_group_add(container, group, errp);
> +            }
>           }
> -    }
>   
> -    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
> -    if (fd < 0) {
> -        goto fail;
> +        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
> +        if (fd < 0) {
> +            goto fail;
> +        }
> +    } else {
> +        /*
> +         * For incoming CPR, the group is already attached in the kernel.
> +         * If a container with matching fd is found, then update the
> +         * userland group list and return.  If not, then after the loop,
> +         * create the container struct and group list.
> +         */
> +        QLIST_FOREACH(bcontainer, &space->containers, next) {
> +            container = container_of(bcontainer, VFIOContainer, bcontainer);
> +
> +            if (vfio_cpr_container_match(container, group, fd)) {
> +                return vfio_container_group_add(container, group, errp);
> +            }
> +        }
>       }
>   
>       ret = ioctl(fd, VFIO_GET_API_VERSION);
> @@ -698,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>   
>       QLIST_REMOVE(group, container_next);
>       group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>   
>       /*
>        * Explicitly release the listener first before unset container,
> @@ -751,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       group = g_malloc0(sizeof(*group));
>   
>       snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open(path, O_RDWR, errp);
> +    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, errp);
>       if (group->fd < 0) {
>           goto free_group_exit;
>       }
> @@ -783,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid, AddressSpace *as, Error **errp)
>       return group;
>   
>   close_fd_exit:
> +    cpr_delete_fd("vfio_group", groupid);
>       close(group->fd);
>   
>   free_group_exit:
> @@ -804,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>       vfio_container_disconnect(group);
>       QLIST_REMOVE(group, next);
>       trace_vfio_group_put(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>       close(group->fd);
>       g_free(group);
>   }
> @@ -814,7 +847,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>       g_autofree struct vfio_device_info *info = NULL;
>       int fd;
>   
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    fd = vfio_cpr_group_get_device_fd(group->fd, name);
>       if (fd < 0) {
>           error_setg_errno(errp, errno, "error getting device from group %d",
>                            group->groupid);
> @@ -827,8 +860,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>       info = vfio_get_device_info(fd);
>       if (!info) {
>           error_setg_errno(errp, errno, "error getting device info");
> -        close(fd);
> -        return false;
> +        goto fail;
>       }
>   
>       /*
> @@ -842,8 +874,7 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>           if (!QLIST_EMPTY(&group->device_list)) {
>               error_setg(errp, "Inconsistent setting of support for discarding "
>                          "RAM (e.g., balloon) within group");
> -            close(fd);
> -            return false;
> +            goto fail;
>           }
>   
>           if (!group->ram_block_discard_allowed) {
> @@ -861,6 +892,11 @@ static bool vfio_device_get(VFIOGroup *group, const char *name,
>       trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>   
>       return true;
> +
> +fail:
> +    close(fd);
> +    cpr_delete_fd(name, 0);
> +    return false;
>   }
>   
>   static void vfio_device_put(VFIODevice *vbasedev)
> @@ -871,6 +907,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>       QLIST_REMOVE(vbasedev, next);
>       vbasedev->group = NULL;
>       trace_vfio_device_put(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>       close(vbasedev->fd);
>   }
>   
> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
> index 419b9fb..29be64f 100644
> --- a/hw/vfio/cpr-legacy.c
> +++ b/hw/vfio/cpr-legacy.c
> @@ -9,6 +9,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/vfio/vfio-container.h"
>   #include "hw/vfio/vfio-cpr.h"
> +#include "hw/vfio/vfio-device.h"
>   #include "migration/blocker.h"
>   #include "migration/cpr.h"
>   #include "migration/migration.h"
> @@ -67,3 +68,44 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>       migrate_del_blocker(&container->cpr.blocker);
>       vmstate_unregister(NULL, &vfio_container_vmstate, container);
>   }
> +
> +int vfio_cpr_group_get_device_fd(int d, const char *name)
> +{
> +    const int id = 0;
> +    int fd = cpr_find_fd(name, id);
> +
> +    if (fd < 0) {
> +        fd = ioctl(d, VFIO_GROUP_GET_DEVICE_FD, name);
> +        if (fd >= 0) {
> +            cpr_save_fd(name, id, fd);
> +        }
> +    }
> +    return fd;
> +}
> +
> +static bool same_device(int fd1, int fd2)
> +{
> +    struct stat st1, st2;
> +
> +    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
> +}
> +
> +bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
> +                              int fd)
> +{
> +    if (container->fd == fd) {
> +        return true;
> +    }
> +    if (!same_device(container->fd, fd)) {
> +        return false;
> +    }
> +    /*
> +     * Same device, different fd.  This occurs when the container fd is
> +     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
> +     * produces duplicates.  De-dup it.
> +     */
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
> +    close(fd);
> +    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
> +    return true;
> +}



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 00/43] Live update: vfio and iommufd
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (42 preceding siblings ...)
  2025-05-29 19:24 ` [PATCH V4 43/43] vfio/container: delete old cpr register Steve Sistare
@ 2025-06-01 17:26 ` Cédric Le Goater
  2025-06-02 12:42   ` Steven Sistare
  2025-06-03 12:09 ` Duan, Zhenzhong
  44 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-01 17:26 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

On 5/29/25 21:23, Steve Sistare wrote:
> Support vfio and iommufd devices with the cpr-transfer live migration mode.
> Devices that do not support live migration can still support cpr-transfer,
> allowing live update to a new version of QEMU on the same host, with no loss
> of guest connectivity.
> 
> No user-visible interfaces are added.
> 
> For legacy containers:
> 
> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
> skip the ioctls that configure the device, because it is already configured.
> 
> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
> QEMU and update the locked memory accounting.  The physical pages remain
> pinned, because the descriptor of the device that locked them remains open,
> so DMA to those pages continues without interruption.  Mediated devices are
> not supported, however, because they require the VA to always be valid, and
> there is a brief window where no VA is registered.
> 
> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
> data structures, and attaches the interrupts to the new KVM instance.  This
> logic also applies to iommufd containers.
> 
> For iommufd containers:
> 
> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
> backed by a file (including a memfd), so DMA mappings do not depend on VA,
> which can differ after live update.  This allows mediated devices to be
> supported.
> 
> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
> QEMU, during vfio_realize, skip the ioctls that configure the device, because
> it is already configured.
> 
> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
> locked memory accounting.
> 
> Patches 4 to 12 are specific to legacy containers.
> Patches 25 to 41 are specific to iommufd containers.> The remainder apply to both.

Steve,

I am considering patches 1-23 for vfio-next. This is to be able to merge
quickly a first part because we still have time ahead of us and to catch
issues early. It will also help John to rebase.

I think patch 16 can be simplified. If you agree, and Michael too, please
resend just this patch. I will update the series and send a PR.

Thanks,

C.

  




  
> Changes from previous versions:
>    * V1 of this series contains minor changes from the "Live update: vfio" and
>      "Live update: iommufd" series, mainly bug fixes and refactored patches.
> 
> Changes in V2:
>    * refactored various vfio code snippets into new cpr helpers
>    * refactored vfio struct members into cpr-specific structures
>    * refactored various small changes into their own patches
>    * split complex patches.  Notably:
>      - split "refactor for cpr" into 5 patches
>      - split "reconstruct device" into 4 patches
>    * refactored vfio_connect_container using helpers and made its
>      error recovery more robust.
>    * moved vfio pci msi/vector/intx cpr functions to cpr.c
>    * renamed "reused" to cpr_reused and cpr.reused
>    * squashed vfio_cpr_[un]register_container to their call sites
>    * simplified iommu_type setting after cpr
>    * added cpr_open_fd and cpr_is_incoming helpers
>    * removed changes from vfio_legacy_dma_map, and instead temporarily
>      override dma_map and dma_unmap ops.
>    * deleted error_report and returned Error to callers where possible.
>    * simplified the memory_get_xlat_addr interface
>    * fixed flags passed to iommufd_backend_alloc_hwpt
>    * defined MIG_PRI_UNINITIALIZED
>    * added maintainers
> 
> Changes in V3:
>    * removed cleanup patches that were already pulled
>    * rebased to latest master
> 
> Changes in V4:
>    * added SPDX-License-Identifier
>    * patch "vfio/container: preserve descriptors"
>      - rewrote search loop in vfio_container_connect
>      - do not return pfd from vfio_cpr_container_match
>      - add helper for VFIO_GROUP_GET_DEVICE_FD
>    * deleted patch "export vfio_legacy_dma_map"
>    * patch "vfio/container: restore DMA vaddr"
>      - deleted redundant error_report from vfio_legacy_cpr_dma_map
>      - save old dma_map function
>    * patch "vfio-pci: skip reset during cpr"
>      - use cpr_is_incoming instead of cpr_reused
>    * renamed err -> local_err in all new code
>    * patch "export MSI functions"
>      -  renamed with vfio_pci prefix, and defined wrappers for low level
>         routines instead of exporting them.
>    * patch "close kvm after cpr"
>      - fixed build error for !CONFIG_KVM
>    * added the cpr_resave_fd helper
>    * dropped patch "pass ramblock to vfio_container_dma_map", relying on
>      "pass MemoryRegion" from the vfio-user series instead.
>    * deleted "reused" variables, replaced with cpr_is_incoming()
>    * renamed cpr_needed_for_reuse -> cpr_incoming_needed
>    * rewrote patch "pci: skip reset during cpr"
>    * rebased to latest master
> 
>    for iommufd:
>      * deleted redundant error_report from iommufd_backend_map_file_dma
>      * added interface doc for dma_map_file
>      * check return value of cpr_open_fd
>      * deleted "export iommufd_cdev_get_info_iova_range"
>      * deleted "reconstruct device"
>      * deleted "reconstruct hw_caps"
>      * deleted "define hwpt constructors"
>      * seperated cpr registration for iommufd be and vfio container
>      * correctly attach to multiple containers per iommufd using ioas_id
>      * simplified "reconstruct hwpt" by matching against hwpt_id.
>      * added patch "add vfio_device_free_name"
> 
> 
> Steve Sistare (43):
>    MAINTAINERS: Add reviewer for CPR
>    vfio: return mr from vfio_get_xlat_addr
>    vfio/container: pass MemoryRegion to DMA operations
>    vfio/pci: vfio_pci_put_device on failure
>    migration: cpr helpers
>    migration: lower handler priority
>    vfio: vfio_find_ram_discard_listener
>    vfio: move vfio-cpr.h
>    vfio/container: register container for cpr
>    vfio/container: preserve descriptors
>    vfio/container: discard old DMA vaddr
>    vfio/container: restore DMA vaddr
>    vfio/container: mdev cpr blocker
>    vfio/container: recover from unmap-all-vaddr failure
>    pci: export msix_is_pending
>    pci: skip reset during cpr
>    vfio-pci: skip reset during cpr
>    vfio/pci: vfio_pci_vector_init
>    vfio/pci: vfio_notifier_init
>    vfio/pci: pass vector to virq functions
>    vfio/pci: vfio_notifier_init cpr parameters
>    vfio/pci: vfio_notifier_cleanup
>    vfio/pci: export MSI functions
>    vfio-pci: preserve MSI
>    vfio-pci: preserve INTx
>    migration: close kvm after cpr
>    migration: cpr_get_fd_param helper
>    backends/iommufd: iommufd_backend_map_file_dma
>    backends/iommufd: change process ioctl
>    physmem: qemu_ram_get_fd_offset
>    vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>    vfio/iommufd: invariant device name
>    vfio/iommufd: add vfio_device_free_name
>    vfio/iommufd: device name blocker
>    vfio/iommufd: register container for cpr
>    migration: vfio cpr state hook
>    vfio/iommufd: cpr state
>    vfio/iommufd: preserve descriptors
>    vfio/iommufd: reconstruct device
>    vfio/iommufd: reconstruct hwpt
>    vfio/iommufd: change process
>    iommufd: preserve DMA mappings
>    vfio/container: delete old cpr register
> 
>   MAINTAINERS                           |  10 ++
>   hw/vfio/pci.h                         |  10 ++
>   hw/vfio/vfio-cpr.h                    |  15 --
>   include/exec/cpu-common.h             |   1 +
>   include/hw/pci/msix.h                 |   1 +
>   include/hw/pci/pci_device.h           |   3 +
>   include/hw/vfio/vfio-container-base.h |  38 ++++-
>   include/hw/vfio/vfio-container.h      |   2 +
>   include/hw/vfio/vfio-cpr.h            |  78 +++++++++
>   include/hw/vfio/vfio-device.h         |   5 +
>   include/migration/cpr.h               |  21 +++
>   include/migration/vmstate.h           |   6 +-
>   include/system/iommufd.h              |   6 +
>   include/system/kvm.h                  |   1 +
>   include/system/memory.h               |  19 ++-
>   accel/kvm/kvm-all.c                   |  28 ++++
>   accel/stubs/kvm-stub.c                |   5 +
>   backends/iommufd.c                    | 101 +++++++++++-
>   hw/pci/msix.c                         |   2 +-
>   hw/pci/pci.c                          |   5 +
>   hw/vfio/ap.c                          |   2 +-
>   hw/vfio/ccw.c                         |   2 +-
>   hw/vfio/container-base.c              |  13 +-
>   hw/vfio/container.c                   | 101 +++++++++---
>   hw/vfio/cpr-iommufd.c                 | 220 ++++++++++++++++++++++++++
>   hw/vfio/cpr-legacy.c                  | 288 ++++++++++++++++++++++++++++++++++
>   hw/vfio/cpr.c                         | 161 +++++++++++++++++--
>   hw/vfio/device.c                      |  40 +++--
>   hw/vfio/helpers.c                     |  10 ++
>   hw/vfio/iommufd.c                     |  86 ++++++++--
>   hw/vfio/listener.c                    |  93 +++++++----
>   hw/vfio/pci.c                         | 232 ++++++++++++++++++++-------
>   hw/vfio/platform.c                    |   2 +-
>   hw/vfio/vfio-stubs.c                  |  13 ++
>   hw/virtio/vhost-vdpa.c                |   9 +-
>   migration/cpr-transfer.c              |  18 +++
>   migration/cpr.c                       |  95 +++++++++--
>   migration/migration.c                 |   1 +
>   migration/savevm.c                    |   4 +-
>   system/memory.c                       |  32 +---
>   system/physmem.c                      |   5 +
>   backends/trace-events                 |   2 +
>   hw/vfio/meson.build                   |   4 +
>   43 files changed, 1576 insertions(+), 214 deletions(-)
>   delete mode 100644 hw/vfio/vfio-cpr.h
>   create mode 100644 include/hw/vfio/vfio-cpr.h
>   create mode 100644 hw/vfio/cpr-iommufd.c
>   create mode 100644 hw/vfio/cpr-legacy.c
>   create mode 100644 hw/vfio/vfio-stubs.c
> 
> base-commit: d2e9b78162e31b1eaf20f3a4f563da82da56908d



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-01 16:38   ` Cédric Le Goater
@ 2025-06-01 19:07     ` Michael S. Tsirkin
  2025-06-02 12:36       ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Michael S. Tsirkin @ 2025-06-01 19:07 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Steve Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
	Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On Sun, Jun 01, 2025 at 06:38:43PM +0200, Cédric Le Goater wrote:
> On 5/29/25 21:24, Steve Sistare wrote:
> > Do not reset a vfio-pci device during CPR.
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >   include/hw/pci/pci_device.h | 3 +++
> >   hw/pci/pci.c                | 5 +++++
> >   hw/vfio/pci.c               | 7 +++++++
> >   3 files changed, 15 insertions(+)
> > 
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index e41d95b..b481c5d 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -181,6 +181,9 @@ struct PCIDevice {
> >       uint32_t max_bounce_buffer_size;
> >       char *sriov_pf;
> > +
> > +    /* CPR */
> > +    bool skip_reset_on_cpr;
> >   };
> >   static inline int pci_intx(PCIDevice *pci_dev)
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index f5ab510..21eb11c 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -32,6 +32,7 @@
> >   #include "hw/pci/pci_host.h"
> >   #include "hw/qdev-properties.h"
> >   #include "hw/qdev-properties-system.h"
> > +#include "migration/cpr.h"
> >   #include "migration/qemu-file-types.h"
> >   #include "migration/vmstate.h"
> >   #include "net/net.h"
> > @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
> >   static void pci_do_device_reset(PCIDevice *dev)
> >   {
> > +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
> > +        return;
> > +    }
> 
> Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
> replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")
> 
> Thanks,
> 
> C.

True but I don't really like driver dependent hacks.
what exactly about vfio makes it survive without this reset?

> 
> > +
> >       pci_device_deassert_intx(dev);
> >       assert(dev->irq_state == 0);
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 7d3b9ff..56e7fdd 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -3402,6 +3402,13 @@ static void vfio_instance_init(Object *obj)
> >       /* QEMU_PCI_CAP_EXPRESS initialization does not depend on QEMU command
> >        * line, therefore, no need to wait to realize like other devices */
> >       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
> > +
> > +    /*
> > +     * A device that is resuming for cpr is already configured, so do not
> > +     * reset it during qemu_system_reset prior to cpr load, else interrupts
> > +     * may be lost.
> > +     */
> > +    pci_dev->skip_reset_on_cpr = true;
> >   }>     static void vfio_pci_base_dev_class_init(ObjectClass *klass,
> > const void *data)



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-01 19:07     ` Michael S. Tsirkin
@ 2025-06-02 12:36       ` Steven Sistare
  2025-06-04  7:09         ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-02 12:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Cédric Le Goater
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/1/2025 3:07 PM, Michael S. Tsirkin wrote:
> On Sun, Jun 01, 2025 at 06:38:43PM +0200, Cédric Le Goater wrote:
>> On 5/29/25 21:24, Steve Sistare wrote:
>>> Do not reset a vfio-pci device during CPR.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>    include/hw/pci/pci_device.h | 3 +++
>>>    hw/pci/pci.c                | 5 +++++
>>>    hw/vfio/pci.c               | 7 +++++++
>>>    3 files changed, 15 insertions(+)
>>>
>>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>>> index e41d95b..b481c5d 100644
>>> --- a/include/hw/pci/pci_device.h
>>> +++ b/include/hw/pci/pci_device.h
>>> @@ -181,6 +181,9 @@ struct PCIDevice {
>>>        uint32_t max_bounce_buffer_size;
>>>        char *sriov_pf;
>>> +
>>> +    /* CPR */
>>> +    bool skip_reset_on_cpr;
>>>    };
>>>    static inline int pci_intx(PCIDevice *pci_dev)
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index f5ab510..21eb11c 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -32,6 +32,7 @@
>>>    #include "hw/pci/pci_host.h"
>>>    #include "hw/qdev-properties.h"
>>>    #include "hw/qdev-properties-system.h"
>>> +#include "migration/cpr.h"
>>>    #include "migration/qemu-file-types.h"
>>>    #include "migration/vmstate.h"
>>>    #include "net/net.h"
>>> @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
>>>    static void pci_do_device_reset(PCIDevice *dev)
>>>    {
>>> +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
>>> +        return;
>>> +    }
>>
>> Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
>> replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")
>>
>> Thanks,
>>
>> C.
> 
> True but I don't really like driver dependent hacks.
> what exactly about vfio makes it survive without this reset?

The kernel descriptors remain open and all the active kernel PCI state
remains in place.  The device was never quiesced or de-configured in old QEMU.

The cast is fine with me; it depends on what Michael wants.

- Steve

>>>        pci_device_deassert_intx(dev);
>>>        assert(dev->irq_state == 0);
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 7d3b9ff..56e7fdd 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3402,6 +3402,13 @@ static void vfio_instance_init(Object *obj)
>>>        /* QEMU_PCI_CAP_EXPRESS initialization does not depend on QEMU command
>>>         * line, therefore, no need to wait to realize like other devices */
>>>        pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
>>> +
>>> +    /*
>>> +     * A device that is resuming for cpr is already configured, so do not
>>> +     * reset it during qemu_system_reset prior to cpr load, else interrupts
>>> +     * may be lost.
>>> +     */
>>> +    pci_dev->skip_reset_on_cpr = true;
>>>    }>     static void vfio_pci_base_dev_class_init(ObjectClass *klass,
>>> const void *data)
> 



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 00/43] Live update: vfio and iommufd
  2025-06-01 17:26 ` [PATCH V4 00/43] Live update: vfio and iommufd Cédric Le Goater
@ 2025-06-02 12:42   ` Steven Sistare
  0 siblings, 0 replies; 90+ messages in thread
From: Steven Sistare @ 2025-06-02 12:42 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas,
	John Levon

On 6/1/2025 1:26 PM, Cédric Le Goater wrote:
> On 5/29/25 21:23, Steve Sistare wrote:
>> Support vfio and iommufd devices with the cpr-transfer live migration mode.
>> Devices that do not support live migration can still support cpr-transfer,
>> allowing live update to a new version of QEMU on the same host, with no loss
>> of guest connectivity.
>>
>> No user-visible interfaces are added.
>>
>> For legacy containers:
>>
>> Pass vfio device descriptors to new QEMU.  In new QEMU, during vfio_realize,
>> skip the ioctls that configure the device, because it is already configured.
>>
>> Use VFIO_DMA_UNMAP_FLAG_VADDR to abandon the old VA's for DMA mapped
>> regions, and use VFIO_DMA_MAP_FLAG_VADDR to register the new VA in new
>> QEMU and update the locked memory accounting.  The physical pages remain
>> pinned, because the descriptor of the device that locked them remains open,
>> so DMA to those pages continues without interruption.  Mediated devices are
>> not supported, however, because they require the VA to always be valid, and
>> there is a brief window where no VA is registered.
>>
>> Save the MSI message area as part of vfio-pci vmstate, and pass the interrupt
>> and notifier eventfd's to new QEMU.  New QEMU loads the MSI data, then the
>> vfio-pci post_load handler finds the eventfds in CPR state, rebuilds vector
>> data structures, and attaches the interrupts to the new KVM instance.  This
>> logic also applies to iommufd containers.
>>
>> For iommufd containers:
>>
>> Use IOMMU_IOAS_MAP_FILE to register memory regions for DMA when they are
>> backed by a file (including a memfd), so DMA mappings do not depend on VA,
>> which can differ after live update.  This allows mediated devices to be
>> supported.
>>
>> Pass the iommufd and vfio device descriptors from old to new QEMU.  In new
>> QEMU, during vfio_realize, skip the ioctls that configure the device, because
>> it is already configured.
>>
>> In new QEMU, call ioctl(IOMMU_IOAS_CHANGE_PROCESS) to update mm ownership and
>> locked memory accounting.
>>
>> Patches 4 to 12 are specific to legacy containers.
>> Patches 25 to 41 are specific to iommufd containers.> The remainder apply to both.
> 
> Steve,
> 
> I am considering patches 1-23 for vfio-next. This is to be able to merge
> quickly a first part because we still have time ahead of us and to catch
> issues early. It will also help John to rebase.

Cool.  Do not pull these 2, they landed in vfio-next a few days after I posted V4:
   vfio: return mr from vfio_get_xlat_addr
   vfio/container: pass MemoryRegion to DMA operations

> I think patch 16 can be simplified. If you agree, and Michael too, please
> resend just this patch. I will update the series and send a PR.

Will do, depends on Michael's response.

- Steve

>> Changes from previous versions:
>>    * V1 of this series contains minor changes from the "Live update: vfio" and
>>      "Live update: iommufd" series, mainly bug fixes and refactored patches.
>>
>> Changes in V2:
>>    * refactored various vfio code snippets into new cpr helpers
>>    * refactored vfio struct members into cpr-specific structures
>>    * refactored various small changes into their own patches
>>    * split complex patches.  Notably:
>>      - split "refactor for cpr" into 5 patches
>>      - split "reconstruct device" into 4 patches
>>    * refactored vfio_connect_container using helpers and made its
>>      error recovery more robust.
>>    * moved vfio pci msi/vector/intx cpr functions to cpr.c
>>    * renamed "reused" to cpr_reused and cpr.reused
>>    * squashed vfio_cpr_[un]register_container to their call sites
>>    * simplified iommu_type setting after cpr
>>    * added cpr_open_fd and cpr_is_incoming helpers
>>    * removed changes from vfio_legacy_dma_map, and instead temporarily
>>      override dma_map and dma_unmap ops.
>>    * deleted error_report and returned Error to callers where possible.
>>    * simplified the memory_get_xlat_addr interface
>>    * fixed flags passed to iommufd_backend_alloc_hwpt
>>    * defined MIG_PRI_UNINITIALIZED
>>    * added maintainers
>>
>> Changes in V3:
>>    * removed cleanup patches that were already pulled
>>    * rebased to latest master
>>
>> Changes in V4:
>>    * added SPDX-License-Identifier
>>    * patch "vfio/container: preserve descriptors"
>>      - rewrote search loop in vfio_container_connect
>>      - do not return pfd from vfio_cpr_container_match
>>      - add helper for VFIO_GROUP_GET_DEVICE_FD
>>    * deleted patch "export vfio_legacy_dma_map"
>>    * patch "vfio/container: restore DMA vaddr"
>>      - deleted redundant error_report from vfio_legacy_cpr_dma_map
>>      - save old dma_map function
>>    * patch "vfio-pci: skip reset during cpr"
>>      - use cpr_is_incoming instead of cpr_reused
>>    * renamed err -> local_err in all new code
>>    * patch "export MSI functions"
>>      -  renamed with vfio_pci prefix, and defined wrappers for low level
>>         routines instead of exporting them.
>>    * patch "close kvm after cpr"
>>      - fixed build error for !CONFIG_KVM
>>    * added the cpr_resave_fd helper
>>    * dropped patch "pass ramblock to vfio_container_dma_map", relying on
>>      "pass MemoryRegion" from the vfio-user series instead.
>>    * deleted "reused" variables, replaced with cpr_is_incoming()
>>    * renamed cpr_needed_for_reuse -> cpr_incoming_needed
>>    * rewrote patch "pci: skip reset during cpr"
>>    * rebased to latest master
>>
>>    for iommufd:
>>      * deleted redundant error_report from iommufd_backend_map_file_dma
>>      * added interface doc for dma_map_file
>>      * check return value of cpr_open_fd
>>      * deleted "export iommufd_cdev_get_info_iova_range"
>>      * deleted "reconstruct device"
>>      * deleted "reconstruct hw_caps"
>>      * deleted "define hwpt constructors"
>>      * seperated cpr registration for iommufd be and vfio container
>>      * correctly attach to multiple containers per iommufd using ioas_id
>>      * simplified "reconstruct hwpt" by matching against hwpt_id.
>>      * added patch "add vfio_device_free_name"
>>
>>
>> Steve Sistare (43):
>>    MAINTAINERS: Add reviewer for CPR
>>    vfio: return mr from vfio_get_xlat_addr
>>    vfio/container: pass MemoryRegion to DMA operations
>>    vfio/pci: vfio_pci_put_device on failure
>>    migration: cpr helpers
>>    migration: lower handler priority
>>    vfio: vfio_find_ram_discard_listener
>>    vfio: move vfio-cpr.h
>>    vfio/container: register container for cpr
>>    vfio/container: preserve descriptors
>>    vfio/container: discard old DMA vaddr
>>    vfio/container: restore DMA vaddr
>>    vfio/container: mdev cpr blocker
>>    vfio/container: recover from unmap-all-vaddr failure
>>    pci: export msix_is_pending
>>    pci: skip reset during cpr
>>    vfio-pci: skip reset during cpr
>>    vfio/pci: vfio_pci_vector_init
>>    vfio/pci: vfio_notifier_init
>>    vfio/pci: pass vector to virq functions
>>    vfio/pci: vfio_notifier_init cpr parameters
>>    vfio/pci: vfio_notifier_cleanup
>>    vfio/pci: export MSI functions
>>    vfio-pci: preserve MSI
>>    vfio-pci: preserve INTx
>>    migration: close kvm after cpr
>>    migration: cpr_get_fd_param helper
>>    backends/iommufd: iommufd_backend_map_file_dma
>>    backends/iommufd: change process ioctl
>>    physmem: qemu_ram_get_fd_offset
>>    vfio/iommufd: use IOMMU_IOAS_MAP_FILE
>>    vfio/iommufd: invariant device name
>>    vfio/iommufd: add vfio_device_free_name
>>    vfio/iommufd: device name blocker
>>    vfio/iommufd: register container for cpr
>>    migration: vfio cpr state hook
>>    vfio/iommufd: cpr state
>>    vfio/iommufd: preserve descriptors
>>    vfio/iommufd: reconstruct device
>>    vfio/iommufd: reconstruct hwpt
>>    vfio/iommufd: change process
>>    iommufd: preserve DMA mappings
>>    vfio/container: delete old cpr register
>>
>>   MAINTAINERS                           |  10 ++
>>   hw/vfio/pci.h                         |  10 ++
>>   hw/vfio/vfio-cpr.h                    |  15 --
>>   include/exec/cpu-common.h             |   1 +
>>   include/hw/pci/msix.h                 |   1 +
>>   include/hw/pci/pci_device.h           |   3 +
>>   include/hw/vfio/vfio-container-base.h |  38 ++++-
>>   include/hw/vfio/vfio-container.h      |   2 +
>>   include/hw/vfio/vfio-cpr.h            |  78 +++++++++
>>   include/hw/vfio/vfio-device.h         |   5 +
>>   include/migration/cpr.h               |  21 +++
>>   include/migration/vmstate.h           |   6 +-
>>   include/system/iommufd.h              |   6 +
>>   include/system/kvm.h                  |   1 +
>>   include/system/memory.h               |  19 ++-
>>   accel/kvm/kvm-all.c                   |  28 ++++
>>   accel/stubs/kvm-stub.c                |   5 +
>>   backends/iommufd.c                    | 101 +++++++++++-
>>   hw/pci/msix.c                         |   2 +-
>>   hw/pci/pci.c                          |   5 +
>>   hw/vfio/ap.c                          |   2 +-
>>   hw/vfio/ccw.c                         |   2 +-
>>   hw/vfio/container-base.c              |  13 +-
>>   hw/vfio/container.c                   | 101 +++++++++---
>>   hw/vfio/cpr-iommufd.c                 | 220 ++++++++++++++++++++++++++
>>   hw/vfio/cpr-legacy.c                  | 288 ++++++++++++++++++++++++++++++++++
>>   hw/vfio/cpr.c                         | 161 +++++++++++++++++--
>>   hw/vfio/device.c                      |  40 +++--
>>   hw/vfio/helpers.c                     |  10 ++
>>   hw/vfio/iommufd.c                     |  86 ++++++++--
>>   hw/vfio/listener.c                    |  93 +++++++----
>>   hw/vfio/pci.c                         | 232 ++++++++++++++++++++-------
>>   hw/vfio/platform.c                    |   2 +-
>>   hw/vfio/vfio-stubs.c                  |  13 ++
>>   hw/virtio/vhost-vdpa.c                |   9 +-
>>   migration/cpr-transfer.c              |  18 +++
>>   migration/cpr.c                       |  95 +++++++++--
>>   migration/migration.c                 |   1 +
>>   migration/savevm.c                    |   4 +-
>>   system/memory.c                       |  32 +---
>>   system/physmem.c                      |   5 +
>>   backends/trace-events                 |   2 +
>>   hw/vfio/meson.build                   |   4 +
>>   43 files changed, 1576 insertions(+), 214 deletions(-)
>>   delete mode 100644 hw/vfio/vfio-cpr.h
>>   create mode 100644 include/hw/vfio/vfio-cpr.h
>>   create mode 100644 hw/vfio/cpr-iommufd.c
>>   create mode 100644 hw/vfio/cpr-legacy.c
>>   create mode 100644 hw/vfio/vfio-stubs.c
>>
>> base-commit: d2e9b78162e31b1eaf20f3a4f563da82da56908d
> 



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr
  2025-05-29 19:23 ` [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr Steve Sistare
@ 2025-06-03 10:39   ` Duan, Zhenzhong
  0 siblings, 0 replies; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 10:39 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr
>
>Modify memory_get_xlat_addr and vfio_get_xlat_addr to return the memory
>region that the translated address is found in.  This will be needed by
>CPR in a subsequent patch to map blocks using IOMMU_IOAS_MAP_FILE.
>
>Also return the xlat offset, so we can simplify the interface by removing
>the out parameters that can be trivially derived from mr and xlat.
>
>Lastly, rename the functions to  to memory_translate_iotlb() and
>vfio_translate_iotlb().
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Acked-by: David Hildenbrand <david@redhat.com>
>Reviewed-by: John Levon <john.levon@nutanix.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
>Acked-by: Michael S. Tsirkin <mst@redhat.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

>---
> include/system/memory.h | 19 +++++++++----------
> hw/vfio/listener.c      | 33 ++++++++++++++++++++++-----------
> hw/virtio/vhost-vdpa.c  |  9 +++++++--
> system/memory.c         | 32 +++++++-------------------------
> 4 files changed, 45 insertions(+), 48 deletions(-)
>
>diff --git a/include/system/memory.h b/include/system/memory.h
>index fbbf4cf..13416d7 100644
>--- a/include/system/memory.h
>+++ b/include/system/memory.h
>@@ -738,21 +738,20 @@ void
>ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
>                                              RamDiscardListener *rdl);
>
> /**
>- * memory_get_xlat_addr: Extract addresses from a TLB entry
>+ * memory_translate_iotlb: Extract addresses from a TLB entry.
>+ *                         Called with rcu_read_lock held.
>  *
>  * @iotlb: pointer to an #IOMMUTLBEntry
>- * @vaddr: virtual address
>- * @ram_addr: RAM address
>- * @read_only: indicates if writes are allowed
>- * @mr_has_discard_manager: indicates memory is controlled by a
>- *                          RamDiscardManager
>+ * @xlat_p: return the offset of the entry from the start of the returned
>+ *          MemoryRegion.
>  * @errp: pointer to Error*, to store an error if it happens.
>  *
>- * Return: true on success, else false setting @errp with error.
>+ * Return: On success, return the MemoryRegion containing the @iotlb
>translated
>+ *         addr.  The MemoryRegion must not be accessed after rcu_read_unlock.
>+ *         On failure, return NULL, setting @errp with error.
>  */
>-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>-                          ram_addr_t *ram_addr, bool *read_only,
>-                          bool *mr_has_discard_manager, Error **errp);
>+MemoryRegion *memory_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr
>*xlat_p,
>+                                     Error **errp);
>
> typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>index bfacb3d..0afafe3 100644
>--- a/hw/vfio/listener.c
>+++ b/hw/vfio/listener.c
>@@ -90,16 +90,17 @@ static bool
>vfio_listener_skipped_section(MemoryRegionSection *section)
>            section->offset_within_address_space & (1ULL << 63);
> }
>
>-/* Called with rcu_read_lock held.  */
>-static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>-                               ram_addr_t *ram_addr, bool *read_only,
>-                               Error **errp)
>+/*
>+ * Called with rcu_read_lock held.
>+ * The returned MemoryRegion must not be accessed after calling
>rcu_read_unlock.
>+ */
>+static MemoryRegion *vfio_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr
>*xlat_p,
>+                                          Error **errp)
> {
>-    bool ret, mr_has_discard_manager;
>+    MemoryRegion *mr;
>
>-    ret = memory_get_xlat_addr(iotlb, vaddr, ram_addr, read_only,
>-                               &mr_has_discard_manager, errp);
>-    if (ret && mr_has_discard_manager) {
>+    mr = memory_translate_iotlb(iotlb, xlat_p, errp);
>+    if (mr && memory_region_has_ram_discard_manager(mr)) {
>         /*
>          * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
>          * pages will remain pinned inside vfio until unmapped, resulting in a
>@@ -118,7 +119,7 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb,
>void **vaddr,
>                          " intended via an IOMMU. It's possible to mitigate "
>                          " by setting/adjusting RLIMIT_MEMLOCK.");
>     }
>-    return ret;
>+    return mr;
> }
>
> static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>@@ -126,6 +127,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n,
>IOMMUTLBEntry *iotlb)
>     VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>     VFIOContainerBase *bcontainer = giommu->bcontainer;
>     hwaddr iova = iotlb->iova + giommu->iommu_offset;
>+    MemoryRegion *mr;
>+    hwaddr xlat;
>     void *vaddr;
>     int ret;
>     Error *local_err = NULL;
>@@ -150,10 +153,14 @@ static void vfio_iommu_map_notify(IOMMUNotifier
>*n, IOMMUTLBEntry *iotlb)
>     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>         bool read_only;
>
>-        if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &local_err)) {
>+        mr = vfio_translate_iotlb(iotlb, &xlat, &local_err);
>+        if (!mr) {
>             error_report_err(local_err);
>             goto out;
>         }
>+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
>+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>+
>         /*
>          * vaddr is only valid until rcu_read_unlock(). But after
>          * vfio_dma_map has set up the mapping the pages will be
>@@ -1010,6 +1017,8 @@ static void
>vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>     ram_addr_t translated_addr;
>     Error *local_err = NULL;
>     int ret = -EINVAL;
>+    MemoryRegion *mr;
>+    ram_addr_t xlat;
>
>     trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
>
>@@ -1021,9 +1030,11 @@ static void
>vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>     }
>
>     rcu_read_lock();
>-    if (!vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, &local_err)) {
>+    mr = vfio_translate_iotlb(iotlb, &xlat, &local_err);
>+    if (!mr) {
>         goto out_unlock;
>     }
>+    translated_addr = memory_region_get_ram_addr(mr) + xlat;
>
>     ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask +
>1,
>                                 translated_addr, &local_err);
>diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>index 1ab2c11..a1dd9e1 100644
>--- a/hw/virtio/vhost-vdpa.c
>+++ b/hw/virtio/vhost-vdpa.c
>@@ -209,6 +209,8 @@ static void
>vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>     int ret;
>     Int128 llend;
>     Error *local_err = NULL;
>+    MemoryRegion *mr;
>+    hwaddr xlat;
>
>     if (iotlb->target_as != &address_space_memory) {
>         error_report("Wrong target AS \"%s\", only system memory is allowed",
>@@ -228,11 +230,14 @@ static void
>vhost_vdpa_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>         bool read_only;
>
>-        if (!memory_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, NULL,
>-                                  &local_err)) {
>+        mr = memory_translate_iotlb(iotlb, &xlat, &local_err);
>+        if (!mr) {
>             error_report_err(local_err);
>             return;
>         }
>+        vaddr = memory_region_get_ram_ptr(mr) + xlat;
>+        read_only = !(iotlb->perm & IOMMU_WO) || mr->readonly;
>+
>         ret = vhost_vdpa_dma_map(s, VHOST_VDPA_GUEST_PA_ASID, iova,
>                                  iotlb->addr_mask + 1, vaddr, read_only);
>         if (ret) {
>diff --git a/system/memory.c b/system/memory.c
>index 63b983e..306e9ff 100644
>--- a/system/memory.c
>+++ b/system/memory.c
>@@ -2174,18 +2174,14 @@ void
>ram_discard_manager_unregister_listener(RamDiscardManager *rdm,
> }
>
> /* Called with rcu_read_lock held.  */
>-bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
>-                          ram_addr_t *ram_addr, bool *read_only,
>-                          bool *mr_has_discard_manager, Error **errp)
>+MemoryRegion *memory_translate_iotlb(IOMMUTLBEntry *iotlb, hwaddr
>*xlat_p,
>+                                     Error **errp)
> {
>     MemoryRegion *mr;
>     hwaddr xlat;
>     hwaddr len = iotlb->addr_mask + 1;
>     bool writable = iotlb->perm & IOMMU_WO;
>
>-    if (mr_has_discard_manager) {
>-        *mr_has_discard_manager = false;
>-    }
>     /*
>      * The IOMMU TLB entry we have just covers translation through
>      * this IOMMU to its immediate target.  We need to translate
>@@ -2195,7 +2191,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb,
>void **vaddr,
>                                  &xlat, &len, writable, MEMTXATTRS_UNSPECIFIED);
>     if (!memory_region_is_ram(mr)) {
>         error_setg(errp, "iommu map to non memory area %" HWADDR_PRIx "",
>xlat);
>-        return false;
>+        return NULL;
>     } else if (memory_region_has_ram_discard_manager(mr)) {
>         RamDiscardManager *rdm =
>memory_region_get_ram_discard_manager(mr);
>         MemoryRegionSection tmp = {
>@@ -2203,9 +2199,6 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb,
>void **vaddr,
>             .offset_within_region = xlat,
>             .size = int128_make64(len),
>         };
>-        if (mr_has_discard_manager) {
>-            *mr_has_discard_manager = true;
>-        }
>         /*
>          * Malicious VMs can map memory into the IOMMU, which is expected
>          * to remain discarded. vfio will pin all pages, populating memory.
>@@ -2216,7 +2209,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb,
>void **vaddr,
>             error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
>                          " via virtio-mem): %" HWADDR_PRIx "",
>                          iotlb->translated_addr);
>-            return false;
>+            return NULL;
>         }
>     }
>
>@@ -2226,22 +2219,11 @@ bool memory_get_xlat_addr(IOMMUTLBEntry
>*iotlb, void **vaddr,
>      */
>     if (len & iotlb->addr_mask) {
>         error_setg(errp, "iommu has granularity incompatible with target AS");
>-        return false;
>+        return NULL;
>     }
>
>-    if (vaddr) {
>-        *vaddr = memory_region_get_ram_ptr(mr) + xlat;
>-    }
>-
>-    if (ram_addr) {
>-        *ram_addr = memory_region_get_ram_addr(mr) + xlat;
>-    }
>-
>-    if (read_only) {
>-        *read_only = !writable || mr->readonly;
>-    }
>-
>-    return true;
>+    *xlat_p = xlat;
>+    return mr;
> }
>
> void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>--
>1.8.3.1


^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations
  2025-05-29 19:23 ` [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations Steve Sistare
@ 2025-06-03 10:39   ` Duan, Zhenzhong
  0 siblings, 0 replies; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 10:39 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA
>operations
>
>Pass through the MemoryRegion to DMA operation handlers of vfio
>containers. The vfio-user container will need this later, to translate
>the vaddr into an offset for the dma map vfio-user message.
>
>Originally-by: John Johnson <john.g.johnson@oracle.com>
>Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
>Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>Signed-off-by: John Levon <john.levon@nutanix.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
>Reviewed-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

>---
> include/hw/vfio/vfio-container-base.h | 17 +++++++++++++++--
> hw/vfio/container-base.c              |  4 ++--
> hw/vfio/container.c                   |  3 ++-
> hw/vfio/iommufd.c                     |  3 ++-
> hw/vfio/listener.c                    |  6 +++---
> 5 files changed, 24 insertions(+), 9 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>container-base.h
>index 3d392b0..83ba7a5 100644
>--- a/include/hw/vfio/vfio-container-base.h
>+++ b/include/hw/vfio/vfio-container-base.h
>@@ -78,7 +78,7 @@ void vfio_address_space_insert(VFIOAddressSpace *space,
>
> int vfio_container_dma_map(VFIOContainerBase *bcontainer,
>                            hwaddr iova, ram_addr_t size,
>-                           void *vaddr, bool readonly);
>+                           void *vaddr, bool readonly, MemoryRegion *mr);
> int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
>                              hwaddr iova, ram_addr_t size,
>                              IOMMUTLBEntry *iotlb, bool unmap_all);
>@@ -119,9 +119,22 @@ struct VFIOIOMMUClass {
>     bool (*setup)(VFIOContainerBase *bcontainer, Error **errp);
>     void (*listener_begin)(VFIOContainerBase *bcontainer);
>     void (*listener_commit)(VFIOContainerBase *bcontainer);
>+    /**
>+     * @dma_map
>+     *
>+     * Map an address range into the container. Note that @mr will within an
>+     * RCU read lock region across this call.
>+     *
>+     * @bcontainer: #VFIOContainerBase to use
>+     * @iova: start address to map
>+     * @size: size of the range to map
>+     * @vaddr: process virtual address of mapping
>+     * @readonly: true if mapping should be readonly
>+     * @mr: the memory region for this mapping
>+     */
>     int (*dma_map)(const VFIOContainerBase *bcontainer,
>                    hwaddr iova, ram_addr_t size,
>-                   void *vaddr, bool readonly);
>+                   void *vaddr, bool readonly, MemoryRegion *mr);
>     /**
>      * @dma_unmap
>      *
>diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>index 1c6ca94..d834bd4 100644
>--- a/hw/vfio/container-base.c
>+++ b/hw/vfio/container-base.c
>@@ -75,12 +75,12 @@ void vfio_address_space_insert(VFIOAddressSpace
>*space,
>
> int vfio_container_dma_map(VFIOContainerBase *bcontainer,
>                            hwaddr iova, ram_addr_t size,
>-                           void *vaddr, bool readonly)
>+                           void *vaddr, bool readonly, MemoryRegion *mr)
> {
>     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>
>     g_assert(vioc->dma_map);
>-    return vioc->dma_map(bcontainer, iova, size, vaddr, readonly);
>+    return vioc->dma_map(bcontainer, iova, size, vaddr, readonly, mr);
> }
>
> int vfio_container_dma_unmap(VFIOContainerBase *bcontainer,
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index a9f0dba..a8c76eb 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -207,7 +207,8 @@ static int vfio_legacy_dma_unmap(const
>VFIOContainerBase *bcontainer,
> }
>
> static int vfio_legacy_dma_map(const VFIOContainerBase *bcontainer, hwaddr
>iova,
>-                               ram_addr_t size, void *vaddr, bool readonly)
>+                               ram_addr_t size, void *vaddr, bool readonly,
>+                               MemoryRegion *mr)
> {
>     const VFIOContainer *container = container_of(bcontainer, VFIOContainer,
>                                                   bcontainer);
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index af1c7ab..a8cc543 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -34,7 +34,8 @@
>             TYPE_HOST_IOMMU_DEVICE_IOMMUFD "-vfio"
>
> static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr
>iova,
>-                            ram_addr_t size, void *vaddr, bool readonly)
>+                            ram_addr_t size, void *vaddr, bool readonly,
>+                            MemoryRegion *mr)
> {
>     const VFIOIOMMUFDContainer *container =
>         container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>index 0afafe3..a1d2d25 100644
>--- a/hw/vfio/listener.c
>+++ b/hw/vfio/listener.c
>@@ -170,7 +170,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n,
>IOMMUTLBEntry *iotlb)
>          */
>         ret = vfio_container_dma_map(bcontainer, iova,
>                                      iotlb->addr_mask + 1, vaddr,
>-                                     read_only);
>+                                     read_only, mr);
>         if (ret) {
>             error_report("vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
>                          "0x%"HWADDR_PRIx", %p) = %d (%s)",
>@@ -240,7 +240,7 @@ static int
>vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
>         vaddr = memory_region_get_ram_ptr(section->mr) + start;
>
>         ret = vfio_container_dma_map(bcontainer, iova, next - start,
>-                                     vaddr, section->readonly);
>+                                     vaddr, section->readonly, section->mr);
>         if (ret) {
>             /* Rollback */
>             vfio_ram_discard_notify_discard(rdl, section);
>@@ -564,7 +564,7 @@ static void vfio_listener_region_add(MemoryListener
>*listener,
>     }
>
>     ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
>-                                 vaddr, section->readonly);
>+                                 vaddr, section->readonly, section->mr);
>     if (ret) {
>         error_setg(&err, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
>                    "0x%"HWADDR_PRIx", %p) = %d (%s)",
>--
>1.8.3.1


^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-05-29 19:24 ` [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure Steve Sistare
@ 2025-06-03 10:40   ` Duan, Zhenzhong
  2025-06-03 14:09     ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 10:40 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>
>If vfio_realize fails after vfio_device_attach, it should call
>vfio_device_detach during error recovery.  If it fails after
>vfio_device_get_name, it should free vbasedev->name.  If it fails
>after vfio_pci_config_setup, it should free vdev->msix.
>
>To fix all, call vfio_pci_put_device().
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> hw/vfio/pci.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>index a1bfdfe..7d3b9ff 100644
>--- a/hw/vfio/pci.c
>+++ b/hw/vfio/pci.c
>@@ -3296,6 +3296,7 @@ out_teardown:
>     vfio_bars_exit(vdev);
> error:
>     error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>+    vfio_pci_put_device(vdev);

Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().
Early free of vdev->vbasedev.name will also break something, e.g., trace_vfio_region_finalize(region->vbasedev->name, region->nr);

> }
>
> static void vfio_instance_finalize(Object *obj)
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener
  2025-05-29 19:24 ` [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener Steve Sistare
@ 2025-06-03 10:59   ` Duan, Zhenzhong
  0 siblings, 0 replies; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 10:59 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener
>
>Define vfio_find_ram_discard_listener as a subroutine so additional calls to
>it may be added in a subsequent patch.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

>---
> include/hw/vfio/vfio-container-base.h |  3 +++
> hw/vfio/listener.c                    | 35 ++++++++++++++++++++++-------------
> 2 files changed, 25 insertions(+), 13 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>container-base.h
>index 83ba7a5..01cdcb6 100644
>--- a/include/hw/vfio/vfio-container-base.h
>+++ b/include/hw/vfio/vfio-container-base.h
>@@ -196,4 +196,7 @@ struct VFIOIOMMUClass {
>     void (*release)(VFIOContainerBase *bcontainer);
> };
>
>+VFIORamDiscardListener *vfio_find_ram_discard_listener(
>+    VFIOContainerBase *bcontainer, MemoryRegionSection *section);
>+
> #endif /* HW_VFIO_VFIO_CONTAINER_BASE_H */
>diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>index a1d2d25..fb1fd84 100644
>--- a/hw/vfio/listener.c
>+++ b/hw/vfio/listener.c
>@@ -456,6 +456,26 @@ static void vfio_device_error_append(VFIODevice
>*vbasedev, Error **errp)
>     }
> }
>
>+VFIORamDiscardListener *vfio_find_ram_discard_listener(
>+    VFIOContainerBase *bcontainer, MemoryRegionSection *section)
>+{
>+    VFIORamDiscardListener *vrdl = NULL;
>+
>+    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
>+        if (vrdl->mr == section->mr &&
>+            vrdl->offset_within_address_space ==
>+            section->offset_within_address_space) {
>+            break;
>+        }
>+    }
>+
>+    if (!vrdl) {
>+        hw_error("vfio: Trying to sync missing RAM discard listener");
>+        /* does not return */
>+    }
>+    return vrdl;
>+}
>+
> static void vfio_listener_region_add(MemoryListener *listener,
>                                      MemoryRegionSection *section)
> {
>@@ -1086,19 +1106,8 @@
>vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
>                                             MemoryRegionSection *section)
> {
>     RamDiscardManager *rdm =
>memory_region_get_ram_discard_manager(section->mr);
>-    VFIORamDiscardListener *vrdl = NULL;
>-
>-    QLIST_FOREACH(vrdl, &bcontainer->vrdl_list, next) {
>-        if (vrdl->mr == section->mr &&
>-            vrdl->offset_within_address_space ==
>-            section->offset_within_address_space) {
>-            break;
>-        }
>-    }
>-
>-    if (!vrdl) {
>-        hw_error("vfio: Trying to sync missing RAM discard listener");
>-    }
>+    VFIORamDiscardListener *vrdl =
>+        vfio_find_ram_discard_listener(bcontainer, section);
>
>     /*
>      * We only want/can synchronize the bitmap for actually mapped parts -
>--
>1.8.3.1


^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 08/43] vfio: move vfio-cpr.h
  2025-05-29 19:24 ` [PATCH V4 08/43] vfio: move vfio-cpr.h Steve Sistare
@ 2025-06-03 11:01   ` Duan, Zhenzhong
  0 siblings, 0 replies; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 11:01 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 08/43] vfio: move vfio-cpr.h
>
>Move vfio-cpr.h to include/hw/vfio, because it will need to be included by
>other files there.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

>---
> MAINTAINERS                |  1 +
> hw/vfio/vfio-cpr.h         | 15 ---------------
> include/hw/vfio/vfio-cpr.h | 18 ++++++++++++++++++
> hw/vfio/container.c        |  2 +-
> hw/vfio/cpr.c              |  2 +-
> hw/vfio/iommufd.c          |  2 +-
> 6 files changed, 22 insertions(+), 18 deletions(-)
> delete mode 100644 hw/vfio/vfio-cpr.h
> create mode 100644 include/hw/vfio/vfio-cpr.h
>
>diff --git a/MAINTAINERS b/MAINTAINERS
>index e29fb4f..7b919d7 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -3034,6 +3034,7 @@ CheckPoint and Restart (CPR)
> R: Steve Sistare <steven.sistare@oracle.com>
> S: Supported
> F: hw/vfio/cpr*
>+F: include/hw/vfio/vfio-cpr.h
> F: include/migration/cpr.h
> F: migration/cpr*
> F: tests/qtest/migration/cpr*
>diff --git a/hw/vfio/vfio-cpr.h b/hw/vfio/vfio-cpr.h
>deleted file mode 100644
>index 134b83a..0000000
>--- a/hw/vfio/vfio-cpr.h
>+++ /dev/null
>@@ -1,15 +0,0 @@
>-/*
>- * VFIO CPR
>- *
>- * Copyright (c) 2025 Oracle and/or its affiliates.
>- *
>- * SPDX-License-Identifier: GPL-2.0-or-later
>- */
>-
>-#ifndef HW_VFIO_CPR_H
>-#define HW_VFIO_CPR_H
>-
>-bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp);
>-void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer);
>-
>-#endif /* HW_VFIO_CPR_H */
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>new file mode 100644
>index 0000000..750ea5b
>--- /dev/null
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -0,0 +1,18 @@
>+/*
>+ * VFIO CPR
>+ *
>+ * Copyright (c) 2025 Oracle and/or its affiliates.
>+ *
>+ * SPDX-License-Identifier: GPL-2.0-or-later
>+ */
>+
>+#ifndef HW_VFIO_VFIO_CPR_H
>+#define HW_VFIO_VFIO_CPR_H
>+
>+struct VFIOContainerBase;
>+
>+bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>+                                 Error **errp);
>+void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>+
>+#endif /* HW_VFIO_VFIO_CPR_H */
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index a8c76eb..0f948d0 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -33,8 +33,8 @@
> #include "qapi/error.h"
> #include "pci.h"
> #include "hw/vfio/vfio-container.h"
>+#include "hw/vfio/vfio-cpr.h"
> #include "vfio-helpers.h"
>-#include "vfio-cpr.h"
> #include "vfio-listener.h"
>
> #define TYPE_HOST_IOMMU_DEVICE_LEGACY_VFIO
>TYPE_HOST_IOMMU_DEVICE "-legacy-vfio"
>diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>index 3214184..0210e76 100644
>--- a/hw/vfio/cpr.c
>+++ b/hw/vfio/cpr.c
>@@ -8,9 +8,9 @@
> #include "qemu/osdep.h"
> #include "hw/vfio/vfio-device.h"
> #include "migration/misc.h"
>+#include "hw/vfio/vfio-cpr.h"
> #include "qapi/error.h"
> #include "system/runstate.h"
>-#include "vfio-cpr.h"
>
> static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>                                     MigrationEvent *e, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index a8cc543..eb2f88d 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -21,13 +21,13 @@
> #include "qapi/error.h"
> #include "system/iommufd.h"
> #include "hw/qdev-core.h"
>+#include "hw/vfio/vfio-cpr.h"
> #include "system/reset.h"
> #include "qemu/cutils.h"
> #include "qemu/chardev_open.h"
> #include "pci.h"
> #include "vfio-iommufd.h"
> #include "vfio-helpers.h"
>-#include "vfio-cpr.h"
> #include "vfio-listener.h"
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD_VFIO             \
>--
>1.8.3.1


^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 09/43] vfio/container: register container for cpr
  2025-05-29 19:24 ` [PATCH V4 09/43] vfio/container: register container for cpr Steve Sistare
  2025-06-01 15:21   ` Cédric Le Goater
@ 2025-06-03 11:57   ` Duan, Zhenzhong
  2025-06-03 14:09     ` Steven Sistare
  1 sibling, 1 reply; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 11:57 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 09/43] vfio/container: register container for cpr
>
>Register a legacy container for cpr-transfer, replacing the generic CPR
>register call with a more specific legacy container register call.  Add a
>blocker if the kernel does not support VFIO_UPDATE_VADDR or
>VFIO_UNMAP_ALL.
>
>This is mostly boiler plate.  The fields to to saved and restored are added
>in subsequent patches.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>---
> include/hw/vfio/vfio-container.h |  2 ++
> include/hw/vfio/vfio-cpr.h       | 15 +++++++++
> hw/vfio/container.c              |  6 ++--
> hw/vfio/cpr-legacy.c             | 69
>++++++++++++++++++++++++++++++++++++++++
> hw/vfio/cpr.c                    |  5 ++-
> hw/vfio/meson.build              |  1 +
> 6 files changed, 92 insertions(+), 6 deletions(-)
> create mode 100644 hw/vfio/cpr-legacy.c
>
>diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>index afc498d..21e5807 100644
>--- a/include/hw/vfio/vfio-container.h
>+++ b/include/hw/vfio/vfio-container.h
>@@ -10,6 +10,7 @@
> #define HW_VFIO_CONTAINER_H
>
> #include "hw/vfio/vfio-container-base.h"
>+#include "hw/vfio/vfio-cpr.h"

Now that we have this change, may we remove #include of vfio-cpr.h in hw/vfio/container.c?
Maybe this belong to patch8?

>
> typedef struct VFIOContainer VFIOContainer;
> typedef struct VFIODevice VFIODevice;
>@@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>     unsigned iommu_type;
>     QLIST_HEAD(, VFIOGroup) group_list;
>+    VFIOContainerCPR cpr;
> } VFIOContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index 750ea5b..d4e0bd5 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -9,8 +9,23 @@
> #ifndef HW_VFIO_VFIO_CPR_H
> #define HW_VFIO_VFIO_CPR_H
>
>+#include "migration/misc.h"
>+
>+struct VFIOContainer;
> struct VFIOContainerBase;
>
>+typedef struct VFIOContainerCPR {
>+    Error *blocker;
>+} VFIOContainerCPR;
>+
>+
>+bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>+                                        Error **errp);
>+void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>+
>+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>+                             Error **errp);
>+
> bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                  Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index 0f948d0..7d2035c 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -643,7 +643,7 @@ static bool vfio_container_connect(VFIOGroup *group,
>AddressSpace *as,
>     new_container = true;
>     bcontainer = &container->bcontainer;
>
>-    if (!vfio_cpr_register_container(bcontainer, errp)) {
>+    if (!vfio_legacy_cpr_register_container(container, errp)) {
>         goto fail;
>     }
>
>@@ -679,7 +679,7 @@ fail:
>         vioc->release(bcontainer);
>     }
>     if (new_container) {
>-        vfio_cpr_unregister_container(bcontainer);
>+        vfio_legacy_cpr_unregister_container(container);
>         object_unref(container);
>     }
>     if (fd >= 0) {
>@@ -720,7 +720,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>         VFIOAddressSpace *space = bcontainer->space;
>
>         trace_vfio_container_disconnect(container->fd);
>-        vfio_cpr_unregister_container(bcontainer);
>+        vfio_legacy_cpr_unregister_container(container);
>         close(container->fd);
>         object_unref(container);
>
>diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>new file mode 100644
>index 0000000..419b9fb
>--- /dev/null
>+++ b/hw/vfio/cpr-legacy.c
>@@ -0,0 +1,69 @@
>+/*
>+ * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>+ *
>+ * SPDX-License-Identifier: GPL-2.0-or-later
>+ */
>+
>+#include <sys/ioctl.h>
>+#include <linux/vfio.h>
>+#include "qemu/osdep.h"
>+#include "hw/vfio/vfio-container.h"
>+#include "hw/vfio/vfio-cpr.h"

Ditto.

>+#include "migration/blocker.h"
>+#include "migration/cpr.h"
>+#include "migration/migration.h"
>+#include "migration/vmstate.h"
>+#include "qapi/error.h"
>+
>+static bool vfio_cpr_supported(VFIOContainer *container, Error **errp)
>+{
>+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR)) {
>+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR");
>+        return false;
>+
>+    } else if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>+        error_setg(errp, "VFIO container does not support VFIO_UNMAP_ALL");
>+        return false;
>+
>+    } else {
>+        return true;
>+    }
>+}
>+
>+static const VMStateDescription vfio_container_vmstate = {
>+    .name = "vfio-container",
>+    .version_id = 0,
>+    .minimum_version_id = 0,
>+    .needed = cpr_incoming_needed,
>+    .fields = (VMStateField[]) {
>+        VMSTATE_END_OF_LIST()
>+    }
>+};
>+
>+bool vfio_legacy_cpr_register_container(VFIOContainer *container, Error
>**errp)
>+{
>+    VFIOContainerBase *bcontainer = &container->bcontainer;
>+    Error **cpr_blocker = &container->cpr.blocker;
>+
>+    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>+                                vfio_cpr_reboot_notifier,
>+                                MIG_MODE_CPR_REBOOT);
>+
>+    if (!vfio_cpr_supported(container, cpr_blocker)) {
>+        return migrate_add_blocker_modes(cpr_blocker, errp,
>+                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>+    }
>+
>+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>+
>+    return true;
>+}
>+
>+void vfio_legacy_cpr_unregister_container(VFIOContainer *container)
>+{
>+    VFIOContainerBase *bcontainer = &container->bcontainer;
>+
>+    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>+    migrate_del_blocker(&container->cpr.blocker);
>+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>+}
>diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>index 0210e76..0e59612 100644
>--- a/hw/vfio/cpr.c
>+++ b/hw/vfio/cpr.c
>@@ -7,13 +7,12 @@
>
> #include "qemu/osdep.h"
> #include "hw/vfio/vfio-device.h"
>-#include "migration/misc.h"
> #include "hw/vfio/vfio-cpr.h"
> #include "qapi/error.h"
> #include "system/runstate.h"
>
>-static int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>-                                    MigrationEvent *e, Error **errp)
>+int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>+                             MigrationEvent *e, Error **errp)
> {
>     if (e->type == MIG_EVENT_PRECOPY_SETUP &&
>         !runstate_check(RUN_STATE_SUSPENDED) && !vm_get_suspended()) {
>diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>index bccb050..73d29f9 100644
>--- a/hw/vfio/meson.build
>+++ b/hw/vfio/meson.build
>@@ -21,6 +21,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true:
>files('calxeda-xgmac.c'))
> system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
> system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>   'cpr.c',
>+  'cpr-legacy.c',
>   'device.c',
>   'migration.c',
>   'migration-multifd.c',
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 10/43] vfio/container: preserve descriptors
  2025-05-29 19:24 ` [PATCH V4 10/43] vfio/container: preserve descriptors Steve Sistare
  2025-06-01 16:57   ` Cédric Le Goater
@ 2025-06-03 11:57   ` Duan, Zhenzhong
  1 sibling, 0 replies; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 11:57 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>

>Subject: [PATCH V4 10/43] vfio/container: preserve descriptors
>
>At vfio creation time, save the value of vfio container, group, and device
>descriptors in CPR state.  On qemu restart, vfio_realize() finds and uses
>the saved descriptors.
>
>During reuse, device and iommu state is already configured, so operations
>in vfio_realize that would modify the configuration, such as vfio ioctl's,
>are skipped.  The result is that vfio_realize constructs qemu data
>structures that reflect the current state of the device.
>
>Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

>---
> include/hw/vfio/vfio-cpr.h |  6 +++++
> hw/vfio/container.c        | 67 +++++++++++++++++++++++++++++++++++----------
>-
> hw/vfio/cpr-legacy.c       | 42 +++++++++++++++++++++++++++++
> 3 files changed, 100 insertions(+), 15 deletions(-)
>
>diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>index d4e0bd5..5a2e5f6 100644
>--- a/include/hw/vfio/vfio-cpr.h
>+++ b/include/hw/vfio/vfio-cpr.h
>@@ -13,6 +13,7 @@
>
> struct VFIOContainer;
> struct VFIOContainerBase;
>+struct VFIOGroup;
>
> typedef struct VFIOContainerCPR {
>     Error *blocker;
>@@ -30,4 +31,9 @@ bool vfio_cpr_register_container(struct VFIOContainerBase
>*bcontainer,
>                                  Error **errp);
> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>
>+int vfio_cpr_group_get_device_fd(int d, const char *name);
>+
>+bool vfio_cpr_container_match(struct VFIOContainer *container,
>+                              struct VFIOGroup *group, int fd);
>+
> #endif /* HW_VFIO_VFIO_CPR_H */
>diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>index 7d2035c..798abda 100644
>--- a/hw/vfio/container.c
>+++ b/hw/vfio/container.c
>@@ -31,6 +31,8 @@
> #include "system/reset.h"
> #include "trace.h"
> #include "qapi/error.h"
>+#include "migration/cpr.h"
>+#include "migration/blocker.h"
> #include "pci.h"
> #include "hw/vfio/vfio-container.h"
> #include "hw/vfio/vfio-cpr.h"
>@@ -426,7 +428,12 @@ static VFIOContainer *vfio_create_container(int fd,
>VFIOGroup *group,
>         return NULL;
>     }
>
>-    if (!vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>+    /*
>+     * During CPR, just set the container type and skip the ioctls, as the
>+     * container and group are already configured in the kernel.
>+     */
>+    if (!cpr_is_incoming() &&
>+        !vfio_set_iommu(fd, group->fd, &iommu_type, errp)) {
>         return NULL;
>     }
>
>@@ -593,6 +600,11 @@ static bool vfio_container_group_add(VFIOContainer
>*container, VFIOGroup *group,
>     group->container = container;
>     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>     vfio_group_add_kvm_device(group);
>+    /*
>+     * Remember the container fd for each group, so we can attach to the same
>+     * container after CPR.
>+     */
>+    cpr_resave_fd("vfio_container_for_group", group->groupid, container->fd);
>     return true;
> }
>
>@@ -602,6 +614,7 @@ static void vfio_container_group_del(VFIOContainer
>*container, VFIOGroup *group)
>     group->container = NULL;
>     vfio_group_del_kvm_device(group);
>     vfio_ram_block_discard_disable(container, false);
>+    cpr_delete_fd("vfio_container_for_group", group->groupid);
> }
>
> static bool vfio_container_connect(VFIOGroup *group, AddressSpace *as,
>@@ -616,17 +629,34 @@ static bool vfio_container_connect(VFIOGroup *group,
>AddressSpace *as,
>     bool group_was_added = false;
>
>     space = vfio_address_space_get(as);
>+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>
>-    QLIST_FOREACH(bcontainer, &space->containers, next) {
>-        container = container_of(bcontainer, VFIOContainer, bcontainer);
>-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>-            return vfio_container_group_add(container, group, errp);
>+    if (!cpr_is_incoming()) {
>+        QLIST_FOREACH(bcontainer, &space->containers, next) {
>+            container = container_of(bcontainer, VFIOContainer, bcontainer);
>+            if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>+                return vfio_container_group_add(container, group, errp);
>+            }
>         }
>-    }
>
>-    fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>-    if (fd < 0) {
>-        goto fail;
>+        fd = qemu_open("/dev/vfio/vfio", O_RDWR, errp);
>+        if (fd < 0) {
>+            goto fail;
>+        }
>+    } else {
>+        /*
>+         * For incoming CPR, the group is already attached in the kernel.
>+         * If a container with matching fd is found, then update the
>+         * userland group list and return.  If not, then after the loop,
>+         * create the container struct and group list.
>+         */
>+        QLIST_FOREACH(bcontainer, &space->containers, next) {
>+            container = container_of(bcontainer, VFIOContainer, bcontainer);
>+
>+            if (vfio_cpr_container_match(container, group, fd)) {
>+                return vfio_container_group_add(container, group, errp);
>+            }
>+        }
>     }
>
>     ret = ioctl(fd, VFIO_GET_API_VERSION);
>@@ -698,6 +728,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>
>     QLIST_REMOVE(group, container_next);
>     group->container = NULL;
>+    cpr_delete_fd("vfio_container_for_group", group->groupid);
>
>     /*
>      * Explicitly release the listener first before unset container,
>@@ -751,7 +782,7 @@ static VFIOGroup *vfio_group_get(int groupid,
>AddressSpace *as, Error **errp)
>     group = g_malloc0(sizeof(*group));
>
>     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>-    group->fd = qemu_open(path, O_RDWR, errp);
>+    group->fd = cpr_open_fd(path, O_RDWR, "vfio_group", groupid, errp);
>     if (group->fd < 0) {
>         goto free_group_exit;
>     }
>@@ -783,6 +814,7 @@ static VFIOGroup *vfio_group_get(int groupid,
>AddressSpace *as, Error **errp)
>     return group;
>
> close_fd_exit:
>+    cpr_delete_fd("vfio_group", groupid);
>     close(group->fd);
>
> free_group_exit:
>@@ -804,6 +836,7 @@ static void vfio_group_put(VFIOGroup *group)
>     vfio_container_disconnect(group);
>     QLIST_REMOVE(group, next);
>     trace_vfio_group_put(group->fd);
>+    cpr_delete_fd("vfio_group", group->groupid);
>     close(group->fd);
>     g_free(group);
> }
>@@ -814,7 +847,7 @@ static bool vfio_device_get(VFIOGroup *group, const
>char *name,
>     g_autofree struct vfio_device_info *info = NULL;
>     int fd;
>
>-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>+    fd = vfio_cpr_group_get_device_fd(group->fd, name);
>     if (fd < 0) {
>         error_setg_errno(errp, errno, "error getting device from group %d",
>                          group->groupid);
>@@ -827,8 +860,7 @@ static bool vfio_device_get(VFIOGroup *group, const
>char *name,
>     info = vfio_get_device_info(fd);
>     if (!info) {
>         error_setg_errno(errp, errno, "error getting device info");
>-        close(fd);
>-        return false;
>+        goto fail;
>     }
>
>     /*
>@@ -842,8 +874,7 @@ static bool vfio_device_get(VFIOGroup *group, const
>char *name,
>         if (!QLIST_EMPTY(&group->device_list)) {
>             error_setg(errp, "Inconsistent setting of support for discarding "
>                        "RAM (e.g., balloon) within group");
>-            close(fd);
>-            return false;
>+            goto fail;
>         }
>
>         if (!group->ram_block_discard_allowed) {
>@@ -861,6 +892,11 @@ static bool vfio_device_get(VFIOGroup *group, const
>char *name,
>     trace_vfio_device_get(name, info->flags, info->num_regions, info->num_irqs);
>
>     return true;
>+
>+fail:
>+    close(fd);
>+    cpr_delete_fd(name, 0);
>+    return false;
> }
>
> static void vfio_device_put(VFIODevice *vbasedev)
>@@ -871,6 +907,7 @@ static void vfio_device_put(VFIODevice *vbasedev)
>     QLIST_REMOVE(vbasedev, next);
>     vbasedev->group = NULL;
>     trace_vfio_device_put(vbasedev->fd);
>+    cpr_delete_fd(vbasedev->name, 0);
>     close(vbasedev->fd);
> }
>
>diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>index 419b9fb..29be64f 100644
>--- a/hw/vfio/cpr-legacy.c
>+++ b/hw/vfio/cpr-legacy.c
>@@ -9,6 +9,7 @@
> #include "qemu/osdep.h"
> #include "hw/vfio/vfio-container.h"
> #include "hw/vfio/vfio-cpr.h"
>+#include "hw/vfio/vfio-device.h"
> #include "migration/blocker.h"
> #include "migration/cpr.h"
> #include "migration/migration.h"
>@@ -67,3 +68,44 @@ void vfio_legacy_cpr_unregister_container(VFIOContainer
>*container)
>     migrate_del_blocker(&container->cpr.blocker);
>     vmstate_unregister(NULL, &vfio_container_vmstate, container);
> }
>+
>+int vfio_cpr_group_get_device_fd(int d, const char *name)
>+{
>+    const int id = 0;
>+    int fd = cpr_find_fd(name, id);
>+
>+    if (fd < 0) {
>+        fd = ioctl(d, VFIO_GROUP_GET_DEVICE_FD, name);
>+        if (fd >= 0) {
>+            cpr_save_fd(name, id, fd);
>+        }
>+    }
>+    return fd;
>+}
>+
>+static bool same_device(int fd1, int fd2)
>+{
>+    struct stat st1, st2;
>+
>+    return !fstat(fd1, &st1) && !fstat(fd2, &st2) && st1.st_dev == st2.st_dev;
>+}
>+
>+bool vfio_cpr_container_match(VFIOContainer *container, VFIOGroup *group,
>+                              int fd)
>+{
>+    if (container->fd == fd) {
>+        return true;
>+    }
>+    if (!same_device(container->fd, fd)) {
>+        return false;
>+    }
>+    /*
>+     * Same device, different fd.  This occurs when the container fd is
>+     * cpr_save'd multiple times, once for each groupid, so SCM_RIGHTS
>+     * produces duplicates.  De-dup it.
>+     */
>+    cpr_delete_fd("vfio_container_for_group", group->groupid);
>+    close(fd);
>+    cpr_save_fd("vfio_container_for_group", group->groupid, container->fd);
>+    return true;
>+}
>--
>1.8.3.1



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 00/43] Live update: vfio and iommufd
  2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
                   ` (43 preceding siblings ...)
  2025-06-01 17:26 ` [PATCH V4 00/43] Live update: vfio and iommufd Cédric Le Goater
@ 2025-06-03 12:09 ` Duan, Zhenzhong
  2025-06-03 14:09   ` Steven Sistare
  44 siblings, 1 reply; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-03 12:09 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steve Sistare <steven.sistare@oracle.com>
>Subject: [PATCH V4 00/43] Live update: vfio and iommufd
>
>Support vfio and iommufd devices with the cpr-transfer live migration mode.
>Devices that do not support live migration can still support cpr-transfer,
>allowing live update to a new version of QEMU on the same host, with no loss
>of guest connectivity.

Just curious. My understanding is: for device not supporting live migration, device
will not be stopped during cpr-transfer and there is no device state saving/restore.
But for device supporting live migration, it will be stopped and device state is saved
in source and restored in destination, it that right?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 00/43] Live update: vfio and iommufd
  2025-06-03 12:09 ` Duan, Zhenzhong
@ 2025-06-03 14:09   ` Steven Sistare
  0 siblings, 0 replies; 90+ messages in thread
From: Steven Sistare @ 2025-06-03 14:09 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/2025 8:09 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V4 00/43] Live update: vfio and iommufd
>>
>> Support vfio and iommufd devices with the cpr-transfer live migration mode.
>> Devices that do not support live migration can still support cpr-transfer,
>> allowing live update to a new version of QEMU on the same host, with no loss
>> of guest connectivity.
> 
> Just curious. My understanding is: for device not supporting live migration, device
> will not be stopped during cpr-transfer and there is no device state saving/restore.

Yes.

> But for device supporting live migration, it will be stopped and device state is saved
> in source and restored in destination, it that right?

As currently written, yes.
However, it may be faster and more reliable if I disable savevm_vfio_handlers and the
associated notifiers during CPR, and only use the CPR mechanism for preserving the device.

- Steve


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-03 10:40   ` Duan, Zhenzhong
@ 2025-06-03 14:09     ` Steven Sistare
  2025-06-04  3:55       ` Duan, Zhenzhong
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-03 14:09 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>
>> If vfio_realize fails after vfio_device_attach, it should call
>> vfio_device_detach during error recovery.  If it fails after
>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>> after vfio_pci_config_setup, it should free vdev->msix.
>>
>> To fix all, call vfio_pci_put_device().
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> hw/vfio/pci.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index a1bfdfe..7d3b9ff 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3296,6 +3296,7 @@ out_teardown:
>>      vfio_bars_exit(vdev);
>> error:
>>      error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>> +    vfio_pci_put_device(vdev);
> 
> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().

If vfio_realize fails with an error, vfio_instance_finalize is not called.
I tested that.

> Early free of vdev->vbasedev.name will also break something, e.g., trace_vfio_region_finalize(region->vbasedev->name, region->nr);

All unwinding and calling functions that might use the name is done in the vfio_realize
failure path, and the very last operation is vfio_pci_put_device, and the last operation
of that function is freeing the name string.

- Steve

>> static void vfio_instance_finalize(Object *obj)
>> --
>> 1.8.3.1
> 



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 09/43] vfio/container: register container for cpr
  2025-06-03 11:57   ` Duan, Zhenzhong
@ 2025-06-03 14:09     ` Steven Sistare
  2025-06-03 14:17       ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-03 14:09 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/2025 7:57 AM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steve Sistare <steven.sistare@oracle.com>
>> Subject: [PATCH V4 09/43] vfio/container: register container for cpr
>>
>> Register a legacy container for cpr-transfer, replacing the generic CPR
>> register call with a more specific legacy container register call.  Add a
>> blocker if the kernel does not support VFIO_UPDATE_VADDR or
>> VFIO_UNMAP_ALL.
>>
>> This is mostly boiler plate.  The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>> include/hw/vfio/vfio-container.h |  2 ++
>> include/hw/vfio/vfio-cpr.h       | 15 +++++++++
>> hw/vfio/container.c              |  6 ++--
>> hw/vfio/cpr-legacy.c             | 69
>> ++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/cpr.c                    |  5 ++-
>> hw/vfio/meson.build              |  1 +
>> 6 files changed, 92 insertions(+), 6 deletions(-)
>> create mode 100644 hw/vfio/cpr-legacy.c
>>
>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>> index afc498d..21e5807 100644
>> --- a/include/hw/vfio/vfio-container.h
>> +++ b/include/hw/vfio/vfio-container.h
>> @@ -10,6 +10,7 @@
>> #define HW_VFIO_CONTAINER_H
>>
>> #include "hw/vfio/vfio-container-base.h"
>> +#include "hw/vfio/vfio-cpr.h"
> 
> Now that we have this change, may we remove #include of vfio-cpr.h in hw/vfio/container.c?
> Maybe this belong to patch8?

Yes, thanks.
Patch 8 should not add #include of vfio-cpr.h in hw/vfio/container.c

>> typedef struct VFIOContainer VFIOContainer;
>> typedef struct VFIODevice VFIODevice;
>> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      unsigned iommu_type;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>> +    VFIOContainerCPR cpr;
>> } VFIOContainer;
>>
>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 750ea5b..d4e0bd5 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -9,8 +9,23 @@
>> #ifndef HW_VFIO_VFIO_CPR_H
>> #define HW_VFIO_VFIO_CPR_H
>>
>> +#include "migration/misc.h"
>> +
>> +struct VFIOContainer;
>> struct VFIOContainerBase;
>>
>> +typedef struct VFIOContainerCPR {
>> +    Error *blocker;
>> +} VFIOContainerCPR;
>> +
>> +
>> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>> +                                        Error **errp);
>> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>> +
>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>> +                             Error **errp);
>> +
>> bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>                                   Error **errp);
>> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>> index 0f948d0..7d2035c 100644
>> --- a/hw/vfio/container.c
>> +++ b/hw/vfio/container.c
>> @@ -643,7 +643,7 @@ static bool vfio_container_connect(VFIOGroup *group,
>> AddressSpace *as,
>>      new_container = true;
>>      bcontainer = &container->bcontainer;
>>
>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>>          goto fail;
>>      }
>>
>> @@ -679,7 +679,7 @@ fail:
>>          vioc->release(bcontainer);
>>      }
>>      if (new_container) {
>> -        vfio_cpr_unregister_container(bcontainer);
>> +        vfio_legacy_cpr_unregister_container(container);
>>          object_unref(container);
>>      }
>>      if (fd >= 0) {
>> @@ -720,7 +720,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>          VFIOAddressSpace *space = bcontainer->space;
>>
>>          trace_vfio_container_disconnect(container->fd);
>> -        vfio_cpr_unregister_container(bcontainer);
>> +        vfio_legacy_cpr_unregister_container(container);
>>          close(container->fd);
>>          object_unref(container);
>>
>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>> new file mode 100644
>> index 0000000..419b9fb
>> --- /dev/null
>> +++ b/hw/vfio/cpr-legacy.c
>> @@ -0,0 +1,69 @@
>> +/*
>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "qemu/osdep.h"
>> +#include "hw/vfio/vfio-container.h"
>> +#include "hw/vfio/vfio-cpr.h"
> 
> Ditto.

Yes, this #include vfio-cpr.h should be dropped from this patch.

- Steve



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 09/43] vfio/container: register container for cpr
  2025-06-03 14:09     ` Steven Sistare
@ 2025-06-03 14:17       ` Steven Sistare
  2025-06-03 15:27         ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-03 14:17 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/2025 10:09 AM, Steven Sistare wrote:
> On 6/3/2025 7:57 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V4 09/43] vfio/container: register container for cpr
>>>
>>> Register a legacy container for cpr-transfer, replacing the generic CPR
>>> register call with a more specific legacy container register call.  Add a
>>> blocker if the kernel does not support VFIO_UPDATE_VADDR or
>>> VFIO_UNMAP_ALL.
>>>
>>> This is mostly boiler plate.  The fields to to saved and restored are added
>>> in subsequent patches.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> include/hw/vfio/vfio-container.h |  2 ++
>>> include/hw/vfio/vfio-cpr.h       | 15 +++++++++
>>> hw/vfio/container.c              |  6 ++--
>>> hw/vfio/cpr-legacy.c             | 69
>>> ++++++++++++++++++++++++++++++++++++++++
>>> hw/vfio/cpr.c                    |  5 ++-
>>> hw/vfio/meson.build              |  1 +
>>> 6 files changed, 92 insertions(+), 6 deletions(-)
>>> create mode 100644 hw/vfio/cpr-legacy.c
>>>
>>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>>> index afc498d..21e5807 100644
>>> --- a/include/hw/vfio/vfio-container.h
>>> +++ b/include/hw/vfio/vfio-container.h
>>> @@ -10,6 +10,7 @@
>>> #define HW_VFIO_CONTAINER_H
>>>
>>> #include "hw/vfio/vfio-container-base.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>
>> Now that we have this change, may we remove #include of vfio-cpr.h in hw/vfio/container.c?
>> Maybe this belong to patch8?
> 
> Yes, thanks.
> Patch 8 should not add #include of vfio-cpr.h in hw/vfio/container.c

However, I see that Cedric has staged these patches in vfio-next.
We can make these tweaks in a future patch.

- Steve

>>> typedef struct VFIOContainer VFIOContainer;
>>> typedef struct VFIODevice VFIODevice;
>>> @@ -29,6 +30,7 @@ typedef struct VFIOContainer {
>>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>>      unsigned iommu_type;
>>>      QLIST_HEAD(, VFIOGroup) group_list;
>>> +    VFIOContainerCPR cpr;
>>> } VFIOContainer;
>>>
>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOContainer, VFIO_IOMMU_LEGACY);
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index 750ea5b..d4e0bd5 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -9,8 +9,23 @@
>>> #ifndef HW_VFIO_VFIO_CPR_H
>>> #define HW_VFIO_VFIO_CPR_H
>>>
>>> +#include "migration/misc.h"
>>> +
>>> +struct VFIOContainer;
>>> struct VFIOContainerBase;
>>>
>>> +typedef struct VFIOContainerCPR {
>>> +    Error *blocker;
>>> +} VFIOContainerCPR;
>>> +
>>> +
>>> +bool vfio_legacy_cpr_register_container(struct VFIOContainer *container,
>>> +                                        Error **errp);
>>> +void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>>> +
>>> +int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>> +                             Error **errp);
>>> +
>>> bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>                                   Error **errp);
>>> void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
>>> index 0f948d0..7d2035c 100644
>>> --- a/hw/vfio/container.c
>>> +++ b/hw/vfio/container.c
>>> @@ -643,7 +643,7 @@ static bool vfio_container_connect(VFIOGroup *group,
>>> AddressSpace *as,
>>>      new_container = true;
>>>      bcontainer = &container->bcontainer;
>>>
>>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>>> +    if (!vfio_legacy_cpr_register_container(container, errp)) {
>>>          goto fail;
>>>      }
>>>
>>> @@ -679,7 +679,7 @@ fail:
>>>          vioc->release(bcontainer);
>>>      }
>>>      if (new_container) {
>>> -        vfio_cpr_unregister_container(bcontainer);
>>> +        vfio_legacy_cpr_unregister_container(container);
>>>          object_unref(container);
>>>      }
>>>      if (fd >= 0) {
>>> @@ -720,7 +720,7 @@ static void vfio_container_disconnect(VFIOGroup *group)
>>>          VFIOAddressSpace *space = bcontainer->space;
>>>
>>>          trace_vfio_container_disconnect(container->fd);
>>> -        vfio_cpr_unregister_container(bcontainer);
>>> +        vfio_legacy_cpr_unregister_container(container);
>>>          close(container->fd);
>>>          object_unref(container);
>>>
>>> diff --git a/hw/vfio/cpr-legacy.c b/hw/vfio/cpr-legacy.c
>>> new file mode 100644
>>> index 0000000..419b9fb
>>> --- /dev/null
>>> +++ b/hw/vfio/cpr-legacy.c
>>> @@ -0,0 +1,69 @@
>>> +/*
>>> + * Copyright (c) 2021-2025 Oracle and/or its affiliates.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include <sys/ioctl.h>
>>> +#include <linux/vfio.h>
>>> +#include "qemu/osdep.h"
>>> +#include "hw/vfio/vfio-container.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>
>> Ditto.
> 
> Yes, this #include vfio-cpr.h should be dropped from this patch.
> 
> - Steve
> 



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 09/43] vfio/container: register container for cpr
  2025-06-03 14:17       ` Steven Sistare
@ 2025-06-03 15:27         ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-03 15:27 UTC (permalink / raw)
  To: Steven Sistare, Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/25 16:17, Steven Sistare wrote:
> On 6/3/2025 10:09 AM, Steven Sistare wrote:
>> On 6/3/2025 7:57 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V4 09/43] vfio/container: register container for cpr
>>>>
>>>> Register a legacy container for cpr-transfer, replacing the generic CPR
>>>> register call with a more specific legacy container register call.  Add a
>>>> blocker if the kernel does not support VFIO_UPDATE_VADDR or
>>>> VFIO_UNMAP_ALL.
>>>>
>>>> This is mostly boiler plate.  The fields to to saved and restored are added
>>>> in subsequent patches.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> include/hw/vfio/vfio-container.h |  2 ++
>>>> include/hw/vfio/vfio-cpr.h       | 15 +++++++++
>>>> hw/vfio/container.c              |  6 ++--
>>>> hw/vfio/cpr-legacy.c             | 69
>>>> ++++++++++++++++++++++++++++++++++++++++
>>>> hw/vfio/cpr.c                    |  5 ++-
>>>> hw/vfio/meson.build              |  1 +
>>>> 6 files changed, 92 insertions(+), 6 deletions(-)
>>>> create mode 100644 hw/vfio/cpr-legacy.c
>>>>
>>>> diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
>>>> index afc498d..21e5807 100644
>>>> --- a/include/hw/vfio/vfio-container.h
>>>> +++ b/include/hw/vfio/vfio-container.h
>>>> @@ -10,6 +10,7 @@
>>>> #define HW_VFIO_CONTAINER_H
>>>>
>>>> #include "hw/vfio/vfio-container-base.h"
>>>> +#include "hw/vfio/vfio-cpr.h"
>>>
>>> Now that we have this change, may we remove #include of vfio-cpr.h in hw/vfio/container.c?
>>> Maybe this belong to patch8?
>>
>> Yes, thanks.
>> Patch 8 should not add #include of vfio-cpr.h in hw/vfio/container.c
> 
> However, I see that Cedric has staged these patches in vfio-next.
> We can make these tweaks in a future patch.

It is fine. you can resend. Let's wait first for Michael's feedback on the
pci and vfio-pci reset handlers. This is what is blocking me from sending
a PR.

Thanks,

C.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-03 14:09     ` Steven Sistare
@ 2025-06-04  3:55       ` Duan, Zhenzhong
  2025-06-04 13:33         ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-04  3:55 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>
>On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steve Sistare <steven.sistare@oracle.com>
>>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>
>>> If vfio_realize fails after vfio_device_attach, it should call
>>> vfio_device_detach during error recovery.  If it fails after
>>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>>> after vfio_pci_config_setup, it should free vdev->msix.
>>>
>>> To fix all, call vfio_pci_put_device().
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>> hw/vfio/pci.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index a1bfdfe..7d3b9ff 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3296,6 +3296,7 @@ out_teardown:
>>>      vfio_bars_exit(vdev);
>>> error:
>>>      error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>>> +    vfio_pci_put_device(vdev);
>>
>> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().
>
>If vfio_realize fails with an error, vfio_instance_finalize is not called.
>I tested that.

Have you tried with hot plugged device?

>
>> Early free of vdev->vbasedev.name will also break something, e.g.,
>trace_vfio_region_finalize(region->vbasedev->name, region->nr);
>
>All unwinding and calling functions that might use the name is done in the
>vfio_realize
>failure path, and the very last operation is vfio_pci_put_device, and the last
>operation
>of that function is freeing the name string.
>
>- Steve
>
>>> static void vfio_instance_finalize(Object *obj)
>>> --
>>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-02 12:36       ` Steven Sistare
@ 2025-06-04  7:09         ` Cédric Le Goater
  2025-06-04 11:59           ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-04  7:09 UTC (permalink / raw)
  To: Steven Sistare, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/2/25 14:36, Steven Sistare wrote:
> On 6/1/2025 3:07 PM, Michael S. Tsirkin wrote:
>> On Sun, Jun 01, 2025 at 06:38:43PM +0200, Cédric Le Goater wrote:
>>> On 5/29/25 21:24, Steve Sistare wrote:
>>>> Do not reset a vfio-pci device during CPR.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>    include/hw/pci/pci_device.h | 3 +++
>>>>    hw/pci/pci.c                | 5 +++++
>>>>    hw/vfio/pci.c               | 7 +++++++
>>>>    3 files changed, 15 insertions(+)
>>>>
>>>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>>>> index e41d95b..b481c5d 100644
>>>> --- a/include/hw/pci/pci_device.h
>>>> +++ b/include/hw/pci/pci_device.h
>>>> @@ -181,6 +181,9 @@ struct PCIDevice {
>>>>        uint32_t max_bounce_buffer_size;
>>>>        char *sriov_pf;
>>>> +
>>>> +    /* CPR */
>>>> +    bool skip_reset_on_cpr;
>>>>    };
>>>>    static inline int pci_intx(PCIDevice *pci_dev)
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index f5ab510..21eb11c 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -32,6 +32,7 @@
>>>>    #include "hw/pci/pci_host.h"
>>>>    #include "hw/qdev-properties.h"
>>>>    #include "hw/qdev-properties-system.h"
>>>> +#include "migration/cpr.h"
>>>>    #include "migration/qemu-file-types.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "net/net.h"
>>>> @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
>>>>    static void pci_do_device_reset(PCIDevice *dev)
>>>>    {
>>>> +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
>>>> +        return;
>>>> +    }
>>>
>>> Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
>>> replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")
>>>
>>> Thanks,
>>>
>>> C.
>>
>> True but I don't really like driver dependent hacks.
>> what exactly about vfio makes it survive without this reset?
> 
> The kernel descriptors remain open and all the active kernel PCI state
> remains in place.  The device was never quiesced or de-configured in old QEMU.
> 
> The cast is fine with me; it depends on what Michael wants.
I don't see any good ways to avoid doing the reset when a cpr resume
is in progress. I agree the cast is pretty ugly. We could keep the
'skip_reset_on_cpr' attribute and make it a class attribute instead.


Which raises another question : is this specific to vfio-pci. What
about virtio devices ?

Thanks,

C.





^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-04  7:09         ` Cédric Le Goater
@ 2025-06-04 11:59           ` Cédric Le Goater
  2025-06-04 13:15             ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-04 11:59 UTC (permalink / raw)
  To: Steven Sistare, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/4/25 09:09, Cédric Le Goater wrote:
> On 6/2/25 14:36, Steven Sistare wrote:
>> On 6/1/2025 3:07 PM, Michael S. Tsirkin wrote:
>>> On Sun, Jun 01, 2025 at 06:38:43PM +0200, Cédric Le Goater wrote:
>>>> On 5/29/25 21:24, Steve Sistare wrote:
>>>>> Do not reset a vfio-pci device during CPR.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>    include/hw/pci/pci_device.h | 3 +++
>>>>>    hw/pci/pci.c                | 5 +++++
>>>>>    hw/vfio/pci.c               | 7 +++++++
>>>>>    3 files changed, 15 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>>>>> index e41d95b..b481c5d 100644
>>>>> --- a/include/hw/pci/pci_device.h
>>>>> +++ b/include/hw/pci/pci_device.h
>>>>> @@ -181,6 +181,9 @@ struct PCIDevice {
>>>>>        uint32_t max_bounce_buffer_size;
>>>>>        char *sriov_pf;
>>>>> +
>>>>> +    /* CPR */
>>>>> +    bool skip_reset_on_cpr;
>>>>>    };
>>>>>    static inline int pci_intx(PCIDevice *pci_dev)
>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>> index f5ab510..21eb11c 100644
>>>>> --- a/hw/pci/pci.c
>>>>> +++ b/hw/pci/pci.c
>>>>> @@ -32,6 +32,7 @@
>>>>>    #include "hw/pci/pci_host.h"
>>>>>    #include "hw/qdev-properties.h"
>>>>>    #include "hw/qdev-properties-system.h"
>>>>> +#include "migration/cpr.h"
>>>>>    #include "migration/qemu-file-types.h"
>>>>>    #include "migration/vmstate.h"
>>>>>    #include "net/net.h"
>>>>> @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
>>>>>    static void pci_do_device_reset(PCIDevice *dev)
>>>>>    {
>>>>> +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
>>>>> +        return;
>>>>> +    }
>>>>
>>>> Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
>>>> replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>
>>> True but I don't really like driver dependent hacks.
>>> what exactly about vfio makes it survive without this reset?
>>
>> The kernel descriptors remain open and all the active kernel PCI state
>> remains in place.  The device was never quiesced or de-configured in old QEMU.
>>
>> The cast is fine with me; it depends on what Michael wants.
> I don't see any good ways to avoid doing the reset when a cpr resume
> is in progress. I agree the cast is pretty ugly. We could keep the
> 'skip_reset_on_cpr' attribute and make it a class attribute instead.
Also,

I wonder if the resettable interface, and more specifically the
RESET_TYPE_SNAPSHOT_LOAD type, might be useful. Have you explored
this alternative ?


Thanks,

C.




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-04 11:59           ` Cédric Le Goater
@ 2025-06-04 13:15             ` Steven Sistare
  2025-06-04 13:48               ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-04 13:15 UTC (permalink / raw)
  To: Cédric Le Goater, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/4/2025 7:59 AM, Cédric Le Goater wrote:
> On 6/4/25 09:09, Cédric Le Goater wrote:
>> On 6/2/25 14:36, Steven Sistare wrote:
>>> On 6/1/2025 3:07 PM, Michael S. Tsirkin wrote:
>>>> On Sun, Jun 01, 2025 at 06:38:43PM +0200, Cédric Le Goater wrote:
>>>>> On 5/29/25 21:24, Steve Sistare wrote:
>>>>>> Do not reset a vfio-pci device during CPR.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>    include/hw/pci/pci_device.h | 3 +++
>>>>>>    hw/pci/pci.c                | 5 +++++
>>>>>>    hw/vfio/pci.c               | 7 +++++++
>>>>>>    3 files changed, 15 insertions(+)
>>>>>>
>>>>>> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
>>>>>> index e41d95b..b481c5d 100644
>>>>>> --- a/include/hw/pci/pci_device.h
>>>>>> +++ b/include/hw/pci/pci_device.h
>>>>>> @@ -181,6 +181,9 @@ struct PCIDevice {
>>>>>>        uint32_t max_bounce_buffer_size;
>>>>>>        char *sriov_pf;
>>>>>> +
>>>>>> +    /* CPR */
>>>>>> +    bool skip_reset_on_cpr;
>>>>>>    };
>>>>>>    static inline int pci_intx(PCIDevice *pci_dev)
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index f5ab510..21eb11c 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -32,6 +32,7 @@
>>>>>>    #include "hw/pci/pci_host.h"
>>>>>>    #include "hw/qdev-properties.h"
>>>>>>    #include "hw/qdev-properties-system.h"
>>>>>> +#include "migration/cpr.h"
>>>>>>    #include "migration/qemu-file-types.h"
>>>>>>    #include "migration/vmstate.h"
>>>>>>    #include "net/net.h"
>>>>>> @@ -531,6 +532,10 @@ static void pci_reset_regions(PCIDevice *dev)
>>>>>>    static void pci_do_device_reset(PCIDevice *dev)
>>>>>>    {
>>>>>> +    if (dev->skip_reset_on_cpr && cpr_is_incoming()) {
>>>>>> +        return;
>>>>>> +    }
>>>>>
>>>>> Since ->skip_reset_on_cpr is only true for vfio-pci devices, it could be
>>>>> replaced by : object_dynamic_cast(OBJECT(dev), "vfio-pci")
>>>>>
>>>>> Thanks,
>>>>>
>>>>> C.
>>>>
>>>> True but I don't really like driver dependent hacks.
>>>> what exactly about vfio makes it survive without this reset?
>>>
>>> The kernel descriptors remain open and all the active kernel PCI state
>>> remains in place.  The device was never quiesced or de-configured in old QEMU.
>>>
>>> The cast is fine with me; it depends on what Michael wants.
>> I don't see any good ways to avoid doing the reset when a cpr resume
>> is in progress. I agree the cast is pretty ugly. We could keep the
>> 'skip_reset_on_cpr' attribute and make it a class attribute instead.

I don't see any advantage to making this a class attribute.  I looked for examples
of using such attributes for vfio to configure pci, and found very little.  It
sounds like overkill since vfio already sets and gets PCIDevice members directly
in many places.

I defined skip_reset_on_cpr based on this existing example:

vfio_instance_init()
     pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS

> Also,
> 
> I wonder if the resettable interface, and more specifically the
> RESET_TYPE_SNAPSHOT_LOAD type, might be useful. Have you explored
> this alternative ?

RESET_TYPE_SNAPSHOT_LOAD (or a new type such as RESET_TYPE_CPR) would skip
reset for all devices, but we only skip for vfio_pci.  All other devices
(including virtio) save and restore state using standard migration vmstate,
and must call reset.

- Steve



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-04  3:55       ` Duan, Zhenzhong
@ 2025-06-04 13:33         ` Steven Sistare
  2025-06-05  3:02           ` Duan, Zhenzhong
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-04 13:33 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/3/2025 11:55 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>
>> On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>
>>>> If vfio_realize fails after vfio_device_attach, it should call
>>>> vfio_device_detach during error recovery.  If it fails after
>>>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>>>> after vfio_pci_config_setup, it should free vdev->msix.
>>>>
>>>> To fix all, call vfio_pci_put_device().
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>> hw/vfio/pci.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index a1bfdfe..7d3b9ff 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3296,6 +3296,7 @@ out_teardown:
>>>>       vfio_bars_exit(vdev);
>>>> error:
>>>>       error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>>>> +    vfio_pci_put_device(vdev);
>>>
>>> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().
>>
>> If vfio_realize fails with an error, vfio_instance_finalize is not called.
>> I tested that.
> 
> Have you tried with hot plugged device?

Not before, but I just tried it now, thanks for the suggestion.
Same result -- vfio_instance_finalize is not called.

- Steve

>>> Early free of vdev->vbasedev.name will also break something, e.g.,
>> trace_vfio_region_finalize(region->vbasedev->name, region->nr);
>>
>> All unwinding and calling functions that might use the name is done in the
>> vfio_realize
>> failure path, and the very last operation is vfio_pci_put_device, and the last
>> operation
>> of that function is freeing the name string.
>>
>> - Steve
>>
>>>> static void vfio_instance_finalize(Object *obj)
>>>> --
>>>> 1.8.3.1
>>>
> 



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-04 13:15             ` Steven Sistare
@ 2025-06-04 13:48               ` Cédric Le Goater
  2025-06-10 16:31                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-04 13:48 UTC (permalink / raw)
  To: Steven Sistare, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

> I don't see any advantage to making this a class attribute.  I looked for examples
> of using such attributes for vfio to configure pci, and found very little.  It
> sounds like overkill since vfio already sets and gets PCIDevice members directly
> in many places.
> 
> I defined skip_reset_on_cpr based on this existing example:
> 
> vfio_instance_init()
>      pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS

pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
is a constant, for which a class attribute are more appropriate.
This is minor.

Michael,

   Are you ok with the 'skip_reset_on_cpr' bool ?

>> I wonder if the resettable interface, and more specifically the
>> RESET_TYPE_SNAPSHOT_LOAD type, might be useful. Have you explored
>> this alternative ?
> 
> RESET_TYPE_SNAPSHOT_LOAD (or a new type such as RESET_TYPE_CPR) would skip
> reset for all devices, but we only skip for vfio_pci.  All other devices
> (including virtio) save and restore state using standard migration vmstate,
> and must call reset.
OK.

C.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* RE: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-04 13:33         ` Steven Sistare
@ 2025-06-05  3:02           ` Duan, Zhenzhong
  2025-06-05 15:16             ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Duan, Zhenzhong @ 2025-06-05  3:02 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel@nongnu.org
  Cc: Alex Williamson, Cedric Le Goater, Liu, Yi L, Eric Auger,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas



>-----Original Message-----
>From: Steven Sistare <steven.sistare@oracle.com>
>Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>
>On 6/3/2025 11:55 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>
>>> On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>>
>>>>> If vfio_realize fails after vfio_device_attach, it should call
>>>>> vfio_device_detach during error recovery.  If it fails after
>>>>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>>>>> after vfio_pci_config_setup, it should free vdev->msix.
>>>>>
>>>>> To fix all, call vfio_pci_put_device().
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>> hw/vfio/pci.c | 1 +
>>>>> 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index a1bfdfe..7d3b9ff 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -3296,6 +3296,7 @@ out_teardown:
>>>>>       vfio_bars_exit(vdev);
>>>>> error:
>>>>>       error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>>>>> +    vfio_pci_put_device(vdev);
>>>>
>>>> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().
>>>
>>> If vfio_realize fails with an error, vfio_instance_finalize is not called.
>>> I tested that.
>>
>> Have you tried with hot plugged device?
>
>Not before, but I just tried it now, thanks for the suggestion.
>Same result -- vfio_instance_finalize is not called.

That's strange, I tried below change with hotplug a device through qmp, I see "vfio_instance_finalize called"

device_add vfio-pci,host=04:10.1,id=vfio0,bus=root0,iommufd=iommufd0

--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3167,6 +3167,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)

     trace_vfio_mdev(vbasedev->name, vbasedev->mdev);

+error_setg(errp, "faking error in vfio_realize");
+goto error;
+
     if (vbasedev->ram_block_discard_allowed && !vbasedev->mdev) {
         error_setg(errp, "x-balloon-allowed only potentially compatible "
                    "with mdev devices");
@@ -3301,6 +3304,8 @@ error:
 static void vfio_instance_finalize(Object *obj)
 {
     VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
+printf("vfio_instance_finalize called\n");
+exit(1);

     vfio_display_finalize(vdev);
     vfio_bars_finalize(vdev);

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-05  3:02           ` Duan, Zhenzhong
@ 2025-06-05 15:16             ` Steven Sistare
  2025-06-05 21:14               ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-05 15:16 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cedric Le Goater
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas, qemu-devel@nongnu.org

On 6/4/2025 11:02 PM, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Steven Sistare <steven.sistare@oracle.com>
>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>
>> On 6/3/2025 11:55 PM, Duan, Zhenzhong wrote:
>>>> -----Original Message-----
>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>
>>>> On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>>>>>> -----Original Message-----
>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>>>
>>>>>> If vfio_realize fails after vfio_device_attach, it should call
>>>>>> vfio_device_detach during error recovery.  If it fails after
>>>>>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>>>>>> after vfio_pci_config_setup, it should free vdev->msix.
>>>>>>
>>>>>> To fix all, call vfio_pci_put_device().
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>> hw/vfio/pci.c | 1 +
>>>>>> 1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index a1bfdfe..7d3b9ff 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -3296,6 +3296,7 @@ out_teardown:
>>>>>>        vfio_bars_exit(vdev);
>>>>>> error:
>>>>>>        error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>>>>>> +    vfio_pci_put_device(vdev);
>>>>>
>>>>> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().

Agreed, this line must be deleted.
Cedric, this must be fixed in vfio-next.

>>>> If vfio_realize fails with an error, vfio_instance_finalize is not called.
>>>> I tested that.
>>>
>>> Have you tried with hot plugged device?
>>
>> Not before, but I just tried it now, thanks for the suggestion.
>> Same result -- vfio_instance_finalize is not called.
> 
> That's strange, I tried below change with hotplug a device through qmp, I see "vfio_instance_finalize called"
> 
> device_add vfio-pci,host=04:10.1,id=vfio0,bus=root0,iommufd=iommufd0
> 
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3167,6 +3167,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> 
>       trace_vfio_mdev(vbasedev->name, vbasedev->mdev);
> 
> +error_setg(errp, "faking error in vfio_realize");
> +goto error;

Thank you, with this I see finalize being called.

In my test, I had injected an error as late as possible in realize, to verify all
state is unwound, and I did it wrong:

     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);

     error_setg(errp, "forced error");
     goto out_deregister;

     return;
   out_deregister:

and finalize is not called.  Probably some reference is taken in those last few
function calls, and is not released.

This is correct, and calls finalize:

     error_setg(errp, "forced error");
     goto out_deregister;

     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);

     return;
   out_deregister:

- Steve



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
  2025-06-05 15:16             ` Steven Sistare
@ 2025-06-05 21:14               ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-05 21:14 UTC (permalink / raw)
  To: Steven Sistare, Duan, Zhenzhong
  Cc: Alex Williamson, Liu, Yi L, Eric Auger, Michael S. Tsirkin,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas, qemu-devel@nongnu.org

On 6/5/25 17:16, Steven Sistare wrote:
> On 6/4/2025 11:02 PM, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Steven Sistare <steven.sistare@oracle.com>
>>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>
>>> On 6/3/2025 11:55 PM, Duan, Zhenzhong wrote:
>>>>> -----Original Message-----
>>>>> From: Steven Sistare <steven.sistare@oracle.com>
>>>>> Subject: Re: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>>
>>>>> On 6/3/2025 6:40 AM, Duan, Zhenzhong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> Subject: [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure
>>>>>>>
>>>>>>> If vfio_realize fails after vfio_device_attach, it should call
>>>>>>> vfio_device_detach during error recovery.  If it fails after
>>>>>>> vfio_device_get_name, it should free vbasedev->name.  If it fails
>>>>>>> after vfio_pci_config_setup, it should free vdev->msix.
>>>>>>>
>>>>>>> To fix all, call vfio_pci_put_device().
>>>>>>>
>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>> ---
>>>>>>> hw/vfio/pci.c | 1 +
>>>>>>> 1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>>> index a1bfdfe..7d3b9ff 100644
>>>>>>> --- a/hw/vfio/pci.c
>>>>>>> +++ b/hw/vfio/pci.c
>>>>>>> @@ -3296,6 +3296,7 @@ out_teardown:
>>>>>>>        vfio_bars_exit(vdev);
>>>>>>> error:
>>>>>>>        error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
>>>>>>> +    vfio_pci_put_device(vdev);
>>>>>>
>>>>>> Double free, vfio_pci_put_device() is also called in vfio_instance_finalize().
> 
> Agreed, this line must be deleted.
> Cedric, this must be fixed in vfio-next.


yes. It was not merged.

Thanks to you both for the analysis. I lacked the time.

C.




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 35/43] vfio/iommufd: register container for cpr
  2025-05-29 19:24 ` [PATCH V4 35/43] vfio/iommufd: register container for cpr Steve Sistare
@ 2025-06-09 20:30   ` Cédric Le Goater
  2025-06-09 20:47     ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-09 20:30 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Register a vfio iommufd container and device for CPR, replacing the generic
> CPR register call with a more specific iommufd register call.  Add a
> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
> 
> This is mostly boiler plate.  The fields to to saved and restored are added
> in subsequent patches.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/hw/vfio/vfio-cpr.h | 12 +++++++
>   include/system/iommufd.h   |  1 +
>   backends/iommufd.c         | 10 ++++++
>   hw/vfio/cpr-iommufd.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>   hw/vfio/iommufd.c          |  6 ++--
>   hw/vfio/meson.build        |  1 +
>   6 files changed, 112 insertions(+), 2 deletions(-)
>   create mode 100644 hw/vfio/cpr-iommufd.c
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index 170a116..b9b77ae 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -15,7 +15,10 @@
>   struct VFIOContainer;
>   struct VFIOContainerBase;
>   struct VFIOGroup;
> +struct VFIODevice;
>   struct VFIOPCIDevice;
> +struct VFIOIOMMUFDContainer;
> +struct IOMMUFDBackend;
>   
>   typedef struct VFIOContainerCPR {
>       Error *blocker;
> @@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>                                    Error **errp);
>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>   
> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
> +                                         Error **errp);
> +void vfio_iommufd_cpr_unregister_container(
> +    struct VFIOIOMMUFDContainer *container);
> +bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
> +void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
> +
>   int vfio_cpr_group_get_device_fd(int d, const char *name);
>   
>   bool vfio_cpr_container_match(struct VFIOContainer *container,
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index db9ed53..3c58ea8 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>       /*< protected >*/
>       int fd;            /* /dev/iommu file descriptor */
>       bool owned;        /* is the /dev/iommu opened internally */
> +    Error *cpr_blocker;/* set if be does not support CPR */
>       uint32_t users;
>   
>       /*< public >*/
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index ed8bb4c..2e9d6cb 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>           }
>           be->fd = fd;
>       }
> +    if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
> +        if (be->owned) {
> +            close(be->fd);
> +            be->fd = -1;
> +        }
> +        return false;
> +    }
>       be->users++;
>   
>       trace_iommufd_backend_connect(be->fd, be->owned, be->users);
> @@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
>           be->fd = -1;
>       }
>   out:
> +    if (!be->users) {
> +        vfio_iommufd_cpr_unregister_iommufd(be);
> +    }
>       trace_iommufd_backend_disconnect(be->fd, be->users);
>   }
>   
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> new file mode 100644
> index 0000000..60bd7e8
> --- /dev/null
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -0,0 +1,84 @@
> +/*
> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "hw/vfio/vfio-cpr.h"
> +#include "migration/blocker.h"
> +#include "migration/cpr.h"
> +#include "migration/migration.h"
> +#include "migration/vmstate.h"
> +#include "system/iommufd.h"
> +#include "vfio-iommufd.h"
> +
> +static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
> +{
> +    if (!iommufd_change_process_capable(be)) {
> +        if (errp) {
> +            error_setg(errp, "vfio iommufd backend does not support "
> +                       "IOMMU_IOAS_CHANGE_PROCESS");
> +        }
> +        return false;
> +    }
> +    return true;
> +}
> +
> +static const VMStateDescription iommufd_cpr_vmstate = {
> +    .name = "iommufd",
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .needed = cpr_incoming_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
> +{
> +    Error **cpr_blocker = &be->cpr_blocker;
> +
> +    if (!vfio_cpr_supported(be, cpr_blocker)) {
> +        return migrate_add_blocker_modes(cpr_blocker, errp,
> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
> +    }
> +
> +    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
> +
> +    return true;
> +}
> +
> +void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
> +{
> +    vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
> +    migrate_del_blocker(&be->cpr_blocker);
> +}
> +
> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
> +                                         Error **errp)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
> +                                vfio_cpr_reboot_notifier,
> +                                MIG_MODE_CPR_REBOOT);
> +
> +    return true;
> +}
> +
> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
> +{
> +    VFIOContainerBase *bcontainer = &container->bcontainer;
> +
> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> +}
> +
> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
> +{
> +}
> +
> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
> +{
> +}
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index ca00d08..c690c2c 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
>       if (!QLIST_EMPTY(&bcontainer->device_list)) {
>           return;
>       }
> -    vfio_cpr_unregister_container(bcontainer);
> +    vfio_iommufd_cpr_unregister_container(container);
>       vfio_listener_unregister(bcontainer);
>       iommufd_backend_free_id(container->be, container->ioas_id);
>       object_unref(container);
> @@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>           goto err_listener_register;
>       }
>   
> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
> +    if (!vfio_iommufd_cpr_register_container(container, errp)) {
>           goto err_listener_register;
>       }
>   
> @@ -619,6 +619,7 @@ found_container:
>       }
>   
>       vfio_device_prepare(vbasedev, bcontainer, &dev_info);
> +    vfio_iommufd_cpr_register_device(vbasedev);
>   
>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>                                      vbasedev->num_regions, vbasedev->flags);
> @@ -656,6 +657,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>       iommufd_cdev_container_destroy(container);
>       vfio_address_space_put(space);
>   
> +    vfio_iommufd_cpr_unregister_device(vbasedev);
>       iommufd_cdev_unbind_and_disconnect(vbasedev);
>       close(vbasedev->fd);
>   }
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index 98134a7..12711fb 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -23,6 +23,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>     'cpr.c',
> +  'cpr-iommufd.c',

This file should be compiled under CONFIG_IOMMUFD.


Thanks,

C.



>     'cpr-legacy.c',
>     'device.c',
>     'migration.c',



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 35/43] vfio/iommufd: register container for cpr
  2025-06-09 20:30   ` Cédric Le Goater
@ 2025-06-09 20:47     ` Steven Sistare
  2025-06-10  6:11       ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-09 20:47 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/9/2025 4:30 PM, Cédric Le Goater wrote:
> On 5/29/25 21:24, Steve Sistare wrote:
>> Register a vfio iommufd container and device for CPR, replacing the generic
>> CPR register call with a more specific iommufd register call.  Add a
>> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>>
>> This is mostly boiler plate.  The fields to to saved and restored are added
>> in subsequent patches.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/hw/vfio/vfio-cpr.h | 12 +++++++
>>   include/system/iommufd.h   |  1 +
>>   backends/iommufd.c         | 10 ++++++
>>   hw/vfio/cpr-iommufd.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>>   hw/vfio/iommufd.c          |  6 ++--
>>   hw/vfio/meson.build        |  1 +
>>   6 files changed, 112 insertions(+), 2 deletions(-)
>>   create mode 100644 hw/vfio/cpr-iommufd.c
>>
>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>> index 170a116..b9b77ae 100644
>> --- a/include/hw/vfio/vfio-cpr.h
>> +++ b/include/hw/vfio/vfio-cpr.h
>> @@ -15,7 +15,10 @@
>>   struct VFIOContainer;
>>   struct VFIOContainerBase;
>>   struct VFIOGroup;
>> +struct VFIODevice;
>>   struct VFIOPCIDevice;
>> +struct VFIOIOMMUFDContainer;
>> +struct IOMMUFDBackend;
>>   typedef struct VFIOContainerCPR {
>>       Error *blocker;
>> @@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>                                    Error **errp);
>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
>> +                                         Error **errp);
>> +void vfio_iommufd_cpr_unregister_container(
>> +    struct VFIOIOMMUFDContainer *container);
>> +bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
>> +void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
>> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>> +
>>   int vfio_cpr_group_get_device_fd(int d, const char *name);
>>   bool vfio_cpr_container_match(struct VFIOContainer *container,
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index db9ed53..3c58ea8 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>>       /*< protected >*/
>>       int fd;            /* /dev/iommu file descriptor */
>>       bool owned;        /* is the /dev/iommu opened internally */
>> +    Error *cpr_blocker;/* set if be does not support CPR */
>>       uint32_t users;
>>       /*< public >*/
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index ed8bb4c..2e9d6cb 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>>           }
>>           be->fd = fd;
>>       }
>> +    if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
>> +        if (be->owned) {
>> +            close(be->fd);
>> +            be->fd = -1;
>> +        }
>> +        return false;
>> +    }
>>       be->users++;
>>       trace_iommufd_backend_connect(be->fd, be->owned, be->users);
>> @@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
>>           be->fd = -1;
>>       }
>>   out:
>> +    if (!be->users) {
>> +        vfio_iommufd_cpr_unregister_iommufd(be);
>> +    }
>>       trace_iommufd_backend_disconnect(be->fd, be->users);
>>   }
>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>> new file mode 100644
>> index 0000000..60bd7e8
>> --- /dev/null
>> +++ b/hw/vfio/cpr-iommufd.c
>> @@ -0,0 +1,84 @@
>> +/*
>> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qapi/error.h"
>> +#include "hw/vfio/vfio-cpr.h"
>> +#include "migration/blocker.h"
>> +#include "migration/cpr.h"
>> +#include "migration/migration.h"
>> +#include "migration/vmstate.h"
>> +#include "system/iommufd.h"
>> +#include "vfio-iommufd.h"
>> +
>> +static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>> +{
>> +    if (!iommufd_change_process_capable(be)) {
>> +        if (errp) {
>> +            error_setg(errp, "vfio iommufd backend does not support "
>> +                       "IOMMU_IOAS_CHANGE_PROCESS");
>> +        }
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static const VMStateDescription iommufd_cpr_vmstate = {
>> +    .name = "iommufd",
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .needed = cpr_incoming_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>> +{
>> +    Error **cpr_blocker = &be->cpr_blocker;
>> +
>> +    if (!vfio_cpr_supported(be, cpr_blocker)) {
>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>> +    }
>> +
>> +    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
>> +
>> +    return true;
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
>> +{
>> +    vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
>> +    migrate_del_blocker(&be->cpr_blocker);
>> +}
>> +
>> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
>> +                                         Error **errp)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>> +                                vfio_cpr_reboot_notifier,
>> +                                MIG_MODE_CPR_REBOOT);
>> +
>> +    return true;
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>> +{
>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>> +
>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>> +}
>> +
>> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>> +{
>> +}
>> +
>> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>> +{
>> +}
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index ca00d08..c690c2c 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
>>       if (!QLIST_EMPTY(&bcontainer->device_list)) {
>>           return;
>>       }
>> -    vfio_cpr_unregister_container(bcontainer);
>> +    vfio_iommufd_cpr_unregister_container(container);
>>       vfio_listener_unregister(bcontainer);
>>       iommufd_backend_free_id(container->be, container->ioas_id);
>>       object_unref(container);
>> @@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>>           goto err_listener_register;
>>       }
>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>> +    if (!vfio_iommufd_cpr_register_container(container, errp)) {
>>           goto err_listener_register;
>>       }
>> @@ -619,6 +619,7 @@ found_container:
>>       }
>>       vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>> +    vfio_iommufd_cpr_register_device(vbasedev);
>>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>>                                      vbasedev->num_regions, vbasedev->flags);
>> @@ -656,6 +657,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>       iommufd_cdev_container_destroy(container);
>>       vfio_address_space_put(space);
>> +    vfio_iommufd_cpr_unregister_device(vbasedev);
>>       iommufd_cdev_unbind_and_disconnect(vbasedev);
>>       close(vbasedev->fd);
>>   }
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index 98134a7..12711fb 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -23,6 +23,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>     'cpr.c',
>> +  'cpr-iommufd.c',
> 
> This file should be compiled under CONFIG_IOMMUFD.

Sure, will fix.
Tomorrow I plan to rebase to the lastest master, add this and the few other
comments that came up for V4, and send V5.  Sound OK?

- Steve



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 32/43] vfio/iommufd: invariant device name
  2025-05-29 19:24 ` [PATCH V4 32/43] vfio/iommufd: invariant device name Steve Sistare
@ 2025-06-10  6:10   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10  6:10 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> cpr-transfer will use the device name as a key to find the value
> of the device descriptor in new QEMU.  However, if the descriptor
> number is specified by a command-line fd parameter, then
> vfio_device_get_name creates a name that includes the fd number.
> This causes a chicken-and-egg problem: new QEMU must know the fd
> number to construct a name to find the fd number.
> 
> To fix, create an invariant name based on the id command-line parameter,
> if id is defined.  The user will need to provide such an id to use CPR.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   hw/vfio/device.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 9fba2c7..71fa9f4 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -300,12 +300,17 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>               error_setg(errp, "Use FD passing only with iommufd backend");
>               return false;
>           }
> -        /*
> -         * Give a name with fd so any function printing out vbasedev->name
> -         * will not break.
> -         */
>           if (!vbasedev->name) {
> -            vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> +
> +            if (vbasedev->dev->id) {
> +                vbasedev->name = g_strdup(vbasedev->dev->id);
> +                return true;
> +            } else {
> +                /*
> +                 * Assign a name so any function printing it will not break.
> +                 */
> +                vbasedev->name = g_strdup_printf("VFIO_FD%d", vbasedev->fd);
> +            }
>           }
>       }
>   



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 35/43] vfio/iommufd: register container for cpr
  2025-06-09 20:47     ` Steven Sistare
@ 2025-06-10  6:11       ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10  6:11 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/9/25 22:47, Steven Sistare wrote:
> On 6/9/2025 4:30 PM, Cédric Le Goater wrote:
>> On 5/29/25 21:24, Steve Sistare wrote:
>>> Register a vfio iommufd container and device for CPR, replacing the generic
>>> CPR register call with a more specific iommufd register call.  Add a
>>> blocker if the kernel does not support IOMMU_IOAS_CHANGE_PROCESS.
>>>
>>> This is mostly boiler plate.  The fields to to saved and restored are added
>>> in subsequent patches.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>   include/hw/vfio/vfio-cpr.h | 12 +++++++
>>>   include/system/iommufd.h   |  1 +
>>>   backends/iommufd.c         | 10 ++++++
>>>   hw/vfio/cpr-iommufd.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>>>   hw/vfio/iommufd.c          |  6 ++--
>>>   hw/vfio/meson.build        |  1 +
>>>   6 files changed, 112 insertions(+), 2 deletions(-)
>>>   create mode 100644 hw/vfio/cpr-iommufd.c
>>>
>>> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
>>> index 170a116..b9b77ae 100644
>>> --- a/include/hw/vfio/vfio-cpr.h
>>> +++ b/include/hw/vfio/vfio-cpr.h
>>> @@ -15,7 +15,10 @@
>>>   struct VFIOContainer;
>>>   struct VFIOContainerBase;
>>>   struct VFIOGroup;
>>> +struct VFIODevice;
>>>   struct VFIOPCIDevice;
>>> +struct VFIOIOMMUFDContainer;
>>> +struct IOMMUFDBackend;
>>>   typedef struct VFIOContainerCPR {
>>>       Error *blocker;
>>> @@ -43,6 +46,15 @@ bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
>>>                                    Error **errp);
>>>   void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
>>> +bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
>>> +                                         Error **errp);
>>> +void vfio_iommufd_cpr_unregister_container(
>>> +    struct VFIOIOMMUFDContainer *container);
>>> +bool vfio_iommufd_cpr_register_iommufd(struct IOMMUFDBackend *be, Error **errp);
>>> +void vfio_iommufd_cpr_unregister_iommufd(struct IOMMUFDBackend *be);
>>> +void vfio_iommufd_cpr_register_device(struct VFIODevice *vbasedev);
>>> +void vfio_iommufd_cpr_unregister_device(struct VFIODevice *vbasedev);
>>> +
>>>   int vfio_cpr_group_get_device_fd(int d, const char *name);
>>>   bool vfio_cpr_container_match(struct VFIOContainer *container,
>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>> index db9ed53..3c58ea8 100644
>>> --- a/include/system/iommufd.h
>>> +++ b/include/system/iommufd.h
>>> @@ -32,6 +32,7 @@ struct IOMMUFDBackend {
>>>       /*< protected >*/
>>>       int fd;            /* /dev/iommu file descriptor */
>>>       bool owned;        /* is the /dev/iommu opened internally */
>>> +    Error *cpr_blocker;/* set if be does not support CPR */
>>>       uint32_t users;
>>>       /*< public >*/
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index ed8bb4c..2e9d6cb 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -108,6 +108,13 @@ bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp)
>>>           }
>>>           be->fd = fd;
>>>       }
>>> +    if (!be->users && !vfio_iommufd_cpr_register_iommufd(be, errp)) {
>>> +        if (be->owned) {
>>> +            close(be->fd);
>>> +            be->fd = -1;
>>> +        }
>>> +        return false;
>>> +    }
>>>       be->users++;
>>>       trace_iommufd_backend_connect(be->fd, be->owned, be->users);
>>> @@ -125,6 +132,9 @@ void iommufd_backend_disconnect(IOMMUFDBackend *be)
>>>           be->fd = -1;
>>>       }
>>>   out:
>>> +    if (!be->users) {
>>> +        vfio_iommufd_cpr_unregister_iommufd(be);
>>> +    }
>>>       trace_iommufd_backend_disconnect(be->fd, be->users);
>>>   }
>>> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
>>> new file mode 100644
>>> index 0000000..60bd7e8
>>> --- /dev/null
>>> +++ b/hw/vfio/cpr-iommufd.c
>>> @@ -0,0 +1,84 @@
>>> +/*
>>> + * Copyright (c) 2024-2025 Oracle and/or its affiliates.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "qapi/error.h"
>>> +#include "hw/vfio/vfio-cpr.h"
>>> +#include "migration/blocker.h"
>>> +#include "migration/cpr.h"
>>> +#include "migration/migration.h"
>>> +#include "migration/vmstate.h"
>>> +#include "system/iommufd.h"
>>> +#include "vfio-iommufd.h"
>>> +
>>> +static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>>> +{
>>> +    if (!iommufd_change_process_capable(be)) {
>>> +        if (errp) {
>>> +            error_setg(errp, "vfio iommufd backend does not support "
>>> +                       "IOMMU_IOAS_CHANGE_PROCESS");
>>> +        }
>>> +        return false;
>>> +    }
>>> +    return true;
>>> +}
>>> +
>>> +static const VMStateDescription iommufd_cpr_vmstate = {
>>> +    .name = "iommufd",
>>> +    .version_id = 0,
>>> +    .minimum_version_id = 0,
>>> +    .needed = cpr_incoming_needed,
>>> +    .fields = (VMStateField[]) {
>>> +        VMSTATE_END_OF_LIST()
>>> +    }
>>> +};
>>> +
>>> +bool vfio_iommufd_cpr_register_iommufd(IOMMUFDBackend *be, Error **errp)
>>> +{
>>> +    Error **cpr_blocker = &be->cpr_blocker;
>>> +
>>> +    if (!vfio_cpr_supported(be, cpr_blocker)) {
>>> +        return migrate_add_blocker_modes(cpr_blocker, errp,
>>> +                                         MIG_MODE_CPR_TRANSFER, -1) == 0;
>>> +    }
>>> +
>>> +    vmstate_register(NULL, -1, &iommufd_cpr_vmstate, be);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +void vfio_iommufd_cpr_unregister_iommufd(IOMMUFDBackend *be)
>>> +{
>>> +    vmstate_unregister(NULL, &iommufd_cpr_vmstate, be);
>>> +    migrate_del_blocker(&be->cpr_blocker);
>>> +}
>>> +
>>> +bool vfio_iommufd_cpr_register_container(VFIOIOMMUFDContainer *container,
>>> +                                         Error **errp)
>>> +{
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>> +
>>> +    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
>>> +                                vfio_cpr_reboot_notifier,
>>> +                                MIG_MODE_CPR_REBOOT);
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +void vfio_iommufd_cpr_unregister_container(VFIOIOMMUFDContainer *container)
>>> +{
>>> +    VFIOContainerBase *bcontainer = &container->bcontainer;
>>> +
>>> +    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
>>> +}
>>> +
>>> +void vfio_iommufd_cpr_register_device(VFIODevice *vbasedev)
>>> +{
>>> +}
>>> +
>>> +void vfio_iommufd_cpr_unregister_device(VFIODevice *vbasedev)
>>> +{
>>> +}
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index ca00d08..c690c2c 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -446,7 +446,7 @@ static void iommufd_cdev_container_destroy(VFIOIOMMUFDContainer *container)
>>>       if (!QLIST_EMPTY(&bcontainer->device_list)) {
>>>           return;
>>>       }
>>> -    vfio_cpr_unregister_container(bcontainer);
>>> +    vfio_iommufd_cpr_unregister_container(container);
>>>       vfio_listener_unregister(bcontainer);
>>>       iommufd_backend_free_id(container->be, container->ioas_id);
>>>       object_unref(container);
>>> @@ -592,7 +592,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>>>           goto err_listener_register;
>>>       }
>>> -    if (!vfio_cpr_register_container(bcontainer, errp)) {
>>> +    if (!vfio_iommufd_cpr_register_container(container, errp)) {
>>>           goto err_listener_register;
>>>       }
>>> @@ -619,6 +619,7 @@ found_container:
>>>       }
>>>       vfio_device_prepare(vbasedev, bcontainer, &dev_info);
>>> +    vfio_iommufd_cpr_register_device(vbasedev);
>>>       trace_iommufd_cdev_device_info(vbasedev->name, devfd, vbasedev->num_irqs,
>>>                                      vbasedev->num_regions, vbasedev->flags);
>>> @@ -656,6 +657,7 @@ static void iommufd_cdev_detach(VFIODevice *vbasedev)
>>>       iommufd_cdev_container_destroy(container);
>>>       vfio_address_space_put(space);
>>> +    vfio_iommufd_cpr_unregister_device(vbasedev);
>>>       iommufd_cdev_unbind_and_disconnect(vbasedev);
>>>       close(vbasedev->fd);
>>>   }
>>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>>> index 98134a7..12711fb 100644
>>> --- a/hw/vfio/meson.build
>>> +++ b/hw/vfio/meson.build
>>> @@ -23,6 +23,7 @@ system_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
>>>   system_ss.add(when: 'CONFIG_VFIO_AMD_XGBE', if_true: files('amd-xgbe.c'))
>>>   system_ss.add(when: 'CONFIG_VFIO', if_true: files(
>>>     'cpr.c',
>>> +  'cpr-iommufd.c',
>>
>> This file should be compiled under CONFIG_IOMMUFD.
> 
> Sure, will fix.
> Tomorrow I plan to rebase to the lastest master, add this and the few other
> comments that came up for V4, and send V5.  Sound OK?
Please check compile on windows too.


Thanks,

C.






^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name
  2025-05-29 19:24 ` [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name Steve Sistare
@ 2025-06-10  6:12   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10  6:12 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Define vfio_device_free_name to free the name created by
> vfio_device_get_name.  A subsequent patch will do more there.
> No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/hw/vfio/vfio-device.h | 1 +
>   hw/vfio/ap.c                  | 2 +-
>   hw/vfio/ccw.c                 | 2 +-
>   hw/vfio/device.c              | 5 +++++
>   hw/vfio/pci.c                 | 2 +-
>   hw/vfio/platform.c            | 2 +-
>   6 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 6eb6f21..321b442 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -227,6 +227,7 @@ int vfio_device_get_irq_info(VFIODevice *vbasedev, int index,
>   
>   /* Returns 0 on success, or a negative errno. */
>   bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp);
> +void vfio_device_free_name(VFIODevice *vbasedev);
>   void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp);
>   void vfio_device_init(VFIODevice *vbasedev, int type, VFIODeviceOps *ops,
>                         DeviceState *dev, bool ram_discard);
> diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
> index 785c0a0..013bd59 100644
> --- a/hw/vfio/ap.c
> +++ b/hw/vfio/ap.c
> @@ -180,7 +180,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
>   
>   error:
>       error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->name);
> -    g_free(vbasedev->name);
> +    vfio_device_free_name(vbasedev);
>   }
>   
>   static void vfio_ap_unrealize(DeviceState *dev)
> diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
> index cea9d6e..903b8b0 100644
> --- a/hw/vfio/ccw.c
> +++ b/hw/vfio/ccw.c
> @@ -619,7 +619,7 @@ out_io_notifier_err:
>   out_region_err:
>       vfio_device_detach(vbasedev);
>   out_attach_dev_err:
> -    g_free(vbasedev->name);
> +    vfio_device_free_name(vbasedev);
>   out_unrealize:
>       if (cdc->unrealize) {
>           cdc->unrealize(cdev);
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 71fa9f4..151c618 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -317,6 +317,11 @@ bool vfio_device_get_name(VFIODevice *vbasedev, Error **errp)
>       return true;
>   }
>   
> +void vfio_device_free_name(VFIODevice *vbasedev)
> +{
> +    g_free(vbasedev->name);

you could use g_clear_pointer().

Thanks,

C.

  
> +}
> +
>   void vfio_device_set_fd(VFIODevice *vbasedev, const char *str, Error **errp)
>   {
>       ERRP_GUARD();
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c8d6ee0..7da7a9c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2949,7 +2949,7 @@ static void vfio_pci_put_device(VFIOPCIDevice *vdev)
>   {
>       vfio_device_detach(&vdev->vbasedev);
>   
> -    g_free(vdev->vbasedev.name);
> +    vfio_device_free_name(&vdev->vbasedev);
>       g_free(vdev->msix);
>   }
>   
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> index 9a21f2e..5c1795a 100644
> --- a/hw/vfio/platform.c
> +++ b/hw/vfio/platform.c
> @@ -530,7 +530,7 @@ static bool vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>   {
>       /* @fd takes precedence over @sysfsdev which takes precedence over @host */
>       if (vbasedev->fd < 0 && vbasedev->sysfsdev) {
> -        g_free(vbasedev->name);
> +        vfio_device_free_name(vbasedev);
>           vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
>       } else if (vbasedev->fd < 0) {
>           if (!vbasedev->name || strchr(vbasedev->name, '/')) {



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 36/43] migration: vfio cpr state hook
  2025-05-29 19:24 ` [PATCH V4 36/43] migration: vfio cpr state hook Steve Sistare
@ 2025-06-10  6:14   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10  6:14 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> Define a list of vfio devices in CPR state, in a subsection so that
> older QEMU can be live updated to this version.  However, new QEMU
> will not be live updateable to old QEMU.  This is acceptable because
> CPR is not yet commonly used, and updates to older versions are unusual.
> 
> The contents of each device object will be defined by the vfio subsystem
> in a subsequent patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/hw/vfio/vfio-cpr.h |  1 +
>   include/migration/cpr.h    | 12 ++++++++++++
>   hw/vfio/cpr-iommufd.c      |  2 ++
>   migration/cpr.c            | 14 +++++---------
>   4 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index b9b77ae..619af07 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -74,5 +74,6 @@ void vfio_cpr_delete_vector_fd(struct VFIOPCIDevice *vdev, const char *name,
>                                  int nr);
>   
>   extern const VMStateDescription vfio_cpr_pci_vmstate;
> +extern const VMStateDescription vmstate_cpr_vfio_devices;
>   
>   #endif /* HW_VFIO_VFIO_CPR_H */
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 7fd8065..8fd8bfe 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -9,11 +9,23 @@
>   #define MIGRATION_CPR_H
>   
>   #include "qapi/qapi-types-migration.h"
> +#include "qemu/queue.h"
>   
>   #define MIG_MODE_NONE           -1
>   
>   #define QEMU_CPR_FILE_MAGIC     0x51435052
>   #define QEMU_CPR_FILE_VERSION   0x00000001
> +#define CPR_STATE "CprState"
> +
> +typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
> +typedef QLIST_HEAD(CprVFIODeviceList, CprVFIODevice) CprVFIODeviceList;
> +
> +typedef struct CprState {
> +    CprFdList fds;
> +    CprVFIODeviceList vfio_devices;
> +} CprState;
> +
> +extern CprState cpr_state;
>   
>   void cpr_save_fd(const char *name, int id, int fd);
>   void cpr_delete_fd(const char *name, int id);
> diff --git a/hw/vfio/cpr-iommufd.c b/hw/vfio/cpr-iommufd.c
> index 60bd7e8..3e78265 100644
> --- a/hw/vfio/cpr-iommufd.c
> +++ b/hw/vfio/cpr-iommufd.c
> @@ -14,6 +14,8 @@
>   #include "system/iommufd.h"
>   #include "vfio-iommufd.h"
>   
> +const VMStateDescription vmstate_cpr_vfio_devices;  /* TBD in a later patch */
> +

So vmstate_cpr_vfio_devices should be only compiled if CONFIG_IOMMUFD
is set but ...

>   static bool vfio_cpr_supported(IOMMUFDBackend *be, Error **errp)
>   {
>       if (!iommufd_change_process_capable(be)) {
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 4574608..47898ab 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -22,13 +22,7 @@
>   /*************************************************************************/
>   /* cpr state container for all information to be saved. */
>   
> -typedef QLIST_HEAD(CprFdList, CprFd) CprFdList;
> -
> -typedef struct CprState {
> -    CprFdList fds;
> -} CprState;
> -
> -static CprState cpr_state;
> +CprState cpr_state;
>   
>   /****************************************************************************/
>   
> @@ -129,8 +123,6 @@ int cpr_open_fd(const char *path, int flags, const char *name, int id,
>   }
>   
>   /*************************************************************************/
> -#define CPR_STATE "CprState"
> -
>   static const VMStateDescription vmstate_cpr_state = {
>       .name = CPR_STATE,
>       .version_id = 1,
> @@ -138,6 +130,10 @@ static const VMStateDescription vmstate_cpr_state = {
>       .fields = (VMStateField[]) {
>           VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, next),
>           VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription * const []) {
> +        &vmstate_cpr_vfio_devices,

... vmstate_cpr_vfio_devices is also used when CONFIG_IOMMUFD is not set.


Thanks,

C.





> +        NULL
>       }
>   };
>   /*************************************************************************/



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 43/43] vfio/container: delete old cpr register
  2025-05-29 19:24 ` [PATCH V4 43/43] vfio/container: delete old cpr register Steve Sistare
@ 2025-06-10  6:14   ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10  6:14 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Michael S. Tsirkin, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 5/29/25 21:24, Steve Sistare wrote:
> vfio_cpr_[un]register_container is no longer used since they were
> subsumed by container type-specific registration.  Delete them.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>



Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-cpr.h |  4 ----
>   hw/vfio/cpr.c              | 13 -------------
>   2 files changed, 17 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-cpr.h b/include/hw/vfio/vfio-cpr.h
> index f88e4ba..5b6c960 100644
> --- a/include/hw/vfio/vfio-cpr.h
> +++ b/include/hw/vfio/vfio-cpr.h
> @@ -44,10 +44,6 @@ void vfio_legacy_cpr_unregister_container(struct VFIOContainer *container);
>   int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>                                Error **errp);
>   
> -bool vfio_cpr_register_container(struct VFIOContainerBase *bcontainer,
> -                                 Error **errp);
> -void vfio_cpr_unregister_container(struct VFIOContainerBase *bcontainer);
> -
>   bool vfio_iommufd_cpr_register_container(struct VFIOIOMMUFDContainer *container,
>                                            Error **errp);
>   void vfio_iommufd_cpr_unregister_container(
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index f5555ca..c97e467 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -29,19 +29,6 @@ int vfio_cpr_reboot_notifier(NotifierWithReturn *notifier,
>       return 0;
>   }
>   
> -bool vfio_cpr_register_container(VFIOContainerBase *bcontainer, Error **errp)
> -{
> -    migration_add_notifier_mode(&bcontainer->cpr_reboot_notifier,
> -                                vfio_cpr_reboot_notifier,
> -                                MIG_MODE_CPR_REBOOT);
> -    return true;
> -}
> -
> -void vfio_cpr_unregister_container(VFIOContainerBase *bcontainer)
> -{
> -    migration_remove_notifier(&bcontainer->cpr_reboot_notifier);
> -}
> -
>   #define STRDUP_VECTOR_FD_NAME(vdev, name)   \
>       g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>   



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-04 13:48               ` Cédric Le Goater
@ 2025-06-10 16:31                 ` Michael S. Tsirkin
  2025-06-10 17:05                   ` Steven Sistare
  2025-06-10 17:09                   ` Cédric Le Goater
  0 siblings, 2 replies; 90+ messages in thread
From: Michael S. Tsirkin @ 2025-06-10 16:31 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Steven Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
	Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
> > I don't see any advantage to making this a class attribute.  I looked for examples
> > of using such attributes for vfio to configure pci, and found very little.  It
> > sounds like overkill since vfio already sets and gets PCIDevice members directly
> > in many places.
> > 
> > I defined skip_reset_on_cpr based on this existing example:
> > 
> > vfio_instance_init()
> >      pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
> 
> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
> is a constant, for which a class attribute are more appropriate.
> This is minor.
> 
> Michael,
> 
>   Are you ok with the 'skip_reset_on_cpr' bool ?

Generally yes, but maybe cap_present bit is even cleaner?
vfio already pokes at it, and we have history of encoding
quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.


> > > I wonder if the resettable interface, and more specifically the
> > > RESET_TYPE_SNAPSHOT_LOAD type, might be useful. Have you explored
> > > this alternative ?
> > 
> > RESET_TYPE_SNAPSHOT_LOAD (or a new type such as RESET_TYPE_CPR) would skip
> > reset for all devices, but we only skip for vfio_pci.  All other devices
> > (including virtio) save and restore state using standard migration vmstate,
> > and must call reset.
> OK.
> 
> C.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-10 16:31                 ` Michael S. Tsirkin
@ 2025-06-10 17:05                   ` Steven Sistare
  2025-06-10 17:11                     ` Cédric Le Goater
  2025-06-10 17:09                   ` Cédric Le Goater
  1 sibling, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-10 17:05 UTC (permalink / raw)
  To: Michael S. Tsirkin, Cédric Le Goater
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/10/2025 12:31 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
>>> I don't see any advantage to making this a class attribute.  I looked for examples
>>> of using such attributes for vfio to configure pci, and found very little.  It
>>> sounds like overkill since vfio already sets and gets PCIDevice members directly
>>> in many places.
>>>
>>> I defined skip_reset_on_cpr based on this existing example:
>>>
>>> vfio_instance_init()
>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
>>
>> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
>> is a constant, for which a class attribute are more appropriate.
>> This is minor.
>>
>> Michael,
>>
>>    Are you ok with the 'skip_reset_on_cpr' bool ?
> 
> Generally yes, but maybe cap_present bit is even cleaner?
> vfio already pokes at it, and we have history of encoding
> quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.

Sure, I can send a new version based on a cap_present bit QEMU_PCI_SKIP_RESET_ON_CPR.

- Steve


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-10 16:31                 ` Michael S. Tsirkin
  2025-06-10 17:05                   ` Steven Sistare
@ 2025-06-10 17:09                   ` Cédric Le Goater
  1 sibling, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10 17:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Steven Sistare, qemu-devel, Alex Williamson, Yi Liu, Eric Auger,
	Zhenzhong Duan, Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/10/25 18:31, Michael S. Tsirkin wrote:
> On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
>>> I don't see any advantage to making this a class attribute.  I looked for examples
>>> of using such attributes for vfio to configure pci, and found very little.  It
>>> sounds like overkill since vfio already sets and gets PCIDevice members directly
>>> in many places.
>>>
>>> I defined skip_reset_on_cpr based on this existing example:
>>>
>>> vfio_instance_init()
>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
>>
>> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
>> is a constant, for which a class attribute are more appropriate.
>> This is minor.
>>
>> Michael,
>>
>>    Are you ok with the 'skip_reset_on_cpr' bool ?
> 
> Generally yes, but maybe cap_present bit is even cleaner?
> vfio already pokes at it, and we have history of encoding
> quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.

I agree. An extra bit in the cap_present field would be cleaner.

Thanks,

C.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-10 17:05                   ` Steven Sistare
@ 2025-06-10 17:11                     ` Cédric Le Goater
  2025-06-10 17:14                       ` Steven Sistare
  0 siblings, 1 reply; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10 17:11 UTC (permalink / raw)
  To: Steven Sistare, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/10/25 19:05, Steven Sistare wrote:
> On 6/10/2025 12:31 PM, Michael S. Tsirkin wrote:
>> On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
>>>> I don't see any advantage to making this a class attribute.  I looked for examples
>>>> of using such attributes for vfio to configure pci, and found very little.  It
>>>> sounds like overkill since vfio already sets and gets PCIDevice members directly
>>>> in many places.
>>>>
>>>> I defined skip_reset_on_cpr based on this existing example:
>>>>
>>>> vfio_instance_init()
>>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
>>>
>>> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
>>> is a constant, for which a class attribute are more appropriate.
>>> This is minor.
>>>
>>> Michael,
>>>
>>>    Are you ok with the 'skip_reset_on_cpr' bool ?
>>
>> Generally yes, but maybe cap_present bit is even cleaner?
>> vfio already pokes at it, and we have history of encoding
>> quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.
> 
> Sure, I can send a new version based on a cap_present bit QEMU_PCI_SKIP_RESET_ON_CPR.

Please send an update of patch "pci: skip reset during cpr" in the v5 series.
Hopefully it will apply cleanly.


Thanks,

C.








^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-10 17:11                     ` Cédric Le Goater
@ 2025-06-10 17:14                       ` Steven Sistare
  2025-06-10 17:19                         ` Cédric Le Goater
  0 siblings, 1 reply; 90+ messages in thread
From: Steven Sistare @ 2025-06-10 17:14 UTC (permalink / raw)
  To: Cédric Le Goater, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/10/2025 1:11 PM, Cédric Le Goater wrote:
> On 6/10/25 19:05, Steven Sistare wrote:
>> On 6/10/2025 12:31 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
>>>>> I don't see any advantage to making this a class attribute.  I looked for examples
>>>>> of using such attributes for vfio to configure pci, and found very little.  It
>>>>> sounds like overkill since vfio already sets and gets PCIDevice members directly
>>>>> in many places.
>>>>>
>>>>> I defined skip_reset_on_cpr based on this existing example:
>>>>>
>>>>> vfio_instance_init()
>>>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
>>>>
>>>> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
>>>> is a constant, for which a class attribute are more appropriate.
>>>> This is minor.
>>>>
>>>> Michael,
>>>>
>>>>    Are you ok with the 'skip_reset_on_cpr' bool ?
>>>
>>> Generally yes, but maybe cap_present bit is even cleaner?
>>> vfio already pokes at it, and we have history of encoding
>>> quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.
>>
>> Sure, I can send a new version based on a cap_present bit QEMU_PCI_SKIP_RESET_ON_CPR.
> 
> Please send an update of patch "pci: skip reset during cpr" in the v5 series.
> Hopefully it will apply cleanly.

Please clarify: do you want me to send a delta that applies on top of the
old patch, or send a new version of the old patch?

- Steve



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH V4 16/43] pci: skip reset during cpr
  2025-06-10 17:14                       ` Steven Sistare
@ 2025-06-10 17:19                         ` Cédric Le Goater
  0 siblings, 0 replies; 90+ messages in thread
From: Cédric Le Goater @ 2025-06-10 17:19 UTC (permalink / raw)
  To: Steven Sistare, Michael S. Tsirkin
  Cc: qemu-devel, Alex Williamson, Yi Liu, Eric Auger, Zhenzhong Duan,
	Marcel Apfelbaum, Peter Xu, Fabiano Rosas

On 6/10/25 19:14, Steven Sistare wrote:
> On 6/10/2025 1:11 PM, Cédric Le Goater wrote:
>> On 6/10/25 19:05, Steven Sistare wrote:
>>> On 6/10/2025 12:31 PM, Michael S. Tsirkin wrote:
>>>> On Wed, Jun 04, 2025 at 03:48:40PM +0200, Cédric Le Goater wrote:
>>>>>> I don't see any advantage to making this a class attribute.  I looked for examples
>>>>>> of using such attributes for vfio to configure pci, and found very little.  It
>>>>>> sounds like overkill since vfio already sets and gets PCIDevice members directly
>>>>>> in many places.
>>>>>>
>>>>>> I defined skip_reset_on_cpr based on this existing example:
>>>>>>
>>>>>> vfio_instance_init()
>>>>>>       pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS
>>>>>
>>>>> pci_dev->cap_present can be modified at realize time. skip_reset_on_cpr
>>>>> is a constant, for which a class attribute are more appropriate.
>>>>> This is minor.
>>>>>
>>>>> Michael,
>>>>>
>>>>>    Are you ok with the 'skip_reset_on_cpr' bool ?
>>>>
>>>> Generally yes, but maybe cap_present bit is even cleaner?
>>>> vfio already pokes at it, and we have history of encoding
>>>> quirks there, see QEMU_PCIE_LNKSTA_DLLLA_BITNR for example.
>>>
>>> Sure, I can send a new version based on a cap_present bit QEMU_PCI_SKIP_RESET_ON_CPR.
>>
>> Please send an update of patch "pci: skip reset during cpr" in the v5 series.
>> Hopefully it will apply cleanly.
> 
> Please clarify: do you want me to send a delta that applies on top of the
> old patch, or send a new version of the old patch?

a new version of the patch.

Thanks,

C.




^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2025-06-10 17:19 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 19:23 [PATCH V4 00/43] Live update: vfio and iommufd Steve Sistare
2025-05-29 19:23 ` [PATCH V4 01/43] MAINTAINERS: Add reviewer for CPR Steve Sistare
2025-05-29 19:23 ` [PATCH V4 02/43] vfio: return mr from vfio_get_xlat_addr Steve Sistare
2025-06-03 10:39   ` Duan, Zhenzhong
2025-05-29 19:23 ` [PATCH V4 03/43] vfio/container: pass MemoryRegion to DMA operations Steve Sistare
2025-06-03 10:39   ` Duan, Zhenzhong
2025-05-29 19:24 ` [PATCH V4 04/43] vfio/pci: vfio_pci_put_device on failure Steve Sistare
2025-06-03 10:40   ` Duan, Zhenzhong
2025-06-03 14:09     ` Steven Sistare
2025-06-04  3:55       ` Duan, Zhenzhong
2025-06-04 13:33         ` Steven Sistare
2025-06-05  3:02           ` Duan, Zhenzhong
2025-06-05 15:16             ` Steven Sistare
2025-06-05 21:14               ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 05/43] migration: cpr helpers Steve Sistare
2025-05-29 19:24 ` [PATCH V4 06/43] migration: lower handler priority Steve Sistare
2025-05-29 19:24 ` [PATCH V4 07/43] vfio: vfio_find_ram_discard_listener Steve Sistare
2025-06-03 10:59   ` Duan, Zhenzhong
2025-05-29 19:24 ` [PATCH V4 08/43] vfio: move vfio-cpr.h Steve Sistare
2025-06-03 11:01   ` Duan, Zhenzhong
2025-05-29 19:24 ` [PATCH V4 09/43] vfio/container: register container for cpr Steve Sistare
2025-06-01 15:21   ` Cédric Le Goater
2025-06-03 11:57   ` Duan, Zhenzhong
2025-06-03 14:09     ` Steven Sistare
2025-06-03 14:17       ` Steven Sistare
2025-06-03 15:27         ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 10/43] vfio/container: preserve descriptors Steve Sistare
2025-06-01 16:57   ` Cédric Le Goater
2025-06-03 11:57   ` Duan, Zhenzhong
2025-05-29 19:24 ` [PATCH V4 11/43] vfio/container: discard old DMA vaddr Steve Sistare
2025-05-29 19:24 ` [PATCH V4 12/43] vfio/container: restore " Steve Sistare
2025-06-01 16:48   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 13/43] vfio/container: mdev cpr blocker Steve Sistare
2025-05-29 19:24 ` [PATCH V4 14/43] vfio/container: recover from unmap-all-vaddr failure Steve Sistare
2025-05-29 19:24 ` [PATCH V4 15/43] pci: export msix_is_pending Steve Sistare
2025-05-29 19:24 ` [PATCH V4 16/43] pci: skip reset during cpr Steve Sistare
2025-06-01 16:38   ` Cédric Le Goater
2025-06-01 19:07     ` Michael S. Tsirkin
2025-06-02 12:36       ` Steven Sistare
2025-06-04  7:09         ` Cédric Le Goater
2025-06-04 11:59           ` Cédric Le Goater
2025-06-04 13:15             ` Steven Sistare
2025-06-04 13:48               ` Cédric Le Goater
2025-06-10 16:31                 ` Michael S. Tsirkin
2025-06-10 17:05                   ` Steven Sistare
2025-06-10 17:11                     ` Cédric Le Goater
2025-06-10 17:14                       ` Steven Sistare
2025-06-10 17:19                         ` Cédric Le Goater
2025-06-10 17:09                   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 17/43] vfio-pci: " Steve Sistare
2025-06-01 16:39   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 18/43] vfio/pci: vfio_pci_vector_init Steve Sistare
2025-06-01 15:25   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 19/43] vfio/pci: vfio_notifier_init Steve Sistare
2025-05-29 19:24 ` [PATCH V4 20/43] vfio/pci: pass vector to virq functions Steve Sistare
2025-05-29 19:24 ` [PATCH V4 21/43] vfio/pci: vfio_notifier_init cpr parameters Steve Sistare
2025-05-29 19:24 ` [PATCH V4 22/43] vfio/pci: vfio_notifier_cleanup Steve Sistare
2025-05-29 19:24 ` [PATCH V4 23/43] vfio/pci: export MSI functions Steve Sistare
2025-06-01 15:27   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 24/43] vfio-pci: preserve MSI Steve Sistare
2025-05-29 19:24 ` [PATCH V4 25/43] vfio-pci: preserve INTx Steve Sistare
2025-05-29 19:24 ` [PATCH V4 26/43] migration: close kvm after cpr Steve Sistare
2025-05-29 19:24 ` [PATCH V4 27/43] migration: cpr_get_fd_param helper Steve Sistare
2025-05-29 19:24 ` [PATCH V4 28/43] backends/iommufd: iommufd_backend_map_file_dma Steve Sistare
2025-05-29 19:24 ` [PATCH V4 29/43] backends/iommufd: change process ioctl Steve Sistare
2025-05-29 19:24 ` [PATCH V4 30/43] physmem: qemu_ram_get_fd_offset Steve Sistare
2025-05-29 19:24 ` [PATCH V4 31/43] vfio/iommufd: use IOMMU_IOAS_MAP_FILE Steve Sistare
2025-05-29 19:24 ` [PATCH V4 32/43] vfio/iommufd: invariant device name Steve Sistare
2025-06-10  6:10   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 33/43] vfio/iommufd: add vfio_device_free_name Steve Sistare
2025-06-10  6:12   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 34/43] vfio/iommufd: device name blocker Steve Sistare
2025-05-29 19:24 ` [PATCH V4 35/43] vfio/iommufd: register container for cpr Steve Sistare
2025-06-09 20:30   ` Cédric Le Goater
2025-06-09 20:47     ` Steven Sistare
2025-06-10  6:11       ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 36/43] migration: vfio cpr state hook Steve Sistare
2025-06-10  6:14   ` Cédric Le Goater
2025-05-29 19:24 ` [PATCH V4 37/43] vfio/iommufd: cpr state Steve Sistare
2025-05-29 19:24 ` [PATCH V4 38/43] vfio/iommufd: preserve descriptors Steve Sistare
2025-05-29 19:24 ` [PATCH V4 39/43] vfio/iommufd: reconstruct device Steve Sistare
2025-05-29 19:24 ` [PATCH V4 40/43] vfio/iommufd: reconstruct hwpt Steve Sistare
2025-05-29 19:24 ` [PATCH V4 41/43] vfio/iommufd: change process Steve Sistare
2025-05-29 19:24 ` [PATCH V4 42/43] iommufd: preserve DMA mappings Steve Sistare
2025-05-29 19:24 ` [PATCH V4 43/43] vfio/container: delete old cpr register Steve Sistare
2025-06-10  6:14   ` Cédric Le Goater
2025-06-01 17:26 ` [PATCH V4 00/43] Live update: vfio and iommufd Cédric Le Goater
2025-06-02 12:42   ` Steven Sistare
2025-06-03 12:09 ` Duan, Zhenzhong
2025-06-03 14:09   ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).