qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] vfio: relax the vIOMMU check
@ 2025-09-10  2:36 Zhenzhong Duan
  2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan

Hi,

This series relax the vIOMMU check and allows live migration with vIOMMU
without VFs using device dirty tracking. It's rewritten based on first 4
patches of [1] from Joao.

Currently what block us is the lack of dirty bitmap query with iommufd
before unmap. By adding that query in patch2 plus an extra optimization
in patch3 that let us simply read the dirty bit (without clearing it),
then an optimization in patch4 to handle a corner case of domain switch
in guest, finally we relax the check in patch5.

We tested VM live migration (running QAT workload in VM) with QAT
device passthrough, below matrix configs:
1.Scalable mode vIOMMU + IOMMUFD cdev mode
2.Scalable mode vIOMMU + legacy VFIO mode
3.legacy mode vIOMMU + IOMMUFD cdev mode
4.legacy mode vIOMMU + legacy VFIO mode

[1] https://github.com/jpemartins/qemu/commits/vfio-migration-viommu/

Thanks
Zhenzhong

Zhenzhong Duan (5):
  vfio/iommufd: Add framework code to support getting dirty bitmap
    before unmap
  vfio/iommufd: Query dirty bitmap before DMA unmap
  vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  intel_iommu: Optimize unmap_bitmap during migration
  vfio/migration: Allow live migration with vIOMMU without VFs using
    device dirty tracking

 hw/vfio/vfio-iommufd.h        |  1 +
 include/hw/vfio/vfio-device.h | 10 ++++++
 include/system/iommufd.h      |  2 +-
 backends/iommufd.c            |  5 +--
 hw/i386/intel_iommu.c         | 42 ++++++++++++++++++++++
 hw/vfio/container-base.c      |  5 +--
 hw/vfio/device.c              |  6 ++++
 hw/vfio/iommufd.c             | 66 +++++++++++++++++++++++++++++------
 hw/vfio/migration.c           |  6 ++--
 backends/trace-events         |  2 +-
 10 files changed, 123 insertions(+), 22 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
@ 2025-09-10  2:36 ` Zhenzhong Duan
  2025-09-19  9:30   ` Cédric Le Goater
  2025-09-10  2:36 ` [PATCH 2/5] vfio/iommufd: Query dirty bitmap before DMA unmap Zhenzhong Duan
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan

Currently we support device and iommu dirty tracking, device dirty
tracking is preferred.

Add the framework code in iommufd_cdev_unmap_one() to choose either
device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/vfio/iommufd.c | 56 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 48c590b6a9..b5d6e54c45 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -58,33 +58,67 @@ static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
                                         iova, size, fd, start, readonly);
 }
 
-static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
-                              hwaddr iova, ram_addr_t size,
-                              IOMMUTLBEntry *iotlb, bool unmap_all)
+static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
+                                  hwaddr iova, ram_addr_t size,
+                                  IOMMUTLBEntry *iotlb)
 {
     const VFIOIOMMUFDContainer *container =
         container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
+    bool need_dirty_sync = false;
+    Error *local_err = NULL;
+    IOMMUFDBackend *be = container->be;
+    uint32_t ioas_id = container->ioas_id;
+    int ret;
+
+    if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
+        if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
+            bcontainer->dirty_pages_supported) {
+            /* TODO: query dirty bitmap before DMA unmap */
+            return iommufd_backend_unmap_dma(be, ioas_id, iova, size);
+        }
+
+        need_dirty_sync = true;
+    }
+
+    ret = iommufd_backend_unmap_dma(be, ioas_id, iova, size);
+    if (ret) {
+        return ret;
+    }
 
+    if (need_dirty_sync) {
+        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
+                                                iotlb->translated_addr,
+                                                &local_err);
+        if (ret) {
+            error_report_err(local_err);
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
+                              hwaddr iova, ram_addr_t size,
+                              IOMMUTLBEntry *iotlb, bool unmap_all)
+{
     /* unmap in halves */
     if (unmap_all) {
         Int128 llsize = int128_rshift(int128_2_64(), 1);
         int ret;
 
-        ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
-                                        0, int128_get64(llsize));
+        ret = iommufd_cdev_unmap_one(bcontainer, 0, int128_get64(llsize),
+                                     iotlb);
 
         if (ret == 0) {
-            ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
-                                            int128_get64(llsize),
-                                            int128_get64(llsize));
+            ret = iommufd_cdev_unmap_one(bcontainer, int128_get64(llsize),
+                                         int128_get64(llsize), iotlb);
         }
 
         return ret;
     }
 
-    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
-    return iommufd_backend_unmap_dma(container->be,
-                                     container->ioas_id, iova, size);
+    return iommufd_cdev_unmap_one(bcontainer, iova, size, iotlb);
 }
 
 static bool iommufd_cdev_kvm_device_add(VFIODevice *vbasedev, Error **errp)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/5] vfio/iommufd: Query dirty bitmap before DMA unmap
  2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
  2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
@ 2025-09-10  2:36 ` Zhenzhong Duan
  2025-09-10  2:36 ` [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support Zhenzhong Duan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan

When a existing mapping is unmapped, there could already be dirty bits
which need to be recorded before unmap.

If query dirty bitmap fails, we still need to do unmapping or else there
is stale mapping and it's risky to guest.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/vfio/iommufd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index b5d6e54c45..0057488ce9 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -73,7 +73,13 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
     if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
         if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
             bcontainer->dirty_pages_supported) {
-            /* TODO: query dirty bitmap before DMA unmap */
+            ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
+                                                    iotlb->translated_addr,
+                                                    &local_err);
+            if (ret) {
+                error_report_err(local_err);
+            }
+            /* Unmap stale mapping even if query dirty bitmap fails */
             return iommufd_backend_unmap_dma(be, ioas_id, iova, size);
         }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
  2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
  2025-09-10  2:36 ` [PATCH 2/5] vfio/iommufd: Query dirty bitmap before DMA unmap Zhenzhong Duan
@ 2025-09-10  2:36 ` Zhenzhong Duan
  2025-09-19 10:27   ` Joao Martins
  2025-09-10  2:37 ` [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration Zhenzhong Duan
  2025-09-10  2:37 ` [PATCH 5/5] vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking Zhenzhong Duan
  4 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan

Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last dirty
bitmap query right before unmap, no PTEs flushes. This accelerates the
query without issue because unmap will tear down the mapping anyway.

Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
be used for the flags of iommufd dirty tracking. Currently it is
set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
the scenario.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/vfio/vfio-iommufd.h   | 1 +
 include/system/iommufd.h | 2 +-
 backends/iommufd.c       | 5 +++--
 hw/vfio/iommufd.c        | 6 +++++-
 backends/trace-events    | 2 +-
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index 07ea0f4304..e0af241c75 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
     VFIOContainerBase bcontainer;
     IOMMUFDBackend *be;
     uint32_t ioas_id;
+    uint64_t dirty_tracking_flags;
     QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
 } VFIOIOMMUFDContainer;
 
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index c9c72ffc45..63898e7b0d 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -64,7 +64,7 @@ bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
 bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
-                                      Error **errp);
+                                      uint64_t flags, Error **errp);
 bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
                                       uint32_t data_type, uint32_t entry_len,
                                       uint32_t *entry_num, void *data,
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2a33c7ab0b..3c4f6157e2 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -361,7 +361,7 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
                                       uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
-                                      Error **errp)
+                                      uint64_t flags, Error **errp)
 {
     int ret;
     struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
@@ -371,11 +371,12 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
         .length = size,
         .page_size = page_size,
         .data = (uintptr_t)data,
+        .flags = flags,
     };
 
     ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP, &get_dirty_bitmap);
     trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova, size,
-                                           page_size, ret ? errno : 0);
+                                           flags, page_size, ret ? errno : 0);
     if (ret) {
         error_setg_errno(errp, errno,
                          "IOMMU_HWPT_GET_DIRTY_BITMAP (iova: 0x%"HWADDR_PRIx
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 0057488ce9..c897aa6b17 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
                                   hwaddr iova, ram_addr_t size,
                                   IOMMUTLBEntry *iotlb)
 {
-    const VFIOIOMMUFDContainer *container =
+    VFIOIOMMUFDContainer *container =
         container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
     bool need_dirty_sync = false;
     Error *local_err = NULL;
@@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
     if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
         if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
             bcontainer->dirty_pages_supported) {
+            container->dirty_tracking_flags =
+                                      IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
             ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
                                                     iotlb->translated_addr,
                                                     &local_err);
+            container->dirty_tracking_flags = 0;
             if (ret) {
                 error_report_err(local_err);
             }
@@ -248,6 +251,7 @@ static int iommufd_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
         if (!iommufd_backend_get_dirty_bitmap(container->be, hwpt->hwpt_id,
                                               iova, size, page_size,
                                               (uint64_t *)vbmap->bitmap,
+                                              container->dirty_tracking_flags,
                                               errp)) {
             return -EINVAL;
         }
diff --git a/backends/trace-events b/backends/trace-events
index 56132d3fd2..e1992ba12f 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -19,5 +19,5 @@ iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
 iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
 iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
-iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
+iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t flags, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" flags=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
 iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-09-10  2:36 ` [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support Zhenzhong Duan
@ 2025-09-10  2:37 ` Zhenzhong Duan
  2025-10-12 10:31   ` Yi Liu
  2025-09-10  2:37 ` [PATCH 5/5] vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking Zhenzhong Duan
  4 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan

If a VFIO device in guest switches from IOMMU domain to block domain,
vtd_address_space_unmap() is called to unmap whole address space.

If that happens during migration, migration fails with legacy VFIO
backend as below:

Status: failed (vfio_container_dma_unmap(0x561bbbd92d90, 0x100000000000, 0x100000000000) = -7 (Argument list too long))

Because legacy VFIO limits maximum bitmap size to 256MB which maps to 8TB on
4K page system, when 16TB sized UNMAP notification is sent, unmap_bitmap
ioctl fails.

There is no such limitation with iommufd backend, but it's still not optimal
to allocate large bitmap.

Optimize it by iterating over DMAMap list to unmap each range with active
mapping when migration is active. If migration is not active, unmapping the
whole address space in one go is optimal.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c5e44413..6876dae727 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -37,6 +37,7 @@
 #include "system/system.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm/kvm_i386.h"
+#include "migration/misc.h"
 #include "migration/vmstate.h"
 #include "trace.h"
 
@@ -4423,6 +4424,42 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_unlock(s);
 }
 
+/*
+ * Unmapping a large range in one go is not optimal during migration because
+ * a large dirty bitmap needs to be allocated while there may be only small
+ * mappings, iterate over DMAMap list to unmap each range with active mapping.
+ */
+static void vtd_address_space_unmap_in_migration(VTDAddressSpace *as,
+                                                 IOMMUNotifier *n)
+{
+    const DMAMap *map;
+    const DMAMap target = {
+        .iova = n->start,
+        .size = n->end,
+    };
+    IOVATree *tree = as->iova_tree;
+
+    /*
+     * DMAMap is created during IOMMU page table sync, it's either 4KB or huge
+     * page size and always a power of 2 in size. So the range of DMAMap could
+     * be used for UNMAP notification directly.
+     */
+    while ((map = iova_tree_find(tree, &target))) {
+        IOMMUTLBEvent event;
+
+        event.type = IOMMU_NOTIFIER_UNMAP;
+        event.entry.iova = map->iova;
+        event.entry.addr_mask = map->size;
+        event.entry.target_as = &address_space_memory;
+        event.entry.perm = IOMMU_NONE;
+        /* This field is meaningless for unmap */
+        event.entry.translated_addr = 0;
+        memory_region_notify_iommu_one(n, &event);
+
+        iova_tree_remove(tree, *map);
+    }
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4432,6 +4469,11 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     IntelIOMMUState *s = as->iommu_state;
     DMAMap map;
 
+    if (migration_is_running()) {
+        vtd_address_space_unmap_in_migration(as, n);
+        return;
+    }
+
     /*
      * Note: all the codes in this function has a assumption that IOVA
      * bits are no more than VTD_MGAW bits (which is restricted by
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/5] vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking
  2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-09-10  2:37 ` [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration Zhenzhong Duan
@ 2025-09-10  2:37 ` Zhenzhong Duan
  4 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-09-10  2:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, joao.m.martins, avihaih,
	xudong.hao, giovanni.cabiddu, mark.gross, arjan.van.de.ven,
	Zhenzhong Duan, Jason Zeng

Commit e46883204c38 ("vfio/migration: Block migration with vIOMMU")
introduces a migration blocker when vIOMMU is enabled, because we need
to calculate the IOVA ranges for device dirty tracking. But this is
unnecessary for iommu dirty tracking.

Limit the vfio_viommu_preset() check to those devices which use device
dirty tracking. This allows live migration with VFIO devices which use
iommu dirty tracking.

Introduce a helper vfio_device_dirty_pages_disabled() to facilicate it.

Suggested-by: Jason Zeng <jason.zeng@intel.com>
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
---
 include/hw/vfio/vfio-device.h | 10 ++++++++++
 hw/vfio/container-base.c      |  5 +----
 hw/vfio/device.c              |  6 ++++++
 hw/vfio/migration.c           |  6 +++---
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 6e4d5ccdac..0c663a49d5 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -148,6 +148,16 @@ bool vfio_device_irq_set_signaling(VFIODevice *vbasedev, int index, int subindex
 
 void vfio_device_reset_handler(void *opaque);
 bool vfio_device_is_mdev(VFIODevice *vbasedev);
+/**
+ * vfio_device_dirty_pages_disabled: Check if device dirty tracking will be
+ * used for a VFIO device
+ *
+ * @vbasedev: The VFIODevice to transform
+ *
+ * Return: true if either @vbasedev doesn't support device dirty tracking or
+ * is forcedly disabled from command line, otherwise false.
+ */
+bool vfio_device_dirty_pages_disabled(VFIODevice *vbasedev);
 bool vfio_device_hiod_create_and_realize(VFIODevice *vbasedev,
                                          const char *typename, Error **errp);
 bool vfio_device_attach(char *name, VFIODevice *vbasedev,
diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 56304978e1..39c88812f8 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -177,10 +177,7 @@ bool vfio_container_devices_dirty_tracking_is_supported(
     VFIODevice *vbasedev;
 
     QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
-        if (vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) {
-            return false;
-        }
-        if (!vbasedev->dirty_pages_supported) {
+        if (vfio_device_dirty_pages_disabled(vbasedev)) {
             return false;
         }
     }
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 52a1996dc4..d0975e0e40 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -400,6 +400,12 @@ bool vfio_device_is_mdev(VFIODevice *vbasedev)
     return subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
 }
 
+bool vfio_device_dirty_pages_disabled(VFIODevice *vbasedev)
+{
+    return (!vbasedev->dirty_pages_supported ||
+            vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF);
+}
+
 bool vfio_device_hiod_create_and_realize(VFIODevice *vbasedev,
                                          const char *typename, Error **errp)
 {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4c06e3db93..810a5cb157 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -1183,8 +1183,7 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
         return !vfio_block_migration(vbasedev, err, errp);
     }
 
-    if ((!vbasedev->dirty_pages_supported ||
-         vbasedev->device_dirty_page_tracking == ON_OFF_AUTO_OFF) &&
+    if (vfio_device_dirty_pages_disabled(vbasedev) &&
         !vbasedev->iommu_dirty_tracking) {
         if (vbasedev->enable_migration == ON_OFF_AUTO_AUTO) {
             error_setg(&err,
@@ -1202,7 +1201,8 @@ bool vfio_migration_realize(VFIODevice *vbasedev, Error **errp)
         goto out_deinit;
     }
 
-    if (vfio_viommu_preset(vbasedev)) {
+    if (!vfio_device_dirty_pages_disabled(vbasedev) &&
+        vfio_viommu_preset(vbasedev)) {
         error_setg(&err, "%s: Migration is currently not supported "
                    "with vIOMMU enabled", vbasedev->name);
         goto add_blocker;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
@ 2025-09-19  9:30   ` Cédric Le Goater
  2025-09-22  3:17     ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Cédric Le Goater @ 2025-09-19  9:30 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, mst, jasowang, yi.l.liu, clement.mathieu--drif,
	eric.auger, joao.m.martins, avihaih, xudong.hao, giovanni.cabiddu,
	mark.gross, arjan.van.de.ven

Hello Zhenzhong

On 9/10/25 04:36, Zhenzhong Duan wrote:
> Currently we support device and iommu dirty tracking, device dirty
> tracking is preferred.
> 
> Add the framework code in iommufd_cdev_unmap_one() to choose either
> device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().

I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
overflow bug in DMA unmap") could be removed now to make the code
common to both VFIO IOMMU Type1 and IOMMUFD backends.

I asked Alex and Peter in another thread.

Thanks,

C.




> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Tested-by: Xudong Hao <xudong.hao@intel.com>
> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
> ---
>   hw/vfio/iommufd.c | 56 +++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 45 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 48c590b6a9..b5d6e54c45 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -58,33 +58,67 @@ static int iommufd_cdev_map_file(const VFIOContainerBase *bcontainer,
>                                           iova, size, fd, start, readonly);
>   }
>   
> -static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
> -                              hwaddr iova, ram_addr_t size,
> -                              IOMMUTLBEntry *iotlb, bool unmap_all)
> +static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
> +                                  hwaddr iova, ram_addr_t size,
> +                                  IOMMUTLBEntry *iotlb)
>   {
>       const VFIOIOMMUFDContainer *container =
>           container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
> +    bool need_dirty_sync = false;
> +    Error *local_err = NULL;
> +    IOMMUFDBackend *be = container->be;
> +    uint32_t ioas_id = container->ioas_id;
> +    int ret;
> +
> +    if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
> +        if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
> +            bcontainer->dirty_pages_supported) {
> +            /* TODO: query dirty bitmap before DMA unmap */
> +            return iommufd_backend_unmap_dma(be, ioas_id, iova, size);
> +        }
> +
> +        need_dirty_sync = true;
> +    }
> +
> +    ret = iommufd_backend_unmap_dma(be, ioas_id, iova, size);
> +    if (ret) {
> +        return ret;
> +    }
>   
> +    if (need_dirty_sync) {
> +        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
> +                                                iotlb->translated_addr,
> +                                                &local_err);
> +        if (ret) {
> +            error_report_err(local_err);
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,
> +                              hwaddr iova, ram_addr_t size,
> +                              IOMMUTLBEntry *iotlb, bool unmap_all)
> +{
>       /* unmap in halves */
>       if (unmap_all) {
>           Int128 llsize = int128_rshift(int128_2_64(), 1);
>           int ret;
>   
> -        ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
> -                                        0, int128_get64(llsize));
> +        ret = iommufd_cdev_unmap_one(bcontainer, 0, int128_get64(llsize),
> +                                     iotlb);
>   
>           if (ret == 0) {
> -            ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
> -                                            int128_get64(llsize),
> -                                            int128_get64(llsize));
> +            ret = iommufd_cdev_unmap_one(bcontainer, int128_get64(llsize),
> +                                         int128_get64(llsize), iotlb);
>           }
>   
>           return ret;
>       }
>   
> -    /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
> -    return iommufd_backend_unmap_dma(container->be,
> -                                     container->ioas_id, iova, size);
> +    return iommufd_cdev_unmap_one(bcontainer, iova, size, iotlb);
>   }
>   
>   static bool iommufd_cdev_kvm_device_add(VFIODevice *vbasedev, Error **errp)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-10  2:36 ` [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support Zhenzhong Duan
@ 2025-09-19 10:27   ` Joao Martins
  2025-09-22  5:49     ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2025-09-19 10:27 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, yi.l.liu,
	clement.mathieu--drif, eric.auger, avihaih, xudong.hao,
	giovanni.cabiddu, mark.gross, arjan.van.de.ven

On 10/09/2025 03:36, Zhenzhong Duan wrote:
> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last dirty
> bitmap query right before unmap, no PTEs flushes. This accelerates the
> query without issue because unmap will tear down the mapping anyway.
> 
> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
> be used for the flags of iommufd dirty tracking. Currently it is
> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
> the scenario.
> 
> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Tested-by: Xudong Hao <xudong.hao@intel.com>
> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
> ---
>  hw/vfio/vfio-iommufd.h   | 1 +
>  include/system/iommufd.h | 2 +-
>  backends/iommufd.c       | 5 +++--
>  hw/vfio/iommufd.c        | 6 +++++-
>  backends/trace-events    | 2 +-
>  5 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
> index 07ea0f4304..e0af241c75 100644
> --- a/hw/vfio/vfio-iommufd.h
> +++ b/hw/vfio/vfio-iommufd.h
> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>      VFIOContainerBase bcontainer;
>      IOMMUFDBackend *be;
>      uint32_t ioas_id;
> +    uint64_t dirty_tracking_flags;
>      QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>  } VFIOIOMMUFDContainer;
>  
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index c9c72ffc45..63898e7b0d 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -64,7 +64,7 @@ bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
>  bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>                                        uint64_t iova, ram_addr_t size,
>                                        uint64_t page_size, uint64_t *data,
> -                                      Error **errp);
> +                                      uint64_t flags, Error **errp);
>  bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
>                                        uint32_t data_type, uint32_t entry_len,
>                                        uint32_t *entry_num, void *data,
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2a33c7ab0b..3c4f6157e2 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -361,7 +361,7 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>                                        uint32_t hwpt_id,
>                                        uint64_t iova, ram_addr_t size,
>                                        uint64_t page_size, uint64_t *data,
> -                                      Error **errp)
> +                                      uint64_t flags, Error **errp)
>  {
>      int ret;
>      struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
> @@ -371,11 +371,12 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>          .length = size,
>          .page_size = page_size,
>          .data = (uintptr_t)data,
> +        .flags = flags,
>      };
>  
>      ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP, &get_dirty_bitmap);
>      trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova, size,
> -                                           page_size, ret ? errno : 0);
> +                                           flags, page_size, ret ? errno : 0);
>      if (ret) {
>          error_setg_errno(errp, errno,
>                           "IOMMU_HWPT_GET_DIRTY_BITMAP (iova: 0x%"HWADDR_PRIx
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 0057488ce9..c897aa6b17 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
>                                    hwaddr iova, ram_addr_t size,
>                                    IOMMUTLBEntry *iotlb)
>  {
> -    const VFIOIOMMUFDContainer *container =
> +    VFIOIOMMUFDContainer *container =
>          container_of(bcontainer, VFIOIOMMUFDContainer, bcontainer);
>      bool need_dirty_sync = false;
>      Error *local_err = NULL;
> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
>      if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>          if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>              bcontainer->dirty_pages_supported) {
> +            container->dirty_tracking_flags =
> +                                      IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>              ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
>                                                      iotlb->translated_addr,
>                                                      &local_err);
> +            container->dirty_tracking_flags = 0;

Why not changing vfio_container_query_dirty_bitmap to pass a flags too, like the
original patches? This is a little unnecssary odd style to pass a flag via
container structure rather and then clearing.

Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH for generic
container abstraction was to not mix IOMMUFD UAPI specifics into base container
API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend could just
ignore the flag, while IOMMUFD translates it to IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR

>              if (ret) {
>                  error_report_err(local_err);
>              }
> @@ -248,6 +251,7 @@ static int iommufd_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
>          if (!iommufd_backend_get_dirty_bitmap(container->be, hwpt->hwpt_id,
>                                                iova, size, page_size,
>                                                (uint64_t *)vbmap->bitmap,
> +                                              container->dirty_tracking_flags,
>                                                errp)) {
>              return -EINVAL;
>          }
> diff --git a/backends/trace-events b/backends/trace-events
> index 56132d3fd2..e1992ba12f 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -19,5 +19,5 @@ iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>  iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
>  iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
> -iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
> +iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t flags, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" flags=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
>  iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-19  9:30   ` Cédric Le Goater
@ 2025-09-22  3:17     ` Duan, Zhenzhong
  2025-09-22  8:27       ` Cédric Le Goater
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-22  3:17 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

Hi Cedric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>getting dirty bitmap before unmap
>
>Hello Zhenzhong
>
>On 9/10/25 04:36, Zhenzhong Duan wrote:
>> Currently we support device and iommu dirty tracking, device dirty
>> tracking is preferred.
>>
>> Add the framework code in iommufd_cdev_unmap_one() to choose either
>> device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().
>
>I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
>overflow bug in DMA unmap") could be removed now to make the code
>common to both VFIO IOMMU Type1 and IOMMUFD backends.

I am not clear if there is other reason to keep the workaround, but the original
kernel issue had been fixed with below commit:

commit 58fec830fc19208354895d9832785505046d6c01
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Mon Jan 7 22:13:22 2019 -0700

    vfio/type1: Fix unmap overflow off-by-one

    The below referenced commit adds a test for integer overflow, but in
    doing so prevents the unmap ioctl from ever including the last page of
    the address space.  Subtract one to compare to the last address of the
    unmap to avoid the overflow and wrap-around.

    Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
    Cc: stable@vger.kernel.org # v4.15+
    Reported-by: Pei Zhang <pezhang@redhat.com>
    Debugged-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Tested-by: Peter Xu <peterx@redhat.com>
    Reviewed-by: Cornelia Huck <cohuck@redhat.com>
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

>
>I asked Alex and Peter in another thread.

Just curious on the answer, may I ask which thread?

btw: I just found unmapping in halves seems unnecessary as both backends of kernel side support unmap_all now.

    if (unmap_all) {
        /* The unmap ioctl doesn't accept a full 64-bit span. */
        Int128 llsize = int128_rshift(int128_2_64(), 1);

        ret = vfio_legacy_dma_unmap_one(bcontainer, 0, int128_get64(llsize),
                                        iotlb);

        if (ret == 0) {
            ret = vfio_legacy_dma_unmap_one(bcontainer, int128_get64(llsize),
                                            int128_get64(llsize), iotlb);
        }

    } else {
        ret = vfio_legacy_dma_unmap_one(bcontainer, iova, size, iotlb);
    }

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-19 10:27   ` Joao Martins
@ 2025-09-22  5:49     ` Duan, Zhenzhong
  2025-09-22 16:02       ` Cédric Le Goater
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-22  5:49 UTC (permalink / raw)
  To: Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan

Hi Joao,

>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>
>On 10/09/2025 03:36, Zhenzhong Duan wrote:
>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last
>dirty
>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>> query without issue because unmap will tear down the mapping anyway.
>>
>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>> be used for the flags of iommufd dirty tracking. Currently it is
>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
>> the scenario.
>>
>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>> ---
>>  hw/vfio/vfio-iommufd.h   | 1 +
>>  include/system/iommufd.h | 2 +-
>>  backends/iommufd.c       | 5 +++--
>>  hw/vfio/iommufd.c        | 6 +++++-
>>  backends/trace-events    | 2 +-
>>  5 files changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>> index 07ea0f4304..e0af241c75 100644
>> --- a/hw/vfio/vfio-iommufd.h
>> +++ b/hw/vfio/vfio-iommufd.h
>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>      VFIOContainerBase bcontainer;
>>      IOMMUFDBackend *be;
>>      uint32_t ioas_id;
>> +    uint64_t dirty_tracking_flags;
>>      QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>  } VFIOIOMMUFDContainer;
>>
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index c9c72ffc45..63898e7b0d 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -64,7 +64,7 @@ bool
>iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>hwpt_id,
>>  bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>uint32_t hwpt_id,
>>                                        uint64_t iova, ram_addr_t
>size,
>>                                        uint64_t page_size,
>uint64_t *data,
>> -                                      Error **errp);
>> +                                      uint64_t flags, Error
>**errp);
>>  bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>uint32_t id,
>>                                        uint32_t data_type,
>uint32_t entry_len,
>>                                        uint32_t *entry_num, void
>*data,
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 2a33c7ab0b..3c4f6157e2 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -361,7 +361,7 @@ bool
>iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>                                        uint32_t hwpt_id,
>>                                        uint64_t iova, ram_addr_t
>size,
>>                                        uint64_t page_size,
>uint64_t *data,
>> -                                      Error **errp)
>> +                                      uint64_t flags, Error **errp)
>>  {
>>      int ret;
>>      struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>> @@ -371,11 +371,12 @@ bool
>iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>          .length = size,
>>          .page_size = page_size,
>>          .data = (uintptr_t)data,
>> +        .flags = flags,
>>      };
>>
>>      ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>&get_dirty_bitmap);
>>      trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova,
>size,
>> -                                           page_size, ret ? errno :
>0);
>> +                                           flags, page_size, ret ?
>errno : 0);
>>      if (ret) {
>>          error_setg_errno(errp, errno,
>>                           "IOMMU_HWPT_GET_DIRTY_BITMAP
>(iova: 0x%"HWADDR_PRIx
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 0057488ce9..c897aa6b17 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>VFIOContainerBase *bcontainer,
>>                                    hwaddr iova, ram_addr_t size,
>>                                    IOMMUTLBEntry *iotlb)
>>  {
>> -    const VFIOIOMMUFDContainer *container =
>> +    VFIOIOMMUFDContainer *container =
>>          container_of(bcontainer, VFIOIOMMUFDContainer,
>bcontainer);
>>      bool need_dirty_sync = false;
>>      Error *local_err = NULL;
>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>VFIOContainerBase *bcontainer,
>>      if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>          if
>(!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>              bcontainer->dirty_pages_supported) {
>> +            container->dirty_tracking_flags =
>> +
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>              ret = vfio_container_query_dirty_bitmap(bcontainer, iova,
>size,
>>
>iotlb->translated_addr,
>>
>&local_err);
>> +            container->dirty_tracking_flags = 0;
>
>Why not changing vfio_container_query_dirty_bitmap to pass a flags too, like
>the
>original patches? This is a little unnecssary odd style to pass a flag via
>container structure rather and then clearing.

Just want to be simpler, original patch introduced a new parameter to almost all
variants of *_query_dirty_bitmap() while the flags parameter is only used by
IOMMUFD backend when doing unmap_bitmap. Currently we already have three
backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the flag.

I take container->dirty_tracking_flags as a notification mechanism, so set it before
vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing it in
iommufd_query_dirty_bitmap() is easier to be acceptable?

>
>Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH for
>generic
>container abstraction was to not mix IOMMUFD UAPI specifics into base
>container
>API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>could just
>ignore the flag, while IOMMUFD translates it to
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR

I did port original patch https://github.com/yiliu1765/qemu/commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
But it looks complex to have 'flags' parameter everywhere.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-22  3:17     ` Duan, Zhenzhong
@ 2025-09-22  8:27       ` Cédric Le Goater
  2025-09-23  2:45         ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Cédric Le Goater @ 2025-09-22  8:27 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao, Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 9/22/25 05:17, Duan, Zhenzhong wrote:
> Hi Cedric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>> getting dirty bitmap before unmap
>>
>> Hello Zhenzhong
>>
>> On 9/10/25 04:36, Zhenzhong Duan wrote:
>>> Currently we support device and iommu dirty tracking, device dirty
>>> tracking is preferred.
>>>
>>> Add the framework code in iommufd_cdev_unmap_one() to choose either
>>> device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().
>>
>> I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
>> overflow bug in DMA unmap") could be removed now to make the code
>> common to both VFIO IOMMU Type1 and IOMMUFD backends.
> 
> I am not clear if there is other reason to keep the workaround, but the original
> kernel issue had been fixed with below commit:
> 
> commit 58fec830fc19208354895d9832785505046d6c01
> Author: Alex Williamson <alex.williamson@redhat.com>
> Date:   Mon Jan 7 22:13:22 2019 -0700
> 
>      vfio/type1: Fix unmap overflow off-by-one
> 
>      The below referenced commit adds a test for integer overflow, but in
>      doing so prevents the unmap ioctl from ever including the last page of
>      the address space.  Subtract one to compare to the last address of the
>      unmap to avoid the overflow and wrap-around.
> 
>      Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
>      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
>      Cc: stable@vger.kernel.org # v4.15+
>      Reported-by: Pei Zhang <pezhang@redhat.com>
>      Debugged-by: Peter Xu <peterx@redhat.com>
>      Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
>      Reviewed-by: Peter Xu <peterx@redhat.com>
>      Tested-by: Peter Xu <peterx@redhat.com>
>      Reviewed-by: Cornelia Huck <cohuck@redhat.com>
>      Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
>>
>> I asked Alex and Peter in another thread.
> 
> Just curious on the answer, may I ask which thread?

According to Alex, the QEMU workaround can be removed :

https://lore.kernel.org/qemu-devel/20250919102447.748e17fe.alex.williamson@redhat.com/

> btw: I just found unmapping in halves seems unnecessary as both backends of kernel side support unmap_all now.
> 
>      if (unmap_all) {
>          /* The unmap ioctl doesn't accept a full 64-bit span. */
>          Int128 llsize = int128_rshift(int128_2_64(), 1);
> 
>          ret = vfio_legacy_dma_unmap_one(bcontainer, 0, int128_get64(llsize),
>                                          iotlb);
> 
>          if (ret == 0) {
>              ret = vfio_legacy_dma_unmap_one(bcontainer, int128_get64(llsize),
>                                              int128_get64(llsize), iotlb);
>          }
> 
>      } else {
>          ret = vfio_legacy_dma_unmap_one(bcontainer, iova, size, iotlb);
>      }

Good. So we can simply both backends it seems.

Thanks,

C.








^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22  5:49     ` Duan, Zhenzhong
@ 2025-09-22 16:02       ` Cédric Le Goater
  2025-09-22 16:06         ` Joao Martins
                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Cédric Le Goater @ 2025-09-22 16:02 UTC (permalink / raw)
  To: Duan, Zhenzhong, Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan

On 9/22/25 07:49, Duan, Zhenzhong wrote:
> Hi Joao,
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>
>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last
>> dirty
>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>> query without issue because unmap will tear down the mapping anyway.
>>>
>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>> be used for the flags of iommufd dirty tracking. Currently it is
>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
>>> the scenario.
>>>
>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>> ---
>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>   include/system/iommufd.h | 2 +-
>>>   backends/iommufd.c       | 5 +++--
>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>   backends/trace-events    | 2 +-
>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>> index 07ea0f4304..e0af241c75 100644
>>> --- a/hw/vfio/vfio-iommufd.h
>>> +++ b/hw/vfio/vfio-iommufd.h
>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>       VFIOContainerBase bcontainer;
>>>       IOMMUFDBackend *be;
>>>       uint32_t ioas_id;
>>> +    uint64_t dirty_tracking_flags;
>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>   } VFIOIOMMUFDContainer;
>>>
>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>> index c9c72ffc45..63898e7b0d 100644
>>> --- a/include/system/iommufd.h
>>> +++ b/include/system/iommufd.h
>>> @@ -64,7 +64,7 @@ bool
>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>> hwpt_id,
>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>> uint32_t hwpt_id,
>>>                                         uint64_t iova, ram_addr_t
>> size,
>>>                                         uint64_t page_size,
>> uint64_t *data,
>>> -                                      Error **errp);
>>> +                                      uint64_t flags, Error
>> **errp);
>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>> uint32_t id,
>>>                                         uint32_t data_type,
>> uint32_t entry_len,
>>>                                         uint32_t *entry_num, void
>> *data,
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 2a33c7ab0b..3c4f6157e2 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -361,7 +361,7 @@ bool
>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>                                         uint32_t hwpt_id,
>>>                                         uint64_t iova, ram_addr_t
>> size,
>>>                                         uint64_t page_size,
>> uint64_t *data,
>>> -                                      Error **errp)
>>> +                                      uint64_t flags, Error **errp)
>>>   {
>>>       int ret;
>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>> @@ -371,11 +371,12 @@ bool
>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>           .length = size,
>>>           .page_size = page_size,
>>>           .data = (uintptr_t)data,
>>> +        .flags = flags,
>>>       };
>>>
>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>> &get_dirty_bitmap);
>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova,
>> size,
>>> -                                           page_size, ret ? errno :
>> 0);
>>> +                                           flags, page_size, ret ?
>> errno : 0);
>>>       if (ret) {
>>>           error_setg_errno(errp, errno,
>>>                            "IOMMU_HWPT_GET_DIRTY_BITMAP
>> (iova: 0x%"HWADDR_PRIx
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 0057488ce9..c897aa6b17 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>> VFIOContainerBase *bcontainer,
>>>                                     hwaddr iova, ram_addr_t size,
>>>                                     IOMMUTLBEntry *iotlb)
>>>   {
>>> -    const VFIOIOMMUFDContainer *container =
>>> +    VFIOIOMMUFDContainer *container =
>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>> bcontainer);
>>>       bool need_dirty_sync = false;
>>>       Error *local_err = NULL;
>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>> VFIOContainerBase *bcontainer,
>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>           if
>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>               bcontainer->dirty_pages_supported) {
>>> +            container->dirty_tracking_flags =
>>> +
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>               ret = vfio_container_query_dirty_bitmap(bcontainer, iova,
>> size,
>>>
>> iotlb->translated_addr,
>>>
>> &local_err);
>>> +            container->dirty_tracking_flags = 0;
>>
>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too, like
>> the
>> original patches? This is a little unnecssary odd style to pass a flag via
>> container structure rather and then clearing.
> 
> Just want to be simpler, original patch introduced a new parameter to almost all
> variants of *_query_dirty_bitmap() while the flags parameter is only used by
> IOMMUFD backend when doing unmap_bitmap. Currently we already have three
> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the flag.
> 
> I take container->dirty_tracking_flags as a notification mechanism, so set it before
> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing it in
> iommufd_query_dirty_bitmap() is easier to be acceptable?
> 
>>
>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH for
>> generic
>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>> container
>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>> could just
>> ignore the flag, while IOMMUFD translates it to
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
> 
> I did port original patch https://github.com/yiliu1765/qemu/commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
> But it looks complex to have 'flags' parameter everywhere.
I think I would prefer like Joao to avoid caching information if possible
but I haven't check closely the mess it would introduce in the code. Let
me check.

Thanks,

C.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22 16:02       ` Cédric Le Goater
@ 2025-09-22 16:06         ` Joao Martins
  2025-09-22 17:01           ` Joao Martins
  2025-09-23  2:47         ` Duan, Zhenzhong
  2025-10-09 10:20         ` Duan, Zhenzhong
  2 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2025-09-22 16:06 UTC (permalink / raw)
  To: Cédric Le Goater, Duan, Zhenzhong
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org

On 22/09/2025 17:02, Cédric Le Goater wrote:
> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>> Hi Joao,
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>
>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last
>>> dirty
>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>> query without issue because unmap will tear down the mapping anyway.
>>>>
>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
>>>> the scenario.
>>>>
>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>   include/system/iommufd.h | 2 +-
>>>>   backends/iommufd.c       | 5 +++--
>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>   backends/trace-events    | 2 +-
>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>> index 07ea0f4304..e0af241c75 100644
>>>> --- a/hw/vfio/vfio-iommufd.h
>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>       VFIOContainerBase bcontainer;
>>>>       IOMMUFDBackend *be;
>>>>       uint32_t ioas_id;
>>>> +    uint64_t dirty_tracking_flags;
>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>   } VFIOIOMMUFDContainer;
>>>>
>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>> index c9c72ffc45..63898e7b0d 100644
>>>> --- a/include/system/iommufd.h
>>>> +++ b/include/system/iommufd.h
>>>> @@ -64,7 +64,7 @@ bool
>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>> hwpt_id,
>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>> uint32_t hwpt_id,
>>>>                                         uint64_t iova, ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp);
>>>> +                                      uint64_t flags, Error
>>> **errp);
>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>> uint32_t id,
>>>>                                         uint32_t data_type,
>>> uint32_t entry_len,
>>>>                                         uint32_t *entry_num, void
>>> *data,
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -361,7 +361,7 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>                                         uint32_t hwpt_id,
>>>>                                         uint64_t iova, ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp)
>>>> +                                      uint64_t flags, Error **errp)
>>>>   {
>>>>       int ret;
>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>> @@ -371,11 +371,12 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>           .length = size,
>>>>           .page_size = page_size,
>>>>           .data = (uintptr_t)data,
>>>> +        .flags = flags,
>>>>       };
>>>>
>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>> &get_dirty_bitmap);
>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova,
>>> size,
>>>> -                                           page_size, ret ? errno :
>>> 0);
>>>> +                                           flags, page_size, ret ?
>>> errno : 0);
>>>>       if (ret) {
>>>>           error_setg_errno(errp, errno,
>>>>                            "IOMMU_HWPT_GET_DIRTY_BITMAP
>>> (iova: 0x%"HWADDR_PRIx
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 0057488ce9..c897aa6b17 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>                                     hwaddr iova, ram_addr_t size,
>>>>                                     IOMMUTLBEntry *iotlb)
>>>>   {
>>>> -    const VFIOIOMMUFDContainer *container =
>>>> +    VFIOIOMMUFDContainer *container =
>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>> bcontainer);
>>>>       bool need_dirty_sync = false;
>>>>       Error *local_err = NULL;
>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>           if
>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>               bcontainer->dirty_pages_supported) {
>>>> +            container->dirty_tracking_flags =
>>>> +
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>               ret = vfio_container_query_dirty_bitmap(bcontainer, iova,
>>> size,
>>>>
>>> iotlb->translated_addr,
>>>>
>>> &local_err);
>>>> +            container->dirty_tracking_flags = 0;
>>>
>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too, like
>>> the
>>> original patches? This is a little unnecssary odd style to pass a flag via
>>> container structure rather and then clearing.
>>
>> Just want to be simpler, original patch introduced a new parameter to almost all
>> variants of *_query_dirty_bitmap() while the flags parameter is only used by
>> IOMMUFD backend when doing unmap_bitmap. Currently we already have three
>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the flag.
>>
>> I take container->dirty_tracking_flags as a notification mechanism, so set it
>> before
>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing it in
>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>
>>>
>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH for
>>> generic
>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>> container
>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>> could just
>>> ignore the flag, while IOMMUFD translates it to
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>
>> I did port original patch https://github.com/yiliu1765/qemu/
>> commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
>> But it looks complex to have 'flags' parameter everywhere.
> I think I would prefer like Joao to avoid caching information if possible
> but I haven't check closely the mess it would introduce in the code. Let
> me check.
> 

My recollection was that it wasn't that much churn added as this series is
already doing the most of the churn. But I am checking how the code would look
like to properly respond to his suggestion on why he changing it towards
structuref field.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22 16:06         ` Joao Martins
@ 2025-09-22 17:01           ` Joao Martins
  2025-09-23  2:50             ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2025-09-22 17:01 UTC (permalink / raw)
  To: Cédric Le Goater, Duan, Zhenzhong
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org

On 22/09/2025 17:06, Joao Martins wrote:
> On 22/09/2025 17:02, Cédric Le Goater wrote:
>> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>>> Hi Joao,
>>>
>>>> -----Original Message-----
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>>
>>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the last
>>>> dirty
>>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>>> query without issue because unmap will tear down the mapping anyway.
>>>>>
>>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based on
>>>>> the scenario.
>>>>>
>>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>> ---
>>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>>   include/system/iommufd.h | 2 +-
>>>>>   backends/iommufd.c       | 5 +++--
>>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>>   backends/trace-events    | 2 +-
>>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>>> index 07ea0f4304..e0af241c75 100644
>>>>> --- a/hw/vfio/vfio-iommufd.h
>>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>>       VFIOContainerBase bcontainer;
>>>>>       IOMMUFDBackend *be;
>>>>>       uint32_t ioas_id;
>>>>> +    uint64_t dirty_tracking_flags;
>>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>   } VFIOIOMMUFDContainer;
>>>>>
>>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>>> index c9c72ffc45..63898e7b0d 100644
>>>>> --- a/include/system/iommufd.h
>>>>> +++ b/include/system/iommufd.h
>>>>> @@ -64,7 +64,7 @@ bool
>>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>>> hwpt_id,
>>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>> uint32_t hwpt_id,
>>>>>                                         uint64_t iova, ram_addr_t
>>>> size,
>>>>>                                         uint64_t page_size,
>>>> uint64_t *data,
>>>>> -                                      Error **errp);
>>>>> +                                      uint64_t flags, Error
>>>> **errp);
>>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>>> uint32_t id,
>>>>>                                         uint32_t data_type,
>>>> uint32_t entry_len,
>>>>>                                         uint32_t *entry_num, void
>>>> *data,
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -361,7 +361,7 @@ bool
>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>                                         uint32_t hwpt_id,
>>>>>                                         uint64_t iova, ram_addr_t
>>>> size,
>>>>>                                         uint64_t page_size,
>>>> uint64_t *data,
>>>>> -                                      Error **errp)
>>>>> +                                      uint64_t flags, Error **errp)
>>>>>   {
>>>>>       int ret;
>>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>>> @@ -371,11 +371,12 @@ bool
>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>           .length = size,
>>>>>           .page_size = page_size,
>>>>>           .data = (uintptr_t)data,
>>>>> +        .flags = flags,
>>>>>       };
>>>>>
>>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>>> &get_dirty_bitmap);
>>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id, iova,
>>>> size,
>>>>> -                                           page_size, ret ? errno :
>>>> 0);
>>>>> +                                           flags, page_size, ret ?
>>>> errno : 0);
>>>>>       if (ret) {
>>>>>           error_setg_errno(errp, errno,
>>>>>                            "IOMMU_HWPT_GET_DIRTY_BITMAP
>>>> (iova: 0x%"HWADDR_PRIx
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index 0057488ce9..c897aa6b17 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>>> VFIOContainerBase *bcontainer,
>>>>>                                     hwaddr iova, ram_addr_t size,
>>>>>                                     IOMMUTLBEntry *iotlb)
>>>>>   {
>>>>> -    const VFIOIOMMUFDContainer *container =
>>>>> +    VFIOIOMMUFDContainer *container =
>>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>>> bcontainer);
>>>>>       bool need_dirty_sync = false;
>>>>>       Error *local_err = NULL;
>>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>>> VFIOContainerBase *bcontainer,
>>>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>>           if
>>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>>               bcontainer->dirty_pages_supported) {
>>>>> +            container->dirty_tracking_flags =
>>>>> +
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>>               ret = vfio_container_query_dirty_bitmap(bcontainer, iova,
>>>> size,
>>>>>
>>>> iotlb->translated_addr,
>>>>>
>>>> &local_err);
>>>>> +            container->dirty_tracking_flags = 0;
>>>>
>>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too, like
>>>> the
>>>> original patches? This is a little unnecssary odd style to pass a flag via
>>>> container structure rather and then clearing.
>>>
>>> Just want to be simpler, original patch introduced a new parameter to almost all
>>> variants of *_query_dirty_bitmap() while the flags parameter is only used by
>>> IOMMUFD backend when doing unmap_bitmap. Currently we already have three
>>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the flag.
>>>
>>> I take container->dirty_tracking_flags as a notification mechanism, so set it
>>> before
>>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing it in
>>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>>
>>>>
>>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH for
>>>> generic
>>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>>> container
>>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>>> could just
>>>> ignore the flag, while IOMMUFD translates it to
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>>
>>> I did port original patch https://github.com/yiliu1765/qemu/
>>> commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
>>> But it looks complex to have 'flags' parameter everywhere.
>> I think I would prefer like Joao to avoid caching information if possible
>> but I haven't check closely the mess it would introduce in the code. Let
>> me check.
>>
> 
> My recollection was that it wasn't that much churn added as this series is
> already doing the most of the churn. But I am checking how the code would look
> like to properly respond to his suggestion on why he changing it towards
> structuref field.

The churn should be similar to this patch:

https://github.com/jpemartins/qemu/commit/5e1f11075299a5fa9564a26788dc9cc1717e297c

It's mostly an interface parameter addition of flags.
But there's now a few more callsites and I suppose that's the extra churn. But I don't think
vfio-user would need to change. See snip below.

I've pushed this here https://github.com/jpemartins/qemu/commits/relax-viommu-lm but showing the
series wide changes for discussion. The thing I didn't do was have an intermediate 'backend
agnostic' flag thingie like the original one
(https://github.com/jpemartins/qemu/commit/1ef8abc896ae69be8896a68705fe4a87204709e0) with
VFIO_GET_DIRTY_NO_FLUSH

diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
index 56304978e1e8..2fe08e010405 100644
--- a/hw/vfio/container-base.c
+++ b/hw/vfio/container-base.c
@@ -211,17 +211,16 @@ static int vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
 }

 static int vfio_container_iommu_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
-                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error **errp)
+                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size, uint64_t flags, Error **errp)
 {
     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);

     g_assert(vioc->query_dirty_bitmap);
-    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size,
-                                               errp);
+    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size, flags, errp);
 }

 static int vfio_container_devices_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
-                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error **errp)
+                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size, uint64_t flags, Error **errp)
 {
     VFIODevice *vbasedev;
     int ret;
@@ -243,7 +242,7 @@ static int vfio_container_devices_query_dirty_bitmap(const VFIOContainerBase *bc
 }

 int vfio_container_query_dirty_bitmap(const VFIOContainerBase *bcontainer, uint64_t iova,
-                          uint64_t size, ram_addr_t ram_addr, Error **errp)
+                          uint64_t size, uint64_t flags, ram_addr_t ram_addr, Error **errp)
 {
     bool all_device_dirty_tracking =
         vfio_container_devices_dirty_tracking_is_supported(bcontainer);
@@ -267,10 +266,10 @@ int vfio_container_query_dirty_bitmap(const VFIOContainerBase *bcontainer, uint6

     if (all_device_dirty_tracking) {
         ret = vfio_container_devices_query_dirty_bitmap(bcontainer, &vbmap, iova, size,
-                                                        errp);
+                                                        flags, errp);
     } else {
         ret = vfio_container_iommu_query_dirty_bitmap(bcontainer, &vbmap, iova, size,
-                                                     errp);
+                                                      flags, errp);
     }

     if (ret) {
@@ -280,7 +279,7 @@ int vfio_container_query_dirty_bitmap(const VFIOContainerBase *bcontainer, uint6
     dirty_pages = cpu_physical_memory_set_dirty_lebitmap(vbmap.bitmap, ram_addr,
                                                          vbmap.pages);

-    trace_vfio_container_query_dirty_bitmap(iova, size, vbmap.size, ram_addr,
+    trace_vfio_container_query_dirty_bitmap(iova, size, flags, vbmap.size, ram_addr,
                                             dirty_pages);
 out:
     g_free(vbmap.bitmap);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 030c6d3f89cf..86c4d06298f7 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -169,7 +169,7 @@ static int vfio_legacy_dma_unmap_one(const VFIOContainerBase *bcontainer,
     }

     if (need_dirty_sync) {
-        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
+        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size, 0,
                                     iotlb->translated_addr, &local_err);
         if (ret) {
             error_report_err(local_err);
@@ -267,7 +267,8 @@ vfio_legacy_set_dirty_page_tracking(const VFIOContainerBase *bcontainer,
 }

 static int vfio_legacy_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
-                      VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error **errp)
+                      VFIOBitmap *vbmap, hwaddr iova, hwaddr size, uint64_t flags,
+                      Error **errp)
 {
     const VFIOContainer *container = VFIO_IOMMU_LEGACY(bcontainer);
     struct vfio_iommu_type1_dirty_bitmap *dbitmap;
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index b82597de9116..c1c8755c6d7c 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -73,12 +73,10 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
     if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer)) {
         if (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
             bcontainer->dirty_pages_supported) {
-            container->dirty_tracking_flags =
-                                      IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
             ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
+                                                    IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR,
                                                     iotlb->translated_addr,
                                                     &local_err);
-            container->dirty_tracking_flags = 0;
             if (ret) {
                 error_report_err(local_err);
             }
@@ -95,7 +93,7 @@ static int iommufd_cdev_unmap_one(const VFIOContainerBase *bcontainer,
     }

     if (need_dirty_sync) {
-        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size,
+        ret = vfio_container_query_dirty_bitmap(bcontainer, iova, size, 0,
                                                 iotlb->translated_addr,
                                                 &local_err);
         if (ret) {
@@ -235,7 +233,7 @@ err:

 static int iommufd_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
                                       VFIOBitmap *vbmap, hwaddr iova,
-                                      hwaddr size, Error **errp)
+                                      hwaddr size, uint64_t flags, Error **errp)
 {
     VFIOIOMMUFDContainer *container = container_of(bcontainer,
                                                    VFIOIOMMUFDContainer,
@@ -251,8 +249,7 @@ static int iommufd_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
         if (!iommufd_backend_get_dirty_bitmap(container->be, hwpt->hwpt_id,
                                               iova, size, page_size,
                                               (uint64_t *)vbmap->bitmap,
-                                              container->dirty_tracking_flags,
-                                              errp)) {
+                                              flags, errp)) {
             return -EINVAL;
         }
     }
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index e09383316598..27cbb45282ae 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -1082,7 +1082,7 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     translated_addr = memory_region_get_ram_addr(mr) + xlat;

     ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
-                                translated_addr, &local_err);
+                                0, translated_addr, &local_err);
     if (ret) {
         error_prepend(&local_err,
                       "vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
@@ -1118,7 +1118,7 @@ static int vfio_ram_discard_query_dirty_bitmap(MemoryRegionSection *section,
      * Sync the whole mapped region (spanning multiple individual mappings)
      * in one go.
      */
-    ret = vfio_container_query_dirty_bitmap(vrdl->bcontainer, iova, size, ram_addr,
+    ret = vfio_container_query_dirty_bitmap(vrdl->bcontainer, iova, size, 0, ram_addr,
                                 &local_err);
     if (ret) {
         error_report_err(local_err);
@@ -1203,7 +1203,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainerBase *bcontainer,

     return vfio_container_query_dirty_bitmap(bcontainer,
                    REAL_HOST_PAGE_ALIGN(section->offset_within_address_space),
-                                 int128_get64(section->size), ram_addr, errp);
+                                 int128_get64(section->size), 0, ram_addr, errp);
 }

 static void vfio_listener_log_sync(MemoryListener *listener,
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e3d571f8c845..ee18dafdb2ae 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -105,7 +105,7 @@ vfio_device_dirty_tracking_start(int nr_ranges, uint64_t min32, uint64_t max32,
 vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" -
0x%"PRIx64

 # container-base.c
-vfio_container_query_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t
start, uint64_t dirty_pages) "
iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64" dirty_pages=%"PRIu64
+vfio_container_query_dirty_bitmap(uint64_t iova, uint64_t size, uint64_t flags, uint64_t
bitmap_size, uint64_t start, uint64_
t dirty_pages) "iova=0x%"PRIx64" size= 0x%"PRIx64" flags= 0x%"PRIx64" bitmap_size=0x%"PRIx64"
start=0x%"PRIx64" dirty_pages=%"
PRIu64

 # container.c
 vfio_container_disconnect(int fd) "close container->fd=%d"
diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
index e0af241c75cf..07ea0f430496 100644
--- a/hw/vfio/vfio-iommufd.h
+++ b/hw/vfio/vfio-iommufd.h
@@ -26,7 +26,6 @@ typedef struct VFIOIOMMUFDContainer {
     VFIOContainerBase bcontainer;
     IOMMUFDBackend *be;
     uint32_t ioas_id;
-    uint64_t dirty_tracking_flags;
     QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
 } VFIOIOMMUFDContainer;
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index acbd48a18a3a..e88b690cf423 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -98,7 +98,7 @@ bool vfio_container_dirty_tracking_is_started(
 bool vfio_container_devices_dirty_tracking_is_supported(
     const VFIOContainerBase *bcontainer);
 int vfio_container_query_dirty_bitmap(const VFIOContainerBase *bcontainer,
-    uint64_t iova, uint64_t size, ram_addr_t ram_addr, Error **errp);
+    uint64_t iova, uint64_t size, uint64_t flags, ram_addr_t ram_addr, Error **errp);

 GList *vfio_container_get_iova_ranges(const VFIOContainerBase *bcontainer);

@@ -252,12 +252,14 @@ struct VFIOIOMMUClass {
      * @vbmap: #VFIOBitmap internal bitmap structure
      * @iova: iova base address
      * @size: size of iova range
+     * @flags: flags to the dirty tracking query
      * @errp: pointer to Error*, to store an error if it happens.
      *
      * Returns zero to indicate success and negative for error.
      */
     int (*query_dirty_bitmap)(const VFIOContainerBase *bcontainer,
-                VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error **errp);
+                VFIOBitmap *vbmap, hwaddr iova, hwaddr size, uint64_t flags,
+                Error **errp);
     /* PCI specific */
     int (*pci_hot_reset)(VFIODevice *vbasedev, bool single);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* RE: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-22  8:27       ` Cédric Le Goater
@ 2025-09-23  2:45         ` Duan, Zhenzhong
  2025-09-23  7:06           ` Cédric Le Goater
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23  2:45 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>getting dirty bitmap before unmap
>
>On 9/22/25 05:17, Duan, Zhenzhong wrote:
>> Hi Cedric,
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>>> getting dirty bitmap before unmap
>>>
>>> Hello Zhenzhong
>>>
>>> On 9/10/25 04:36, Zhenzhong Duan wrote:
>>>> Currently we support device and iommu dirty tracking, device dirty
>>>> tracking is preferred.
>>>>
>>>> Add the framework code in iommufd_cdev_unmap_one() to choose
>either
>>>> device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().
>>>
>>> I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
>>> overflow bug in DMA unmap") could be removed now to make the code
>>> common to both VFIO IOMMU Type1 and IOMMUFD backends.
>>
>> I am not clear if there is other reason to keep the workaround, but the
>original
>> kernel issue had been fixed with below commit:
>>
>> commit 58fec830fc19208354895d9832785505046d6c01
>> Author: Alex Williamson <alex.williamson@redhat.com>
>> Date:   Mon Jan 7 22:13:22 2019 -0700
>>
>>      vfio/type1: Fix unmap overflow off-by-one
>>
>>      The below referenced commit adds a test for integer overflow, but in
>>      doing so prevents the unmap ioctl from ever including the last page
>of
>>      the address space.  Subtract one to compare to the last address of
>the
>>      unmap to avoid the overflow and wrap-around.
>>
>>      Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
>>      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
>>      Cc: stable@vger.kernel.org # v4.15+
>>      Reported-by: Pei Zhang <pezhang@redhat.com>
>>      Debugged-by: Peter Xu <peterx@redhat.com>
>>      Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
>>      Reviewed-by: Peter Xu <peterx@redhat.com>
>>      Tested-by: Peter Xu <peterx@redhat.com>
>>      Reviewed-by: Cornelia Huck <cohuck@redhat.com>
>>      Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>
>>>
>>> I asked Alex and Peter in another thread.
>>
>> Just curious on the answer, may I ask which thread?
>
>According to Alex, the QEMU workaround can be removed :
>
>https://lore.kernel.org/qemu-devel/20250919102447.748e17fe.alex.williams
>on@redhat.com/
>
>> btw: I just found unmapping in halves seems unnecessary as both backends
>of kernel side support unmap_all now.
>>
>>      if (unmap_all) {
>>          /* The unmap ioctl doesn't accept a full 64-bit span. */
>>          Int128 llsize = int128_rshift(int128_2_64(), 1);
>>
>>          ret = vfio_legacy_dma_unmap_one(bcontainer, 0,
>int128_get64(llsize),
>>                                          iotlb);
>>
>>          if (ret == 0) {
>>              ret = vfio_legacy_dma_unmap_one(bcontainer,
>int128_get64(llsize),
>>                                              int128_get64(llsize),
>iotlb);
>>          }
>>
>>      } else {
>>          ret = vfio_legacy_dma_unmap_one(bcontainer, iova, size,
>iotlb);
>>      }
>
>Good. So we can simply both backends it seems.

Will you handle them or not? I mean the workaround removing and unmapping_all optimization.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22 16:02       ` Cédric Le Goater
  2025-09-22 16:06         ` Joao Martins
@ 2025-09-23  2:47         ` Duan, Zhenzhong
  2025-10-09 10:20         ` Duan, Zhenzhong
  2 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23  2:47 UTC (permalink / raw)
  To: Cédric Le Goater, Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>
>On 9/22/25 07:49, Duan, Zhenzhong wrote:
>> Hi Joao,
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>
>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the
>last
>>> dirty
>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>> query without issue because unmap will tear down the mapping anyway.
>>>>
>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based
>on
>>>> the scenario.
>>>>
>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>   include/system/iommufd.h | 2 +-
>>>>   backends/iommufd.c       | 5 +++--
>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>   backends/trace-events    | 2 +-
>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>> index 07ea0f4304..e0af241c75 100644
>>>> --- a/hw/vfio/vfio-iommufd.h
>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>       VFIOContainerBase bcontainer;
>>>>       IOMMUFDBackend *be;
>>>>       uint32_t ioas_id;
>>>> +    uint64_t dirty_tracking_flags;
>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>   } VFIOIOMMUFDContainer;
>>>>
>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>> index c9c72ffc45..63898e7b0d 100644
>>>> --- a/include/system/iommufd.h
>>>> +++ b/include/system/iommufd.h
>>>> @@ -64,7 +64,7 @@ bool
>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>> hwpt_id,
>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>> uint32_t hwpt_id,
>>>>                                         uint64_t iova,
>ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp);
>>>> +                                      uint64_t flags, Error
>>> **errp);
>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>> uint32_t id,
>>>>                                         uint32_t data_type,
>>> uint32_t entry_len,
>>>>                                         uint32_t *entry_num,
>void
>>> *data,
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -361,7 +361,7 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>                                         uint32_t hwpt_id,
>>>>                                         uint64_t iova,
>ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp)
>>>> +                                      uint64_t flags, Error
>**errp)
>>>>   {
>>>>       int ret;
>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>> @@ -371,11 +371,12 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>           .length = size,
>>>>           .page_size = page_size,
>>>>           .data = (uintptr_t)data,
>>>> +        .flags = flags,
>>>>       };
>>>>
>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>> &get_dirty_bitmap);
>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id,
>iova,
>>> size,
>>>> -                                           page_size, ret ?
>errno :
>>> 0);
>>>> +                                           flags, page_size,
>ret ?
>>> errno : 0);
>>>>       if (ret) {
>>>>           error_setg_errno(errp, errno,
>>>>                            "IOMMU_HWPT_GET_DIRTY_BITMAP
>>> (iova: 0x%"HWADDR_PRIx
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 0057488ce9..c897aa6b17 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>                                     hwaddr iova, ram_addr_t
>size,
>>>>                                     IOMMUTLBEntry *iotlb)
>>>>   {
>>>> -    const VFIOIOMMUFDContainer *container =
>>>> +    VFIOIOMMUFDContainer *container =
>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>> bcontainer);
>>>>       bool need_dirty_sync = false;
>>>>       Error *local_err = NULL;
>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer))
>{
>>>>           if
>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>               bcontainer->dirty_pages_supported) {
>>>> +            container->dirty_tracking_flags =
>>>> +
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>               ret = vfio_container_query_dirty_bitmap(bcontainer,
>iova,
>>> size,
>>>>
>>> iotlb->translated_addr,
>>>>
>>> &local_err);
>>>> +            container->dirty_tracking_flags = 0;
>>>
>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too,
>like
>>> the
>>> original patches? This is a little unnecssary odd style to pass a flag via
>>> container structure rather and then clearing.
>>
>> Just want to be simpler, original patch introduced a new parameter to
>almost all
>> variants of *_query_dirty_bitmap() while the flags parameter is only used by
>> IOMMUFD backend when doing unmap_bitmap. Currently we already have
>three
>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the
>flag.
>>
>> I take container->dirty_tracking_flags as a notification mechanism, so set it
>before
>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing
>it in
>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>
>>>
>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH
>for
>>> generic
>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>> container
>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>> could just
>>> ignore the flag, while IOMMUFD translates it to
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>
>> I did port original patch
>https://github.com/yiliu1765/qemu/commit/99f83595d79d2e4170c9e456cf1
>a7b9521bd4f80
>> But it looks complex to have 'flags' parameter everywhere.
>I think I would prefer like Joao to avoid caching information if possible
>but I haven't check closely the mess it would introduce in the code. Let
>me check.

OK, waiting for your finial decision, I'm not that eager on a simpler change,
I can cherry-pick Joao's change if you prefer it.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22 17:01           ` Joao Martins
@ 2025-09-23  2:50             ` Duan, Zhenzhong
  2025-09-23  9:17               ` Joao Martins
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23  2:50 UTC (permalink / raw)
  To: Joao Martins, Cédric Le Goater
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org



>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>
>On 22/09/2025 17:06, Joao Martins wrote:
>> On 22/09/2025 17:02, Cédric Le Goater wrote:
>>> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>>>> Hi Joao,
>>>>
>>>>> -----Original Message-----
>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>>>
>>>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the
>last
>>>>> dirty
>>>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>>>> query without issue because unmap will tear down the mapping
>anyway.
>>>>>>
>>>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0
>based on
>>>>>> the scenario.
>>>>>>
>>>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>> ---
>>>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>>>   include/system/iommufd.h | 2 +-
>>>>>>   backends/iommufd.c       | 5 +++--
>>>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>>>   backends/trace-events    | 2 +-
>>>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>>>> index 07ea0f4304..e0af241c75 100644
>>>>>> --- a/hw/vfio/vfio-iommufd.h
>>>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>>>       VFIOContainerBase bcontainer;
>>>>>>       IOMMUFDBackend *be;
>>>>>>       uint32_t ioas_id;
>>>>>> +    uint64_t dirty_tracking_flags;
>>>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>   } VFIOIOMMUFDContainer;
>>>>>>
>>>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>>>> index c9c72ffc45..63898e7b0d 100644
>>>>>> --- a/include/system/iommufd.h
>>>>>> +++ b/include/system/iommufd.h
>>>>>> @@ -64,7 +64,7 @@ bool
>>>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>>>> hwpt_id,
>>>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>> uint32_t hwpt_id,
>>>>>>                                         uint64_t
>iova, ram_addr_t
>>>>> size,
>>>>>>                                         uint64_t
>page_size,
>>>>> uint64_t *data,
>>>>>> -                                      Error
>**errp);
>>>>>> +                                      uint64_t
>flags, Error
>>>>> **errp);
>>>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>>>> uint32_t id,
>>>>>>                                         uint32_t
>data_type,
>>>>> uint32_t entry_len,
>>>>>>                                         uint32_t
>*entry_num, void
>>>>> *data,
>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>>>> --- a/backends/iommufd.c
>>>>>> +++ b/backends/iommufd.c
>>>>>> @@ -361,7 +361,7 @@ bool
>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>                                         uint32_t
>hwpt_id,
>>>>>>                                         uint64_t
>iova, ram_addr_t
>>>>> size,
>>>>>>                                         uint64_t
>page_size,
>>>>> uint64_t *data,
>>>>>> -                                      Error **errp)
>>>>>> +                                      uint64_t
>flags, Error **errp)
>>>>>>   {
>>>>>>       int ret;
>>>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>>>> @@ -371,11 +371,12 @@ bool
>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>           .length = size,
>>>>>>           .page_size = page_size,
>>>>>>           .data = (uintptr_t)data,
>>>>>> +        .flags = flags,
>>>>>>       };
>>>>>>
>>>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>>>> &get_dirty_bitmap);
>>>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id,
>iova,
>>>>> size,
>>>>>>
>-                                           page_size,
>ret ? errno :
>>>>> 0);
>>>>>> +                                           flags,
>page_size, ret ?
>>>>> errno : 0);
>>>>>>       if (ret) {
>>>>>>           error_setg_errno(errp, errno,
>>>>>>                            "IOMMU_HWPT_GET_DIRTY_
>BITMAP
>>>>> (iova: 0x%"HWADDR_PRIx
>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>> index 0057488ce9..c897aa6b17 100644
>>>>>> --- a/hw/vfio/iommufd.c
>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>>>> VFIOContainerBase *bcontainer,
>>>>>>                                     hwaddr iova,
>ram_addr_t size,
>>>>>>                                     IOMMUTLBEntr
>y *iotlb)
>>>>>>   {
>>>>>> -    const VFIOIOMMUFDContainer *container =
>>>>>> +    VFIOIOMMUFDContainer *container =
>>>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>>>> bcontainer);
>>>>>>       bool need_dirty_sync = false;
>>>>>>       Error *local_err = NULL;
>>>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>>>> VFIOContainerBase *bcontainer,
>>>>>>       if (iotlb &&
>vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>>>           if
>>>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>>>               bcontainer->dirty_pages_supported) {
>>>>>> +            container->dirty_tracking_flags =
>>>>>> +
>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>>>               ret =
>vfio_container_query_dirty_bitmap(bcontainer, iova,
>>>>> size,
>>>>>>
>>>>> iotlb->translated_addr,
>>>>>>
>>>>> &local_err);
>>>>>> +            container->dirty_tracking_flags = 0;
>>>>>
>>>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags
>too, like
>>>>> the
>>>>> original patches? This is a little unnecssary odd style to pass a flag via
>>>>> container structure rather and then clearing.
>>>>
>>>> Just want to be simpler, original patch introduced a new parameter to
>almost all
>>>> variants of *_query_dirty_bitmap() while the flags parameter is only used
>by
>>>> IOMMUFD backend when doing unmap_bitmap. Currently we already
>have three
>>>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need
>the flag.
>>>>
>>>> I take container->dirty_tracking_flags as a notification mechanism, so set
>it
>>>> before
>>>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe
>clearing it in
>>>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>>>
>>>>>
>>>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH
>for
>>>>> generic
>>>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>>>> container
>>>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>>>> could just
>>>>> ignore the flag, while IOMMUFD translates it to
>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>>>
>>>> I did port original patch https://github.com/yiliu1765/qemu/
>>>> commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
>>>> But it looks complex to have 'flags' parameter everywhere.
>>> I think I would prefer like Joao to avoid caching information if possible
>>> but I haven't check closely the mess it would introduce in the code. Let
>>> me check.
>>>
>>
>> My recollection was that it wasn't that much churn added as this series is
>> already doing the most of the churn. But I am checking how the code would
>look
>> like to properly respond to his suggestion on why he changing it towards
>> structuref field.
>
>The churn should be similar to this patch:
>
>https://github.com/jpemartins/qemu/commit/5e1f11075299a5fa9564a26788
>dc9cc1717e297c
>
>It's mostly an interface parameter addition of flags.

Yes, it's a general interface parameter but it's only used by IOMMUFD.
That's my only argument.

>But there's now a few more callsites and I suppose that's the extra churn. But
>I don't think
>vfio-user would need to change. See snip below.

vfio-user need same change as vfio-legacy though they don't use 'flags' parameter,
or else there is build error:

../hw/vfio-user/container.c:340:30: error: assignment to ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, hwaddr,  hwaddr,  uint64_t,  Error **)’ {aka ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, long unsigned int,  long unsigned int,  long unsigned int,  Error **)’} from incompatible pointer type ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, hwaddr,  hwaddr,  Error **)’ {aka ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, long unsigned int,  long unsigned int,  Error **)’} [-Werror=incompatible-pointer-types]
  340 |     vioc->query_dirty_bitmap = vfio_user_query_dirty_bitmap;

>
>I've pushed this here
>https://github.com/jpemartins/qemu/commits/relax-viommu-lm but showing
>the
>series wide changes for discussion. The thing I didn't do was have an
>intermediate 'backend
>agnostic' flag thingie like the original one
>(https://github.com/jpemartins/qemu/commit/1ef8abc896ae69be8896a6870
>5fe4a87204709e0) with
>VFIO_GET_DIRTY_NO_FLUSH
>
>diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>index 56304978e1e8..2fe08e010405 100644
>--- a/hw/vfio/container-base.c
>+++ b/hw/vfio/container-base.c
>@@ -211,17 +211,16 @@ static int
>vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
> }
>
> static int vfio_container_iommu_query_dirty_bitmap(const
>VFIOContainerBase *bcontainer,
>-                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error
>**errp)
>+                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size,
>uint64_t flags, Error **errp)
> {
>     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>
>     g_assert(vioc->query_dirty_bitmap);
>-    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size,
>-                                               errp);
>+    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size, flags,
>errp);
> }
>
> static int vfio_container_devices_query_dirty_bitmap(const
>VFIOContainerBase *bcontainer,
>-                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error
>**errp)
>+                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size,
>uint64_t flags, Error **errp)

I think we can drop 'flags' here as you did in your old patch?

> {
>     VFIODevice *vbasedev;
>     int ret;
>@@ -243,7 +242,7 @@ static int
>vfio_container_devices_query_dirty_bitmap(const VFIOContainerBase *bc
> }
>
> int vfio_container_query_dirty_bitmap(const VFIOContainerBase
>*bcontainer, uint64_t iova,
>-                          uint64_t size, ram_addr_t ram_addr, Error
>**errp)
>+                          uint64_t size, uint64_t flags, ram_addr_t
>ram_addr, Error **errp)
> {
>     bool all_device_dirty_tracking =
>         vfio_container_devices_dirty_tracking_is_supported(bcontainer);
>@@ -267,10 +266,10 @@ int vfio_container_query_dirty_bitmap(const
>VFIOContainerBase *bcontainer, uint6
>
>     if (all_device_dirty_tracking) {
>         ret = vfio_container_devices_query_dirty_bitmap(bcontainer,
>&vbmap, iova, size,
>-                                                        errp);
>+                                                        flags,
>errp);

Same here.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-23  2:45         ` Duan, Zhenzhong
@ 2025-09-23  7:06           ` Cédric Le Goater
  2025-09-23  9:50             ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Cédric Le Goater @ 2025-09-23  7:06 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao, Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 9/23/25 04:45, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>> getting dirty bitmap before unmap
>>
>> On 9/22/25 05:17, Duan, Zhenzhong wrote:
>>> Hi Cedric,
>>>
>>>> -----Original Message-----
>>>> From: Cédric Le Goater <clg@redhat.com>
>>>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>>>> getting dirty bitmap before unmap
>>>>
>>>> Hello Zhenzhong
>>>>
>>>> On 9/10/25 04:36, Zhenzhong Duan wrote:
>>>>> Currently we support device and iommu dirty tracking, device dirty
>>>>> tracking is preferred.
>>>>>
>>>>> Add the framework code in iommufd_cdev_unmap_one() to choose
>> either
>>>>> device or iommu dirty tracking, just like vfio_legacy_dma_unmap_one().
>>>>
>>>> I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
>>>> overflow bug in DMA unmap") could be removed now to make the code
>>>> common to both VFIO IOMMU Type1 and IOMMUFD backends.
>>>
>>> I am not clear if there is other reason to keep the workaround, but the
>> original
>>> kernel issue had been fixed with below commit:
>>>
>>> commit 58fec830fc19208354895d9832785505046d6c01
>>> Author: Alex Williamson <alex.williamson@redhat.com>
>>> Date:   Mon Jan 7 22:13:22 2019 -0700
>>>
>>>       vfio/type1: Fix unmap overflow off-by-one
>>>
>>>       The below referenced commit adds a test for integer overflow, but in
>>>       doing so prevents the unmap ioctl from ever including the last page
>> of
>>>       the address space.  Subtract one to compare to the last address of
>> the
>>>       unmap to avoid the overflow and wrap-around.
>>>
>>>       Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
>>>       Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
>>>       Cc: stable@vger.kernel.org # v4.15+
>>>       Reported-by: Pei Zhang <pezhang@redhat.com>
>>>       Debugged-by: Peter Xu <peterx@redhat.com>
>>>       Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
>>>       Reviewed-by: Peter Xu <peterx@redhat.com>
>>>       Tested-by: Peter Xu <peterx@redhat.com>
>>>       Reviewed-by: Cornelia Huck <cohuck@redhat.com>
>>>       Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>
>>>>
>>>> I asked Alex and Peter in another thread.
>>>
>>> Just curious on the answer, may I ask which thread?
>>
>> According to Alex, the QEMU workaround can be removed :
>>
>> https://lore.kernel.org/qemu-devel/20250919102447.748e17fe.alex.williams
>> on@redhat.com/
>>
>>> btw: I just found unmapping in halves seems unnecessary as both backends
>> of kernel side support unmap_all now.
>>>
>>>       if (unmap_all) {
>>>           /* The unmap ioctl doesn't accept a full 64-bit span. */
>>>           Int128 llsize = int128_rshift(int128_2_64(), 1);
>>>
>>>           ret = vfio_legacy_dma_unmap_one(bcontainer, 0,
>> int128_get64(llsize),
>>>                                           iotlb);
>>>
>>>           if (ret == 0) {
>>>               ret = vfio_legacy_dma_unmap_one(bcontainer,
>> int128_get64(llsize),
>>>                                               int128_get64(llsize),
>> iotlb);
>>>           }
>>>
>>>       } else {
>>>           ret = vfio_legacy_dma_unmap_one(bcontainer, iova, size,
>> iotlb);
>>>       }
>>
>> Good. So we can simply both backends it seems.

*ify

> 
> Will you handle them or not? I mean the workaround removing and unmapping_all optimization.

I can revert 567d7d3e6be5 ("vfio/common: Work around kernel overflow
bug in DMA unmap") but, AFAICT, the "unmap DMAs in halves" method (see
1b296c3def4b) should be kept.

Thanks,

C.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-23  2:50             ` Duan, Zhenzhong
@ 2025-09-23  9:17               ` Joao Martins
  2025-09-23  9:55                 ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Joao Martins @ 2025-09-23  9:17 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cédric Le Goater
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org

On 23/09/2025 03:50, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>
>> On 22/09/2025 17:06, Joao Martins wrote:
>>> On 22/09/2025 17:02, Cédric Le Goater wrote:
>>>> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>>>>> Hi Joao,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>>>>
>>>>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the
>> last
>>>>>> dirty
>>>>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>>>>> query without issue because unmap will tear down the mapping
>> anyway.
>>>>>>>
>>>>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0
>> based on
>>>>>>> the scenario.
>>>>>>>
>>>>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>> ---
>>>>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>>>>   include/system/iommufd.h | 2 +-
>>>>>>>   backends/iommufd.c       | 5 +++--
>>>>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>>>>   backends/trace-events    | 2 +-
>>>>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>>>>> index 07ea0f4304..e0af241c75 100644
>>>>>>> --- a/hw/vfio/vfio-iommufd.h
>>>>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>>>>       VFIOContainerBase bcontainer;
>>>>>>>       IOMMUFDBackend *be;
>>>>>>>       uint32_t ioas_id;
>>>>>>> +    uint64_t dirty_tracking_flags;
>>>>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>>   } VFIOIOMMUFDContainer;
>>>>>>>
>>>>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>>>>> index c9c72ffc45..63898e7b0d 100644
>>>>>>> --- a/include/system/iommufd.h
>>>>>>> +++ b/include/system/iommufd.h
>>>>>>> @@ -64,7 +64,7 @@ bool
>>>>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>>>>> hwpt_id,
>>>>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>> uint32_t hwpt_id,
>>>>>>>                                         uint64_t
>> iova, ram_addr_t
>>>>>> size,
>>>>>>>                                         uint64_t
>> page_size,
>>>>>> uint64_t *data,
>>>>>>> -                                      Error
>> **errp);
>>>>>>> +                                      uint64_t
>> flags, Error
>>>>>> **errp);
>>>>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>>>>> uint32_t id,
>>>>>>>                                         uint32_t
>> data_type,
>>>>>> uint32_t entry_len,
>>>>>>>                                         uint32_t
>> *entry_num, void
>>>>>> *data,
>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>>>>> --- a/backends/iommufd.c
>>>>>>> +++ b/backends/iommufd.c
>>>>>>> @@ -361,7 +361,7 @@ bool
>>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>>                                         uint32_t
>> hwpt_id,
>>>>>>>                                         uint64_t
>> iova, ram_addr_t
>>>>>> size,
>>>>>>>                                         uint64_t
>> page_size,
>>>>>> uint64_t *data,
>>>>>>> -                                      Error **errp)
>>>>>>> +                                      uint64_t
>> flags, Error **errp)
>>>>>>>   {
>>>>>>>       int ret;
>>>>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>>>>> @@ -371,11 +371,12 @@ bool
>>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>>           .length = size,
>>>>>>>           .page_size = page_size,
>>>>>>>           .data = (uintptr_t)data,
>>>>>>> +        .flags = flags,
>>>>>>>       };
>>>>>>>
>>>>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>>>>> &get_dirty_bitmap);
>>>>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id,
>> iova,
>>>>>> size,
>>>>>>>
>> -                                           page_size,
>> ret ? errno :
>>>>>> 0);
>>>>>>> +                                           flags,
>> page_size, ret ?
>>>>>> errno : 0);
>>>>>>>       if (ret) {
>>>>>>>           error_setg_errno(errp, errno,
>>>>>>>                            "IOMMU_HWPT_GET_DIRTY_
>> BITMAP
>>>>>> (iova: 0x%"HWADDR_PRIx
>>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>>> index 0057488ce9..c897aa6b17 100644
>>>>>>> --- a/hw/vfio/iommufd.c
>>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>>>>> VFIOContainerBase *bcontainer,
>>>>>>>                                     hwaddr iova,
>> ram_addr_t size,
>>>>>>>                                     IOMMUTLBEntr
>> y *iotlb)
>>>>>>>   {
>>>>>>> -    const VFIOIOMMUFDContainer *container =
>>>>>>> +    VFIOIOMMUFDContainer *container =
>>>>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>>>>> bcontainer);
>>>>>>>       bool need_dirty_sync = false;
>>>>>>>       Error *local_err = NULL;
>>>>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>>>>> VFIOContainerBase *bcontainer,
>>>>>>>       if (iotlb &&
>> vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>>>>           if
>>>>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>>>>               bcontainer->dirty_pages_supported) {
>>>>>>> +            container->dirty_tracking_flags =
>>>>>>> +
>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>>>>               ret =
>> vfio_container_query_dirty_bitmap(bcontainer, iova,
>>>>>> size,
>>>>>>>
>>>>>> iotlb->translated_addr,
>>>>>>>
>>>>>> &local_err);
>>>>>>> +            container->dirty_tracking_flags = 0;
>>>>>>
>>>>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags
>> too, like
>>>>>> the
>>>>>> original patches? This is a little unnecssary odd style to pass a flag via
>>>>>> container structure rather and then clearing.
>>>>>
>>>>> Just want to be simpler, original patch introduced a new parameter to
>> almost all
>>>>> variants of *_query_dirty_bitmap() while the flags parameter is only used
>> by
>>>>> IOMMUFD backend when doing unmap_bitmap. Currently we already
>> have three
>>>>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need
>> the flag.
>>>>>
>>>>> I take container->dirty_tracking_flags as a notification mechanism, so set
>> it
>>>>> before
>>>>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe
>> clearing it in
>>>>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>>>>
>>>>>>
>>>>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH
>> for
>>>>>> generic
>>>>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>>>>> container
>>>>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>>>>> could just
>>>>>> ignore the flag, while IOMMUFD translates it to
>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>>>>
>>>>> I did port original patch https://github.com/yiliu1765/qemu/
>>>>> commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
>>>>> But it looks complex to have 'flags' parameter everywhere.
>>>> I think I would prefer like Joao to avoid caching information if possible
>>>> but I haven't check closely the mess it would introduce in the code. Let
>>>> me check.
>>>>
>>>
>>> My recollection was that it wasn't that much churn added as this series is
>>> already doing the most of the churn. But I am checking how the code would
>> look
>>> like to properly respond to his suggestion on why he changing it towards
>>> structuref field.
>>
>> The churn should be similar to this patch:
>>
>> https://github.com/jpemartins/qemu/commit/5e1f11075299a5fa9564a26788
>> dc9cc1717e297c
>>
>> It's mostly an interface parameter addition of flags.
> 
> Yes, it's a general interface parameter but it's only used by IOMMUFD.
> That's my only argument.
> 

You could use that argument for anything that's not common and unique to each
backend. type1 dirty tracking is effectively "frozen" in UAPI (IIRC) so I won't
expect flags. There's device dirty tracking, IOMMUFD dirty tracking and
vfio-user dirty tracking. While I am not anticipating any new thing here (at
least in the needs I have ahead, can't speak for other companies/contributors);
but you could see the flags as future proofing it.

The ugly part of the alternative is that it sort of limits the interface  to be
single-user only as nobody call into IOMMUFD with the dirty_tracking_flags if
it's called concurrently for some other purpose than unmap. We sort of use the
container data structure as a argument storage for a single call as opposed to
store some intermediary state.

I mean I agree with you that there's churn, but it's essentially a argument
change that's only churn-y because there's 3 users.

>> But there's now a few more callsites and I suppose that's the extra churn. But
>> I don't think
>> vfio-user would need to change. See snip below.
> 
> vfio-user need same change as vfio-legacy though they don't use 'flags' parameter,
> or else there is build error:
> 
> ../hw/vfio-user/container.c:340:30: error: assignment to ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, hwaddr,  hwaddr,  uint64_t,  Error **)’ {aka ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, long unsigned int,  long unsigned int,  long unsigned int,  Error **)’} from incompatible pointer type ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, hwaddr,  hwaddr,  Error **)’ {aka ‘int (*)(const VFIOContainerBase *, VFIOBitmap *, long unsigned int,  long unsigned int,  Error **)’} [-Werror=incompatible-pointer-types]
>   340 |     vioc->query_dirty_bitmap = vfio_user_query_dirty_bitmap;
> 

yes -- I didn't have VFIO_USER compiled in probably why I didn't notice it.

>>
>> I've pushed this here
>> https://github.com/jpemartins/qemu/commits/relax-viommu-lm but showing
>> the
>> series wide changes for discussion. The thing I didn't do was have an
>> intermediate 'backend
>> agnostic' flag thingie like the original one
>> (https://github.com/jpemartins/qemu/commit/1ef8abc896ae69be8896a6870
>> 5fe4a87204709e0) with
>> VFIO_GET_DIRTY_NO_FLUSH
>>

Instead of a flags, an alternative would be to have a 'bool unmap' which then
IOMMUFD can translate to IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR.

That's a bit cleaner as to avoid leaking backend UAPIs into the API but it
limits to 'unmap' case.

>> diff --git a/hw/vfio/container-base.c b/hw/vfio/container-base.c
>> index 56304978e1e8..2fe08e010405 100644
>> --- a/hw/vfio/container-base.c
>> +++ b/hw/vfio/container-base.c
>> @@ -211,17 +211,16 @@ static int
>> vfio_device_dma_logging_report(VFIODevice *vbasedev, hwaddr iova,
>> }
>>
>> static int vfio_container_iommu_query_dirty_bitmap(const
>> VFIOContainerBase *bcontainer,
>> -                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error
>> **errp)
>> +                   VFIOBitmap *vbmap, hwaddr iova, hwaddr size,
>> uint64_t flags, Error **errp)
>> {
>>     VFIOIOMMUClass *vioc = VFIO_IOMMU_GET_CLASS(bcontainer);
>>
>>     g_assert(vioc->query_dirty_bitmap);
>> -    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size,
>> -                                               errp);
>> +    return vioc->query_dirty_bitmap(bcontainer, vbmap, iova, size, flags,
>> errp);
>> }
>>
>> static int vfio_container_devices_query_dirty_bitmap(const
>> VFIOContainerBase *bcontainer,
>> -                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size, Error
>> **errp)
>> +                 VFIOBitmap *vbmap, hwaddr iova, hwaddr size,
>> uint64_t flags, Error **errp)
> 
> I think we can drop 'flags' here as you did in your old patch?
> 

Yeah

>> {
>>     VFIODevice *vbasedev;
>>     int ret;
>> @@ -243,7 +242,7 @@ static int
>> vfio_container_devices_query_dirty_bitmap(const VFIOContainerBase *bc
>> }
>>
>> int vfio_container_query_dirty_bitmap(const VFIOContainerBase
>> *bcontainer, uint64_t iova,
>> -                          uint64_t size, ram_addr_t ram_addr, Error
>> **errp)
>> +                          uint64_t size, uint64_t flags, ram_addr_t
>> ram_addr, Error **errp)
>> {
>>     bool all_device_dirty_tracking =
>>         vfio_container_devices_dirty_tracking_is_supported(bcontainer);
>> @@ -267,10 +266,10 @@ int vfio_container_query_dirty_bitmap(const
>> VFIOContainerBase *bcontainer, uint6
>>
>>     if (all_device_dirty_tracking) {
>>         ret = vfio_container_devices_query_dirty_bitmap(bcontainer,
>> &vbmap, iova, size,
>> -                                                        errp);
>> +                                                        flags,
>> errp);
> 
> Same here.

/me nods


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap
  2025-09-23  7:06           ` Cédric Le Goater
@ 2025-09-23  9:50             ` Duan, Zhenzhong
  0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23  9:50 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>getting dirty bitmap before unmap
>
>On 9/23/25 04:45, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>>> getting dirty bitmap before unmap
>>>
>>> On 9/22/25 05:17, Duan, Zhenzhong wrote:
>>>> Hi Cedric,
>>>>
>>>>> -----Original Message-----
>>>>> From: Cédric Le Goater <clg@redhat.com>
>>>>> Subject: Re: [PATCH 1/5] vfio/iommufd: Add framework code to support
>>>>> getting dirty bitmap before unmap
>>>>>
>>>>> Hello Zhenzhong
>>>>>
>>>>> On 9/10/25 04:36, Zhenzhong Duan wrote:
>>>>>> Currently we support device and iommu dirty tracking, device dirty
>>>>>> tracking is preferred.
>>>>>>
>>>>>> Add the framework code in iommufd_cdev_unmap_one() to choose
>>> either
>>>>>> device or iommu dirty tracking, just like
>vfio_legacy_dma_unmap_one().
>>>>>
>>>>> I wonder if commit 567d7d3e6be5 ("vfio/common: Work around kernel
>>>>> overflow bug in DMA unmap") could be removed now to make the code
>>>>> common to both VFIO IOMMU Type1 and IOMMUFD backends.
>>>>
>>>> I am not clear if there is other reason to keep the workaround, but the
>>> original
>>>> kernel issue had been fixed with below commit:
>>>>
>>>> commit 58fec830fc19208354895d9832785505046d6c01
>>>> Author: Alex Williamson <alex.williamson@redhat.com>
>>>> Date:   Mon Jan 7 22:13:22 2019 -0700
>>>>
>>>>       vfio/type1: Fix unmap overflow off-by-one
>>>>
>>>>       The below referenced commit adds a test for integer overflow,
>but in
>>>>       doing so prevents the unmap ioctl from ever including the last
>page
>>> of
>>>>       the address space.  Subtract one to compare to the last address
>of
>>> the
>>>>       unmap to avoid the overflow and wrap-around.
>>>>
>>>>       Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow
>warning")
>>>>       Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
>>>>       Cc: stable@vger.kernel.org # v4.15+
>>>>       Reported-by: Pei Zhang <pezhang@redhat.com>
>>>>       Debugged-by: Peter Xu <peterx@redhat.com>
>>>>       Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
>>>>       Reviewed-by: Peter Xu <peterx@redhat.com>
>>>>       Tested-by: Peter Xu <peterx@redhat.com>
>>>>       Reviewed-by: Cornelia Huck <cohuck@redhat.com>
>>>>       Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>
>>>>>
>>>>> I asked Alex and Peter in another thread.
>>>>
>>>> Just curious on the answer, may I ask which thread?
>>>
>>> According to Alex, the QEMU workaround can be removed :
>>>
>>>
>https://lore.kernel.org/qemu-devel/20250919102447.748e17fe.alex.williams
>>> on@redhat.com/
>>>
>>>> btw: I just found unmapping in halves seems unnecessary as both
>backends
>>> of kernel side support unmap_all now.
>>>>
>>>>       if (unmap_all) {
>>>>           /* The unmap ioctl doesn't accept a full 64-bit span. */
>>>>           Int128 llsize = int128_rshift(int128_2_64(), 1);
>>>>
>>>>           ret = vfio_legacy_dma_unmap_one(bcontainer, 0,
>>> int128_get64(llsize),
>>>>                                           iotlb);
>>>>
>>>>           if (ret == 0) {
>>>>               ret = vfio_legacy_dma_unmap_one(bcontainer,
>>> int128_get64(llsize),
>>>>
>int128_get64(llsize),
>>> iotlb);
>>>>           }
>>>>
>>>>       } else {
>>>>           ret = vfio_legacy_dma_unmap_one(bcontainer, iova, size,
>>> iotlb);
>>>>       }
>>>
>>> Good. So we can simply both backends it seems.
>
>*ify
>
>>
>> Will you handle them or not? I mean the workaround removing and
>unmapping_all optimization.
>
>I can revert 567d7d3e6be5 ("vfio/common: Work around kernel overflow
>bug in DMA unmap") but, AFAICT, the "unmap DMAs in halves" method (see
>1b296c3def4b) should be kept.

OK, let me take the unmap_all part. For example, for iommufd, it can be simplified to:
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -67,19 +67,8 @@ static int iommufd_cdev_unmap(const VFIOContainerBase *bcontainer,

     /* unmap in halves */
     if (unmap_all) {
-        Int128 llsize = int128_rshift(int128_2_64(), 1);
-        int ret;
-
-        ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
-                                        0, int128_get64(llsize));
-
-        if (ret == 0) {
-            ret = iommufd_backend_unmap_dma(container->be, container->ioas_id,
-                                            int128_get64(llsize),
-                                            int128_get64(llsize));
-        }
-
-        return ret;
+        assert(!iova && !size);
+        size = UINT64_MAX;
     }

     /* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-23  9:17               ` Joao Martins
@ 2025-09-23  9:55                 ` Duan, Zhenzhong
  2025-09-23 10:06                   ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23  9:55 UTC (permalink / raw)
  To: Joao Martins, Cédric Le Goater
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org



>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>
>On 23/09/2025 03:50, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>
>>> On 22/09/2025 17:06, Joao Martins wrote:
>>>> On 22/09/2025 17:02, Cédric Le Goater wrote:
>>>>> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>>>>>> Hi Joao,
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>>>>>
>>>>>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>>>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing
>the
>>> last
>>>>>>> dirty
>>>>>>>> bitmap query right before unmap, no PTEs flushes. This accelerates
>the
>>>>>>>> query without issue because unmap will tear down the mapping
>>> anyway.
>>>>>>>>
>>>>>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer
>to
>>>>>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>>>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0
>>> based on
>>>>>>>> the scenario.
>>>>>>>>
>>>>>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>>> ---
>>>>>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>>>>>   include/system/iommufd.h | 2 +-
>>>>>>>>   backends/iommufd.c       | 5 +++--
>>>>>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>>>>>   backends/trace-events    | 2 +-
>>>>>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>>>>>> index 07ea0f4304..e0af241c75 100644
>>>>>>>> --- a/hw/vfio/vfio-iommufd.h
>>>>>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>>>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>>>>>       VFIOContainerBase bcontainer;
>>>>>>>>       IOMMUFDBackend *be;
>>>>>>>>       uint32_t ioas_id;
>>>>>>>> +    uint64_t dirty_tracking_flags;
>>>>>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>>>   } VFIOIOMMUFDContainer;
>>>>>>>>
>>>>>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>>>>>> index c9c72ffc45..63898e7b0d 100644
>>>>>>>> --- a/include/system/iommufd.h
>>>>>>>> +++ b/include/system/iommufd.h
>>>>>>>> @@ -64,7 +64,7 @@ bool
>>>>>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be,
>uint32_t
>>>>>>> hwpt_id,
>>>>>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend
>*be,
>>>>>>> uint32_t hwpt_id,
>>>>>>>>                                         uint64_
>t
>>> iova, ram_addr_t
>>>>>>> size,
>>>>>>>>                                         uint64_
>t
>>> page_size,
>>>>>>> uint64_t *data,
>>>>>>>> -                                      Error
>>> **errp);
>>>>>>>> +                                      uint64_t
>>> flags, Error
>>>>>>> **errp);
>>>>>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend
>*be,
>>>>>>> uint32_t id,
>>>>>>>>                                         uint32_
>t
>>> data_type,
>>>>>>> uint32_t entry_len,
>>>>>>>>                                         uint32_
>t
>>> *entry_num, void
>>>>>>> *data,
>>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>>>>>> --- a/backends/iommufd.c
>>>>>>>> +++ b/backends/iommufd.c
>>>>>>>> @@ -361,7 +361,7 @@ bool
>>>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>>>                                         uint32_
>t
>>> hwpt_id,
>>>>>>>>                                         uint64_
>t
>>> iova, ram_addr_t
>>>>>>> size,
>>>>>>>>                                         uint64_
>t
>>> page_size,
>>>>>>> uint64_t *data,
>>>>>>>> -                                      Error
>**errp)
>>>>>>>> +                                      uint64_t
>>> flags, Error **errp)
>>>>>>>>   {
>>>>>>>>       int ret;
>>>>>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>>>>>> @@ -371,11 +371,12 @@ bool
>>>>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>>>>           .length = size,
>>>>>>>>           .page_size = page_size,
>>>>>>>>           .data = (uintptr_t)data,
>>>>>>>> +        .flags = flags,
>>>>>>>>       };
>>>>>>>>
>>>>>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>>>>>> &get_dirty_bitmap);
>>>>>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd,
>hwpt_id,
>>> iova,
>>>>>>> size,
>>>>>>>>
>>>
>-                                           page_size,
>>> ret ? errno :
>>>>>>> 0);
>>>>>>>>
>+                                           flags,
>>> page_size, ret ?
>>>>>>> errno : 0);
>>>>>>>>       if (ret) {
>>>>>>>>           error_setg_errno(errp, errno,
>>>>>>>>                            "IOMMU_HWPT_GET_DIRT
>Y_
>>> BITMAP
>>>>>>> (iova: 0x%"HWADDR_PRIx
>>>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>>>> index 0057488ce9..c897aa6b17 100644
>>>>>>>> --- a/hw/vfio/iommufd.c
>>>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>>>>>> VFIOContainerBase *bcontainer,
>>>>>>>>                                     hwaddr iova,
>>> ram_addr_t size,
>>>>>>>>                                     IOMMUTLBE
>ntr
>>> y *iotlb)
>>>>>>>>   {
>>>>>>>> -    const VFIOIOMMUFDContainer *container =
>>>>>>>> +    VFIOIOMMUFDContainer *container =
>>>>>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>>>>>> bcontainer);
>>>>>>>>       bool need_dirty_sync = false;
>>>>>>>>       Error *local_err = NULL;
>>>>>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>>>>>> VFIOContainerBase *bcontainer,
>>>>>>>>       if (iotlb &&
>>> vfio_container_dirty_tracking_is_started(bcontainer)) {
>>>>>>>>           if
>>>>>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>>>>>               bcontainer->dirty_pages_supported) {
>>>>>>>> +            container->dirty_tracking_flags =
>>>>>>>> +
>>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>>>>>               ret =
>>> vfio_container_query_dirty_bitmap(bcontainer, iova,
>>>>>>> size,
>>>>>>>>
>>>>>>> iotlb->translated_addr,
>>>>>>>>
>>>>>>> &local_err);
>>>>>>>> +            container->dirty_tracking_flags = 0;
>>>>>>>
>>>>>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags
>>> too, like
>>>>>>> the
>>>>>>> original patches? This is a little unnecssary odd style to pass a flag via
>>>>>>> container structure rather and then clearing.
>>>>>>
>>>>>> Just want to be simpler, original patch introduced a new parameter to
>>> almost all
>>>>>> variants of *_query_dirty_bitmap() while the flags parameter is only
>used
>>> by
>>>>>> IOMMUFD backend when doing unmap_bitmap. Currently we already
>>> have three
>>>>>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need
>>> the flag.
>>>>>>
>>>>>> I take container->dirty_tracking_flags as a notification mechanism, so
>set
>>> it
>>>>>> before
>>>>>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe
>>> clearing it in
>>>>>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>>>>>
>>>>>>>
>>>>>>> Part of the reason the original series had a
>VFIO_GET_DIRTY_NO_FLUSH
>>> for
>>>>>>> generic
>>>>>>> container abstraction was to not mix IOMMUFD UAPI specifics into
>base
>>>>>>> container
>>>>>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1
>backend
>>>>>>> could just
>>>>>>> ignore the flag, while IOMMUFD translates it to
>>>>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>>>>>
>>>>>> I did port original patch https://github.com/yiliu1765/qemu/
>>>>>> commit/99f83595d79d2e4170c9e456cf1a7b9521bd4f80
>>>>>> But it looks complex to have 'flags' parameter everywhere.
>>>>> I think I would prefer like Joao to avoid caching information if possible
>>>>> but I haven't check closely the mess it would introduce in the code. Let
>>>>> me check.
>>>>>
>>>>
>>>> My recollection was that it wasn't that much churn added as this series is
>>>> already doing the most of the churn. But I am checking how the code
>would
>>> look
>>>> like to properly respond to his suggestion on why he changing it towards
>>>> structuref field.
>>>
>>> The churn should be similar to this patch:
>>>
>>>
>https://github.com/jpemartins/qemu/commit/5e1f11075299a5fa9564a26788
>>> dc9cc1717e297c
>>>
>>> It's mostly an interface parameter addition of flags.
>>
>> Yes, it's a general interface parameter but it's only used by IOMMUFD.
>> That's my only argument.
>>
>
>You could use that argument for anything that's not common and unique to
>each
>backend. type1 dirty tracking is effectively "frozen" in UAPI (IIRC) so I won't
>expect flags. There's device dirty tracking, IOMMUFD dirty tracking and
>vfio-user dirty tracking. While I am not anticipating any new thing here (at
>least in the needs I have ahead, can't speak for other
>companies/contributors);
>but you could see the flags as future proofing it.

Yes, agree it's future proofing.

>
>The ugly part of the alternative is that it sort of limits the interface  to be
>single-user only as nobody call into IOMMUFD with the dirty_tracking_flags if
>it's called concurrently for some other purpose than unmap. We sort of use
>the
>container data structure as a argument storage for a single call as opposed to
>store some intermediary state.

dirty_tracking_flags only take effect during the unmap path except you take it
for other purpose. But yes, I made the change based on the fact that unmap path
is protected by BQL, if the concurrent access is allowed in future, this change will
break.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-23  9:55                 ` Duan, Zhenzhong
@ 2025-09-23 10:06                   ` Duan, Zhenzhong
  0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-09-23 10:06 UTC (permalink / raw)
  To: Joao Martins, Cédric Le Goater
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan,
	qemu-devel@nongnu.org



>-----Original Message-----
>From: Duan, Zhenzhong
>Subject: RE: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>The ugly part of the alternative is that it sort of limits the interface  to be
>>single-user only as nobody call into IOMMUFD with the dirty_tracking_flags
>if
>>it's called concurrently for some other purpose than unmap. We sort of use
>>the
>>container data structure as a argument storage for a single call as opposed
>to
>>store some intermediary state.
>
>dirty_tracking_flags only take effect during the unmap path except you take it
>for other purpose. But yes, I made the change based on the fact that unmap
>path
>is protected by BQL, if the concurrent access is allowed in future, this change
>will
>break.

I found I misunderstood your comment above. Yes, I planned dirty_tracking_flags
only for unmap usage, not for others. I didn't take it as argument but a notification
mechanism only for unmap_all path optimization.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-09-22 16:02       ` Cédric Le Goater
  2025-09-22 16:06         ` Joao Martins
  2025-09-23  2:47         ` Duan, Zhenzhong
@ 2025-10-09 10:20         ` Duan, Zhenzhong
  2025-10-09 12:43           ` Cédric Le Goater
  2 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-09 10:20 UTC (permalink / raw)
  To: Cédric Le Goater, Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>
>On 9/22/25 07:49, Duan, Zhenzhong wrote:
>> Hi Joao,
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>
>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the
>last
>>> dirty
>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>> query without issue because unmap will tear down the mapping anyway.
>>>>
>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based
>on
>>>> the scenario.
>>>>
>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>   hw/vfio/vfio-iommufd.h   | 1 +
>>>>   include/system/iommufd.h | 2 +-
>>>>   backends/iommufd.c       | 5 +++--
>>>>   hw/vfio/iommufd.c        | 6 +++++-
>>>>   backends/trace-events    | 2 +-
>>>>   5 files changed, 11 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>> index 07ea0f4304..e0af241c75 100644
>>>> --- a/hw/vfio/vfio-iommufd.h
>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>       VFIOContainerBase bcontainer;
>>>>       IOMMUFDBackend *be;
>>>>       uint32_t ioas_id;
>>>> +    uint64_t dirty_tracking_flags;
>>>>       QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>   } VFIOIOMMUFDContainer;
>>>>
>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>> index c9c72ffc45..63898e7b0d 100644
>>>> --- a/include/system/iommufd.h
>>>> +++ b/include/system/iommufd.h
>>>> @@ -64,7 +64,7 @@ bool
>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>> hwpt_id,
>>>>   bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>> uint32_t hwpt_id,
>>>>                                         uint64_t iova,
>ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp);
>>>> +                                      uint64_t flags, Error
>>> **errp);
>>>>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>> uint32_t id,
>>>>                                         uint32_t data_type,
>>> uint32_t entry_len,
>>>>                                         uint32_t *entry_num,
>void
>>> *data,
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -361,7 +361,7 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>                                         uint32_t hwpt_id,
>>>>                                         uint64_t iova,
>ram_addr_t
>>> size,
>>>>                                         uint64_t page_size,
>>> uint64_t *data,
>>>> -                                      Error **errp)
>>>> +                                      uint64_t flags, Error
>**errp)
>>>>   {
>>>>       int ret;
>>>>       struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>> @@ -371,11 +371,12 @@ bool
>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>           .length = size,
>>>>           .page_size = page_size,
>>>>           .data = (uintptr_t)data,
>>>> +        .flags = flags,
>>>>       };
>>>>
>>>>       ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>> &get_dirty_bitmap);
>>>>       trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id,
>iova,
>>> size,
>>>> -                                           page_size, ret ?
>errno :
>>> 0);
>>>> +                                           flags, page_size,
>ret ?
>>> errno : 0);
>>>>       if (ret) {
>>>>           error_setg_errno(errp, errno,
>>>>                            "IOMMU_HWPT_GET_DIRTY_BITMAP
>>> (iova: 0x%"HWADDR_PRIx
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 0057488ce9..c897aa6b17 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>                                     hwaddr iova, ram_addr_t
>size,
>>>>                                     IOMMUTLBEntry *iotlb)
>>>>   {
>>>> -    const VFIOIOMMUFDContainer *container =
>>>> +    VFIOIOMMUFDContainer *container =
>>>>           container_of(bcontainer, VFIOIOMMUFDContainer,
>>> bcontainer);
>>>>       bool need_dirty_sync = false;
>>>>       Error *local_err = NULL;
>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>> VFIOContainerBase *bcontainer,
>>>>       if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer))
>{
>>>>           if
>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>               bcontainer->dirty_pages_supported) {
>>>> +            container->dirty_tracking_flags =
>>>> +
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>               ret = vfio_container_query_dirty_bitmap(bcontainer,
>iova,
>>> size,
>>>>
>>> iotlb->translated_addr,
>>>>
>>> &local_err);
>>>> +            container->dirty_tracking_flags = 0;
>>>
>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too,
>like
>>> the
>>> original patches? This is a little unnecssary odd style to pass a flag via
>>> container structure rather and then clearing.
>>
>> Just want to be simpler, original patch introduced a new parameter to
>almost all
>> variants of *_query_dirty_bitmap() while the flags parameter is only used by
>> IOMMUFD backend when doing unmap_bitmap. Currently we already have
>three
>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the
>flag.
>>
>> I take container->dirty_tracking_flags as a notification mechanism, so set it
>before
>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing
>it in
>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>
>>>
>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH
>for
>>> generic
>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>> container
>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>> could just
>>> ignore the flag, while IOMMUFD translates it to
>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>
>> I did port original patch
>https://github.com/yiliu1765/qemu/commit/99f83595d79d2e4170c9e456cf1
>a7b9521bd4f80
>> But it looks complex to have 'flags' parameter everywhere.
>I think I would prefer like Joao to avoid caching information if possible
>but I haven't check closely the mess it would introduce in the code. Let
>me check.

Kindly ping. Just in case you have further comments on this.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-10-09 10:20         ` Duan, Zhenzhong
@ 2025-10-09 12:43           ` Cédric Le Goater
  2025-10-10  4:09             ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Cédric Le Goater @ 2025-10-09 12:43 UTC (permalink / raw)
  To: Duan, Zhenzhong, Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan

Hello Zhenzhong, Joao,

On 10/9/25 12:20, Duan, Zhenzhong wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>
>> On 9/22/25 07:49, Duan, Zhenzhong wrote:
>>> Hi Joao,
>>>
>>>> -----Original Message-----
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Subject: Re: [PATCH 3/5] vfio/iommufd: Add
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
>>>>
>>>> On 10/09/2025 03:36, Zhenzhong Duan wrote:
>>>>> Pass IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR when doing the
>> last
>>>> dirty
>>>>> bitmap query right before unmap, no PTEs flushes. This accelerates the
>>>>> query without issue because unmap will tear down the mapping anyway.
>>>>>
>>>>> Add a new element dirty_tracking_flags in VFIOIOMMUFDContainer to
>>>>> be used for the flags of iommufd dirty tracking. Currently it is
>>>>> set to either IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR or 0 based
>> on
>>>>> the scenario.
>>>>>
>>>>> Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> Tested-by: Xudong Hao <xudong.hao@intel.com>
>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>> ---
>>>>>    hw/vfio/vfio-iommufd.h   | 1 +
>>>>>    include/system/iommufd.h | 2 +-
>>>>>    backends/iommufd.c       | 5 +++--
>>>>>    hw/vfio/iommufd.c        | 6 +++++-
>>>>>    backends/trace-events    | 2 +-
>>>>>    5 files changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/hw/vfio/vfio-iommufd.h b/hw/vfio/vfio-iommufd.h
>>>>> index 07ea0f4304..e0af241c75 100644
>>>>> --- a/hw/vfio/vfio-iommufd.h
>>>>> +++ b/hw/vfio/vfio-iommufd.h
>>>>> @@ -26,6 +26,7 @@ typedef struct VFIOIOMMUFDContainer {
>>>>>        VFIOContainerBase bcontainer;
>>>>>        IOMMUFDBackend *be;
>>>>>        uint32_t ioas_id;
>>>>> +    uint64_t dirty_tracking_flags;
>>>>>        QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>    } VFIOIOMMUFDContainer;
>>>>>
>>>>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>>>>> index c9c72ffc45..63898e7b0d 100644
>>>>> --- a/include/system/iommufd.h
>>>>> +++ b/include/system/iommufd.h
>>>>> @@ -64,7 +64,7 @@ bool
>>>> iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t
>>>> hwpt_id,
>>>>>    bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>> uint32_t hwpt_id,
>>>>>                                          uint64_t iova,
>> ram_addr_t
>>>> size,
>>>>>                                          uint64_t page_size,
>>>> uint64_t *data,
>>>>> -                                      Error **errp);
>>>>> +                                      uint64_t flags, Error
>>>> **errp);
>>>>>    bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be,
>>>> uint32_t id,
>>>>>                                          uint32_t data_type,
>>>> uint32_t entry_len,
>>>>>                                          uint32_t *entry_num,
>> void
>>>> *data,
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 2a33c7ab0b..3c4f6157e2 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -361,7 +361,7 @@ bool
>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>                                          uint32_t hwpt_id,
>>>>>                                          uint64_t iova,
>> ram_addr_t
>>>> size,
>>>>>                                          uint64_t page_size,
>>>> uint64_t *data,
>>>>> -                                      Error **errp)
>>>>> +                                      uint64_t flags, Error
>> **errp)
>>>>>    {
>>>>>        int ret;
>>>>>        struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap = {
>>>>> @@ -371,11 +371,12 @@ bool
>>>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
>>>>>            .length = size,
>>>>>            .page_size = page_size,
>>>>>            .data = (uintptr_t)data,
>>>>> +        .flags = flags,
>>>>>        };
>>>>>
>>>>>        ret = ioctl(be->fd, IOMMU_HWPT_GET_DIRTY_BITMAP,
>>>> &get_dirty_bitmap);
>>>>>        trace_iommufd_backend_get_dirty_bitmap(be->fd, hwpt_id,
>> iova,
>>>> size,
>>>>> -                                           page_size, ret ?
>> errno :
>>>> 0);
>>>>> +                                           flags, page_size,
>> ret ?
>>>> errno : 0);
>>>>>        if (ret) {
>>>>>            error_setg_errno(errp, errno,
>>>>>                             "IOMMU_HWPT_GET_DIRTY_BITMAP
>>>> (iova: 0x%"HWADDR_PRIx
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index 0057488ce9..c897aa6b17 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -62,7 +62,7 @@ static int iommufd_cdev_unmap_one(const
>>>> VFIOContainerBase *bcontainer,
>>>>>                                      hwaddr iova, ram_addr_t
>> size,
>>>>>                                      IOMMUTLBEntry *iotlb)
>>>>>    {
>>>>> -    const VFIOIOMMUFDContainer *container =
>>>>> +    VFIOIOMMUFDContainer *container =
>>>>>            container_of(bcontainer, VFIOIOMMUFDContainer,
>>>> bcontainer);
>>>>>        bool need_dirty_sync = false;
>>>>>        Error *local_err = NULL;
>>>>> @@ -73,9 +73,12 @@ static int iommufd_cdev_unmap_one(const
>>>> VFIOContainerBase *bcontainer,
>>>>>        if (iotlb && vfio_container_dirty_tracking_is_started(bcontainer))
>> {
>>>>>            if
>>>> (!vfio_container_devices_dirty_tracking_is_supported(bcontainer) &&
>>>>>                bcontainer->dirty_pages_supported) {
>>>>> +            container->dirty_tracking_flags =
>>>>> +
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR;
>>>>>                ret = vfio_container_query_dirty_bitmap(bcontainer,
>> iova,
>>>> size,
>>>>>
>>>> iotlb->translated_addr,
>>>>>
>>>> &local_err);
>>>>> +            container->dirty_tracking_flags = 0;
>>>>
>>>> Why not changing vfio_container_query_dirty_bitmap to pass a flags too,
>> like
>>>> the
>>>> original patches? This is a little unnecssary odd style to pass a flag via
>>>> container structure rather and then clearing.
>>>
>>> Just want to be simpler, original patch introduced a new parameter to
>> almost all
>>> variants of *_query_dirty_bitmap() while the flags parameter is only used by
>>> IOMMUFD backend when doing unmap_bitmap. Currently we already have
>> three
>>> backends, legacy VFIO, IOMMUFD and VFIO-user, only IOMMUFD need the
>> flag.
>>>
>>> I take container->dirty_tracking_flags as a notification mechanism, so set it
>> before
>>> vfio_container_query_dirty_bitmap() and clear it thereafter. Maybe clearing
>> it in
>>> iommufd_query_dirty_bitmap() is easier to be acceptable?
>>>
>>>>
>>>> Part of the reason the original series had a VFIO_GET_DIRTY_NO_FLUSH
>> for
>>>> generic
>>>> container abstraction was to not mix IOMMUFD UAPI specifics into base
>>>> container
>>>> API. Then in getting a VFIO_GET_DIRTY_NO_FLUSH, then type1 backend
>>>> could just
>>>> ignore the flag, while IOMMUFD translates it to
>>>> IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>>>
>>> I did port original patch
>> https://github.com/yiliu1765/qemu/commit/99f83595d79d2e4170c9e456cf1
>> a7b9521bd4f80
>>> But it looks complex to have 'flags' parameter everywhere.
>> I think I would prefer like Joao to avoid caching information if possible
>> but I haven't check closely the mess it would introduce in the code. Let
>> me check.
> 
> Kindly ping. Just in case you have further comments on this.
Regarding the whole series,

* [1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap

   Needs refactoring. iommufd_cdev_unmap_one() seems superfluous now ?

* [2/5] vfio/iommufd: Query dirty bitmap before DMA unmap

   Looks OK.

   In an extra patch, could we rename 'vfio_dma_unmap_bitmap()' to
   'vfio_legacy_dma_unmap_get_dirty_bitmap()' ? Helps (me) remember
   what it does.

* [3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support

   Sorry, I lost track of the discussion between you and Joao regarding
   the new '->dirty_tracking_flags' attribute.

   I'd prefer adding a 'backend_flag' parameter, even if it’s currently
   only used by IOMMUFD. Since we’re not going to redefine all possible
   IOMMU backend flags, we can treat it as an opaque parameter for
   the upper container layer and document it as such.

   Is that ok for you ? A bit more churn for you but Joao did provide
   some parts.

* [4/5] intel_iommu: Optimize unmap_bitmap during migration

   Ack by Clément may be ?

* [5/5] vfio/migration: Allow live migration with vIOMMU without VFs
   using device dirty tracking

   ok.


Thanks,

C.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support
  2025-10-09 12:43           ` Cédric Le Goater
@ 2025-10-10  4:09             ` Duan, Zhenzhong
  0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-10  4:09 UTC (permalink / raw)
  To: Cédric Le Goater, Joao Martins, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	Liu, Yi L, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Regarding the whole series,
>
>* [1/5] vfio/iommufd: Add framework code to support getting dirty bitmap
>before unmap
>
>   Needs refactoring. iommufd_cdev_unmap_one() seems superfluous
>now ?

Yes.

>
>* [2/5] vfio/iommufd: Query dirty bitmap before DMA unmap
>
>   Looks OK.
>
>   In an extra patch, could we rename 'vfio_dma_unmap_bitmap()' to
>   'vfio_legacy_dma_unmap_get_dirty_bitmap()' ? Helps (me) remember
>   what it does.

Sure

>
>* [3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR
>flag support
>
>   Sorry, I lost track of the discussion between you and Joao regarding
>   the new '->dirty_tracking_flags' attribute.
>
>   I'd prefer adding a 'backend_flag' parameter, even if it’s currently
>   only used by IOMMUFD. Since we’re not going to redefine all possible
>   IOMMU backend flags, we can treat it as an opaque parameter for
>   the upper container layer and document it as such.
>
>   Is that ok for you ? A bit more churn for you but Joao did provide
>   some parts.

Sure, will cherry-pick Joao's change.

>
>* [4/5] intel_iommu: Optimize unmap_bitmap during migration
>
>   Ack by Clément may be ?

Yes

Thanks
Zhenzhong

>
>* [5/5] vfio/migration: Allow live migration with vIOMMU without VFs
>   using device dirty tracking
>
>   ok.
>
>
>Thanks,
>
>C.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-09-10  2:37 ` [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration Zhenzhong Duan
@ 2025-10-12 10:31   ` Yi Liu
  2025-10-13  2:50     ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-10-12 10:31 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, clement.mathieu--drif,
	eric.auger, joao.m.martins, avihaih, xudong.hao, giovanni.cabiddu,
	mark.gross, arjan.van.de.ven

On 2025/9/10 10:37, Zhenzhong Duan wrote:
> If a VFIO device in guest switches from IOMMU domain to block domain,
> vtd_address_space_unmap() is called to unmap whole address space.
> 
> If that happens during migration, migration fails with legacy VFIO
> backend as below:
> 
> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90, 0x100000000000, 0x100000000000) = -7 (Argument list too long))

this should be a giant and busy VM. right? Is a fix tag needed by the way?

> 
> Because legacy VFIO limits maximum bitmap size to 256MB which maps to 8TB on
> 4K page system, when 16TB sized UNMAP notification is sent, unmap_bitmap
> ioctl fails.
> 
> There is no such limitation with iommufd backend, but it's still not optimal
> to allocate large bitmap.
> 
> Optimize it by iterating over DMAMap list to unmap each range with active
> mapping when migration is active. If migration is not active, unmapping the
> whole address space in one go is optimal.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
> ---
>   hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 83c5e44413..6876dae727 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -37,6 +37,7 @@
>   #include "system/system.h"
>   #include "hw/i386/apic_internal.h"
>   #include "kvm/kvm_i386.h"
> +#include "migration/misc.h"
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   
> @@ -4423,6 +4424,42 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>       vtd_iommu_unlock(s);
>   }
>   
> +/*
> + * Unmapping a large range in one go is not optimal during migration because
> + * a large dirty bitmap needs to be allocated while there may be only small
> + * mappings, iterate over DMAMap list to unmap each range with active mapping.
> + */
> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace *as,
> +                                                 IOMMUNotifier *n)
> +{
> +    const DMAMap *map;
> +    const DMAMap target = {
> +        .iova = n->start,
> +        .size = n->end,
> +    };
> +    IOVATree *tree = as->iova_tree;
> +
> +    /*
> +     * DMAMap is created during IOMMU page table sync, it's either 4KB or huge
> +     * page size and always a power of 2 in size. So the range of DMAMap could
> +     * be used for UNMAP notification directly.
> +     */
> +    while ((map = iova_tree_find(tree, &target))) {

how about an empty iova_tree? If guest has not mapped anything for the
device, the tree is empty. And it is fine to not unmap anyting. While,
if the device is attached to an identify domain, the iova_tree is empty
as well. Are we sure that we need not to unmap anything here? It looks
the answer is yes. But I'm suspecting the unmap failure will happen in 
the vfio side? If yes, need to consider a complete fix. :)

> +        IOMMUTLBEvent event;
> +
> +        event.type = IOMMU_NOTIFIER_UNMAP;
> +        event.entry.iova = map->iova;
> +        event.entry.addr_mask = map->size;
> +        event.entry.target_as = &address_space_memory;
> +        event.entry.perm = IOMMU_NONE;
> +        /* This field is meaningless for unmap */
> +        event.entry.translated_addr = 0;
> +        memory_region_notify_iommu_one(n, &event);
> +
> +        iova_tree_remove(tree, *map);
> +    }
> +}
> +
>   /* Unmap the whole range in the notifier's scope. */
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>   {
> @@ -4432,6 +4469,11 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>       IntelIOMMUState *s = as->iommu_state;
>       DMAMap map;
>   
> +    if (migration_is_running()) {

If the range is not big enough, it is still better to unmap in one-go.
right? If so, might add a check on the range here to go to the iova_tee
iteration conditionally.

> +        vtd_address_space_unmap_in_migration(as, n);
> +        return;
> +    }
> +
>       /*
>        * Note: all the codes in this function has a assumption that IOVA
>        * bits are no more than VTD_MGAW bits (which is restricted by

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-12 10:31   ` Yi Liu
@ 2025-10-13  2:50     ` Duan, Zhenzhong
  2025-10-13 12:56       ` Yi Liu
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  2:50 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/9/10 10:37, Zhenzhong Duan wrote:
>> If a VFIO device in guest switches from IOMMU domain to block domain,
>> vtd_address_space_unmap() is called to unmap whole address space.
>>
>> If that happens during migration, migration fails with legacy VFIO
>> backend as below:
>>
>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>0x100000000000, 0x100000000000) = -7 (Argument list too long))
>
>this should be a giant and busy VM. right? Is a fix tag needed by the way?

VM size is unrelated, it's not a bug, just current code doesn't work well with migration.

When device switches from IOMMU domain to block domain, the whole iommu
memory region is disabled, this trigger the unmap on the whole iommu memory
region, no matter how many or how large the mappings are in the iommu MR.

>
>>
>> Because legacy VFIO limits maximum bitmap size to 256MB which maps to
>8TB on
>> 4K page system, when 16TB sized UNMAP notification is sent,
>unmap_bitmap
>> ioctl fails.
>>
>> There is no such limitation with iommufd backend, but it's still not optimal
>> to allocate large bitmap.
>>
>> Optimize it by iterating over DMAMap list to unmap each range with active
>> mapping when migration is active. If migration is not active, unmapping the
>> whole address space in one go is optimal.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>> ---
>>   hw/i386/intel_iommu.c | 42
>++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 83c5e44413..6876dae727 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -37,6 +37,7 @@
>>   #include "system/system.h"
>>   #include "hw/i386/apic_internal.h"
>>   #include "kvm/kvm_i386.h"
>> +#include "migration/misc.h"
>>   #include "migration/vmstate.h"
>>   #include "trace.h"
>>
>> @@ -4423,6 +4424,42 @@ static void
>vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>       vtd_iommu_unlock(s);
>>   }
>>
>> +/*
>> + * Unmapping a large range in one go is not optimal during migration
>because
>> + * a large dirty bitmap needs to be allocated while there may be only small
>> + * mappings, iterate over DMAMap list to unmap each range with active
>mapping.
>> + */
>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>*as,
>> +
>IOMMUNotifier *n)
>> +{
>> +    const DMAMap *map;
>> +    const DMAMap target = {
>> +        .iova = n->start,
>> +        .size = n->end,
>> +    };
>> +    IOVATree *tree = as->iova_tree;
>> +
>> +    /*
>> +     * DMAMap is created during IOMMU page table sync, it's either 4KB
>or huge
>> +     * page size and always a power of 2 in size. So the range of
>DMAMap could
>> +     * be used for UNMAP notification directly.
>> +     */
>> +    while ((map = iova_tree_find(tree, &target))) {
>
>how about an empty iova_tree? If guest has not mapped anything for the
>device, the tree is empty. And it is fine to not unmap anyting. While,
>if the device is attached to an identify domain, the iova_tree is empty
>as well. Are we sure that we need not to unmap anything here? It looks
>the answer is yes. But I'm suspecting the unmap failure will happen in
>the vfio side? If yes, need to consider a complete fix. :)

Not get what failure will happen, could you elaborate?
In case of identity domain, IOMMU memory region is disabled, no iommu
notifier will ever be triggered. vfio_listener monitors memory address space,
if any memory region is disabled, vfio_listener will catch it and do dirty tracking.


>
>> +        IOMMUTLBEvent event;
>> +
>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>> +        event.entry.iova = map->iova;
>> +        event.entry.addr_mask = map->size;
>> +        event.entry.target_as = &address_space_memory;
>> +        event.entry.perm = IOMMU_NONE;
>> +        /* This field is meaningless for unmap */
>> +        event.entry.translated_addr = 0;
>> +        memory_region_notify_iommu_one(n, &event);
>> +
>> +        iova_tree_remove(tree, *map);
>> +    }
>> +}
>> +
>>   /* Unmap the whole range in the notifier's scope. */
>>   static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n)
>>   {
>> @@ -4432,6 +4469,11 @@ static void
>vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>       IntelIOMMUState *s = as->iommu_state;
>>       DMAMap map;
>>
>> +    if (migration_is_running()) {
>
>If the range is not big enough, it is still better to unmap in one-go.
>right? If so, might add a check on the range here to go to the iova_tee
>iteration conditionally.

We don't want to ditry track IOVA holes between IOVA ranges because it's time consuming and useless work. The hole may be large depending on guest behavior.
Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking the exact iova ranges that may be dirty actually.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-13  2:50     ` Duan, Zhenzhong
@ 2025-10-13 12:56       ` Yi Liu
  2025-10-14  2:31         ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-10-13 12:56 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 2025/10/13 10:50, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>
>>> If that happens during migration, migration fails with legacy VFIO
>>> backend as below:
>>>
>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>
>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
> 
> VM size is unrelated, it's not a bug, just current code doesn't work well with migration.
> 
> When device switches from IOMMU domain to block domain, the whole iommu
> memory region is disabled, this trigger the unmap on the whole iommu memory
> region,

I got this part.

> no matter how many or how large the mappings are in the iommu MR.

hmmm. A more explicit question: does this error happen with 4G VM memory
as well?

>>
>>>
>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps to
>> 8TB on
>>> 4K page system, when 16TB sized UNMAP notification is sent,
>> unmap_bitmap
>>> ioctl fails.
>>>
>>> There is no such limitation with iommufd backend, but it's still not optimal
>>> to allocate large bitmap.
>>>
>>> Optimize it by iterating over DMAMap list to unmap each range with active
>>> mapping when migration is active. If migration is not active, unmapping the
>>> whole address space in one go is optimal.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>> ---
>>>    hw/i386/intel_iommu.c | 42
>> ++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 42 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 83c5e44413..6876dae727 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -37,6 +37,7 @@
>>>    #include "system/system.h"
>>>    #include "hw/i386/apic_internal.h"
>>>    #include "kvm/kvm_i386.h"
>>> +#include "migration/misc.h"
>>>    #include "migration/vmstate.h"
>>>    #include "trace.h"
>>>
>>> @@ -4423,6 +4424,42 @@ static void
>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>        vtd_iommu_unlock(s);
>>>    }
>>>
>>> +/*
>>> + * Unmapping a large range in one go is not optimal during migration
>> because
>>> + * a large dirty bitmap needs to be allocated while there may be only small
>>> + * mappings, iterate over DMAMap list to unmap each range with active
>> mapping.
>>> + */
>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>> *as,
>>> +
>> IOMMUNotifier *n)
>>> +{
>>> +    const DMAMap *map;
>>> +    const DMAMap target = {
>>> +        .iova = n->start,
>>> +        .size = n->end,
>>> +    };
>>> +    IOVATree *tree = as->iova_tree;
>>> +
>>> +    /*
>>> +     * DMAMap is created during IOMMU page table sync, it's either 4KB
>> or huge
>>> +     * page size and always a power of 2 in size. So the range of
>> DMAMap could
>>> +     * be used for UNMAP notification directly.
>>> +     */
>>> +    while ((map = iova_tree_find(tree, &target))) {
>>
>> how about an empty iova_tree? If guest has not mapped anything for the
>> device, the tree is empty. And it is fine to not unmap anyting. While,
>> if the device is attached to an identify domain, the iova_tree is empty
>> as well. Are we sure that we need not to unmap anything here? It looks
>> the answer is yes. But I'm suspecting the unmap failure will happen in
>> the vfio side? If yes, need to consider a complete fix. :)
> 
> Not get what failure will happen, could you elaborate?
> In case of identity domain, IOMMU memory region is disabled, no iommu
> notifier will ever be triggered. vfio_listener monitors memory address space,
> if any memory region is disabled, vfio_listener will catch it and do dirty tracking.

My question comes from the reason why DMA unmap fails. It is due to
a big range is given to kernel while kernel does not support. So if
VFIO gives a big range as well, it should fail as well. And this is
possible when guest (a VM with large size memory) switches from identify
domain to a paging domain. In this case, vfio_listener will unmap all
the system MRs. And it can be a big range if VM size is big enough.

>>
>>> +        IOMMUTLBEvent event;
>>> +
>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>> +        event.entry.iova = map->iova;
>>> +        event.entry.addr_mask = map->size;
>>> +        event.entry.target_as = &address_space_memory;
>>> +        event.entry.perm = IOMMU_NONE;
>>> +        /* This field is meaningless for unmap */
>>> +        event.entry.translated_addr = 0;
>>> +        memory_region_notify_iommu_one(n, &event);
>>> +
>>> +        iova_tree_remove(tree, *map);
>>> +    }
>>> +}
>>> +
>>>    /* Unmap the whole range in the notifier's scope. */
>>>    static void vtd_address_space_unmap(VTDAddressSpace *as,
>> IOMMUNotifier *n)
>>>    {
>>> @@ -4432,6 +4469,11 @@ static void
>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>        IntelIOMMUState *s = as->iommu_state;
>>>        DMAMap map;
>>>
>>> +    if (migration_is_running()) {
>>
>> If the range is not big enough, it is still better to unmap in one-go.
>> right? If so, might add a check on the range here to go to the iova_tee
>> iteration conditionally.
> 
> We don't want to ditry track IOVA holes between IOVA ranges because it's time consuming and useless work. The hole may be large depending on guest behavior.
> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking the exact iova ranges that may be dirty actually.

I see. So this is the optimization. And it also WA the above DMA
unmap issue as well. right? If so, you may want to call out in the
commit message.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-13 12:56       ` Yi Liu
@ 2025-10-14  2:31         ` Duan, Zhenzhong
  2025-10-14  6:46           ` Yi Liu
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-14  2:31 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>
>>>> If that happens during migration, migration fails with legacy VFIO
>>>> backend as below:
>>>>
>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>
>>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
>>
>> VM size is unrelated, it's not a bug, just current code doesn't work well with
>migration.
>>
>> When device switches from IOMMU domain to block domain, the whole
>iommu
>> memory region is disabled, this trigger the unmap on the whole iommu
>memory
>> region,
>
>I got this part.
>
>> no matter how many or how large the mappings are in the iommu MR.
>
>hmmm. A more explicit question: does this error happen with 4G VM memory
>as well?

Coincidently, I remember QAT team reported this issue just with 4G VM memory.

>
>>>
>>>>
>>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps
>to
>>> 8TB on
>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>> unmap_bitmap
>>>> ioctl fails.
>>>>
>>>> There is no such limitation with iommufd backend, but it's still not optimal
>>>> to allocate large bitmap.
>>>>
>>>> Optimize it by iterating over DMAMap list to unmap each range with
>active
>>>> mapping when migration is active. If migration is not active, unmapping
>the
>>>> whole address space in one go is optimal.
>>>>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>> ---
>>>>    hw/i386/intel_iommu.c | 42
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 42 insertions(+)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 83c5e44413..6876dae727 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -37,6 +37,7 @@
>>>>    #include "system/system.h"
>>>>    #include "hw/i386/apic_internal.h"
>>>>    #include "kvm/kvm_i386.h"
>>>> +#include "migration/misc.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "trace.h"
>>>>
>>>> @@ -4423,6 +4424,42 @@ static void
>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>        vtd_iommu_unlock(s);
>>>>    }
>>>>
>>>> +/*
>>>> + * Unmapping a large range in one go is not optimal during migration
>>> because
>>>> + * a large dirty bitmap needs to be allocated while there may be only
>small
>>>> + * mappings, iterate over DMAMap list to unmap each range with active
>>> mapping.
>>>> + */
>>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>>> *as,
>>>> +
>>> IOMMUNotifier *n)
>>>> +{
>>>> +    const DMAMap *map;
>>>> +    const DMAMap target = {
>>>> +        .iova = n->start,
>>>> +        .size = n->end,
>>>> +    };
>>>> +    IOVATree *tree = as->iova_tree;
>>>> +
>>>> +    /*
>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>4KB
>>> or huge
>>>> +     * page size and always a power of 2 in size. So the range of
>>> DMAMap could
>>>> +     * be used for UNMAP notification directly.
>>>> +     */
>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>
>>> how about an empty iova_tree? If guest has not mapped anything for the
>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>> if the device is attached to an identify domain, the iova_tree is empty
>>> as well. Are we sure that we need not to unmap anything here? It looks
>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>> the vfio side? If yes, need to consider a complete fix. :)
>>
>> Not get what failure will happen, could you elaborate?
>> In case of identity domain, IOMMU memory region is disabled, no iommu
>> notifier will ever be triggered. vfio_listener monitors memory address
>space,
>> if any memory region is disabled, vfio_listener will catch it and do dirty
>tracking.
>
>My question comes from the reason why DMA unmap fails. It is due to
>a big range is given to kernel while kernel does not support. So if
>VFIO gives a big range as well, it should fail as well. And this is
>possible when guest (a VM with large size memory) switches from identify
>domain to a paging domain. In this case, vfio_listener will unmap all
>the system MRs. And it can be a big range if VM size is big enough.

Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB size.
If guest memory is large enough and lead to a memory region of more than 8TB size,
unmap_bitmap will fail. It's a rare case to live migrate VM with more than 8TB memory,
instead of fixing it in qemu with complex change, I'd suggest to bump below MACRO
value to enlarge the limit in kernel, or switch to use iommufd which doesn't have such limit.

/*
 * Input argument of number of bits to bitmap_set() is unsigned integer, which
 * further casts to signed integer for unaligned multi-bit operation,
 * __bitmap_set().
 * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
 * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
 * system.
 */
#define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
#define DIRTY_BITMAP_SIZE_MAX    DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)

>
>>>
>>>> +        IOMMUTLBEvent event;
>>>> +
>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>> +        event.entry.iova = map->iova;
>>>> +        event.entry.addr_mask = map->size;
>>>> +        event.entry.target_as = &address_space_memory;
>>>> +        event.entry.perm = IOMMU_NONE;
>>>> +        /* This field is meaningless for unmap */
>>>> +        event.entry.translated_addr = 0;
>>>> +        memory_region_notify_iommu_one(n, &event);
>>>> +
>>>> +        iova_tree_remove(tree, *map);
>>>> +    }
>>>> +}
>>>> +
>>>>    /* Unmap the whole range in the notifier's scope. */
>>>>    static void vtd_address_space_unmap(VTDAddressSpace *as,
>>> IOMMUNotifier *n)
>>>>    {
>>>> @@ -4432,6 +4469,11 @@ static void
>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>        IntelIOMMUState *s = as->iommu_state;
>>>>        DMAMap map;
>>>>
>>>> +    if (migration_is_running()) {
>>>
>>> If the range is not big enough, it is still better to unmap in one-go.
>>> right? If so, might add a check on the range here to go to the iova_tee
>>> iteration conditionally.
>>
>> We don't want to ditry track IOVA holes between IOVA ranges because it's
>time consuming and useless work. The hole may be large depending on guest
>behavior.
>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>the exact iova ranges that may be dirty actually.
>
>I see. So this is the optimization. And it also WA the above DMA
>unmap issue as well. right? If so, you may want to call out in the
>commit message.

Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then the optimization.
I'll rephrase the description and subject.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-14  2:31         ` Duan, Zhenzhong
@ 2025-10-14  6:46           ` Yi Liu
  2025-10-15  7:48             ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-10-14  6:46 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 2025/10/14 10:31, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>> migration
>>>>
>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>> If a VFIO device in guest switches from IOMMU domain to block domain,
>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>
>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>> backend as below:
>>>>>
>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>
>>>> this should be a giant and busy VM. right? Is a fix tag needed by the way?
>>>
>>> VM size is unrelated, it's not a bug, just current code doesn't work well with
>> migration.
>>>
>>> When device switches from IOMMU domain to block domain, the whole
>> iommu
>>> memory region is disabled, this trigger the unmap on the whole iommu
>> memory
>>> region,
>>
>> I got this part.
>>
>>> no matter how many or how large the mappings are in the iommu MR.
>>
>> hmmm. A more explicit question: does this error happen with 4G VM memory
>> as well?
> 
> Coincidently, I remember QAT team reported this issue just with 4G VM memory.

ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
It can be a large range. But it's still not clear to me how can guest
map a range more than 4G if VM only has 4G memory.

> 
>>
>>>>
>>>>>
>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which maps
>> to
>>>> 8TB on
>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>> unmap_bitmap
>>>>> ioctl fails.
>>>>>
>>>>> There is no such limitation with iommufd backend, but it's still not optimal
>>>>> to allocate large bitmap.
>>>>>
>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>> active
>>>>> mapping when migration is active. If migration is not active, unmapping
>> the
>>>>> whole address space in one go is optimal.
>>>>>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>> ---
>>>>>     hw/i386/intel_iommu.c | 42
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 42 insertions(+)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 83c5e44413..6876dae727 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -37,6 +37,7 @@
>>>>>     #include "system/system.h"
>>>>>     #include "hw/i386/apic_internal.h"
>>>>>     #include "kvm/kvm_i386.h"
>>>>> +#include "migration/misc.h"
>>>>>     #include "migration/vmstate.h"
>>>>>     #include "trace.h"
>>>>>
>>>>> @@ -4423,6 +4424,42 @@ static void
>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>         vtd_iommu_unlock(s);
>>>>>     }
>>>>>
>>>>> +/*
>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>> because
>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>> small
>>>>> + * mappings, iterate over DMAMap list to unmap each range with active
>>>> mapping.
>>>>> + */
>>>>> +static void vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>> *as,
>>>>> +
>>>> IOMMUNotifier *n)
>>>>> +{
>>>>> +    const DMAMap *map;
>>>>> +    const DMAMap target = {
>>>>> +        .iova = n->start,
>>>>> +        .size = n->end,
>>>>> +    };
>>>>> +    IOVATree *tree = as->iova_tree;
>>>>> +
>>>>> +    /*
>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>> 4KB
>>>> or huge
>>>>> +     * page size and always a power of 2 in size. So the range of
>>>> DMAMap could
>>>>> +     * be used for UNMAP notification directly.
>>>>> +     */
>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>
>>>> how about an empty iova_tree? If guest has not mapped anything for the
>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>
>>> Not get what failure will happen, could you elaborate?
>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>> notifier will ever be triggered. vfio_listener monitors memory address
>> space,
>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>> tracking.
>>
>> My question comes from the reason why DMA unmap fails. It is due to
>> a big range is given to kernel while kernel does not support. So if
>> VFIO gives a big range as well, it should fail as well. And this is
>> possible when guest (a VM with large size memory) switches from identify
>> domain to a paging domain. In this case, vfio_listener will unmap all
>> the system MRs. And it can be a big range if VM size is big enough.
> 
> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB size.
> If guest memory is large enough and lead to a memory region of more than 8TB size,
> unmap_bitmap will fail. It's a rare case to live migrate VM with more than 8TB memory,
> instead of fixing it in qemu with complex change, I'd suggest to bump below MACRO
> value to enlarge the limit in kernel, or switch to use iommufd which doesn't have such limit.

This limit shall not affect the usage of device dirty tracking. right?
If yes, add something to tell user use iommufd backend is better. e.g
if memory size is bigger than the limit of vfio iommu type1's dirty
bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
user wants migration capability.

> /*
>   * Input argument of number of bits to bitmap_set() is unsigned integer, which
>   * further casts to signed integer for unaligned multi-bit operation,
>   * __bitmap_set().
>   * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
>   * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
>   * system.
>   */
> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
> #define DIRTY_BITMAP_SIZE_MAX    DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> 
>>
>>>>
>>>>> +        IOMMUTLBEvent event;
>>>>> +
>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>> +        event.entry.iova = map->iova;
>>>>> +        event.entry.addr_mask = map->size;
>>>>> +        event.entry.target_as = &address_space_memory;
>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>> +        /* This field is meaningless for unmap */
>>>>> +        event.entry.translated_addr = 0;
>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>> +
>>>>> +        iova_tree_remove(tree, *map);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>     /* Unmap the whole range in the notifier's scope. */
>>>>>     static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>> IOMMUNotifier *n)
>>>>>     {
>>>>> @@ -4432,6 +4469,11 @@ static void
>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>         IntelIOMMUState *s = as->iommu_state;
>>>>>         DMAMap map;
>>>>>
>>>>> +    if (migration_is_running()) {
>>>>
>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>> iteration conditionally.
>>>
>>> We don't want to ditry track IOVA holes between IOVA ranges because it's
>> time consuming and useless work. The hole may be large depending on guest
>> behavior.
>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>> the exact iova ranges that may be dirty actually.
>>
>> I see. So this is the optimization. And it also WA the above DMA
>> unmap issue as well. right? If so, you may want to call out in the
>> commit message.
> 
> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then the optimization.
> I'll rephrase the description and subject.

yes. The commit message gives me the impression this is bug fix. While
subject is optimization. BTW. perhaps call it as an optimization is
clearer since this smells more like an optimization. For fix, I guess
you may need to consider the vfio_listener as well.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-14  6:46           ` Yi Liu
@ 2025-10-15  7:48             ` Duan, Zhenzhong
  2025-10-15 12:57               ` Yi Liu
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-15  7:48 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>> migration
>>>>>
>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>domain,
>>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>>
>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>> backend as below:
>>>>>>
>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>
>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>way?
>>>>
>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>with
>>> migration.
>>>>
>>>> When device switches from IOMMU domain to block domain, the whole
>>> iommu
>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>> memory
>>>> region,
>>>
>>> I got this part.
>>>
>>>> no matter how many or how large the mappings are in the iommu MR.
>>>
>>> hmmm. A more explicit question: does this error happen with 4G VM
>memory
>>> as well?
>>
>> Coincidently, I remember QAT team reported this issue just with 4G VM
>memory.
>
>ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>It can be a large range. But it's still not clear to me how can guest
>map a range more than 4G if VM only has 4G memory.

It happens when guest switch from DMA domain to block domain, below sequence is triggered:

vtd_context_device_invalidate
	vtd_address_space_sync
		vtd_address_space_unmap

You can see the whole iommu address space is unmapped, it's unrelated to actual mapping in guest.

>
>>
>>>
>>>>>
>>>>>>
>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>maps
>>> to
>>>>> 8TB on
>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>> unmap_bitmap
>>>>>> ioctl fails.
>>>>>>
>>>>>> There is no such limitation with iommufd backend, but it's still not
>optimal
>>>>>> to allocate large bitmap.
>>>>>>
>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>> active
>>>>>> mapping when migration is active. If migration is not active,
>unmapping
>>> the
>>>>>> whole address space in one go is optimal.
>>>>>>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>> ---
>>>>>>     hw/i386/intel_iommu.c | 42
>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 42 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>> index 83c5e44413..6876dae727 100644
>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>> @@ -37,6 +37,7 @@
>>>>>>     #include "system/system.h"
>>>>>>     #include "hw/i386/apic_internal.h"
>>>>>>     #include "kvm/kvm_i386.h"
>>>>>> +#include "migration/misc.h"
>>>>>>     #include "migration/vmstate.h"
>>>>>>     #include "trace.h"
>>>>>>
>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>         vtd_iommu_unlock(s);
>>>>>>     }
>>>>>>
>>>>>> +/*
>>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>>> because
>>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>>> small
>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>active
>>>>> mapping.
>>>>>> + */
>>>>>> +static void
>vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>> *as,
>>>>>> +
>>>>> IOMMUNotifier *n)
>>>>>> +{
>>>>>> +    const DMAMap *map;
>>>>>> +    const DMAMap target = {
>>>>>> +        .iova = n->start,
>>>>>> +        .size = n->end,
>>>>>> +    };
>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>>> 4KB
>>>>> or huge
>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>> DMAMap could
>>>>>> +     * be used for UNMAP notification directly.
>>>>>> +     */
>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>
>>>>> how about an empty iova_tree? If guest has not mapped anything for
>the
>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>
>>>> Not get what failure will happen, could you elaborate?
>>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>> space,
>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>> tracking.
>>>
>>> My question comes from the reason why DMA unmap fails. It is due to
>>> a big range is given to kernel while kernel does not support. So if
>>> VFIO gives a big range as well, it should fail as well. And this is
>>> possible when guest (a VM with large size memory) switches from identify
>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>> the system MRs. And it can be a big range if VM size is big enough.
>>
>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB
>size.
>> If guest memory is large enough and lead to a memory region of more than
>8TB size,
>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>8TB memory,
>> instead of fixing it in qemu with complex change, I'd suggest to bump below
>MACRO
>> value to enlarge the limit in kernel, or switch to use iommufd which doesn't
>have such limit.
>
>This limit shall not affect the usage of device dirty tracking. right?
>If yes, add something to tell user use iommufd backend is better. e.g
>if memory size is bigger than the limit of vfio iommu type1's dirty
>bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>user wants migration capability.

Do you mean just dirty tracking instead of migration, like dirtyrate?
In that case, there is error print as above, I think that's enough as a hint?

I guess you mean to add a migration blocker if limit is reached? It's hard
because the limit is only helpful for identity domain, DMA domain in guest
doesn't have such limit, and we can't know guest's choice of domain type
of each VFIO device attached.

>
>> /*
>>   * Input argument of number of bits to bitmap_set() is unsigned integer,
>which
>>   * further casts to signed integer for unaligned multi-bit operation,
>>   * __bitmap_set().
>>   * Then maximum bitmap size supported is 2^31 bits divided by 2^3
>bits/byte,
>>   * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K
>page
>>   * system.
>>   */
>> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>> #define DIRTY_BITMAP_SIZE_MAX
>DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>
>>>
>>>>>
>>>>>> +        IOMMUTLBEvent event;
>>>>>> +
>>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>>> +        event.entry.iova = map->iova;
>>>>>> +        event.entry.addr_mask = map->size;
>>>>>> +        event.entry.target_as = &address_space_memory;
>>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>>> +        /* This field is meaningless for unmap */
>>>>>> +        event.entry.translated_addr = 0;
>>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>>> +
>>>>>> +        iova_tree_remove(tree, *map);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>     /* Unmap the whole range in the notifier's scope. */
>>>>>>     static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>>> IOMMUNotifier *n)
>>>>>>     {
>>>>>> @@ -4432,6 +4469,11 @@ static void
>>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>>         IntelIOMMUState *s = as->iommu_state;
>>>>>>         DMAMap map;
>>>>>>
>>>>>> +    if (migration_is_running()) {
>>>>>
>>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>>> iteration conditionally.
>>>>
>>>> We don't want to ditry track IOVA holes between IOVA ranges because
>it's
>>> time consuming and useless work. The hole may be large depending on
>guest
>>> behavior.
>>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>>> the exact iova ranges that may be dirty actually.
>>>
>>> I see. So this is the optimization. And it also WA the above DMA
>>> unmap issue as well. right? If so, you may want to call out in the
>>> commit message.
>>
>> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then
>the optimization.
>> I'll rephrase the description and subject.
>
>yes. The commit message gives me the impression this is bug fix. While
>subject is optimization. BTW. perhaps call it as an optimization is
>clearer since this smells more like an optimization. For fix, I guess
>you may need to consider the vfio_listener as well.

Do you have any idea to fix it with vfio_listener?

My understanding is, it's hard to fix this issue from vfio core, because vfio_listener doesn't
know the mapping details in the guest, only vIOMMU cached them through DMAMap.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-15  7:48             ` Duan, Zhenzhong
@ 2025-10-15 12:57               ` Yi Liu
  2025-10-16  8:48                 ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-10-15 12:57 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 2025/10/15 15:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>> migration
>>
>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>> migration
>>>>
>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>>> migration
>>>>>>
>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>> domain,
>>>>>>> vtd_address_space_unmap() is called to unmap whole address space.
>>>>>>>
>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>> backend as below:
>>>>>>>
>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>
>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>> way?
>>>>>
>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>> with
>>>> migration.
>>>>>
>>>>> When device switches from IOMMU domain to block domain, the whole
>>>> iommu
>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>> memory
>>>>> region,
>>>>
>>>> I got this part.
>>>>
>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>
>>>> hmmm. A more explicit question: does this error happen with 4G VM
>> memory
>>>> as well?
>>>
>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>> memory.
>>
>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>> It can be a large range. But it's still not clear to me how can guest
>> map a range more than 4G if VM only has 4G memory.
> 
> It happens when guest switch from DMA domain to block domain, below sequence is triggered:
> 
> vtd_context_device_invalidate
> 	vtd_address_space_sync
> 		vtd_address_space_unmap
> 
> You can see the whole iommu address space is unmapped, it's unrelated to actual mapping in guest.

got it.

>>
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>> maps
>>>> to
>>>>>> 8TB on
>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>> unmap_bitmap
>>>>>>> ioctl fails.
>>>>>>>
>>>>>>> There is no such limitation with iommufd backend, but it's still not
>> optimal
>>>>>>> to allocate large bitmap.
>>>>>>>
>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>> active
>>>>>>> mapping when migration is active. If migration is not active,
>> unmapping
>>>> the
>>>>>>> whole address space in one go is optimal.
>>>>>>>
>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>> ---
>>>>>>>      hw/i386/intel_iommu.c | 42
>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>      1 file changed, 42 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>      #include "system/system.h"
>>>>>>>      #include "hw/i386/apic_internal.h"
>>>>>>>      #include "kvm/kvm_i386.h"
>>>>>>> +#include "migration/misc.h"
>>>>>>>      #include "migration/vmstate.h"
>>>>>>>      #include "trace.h"
>>>>>>>
>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>          vtd_iommu_unlock(s);
>>>>>>>      }
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Unmapping a large range in one go is not optimal during migration
>>>>>> because
>>>>>>> + * a large dirty bitmap needs to be allocated while there may be only
>>>> small
>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>> active
>>>>>> mapping.
>>>>>>> + */
>>>>>>> +static void
>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>> *as,
>>>>>>> +
>>>>>> IOMMUNotifier *n)
>>>>>>> +{
>>>>>>> +    const DMAMap *map;
>>>>>>> +    const DMAMap target = {
>>>>>>> +        .iova = n->start,
>>>>>>> +        .size = n->end,
>>>>>>> +    };
>>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>>> +
>>>>>>> +    /*
>>>>>>> +     * DMAMap is created during IOMMU page table sync, it's either
>>>> 4KB
>>>>>> or huge
>>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>>> DMAMap could
>>>>>>> +     * be used for UNMAP notification directly.
>>>>>>> +     */
>>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>>
>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>> the
>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>> as well. Are we sure that we need not to unmap anything here? It looks
>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>
>>>>> Not get what failure will happen, could you elaborate?
>>>>> In case of identity domain, IOMMU memory region is disabled, no iommu
>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>> space,
>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>> tracking.
>>>>
>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>> a big range is given to kernel while kernel does not support. So if
>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>> possible when guest (a VM with large size memory) switches from identify
>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>
>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to 8TB
>> size.
>>> If guest memory is large enough and lead to a memory region of more than
>> 8TB size,
>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>> 8TB memory,
>>> instead of fixing it in qemu with complex change, I'd suggest to bump below
>> MACRO
>>> value to enlarge the limit in kernel, or switch to use iommufd which doesn't
>> have such limit.
>>
>> This limit shall not affect the usage of device dirty tracking. right?
>> If yes, add something to tell user use iommufd backend is better. e.g
>> if memory size is bigger than the limit of vfio iommu type1's dirty
>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>> user wants migration capability.
> 
> Do you mean just dirty tracking instead of migration, like dirtyrate?
> In that case, there is error print as above, I think that's enough as a hint?

it's not related to diryrate.

> I guess you mean to add a migration blocker if limit is reached? It's hard
> because the limit is only helpful for identity domain, DMA domain in guest
> doesn't have such limit, and we can't know guest's choice of domain type
> of each VFIO device attached.

I meant a blocker to boot QEMU if there is limit. something like below:

	if (VM memory > 8TB && legacy_container_backend && migration_enabled)
		fail the VM boot.

>>
>>> /*
>>>    * Input argument of number of bits to bitmap_set() is unsigned integer,
>> which
>>>    * further casts to signed integer for unaligned multi-bit operation,
>>>    * __bitmap_set().
>>>    * Then maximum bitmap size supported is 2^31 bits divided by 2^3
>> bits/byte,
>>>    * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K
>> page
>>>    * system.
>>>    */
>>> #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>>> #define DIRTY_BITMAP_SIZE_MAX
>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>
>>>>
>>>>>>
>>>>>>> +        IOMMUTLBEvent event;
>>>>>>> +
>>>>>>> +        event.type = IOMMU_NOTIFIER_UNMAP;
>>>>>>> +        event.entry.iova = map->iova;
>>>>>>> +        event.entry.addr_mask = map->size;
>>>>>>> +        event.entry.target_as = &address_space_memory;
>>>>>>> +        event.entry.perm = IOMMU_NONE;
>>>>>>> +        /* This field is meaningless for unmap */
>>>>>>> +        event.entry.translated_addr = 0;
>>>>>>> +        memory_region_notify_iommu_one(n, &event);
>>>>>>> +
>>>>>>> +        iova_tree_remove(tree, *map);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>      /* Unmap the whole range in the notifier's scope. */
>>>>>>>      static void vtd_address_space_unmap(VTDAddressSpace *as,
>>>>>> IOMMUNotifier *n)
>>>>>>>      {
>>>>>>> @@ -4432,6 +4469,11 @@ static void
>>>>>> vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>>>>>>>          IntelIOMMUState *s = as->iommu_state;
>>>>>>>          DMAMap map;
>>>>>>>
>>>>>>> +    if (migration_is_running()) {
>>>>>>
>>>>>> If the range is not big enough, it is still better to unmap in one-go.
>>>>>> right? If so, might add a check on the range here to go to the iova_tee
>>>>>> iteration conditionally.
>>>>>
>>>>> We don't want to ditry track IOVA holes between IOVA ranges because
>> it's
>>>> time consuming and useless work. The hole may be large depending on
>> guest
>>>> behavior.
>>>>> Meanwhile the time iterating on iova_tree is trivial. So we prefer tracking
>>>> the exact iova ranges that may be dirty actually.
>>>>
>>>> I see. So this is the optimization. And it also WA the above DMA
>>>> unmap issue as well. right? If so, you may want to call out in the
>>>> commit message.
>>>
>>> Yes, the main purpose of this patch is to fix the unmap_bitmap issue, then
>> the optimization.
>>> I'll rephrase the description and subject.
>>
>> yes. The commit message gives me the impression this is bug fix. While
>> subject is optimization. BTW. perhaps call it as an optimization is
>> clearer since this smells more like an optimization. For fix, I guess
>> you may need to consider the vfio_listener as well.
> 
> Do you have any idea to fix it with vfio_listener?

no good idea.

> My understanding is, it's hard to fix this issue from vfio core, because vfio_listener doesn't
> know the mapping details in the guest, only vIOMMU cached them through DMAMap.

yes. that's why I suggest above fix/WA to avoid it.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-15 12:57               ` Yi Liu
@ 2025-10-16  8:48                 ` Duan, Zhenzhong
  2025-10-16  9:53                   ` Yi Liu
  0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-16  8:48 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 2025/10/15 15:48, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>> migration
>>>
>>> On 2025/10/14 10:31, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>>>>> migration
>>>>>
>>>>> On 2025/10/13 10:50, Duan, Zhenzhong wrote:
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap
>during
>>>>>>> migration
>>>>>>>
>>>>>>> On 2025/9/10 10:37, Zhenzhong Duan wrote:
>>>>>>>> If a VFIO device in guest switches from IOMMU domain to block
>>> domain,
>>>>>>>> vtd_address_space_unmap() is called to unmap whole address
>space.
>>>>>>>>
>>>>>>>> If that happens during migration, migration fails with legacy VFIO
>>>>>>>> backend as below:
>>>>>>>>
>>>>>>>> Status: failed (vfio_container_dma_unmap(0x561bbbd92d90,
>>>>>>> 0x100000000000, 0x100000000000) = -7 (Argument list too long))
>>>>>>>
>>>>>>> this should be a giant and busy VM. right? Is a fix tag needed by the
>>> way?
>>>>>>
>>>>>> VM size is unrelated, it's not a bug, just current code doesn't work well
>>> with
>>>>> migration.
>>>>>>
>>>>>> When device switches from IOMMU domain to block domain, the
>whole
>>>>> iommu
>>>>>> memory region is disabled, this trigger the unmap on the whole iommu
>>>>> memory
>>>>>> region,
>>>>>
>>>>> I got this part.
>>>>>
>>>>>> no matter how many or how large the mappings are in the iommu MR.
>>>>>
>>>>> hmmm. A more explicit question: does this error happen with 4G VM
>>> memory
>>>>> as well?
>>>>
>>>> Coincidently, I remember QAT team reported this issue just with 4G VM
>>> memory.
>>>
>>> ok. this might happen with legacy vIOMMU as guest triggers map/unmap.
>>> It can be a large range. But it's still not clear to me how can guest
>>> map a range more than 4G if VM only has 4G memory.
>>
>> It happens when guest switch from DMA domain to block domain, below
>sequence is triggered:
>>
>> vtd_context_device_invalidate
>> 	vtd_address_space_sync
>> 		vtd_address_space_unmap
>>
>> You can see the whole iommu address space is unmapped, it's unrelated to
>actual mapping in guest.
>
>got it.
>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Because legacy VFIO limits maximum bitmap size to 256MB which
>>> maps
>>>>> to
>>>>>>> 8TB on
>>>>>>>> 4K page system, when 16TB sized UNMAP notification is sent,
>>>>>>> unmap_bitmap
>>>>>>>> ioctl fails.
>>>>>>>>
>>>>>>>> There is no such limitation with iommufd backend, but it's still not
>>> optimal
>>>>>>>> to allocate large bitmap.
>>>>>>>>
>>>>>>>> Optimize it by iterating over DMAMap list to unmap each range with
>>>>> active
>>>>>>>> mapping when migration is active. If migration is not active,
>>> unmapping
>>>>> the
>>>>>>>> whole address space in one go is optimal.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>>>> Tested-by: Giovannio Cabiddu <giovanni.cabiddu@intel.com>
>>>>>>>> ---
>>>>>>>>      hw/i386/intel_iommu.c | 42
>>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 42 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>>>> index 83c5e44413..6876dae727 100644
>>>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>>>> @@ -37,6 +37,7 @@
>>>>>>>>      #include "system/system.h"
>>>>>>>>      #include "hw/i386/apic_internal.h"
>>>>>>>>      #include "kvm/kvm_i386.h"
>>>>>>>> +#include "migration/misc.h"
>>>>>>>>      #include "migration/vmstate.h"
>>>>>>>>      #include "trace.h"
>>>>>>>>
>>>>>>>> @@ -4423,6 +4424,42 @@ static void
>>>>>>> vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>>>>>>>          vtd_iommu_unlock(s);
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Unmapping a large range in one go is not optimal during
>migration
>>>>>>> because
>>>>>>>> + * a large dirty bitmap needs to be allocated while there may be
>only
>>>>> small
>>>>>>>> + * mappings, iterate over DMAMap list to unmap each range with
>>> active
>>>>>>> mapping.
>>>>>>>> + */
>>>>>>>> +static void
>>> vtd_address_space_unmap_in_migration(VTDAddressSpace
>>>>>>> *as,
>>>>>>>> +
>>>>>>> IOMMUNotifier *n)
>>>>>>>> +{
>>>>>>>> +    const DMAMap *map;
>>>>>>>> +    const DMAMap target = {
>>>>>>>> +        .iova = n->start,
>>>>>>>> +        .size = n->end,
>>>>>>>> +    };
>>>>>>>> +    IOVATree *tree = as->iova_tree;
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * DMAMap is created during IOMMU page table sync, it's
>either
>>>>> 4KB
>>>>>>> or huge
>>>>>>>> +     * page size and always a power of 2 in size. So the range of
>>>>>>> DMAMap could
>>>>>>>> +     * be used for UNMAP notification directly.
>>>>>>>> +     */
>>>>>>>> +    while ((map = iova_tree_find(tree, &target))) {
>>>>>>>
>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>> the
>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>looks
>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>
>>>>>> Not get what failure will happen, could you elaborate?
>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>iommu
>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>> space,
>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>> tracking.
>>>>>
>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>> a big range is given to kernel while kernel does not support. So if
>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>> possible when guest (a VM with large size memory) switches from
>identify
>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>
>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>8TB
>>> size.
>>>> If guest memory is large enough and lead to a memory region of more
>than
>>> 8TB size,
>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>> 8TB memory,
>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>below
>>> MACRO
>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>doesn't
>>> have such limit.
>>>
>>> This limit shall not affect the usage of device dirty tracking. right?
>>> If yes, add something to tell user use iommufd backend is better. e.g
>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>> user wants migration capability.
>>
>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>> In that case, there is error print as above, I think that's enough as a hint?
>
>it's not related to diryrate.
>
>> I guess you mean to add a migration blocker if limit is reached? It's hard
>> because the limit is only helpful for identity domain, DMA domain in guest
>> doesn't have such limit, and we can't know guest's choice of domain type
>> of each VFIO device attached.
>
>I meant a blocker to boot QEMU if there is limit. something like below:
>
>	if (VM memory > 8TB && legacy_container_backend &&
>migration_enabled)
>		fail the VM boot.

OK, will add below to vfio_migration_realize() with an extra patch:

    if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
        /*
         * The 8TB comes from default kernel and QEMU config, it may be
         * conservative here as VM can use large page or run with vIOMMU
         * so the limitation may be relaxed. But 8TB is already quite
         * large for live migration. One can also switch to use IOMMUFD
         * backend if there is a need to migrate large VM.
         */
        error_setg(&err, "%s: Migration is currently not supported "
                   "with large memory VM with approximately 8TB memory "
                   "due to limitation in VFIO type1 driver", vbasedev->name);
        goto add_blocker;
    }

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-16  8:48                 ` Duan, Zhenzhong
@ 2025-10-16  9:53                   ` Yi Liu
  2025-10-16 16:24                     ` Cédric Le Goater
  0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-10-16  9:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, clement.mathieu--drif@eviden.com,
	eric.auger@redhat.com, joao.m.martins@oracle.com,
	avihaih@nvidia.com, Hao,  Xudong, Cabiddu, Giovanni, Gross, Mark,
	Van De Ven, Arjan

On 2025/10/16 16:48, Duan, Zhenzhong wrote:
> 
>>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>>> the
>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>> looks
>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>
>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>> iommu
>>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>>> space,
>>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>>> tracking.
>>>>>>
>>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>> possible when guest (a VM with large size memory) switches from
>> identify
>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>
>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>> 8TB
>>>> size.
>>>>> If guest memory is large enough and lead to a memory region of more
>> than
>>>> 8TB size,
>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>>> 8TB memory,
>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>> below
>>>> MACRO
>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>> doesn't
>>>> have such limit.
>>>>
>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>> user wants migration capability.
>>>
>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>> In that case, there is error print as above, I think that's enough as a hint?
>>
>> it's not related to diryrate.
>>
>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>> because the limit is only helpful for identity domain, DMA domain in guest
>>> doesn't have such limit, and we can't know guest's choice of domain type
>>> of each VFIO device attached.
>>
>> I meant a blocker to boot QEMU if there is limit. something like below:
>>
>> 	if (VM memory > 8TB && legacy_container_backend &&
>> migration_enabled)
>> 		fail the VM boot.
> 
> OK, will add below to vfio_migration_realize() with an extra patch:

yeah, let's see Alex and Cedric's feedback.

>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
>          /*
>           * The 8TB comes from default kernel and QEMU config, it may be
>           * conservative here as VM can use large page or run with vIOMMU
>           * so the limitation may be relaxed. But 8TB is already quite
>           * large for live migration. One can also switch to use IOMMUFD
>           * backend if there is a need to migrate large VM.
>           */

instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
memory size. :)

>          error_setg(&err, "%s: Migration is currently not supported "
>                     "with large memory VM with approximately 8TB memory "
>                     "due to limitation in VFIO type1 driver", vbasedev->name);
>          goto add_blocker;
>      }
> 
> Thanks
> Zhenzhong


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-16  9:53                   ` Yi Liu
@ 2025-10-16 16:24                     ` Cédric Le Goater
  2025-10-17  2:31                       ` Duan, Zhenzhong
  0 siblings, 1 reply; 36+ messages in thread
From: Cédric Le Goater @ 2025-10-16 16:24 UTC (permalink / raw)
  To: Yi Liu, Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	clement.mathieu--drif@eviden.com, eric.auger@redhat.com,
	joao.m.martins@oracle.com, avihaih@nvidia.com, Hao, Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan

On 10/16/25 11:53, Yi Liu wrote:
> On 2025/10/16 16:48, Duan, Zhenzhong wrote:
>>
>>>>>>>>> how about an empty iova_tree? If guest has not mapped anything for
>>>>> the
>>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting. While,
>>>>>>>>> if the device is attached to an identify domain, the iova_tree is empty
>>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>>> looks
>>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will happen in
>>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>>
>>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>>> iommu
>>>>>>>> notifier will ever be triggered. vfio_listener monitors memory address
>>>>>>> space,
>>>>>>>> if any memory region is disabled, vfio_listener will catch it and do dirty
>>>>>>> tracking.
>>>>>>>
>>>>>>> My question comes from the reason why DMA unmap fails. It is due to
>>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>>> possible when guest (a VM with large size memory) switches from
>>> identify
>>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>>
>>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap to
>>> 8TB
>>>>> size.
>>>>>> If guest memory is large enough and lead to a memory region of more
>>> than
>>>>> 8TB size,
>>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more than
>>>>> 8TB memory,
>>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>>> below
>>>>> MACRO
>>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>>> doesn't
>>>>> have such limit.
>>>>>
>>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>>> user wants migration capability.
>>>>
>>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>>> In that case, there is error print as above, I think that's enough as a hint?
>>>
>>> it's not related to diryrate.
>>>
>>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>>> because the limit is only helpful for identity domain, DMA domain in guest
>>>> doesn't have such limit, and we can't know guest's choice of domain type
>>>> of each VFIO device attached.
>>>
>>> I meant a blocker to boot QEMU if there is limit. something like below:
>>>
>>>     if (VM memory > 8TB && legacy_container_backend &&
>>> migration_enabled)
>>>         fail the VM boot.
>>
>> OK, will add below to vfio_migration_realize() with an extra patch:
> 
> yeah, let's see Alex and Cedric's feedback.
> 
>>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB) {
>>          /*
>>           * The 8TB comes from default kernel and QEMU config, it may be
>>           * conservative here as VM can use large page or run with vIOMMU
>>           * so the limitation may be relaxed. But 8TB is already quite
>>           * large for live migration. One can also switch to use IOMMUFD
>>           * backend if there is a need to migrate large VM.
>>           */
> 
> instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
> memory size. :)
yes. It would reflect better that it's a VFIO dirty tracking limitation.


Zhenzhong,

Soft freeze is w45. I plan to send a PR next week, w43, and I will be out
w44. I will have some (limited) time to address more changes on w45.

Thanks,

C.




^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration
  2025-10-16 16:24                     ` Cédric Le Goater
@ 2025-10-17  2:31                       ` Duan, Zhenzhong
  0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-10-17  2:31 UTC (permalink / raw)
  To: Cédric Le Goater, Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	clement.mathieu--drif@eviden.com, eric.auger@redhat.com,
	joao.m.martins@oracle.com, avihaih@nvidia.com, Hao,  Xudong,
	Cabiddu, Giovanni, Gross, Mark, Van De Ven, Arjan



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during
>migration
>
>On 10/16/25 11:53, Yi Liu wrote:
>> On 2025/10/16 16:48, Duan, Zhenzhong wrote:
>>>
>>>>>>>>>> how about an empty iova_tree? If guest has not mapped anything
>for
>>>>>> the
>>>>>>>>>> device, the tree is empty. And it is fine to not unmap anyting.
>While,
>>>>>>>>>> if the device is attached to an identify domain, the iova_tree is
>empty
>>>>>>>>>> as well. Are we sure that we need not to unmap anything here? It
>>>> looks
>>>>>>>>>> the answer is yes. But I'm suspecting the unmap failure will
>happen in
>>>>>>>>>> the vfio side? If yes, need to consider a complete fix. :)
>>>>>>>>>
>>>>>>>>> Not get what failure will happen, could you elaborate?
>>>>>>>>> In case of identity domain, IOMMU memory region is disabled, no
>>>> iommu
>>>>>>>>> notifier will ever be triggered. vfio_listener monitors memory
>address
>>>>>>>> space,
>>>>>>>>> if any memory region is disabled, vfio_listener will catch it and do
>dirty
>>>>>>>> tracking.
>>>>>>>>
>>>>>>>> My question comes from the reason why DMA unmap fails. It is due
>to
>>>>>>>> a big range is given to kernel while kernel does not support. So if
>>>>>>>> VFIO gives a big range as well, it should fail as well. And this is
>>>>>>>> possible when guest (a VM with large size memory) switches from
>>>> identify
>>>>>>>> domain to a paging domain. In this case, vfio_listener will unmap all
>>>>>>>> the system MRs. And it can be a big range if VM size is big enough.
>>>>>>>
>>>>>>> Got you point. Yes, currently vfio_type1 driver limits unmap_bitmap
>to
>>>> 8TB
>>>>>> size.
>>>>>>> If guest memory is large enough and lead to a memory region of
>more
>>>> than
>>>>>> 8TB size,
>>>>>>> unmap_bitmap will fail. It's a rare case to live migrate VM with more
>than
>>>>>> 8TB memory,
>>>>>>> instead of fixing it in qemu with complex change, I'd suggest to bump
>>>> below
>>>>>> MACRO
>>>>>>> value to enlarge the limit in kernel, or switch to use iommufd which
>>>> doesn't
>>>>>> have such limit.
>>>>>>
>>>>>> This limit shall not affect the usage of device dirty tracking. right?
>>>>>> If yes, add something to tell user use iommufd backend is better. e.g
>>>>>> if memory size is bigger than the limit of vfio iommu type1's dirty
>>>>>> bitmap limit (query cap_mig.max_dirty_bitmap_size), then fail user if
>>>>>> user wants migration capability.
>>>>>
>>>>> Do you mean just dirty tracking instead of migration, like dirtyrate?
>>>>> In that case, there is error print as above, I think that's enough as a hint?
>>>>
>>>> it's not related to diryrate.
>>>>
>>>>> I guess you mean to add a migration blocker if limit is reached? It's hard
>>>>> because the limit is only helpful for identity domain, DMA domain in
>guest
>>>>> doesn't have such limit, and we can't know guest's choice of domain
>type
>>>>> of each VFIO device attached.
>>>>
>>>> I meant a blocker to boot QEMU if there is limit. something like below:
>>>>
>>>>     if (VM memory > 8TB && legacy_container_backend &&
>>>> migration_enabled)
>>>>         fail the VM boot.
>>>
>>> OK, will add below to vfio_migration_realize() with an extra patch:
>>
>> yeah, let's see Alex and Cedric's feedback.
>>
>>>      if (!vbasedev->iommufd && current_machine->ram_size > 8 * TiB)
>{
>>>          /*
>>>           * The 8TB comes from default kernel and QEMU config, it
>may be
>>>           * conservative here as VM can use large page or run with
>vIOMMU
>>>           * so the limitation may be relaxed. But 8TB is already quite
>>>           * large for live migration. One can also switch to use
>IOMMUFD
>>>           * backend if there is a need to migrate large VM.
>>>           */
>>
>> instead of hard code 8TB. May convert cap_mig.max_dirty_bitmap_size to
>> memory size. :)
>yes. It would reflect better that it's a VFIO dirty tracking limitation.
>
>
>Zhenzhong,
>
>Soft freeze is w45. I plan to send a PR next week, w43, and I will be out
>w44. I will have some (limited) time to address more changes on w45.

Got it, I'll send a new version soon.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-10-17  2:33 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10  2:36 [PATCH 0/5] vfio: relax the vIOMMU check Zhenzhong Duan
2025-09-10  2:36 ` [PATCH 1/5] vfio/iommufd: Add framework code to support getting dirty bitmap before unmap Zhenzhong Duan
2025-09-19  9:30   ` Cédric Le Goater
2025-09-22  3:17     ` Duan, Zhenzhong
2025-09-22  8:27       ` Cédric Le Goater
2025-09-23  2:45         ` Duan, Zhenzhong
2025-09-23  7:06           ` Cédric Le Goater
2025-09-23  9:50             ` Duan, Zhenzhong
2025-09-10  2:36 ` [PATCH 2/5] vfio/iommufd: Query dirty bitmap before DMA unmap Zhenzhong Duan
2025-09-10  2:36 ` [PATCH 3/5] vfio/iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag support Zhenzhong Duan
2025-09-19 10:27   ` Joao Martins
2025-09-22  5:49     ` Duan, Zhenzhong
2025-09-22 16:02       ` Cédric Le Goater
2025-09-22 16:06         ` Joao Martins
2025-09-22 17:01           ` Joao Martins
2025-09-23  2:50             ` Duan, Zhenzhong
2025-09-23  9:17               ` Joao Martins
2025-09-23  9:55                 ` Duan, Zhenzhong
2025-09-23 10:06                   ` Duan, Zhenzhong
2025-09-23  2:47         ` Duan, Zhenzhong
2025-10-09 10:20         ` Duan, Zhenzhong
2025-10-09 12:43           ` Cédric Le Goater
2025-10-10  4:09             ` Duan, Zhenzhong
2025-09-10  2:37 ` [PATCH 4/5] intel_iommu: Optimize unmap_bitmap during migration Zhenzhong Duan
2025-10-12 10:31   ` Yi Liu
2025-10-13  2:50     ` Duan, Zhenzhong
2025-10-13 12:56       ` Yi Liu
2025-10-14  2:31         ` Duan, Zhenzhong
2025-10-14  6:46           ` Yi Liu
2025-10-15  7:48             ` Duan, Zhenzhong
2025-10-15 12:57               ` Yi Liu
2025-10-16  8:48                 ` Duan, Zhenzhong
2025-10-16  9:53                   ` Yi Liu
2025-10-16 16:24                     ` Cédric Le Goater
2025-10-17  2:31                       ` Duan, Zhenzhong
2025-09-10  2:37 ` [PATCH 5/5] vfio/migration: Allow live migration with vIOMMU without VFs using device dirty tracking Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).