qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Optimize UNMAP call and bug fix
@ 2023-06-01  6:33 Zhenzhong Duan
  2023-06-01  6:33 ` [PATCH v2 1/4] util: Add iova_tree_foreach_range_data Zhenzhong Duan
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Zhenzhong Duan @ 2023-06-01  6:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, peterx, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

Hi All,

This is an extention to original patch [1] based on discuss in [2].

1. Fixed a potential VFIO migration issue Peter found, and optimize
   VFIO dirty page sync in intel_iommu incidentally.

2. Clarify the definition of replay() to match existent optimization
   in intel_iommu.

3. Optimize out some potential UNMAP calls

Tested net card passthrough, ping/ssh pass
Tested DSA vdev passthrough, start dmatest then do live migration, pass.
I checked the LM performance before and after patch, no explicit difference.
I think it may because dirty page sync isn't an important factor in LM
and it's small in number in my test, 4 or 5.

[1] https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg05549.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg06708.html

Thanks

Zhenzhong Duan (4):
  util: Add iova_tree_foreach_range_data
  intel_iommu: Fix a potential issue in VFIO dirty page sync
  memory: Document update on replay()
  intel_iommu: Optimize out some unnecessary UNMAP calls

 hw/i386/intel_iommu.c    | 75 +++++++++++++++++++++++++++++-----------
 hw/vfio/common.c         |  2 +-
 include/exec/memory.h    | 19 ++++++++--
 include/qemu/iova-tree.h | 17 +++++++--
 softmmu/memory.c         |  4 +++
 util/iova-tree.c         | 31 +++++++++++++++++
 6 files changed, 122 insertions(+), 26 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] util: Add iova_tree_foreach_range_data
  2023-06-01  6:33 [PATCH v2 0/4] Optimize UNMAP call and bug fix Zhenzhong Duan
@ 2023-06-01  6:33 ` Zhenzhong Duan
  2023-06-05 18:39   ` Peter Xu
  2023-06-01  6:33 ` [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync Zhenzhong Duan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Zhenzhong Duan @ 2023-06-01  6:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, peterx, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

This function is a variant of iova_tree_foreach and support tranversing
a range to trigger callback with a private data.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/qemu/iova-tree.h | 17 +++++++++++++++--
 util/iova-tree.c         | 31 +++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 8528e5c98fbc..8bbf049dc3f7 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -39,6 +39,7 @@ typedef struct DMAMap {
     IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
 typedef gboolean (*iova_tree_iterator)(DMAMap *map);
+typedef gboolean (*iova_tree_iterator_2)(DMAMap *map, gpointer *private);
 
 /**
  * iova_tree_new:
@@ -131,11 +132,23 @@ const DMAMap *iova_tree_find_address(const IOVATree *tree, hwaddr iova);
  * @iterator: the interator for the mappings, return true to stop
  *
  * Iterate over the iova tree.
- *
- * Return: 1 if found any overlap, 0 if not, <0 if error.
  */
 void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator);
 
+/**
+ * iova_tree_foreach_range_data:
+ *
+ * @tree: the iova tree to iterate on
+ * @range: the iova range to iterate in
+ * @func: the iterator for the mappings, return true to stop
+ * @private: parameter passed to @func
+ *
+ * Iterate over an iova range in iova tree.
+ */
+void iova_tree_foreach_range_data(IOVATree *tree, DMAMap *range,
+                                  iova_tree_iterator_2 func,
+                                  gpointer *private);
+
 /**
  * iova_tree_alloc_map:
  *
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e47..a3cbd5198410 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -42,6 +42,12 @@ typedef struct IOVATreeFindIOVAArgs {
     const DMAMap *result;
 } IOVATreeFindIOVAArgs;
 
+typedef struct IOVATreeIterator {
+    DMAMap *range;
+    iova_tree_iterator_2 func;
+    gpointer *private;
+} IOVATreeIterator;
+
 /**
  * Iterate args to the next hole
  *
@@ -164,6 +170,31 @@ void iova_tree_foreach(IOVATree *tree, iova_tree_iterator iterator)
     g_tree_foreach(tree->tree, iova_tree_traverse, iterator);
 }
 
+static gboolean iova_tree_traverse_range(gpointer key, gpointer value,
+                                         gpointer data)
+{
+    DMAMap *map = key;
+    IOVATreeIterator *iterator = data;
+    DMAMap *range = iterator->range;
+
+    g_assert(key == value);
+
+    if (iova_tree_compare(map, range, NULL)) {
+        return false;
+    }
+
+    return iterator->func(map, iterator->private);
+}
+
+void iova_tree_foreach_range_data(IOVATree *tree, DMAMap *range,
+                                  iova_tree_iterator_2 func,
+                                  gpointer *private)
+{
+    IOVATreeIterator iterator = {range, func, private};
+
+    g_tree_foreach(tree->tree, iova_tree_traverse_range, &iterator);
+}
+
 void iova_tree_remove(IOVATree *tree, DMAMap map)
 {
     const DMAMap *overlap;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-01  6:33 [PATCH v2 0/4] Optimize UNMAP call and bug fix Zhenzhong Duan
  2023-06-01  6:33 ` [PATCH v2 1/4] util: Add iova_tree_foreach_range_data Zhenzhong Duan
@ 2023-06-01  6:33 ` Zhenzhong Duan
  2023-06-05 18:39   ` Peter Xu
  2023-06-01  6:33 ` [PATCH v2 3/4] memory: Document update on replay() Zhenzhong Duan
  2023-06-01  6:33 ` [PATCH v2 4/4] intel_iommu: Optimize out some unnecessary UNMAP calls Zhenzhong Duan
  3 siblings, 1 reply; 13+ messages in thread
From: Zhenzhong Duan @ 2023-06-01  6:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, peterx, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

Peter Xu found a potential issue:

"The other thing is when I am looking at the new code I found that we
actually extended the replay() to be used also in dirty tracking of vfio,
in vfio_sync_dirty_bitmap().  For that maybe it's already broken if
unmap_all() because afaiu log_sync() can be called in migration thread
anytime during DMA so I think it means the device is prone to DMA with the
IOMMU pgtable quickly erased and rebuilt here, which means the DMA could
fail unexpectedly.  Copy Alex, Kirti and Neo."

To eliminate this small window with empty mapping, we should remove the
call to unmap_all(). Besides that, introduce a new notifier type called
IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only notifies
changed mappings while VFIO dirty page sync needs full mappings. Thanks
to current implementation of iova tree, we could pick mappings from iova
trees directly instead of walking through guest IOMMU page table.

IOMMU_NOTIFIER_MAP is still used to get changed mappings for optimization
purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure shadow
page table in sync, then it's OK.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 49 +++++++++++++++++++++++++++++++++++--------
 hw/vfio/common.c      |  2 +-
 include/exec/memory.h | 13 ++++++++++++
 softmmu/memory.c      |  4 ++++
 4 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 94d52f4205d2..061fcded0dfb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent *event, void *private)
     return 0;
 }
 
+static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private)
+{
+    IOMMUTLBEvent event;
+
+    event.type = IOMMU_NOTIFIER_MAP;
+    event.entry.iova = map->iova;
+    event.entry.addr_mask = map->size;
+    event.entry.target_as = &address_space_memory;
+    event.entry.perm = map->perm;
+    event.entry.translated_addr = map->translated_addr;
+
+    return vtd_replay_hook(&event, private);
+}
+
+/*
+ * This is a fast path to notify the full mappings falling in the scope
+ * of IOMMU notifier. The call site should ensure no iova tree update by
+ * taking necessary locks(e.x. BQL).
+ */
+static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree,
+                                            IOMMUNotifier *n)
+{
+    DMAMap map;
+
+    map.iova = n->start;
+    map.size = n->end - n->start;
+    if (!iova_tree_find(iova_tree, &map)) {
+        return 0;
+    }
+
+    iova_tree_foreach_range_data(iova_tree, &map, vtd_replay_full_map,
+                                 (gpointer *)n);
+    return 0;
+}
+
 static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
 {
     VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
@@ -3826,13 +3861,6 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     uint8_t bus_n = pci_bus_num(vtd_as->bus);
     VTDContextEntry ce;
 
-    /*
-     * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
-     */
-    vtd_address_space_unmap(vtd_as, n);
-
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
                                   "legacy mode",
@@ -3850,8 +3878,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .as = vtd_as,
                 .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
             };
-
-            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
+            if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
+                vtd_page_walk_full_map_fast_path(vtd_as->iova_tree, n);
+            } else {
+                vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
+            }
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede2764..5dae4502b908 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1890,7 +1890,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
 
                 iommu_notifier_init(&gdn.n,
                                     vfio_iommu_map_dirty_notify,
-                                    IOMMU_NOTIFIER_MAP,
+                                    IOMMU_NOTIFIER_FULL_MAP,
                                     section->offset_within_region,
                                     int128_get64(llend),
                                     idx);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3661b2276c7..eecc3eec6702 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
  *       events (e.g. VFIO). Both notifications must be accurate so that
  *       the shadow page table is fully in sync with the guest view.
  *
+ *       Besides MAP, there is a special use case called FULL_MAP which
+ *       requests notification for all the existent mappings (e.g. VFIO
+ *       dirty page sync).
+ *
  *   (2) When the device doesn't need accurate synchronizations of the
  *       vIOMMU page tables, it needs to register only with UNMAP or
  *       DEVIOTLB_UNMAP notifies.
@@ -164,6 +168,8 @@ typedef enum {
     IOMMU_NOTIFIER_MAP = 0x2,
     /* Notify changes on device IOTLB entries */
     IOMMU_NOTIFIER_DEVIOTLB_UNMAP = 0x04,
+    /* Notify every existent entries */
+    IOMMU_NOTIFIER_FULL_MAP = 0x8,
 } IOMMUNotifierFlag;
 
 #define IOMMU_NOTIFIER_IOTLB_EVENTS (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
@@ -237,6 +243,13 @@ static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        hwaddr start, hwaddr end,
                                        int iommu_idx)
 {
+    /*
+     * memory_region_notify_iommu_one() needs IOMMU_NOTIFIER_MAP set to
+     * trigger notifier.
+     */
+    if (flags & IOMMU_NOTIFIER_FULL_MAP) {
+        flags |= IOMMU_NOTIFIER_MAP;
+    }
     n->notify = fn;
     n->notifier_flags = flags;
     n->start = start;
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7d9494ce7028..0a8465007c66 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1922,6 +1922,10 @@ int memory_region_register_iommu_notifier(MemoryRegion *mr,
     assert(n->iommu_idx >= 0 &&
            n->iommu_idx < memory_region_iommu_num_indexes(iommu_mr));
 
+    if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
+        error_setg(errp, "FULL_MAP could only be used in replay");
+    }
+
     QLIST_INSERT_HEAD(&iommu_mr->iommu_notify, n, node);
     ret = memory_region_update_iommu_notify_flags(iommu_mr, errp);
     if (ret) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] memory: Document update on replay()
  2023-06-01  6:33 [PATCH v2 0/4] Optimize UNMAP call and bug fix Zhenzhong Duan
  2023-06-01  6:33 ` [PATCH v2 1/4] util: Add iova_tree_foreach_range_data Zhenzhong Duan
  2023-06-01  6:33 ` [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync Zhenzhong Duan
@ 2023-06-01  6:33 ` Zhenzhong Duan
  2023-06-05 18:42   ` Peter Xu
  2023-06-01  6:33 ` [PATCH v2 4/4] intel_iommu: Optimize out some unnecessary UNMAP calls Zhenzhong Duan
  3 siblings, 1 reply; 13+ messages in thread
From: Zhenzhong Duan @ 2023-06-01  6:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, peterx, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

Currently replay() callback is declared to be exactly same semantics
as memory_region_iommu_replay().

Customed replay() may provide some extent of optimization depending on
notifier's type. E.g. intel_iommu, IOMMU_NOTIFIER_MAP is optimized to
provide only changed entries.

Clarify the semantics of replay() and provide more flexibility.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/exec/memory.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index eecc3eec6702..9a523ef62a94 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -441,9 +441,9 @@ struct IOMMUMemoryRegionClass {
      * call the IOMMU translate method for every page in the address space
      * with flag == IOMMU_NONE and then call the notifier if translate
      * returns a valid mapping. If this method is implemented then it
-     * overrides the default behaviour, and must provide the full semantics
-     * of memory_region_iommu_replay(), by calling @notifier for every
-     * translation present in the IOMMU.
+     * overrides the default behavior, and must provide corresponding
+     * semantics depending on notifier's type, e.g. IOMMU_NOTIFIER_MAP,
+     * notify changed entries; IOMMU_NOTIFIER_FULL_MAP, notify full entries.
      *
      * Optional method -- an IOMMU only needs to provide this method
      * if the default is inefficient or produces undesirable side effects.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] intel_iommu: Optimize out some unnecessary UNMAP calls
  2023-06-01  6:33 [PATCH v2 0/4] Optimize UNMAP call and bug fix Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2023-06-01  6:33 ` [PATCH v2 3/4] memory: Document update on replay() Zhenzhong Duan
@ 2023-06-01  6:33 ` Zhenzhong Duan
  3 siblings, 0 replies; 13+ messages in thread
From: Zhenzhong Duan @ 2023-06-01  6:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, peterx, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

Commit 63b88968f1 ("intel-iommu: rework the page walk logic") adds IOVA
tree to cache mapped ranges so we only need to send MAP or UNMAP when
there are changes. But there is still a corner case of unnecessary UNMAP.

During invalidation, either domain or device selective, we only need to
unmap when there are recorded mapped IOVA ranges, presuming most of OSes
allocating IOVA range continuously, e.g. on x86, linux sets up mapping
from 0xffffffff downwards.

Strace shows UNMAP ioctl taking 0.000014us and we have 28 such ioctl()
in one invalidation, as two notifiers in x86 are split into power of 2
pieces.

ioctl(48, VFIO_IOMMU_UNMAP_DMA, 0x7ffffd5c42f0) = 0 <0.000014>

The other purpose of this patch is to eliminate noisy error log when we
work with IOMMUFD. It looks the duplicate UNMAP call will fail with IOMMUFD
while always succeed with legacy container. This behavior difference leads
to below error log for IOMMUFD:

IOMMU_IOAS_UNMAP failed: No such file or directory
vfio_container_dma_unmap(0x562012d6b6d0, 0x0, 0x80000000) = -2 (No such file or directory)
IOMMU_IOAS_UNMAP failed: No such file or directory
vfio_container_dma_unmap(0x562012d6b6d0, 0x80000000, 0x40000000) = -2 (No such file or directory)
...

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 061fcded0dfb..a5fd144aa246 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3743,6 +3743,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     hwaddr start = n->start;
     hwaddr end = n->end;
     IntelIOMMUState *s = as->iommu_state;
+    IOMMUTLBEvent event;
     DMAMap map;
 
     /*
@@ -3762,22 +3763,25 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
     assert(start <= end);
     size = remain = end - start + 1;
 
+    event.type = IOMMU_NOTIFIER_UNMAP;
+    event.entry.target_as = &address_space_memory;
+    event.entry.perm = IOMMU_NONE;
+    /* This field is meaningless for unmap */
+    event.entry.translated_addr = 0;
+
     while (remain >= VTD_PAGE_SIZE) {
-        IOMMUTLBEvent event;
         uint64_t mask = dma_aligned_pow2_mask(start, end, s->aw_bits);
         uint64_t size = mask + 1;
 
         assert(size);
 
-        event.type = IOMMU_NOTIFIER_UNMAP;
-        event.entry.iova = start;
-        event.entry.addr_mask = mask;
-        event.entry.target_as = &address_space_memory;
-        event.entry.perm = IOMMU_NONE;
-        /* This field is meaningless for unmap */
-        event.entry.translated_addr = 0;
-
-        memory_region_notify_iommu_one(n, &event);
+        map.iova = start;
+        map.size = mask;
+        if (iova_tree_find(as->iova_tree, &map)) {
+            event.entry.iova = start;
+            event.entry.addr_mask = mask;
+            memory_region_notify_iommu_one(n, &event);
+        }
 
         start += size;
         remain -= size;
@@ -3791,7 +3795,7 @@ static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
                              n->start, size);
 
     map.iova = n->start;
-    map.size = size;
+    map.size = size - 1; /* Inclusive */
     iova_tree_remove(as->iova_tree, map);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-01  6:33 ` [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync Zhenzhong Duan
@ 2023-06-05 18:39   ` Peter Xu
  2023-06-06  2:35     ` Duan, Zhenzhong
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2023-06-05 18:39 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, mst, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote:
> Peter Xu found a potential issue:
> 
> "The other thing is when I am looking at the new code I found that we
> actually extended the replay() to be used also in dirty tracking of vfio,
> in vfio_sync_dirty_bitmap().  For that maybe it's already broken if
> unmap_all() because afaiu log_sync() can be called in migration thread
> anytime during DMA so I think it means the device is prone to DMA with the
> IOMMU pgtable quickly erased and rebuilt here, which means the DMA could
> fail unexpectedly.  Copy Alex, Kirti and Neo."
> 
> To eliminate this small window with empty mapping, we should remove the
> call to unmap_all(). Besides that, introduce a new notifier type called
> IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only notifies
> changed mappings while VFIO dirty page sync needs full mappings. Thanks
> to current implementation of iova tree, we could pick mappings from iova
> trees directly instead of walking through guest IOMMU page table.
> 
> IOMMU_NOTIFIER_MAP is still used to get changed mappings for optimization
> purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure shadow
> page table in sync, then it's OK.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 49 +++++++++++++++++++++++++++++++++++--------
>  hw/vfio/common.c      |  2 +-
>  include/exec/memory.h | 13 ++++++++++++
>  softmmu/memory.c      |  4 ++++
>  4 files changed, 58 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 94d52f4205d2..061fcded0dfb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent *event, void *private)
>      return 0;
>  }
>  
> +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private)
> +{
> +    IOMMUTLBEvent event;
> +
> +    event.type = IOMMU_NOTIFIER_MAP;
> +    event.entry.iova = map->iova;
> +    event.entry.addr_mask = map->size;
> +    event.entry.target_as = &address_space_memory;
> +    event.entry.perm = map->perm;
> +    event.entry.translated_addr = map->translated_addr;
> +
> +    return vtd_replay_hook(&event, private);
> +}
> +
> +/*
> + * This is a fast path to notify the full mappings falling in the scope
> + * of IOMMU notifier. The call site should ensure no iova tree update by
> + * taking necessary locks(e.x. BQL).

We should be accurate on the locking - I think it's the BQL so far.

> + */
> +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree,
> +                                            IOMMUNotifier *n)
> +{
> +    DMAMap map;
> +
> +    map.iova = n->start;
> +    map.size = n->end - n->start;
> +    if (!iova_tree_find(iova_tree, &map)) {
> +        return 0;
> +    }
> +
> +    iova_tree_foreach_range_data(iova_tree, &map, vtd_replay_full_map,
> +                                 (gpointer *)n);
> +    return 0;
> +}
> +
>  static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>  {
>      VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace, iommu);
> @@ -3826,13 +3861,6 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>      VTDContextEntry ce;
>  
> -    /*
> -     * The replay can be triggered by either a invalidation or a newly
> -     * created entry. No matter what, we release existing mappings
> -     * (it means flushing caches for UNMAP-only registers).
> -     */
> -    vtd_address_space_unmap(vtd_as, n);
> -
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
>                                    "legacy mode",
> @@ -3850,8 +3878,11 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                  .as = vtd_as,
>                  .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
>              };
> -
> -            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
> +            if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
> +                vtd_page_walk_full_map_fast_path(vtd_as->iova_tree, n);
> +            } else {
> +                vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
> +            }
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 78358ede2764..5dae4502b908 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1890,7 +1890,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>  
>                  iommu_notifier_init(&gdn.n,
>                                      vfio_iommu_map_dirty_notify,
> -                                    IOMMU_NOTIFIER_MAP,
> +                                    IOMMU_NOTIFIER_FULL_MAP,
>                                      section->offset_within_region,
>                                      int128_get64(llend),
>                                      idx);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index c3661b2276c7..eecc3eec6702 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
>   *       events (e.g. VFIO). Both notifications must be accurate so that
>   *       the shadow page table is fully in sync with the guest view.
>   *
> + *       Besides MAP, there is a special use case called FULL_MAP which
> + *       requests notification for all the existent mappings (e.g. VFIO
> + *       dirty page sync).

Why do we need FULL_MAP?  Can we simply reimpl MAP?

> + *
>   *   (2) When the device doesn't need accurate synchronizations of the
>   *       vIOMMU page tables, it needs to register only with UNMAP or
>   *       DEVIOTLB_UNMAP notifies.
> @@ -164,6 +168,8 @@ typedef enum {
>      IOMMU_NOTIFIER_MAP = 0x2,
>      /* Notify changes on device IOTLB entries */
>      IOMMU_NOTIFIER_DEVIOTLB_UNMAP = 0x04,
> +    /* Notify every existent entries */
> +    IOMMU_NOTIFIER_FULL_MAP = 0x8,
>  } IOMMUNotifierFlag;
>  
>  #define IOMMU_NOTIFIER_IOTLB_EVENTS (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
> @@ -237,6 +243,13 @@ static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         hwaddr start, hwaddr end,
>                                         int iommu_idx)
>  {
> +    /*
> +     * memory_region_notify_iommu_one() needs IOMMU_NOTIFIER_MAP set to
> +     * trigger notifier.
> +     */
> +    if (flags & IOMMU_NOTIFIER_FULL_MAP) {
> +        flags |= IOMMU_NOTIFIER_MAP;
> +    }
>      n->notify = fn;
>      n->notifier_flags = flags;
>      n->start = start;
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7d9494ce7028..0a8465007c66 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1922,6 +1922,10 @@ int memory_region_register_iommu_notifier(MemoryRegion *mr,
>      assert(n->iommu_idx >= 0 &&
>             n->iommu_idx < memory_region_iommu_num_indexes(iommu_mr));
>  
> +    if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
> +        error_setg(errp, "FULL_MAP could only be used in replay");
> +    }
> +
>      QLIST_INSERT_HEAD(&iommu_mr->iommu_notify, n, node);
>      ret = memory_region_update_iommu_notify_flags(iommu_mr, errp);
>      if (ret) {
> -- 
> 2.34.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] util: Add iova_tree_foreach_range_data
  2023-06-01  6:33 ` [PATCH v2 1/4] util: Add iova_tree_foreach_range_data Zhenzhong Duan
@ 2023-06-05 18:39   ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2023-06-05 18:39 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, mst, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

On Thu, Jun 01, 2023 at 02:33:17PM +0800, Zhenzhong Duan wrote:
> This function is a variant of iova_tree_foreach and support tranversing
> a range to trigger callback with a private data.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] memory: Document update on replay()
  2023-06-01  6:33 ` [PATCH v2 3/4] memory: Document update on replay() Zhenzhong Duan
@ 2023-06-05 18:42   ` Peter Xu
  2023-06-06  2:48     ` Duan, Zhenzhong
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2023-06-05 18:42 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, mst, jasowang, pbonzini, richard.henderson, eduardo,
	marcel.apfelbaum, alex.williamson, clg, david, philmd, kwankhede,
	cjia, yi.l.liu, chao.p.peng

On Thu, Jun 01, 2023 at 02:33:19PM +0800, Zhenzhong Duan wrote:
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index eecc3eec6702..9a523ef62a94 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -441,9 +441,9 @@ struct IOMMUMemoryRegionClass {
>       * call the IOMMU translate method for every page in the address space
>       * with flag == IOMMU_NONE and then call the notifier if translate
>       * returns a valid mapping. If this method is implemented then it
> -     * overrides the default behaviour, and must provide the full semantics
> -     * of memory_region_iommu_replay(), by calling @notifier for every
> -     * translation present in the IOMMU.
> +     * overrides the default behavior, and must provide corresponding
> +     * semantics depending on notifier's type, e.g. IOMMU_NOTIFIER_MAP,
> +     * notify changed entries; IOMMU_NOTIFIER_FULL_MAP, notify full entries.

IIUC it was always trying to notify all existing entries only, rather than
changed entries.  VT-d used to unmap all so it was also true.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-05 18:39   ` Peter Xu
@ 2023-06-06  2:35     ` Duan, Zhenzhong
  2023-06-06 15:41       ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Duan, Zhenzhong @ 2023-06-06  2:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel@nongnu.org, mst@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, clg@redhat.com, david@redhat.com,
	philmd@linaro.org, kwankhede@nvidia.com, cjia@nvidia.com,
	Liu, Yi L, Peng,  Chao P

>-----Original Message-----
>From: Peter Xu <peterx@redhat.com>
>Sent: Tuesday, June 6, 2023 2:39 AM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>Cc: qemu-devel@nongnu.org; mst@redhat.com; jasowang@redhat.com;
>pbonzini@redhat.com; richard.henderson@linaro.org; eduardo@habkost.net;
>marcel.apfelbaum@gmail.com; alex.williamson@redhat.com;
>clg@redhat.com; david@redhat.com; philmd@linaro.org;
>kwankhede@nvidia.com; cjia@nvidia.com; Liu, Yi L <yi.l.liu@intel.com>; Peng,
>Chao P <chao.p.peng@intel.com>
>Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty
>page sync
>
>On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote:
>> Peter Xu found a potential issue:
>>
>> "The other thing is when I am looking at the new code I found that we
>> actually extended the replay() to be used also in dirty tracking of vfio,
>> in vfio_sync_dirty_bitmap().  For that maybe it's already broken if
>> unmap_all() because afaiu log_sync() can be called in migration thread
>> anytime during DMA so I think it means the device is prone to DMA with the
>> IOMMU pgtable quickly erased and rebuilt here, which means the DMA
>could
>> fail unexpectedly.  Copy Alex, Kirti and Neo."
>>
>> To eliminate this small window with empty mapping, we should remove the
>> call to unmap_all(). Besides that, introduce a new notifier type called
>> IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only
>notifies
>> changed mappings while VFIO dirty page sync needs full mappings. Thanks
>> to current implementation of iova tree, we could pick mappings from iova
>> trees directly instead of walking through guest IOMMU page table.
>>
>> IOMMU_NOTIFIER_MAP is still used to get changed mappings for
>optimization
>> purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure
>shadow
>> page table in sync, then it's OK.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c | 49 +++++++++++++++++++++++++++++++++++------
>--
>>  hw/vfio/common.c      |  2 +-
>>  include/exec/memory.h | 13 ++++++++++++
>>  softmmu/memory.c      |  4 ++++
>>  4 files changed, 58 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 94d52f4205d2..061fcded0dfb 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent
>*event, void *private)
>>      return 0;
>>  }
>>
>> +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private)
>> +{
>> +    IOMMUTLBEvent event;
>> +
>> +    event.type = IOMMU_NOTIFIER_MAP;
>> +    event.entry.iova = map->iova;
>> +    event.entry.addr_mask = map->size;
>> +    event.entry.target_as = &address_space_memory;
>> +    event.entry.perm = map->perm;
>> +    event.entry.translated_addr = map->translated_addr;
>> +
>> +    return vtd_replay_hook(&event, private);
>> +}
>> +
>> +/*
>> + * This is a fast path to notify the full mappings falling in the scope
>> + * of IOMMU notifier. The call site should ensure no iova tree update by
>> + * taking necessary locks(e.x. BQL).
>
>We should be accurate on the locking - I think it's the BQL so far.

Will update comments.

>
>> + */
>> +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree,
>> +                                            IOMMUNotifier *n)
>> +{
>> +    DMAMap map;
>> +
>> +    map.iova = n->start;
>> +    map.size = n->end - n->start;
>> +    if (!iova_tree_find(iova_tree, &map)) {
>> +        return 0;
>> +    }
>> +
>> +    iova_tree_foreach_range_data(iova_tree, &map, vtd_replay_full_map,
>> +                                 (gpointer *)n);
>> +    return 0;
>> +}
>> +
>>  static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr,
>IOMMUNotifier *n)
>>  {
>>      VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace,
>iommu);
>> @@ -3826,13 +3861,6 @@ static void
>vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>>      VTDContextEntry ce;
>>
>> -    /*
>> -     * The replay can be triggered by either a invalidation or a newly
>> -     * created entry. No matter what, we release existing mappings
>> -     * (it means flushing caches for UNMAP-only registers).
>> -     */
>> -    vtd_address_space_unmap(vtd_as, n);
>> -
>>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
>>                                    "legacy mode",
>> @@ -3850,8 +3878,11 @@ static void
>vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>>                  .as = vtd_as,
>>                  .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
>>              };
>> -
>> -            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
>> +            if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
>> +                vtd_page_walk_full_map_fast_path(vtd_as->iova_tree, n);
>> +            } else {
>> +                vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
>> +            }
>>          }
>>      } else {
>>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 78358ede2764..5dae4502b908 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -1890,7 +1890,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer
>*container,
>>
>>                  iommu_notifier_init(&gdn.n,
>>                                      vfio_iommu_map_dirty_notify,
>> -                                    IOMMU_NOTIFIER_MAP,
>> +                                    IOMMU_NOTIFIER_FULL_MAP,
>>                                      section->offset_within_region,
>>                                      int128_get64(llend),
>>                                      idx);
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index c3661b2276c7..eecc3eec6702 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
>>   *       events (e.g. VFIO). Both notifications must be accurate so that
>>   *       the shadow page table is fully in sync with the guest view.
>>   *
>> + *       Besides MAP, there is a special use case called FULL_MAP which
>> + *       requests notification for all the existent mappings (e.g. VFIO
>> + *       dirty page sync).
>
>Why do we need FULL_MAP?  Can we simply reimpl MAP?

Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing.
Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate.

IIUC, currently replay() is called from two paths, one is VFIO device address
space switch which walks over the IOMMU page table to setup initial
mapping and cache it in IOVA tree. The other is VFIO dirty sync which
walks over the IOMMU page table to notify the mapping, because we
already cache the mapping in IOVA tree and VFIO dirty sync is protected
by BQL, so I think it's fine to pick mapping from IOVA tree directly instead
of walking over IOMMU page table. That's the reason of FULL_MAP
(IOMMU_NOTIFIER_MAP_FAST_PATH better).

About "reimpl MAP", do you mean to walk over IOMMU page table to
notify all existing MAP events without checking with the IOVA tree for
difference? If you prefer, I'll rewrite an implementation this way.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2 3/4] memory: Document update on replay()
  2023-06-05 18:42   ` Peter Xu
@ 2023-06-06  2:48     ` Duan, Zhenzhong
  0 siblings, 0 replies; 13+ messages in thread
From: Duan, Zhenzhong @ 2023-06-06  2:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel@nongnu.org, mst@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, clg@redhat.com, david@redhat.com,
	philmd@linaro.org, kwankhede@nvidia.com, cjia@nvidia.com,
	Liu, Yi L, Peng,  Chao P

>-----Original Message-----
>From: Peter Xu <peterx@redhat.com>
>Sent: Tuesday, June 6, 2023 2:42 AM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>Cc: qemu-devel@nongnu.org; mst@redhat.com; jasowang@redhat.com;
>pbonzini@redhat.com; richard.henderson@linaro.org; eduardo@habkost.net;
>marcel.apfelbaum@gmail.com; alex.williamson@redhat.com;
>clg@redhat.com; david@redhat.com; philmd@linaro.org;
>kwankhede@nvidia.com; cjia@nvidia.com; Liu, Yi L <yi.l.liu@intel.com>; Peng,
>Chao P <chao.p.peng@intel.com>
>Subject: Re: [PATCH v2 3/4] memory: Document update on replay()
>
>On Thu, Jun 01, 2023 at 02:33:19PM +0800, Zhenzhong Duan wrote:
>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>> eecc3eec6702..9a523ef62a94 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -441,9 +441,9 @@ struct IOMMUMemoryRegionClass {
>>       * call the IOMMU translate method for every page in the address space
>>       * with flag == IOMMU_NONE and then call the notifier if translate
>>       * returns a valid mapping. If this method is implemented then it
>> -     * overrides the default behaviour, and must provide the full semantics
>> -     * of memory_region_iommu_replay(), by calling @notifier for every
>> -     * translation present in the IOMMU.
>> +     * overrides the default behavior, and must provide corresponding
>> +     * semantics depending on notifier's type, e.g. IOMMU_NOTIFIER_MAP,
>> +     * notify changed entries; IOMMU_NOTIFIER_FULL_MAP, notify full
>entries.
>
>IIUC it was always trying to notify all existing entries only, rather than changed
>entries.  VT-d used to unmap all so it was also true.

Thanks for point out, I confused with invalidation path, I'll drop it.

Zhenzhong 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-06  2:35     ` Duan, Zhenzhong
@ 2023-06-06 15:41       ` Peter Xu
  2023-06-07  3:14         ` Duan, Zhenzhong
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2023-06-06 15:41 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, mst@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, clg@redhat.com, david@redhat.com,
	philmd@linaro.org, kwankhede@nvidia.com, cjia@nvidia.com,
	Liu, Yi L, Peng, Chao P

On Tue, Jun 06, 2023 at 02:35:41AM +0000, Duan, Zhenzhong wrote:
> >-----Original Message-----
> >From: Peter Xu <peterx@redhat.com>
> >Sent: Tuesday, June 6, 2023 2:39 AM
> >To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >Cc: qemu-devel@nongnu.org; mst@redhat.com; jasowang@redhat.com;
> >pbonzini@redhat.com; richard.henderson@linaro.org; eduardo@habkost.net;
> >marcel.apfelbaum@gmail.com; alex.williamson@redhat.com;
> >clg@redhat.com; david@redhat.com; philmd@linaro.org;
> >kwankhede@nvidia.com; cjia@nvidia.com; Liu, Yi L <yi.l.liu@intel.com>; Peng,
> >Chao P <chao.p.peng@intel.com>
> >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty
> >page sync
> >
> >On Thu, Jun 01, 2023 at 02:33:18PM +0800, Zhenzhong Duan wrote:
> >> Peter Xu found a potential issue:
> >>
> >> "The other thing is when I am looking at the new code I found that we
> >> actually extended the replay() to be used also in dirty tracking of vfio,
> >> in vfio_sync_dirty_bitmap().  For that maybe it's already broken if
> >> unmap_all() because afaiu log_sync() can be called in migration thread
> >> anytime during DMA so I think it means the device is prone to DMA with the
> >> IOMMU pgtable quickly erased and rebuilt here, which means the DMA
> >could
> >> fail unexpectedly.  Copy Alex, Kirti and Neo."
> >>
> >> To eliminate this small window with empty mapping, we should remove the
> >> call to unmap_all(). Besides that, introduce a new notifier type called
> >> IOMMU_NOTIFIER_FULL_MAP to get full mappings as intel_iommu only
> >notifies
> >> changed mappings while VFIO dirty page sync needs full mappings. Thanks
> >> to current implementation of iova tree, we could pick mappings from iova
> >> trees directly instead of walking through guest IOMMU page table.
> >>
> >> IOMMU_NOTIFIER_MAP is still used to get changed mappings for
> >optimization
> >> purpose. As long as notification for IOMMU_NOTIFIER_MAP could ensure
> >shadow
> >> page table in sync, then it's OK.
> >>
> >> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> >> ---
> >>  hw/i386/intel_iommu.c | 49 +++++++++++++++++++++++++++++++++++------
> >--
> >>  hw/vfio/common.c      |  2 +-
> >>  include/exec/memory.h | 13 ++++++++++++
> >>  softmmu/memory.c      |  4 ++++
> >>  4 files changed, 58 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >> index 94d52f4205d2..061fcded0dfb 100644
> >> --- a/hw/i386/intel_iommu.c
> >> +++ b/hw/i386/intel_iommu.c
> >> @@ -3819,6 +3819,41 @@ static int vtd_replay_hook(IOMMUTLBEvent
> >*event, void *private)
> >>      return 0;
> >>  }
> >>
> >> +static gboolean vtd_replay_full_map(DMAMap *map, gpointer *private)
> >> +{
> >> +    IOMMUTLBEvent event;
> >> +
> >> +    event.type = IOMMU_NOTIFIER_MAP;
> >> +    event.entry.iova = map->iova;
> >> +    event.entry.addr_mask = map->size;
> >> +    event.entry.target_as = &address_space_memory;
> >> +    event.entry.perm = map->perm;
> >> +    event.entry.translated_addr = map->translated_addr;
> >> +
> >> +    return vtd_replay_hook(&event, private);
> >> +}
> >> +
> >> +/*
> >> + * This is a fast path to notify the full mappings falling in the scope
> >> + * of IOMMU notifier. The call site should ensure no iova tree update by
> >> + * taking necessary locks(e.x. BQL).
> >
> >We should be accurate on the locking - I think it's the BQL so far.
> 
> Will update comments.
> 
> >
> >> + */
> >> +static int vtd_page_walk_full_map_fast_path(IOVATree *iova_tree,
> >> +                                            IOMMUNotifier *n)
> >> +{
> >> +    DMAMap map;
> >> +
> >> +    map.iova = n->start;
> >> +    map.size = n->end - n->start;
> >> +    if (!iova_tree_find(iova_tree, &map)) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    iova_tree_foreach_range_data(iova_tree, &map, vtd_replay_full_map,
> >> +                                 (gpointer *)n);
> >> +    return 0;
> >> +}
> >> +
> >>  static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr,
> >IOMMUNotifier *n)
> >>  {
> >>      VTDAddressSpace *vtd_as = container_of(iommu_mr, VTDAddressSpace,
> >iommu);
> >> @@ -3826,13 +3861,6 @@ static void
> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
> >>      VTDContextEntry ce;
> >>
> >> -    /*
> >> -     * The replay can be triggered by either a invalidation or a newly
> >> -     * created entry. No matter what, we release existing mappings
> >> -     * (it means flushing caches for UNMAP-only registers).
> >> -     */
> >> -    vtd_address_space_unmap(vtd_as, n);
> >> -
> >>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
> >>                                    "legacy mode",
> >> @@ -3850,8 +3878,11 @@ static void
> >vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >>                  .as = vtd_as,
> >>                  .domain_id = vtd_get_domain_id(s, &ce, vtd_as->pasid),
> >>              };
> >> -
> >> -            vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
> >> +            if (n->notifier_flags & IOMMU_NOTIFIER_FULL_MAP) {
> >> +                vtd_page_walk_full_map_fast_path(vtd_as->iova_tree, n);
> >> +            } else {
> >> +                vtd_page_walk(s, &ce, 0, ~0ULL, &info, vtd_as->pasid);
> >> +            }
> >>          }
> >>      } else {
> >>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 78358ede2764..5dae4502b908 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -1890,7 +1890,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer
> >*container,
> >>
> >>                  iommu_notifier_init(&gdn.n,
> >>                                      vfio_iommu_map_dirty_notify,
> >> -                                    IOMMU_NOTIFIER_MAP,
> >> +                                    IOMMU_NOTIFIER_FULL_MAP,
> >>                                      section->offset_within_region,
> >>                                      int128_get64(llend),
> >>                                      idx);
> >> diff --git a/include/exec/memory.h b/include/exec/memory.h
> >> index c3661b2276c7..eecc3eec6702 100644
> >> --- a/include/exec/memory.h
> >> +++ b/include/exec/memory.h
> >> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
> >>   *       events (e.g. VFIO). Both notifications must be accurate so that
> >>   *       the shadow page table is fully in sync with the guest view.
> >>   *
> >> + *       Besides MAP, there is a special use case called FULL_MAP which
> >> + *       requests notification for all the existent mappings (e.g. VFIO
> >> + *       dirty page sync).
> >
> >Why do we need FULL_MAP?  Can we simply reimpl MAP?
> 
> Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing.
> Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate.
> 
> IIUC, currently replay() is called from two paths, one is VFIO device address
> space switch which walks over the IOMMU page table to setup initial
> mapping and cache it in IOVA tree. The other is VFIO dirty sync which
> walks over the IOMMU page table to notify the mapping, because we
> already cache the mapping in IOVA tree and VFIO dirty sync is protected
> by BQL, so I think it's fine to pick mapping from IOVA tree directly instead
> of walking over IOMMU page table. That's the reason of FULL_MAP
> (IOMMU_NOTIFIER_MAP_FAST_PATH better).
> 
> About "reimpl MAP", do you mean to walk over IOMMU page table to
> notify all existing MAP events without checking with the IOVA tree for
> difference? If you prefer, I'll rewrite an implementation this way.

We still need to maintain iova tree. IIUC that's the major complexity of
vt-d emulation, because we have that extra cache layer to sync with the
real guest iommu pgtables.

But I think we were just wrong to also notify in the unmap_all() procedure.

IIUC the right thing to do (keeping replay() the interface as-is, per it
used to be defined) is we should replace the unmap_all() to only evacuate
the iova tree (keeping all host mappings untouched, IOW, don't notify
UNMAP), and do a full resync there, which will notify all existing mappings
as MAP.  Then we don't interrupt with any existing mapping if there is
(e.g. for the dirty sync case), meanwhile we keep sync too to latest (for
moving a vfio device into an existing iommu group).

Do you think that'll work for us?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-06 15:41       ` Peter Xu
@ 2023-06-07  3:14         ` Duan, Zhenzhong
  2023-06-07 14:06           ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Duan, Zhenzhong @ 2023-06-07  3:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel@nongnu.org, mst@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, clg@redhat.com, david@redhat.com,
	philmd@linaro.org, kwankhede@nvidia.com, cjia@nvidia.com,
	Liu, Yi L, Peng,  Chao P



>-----Original Message-----
>From: Peter Xu <peterx@redhat.com>
>Sent: Tuesday, June 6, 2023 11:42 PM
>Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty
>page sync
>
...
>> >> a/include/exec/memory.h b/include/exec/memory.h index
>> >> c3661b2276c7..eecc3eec6702 100644
>> >> --- a/include/exec/memory.h
>> >> +++ b/include/exec/memory.h
>> >> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
>> >>   *       events (e.g. VFIO). Both notifications must be accurate so that
>> >>   *       the shadow page table is fully in sync with the guest view.
>> >>   *
>> >> + *       Besides MAP, there is a special use case called FULL_MAP which
>> >> + *       requests notification for all the existent mappings (e.g. VFIO
>> >> + *       dirty page sync).
>> >
>> >Why do we need FULL_MAP?  Can we simply reimpl MAP?
>>
>> Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing.
>> Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate.
>>
>> IIUC, currently replay() is called from two paths, one is VFIO device
>> address space switch which walks over the IOMMU page table to setup
>> initial mapping and cache it in IOVA tree. The other is VFIO dirty
>> sync which walks over the IOMMU page table to notify the mapping,
>> because we already cache the mapping in IOVA tree and VFIO dirty sync
>> is protected by BQL, so I think it's fine to pick mapping from IOVA
>> tree directly instead of walking over IOMMU page table. That's the
>> reason of FULL_MAP (IOMMU_NOTIFIER_MAP_FAST_PATH better).
>>
>> About "reimpl MAP", do you mean to walk over IOMMU page table to
>> notify all existing MAP events without checking with the IOVA tree for
>> difference? If you prefer, I'll rewrite an implementation this way.
>
>We still need to maintain iova tree. IIUC that's the major complexity of vt-d
>emulation, because we have that extra cache layer to sync with the real guest
>iommu pgtables.

Can't agree more, looks only intel-iommu and virtio-iommu implemented such
optimization for now.

>
>But I think we were just wrong to also notify in the unmap_all() procedure.
>
>IIUC the right thing to do (keeping replay() the interface as-is, per it used to be
>defined) is we should replace the unmap_all() to only evacuate the iova tree
>(keeping all host mappings untouched, IOW, don't notify UNMAP), and do a
>full resync there, which will notify all existing mappings as MAP.  Then we
>don't interrupt with any existing mapping if there is (e.g. for the dirty sync
>case), meanwhile we keep sync too to latest (for moving a vfio device into an
>existing iommu group).
>
>Do you think that'll work for us?

Yes, I think I get your point.
Below simple change will work in your suggested way, do you agree?

@@ -3825,13 +3833,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     IntelIOMMUState *s = vtd_as->iommu_state;
     uint8_t bus_n = pci_bus_num(vtd_as->bus);
     VTDContextEntry ce;
+    DMAMap map = { .iova = 0, .size = HWADDR_MAX }

-    /*
-     * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
-     */
-    vtd_address_space_unmap(vtd_as, n);
+    /* replay is protected by BQL, page walk will re-setup IOVA tree safely */
+    iova_tree_remove(as->iova_tree, map);

     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync
  2023-06-07  3:14         ` Duan, Zhenzhong
@ 2023-06-07 14:06           ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2023-06-07 14:06 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, mst@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, clg@redhat.com, david@redhat.com,
	philmd@linaro.org, kwankhede@nvidia.com, cjia@nvidia.com,
	Liu, Yi L, Peng, Chao P

On Wed, Jun 07, 2023 at 03:14:07AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Peter Xu <peterx@redhat.com>
> >Sent: Tuesday, June 6, 2023 11:42 PM
> >Subject: Re: [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty
> >page sync
> >
> ...
> >> >> a/include/exec/memory.h b/include/exec/memory.h index
> >> >> c3661b2276c7..eecc3eec6702 100644
> >> >> --- a/include/exec/memory.h
> >> >> +++ b/include/exec/memory.h
> >> >> @@ -142,6 +142,10 @@ struct IOMMUTLBEntry {
> >> >>   *       events (e.g. VFIO). Both notifications must be accurate so that
> >> >>   *       the shadow page table is fully in sync with the guest view.
> >> >>   *
> >> >> + *       Besides MAP, there is a special use case called FULL_MAP which
> >> >> + *       requests notification for all the existent mappings (e.g. VFIO
> >> >> + *       dirty page sync).
> >> >
> >> >Why do we need FULL_MAP?  Can we simply reimpl MAP?
> >>
> >> Sorry, I just realized IOMMU_NOTIFIER_FULL_MAP is confusing.
> >> Maybe IOMMU_NOTIFIER_MAP_FAST_PATH could be a bit more accurate.
> >>
> >> IIUC, currently replay() is called from two paths, one is VFIO device
> >> address space switch which walks over the IOMMU page table to setup
> >> initial mapping and cache it in IOVA tree. The other is VFIO dirty
> >> sync which walks over the IOMMU page table to notify the mapping,
> >> because we already cache the mapping in IOVA tree and VFIO dirty sync
> >> is protected by BQL, so I think it's fine to pick mapping from IOVA
> >> tree directly instead of walking over IOMMU page table. That's the
> >> reason of FULL_MAP (IOMMU_NOTIFIER_MAP_FAST_PATH better).
> >>
> >> About "reimpl MAP", do you mean to walk over IOMMU page table to
> >> notify all existing MAP events without checking with the IOVA tree for
> >> difference? If you prefer, I'll rewrite an implementation this way.
> >
> >We still need to maintain iova tree. IIUC that's the major complexity of vt-d
> >emulation, because we have that extra cache layer to sync with the real guest
> >iommu pgtables.
> 
> Can't agree more, looks only intel-iommu and virtio-iommu implemented such
> optimization for now.
> 
> >
> >But I think we were just wrong to also notify in the unmap_all() procedure.
> >
> >IIUC the right thing to do (keeping replay() the interface as-is, per it used to be
> >defined) is we should replace the unmap_all() to only evacuate the iova tree
> >(keeping all host mappings untouched, IOW, don't notify UNMAP), and do a
> >full resync there, which will notify all existing mappings as MAP.  Then we
> >don't interrupt with any existing mapping if there is (e.g. for the dirty sync
> >case), meanwhile we keep sync too to latest (for moving a vfio device into an
> >existing iommu group).
> >
> >Do you think that'll work for us?
> 
> Yes, I think I get your point.
> Below simple change will work in your suggested way, do you agree?
> 
> @@ -3825,13 +3833,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      IntelIOMMUState *s = vtd_as->iommu_state;
>      uint8_t bus_n = pci_bus_num(vtd_as->bus);
>      VTDContextEntry ce;
> +    DMAMap map = { .iova = 0, .size = HWADDR_MAX }
> 
> -    /*
> -     * The replay can be triggered by either a invalidation or a newly
> -     * created entry. No matter what, we release existing mappings
> -     * (it means flushing caches for UNMAP-only registers).
> -     */
> -    vtd_address_space_unmap(vtd_as, n);
> +    /* replay is protected by BQL, page walk will re-setup IOVA tree safely */
> +    iova_tree_remove(as->iova_tree, map);
> 
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :

Yes, thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-06-07 14:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-01  6:33 [PATCH v2 0/4] Optimize UNMAP call and bug fix Zhenzhong Duan
2023-06-01  6:33 ` [PATCH v2 1/4] util: Add iova_tree_foreach_range_data Zhenzhong Duan
2023-06-05 18:39   ` Peter Xu
2023-06-01  6:33 ` [PATCH v2 2/4] intel_iommu: Fix a potential issue in VFIO dirty page sync Zhenzhong Duan
2023-06-05 18:39   ` Peter Xu
2023-06-06  2:35     ` Duan, Zhenzhong
2023-06-06 15:41       ` Peter Xu
2023-06-07  3:14         ` Duan, Zhenzhong
2023-06-07 14:06           ` Peter Xu
2023-06-01  6:33 ` [PATCH v2 3/4] memory: Document update on replay() Zhenzhong Duan
2023-06-05 18:42   ` Peter Xu
2023-06-06  2:48     ` Duan, Zhenzhong
2023-06-01  6:33 ` [PATCH v2 4/4] intel_iommu: Optimize out some unnecessary UNMAP calls Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).