* [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
@ 2025-09-19 21:34 Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 01/22] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
                   ` (22 more replies)
  0 siblings, 23 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
This series adds support for guests using the AMD vIOMMU to enable DMA remapping
for VFIO devices. Please see v1[0] cover letter for additional details such as
example QEMU command line parameters used in testing.
I have sanity tested on an AMD EPYC Genoa host, booting a Linux guest with
'iommu.passthrough=0' and several CX6 VFs, and there are no issues during
typical guest operation.
When using the non-default parameter 'iommu.forcedac=1' in the guest kernel
cmdline, this initially fails due to a VFIO integer overflow bug which requires
the following fix in the host kernel:
https://github.com/aljimenezb/linux/commit/014be8cafe7464d278729583a2dd5d94514e2e2a
This is a work in progress as there are other locations in the driver that are
susceptible to overflows, but the above is sufficient to fix the initial
problem.
Even after that fix is applied, I see an issue on guest reboot when 'forcedac=1'
is in use. Although the guest boots, the VF is not properly initialized, failing
with a timeout. Once the guest reaches userspace the VF driver can be reloaded
and it then works as expected. I am still investigating the root cause for this
issue, and will need to discuss all the steps I have tried to eliminate
potential sources of errors in a separate thread.
I am sending v3 despite this known issue since forcedac=1 is not a default or
commonly known/used setting. Having the large portions of the infrastructure for
DMA remapping already in place (and working) will make it easier to debug this
corner case and get feedback/testing from the community. I hope this is a viable
approach, otherwise I am happy to discuss all the steps I have taken to debug
this issue in this thread and test any suggestions to address it.
Changes since v2[2]:
- P5: Fixed missed check for AMDVI_FR_DTE_RTR_ERR in amdvi_do_translate() (Sairaj)
- P6: Reword commit message to clarify the need to discern between empty PTEs and errors (Vasant)
- P9: Use correct enum type for notifier flags and remove whitespace changes (Sairaj)
- P11: Fixed integer overflow bug when guest uses iommu.forcedac=1. Fixed in P8. (Sairaj)
- P15: Fixed typo in commit message (Sairaj)
- P16: On reset, use passthrough mode by default on all address spaces (Sairaj)
- P18: Enforce isolation by using DMA mode on errors retrieving DTE (Ethan & Sairaj)
- P20: Removed unused pte_override_page_mask() and pte_get_page_mask() to avoid -Wunused-function error.
- Add HATDis support patches from Joao Martins (HATDis available in Linux since [1])
Thank you,
Alejandro
[0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
[1] https://lore.kernel.org/all/cover.1749016436.git.Ankit.Soni@amd.com/
[2] https://lore.kernel.org/qemu-devel/20250502021605.1795985-1-alejandro.j.jimenez@oracle.com/
Alejandro Jimenez (20):
  memory: Adjust event ranges to fit within notifier boundaries
  amd_iommu: Document '-device amd-iommu' common options
  amd_iommu: Reorder device and page table helpers
  amd_iommu: Helper to decode size of page invalidation command
  amd_iommu: Add helper function to extract the DTE
  amd_iommu: Return an error when unable to read PTE from guest memory
  amd_iommu: Add helpers to walk AMD v1 Page Table format
  amd_iommu: Add a page walker to sync shadow page tables on
    invalidation
  amd_iommu: Add basic structure to support IOMMU notifier updates
  amd_iommu: Sync shadow page tables on page invalidation
  amd_iommu: Use iova_tree records to determine large page size on UNMAP
  amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
  amd_iommu: Add replay callback
  amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
  amd_iommu: Toggle memory regions based on address translation mode
  amd_iommu: Set all address spaces to use passthrough mode on reset
  amd_iommu: Add dma-remap property to AMD vIOMMU device
  amd_iommu: Toggle address translation mode on devtab entry
    invalidation
  amd_iommu: Do not assume passthrough translation when DTE[TV]=0
  amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
Joao Martins (2):
  i386/intel-iommu: Move dma_translation to x86-iommu
  amd_iommu: HATDis/HATS=11 support
 hw/i386/acpi-build.c        |    6 +-
 hw/i386/amd_iommu.c         | 1056 ++++++++++++++++++++++++++++++-----
 hw/i386/amd_iommu.h         |   51 ++
 hw/i386/intel_iommu.c       |    5 +-
 hw/i386/x86-iommu.c         |    1 +
 include/hw/i386/x86-iommu.h |    1 +
 qemu-options.hx             |   23 +
 system/memory.c             |   10 +-
 8 files changed, 999 insertions(+), 154 deletions(-)
base-commit: ab8008b231e758e03c87c1c483c03afdd9c02e19
-- 
2.43.5
^ permalink raw reply	[flat|nested] 34+ messages in thread
* [PATCH v3 01/22] memory: Adjust event ranges to fit within notifier boundaries
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 02/22] amd_iommu: Document '-device amd-iommu' common options Alejandro Jimenez
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
valid and required operation by vIOMMU implementations. However, such
invalidations currently trigger an assertion unless they originate from
device IOTLB invalidations.
Although in recent Linux guests this case is not exercised by the VTD
implementation due to various optimizations, the assertion will be hit
by upcoming AMD vIOMMU changes to support DMA address translation. More
specifically, when running a Linux guest with VFIO passthrough device,
and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
Remove redundant domain flush from attach_device()").
Remove the assertion altogether and adjust the range to ensure it does
not cross notifier boundaries.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
---
 system/memory.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/system/memory.c b/system/memory.c
index cf8cad6961156..5c6ccc5c57412 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2044,13 +2044,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
         return;
     }
 
-    if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
-        /* Crop (iova, addr_mask) to range */
-        tmp.iova = MAX(tmp.iova, notifier->start);
-        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
-    } else {
-        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
-    }
+    /* Crop (iova, addr_mask) to range */
+    tmp.iova = MAX(tmp.iova, notifier->start);
+    tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
 
     if (event->type & notifier->notifier_flags) {
         notifier->notify(notifier, &tmp);
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 02/22] amd_iommu: Document '-device amd-iommu' common options
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 01/22] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 03/22] amd_iommu: Reorder device and page table helpers Alejandro Jimenez
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Document the common parameters used when emulating AMD vIOMMU.
Besides the two amd-iommu specific options: 'xtsup' and 'dma-remap', the
the generic x86 IOMMU option 'intremap' is also included, since it is
typically specified in QEMU command line examples and mailing list threads.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 qemu-options.hx | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
diff --git a/qemu-options.hx b/qemu-options.hx
index 075f4be2e3e67..6615123e6a11a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1238,6 +1238,29 @@ SRST
         Accepts either the default root complex (pcie.0) or a
         pxb-pcie based root complex.
 
+``-device amd-iommu[,option=...]``
+    Enables emulation of an AMD-Vi I/O Memory Management Unit (IOMMU).
+    Only available with ``-machine q35``, it supports the following options:
+
+    ``dma-remap=on|off`` (default: off)
+        Support for DMA address translation and access permission checking for
+        guests attaching passthrough devices to paging domains, using the AMD v1
+        I/O Page Table format. This enables ``-device vfio-pci,...`` to work
+        correctly with a guest using the DMA remapping feature of the vIOMMU.
+
+    ``intremap=on|off`` (default: auto)
+        Generic x86 IOMMU functionality implemented by ``amd-iommu`` device.
+        Enables interrupt remapping feature in guests, which is also required to
+        enable x2apic support.
+        Currently only available with ``kernel-irqchip=off|split``, it is
+        automatically enabled when either of those modes is in use, and disabled
+        with ``kernel-irqchip=on``.
+
+    ``xtsup=on|off`` (default: off)
+        Interrupt remapping table supports x2apic mode, enabling the use of
+        128-bit IRTE format with 32-bit destination field by the guest. Required
+        to support routing interrupts to vCPUs with APIC IDs larger than 0xff.
+
 ERST
 
 DEF("name", HAS_ARG, QEMU_OPTION_name,
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 03/22] amd_iommu: Reorder device and page table helpers
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 01/22] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 02/22] amd_iommu: Document '-device amd-iommu' common options Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 04/22] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Move code related to Device Table and Page Table to an earlier location in
the file, where it does not require forward declarations to be used by the
various invalidation functions that will need to query the DTE and walk the
page table in upcoming changes.
This change consist of code movement only, no functional change intended.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 172 ++++++++++++++++++++++----------------------
 1 file changed, 86 insertions(+), 86 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 26be69bec8ae2..3cbc9499dbcc4 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -438,6 +438,92 @@ static void amdvi_completion_wait(AMDVIState *s, uint64_t *cmd)
     trace_amdvi_completion_wait(addr, data);
 }
 
+static inline uint64_t amdvi_get_perms(uint64_t entry)
+{
+    return (entry & (AMDVI_DEV_PERM_READ | AMDVI_DEV_PERM_WRITE)) >>
+           AMDVI_DEV_PERM_SHIFT;
+}
+
+/* validate that reserved bits are honoured */
+static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
+                               uint64_t *dte)
+{
+    if ((dte[0] & AMDVI_DTE_QUAD0_RESERVED) ||
+        (dte[1] & AMDVI_DTE_QUAD1_RESERVED) ||
+        (dte[2] & AMDVI_DTE_QUAD2_RESERVED) ||
+        (dte[3] & AMDVI_DTE_QUAD3_RESERVED)) {
+        amdvi_log_illegaldevtab_error(s, devid,
+                                      s->devtab +
+                                      devid * AMDVI_DEVTAB_ENTRY_SIZE, 0);
+        return false;
+    }
+
+    return true;
+}
+
+/* get a device table entry given the devid */
+static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
+{
+    uint32_t offset = devid * AMDVI_DEVTAB_ENTRY_SIZE;
+
+    if (dma_memory_read(&address_space_memory, s->devtab + offset, entry,
+                        AMDVI_DEVTAB_ENTRY_SIZE, MEMTXATTRS_UNSPECIFIED)) {
+        trace_amdvi_dte_get_fail(s->devtab, offset);
+        /* log error accessing dte */
+        amdvi_log_devtab_error(s, devid, s->devtab + offset, 0);
+        return false;
+    }
+
+    *entry = le64_to_cpu(*entry);
+    if (!amdvi_validate_dte(s, devid, entry)) {
+        trace_amdvi_invalid_dte(entry[0]);
+        return false;
+    }
+
+    return true;
+}
+
+/* get pte translation mode */
+static inline uint8_t get_pte_translation_mode(uint64_t pte)
+{
+    return (pte >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
+}
+
+static inline uint64_t pte_override_page_mask(uint64_t pte)
+{
+    uint8_t page_mask = 13;
+    uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
+    /* find the first zero bit */
+    while (addr & 1) {
+        page_mask++;
+        addr = addr >> 1;
+    }
+
+    return ~((1ULL << page_mask) - 1);
+}
+
+static inline uint64_t pte_get_page_mask(uint64_t oldlevel)
+{
+    return ~((1UL << ((oldlevel * 9) + 3)) - 1);
+}
+
+static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
+                                          uint16_t devid)
+{
+    uint64_t pte;
+
+    if (dma_memory_read(&address_space_memory, pte_addr,
+                        &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
+        trace_amdvi_get_pte_hwerror(pte_addr);
+        amdvi_log_pagetab_error(s, devid, pte_addr, 0);
+        pte = 0;
+        return pte;
+    }
+
+    pte = le64_to_cpu(pte);
+    return pte;
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -894,92 +980,6 @@ static void amdvi_mmio_write(void *opaque, hwaddr addr, uint64_t val,
     }
 }
 
-static inline uint64_t amdvi_get_perms(uint64_t entry)
-{
-    return (entry & (AMDVI_DEV_PERM_READ | AMDVI_DEV_PERM_WRITE)) >>
-           AMDVI_DEV_PERM_SHIFT;
-}
-
-/* validate that reserved bits are honoured */
-static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
-                               uint64_t *dte)
-{
-    if ((dte[0] & AMDVI_DTE_QUAD0_RESERVED) ||
-        (dte[1] & AMDVI_DTE_QUAD1_RESERVED) ||
-        (dte[2] & AMDVI_DTE_QUAD2_RESERVED) ||
-        (dte[3] & AMDVI_DTE_QUAD3_RESERVED)) {
-        amdvi_log_illegaldevtab_error(s, devid,
-                                      s->devtab +
-                                      devid * AMDVI_DEVTAB_ENTRY_SIZE, 0);
-        return false;
-    }
-
-    return true;
-}
-
-/* get a device table entry given the devid */
-static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
-{
-    uint32_t offset = devid * AMDVI_DEVTAB_ENTRY_SIZE;
-
-    if (dma_memory_read(&address_space_memory, s->devtab + offset, entry,
-                        AMDVI_DEVTAB_ENTRY_SIZE, MEMTXATTRS_UNSPECIFIED)) {
-        trace_amdvi_dte_get_fail(s->devtab, offset);
-        /* log error accessing dte */
-        amdvi_log_devtab_error(s, devid, s->devtab + offset, 0);
-        return false;
-    }
-
-    *entry = le64_to_cpu(*entry);
-    if (!amdvi_validate_dte(s, devid, entry)) {
-        trace_amdvi_invalid_dte(entry[0]);
-        return false;
-    }
-
-    return true;
-}
-
-/* get pte translation mode */
-static inline uint8_t get_pte_translation_mode(uint64_t pte)
-{
-    return (pte >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
-}
-
-static inline uint64_t pte_override_page_mask(uint64_t pte)
-{
-    uint8_t page_mask = 13;
-    uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
-    /* find the first zero bit */
-    while (addr & 1) {
-        page_mask++;
-        addr = addr >> 1;
-    }
-
-    return ~((1ULL << page_mask) - 1);
-}
-
-static inline uint64_t pte_get_page_mask(uint64_t oldlevel)
-{
-    return ~((1UL << ((oldlevel * 9) + 3)) - 1);
-}
-
-static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
-                                          uint16_t devid)
-{
-    uint64_t pte;
-
-    if (dma_memory_read(&address_space_memory, pte_addr,
-                        &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
-        trace_amdvi_get_pte_hwerror(pte_addr);
-        amdvi_log_pagetab_error(s, devid, pte_addr, 0);
-        pte = 0;
-        return pte;
-    }
-
-    pte = le64_to_cpu(pte);
-    return pte;
-}
-
 static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
                             hwaddr addr)
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 04/22] amd_iommu: Helper to decode size of page invalidation command
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (2 preceding siblings ...)
  2025-09-19 21:34 ` [PATCH v3 03/22] amd_iommu: Reorder device and page table helpers Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 05/22] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
The size of the region to invalidate depends on the S bit and address
encoded in the command. Add a helper to extract this information, which
will be used to sync shadow page tables in upcoming changes.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h |  4 ++++
 2 files changed, 38 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 3cbc9499dbcc4..202f0f8c6e90c 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -577,6 +577,40 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
     return entry->domid == domid;
 }
 
+/*
+ * Helper to decode the size of the range to invalidate encoded in the
+ * INVALIDATE_IOMMU_PAGES Command format.
+ * The size of the region to invalidate depends on the S bit and address.
+ * S bit value:
+ * 0 :  Invalidation size is 4 Kbytes.
+ * 1 :  Invalidation size is determined by first zero bit in the address
+ *      starting from Address[12].
+ *
+ * In the AMD IOMMU Linux driver, an invalidation command with address
+ * ((1 << 63) - 1) is sent when intending to clear the entire cache.
+ * However, Table 14: Example Page Size Encodings shows that an address of
+ * ((1ULL << 51) - 1) encodes the entire cache, so effectively any address with
+ * first zero at bit 51 or larger is a request to invalidate the entire address
+ * space.
+ */
+static uint64_t __attribute__((unused))
+amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+{
+    uint64_t size = AMDVI_PAGE_SIZE;
+    uint8_t fzbit = 0;
+
+    if (flags & AMDVI_CMD_INVAL_IOMMU_PAGES_S) {
+        fzbit = cto64(addr | 0xFFF);
+
+        if (fzbit >= 51) {
+            size = AMDVI_INV_ALL_PAGES;
+        } else {
+            size = 1ULL << (fzbit + 1);
+        }
+    }
+    return size;
+}
+
 /* we don't have devid - we can't remove pages by address */
 static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 2476296c49023..c1170a820257e 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -126,6 +126,10 @@
 #define AMDVI_CMD_COMPLETE_PPR_REQUEST    0x07
 #define AMDVI_CMD_INVAL_AMDVI_ALL         0x08
 
+
+#define AMDVI_CMD_INVAL_IOMMU_PAGES_S   (1ULL << 0)
+#define AMDVI_INV_ALL_PAGES             (1ULL << 52)
+
 #define AMDVI_DEVTAB_ENTRY_SIZE           32
 
 /* Device table entry bits 0:63 */
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 05/22] amd_iommu: Add helper function to extract the DTE
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (3 preceding siblings ...)
  2025-09-19 21:34 ` [PATCH v3 04/22] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:34 ` [PATCH v3 06/22] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
common operation required for syncing the shadow page tables. Implement a
helper to do it and check for common error conditions.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 48 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 42 insertions(+), 6 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 202f0f8c6e90c..dc7531fd4a8b9 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -77,6 +77,18 @@ typedef struct AMDVIIOTLBEntry {
     uint64_t page_mask;         /* physical page size  */
 } AMDVIIOTLBEntry;
 
+/*
+ * These 'fault' reasons have an overloaded meaning since they are not only
+ * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
+ * IOMMU specification, but are also used to signal internal errors in the
+ * emulation code.
+ */
+typedef enum AMDVIFaultReason {
+    AMDVI_FR_DTE_RTR_ERR = 1,   /* Failure to retrieve DTE */
+    AMDVI_FR_DTE_V,             /* DTE[V] = 0 */
+    AMDVI_FR_DTE_TV,            /* DTE[TV] = 0 */
+} AMDVIFaultReason;
+
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
     uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
@@ -524,6 +536,28 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
     return pte;
 }
 
+static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
+{
+    uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
+    AMDVIState *s = as->iommu_state;
+
+    if (!amdvi_get_dte(s, devid, dte)) {
+        /* Unable to retrieve DTE for devid */
+        return -AMDVI_FR_DTE_RTR_ERR;
+    }
+
+    if (!(dte[0] & AMDVI_DEV_VALID)) {
+        /* DTE[V] not set, address is passed untranslated for devid */
+        return -AMDVI_FR_DTE_V;
+    }
+
+    if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
+        /* DTE[TV] not set, host page table not valid for devid */
+        return -AMDVI_FR_DTE_TV;
+    }
+    return 0;
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -1081,6 +1115,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
     AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
     uint64_t entry[4];
+    int dte_ret;
 
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
@@ -1092,13 +1127,14 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
         return;
     }
 
-    if (!amdvi_get_dte(s, devid, entry)) {
-        return;
-    }
+    dte_ret = amdvi_as_to_dte(as, entry);
 
-    /* devices with V = 0 are not translated */
-    if (!(entry[0] & AMDVI_DEV_VALID)) {
-        goto out;
+    if (dte_ret < 0) {
+        if (dte_ret == -AMDVI_FR_DTE_V) {
+            /* DTE[V]=0, address is passed untranslated */
+            goto out;
+        }
+        return;
     }
 
     amdvi_page_walk(as, entry, ret,
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 06/22] amd_iommu: Return an error when unable to read PTE from guest memory
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (4 preceding siblings ...)
  2025-09-19 21:34 ` [PATCH v3 05/22] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
@ 2025-09-19 21:34 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 07/22] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Make amdvi_get_pte_entry() return an error value (-1) in cases where the
memory read fails, versus the current return of 0 to indicate failure.
The reason is that 0 is also a valid value to have stored in the PTE in
guest memory i.e. the guest does not have a mapping. Before this change,
amdvi_get_pte_entry() returned 0 for both an error and for empty PTEs, but
the page walker implementation that will be introduced in upcoming changes
needs a method to differentiate between the two scenarios.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index dc7531fd4a8b9..29ed3f0ef292e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -528,7 +528,7 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
                         &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
         trace_amdvi_get_pte_hwerror(pte_addr);
         amdvi_log_pagetab_error(s, devid, pte_addr, 0);
-        pte = 0;
+        pte = (uint64_t)-1;
         return pte;
     }
 
@@ -1081,7 +1081,7 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
             /* add offset and load pte */
             pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
             pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-            if (!pte) {
+            if (!pte || (pte == (uint64_t)-1)) {
                 return;
             }
             oldlevel = level;
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 07/22] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (5 preceding siblings ...)
  2025-09-19 21:34 ` [PATCH v3 06/22] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 08/22] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h |  40 ++++++++++++++
 2 files changed, 163 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 29ed3f0ef292e..c25981ff93c02 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -87,6 +87,8 @@ typedef enum AMDVIFaultReason {
     AMDVI_FR_DTE_RTR_ERR = 1,   /* Failure to retrieve DTE */
     AMDVI_FR_DTE_V,             /* DTE[V] = 0 */
     AMDVI_FR_DTE_TV,            /* DTE[TV] = 0 */
+    AMDVI_FR_PT_ROOT_INV,       /* Page Table Root ptr invalid */
+    AMDVI_FR_PT_ENTRY_INV,      /* Failure to read PTE from guest memory */
 } AMDVIFaultReason;
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
@@ -558,6 +560,127 @@ static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
     return 0;
 }
 
+/*
+ * For a PTE encoding a large page, return the page size it encodes as described
+ * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
+ * No need to adjust the value of the PTE to point to the first PTE in the large
+ * page since the encoding guarantees all "base" PTEs in the large page are the
+ * same.
+ */
+static uint64_t large_pte_page_size(uint64_t pte)
+{
+    assert(PTE_NEXT_LEVEL(pte) == 7);
+
+    /* Determine size of the large/contiguous page encoded in the PTE */
+    return PTE_LARGE_PAGE_SIZE(pte);
+}
+
+/*
+ * Helper function to fetch a PTE using AMD v1 pgtable format.
+ * On successful page walk, returns 0 and pte parameter points to a valid PTE.
+ * On failure, returns:
+ * -AMDVI_FR_PT_ROOT_INV: A page walk is not possible due to conditions like DTE
+ *      with invalid permissions, Page Table Root can not be read from DTE, or a
+ *      larger IOVA than supported by page table level encoded in DTE[Mode].
+ * -AMDVI_FR_PT_ENTRY_INV: A PTE could not be read from guest memory during a
+ *      page table walk. This means that the DTE has valid data, but one of the
+ *      lower level entries in the Page Table could not be read.
+ */
+static int __attribute__((unused))
+fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
+          hwaddr *page_size)
+{
+    IOMMUAccessFlags perms = amdvi_get_perms(dte);
+
+    uint8_t level, mode;
+    uint64_t pte_addr;
+
+    *pte = dte;
+    *page_size = 0;
+
+    if (perms == IOMMU_NONE) {
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    /*
+     * The Linux kernel driver initializes the default mode to 3, corresponding
+     * to a 39-bit GPA space, where each entry in the pagetable translates to a
+     * 1GB (2^30) page size.
+     */
+    level = mode = get_pte_translation_mode(dte);
+    assert(mode > 0 && mode < 7);
+
+    /*
+     * If IOVA is larger than the max supported by the current pgtable level,
+     * there is nothing to do.
+     */
+    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
+        /* IOVA too large for the current DTE */
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    do {
+        level -= 1;
+
+        /* Update the page_size */
+        *page_size = PTE_LEVEL_PAGE_SIZE(level);
+
+        /* Permission bits are ANDed at every level, including the DTE */
+        perms &= amdvi_get_perms(*pte);
+        if (perms == IOMMU_NONE) {
+            return 0;
+        }
+
+        /* Not Present */
+        if (!IOMMU_PTE_PRESENT(*pte)) {
+            return 0;
+        }
+
+        /* Large or Leaf PTE found */
+        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
+            /* Leaf PTE found */
+            break;
+        }
+
+        /*
+         * Index the pgtable using the IOVA bits corresponding to current level
+         * and walk down to the lower level.
+         */
+        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
+        *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+
+        if (*pte == (uint64_t)-1) {
+            /*
+             * A returned PTE of -1 indicates a failure to read the page table
+             * entry from guest memory.
+             */
+            if (level == mode - 1) {
+                /* Failure to retrieve the Page Table from Root Pointer */
+                *page_size = 0;
+                return -AMDVI_FR_PT_ROOT_INV;
+            } else {
+                /* Failure to read PTE. Page walk skips a page_size chunk */
+                return -AMDVI_FR_PT_ENTRY_INV;
+            }
+        }
+    } while (level > 0);
+
+    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
+           level == 0);
+    /*
+     * Page walk ends when Next Level field on PTE shows that either a leaf PTE
+     * or a series of large PTEs have been reached. In the latter case, even if
+     * the range starts in the middle of a contiguous page, the returned PTE
+     * must be the first PTE of the series.
+     */
+    if (PTE_NEXT_LEVEL(*pte) == 7) {
+        /* Update page_size with the large PTE page size */
+        *page_size = large_pte_page_size(*pte);
+    }
+
+    return 0;
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index c1170a820257e..9f833b297d25c 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -178,6 +178,46 @@
 #define AMDVI_GATS_MODE                 (2ULL <<  12)
 #define AMDVI_HATS_MODE                 (2ULL <<  10)
 
+/* Page Table format */
+
+#define AMDVI_PTE_PR                    (1ULL << 0)
+#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
+
+#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
+
+/* Using level=0 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
+
+/* Return IOVA bit group used to index the Page Table at specific level */
+#define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
+                                        GENMASK64(8, 0))
+
+/* Return the max address for a specified level i.e. max_oaddr */
+#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
+                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
+                                (~(0ULL)))
+
+/* Extract the NextLevel field from PTE/PDE */
+#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
+
+/* Take page table level and return default pagetable size for level */
+#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << (PT_LEVEL_SHIFT(level)))
+
+/*
+ * Return address of lower level page table encoded in PTE and specified by
+ * current level and corresponding IOVA bit group at such level.
+ */
+#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
+                                        (PT_LEVEL_INDEX(level, iova) * 8))
+
+/*
+ * Take a PTE value with mode=0x07 and return the page size it encodes.
+ */
+#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 0xfffULL))))
+
+/* Return number of PTEs to use for a given page size (expected power of 2) */
+#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) % 9))
+
 /* IOTLB */
 #define AMDVI_IOTLB_MAX_SIZE 1024
 #define AMDVI_DEVID_SHIFT    36
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 08/22] amd_iommu: Add a page walker to sync shadow page tables on invalidation
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (6 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 07/22] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 09/22] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
For the specified address range, walk the page table identifying regions
as mapped or unmapped and invoke registered notifiers with the
corresponding event type.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 80 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index c25981ff93c02..0e45435c77be9 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -681,6 +681,86 @@ fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
     return 0;
 }
 
+/*
+ * Walk the guest page table for an IOVA and range and signal the registered
+ * notifiers to sync the shadow page tables in the host.
+ * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
+ */
+static void __attribute__((unused))
+amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
+                                   hwaddr addr, uint64_t size, bool send_unmap)
+{
+    IOMMUTLBEvent event;
+
+    hwaddr iova_next, page_mask, pagesize;
+    hwaddr iova = addr;
+    hwaddr end = iova + size - 1;
+
+    uint64_t pte;
+    int ret;
+
+    while (iova < end) {
+
+        ret = fetch_pte(as, iova, dte[0], &pte, &pagesize);
+
+        if (ret == -AMDVI_FR_PT_ROOT_INV) {
+            /*
+             * Invalid conditions such as the IOVA being larger than supported
+             * by current page table mode as configured in the DTE, or a failure
+             * to fetch the Page Table from the Page Table Root Pointer in DTE.
+             */
+            assert(pagesize == 0);
+            return;
+        }
+        /* PTE has been validated for major errors and pagesize is set */
+        assert(pagesize);
+        page_mask = ~(pagesize - 1);
+        iova_next = (iova & page_mask) + pagesize;
+
+        if (ret == -AMDVI_FR_PT_ENTRY_INV) {
+            /*
+             * Failure to read PTE from memory, the pagesize matches the current
+             * level. Unable to determine the region type, so a safe strategy is
+             * to skip the range and continue the page walk.
+             */
+            goto next;
+        }
+
+        event.entry.target_as = &address_space_memory;
+        event.entry.iova = iova & page_mask;
+        /* translated_addr is irrelevant for the unmap case */
+        event.entry.translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) &
+                                      page_mask;
+        event.entry.addr_mask = ~page_mask;
+        event.entry.perm = amdvi_get_perms(pte);
+
+        /*
+         * In cases where the leaf PTE is not found, or it has invalid
+         * permissions, an UNMAP type notification is sent, but only if the
+         * caller requested it.
+         */
+        if (!IOMMU_PTE_PRESENT(pte) || (event.entry.perm == IOMMU_NONE)) {
+            if (!send_unmap) {
+                goto next;
+            }
+            event.type = IOMMU_NOTIFIER_UNMAP;
+        } else {
+            event.type = IOMMU_NOTIFIER_MAP;
+        }
+
+        /* Invoke the notifiers registered for this address space */
+        memory_region_notify_iommu(&as->iommu, 0, event);
+
+next:
+        /* Check for 64-bit overflow and terminate walk in such cases */
+        if (iova_next < iova) {
+            break;
+        } else {
+            iova = iova_next;
+        }
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 09/22] amd_iommu: Add basic structure to support IOMMU notifier updates
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (7 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 08/22] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 10/22] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Add the minimal data structures required to maintain a list of address
spaces (i.e. devices) with registered notifiers, and to update the type of
events that require notifications.
Note that the ability to register for MAP notifications is not available.
It will be unblocked by following changes that enable the synchronization of
guest I/O page tables with host IOMMU state, at which point an amd-iommu
device property will be introduced to control this capability.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 20 ++++++++++++++++++++
 hw/i386/amd_iommu.h |  3 +++
 2 files changed, 23 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 0e45435c77be9..d8a451b3a5ff1 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -66,6 +66,11 @@ struct AMDVIAddressSpace {
     MemoryRegion iommu_nodma;   /* Alias of shared nodma memory region  */
     MemoryRegion iommu_ir;      /* Device's interrupt remapping region  */
     AddressSpace as;            /* device's corresponding address space */
+
+    /* DMA address translation support */
+    IOMMUNotifierFlag notifier_flags;
+    /* entry in list of Address spaces with registered notifiers */
+    QLIST_ENTRY(AMDVIAddressSpace) next;
 };
 
 /* AMDVI cache entry */
@@ -1773,6 +1778,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
         iommu_as[devfn]->bus_num = (uint8_t)bus_num;
         iommu_as[devfn]->devfn = (uint8_t)devfn;
         iommu_as[devfn]->iommu_state = s;
+        iommu_as[devfn]->notifier_flags = IOMMU_NOTIFIER_NONE;
 
         amdvi_dev_as = iommu_as[devfn];
 
@@ -1846,6 +1852,7 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
                                            Error **errp)
 {
     AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
+    AMDVIState *s = as->iommu_state;
 
     if (new & IOMMU_NOTIFIER_MAP) {
         error_setg(errp,
@@ -1854,6 +1861,19 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
                    PCI_FUNC(as->devfn));
         return -EINVAL;
     }
+
+    /*
+     * Update notifier flags for address space and the list of address spaces
+     * with registered notifiers.
+     */
+    as->notifier_flags = new;
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        QLIST_INSERT_HEAD(&s->amdvi_as_with_notifiers, as, next);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        QLIST_REMOVE(as, next);
+    }
+
     return 0;
 }
 
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 9f833b297d25c..b51aa74368995 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -409,6 +409,9 @@ struct AMDVIState {
     /* for each served device */
     AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
 
+    /* list of address spaces with registered notifiers */
+    QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
+
     /* IOTLB */
     GHashTable *iotlb;
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 10/22] amd_iommu: Sync shadow page tables on page invalidation
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (8 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 09/22] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 11/22] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the address
and size of the invalidation and sync the guest page table state with the
host. This requires walking the guest page table and calling notifiers
registered for address spaces matching the domain ID encoded in the command.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 74 insertions(+), 8 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index d8a451b3a5ff1..caae65c4b3565 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -591,9 +591,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
  *      page table walk. This means that the DTE has valid data, but one of the
  *      lower level entries in the Page Table could not be read.
  */
-static int __attribute__((unused))
-fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
-          hwaddr *page_size)
+static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
+                          uint64_t *pte, hwaddr *page_size)
 {
     IOMMUAccessFlags perms = amdvi_get_perms(dte);
 
@@ -691,9 +690,9 @@ fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
  * notifiers to sync the shadow page tables in the host.
  * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
  */
-static void __attribute__((unused))
-amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
-                                   hwaddr addr, uint64_t size, bool send_unmap)
+static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
+                                               uint64_t *dte, hwaddr addr,
+                                               uint64_t size, bool send_unmap)
 {
     IOMMUTLBEvent event;
 
@@ -835,8 +834,7 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
  * first zero at bit 51 or larger is a request to invalidate the entire address
  * space.
  */
-static uint64_t __attribute__((unused))
-amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
 {
     uint64_t size = AMDVI_PAGE_SIZE;
     uint8_t fzbit = 0;
@@ -853,10 +851,76 @@ amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
     return size;
 }
 
+/*
+ * Synchronize the guest page tables with the shadow page tables kept in the
+ * host for the specified range.
+ * The invalidation command issued by the guest and intercepted by the VMM
+ * does not specify a device, but a domain, since all devices in the same domain
+ * share the same page tables. However, vIOMMU emulation creates separate
+ * address spaces per device, so it is necessary to traverse the list of all of
+ * address spaces (i.e. devices) that have notifiers registered in order to
+ * propagate the changes to the host page tables.
+ * We cannot return early from this function once a matching domain has been
+ * identified and its page tables synced (based on the fact that all devices in
+ * the same domain share the page tables). The reason is that different devices
+ * (i.e. address spaces) could have different notifiers registered, and by
+ * skipping address spaces that appear later on the amdvi_as_with_notifiers list
+ * their notifiers (which could differ from the ones registered for the first
+ * device/address space) would not be invoked.
+ */
+static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t addr,
+                              uint16_t flags)
+{
+    AMDVIAddressSpace *as;
+
+    uint64_t size = amdvi_decode_invalidation_size(addr, flags);
+
+    if (size == AMDVI_INV_ALL_PAGES) {
+        addr = 0;       /* Set start address to 0 and invalidate entire AS */
+    } else {
+        addr &= ~(size - 1);
+    }
+
+    /*
+     * Call notifiers that have registered for each address space matching the
+     * domain ID, in order to sync the guest pagetable state with the host.
+     */
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+
+        uint64_t dte[4] = { 0 };
+
+        /*
+         * Retrieve the Device Table entry for the devid corresponding to the
+         * current address space, and verify the DomainID matches i.e. the page
+         * tables to be synced belong to devices in the domain.
+         */
+        if (amdvi_as_to_dte(as, dte)) {
+            continue;
+        }
+
+        /* Only need to sync the Page Tables for a matching domain */
+        if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
+            continue;
+        }
+
+        /*
+         * We have determined that there is a valid Device Table Entry for a
+         * device matching the DomainID in the INV_IOMMU_PAGES command issued by
+         * the guest. Walk the guest page table to sync shadow page table.
+         */
+        if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
+            /* Sync guest IOMMU mappings with host */
+            amdvi_sync_shadow_page_table_range(as, &dte[0], addr, size, true);
+        }
+    }
+}
+
 /* we don't have devid - we can't remove pages by address */
 static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 {
     uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
+    uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
+    uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
 
     if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
         extract64(cmd[1], 3, 9)) {
@@ -866,6 +930,8 @@ static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 
     g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
                                 &domid);
+
+    amdvi_sync_domain(s, domid, addr, flags);
     trace_amdvi_pages_inval(domid);
 }
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 11/22] amd_iommu: Use iova_tree records to determine large page size on UNMAP
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (9 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 10/22] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 12/22] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Keep a record of mapped IOVA ranges per address space, using the iova_tree
implementation. Besides enabling optimizations like avoiding unnecessary
notifications, a record of existing <IOVA, size> mappings makes it possible
to determine if a specific IOVA is mapped by the guest using a large page,
and adjust the size when notifying UNMAP events.
When unmapping a large page, the information in the guest PTE encoding the
page size is lost, since the guest clears the PTE before issuing the
invalidation command to the IOMMU. In such case, the size of the original
mapping can be retrieved from the iova_tree and used to issue the UNMAP
notification. Using the correct size is essential since the VFIO IOMMU
Type1v2 driver in the host kernel will reject unmap requests that do not
fully cover previous mappings.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 95 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 89 insertions(+), 6 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index caae65c4b3565..4376e977f8886 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -33,6 +33,7 @@
 #include "hw/i386/apic-msidef.h"
 #include "hw/qdev-properties.h"
 #include "kvm/kvm_i386.h"
+#include "qemu/iova-tree.h"
 
 /* used AMD-Vi MMIO registers */
 const char *amdvi_mmio_low[] = {
@@ -71,6 +72,8 @@ struct AMDVIAddressSpace {
     IOMMUNotifierFlag notifier_flags;
     /* entry in list of Address spaces with registered notifiers */
     QLIST_ENTRY(AMDVIAddressSpace) next;
+    /* Record DMA translation ranges */
+    IOVATree *iova_tree;
 };
 
 /* AMDVI cache entry */
@@ -685,6 +688,75 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
     return 0;
 }
 
+/*
+ * Invoke notifiers registered for the address space. Update record of mapped
+ * ranges in IOVA Tree.
+ */
+static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event)
+{
+    IOMMUTLBEntry *entry = &event->entry;
+
+    DMAMap target = {
+        .iova = entry->iova,
+        .size = entry->addr_mask,
+        .translated_addr = entry->translated_addr,
+        .perm = entry->perm,
+    };
+
+    /*
+     * Search the IOVA Tree for an existing translation for the target, and skip
+     * the notification if the mapping is already recorded.
+     * When the guest uses large pages, comparing against the record makes it
+     * possible to determine the size of the original MAP and adjust the UNMAP
+     * request to match it. This avoids failed checks against the mappings kept
+     * by the VFIO kernel driver.
+     */
+    const DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
+
+    if (event->type == IOMMU_NOTIFIER_UNMAP) {
+        if (!mapped) {
+            /* No record exists of this mapping, nothing to do */
+            return;
+        }
+        /*
+         * Adjust the size based on the original record. This is essential to
+         * determine when large/contiguous pages are used, since the guest has
+         * already cleared the PTE (erasing the pagesize encoded on it) before
+         * issuing the invalidation command.
+         */
+        if (mapped->size != target.size) {
+            assert(mapped->size > target.size);
+            target.size = mapped->size;
+            /* Adjust event to invoke notifier with correct range */
+            entry->addr_mask = mapped->size;
+        }
+        iova_tree_remove(as->iova_tree, target);
+    } else { /* IOMMU_NOTIFIER_MAP */
+        if (mapped) {
+            /*
+             * If a mapping is present and matches the request, skip the
+             * notification.
+             */
+            if (!memcmp(mapped, &target, sizeof(DMAMap))) {
+                return;
+            } else {
+                /*
+                 * This should never happen unless a buggy guest OS omits or
+                 * sends incorrect invalidation(s). Report an error in the event
+                 * it does happen.
+                 */
+                error_report("Found conflicting translation. This could be due "
+                             "to an incorrect or missing invalidation command");
+            }
+        }
+        /* Record the new mapping */
+        iova_tree_insert(as->iova_tree, &target);
+    }
+
+    /* Invoke the notifiers registered for this address space */
+    memory_region_notify_iommu(&as->iommu, 0, *event);
+}
+
 /*
  * Walk the guest page table for an IOVA and range and signal the registered
  * notifiers to sync the shadow page tables in the host.
@@ -696,7 +768,7 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
 {
     IOMMUTLBEvent event;
 
-    hwaddr iova_next, page_mask, pagesize;
+    hwaddr page_mask, pagesize;
     hwaddr iova = addr;
     hwaddr end = iova + size - 1;
 
@@ -719,7 +791,6 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
         /* PTE has been validated for major errors and pagesize is set */
         assert(pagesize);
         page_mask = ~(pagesize - 1);
-        iova_next = (iova & page_mask) + pagesize;
 
         if (ret == -AMDVI_FR_PT_ENTRY_INV) {
             /*
@@ -752,15 +823,26 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
             event.type = IOMMU_NOTIFIER_MAP;
         }
 
-        /* Invoke the notifiers registered for this address space */
-        memory_region_notify_iommu(&as->iommu, 0, event);
+        /*
+         * The following call might need to adjust event.entry.size in cases
+         * where the guest unmapped a series of large pages.
+         */
+        amdvi_notify_iommu(as, &event);
+        /*
+         * In the special scenario where the guest is unmapping a large page,
+         * addr_mask has been adjusted before sending the notification. Update
+         * pagesize accordingly in order to correctly compute the next IOVA.
+         */
+        pagesize = event.entry.addr_mask + 1;
 
 next:
+        iova &= ~(pagesize - 1);
+
         /* Check for 64-bit overflow and terminate walk in such cases */
-        if (iova_next < iova) {
+        if ((iova + pagesize) < iova) {
             break;
         } else {
-            iova = iova_next;
+            iova += pagesize;
         }
     }
 }
@@ -1845,6 +1927,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
         iommu_as[devfn]->devfn = (uint8_t)devfn;
         iommu_as[devfn]->iommu_state = s;
         iommu_as[devfn]->notifier_flags = IOMMU_NOTIFIER_NONE;
+        iommu_as[devfn]->iova_tree = iova_tree_new();
 
         amdvi_dev_as = iommu_as[devfn];
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 12/22] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (10 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 11/22] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 13/22] amd_iommu: Add replay callback Alejandro Jimenez
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Support dropping all existing mappings on reset. When the guest kernel
reboots it will create new ones, but other components that run before
the kernel (e.g. OVMF) should not be able to use existing mappings from
the previous boot.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 4376e977f8886..497f18c540666 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -847,6 +847,77 @@ next:
     }
 }
 
+/*
+ * Unmap entire range that the notifier registered for i.e. the full AS.
+ *
+ * This is seemingly technically equivalent to directly calling
+ * memory_region_unmap_iommu_notifier_range(), but it allows to check for
+ * notifier boundaries and issue notifications with ranges within those bounds.
+ */
+static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n)
+{
+
+    hwaddr start = n->start;
+    hwaddr end = n->end;
+    hwaddr remain;
+    DMAMap map;
+
+    assert(start <= end);
+    remain = end - start + 1;
+
+    /*
+     * Divide the notifier range into chunks that are aligned and do not exceed
+     * the notifier boundaries.
+     */
+    while (remain >= AMDVI_PAGE_SIZE) {
+
+        IOMMUTLBEvent event;
+
+        uint64_t mask = dma_aligned_pow2_mask(start, end, 64);
+
+        event.type = IOMMU_NOTIFIER_UNMAP;
+
+        IOMMUTLBEntry entry = {
+            .target_as = &address_space_memory,
+            .iova = start,
+            .translated_addr = 0,   /* irrelevant for unmap case */
+            .addr_mask = mask,
+            .perm = IOMMU_NONE,
+        };
+        event.entry = entry;
+
+        /* Call notifier registered for updates on this address space */
+        memory_region_notify_iommu_one(n, &event);
+
+        start += mask + 1;
+        remain -= mask + 1;
+    }
+
+    assert(!remain);
+
+    map.iova = n->start;
+    map.size = n->end - n->start;
+
+    iova_tree_remove(as->iova_tree, map);
+}
+
+/*
+ * For all the address spaces with notifiers registered, unmap the entire range
+ * the notifier registered for i.e. clear all the address spaces managed by the
+ * IOMMU.
+ */
+static void amdvi_address_space_unmap_all(AMDVIState *s)
+{
+    AMDVIAddressSpace *as;
+    IOMMUNotifier *n;
+
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+            amdvi_address_space_unmap(as, n);
+        }
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -2099,6 +2170,9 @@ static void amdvi_sysbus_reset(DeviceState *dev)
 
     msi_reset(&s->pci->dev);
     amdvi_init(s);
+
+    /* Discard all mappings on device reset */
+    amdvi_address_space_unmap_all(s);
 }
 
 static const VMStateDescription vmstate_amdvi_sysbus_migratable = {
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 13/22] amd_iommu: Add replay callback
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (11 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 12/22] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 14/22] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
A replay() method is necessary to efficiently synchronize the host page
tables after VFIO registers a notifier for IOMMU events. It is called to
ensure that existing mappings from an IOMMU memory region are "replayed" to
a specified notifier, initializing or updating the shadow page tables on the
host.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 497f18c540666..9027f7c0544a7 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -918,6 +918,29 @@ static void amdvi_address_space_unmap_all(AMDVIState *s)
     }
 }
 
+/*
+ * For every translation present in the IOMMU, construct IOMMUTLBEntry data
+ * and pass it as parameter to notifier callback.
+ */
+static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+{
+    AMDVIAddressSpace *as = container_of(iommu_mr, AMDVIAddressSpace, iommu);
+    uint64_t dte[4] = { 0 };
+
+    if (!(n->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+        return;
+    }
+
+    if (amdvi_as_to_dte(as, dte)) {
+        return;
+    }
+
+    /* Dropping all mappings for the address space. Also clears the IOVA tree */
+    amdvi_address_space_unmap(as, n);
+
+    amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -2364,6 +2387,7 @@ static void amdvi_iommu_memory_region_class_init(ObjectClass *klass,
 
     imrc->translate = amdvi_translate;
     imrc->notify_flag_changed = amdvi_iommu_notify_flag_changed;
+    imrc->replay = amdvi_iommu_replay;
 }
 
 static const TypeInfo amdvi_iommu_memory_region_info = {
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 14/22] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (12 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 13/22] amd_iommu: Add replay callback Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 15/22] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
When the kernel IOMMU driver issues an INVALIDATE_IOMMU_ALL, the address
translation and interrupt remapping information must be cleared for all
Device IDs and all domains. Introduce a helper to sync the shadow page table
for all the address spaces with registered notifiers, which replays both MAP
and UNMAP events.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 9027f7c0544a7..d74d42b3dda8e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -941,6 +941,47 @@ static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
 }
 
+static void amdvi_address_space_sync(AMDVIAddressSpace *as)
+{
+    IOMMUNotifier *n;
+    uint64_t dte[4] = { 0 };
+
+    /* If only UNMAP notifiers are registered, drop all existing mappings */
+    if (!(as->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+            /*
+             * Directly calling memory_region_unmap_iommu_notifier_range() does
+             * not guarantee that the addr_mask eventually passed as parameter
+             * to the notifier is valid. Use amdvi_address_space_unmap() which
+             * ensures the notifier range is divided into properly aligned
+             * regions, and issues notifications for each one.
+             */
+            amdvi_address_space_unmap(as, n);
+        }
+        return;
+    }
+
+    if (amdvi_as_to_dte(as, dte)) {
+        return;
+    }
+
+    amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, true);
+}
+
+/*
+ * This differs from the replay() method in that it issues both MAP and UNMAP
+ * notifications since it is called after global invalidation events in order to
+ * re-sync all address spaces.
+ */
+static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
+{
+    AMDVIAddressSpace *as;
+
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+        amdvi_address_space_sync(as);
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -983,6 +1024,13 @@ static void amdvi_inval_all(AMDVIState *s, uint64_t *cmd)
     amdvi_intremap_inval_notify_all(s, true, 0, 0);
 
     amdvi_iotlb_reset(s);
+
+    /*
+     * Fully replay the address space i.e. send both UNMAP and MAP events in
+     * order to synchronize guest and host IO page tables tables.
+     */
+    amdvi_iommu_address_space_sync_all(s);
+
     trace_amdvi_all_inval();
 }
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 15/22] amd_iommu: Toggle memory regions based on address translation mode
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (13 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 14/22] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 16/22] amd_iommu: Set all address spaces to use passthrough mode on reset Alejandro Jimenez
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Enable the appropriate memory region for an address space depending on the
address translation mode selected for it. This is currently based on a
generic x86 IOMMU property, and only done during the address space
initialization. Extract the code into a helper and toggle the regions based
on whether the specific address space is using address translation (via the
newly introduced addr_translation field). Later, region activation will also
be controlled by availability of DMA remapping capability (via dma-remap
property to be introduced in follow up changes).
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index d74d42b3dda8e..67a26f524706b 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -74,6 +74,8 @@ struct AMDVIAddressSpace {
     QLIST_ENTRY(AMDVIAddressSpace) next;
     /* Record DMA translation ranges */
     IOVATree *iova_tree;
+    /* DMA address translation active */
+    bool addr_translation;
 };
 
 /* AMDVI cache entry */
@@ -982,6 +984,23 @@ static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
     }
 }
 
+/*
+ * Toggle between address translation and passthrough modes by enabling the
+ * corresponding memory regions.
+ */
+static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
+{
+    if (amdvi_as->addr_translation) {
+        /* Enabling DMA region */
+        memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
+        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
+    } else {
+        /* Disabling DMA region, using passthrough */
+        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), false);
+        memory_region_set_enabled(&amdvi_as->iommu_nodma, true);
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -2070,6 +2089,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
         iommu_as[devfn]->iommu_state = s;
         iommu_as[devfn]->notifier_flags = IOMMU_NOTIFIER_NONE;
         iommu_as[devfn]->iova_tree = iova_tree_new();
+        iommu_as[devfn]->addr_translation = false;
 
         amdvi_dev_as = iommu_as[devfn];
 
@@ -2112,8 +2132,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
                                             AMDVI_INT_ADDR_FIRST,
                                             &amdvi_dev_as->iommu_ir, 1);
 
-        memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, false);
-        memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu), true);
+        amdvi_switch_address_space(amdvi_dev_as);
     }
     return &iommu_as[devfn]->as;
 }
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 16/22] amd_iommu: Set all address spaces to use passthrough mode on reset
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (14 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 15/22] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 17/22] amd_iommu: Add dma-remap property to AMD vIOMMU device Alejandro Jimenez
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
On reset, restore the default address translation mode (passthrough) for all
the address spaces managed by the vIOMMU.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 67a26f524706b..e9ce7b46e854f 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1001,6 +1001,35 @@ static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
     }
 }
 
+/*
+ * For all existing address spaces managed by the IOMMU, enable/disable the
+ * corresponding memory regions to reset the address translation mode and
+ * use passthrough by default.
+ */
+static void amdvi_reset_address_translation_all(AMDVIState *s)
+{
+    AMDVIAddressSpace **iommu_as;
+
+    for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
+
+        /* Nothing to do if there are no devices on the current bus */
+        if (!s->address_spaces[bus_num]) {
+            continue;
+        }
+        iommu_as = s->address_spaces[bus_num];
+
+        for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+
+            if (!iommu_as[devfn]) {
+                continue;
+            }
+            /* Use passthrough as default mode after reset */
+            iommu_as[devfn]->addr_translation = false;
+            amdvi_switch_address_space(iommu_as[devfn]);
+        }
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
@@ -2263,6 +2292,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
 
     /* Discard all mappings on device reset */
     amdvi_address_space_unmap_all(s);
+    amdvi_reset_address_translation_all(s);
 }
 
 static const VMStateDescription vmstate_amdvi_sysbus_migratable = {
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 17/22] amd_iommu: Add dma-remap property to AMD vIOMMU device
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (15 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 16/22] amd_iommu: Set all address spaces to use passthrough mode on reset Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
In order to enable device assignment with IOMMU protection and guest DMA
address translation, IOMMU MAP notifier support is necessary to allow users
like VFIO to synchronize the shadow page tables i.e. to receive
notifications when the guest updates its I/O page tables and replay the
mappings onto host I/O page tables.
Provide a new dma-remap property to govern the ability to register for MAP
notifications, effectively providing global control over the DMA address
translation functionality that was implemented in previous changes.
Note that DMA remapping support also requires the vIOMMU is configured with
the NpCache capability, so a guest driver issues IOMMU invalidations for
both map() and unmap() operations. This capability is already set by default
and written to the configuration in amdvi_pci_realize() as part of
AMDVI_CAPAB_FEATURES.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 24 +++++++++++++++++-------
 hw/i386/amd_iommu.h |  3 +++
 2 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index e9ce7b46e854f..ce5d4c36624fd 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -990,7 +990,9 @@ static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
  */
 static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
 {
-    if (amdvi_as->addr_translation) {
+    AMDVIState *s = amdvi_as->iommu_state;
+
+    if (s->dma_remap && amdvi_as->addr_translation) {
         /* Enabling DMA region */
         memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
         memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
@@ -2193,12 +2195,19 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
     AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
     AMDVIState *s = as->iommu_state;
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_setg(errp,
-                   "device %02x.%02x.%x requires iommu notifier which is not "
-                   "currently supported", as->bus_num, PCI_SLOT(as->devfn),
-                   PCI_FUNC(as->devfn));
-        return -EINVAL;
+    /*
+     * Accurate synchronization of the vIOMMU page tables required to support
+     * MAP notifiers is provided by the dma-remap feature. In addition, this
+     * also requires that the vIOMMU presents the NpCache capability, so a guest
+     * driver issues invalidations for both map() and unmap() operations. The
+     * capability is already set by default as part of AMDVI_CAPAB_FEATURES and
+     * written to the configuration in amdvi_pci_realize().
+     */
+    if (!s->dma_remap && (new & IOMMU_NOTIFIER_MAP)) {
+        error_setg_errno(errp, ENOTSUP,
+            "device %02x.%02x.%x requires dma-remap=1",
+            as->bus_num, PCI_SLOT(as->devfn), PCI_FUNC(as->devfn));
+        return -ENOTSUP;
     }
 
     /*
@@ -2423,6 +2432,7 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
 static const Property amdvi_properties[] = {
     DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
     DEFINE_PROP_STRING("pci-id", AMDVIState, pci_id),
+    DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
 };
 
 static const VMStateDescription vmstate_amdvi_sysbus = {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index b51aa74368995..e1354686b6f03 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -418,6 +418,9 @@ struct AMDVIState {
     /* Interrupt remapping */
     bool ga_enabled;
     bool xtsup;
+
+    /* DMA address translation */
+    bool dma_remap;
 };
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s);
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (16 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 17/22] amd_iommu: Add dma-remap property to AMD vIOMMU device Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-10-06  6:08   ` Sairaj Kodilkar
  2025-09-19 21:35 ` [PATCH v3 19/22] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
Device Table entry (DTE) e.g. after attaching a device and setting up its
DTE. When intercepting this event, determine if the DTE has been configured
for paging or not, and toggle the appropriate memory regions to allow DMA
address translation for the address space if needed. Requires dma-remap=on.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 122 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 120 insertions(+), 2 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index ce5d4c36624fd..e916dcb2be381 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1032,18 +1032,136 @@ static void amdvi_reset_address_translation_all(AMDVIState *s)
     }
 }
 
+static void enable_dma_mode(AMDVIAddressSpace *as, bool inval_current)
+{
+    /*
+     * When enabling DMA mode for the purpose of isolating guest devices on
+     * a failure to retrieve or invalid DTE, all existing mappings must be
+     * dropped.
+     */
+    if (inval_current) {
+        IOMMUNotifier *n;
+        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+            amdvi_address_space_unmap(as, n);
+        }
+    }
+
+    if (as->addr_translation) {
+        return;
+    }
+
+    /* Installing DTE enabling translation, activate region */
+    as->addr_translation = true;
+    amdvi_switch_address_space(as);
+    /* Sync shadow page tables */
+    amdvi_address_space_sync(as);
+}
+
+/*
+ * If paging was previously in use in the address space
+ * - invalidate all existing mappings
+ * - switch to no_dma memory region
+ */
+static void enable_nodma_mode(AMDVIAddressSpace *as)
+{
+    IOMMUNotifier *n;
+
+    if (!as->addr_translation) {
+        /* passthrough is already active, nothing to do */
+        return;
+    }
+
+    as->addr_translation = false;
+    IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+        /* Drop all mappings for the address space */
+        amdvi_address_space_unmap(as, n);
+    }
+    amdvi_switch_address_space(as);
+}
+
+/*
+ * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
+ * after changing a Device Table entry. We can use this fact to detect when a
+ * Device Table entry is created for a device attached to a paging domain and
+ * enable the corresponding IOMMU memory region to allow for DMA translation if
+ * appropriate.
+ */
+static void amdvi_update_addr_translation_mode(AMDVIState *s, uint16_t devid)
+{
+    uint8_t bus_num, devfn, dte_mode;
+    AMDVIAddressSpace *as;
+    uint64_t dte[4] = { 0 };
+    int ret;
+
+    /*
+     * Convert the devid encoded in the command to a bus and devfn in
+     * order to retrieve the corresponding address space.
+     */
+    bus_num = PCI_BUS_NUM(devid);
+    devfn = devid & 0xff;
+
+    /*
+     * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
+     * been allocated within AMDVIState, but must be careful to not access
+     * unallocated devfn.
+     */
+    if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
+        return;
+    }
+    as = s->address_spaces[bus_num][devfn];
+
+    ret = amdvi_as_to_dte(as, dte);
+
+    if (!ret) {
+        dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
+    }
+
+    switch (ret) {
+    case 0:
+        /* DTE was successfully retrieved */
+        if (!dte_mode) {
+            enable_nodma_mode(as); /* DTE[V]=1 && DTE[Mode]=0 => passthrough */
+        } else {
+            enable_dma_mode(as, false); /* Enable DMA translation */
+        }
+        break;
+    case -AMDVI_FR_DTE_V:
+        /* DTE[V]=0, address is passed untranslated */
+        enable_nodma_mode(as);
+        break;
+    case -AMDVI_FR_DTE_RTR_ERR:
+    case -AMDVI_FR_DTE_TV:
+        /*
+         * Enforce isolation by using DMA in rare scenarios where the DTE cannot
+         * be retrieved or DTE[TV]=0. Existing mappings are dropped.
+         */
+        enable_dma_mode(as, true);
+        break;
+    }
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
     uint16_t devid = cpu_to_le16((uint16_t)extract64(cmd[0], 0, 16));
 
+    trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
+                             PCI_FUNC(devid));
+
     /* This command should invalidate internal caches of which there isn't */
     if (extract64(cmd[0], 16, 44) || cmd[1]) {
         amdvi_log_illegalcom_error(s, extract64(cmd[0], 60, 4),
                                    s->cmdbuf + s->cmdbuf_head);
+        return;
+    }
+
+    /*
+     * When DMA remapping capability is enabled, check if updated DTE is setup
+     * for paging or not, and configure the corresponding memory regions.
+     */
+    if (s->dma_remap) {
+        amdvi_update_addr_translation_mode(s, devid);
     }
-    trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
-                             PCI_FUNC(devid));
 }
 
 static void amdvi_complete_ppr(AMDVIState *s, uint64_t *cmd)
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 19/22] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (17 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 20/22] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
with V=1, TV=0 does not contain a valid address translation information.
If a request requires a table walk, the walk is terminated when this
condition is encountered.
Do not assume that addresses for a device with DTE[TV]=0 are passed
through (i.e. not remapped) and instead terminate the page table walk
early.
Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 87 +++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 39 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index e916dcb2be381..1bda2a8ac3a16 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1722,51 +1722,60 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
     uint64_t pte = dte[0], pte_addr, page_mask;
 
     /* make sure the DTE has TV = 1 */
-    if (pte & AMDVI_DEV_TRANSLATION_VALID) {
-        level = get_pte_translation_mode(pte);
-        if (level >= 7) {
-            trace_amdvi_mode_invalid(level, addr);
+    if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+        /*
+         * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
+         * An IOMMU processing a request that requires a table walk terminates
+         * the walk when it encounters this condition. Do the same and return
+         * instead of assuming that the address is forwarded without translation
+         * i.e. the passthrough case, as it is done for the case where DTE[V]=0.
+         */
+        return;
+    }
+
+    level = get_pte_translation_mode(pte);
+    if (level >= 7) {
+        trace_amdvi_mode_invalid(level, addr);
+        return;
+    }
+    if (level == 0) {
+        goto no_remap;
+    }
+
+    /* we are at the leaf page table or page table encodes a huge page */
+    do {
+        pte_perms = amdvi_get_perms(pte);
+        present = pte & 1;
+        if (!present || perms != (perms & pte_perms)) {
+            amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+            trace_amdvi_page_fault(addr);
             return;
         }
-        if (level == 0) {
-            goto no_remap;
+        /* go to the next lower level */
+        pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
+        /* add offset and load pte */
+        pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
+        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+        if (!pte) {
+            return;
         }
+        oldlevel = level;
+        level = get_pte_translation_mode(pte);
+    } while (level > 0 && level < 7);
 
-        /* we are at the leaf page table or page table encodes a huge page */
-        do {
-            pte_perms = amdvi_get_perms(pte);
-            present = pte & 1;
-            if (!present || perms != (perms & pte_perms)) {
-                amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
-                trace_amdvi_page_fault(addr);
-                return;
-            }
-
-            /* go to the next lower level */
-            pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
-            /* add offset and load pte */
-            pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
-            pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-            if (!pte || (pte == (uint64_t)-1)) {
-                return;
-            }
-            oldlevel = level;
-            level = get_pte_translation_mode(pte);
-        } while (level > 0 && level < 7);
+    if (level == 0x7) {
+        page_mask = pte_override_page_mask(pte);
+    } else {
+        page_mask = pte_get_page_mask(oldlevel);
+    }
 
-        if (level == 0x7) {
-            page_mask = pte_override_page_mask(pte);
-        } else {
-            page_mask = pte_get_page_mask(oldlevel);
-        }
+    /* get access permissions from pte */
+    ret->iova = addr & page_mask;
+    ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
+    ret->addr_mask = ~page_mask;
+    ret->perm = amdvi_get_perms(pte);
+    return;
 
-        /* get access permissions from pte */
-        ret->iova = addr & page_mask;
-        ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
-        ret->addr_mask = ~page_mask;
-        ret->perm = amdvi_get_perms(pte);
-        return;
-    }
 no_remap:
     ret->iova = addr & AMDVI_PAGE_MASK_4K;
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 20/22] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (18 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 19/22] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-19 21:35 ` [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu Alejandro Jimenez
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
Simplify amdvi_page_walk() by making it call the fetch_pte() helper that is
already in use by the shadow page synchronization code. Ensures all code
uses the same page table walking algorithm.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 77 ++++++++++++++++-----------------------------
 1 file changed, 27 insertions(+), 50 deletions(-)
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 1bda2a8ac3a16..b6851784fb9f1 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -513,24 +513,6 @@ static inline uint8_t get_pte_translation_mode(uint64_t pte)
     return (pte >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
 }
 
-static inline uint64_t pte_override_page_mask(uint64_t pte)
-{
-    uint8_t page_mask = 13;
-    uint64_t addr = (pte & AMDVI_DEV_PT_ROOT_MASK) >> 12;
-    /* find the first zero bit */
-    while (addr & 1) {
-        page_mask++;
-        addr = addr >> 1;
-    }
-
-    return ~((1ULL << page_mask) - 1);
-}
-
-static inline uint64_t pte_get_page_mask(uint64_t oldlevel)
-{
-    return ~((1UL << ((oldlevel * 9) + 3)) - 1);
-}
-
 static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
                                           uint16_t devid)
 {
@@ -1718,11 +1700,13 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
                             hwaddr addr)
 {
-    unsigned level, present, pte_perms, oldlevel;
-    uint64_t pte = dte[0], pte_addr, page_mask;
+    hwaddr page_mask, pagesize = 0;
+    uint8_t mode;
+    uint64_t pte;
+    int fetch_ret;
 
     /* make sure the DTE has TV = 1 */
-    if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+    if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
         /*
          * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
          * An IOMMU processing a request that requires a table walk terminates
@@ -1733,42 +1717,35 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
         return;
     }
 
-    level = get_pte_translation_mode(pte);
-    if (level >= 7) {
-        trace_amdvi_mode_invalid(level, addr);
+    mode = get_pte_translation_mode(dte[0]);
+    if (mode >= 7) {
+        trace_amdvi_mode_invalid(mode, addr);
         return;
     }
-    if (level == 0) {
+    if (mode == 0) {
         goto no_remap;
     }
 
-    /* we are at the leaf page table or page table encodes a huge page */
-    do {
-        pte_perms = amdvi_get_perms(pte);
-        present = pte & 1;
-        if (!present || perms != (perms & pte_perms)) {
-            amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
-            trace_amdvi_page_fault(addr);
-            return;
-        }
-        /* go to the next lower level */
-        pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
-        /* add offset and load pte */
-        pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
-        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-        if (!pte) {
-            return;
-        }
-        oldlevel = level;
-        level = get_pte_translation_mode(pte);
-    } while (level > 0 && level < 7);
+    /* Attempt to fetch the PTE to determine if a valid mapping exists */
+    fetch_ret = fetch_pte(as, addr, dte[0], &pte, &pagesize);
 
-    if (level == 0x7) {
-        page_mask = pte_override_page_mask(pte);
-    } else {
-        page_mask = pte_get_page_mask(oldlevel);
+    /*
+     * If walking the page table results in an error of any type, returns an
+     * empty PTE i.e. no mapping, or the permissions do not match, return since
+     * there is no translation available.
+     */
+    if (fetch_ret < 0 || !IOMMU_PTE_PRESENT(pte) ||
+        perms != (perms & amdvi_get_perms(pte))) {
+
+        amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+        trace_amdvi_page_fault(addr);
+        return;
     }
 
+    /* A valid PTE and page size has been retrieved */
+    assert(pagesize);
+    page_mask = ~(pagesize - 1);
+
     /* get access permissions from pte */
     ret->iova = addr & page_mask;
     ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
@@ -1780,7 +1757,7 @@ no_remap:
     ret->iova = addr & AMDVI_PAGE_MASK_4K;
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
     ret->addr_mask = ~AMDVI_PAGE_MASK_4K;
-    ret->perm = amdvi_get_perms(pte);
+    ret->perm = amdvi_get_perms(dte[0]);
 }
 
 static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (19 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 20/22] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-09-22  5:33   ` CLEMENT MATHIEU--DRIF
  2025-09-19 21:35 ` [PATCH v3 22/22] amd_iommu: HATDis/HATS=11 support Alejandro Jimenez
  2025-10-06 16:07 ` [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Cédric Le Goater
  22 siblings, 1 reply; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
From: Joao Martins <joao.m.martins@oracle.com>
To be later reused by AMD, now that it shares similar property.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/intel_iommu.c       | 5 ++---
 hw/i386/x86-iommu.c         | 1 +
 include/hw/i386/x86-iommu.h | 1 +
 3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c5e444131a3..2b848d094cfb7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2701,7 +2701,7 @@ static void vtd_handle_gcmd_write(IntelIOMMUState *s)
     uint32_t changed = status ^ val;
 
     trace_vtd_reg_write_gcmd(status, val);
-    if ((changed & VTD_GCMD_TE) && s->dma_translation) {
+    if ((changed & VTD_GCMD_TE) && x86_iommu->dma_translation) {
         /* Translation enable/disable */
         vtd_handle_gcmd_te(s, val & VTD_GCMD_TE);
     }
@@ -3835,7 +3835,6 @@ static const Property vtd_properties[] = {
     DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
     DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
-    DEFINE_PROP_BOOL("dma-translation", IntelIOMMUState, dma_translation, true),
     DEFINE_PROP_BOOL("stale-tm", IntelIOMMUState, stale_tm, false),
     DEFINE_PROP_BOOL("fs1gp", IntelIOMMUState, fs1gp, true),
 };
@@ -4553,7 +4552,7 @@ static void vtd_cap_init(IntelIOMMUState *s)
     if (s->dma_drain) {
         s->cap |= VTD_CAP_DRAIN;
     }
-    if (s->dma_translation) {
+    if (x86_iommu->dma_translation) {
             if (s->aw_bits >= VTD_HOST_AW_39BIT) {
                     s->cap |= VTD_CAP_SAGAW_39bit;
             }
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index d34a6849f4ae9..c127a44bb4bc8 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -130,6 +130,7 @@ static const Property x86_iommu_properties[] = {
                             intr_supported, ON_OFF_AUTO_AUTO),
     DEFINE_PROP_BOOL("device-iotlb", X86IOMMUState, dt_supported, false),
     DEFINE_PROP_BOOL("pt", X86IOMMUState, pt_supported, true),
+    DEFINE_PROP_BOOL("dma-translation", X86IOMMUState, dma_translation, true),
 };
 
 static void x86_iommu_class_init(ObjectClass *klass, const void *data)
diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
index bfd21649d0838..e89f55a5c215c 100644
--- a/include/hw/i386/x86-iommu.h
+++ b/include/hw/i386/x86-iommu.h
@@ -64,6 +64,7 @@ struct X86IOMMUState {
     OnOffAuto intr_supported;   /* Whether vIOMMU supports IR */
     bool dt_supported;          /* Whether vIOMMU supports DT */
     bool pt_supported;          /* Whether vIOMMU supports pass-through */
+    bool dma_translation;       /* Whether vIOMMU supports DMA translation */
     QLIST_HEAD(, IEC_Notifier) iec_notifiers; /* IEC notify list */
 };
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* [PATCH v3 22/22] amd_iommu: HATDis/HATS=11 support
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (20 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu Alejandro Jimenez
@ 2025-09-19 21:35 ` Alejandro Jimenez
  2025-10-06 16:07 ` [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Cédric Le Goater
  22 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-09-19 21:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez
From: Joao Martins <joao.m.martins@oracle.com>
Add a way to disable DMA translation support in AMD IOMMU by
allowing to set IVHD HATDis to 1, and exposing HATS (Host Address
Translation Size) as Reserved value.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/acpi-build.c |  6 +++++-
 hw/i386/amd_iommu.c  | 19 +++++++++++++++++++
 hw/i386/amd_iommu.h  |  1 +
 3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 423c4959fe809..9446a9f862ca4 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1863,7 +1863,11 @@ build_amd_iommu(GArray *table_data, BIOSLinker *linker, const char *oem_id,
     /* IOMMU info */
     build_append_int_noprefix(table_data, 0, 2);
     /* IOMMU Attributes */
-    build_append_int_noprefix(table_data, 0, 4);
+    if (!s->iommu.dma_translation) {
+        build_append_int_noprefix(table_data, (1UL << 0) /* HATDis */, 4);
+    } else {
+        build_append_int_noprefix(table_data, 0, 4);
+    }
     /* EFR Register Image */
     build_append_int_noprefix(table_data,
                               amdvi_extended_feature_register(s),
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b6851784fb9f1..378e0cb55eab6 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -107,6 +107,9 @@ uint64_t amdvi_extended_feature_register(AMDVIState *s)
     if (s->xtsup) {
         feature |= AMDVI_FEATURE_XT;
     }
+    if (!s->iommu.dma_translation) {
+        feature |= AMDVI_HATS_MODE_RESERVED;
+    }
 
     return feature;
 }
@@ -472,6 +475,9 @@ static inline uint64_t amdvi_get_perms(uint64_t entry)
 static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
                                uint64_t *dte)
 {
+
+    uint64_t root;
+
     if ((dte[0] & AMDVI_DTE_QUAD0_RESERVED) ||
         (dte[1] & AMDVI_DTE_QUAD1_RESERVED) ||
         (dte[2] & AMDVI_DTE_QUAD2_RESERVED) ||
@@ -482,6 +488,19 @@ static bool amdvi_validate_dte(AMDVIState *s, uint16_t devid,
         return false;
     }
 
+    /*
+     * 1 = Host Address Translation is not supported. Value in MMIO Offset
+     * 0030h[HATS] is not meaningful. A non-zero host page table root pointer
+     * in the DTE would result in an ILLEGAL_DEV_TABLE_ENTRY event.
+     */
+    root = (dte[0] & AMDVI_DEV_PT_ROOT_MASK) >> 12;
+    if (root && !s->iommu.dma_translation) {
+        amdvi_log_illegaldevtab_error(s, devid,
+                                      s->devtab +
+                                      devid * AMDVI_DEVTAB_ENTRY_SIZE, 0);
+        return false;
+    }
+
     return true;
 }
 
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index e1354686b6f03..daf82fc85f961 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -177,6 +177,7 @@
 /* AMDVI paging mode */
 #define AMDVI_GATS_MODE                 (2ULL <<  12)
 #define AMDVI_HATS_MODE                 (2ULL <<  10)
+#define AMDVI_HATS_MODE_RESERVED        (3ULL <<  10)
 
 /* Page Table format */
 
-- 
2.43.5
^ permalink raw reply related	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu
  2025-09-19 21:35 ` [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu Alejandro Jimenez
@ 2025-09-22  5:33   ` CLEMENT MATHIEU--DRIF
  0 siblings, 0 replies; 34+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2025-09-22  5:33 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel@nongnu.org
  Cc: mst@redhat.com, pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, imammedo@redhat.com,
	anisinha@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, Ankit.Soni@amd.com,
	Ethan MILON, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com
On Fri, 2025-09-19 at 21:35 +0000, Alejandro Jimenez wrote:
> From: Joao Martins <[joao.m.martins@oracle.com](mailto:joao.m.martins@oracle.com)>
> 
> To be later reused by AMD, now that it shares similar property.
> 
> Signed-off-by: Joao Martins <[joao.m.martins@oracle.com](mailto:joao.m.martins@oracle.com)>  
> Signed-off-by: Alejandro Jimenez <[alejandro.j.jimenez@oracle.com](mailto:alejandro.j.jimenez@oracle.com)>  
Hi Alejandro,
Most commits messages for the Intel IOMMU are formatted like "intel_iommu: XYZ".  
We should probably stick to that.
Otherwise, the change looks good to me.
Thanks  
cmd
> ---  
>  hw/i386/intel_iommu.c       | 5 ++---  
>  hw/i386/x86-iommu.c         | 1 +  
>  include/hw/i386/x86-iommu.h | 1 +  
>  3 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c  
> index 83c5e444131a3..2b848d094cfb7 100644  
> --- a/hw/i386/intel_iommu.c  
> +++ b/hw/i386/intel_iommu.c  
> @@ -2701,7 +2701,7 @@ static void vtd_handle_gcmd_write(IntelIOMMUState *s)  
>      uint32_t changed = status ^ val;  
>    
>      trace_vtd_reg_write_gcmd(status, val);  
> -    if ((changed & VTD_GCMD_TE) && s->dma_translation) {  
> +    if ((changed & VTD_GCMD_TE) && x86_iommu->dma_translation) {  
>          /* Translation enable/disable */  
>          vtd_handle_gcmd_te(s, val & VTD_GCMD_TE);  
>      }  
> @@ -3835,7 +3835,6 @@ static const Property vtd_properties[] = {  
>      DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),  
>      DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),  
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),  
> -    DEFINE_PROP_BOOL("dma-translation", IntelIOMMUState, dma_translation, true),  
>      DEFINE_PROP_BOOL("stale-tm", IntelIOMMUState, stale_tm, false),  
>      DEFINE_PROP_BOOL("fs1gp", IntelIOMMUState, fs1gp, true),  
>  };  
> @@ -4553,7 +4552,7 @@ static void vtd_cap_init(IntelIOMMUState *s)  
>      if (s->dma_drain) {  
>          s->cap |= VTD_CAP_DRAIN;  
>      }  
> -    if (s->dma_translation) {  
> +    if (x86_iommu->dma_translation) {  
>              if (s->aw_bits >= VTD_HOST_AW_39BIT) {  
>                      s->cap |= VTD_CAP_SAGAW_39bit;  
>              }  
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c  
> index d34a6849f4ae9..c127a44bb4bc8 100644  
> --- a/hw/i386/x86-iommu.c  
> +++ b/hw/i386/x86-iommu.c  
> @@ -130,6 +130,7 @@ static const Property x86_iommu_properties[] = {  
>                              intr_supported, ON_OFF_AUTO_AUTO),  
>      DEFINE_PROP_BOOL("device-iotlb", X86IOMMUState, dt_supported, false),  
>      DEFINE_PROP_BOOL("pt", X86IOMMUState, pt_supported, true),  
> +    DEFINE_PROP_BOOL("dma-translation", X86IOMMUState, dma_translation, true),  
>  };  
>    
>  static void x86_iommu_class_init(ObjectClass *klass, const void *data)  
> diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h  
> index bfd21649d0838..e89f55a5c215c 100644  
> --- a/include/hw/i386/x86-iommu.h  
> +++ b/include/hw/i386/x86-iommu.h  
> @@ -64,6 +64,7 @@ struct X86IOMMUState {  
>      OnOffAuto intr_supported;   /* Whether vIOMMU supports IR */  
>      bool dt_supported;          /* Whether vIOMMU supports DT */  
>      bool pt_supported;          /* Whether vIOMMU supports pass-through */  
> +    bool dma_translation;       /* Whether vIOMMU supports DMA translation */  
>      QLIST_HEAD(, IEC_Notifier) iec_notifiers; /* IEC notify list */  
>  };  
>  
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation
  2025-09-19 21:35 ` [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
@ 2025-10-06  6:08   ` Sairaj Kodilkar
  2025-10-06  6:15     ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Sairaj Kodilkar @ 2025-10-06  6:08 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky
On 9/20/2025 3:05 AM, Alejandro Jimenez wrote:
> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
> Device Table entry (DTE) e.g. after attaching a device and setting up its
> DTE. When intercepting this event, determine if the DTE has been configured
> for paging or not, and toggle the appropriate memory regions to allow DMA
> address translation for the address space if needed. Requires dma-remap=on.
>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 122 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index ce5d4c36624fd..e916dcb2be381 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1032,18 +1032,136 @@ static void amdvi_reset_address_translation_all(AMDVIState *s)
>       }
>   }
>   
> +static void enable_dma_mode(AMDVIAddressSpace *as, bool inval_current)
> +{
> +    /*
> +     * When enabling DMA mode for the purpose of isolating guest devices on
> +     * a failure to retrieve or invalid DTE, all existing mappings must be
> +     * dropped.
> +     */
> +    if (inval_current) {
> +        IOMMUNotifier *n;
> +        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
> +            amdvi_address_space_unmap(as, n);
> +        }
> +    }
> +
> +    if (as->addr_translation) {
> +        return;
> +    }
> +
> +    /* Installing DTE enabling translation, activate region */
> +    as->addr_translation = true;
> +    amdvi_switch_address_space(as);
> +    /* Sync shadow page tables */
> +    amdvi_address_space_sync(as);
Hi Alejandro,
I think we can skip amdvi_address_space_sync, because 
amdvi_switch_address_space will trigger
amdvi_iommu_replay. this replay should unmap all the old mappings and 
sync shadow page table.
Thanks
Sairaj
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation
  2025-10-06  6:08   ` Sairaj Kodilkar
@ 2025-10-06  6:15     ` Michael S. Tsirkin
  2025-10-06  6:25       ` Sairaj Kodilkar
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2025-10-06  6:15 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Alejandro Jimenez, qemu-devel, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, peterx, david, philmd,
	marcel.apfelbaum, alex.williamson, imammedo, anisinha,
	vasant.hegde, suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	Ankit.Soni, ethan.milon, joao.m.martins, boris.ostrovsky
On Mon, Oct 06, 2025 at 11:38:28AM +0530, Sairaj Kodilkar wrote:
> 
> 
> On 9/20/2025 3:05 AM, Alejandro Jimenez wrote:
> > A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
> > Device Table entry (DTE) e.g. after attaching a device and setting up its
> > DTE. When intercepting this event, determine if the DTE has been configured
> > for paging or not, and toggle the appropriate memory regions to allow DMA
> > address translation for the address space if needed. Requires dma-remap=on.
> > 
> > Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> > ---
> >   hw/i386/amd_iommu.c | 122 +++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 120 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> > index ce5d4c36624fd..e916dcb2be381 100644
> > --- a/hw/i386/amd_iommu.c
> > +++ b/hw/i386/amd_iommu.c
> > @@ -1032,18 +1032,136 @@ static void amdvi_reset_address_translation_all(AMDVIState *s)
> >       }
> >   }
> > +static void enable_dma_mode(AMDVIAddressSpace *as, bool inval_current)
> > +{
> > +    /*
> > +     * When enabling DMA mode for the purpose of isolating guest devices on
> > +     * a failure to retrieve or invalid DTE, all existing mappings must be
> > +     * dropped.
> > +     */
> > +    if (inval_current) {
> > +        IOMMUNotifier *n;
> > +        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
> > +            amdvi_address_space_unmap(as, n);
> > +        }
> > +    }
> > +
> > +    if (as->addr_translation) {
> > +        return;
> > +    }
> > +
> > +    /* Installing DTE enabling translation, activate region */
> > +    as->addr_translation = true;
> > +    amdvi_switch_address_space(as);
> > +    /* Sync shadow page tables */
> > +    amdvi_address_space_sync(as);
> Hi Alejandro,
> I think we can skip amdvi_address_space_sync, because
> amdvi_switch_address_space will trigger
> amdvi_iommu_replay. this replay should unmap all the old mappings and sync
> shadow page table.
> 
> Thanks
> Sairaj
Well I queued this but this speedup can be done on top.
-- 
MST
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation
  2025-10-06  6:15     ` Michael S. Tsirkin
@ 2025-10-06  6:25       ` Sairaj Kodilkar
  2025-10-06 16:03         ` Alejandro Jimenez
  0 siblings, 1 reply; 34+ messages in thread
From: Sairaj Kodilkar @ 2025-10-06  6:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alejandro Jimenez, qemu-devel, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, peterx, david, philmd,
	marcel.apfelbaum, alex.williamson, imammedo, anisinha,
	vasant.hegde, suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	Ankit.Soni, ethan.milon, joao.m.martins, boris.ostrovsky
On 10/6/2025 11:45 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 06, 2025 at 11:38:28AM +0530, Sairaj Kodilkar wrote:
>>
>> On 9/20/2025 3:05 AM, Alejandro Jimenez wrote:
>>> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
>>> Device Table entry (DTE) e.g. after attaching a device and setting up its
>>> DTE. When intercepting this event, determine if the DTE has been configured
>>> for paging or not, and toggle the appropriate memory regions to allow DMA
>>> address translation for the address space if needed. Requires dma-remap=on.
>>>
>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>> ---
>>>    hw/i386/amd_iommu.c | 122 +++++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 120 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>> index ce5d4c36624fd..e916dcb2be381 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -1032,18 +1032,136 @@ static void amdvi_reset_address_translation_all(AMDVIState *s)
>>>        }
>>>    }
>>> +static void enable_dma_mode(AMDVIAddressSpace *as, bool inval_current)
>>> +{
>>> +    /*
>>> +     * When enabling DMA mode for the purpose of isolating guest devices on
>>> +     * a failure to retrieve or invalid DTE, all existing mappings must be
>>> +     * dropped.
>>> +     */
>>> +    if (inval_current) {
>>> +        IOMMUNotifier *n;
>>> +        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
>>> +            amdvi_address_space_unmap(as, n);
>>> +        }
>>> +    }
>>> +
>>> +    if (as->addr_translation) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* Installing DTE enabling translation, activate region */
>>> +    as->addr_translation = true;
>>> +    amdvi_switch_address_space(as);
>>> +    /* Sync shadow page tables */
>>> +    amdvi_address_space_sync(as);
>> Hi Alejandro,
>> I think we can skip amdvi_address_space_sync, because
>> amdvi_switch_address_space will trigger
>> amdvi_iommu_replay. this replay should unmap all the old mappings and sync
>> shadow page table.
>>
>> Thanks
>> Sairaj
> Well I queued this but this speedup can be done on top.
Sorry for the delay in reviewing, I was on vacation for 2 weeks.
I have reviewed all the patches.
Reviewed-by: Sairaj Kodilkar <sarunkod@amd.com>Thanks Sairaj
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation
  2025-10-06  6:25       ` Sairaj Kodilkar
@ 2025-10-06 16:03         ` Alejandro Jimenez
  0 siblings, 0 replies; 34+ messages in thread
From: Alejandro Jimenez @ 2025-10-06 16:03 UTC (permalink / raw)
  To: Sairaj Kodilkar, Michael S. Tsirkin
  Cc: qemu-devel, clement.mathieu--drif, pbonzini, richard.henderson,
	eduardo, peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky
On 10/6/25 2:25 AM, Sairaj Kodilkar wrote:
>
>
> On 10/6/2025 11:45 AM, Michael S. Tsirkin wrote:
>> On Mon, Oct 06, 2025 at 11:38:28AM +0530, Sairaj Kodilkar wrote:
>>>
>>> On 9/20/2025 3:05 AM, Alejandro Jimenez wrote:
>>>> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
>>>> Device Table entry (DTE) e.g. after attaching a device and setting 
>>>> up its
>>>> DTE. When intercepting this event, determine if the DTE has been 
>>>> configured
>>>> for paging or not, and toggle the appropriate memory regions to 
>>>> allow DMA
>>>> address translation for the address space if needed. Requires 
>>>> dma-remap=on.
>>>>
>>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>>> ---
>>>>    hw/i386/amd_iommu.c | 122 
>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 120 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>>> index ce5d4c36624fd..e916dcb2be381 100644
>>>> --- a/hw/i386/amd_iommu.c
>>>> +++ b/hw/i386/amd_iommu.c
>>>> @@ -1032,18 +1032,136 @@ static void 
>>>> amdvi_reset_address_translation_all(AMDVIState *s)
>>>>        }
>>>>    }
>>>> +static void enable_dma_mode(AMDVIAddressSpace *as, bool 
>>>> inval_current)
>>>> +{
>>>> +    /*
>>>> +     * When enabling DMA mode for the purpose of isolating guest 
>>>> devices on
>>>> +     * a failure to retrieve or invalid DTE, all existing mappings 
>>>> must be
>>>> +     * dropped.
>>>> +     */
>>>> +    if (inval_current) {
>>>> +        IOMMUNotifier *n;
>>>> +        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
>>>> +            amdvi_address_space_unmap(as, n);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (as->addr_translation) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Installing DTE enabling translation, activate region */
>>>> +    as->addr_translation = true;
>>>> +    amdvi_switch_address_space(as);
>>>> +    /* Sync shadow page tables */
>>>> +    amdvi_address_space_sync(as);
>>> Hi Alejandro,
>>> I think we can skip amdvi_address_space_sync, because
>>> amdvi_switch_address_space will trigger
>>> amdvi_iommu_replay. this replay should unmap all the old mappings 
>>> and sync
>>> shadow page table.
>>>
>>> Thanks
>>> Sairaj
>> Well I queued this but this speedup can be done on top.
ACK
I rather be explicit and avoid relying on replay(), but sync is 
expensive so this could be worth the trouble with an added comment. I'll 
test and will include Sairaj's optimization in a different patchset.
Please if possible also add Sairaj's R-b to this series, he provided 
valuable feedback and testing so I'd like it to be recognized.
Alejandro
>>
> Sorry for the delay in reviewing, I was on vacation for 2 weeks.
> I have reviewed all the patches.
>
> Reviewed-by: Sairaj Kodilkar <sarunkod@amd.com>Thanks Sairaj
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (21 preceding siblings ...)
  2025-09-19 21:35 ` [PATCH v3 22/22] amd_iommu: HATDis/HATS=11 support Alejandro Jimenez
@ 2025-10-06 16:07 ` Cédric Le Goater
  2025-10-06 18:44   ` Alejandro Jimenez
  22 siblings, 1 reply; 34+ messages in thread
From: Cédric Le Goater @ 2025-10-06 16:07 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky
Hello Alejandro,
On 9/19/25 23:34, Alejandro Jimenez wrote:
> This series adds support for guests using the AMD vIOMMU to enable DMA remapping
> for VFIO devices. Please see v1[0] cover letter for additional details such as
> example QEMU command line parameters used in testing.
> 
> I have sanity tested on an AMD EPYC Genoa host, booting a Linux guest with
> 'iommu.passthrough=0' and several CX6 VFs, and there are no issues during
> typical guest operation.
> 
> When using the non-default parameter 'iommu.forcedac=1' in the guest kernel
> cmdline, this initially fails due to a VFIO integer overflow bug which requires
> the following fix in the host kernel:
> 
> https://github.com/aljimenezb/linux/commit/014be8cafe7464d278729583a2dd5d94514e2e2a
> This is a work in progress as there are other locations in the driver that are
> susceptible to overflows, but the above is sufficient to fix the initial
> problem.
> 
> Even after that fix is applied, I see an issue on guest reboot when 'forcedac=1'
> is in use. Although the guest boots, the VF is not properly initialized, failing
> with a timeout. Once the guest reaches userspace the VF driver can be reloaded
> and it then works as expected. I am still investigating the root cause for this
> issue, and will need to discuss all the steps I have tried to eliminate
> potential sources of errors in a separate thread.
> 
> I am sending v3 despite this known issue since forcedac=1 is not a default or
> commonly known/used setting. Having the large portions of the infrastructure for
> DMA remapping already in place (and working) will make it easier to debug this
> corner case and get feedback/testing from the community. I hope this is a viable
> approach, otherwise I am happy to discuss all the steps I have taken to debug
> this issue in this thread and test any suggestions to address it.
> 
> Changes since v2[2]:
> - P5: Fixed missed check for AMDVI_FR_DTE_RTR_ERR in amdvi_do_translate() (Sairaj)
> - P6: Reword commit message to clarify the need to discern between empty PTEs and errors (Vasant)
> - P9: Use correct enum type for notifier flags and remove whitespace changes (Sairaj)
> - P11: Fixed integer overflow bug when guest uses iommu.forcedac=1. Fixed in P8. (Sairaj)
> - P15: Fixed typo in commit message (Sairaj)
> - P16: On reset, use passthrough mode by default on all address spaces (Sairaj)
> - P18: Enforce isolation by using DMA mode on errors retrieving DTE (Ethan & Sairaj)
> - P20: Removed unused pte_override_page_mask() and pte_get_page_mask() to avoid -Wunused-function error.
> - Add HATDis support patches from Joao Martins (HATDis available in Linux since [1])
> 
> Thank you,
> Alejandro
> 
> [0] https://lore.kernel.org/all/20250414020253.443831-1-alejandro.j.jimenez@oracle.com/
> [1] https://lore.kernel.org/all/cover.1749016436.git.Ankit.Soni@amd.com/
> [2] https://lore.kernel.org/qemu-devel/20250502021605.1795985-1-alejandro.j.jimenez@oracle.com/
> 
> Alejandro Jimenez (20):
>    memory: Adjust event ranges to fit within notifier boundaries
>    amd_iommu: Document '-device amd-iommu' common options
>    amd_iommu: Reorder device and page table helpers
>    amd_iommu: Helper to decode size of page invalidation command
>    amd_iommu: Add helper function to extract the DTE
>    amd_iommu: Return an error when unable to read PTE from guest memory
>    amd_iommu: Add helpers to walk AMD v1 Page Table format
>    amd_iommu: Add a page walker to sync shadow page tables on
>      invalidation
>    amd_iommu: Add basic structure to support IOMMU notifier updates
>    amd_iommu: Sync shadow page tables on page invalidation
>    amd_iommu: Use iova_tree records to determine large page size on UNMAP
>    amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>    amd_iommu: Add replay callback
>    amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>    amd_iommu: Toggle memory regions based on address translation mode
>    amd_iommu: Set all address spaces to use passthrough mode on reset
>    amd_iommu: Add dma-remap property to AMD vIOMMU device
>    amd_iommu: Toggle address translation mode on devtab entry
>      invalidation
>    amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>    amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
> 
> Joao Martins (2):
>    i386/intel-iommu: Move dma_translation to x86-iommu
>    amd_iommu: HATDis/HATS=11 support
> 
>   hw/i386/acpi-build.c        |    6 +-
>   hw/i386/amd_iommu.c         | 1056 ++++++++++++++++++++++++++++++-----
>   hw/i386/amd_iommu.h         |   51 ++
The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
Since this series is about to be merged, should AMD-Vi be considered
maintained now ? and if so by whom ?
Thanks,
C.
>   hw/i386/intel_iommu.c       |    5 +-
>   hw/i386/x86-iommu.c         |    1 +
>   include/hw/i386/x86-iommu.h |    1 +
>   qemu-options.hx             |   23 +
>   system/memory.c             |   10 +-
>   8 files changed, 999 insertions(+), 154 deletions(-)> 
> 
> base-commit: ab8008b231e758e03c87c1c483c03afdd9c02e19
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-10-06 16:07 ` [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Cédric Le Goater
@ 2025-10-06 18:44   ` Alejandro Jimenez
  2025-10-07  5:45     ` Cédric Le Goater
  0 siblings, 1 reply; 34+ messages in thread
From: Alejandro Jimenez @ 2025-10-06 18:44 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	joao.m.martins, boris.ostrovsky
Hi Cédric,
On 10/6/25 12:07 PM, Cédric Le Goater wrote:
> Hello Alejandro,
> 
> On 9/19/25 23:34, Alejandro Jimenez wrote:
[...]
> 
> 
> The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
> Since this series is about to be merged, should AMD-Vi be considered
> maintained now ?
It should be considered maintained.
> and if so by whom ?
> 
I volunteer as maintainer. Assuming no objections from the community, I 
will send a follow up patch updating MAINTAINERS.
If there are additional suggestions/volunteers for co-maintainers, 
please reply to this thread and I'll include them on the patch.
Thank you,
Alejandro
> Thanks,
> 
> C.
> 
> 
> 
> 
>>   hw/i386/intel_iommu.c       |    5 +-
>>   hw/i386/x86-iommu.c         |    1 +
>>   include/hw/i386/x86-iommu.h |    1 +
>>   qemu-options.hx             |   23 +
>>   system/memory.c             |   10 +-
>>   8 files changed, 999 insertions(+), 154 deletions(-)>
>> base-commit: ab8008b231e758e03c87c1c483c03afdd9c02e19
> 
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-10-06 18:44   ` Alejandro Jimenez
@ 2025-10-07  5:45     ` Cédric Le Goater
  2025-10-07  8:17       ` Vasant Hegde
  2025-10-07 19:04       ` Joao Martins
  0 siblings, 2 replies; 34+ messages in thread
From: Cédric Le Goater @ 2025-10-07  5:45 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel, Joao Martins
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	boris.ostrovsky
Hello,
On 10/6/25 20:44, Alejandro Jimenez wrote:
> Hi Cédric,
> 
> On 10/6/25 12:07 PM, Cédric Le Goater wrote:
>> Hello Alejandro,
>>
>> On 9/19/25 23:34, Alejandro Jimenez wrote:
> 
> [...]
> 
>>
>>
>> The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
>> Since this series is about to be merged, should AMD-Vi be considered
>> maintained now ?
> 
> It should be considered maintained.
Great :)
>> and if so by whom ?
>>
> 
> I volunteer as maintainer. Assuming no objections from the community, I will send a follow up patch updating MAINTAINERS.
Thanks. 
> If there are additional suggestions/volunteers for co-maintainers, please reply to this thread and I'll include them on the patch.
This series includes a co-author who would make an excellent reviewer !
C.
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-10-07  5:45     ` Cédric Le Goater
@ 2025-10-07  8:17       ` Vasant Hegde
  2025-10-07 19:04       ` Joao Martins
  1 sibling, 0 replies; 34+ messages in thread
From: Vasant Hegde @ 2025-10-07  8:17 UTC (permalink / raw)
  To: Cédric Le Goater, Alejandro Jimenez, qemu-devel,
	Joao Martins
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, suravee.suthikulpanit, santosh.shukla,
	sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon, boris.ostrovsky
Cedric, Alejandro,
On 10/7/2025 11:15 AM, Cédric Le Goater wrote:
> Hello,
> 
> On 10/6/25 20:44, Alejandro Jimenez wrote:
>> Hi Cédric,
>>
>> On 10/6/25 12:07 PM, Cédric Le Goater wrote:
>>> Hello Alejandro,
>>>
>>> On 9/19/25 23:34, Alejandro Jimenez wrote:
>>
>> [...]
>>
>>>
>>>
>>> The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
>>> Since this series is about to be merged, should AMD-Vi be considered
>>> maintained now ?
>>
>> It should be considered maintained.
> 
> Great :)
>>> and if so by whom ?
>>>
>>
>> I volunteer as maintainer. Assuming no objections from the community, I will
>> send a follow up patch updating MAINTAINERS.
> 
> Thanks.
>> If there are additional suggestions/volunteers for co-maintainers, please
>> reply to this thread and I'll include them on the patch.
> This series includes a co-author who would make an excellent reviewer !
Ack. Can you please add "Sairaj" as Reviewer.
Sairaj Kodilkar <sarunkod@amd.com>
-Vasant
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-10-07  5:45     ` Cédric Le Goater
  2025-10-07  8:17       ` Vasant Hegde
@ 2025-10-07 19:04       ` Joao Martins
  2025-10-07 20:41         ` Cédric Le Goater
  1 sibling, 1 reply; 34+ messages in thread
From: Joao Martins @ 2025-10-07 19:04 UTC (permalink / raw)
  To: Cédric Le Goater, Alejandro Jimenez
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	boris.ostrovsky, qemu-devel
On 07/10/2025 06:45, Cédric Le Goater wrote:
> Hello,
> 
> On 10/6/25 20:44, Alejandro Jimenez wrote:
>> Hi Cédric,
>>
>> On 10/6/25 12:07 PM, Cédric Le Goater wrote:
>>> Hello Alejandro,
>>>
>>> On 9/19/25 23:34, Alejandro Jimenez wrote:
>>
>> [...]
>>
>>>
>>>
>>> The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
>>> Since this series is about to be merged, should AMD-Vi be considered
>>> maintained now ?
>>
>> It should be considered maintained.
> 
> Great :)
>>> and if so by whom ?
>>>
>>
>> I volunteer as maintainer. Assuming no objections from the community, I will
>> send a follow up patch updating MAINTAINERS.
> 
> Thanks.
>> If there are additional suggestions/volunteers for co-maintainers, please
>> reply to this thread and I'll include them on the patch.
> This series includes a co-author who would make an excellent reviewer !
> 
Heh, I wasn't sure you were talking about me or Sairaj but FWIW: while I know
some things here and there, I am not nearly as deep into AMD-VI as Alejandro and
Sairaj. Hence why I haven't voluntereed :)
	Joao
^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-10-07 19:04       ` Joao Martins
@ 2025-10-07 20:41         ` Cédric Le Goater
  0 siblings, 0 replies; 34+ messages in thread
From: Cédric Le Goater @ 2025-10-07 20:41 UTC (permalink / raw)
  To: Joao Martins, Alejandro Jimenez
  Cc: mst, clement.mathieu--drif, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	imammedo, anisinha, vasant.hegde, suravee.suthikulpanit,
	santosh.shukla, sarunkod, Wei.Huang2, Ankit.Soni, ethan.milon,
	boris.ostrovsky, qemu-devel
On 10/7/25 21:04, Joao Martins wrote:
> On 07/10/2025 06:45, Cédric Le Goater wrote:
>> Hello,
>>
>> On 10/6/25 20:44, Alejandro Jimenez wrote:
>>> Hi Cédric,
>>>
>>> On 10/6/25 12:07 PM, Cédric Le Goater wrote:
>>>> Hello Alejandro,
>>>>
>>>> On 9/19/25 23:34, Alejandro Jimenez wrote:
>>>
>>> [...]
>>>
>>>>
>>>>
>>>> The current status of AMD-Vi Emulation in MAINTAINERS is Orphan.
>>>> Since this series is about to be merged, should AMD-Vi be considered
>>>> maintained now ?
>>>
>>> It should be considered maintained.
>>
>> Great :)
>>>> and if so by whom ?
>>>>
>>>
>>> I volunteer as maintainer. Assuming no objections from the community, I will
>>> send a follow up patch updating MAINTAINERS.
>>
>> Thanks.
>>> If there are additional suggestions/volunteers for co-maintainers, please
>>> reply to this thread and I'll include them on the patch.
>> This series includes a co-author who would make an excellent reviewer !
>>
> 
> Heh, I wasn't sure you were talking about me or Sairaj but FWIW: while I know
> some things here and there, I am not nearly as deep into AMD-VI as Alejandro and
> Sairaj. Hence why I haven't voluntereed :)
We don't have a cap on the number of maintainers or reviewers :)
   
It would be helpful to add someone responsible for handling PRs.
Thanks,
C.
^ permalink raw reply	[flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-10-07 20:42 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-19 21:34 [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 01/22] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 02/22] amd_iommu: Document '-device amd-iommu' common options Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 03/22] amd_iommu: Reorder device and page table helpers Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 04/22] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 05/22] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
2025-09-19 21:34 ` [PATCH v3 06/22] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 07/22] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 08/22] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 09/22] amd_iommu: Add basic structure to support IOMMU notifier updates Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 10/22] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 11/22] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 12/22] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 13/22] amd_iommu: Add replay callback Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 14/22] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 15/22] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 16/22] amd_iommu: Set all address spaces to use passthrough mode on reset Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 17/22] amd_iommu: Add dma-remap property to AMD vIOMMU device Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 18/22] amd_iommu: Toggle address translation mode on devtab entry invalidation Alejandro Jimenez
2025-10-06  6:08   ` Sairaj Kodilkar
2025-10-06  6:15     ` Michael S. Tsirkin
2025-10-06  6:25       ` Sairaj Kodilkar
2025-10-06 16:03         ` Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 19/22] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 20/22] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
2025-09-19 21:35 ` [PATCH v3 21/22] i386/intel-iommu: Move dma_translation to x86-iommu Alejandro Jimenez
2025-09-22  5:33   ` CLEMENT MATHIEU--DRIF
2025-09-19 21:35 ` [PATCH v3 22/22] amd_iommu: HATDis/HATS=11 support Alejandro Jimenez
2025-10-06 16:07 ` [PATCH v3 00/22] AMD vIOMMU: DMA remapping support for VFIO devices Cédric Le Goater
2025-10-06 18:44   ` Alejandro Jimenez
2025-10-07  5:45     ` Cédric Le Goater
2025-10-07  8:17       ` Vasant Hegde
2025-10-07 19:04       ` Joao Martins
2025-10-07 20:41         ` Cédric Le Goater
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).