[PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices
@ 2025-04-14  2:02 Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 01/18] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
                   ` (18 more replies)
  0 siblings, 19 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

This series adds support for guests using the AMD vIOMMU to enable DMA
remapping for VFIO devices. In addition to the currently supported
passthrough (PT) mode, guest kernels are now able to to provide DMA
address translation and access permission checking to VFs attached to
paging domains, using the AMD v1 I/O page table format.

These changes provide the essential emulation required to boot and
support regular operation for a Linux guest enabling DMA remapping e.g.
via kernel parameters "iommu=nopt" or "iommu.passthrough=0".

A new amd-iommu device property "dma-remap" (default: off) is introduced
to control whether the feature is available. See below for a full
example of QEMU cmdline parameters used in testing.

The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
host and guest kernels, launching guests with up to 256 vCPUs, 512G
memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
xtsup=on) requires fix:
https://lore.kernel.org/all/20250410064447.29583-3-sarunkod@amd.com/

Although there is more work to do, I am sending this series as a patch
and not an RFC since it provides a working implementation of the
feature. With this basic infrastructure in place it becomes easier to
add/verify enhancements and new functionality. Here are some items I am
working to address in follow up patches:

- Page Fault and error reporting
- Add QEMU tracing and tests
- Provide control over VA Size advertised to guests
- Support hotplug/unplug of devices and other advanced features
  (suggestions welcomed)

Thank you,
Alejandro

---
Example QEMU command line:

$QEMU \
-nodefaults \
-snapshot \
-no-user-config \
-display none \
-serial mon:stdio -nographic \
-machine q35,accel=kvm,kernel_irqchip=split \
-cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
-smp 32 \
-m 128G \
-kernel $KERNEL \
-initrd $INITRD \
-append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro rd.lvm.lv=ol/root rd.lvm.lv=ol/swap iommu.passthrough=0" \
-device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
-blockdev node-name=drive0,driver=qcow2,file.driver=file,file.filename=./OracleLinux-uefi-x86_64.qcow2 \
-device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
-drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/OVMF_CODE.fd,readonly=on \
-drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
-device vfio-pci,host=0000:a1:00.1,id=net0
---

Alejandro Jimenez (18):
  memory: Adjust event ranges to fit within notifier boundaries
  amd_iommu: Add helper function to extract the DTE
  amd_iommu: Add support for IOMMU notifier
  amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
  amd_iommu: Toggle memory regions based on address translation mode
  amd_iommu: Set all address spaces to default translation mode on reset
  amd_iommu: Return an error when unable to read PTE from guest memory
  amd_iommu: Helper to decode size of page invalidation command
  amd_iommu: Add helpers to walk AMD v1 Page Table format
  amd_iommu: Add a page walker to sync shadow page tables on
    invalidation
  amd_iommu: Sync shadow page tables on page invalidation
  amd_iommu: Add replay callback
  amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
  amd_iommu: Toggle address translation on device table entry
    invalidation
  amd_iommu: Use iova_tree records to determine large page size on UNMAP
  amd_iommu: Do not assume passthrough translation when DTE[TV]=0
  amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
  amd_iommu: Do not emit I/O page fault events during replay()

 hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
 hw/i386/amd_iommu.h |  52 +++
 system/memory.c     |  10 +-
 3 files changed, 843 insertions(+), 75 deletions(-)


base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365
-- 
2.43.5



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/18] memory: Adjust event ranges to fit within notifier boundaries
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 02/18] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
valid and required operation by vIOMMU implementations. However, such
invalidations currently trigger an assertion unless they originate from
device IOTLB invalidations.

Although in recent Linux guests this case is not exercised by the VTD
implementation due to various optimizations, the assertion will be hit
by upcoming AMD vIOMMU changes to support DMA address translation. More
specifically, when running a Linux guest with VFIO passthrough device,
and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
Remove redundant domain flush from attach_device()").

Remove the assertion altogether and adjust the range to ensure it does
not cross notifier boundaries.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 system/memory.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/system/memory.c b/system/memory.c
index 4c829793a0ad..7d120e25abe8 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2030,13 +2030,9 @@ void memory_region_notify_iommu_one(IOMMUNotifier *notifier,
         return;
     }
 
-    if (notifier->notifier_flags & IOMMU_NOTIFIER_DEVIOTLB_UNMAP) {
-        /* Crop (iova, addr_mask) to range */
-        tmp.iova = MAX(tmp.iova, notifier->start);
-        tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
-    } else {
-        assert(entry->iova >= notifier->start && entry_end <= notifier->end);
-    }
+    /* Crop (iova, addr_mask) to range */
+    tmp.iova = MAX(tmp.iova, notifier->start);
+    tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
 
     if (event->type & notifier->notifier_flags) {
         notifier->notify(notifier, &tmp);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/18] amd_iommu: Add helper function to extract the DTE
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 01/18] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-16 11:36   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 03/18] amd_iommu: Add support for IOMMU notifier Alejandro Jimenez
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
common operation required for syncing the shadow page tables. Implement a
helper to do it and check for common error conditions.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 47 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 5f9b95279997..22d648c2e0e3 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -77,6 +77,20 @@ typedef struct AMDVIIOTLBEntry {
     uint64_t page_mask;         /* physical page size  */
 } AMDVIIOTLBEntry;
 
+/*
+ * These 'fault' reasons have an overloaded meaning since they are not only
+ * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
+ * IOMMU specification, but are also used to signal internal errors in the
+ * emulation code.
+ */
+typedef enum AMDVIFaultReason {
+    AMDVI_FR_DTE_RTR_ERR = 1,           /* Failure to retrieve DTE */
+    AMDVI_FR_DTE_V,                     /* DTE[V] = 0 */
+    AMDVI_FR_DTE_TV,                    /* DTE[TV] = 0 */
+} AMDVIFaultReason;
+
+static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
+
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
     uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
@@ -882,6 +896,28 @@ static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
     return true;
 }
 
+static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
+{
+    uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
+    AMDVIState *s = as->iommu_state;
+
+    if (!amdvi_get_dte(s, devid, dte)) {
+        /* Unable to retrieve DTE for devid */
+        return -AMDVI_FR_DTE_RTR_ERR;
+    }
+
+    if (!(dte[0] & AMDVI_DEV_VALID)) {
+        /* DTE[V] not set, address is passed untranslated for devid */
+        return -AMDVI_FR_DTE_V;
+    }
+
+    if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
+        /* DTE[TV] not set, host page table not valid for devid */
+        return -AMDVI_FR_DTE_TV;
+    }
+    return 0;
+}
+
 /* get pte translation mode */
 static inline uint8_t get_pte_translation_mode(uint64_t pte)
 {
@@ -990,6 +1026,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
     uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
     AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
     uint64_t entry[4];
+    int dte_ret;
 
     if (iotlb_entry) {
         trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
@@ -1001,13 +1038,13 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
         return;
     }
 
-    if (!amdvi_get_dte(s, devid, entry)) {
-        return;
-    }
+    dte_ret = amdvi_as_to_dte(as, entry);
 
-    /* devices with V = 0 are not translated */
-    if (!(entry[0] & AMDVI_DEV_VALID)) {
+    if (dte_ret == -AMDVI_FR_DTE_V) {
+        /* DTE[V]=0, address is passed untranslated */
         goto out;
+    } else if (dte_ret == -AMDVI_FR_DTE_TV) {
+        return;
     }
 
     amdvi_page_walk(as, entry, ret,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 01/18] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 02/18] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-16 12:14   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 04/18] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

In order to enable device assignment with IOMMU protection and guest
DMA address translation, IOMMU notifier support is necessary to allow
users like VFIO to synchronize the shadow page tables i.e. to receive
notifications when the guest updates its IO page tables and replay the
mappings onto host IO page tables.

This requires the vIOMMU is configured with the NpCache capability,
so the guest issues IOMMU invalidations for both map() and unmap()
operations. This capability is already part of AMDVI_CAPAB_FEATURES,
and is written to the configuration in amdvi_pci_realize().

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++------
 hw/i386/amd_iommu.h |  6 ++++++
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 22d648c2e0e3..8dbb10d91339 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -66,6 +66,13 @@ struct AMDVIAddressSpace {
     MemoryRegion iommu_nodma;   /* Alias of shared nodma memory region  */
     MemoryRegion iommu_ir;      /* Device's interrupt remapping region  */
     AddressSpace as;            /* device's corresponding address space */
+
+    /* DMA address translation support */
+    IOMMUNotifierFlag notifier_flags;
+    /* entry in list of Address spaces with registered notifiers */
+    QLIST_ENTRY(AMDVIAddressSpace) next;
+    /* DMA address translation active */
+    bool addr_translation;
 };
 
 /* AMDVI cache entry */
@@ -1561,14 +1568,28 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
                                            Error **errp)
 {
     AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
+    AMDVIState *s = as->iommu_state;
+
+    /* DMA remapping capability is required to implement IOMMU notifier */
+    if (!s->dma_remap && (new & IOMMU_NOTIFIER_MAP)) {
+        error_setg_errno(errp, ENOTSUP,
+                "device %02x.%02x.%x requires dma-remap=1",
+                as->bus_num, PCI_SLOT(as->devfn), PCI_FUNC(as->devfn));
+        return -ENOTSUP;
+    }
 
-    if (new & IOMMU_NOTIFIER_MAP) {
-        error_setg(errp,
-                   "device %02x.%02x.%x requires iommu notifier which is not "
-                   "currently supported", as->bus_num, PCI_SLOT(as->devfn),
-                   PCI_FUNC(as->devfn));
-        return -EINVAL;
+    /*
+     * Update notifier flags for address space and the list of address spaces
+     * with registered notifiers.
+     */
+    as->notifier_flags = new;
+
+    if (old == IOMMU_NOTIFIER_NONE) {
+        QLIST_INSERT_HEAD(&s->amdvi_as_with_notifiers, as, next);
+    } else if (new == IOMMU_NOTIFIER_NONE) {
+        QLIST_REMOVE(as, next);
     }
+
     return 0;
 }
 
@@ -1700,6 +1721,7 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
 
 static const Property amdvi_properties[] = {
     DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
+    DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
 };
 
 static const VMStateDescription vmstate_amdvi_sysbus = {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 28125130c6fc..e12ecade4baa 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -365,12 +365,18 @@ struct AMDVIState {
     /* for each served device */
     AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
 
+    /* list of address spaces with registered notifiers */
+    QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
+
     /* IOTLB */
     GHashTable *iotlb;
 
     /* Interrupt remapping */
     bool ga_enabled;
     bool xtsup;
+
+    /* DMA address translation */
+    bool dma_remap;
 };
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/18] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (2 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 03/18] amd_iommu: Add support for IOMMU notifier Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Support dropping all existing mappings on reset. When the guest kernel
reboots it will create new ones, but other components that run before
the kernel (e.g. OVMF) should not be able to use existing mappings from
the previous boot.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 8dbb10d91339..ad5869e72fdc 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1109,6 +1109,71 @@ static IOMMUTLBEntry amdvi_translate(IOMMUMemoryRegion *iommu, hwaddr addr,
     return ret;
 }
 
+/*
+ * Unmap entire range that the notifier registered for i.e. the full AS.
+ *
+ * This is seemingly technically equivalent to directly calling
+ * memory_region_unmap_iommu_notifier_range(), but it allows to check for
+ * notifier boundaries and issue notifications with ranges within those bounds.
+ */
+static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n)
+{
+
+    hwaddr start = n->start;
+    hwaddr end = n->end;
+    hwaddr remain;
+
+    assert(start <= end);
+    remain = end - start + 1;
+
+    /*
+     * Divide the notifier range into chunks that are aligned and do not exceed
+     * the notifier boundaries.
+     */
+    while (remain >= AMDVI_PAGE_SIZE) {
+
+        IOMMUTLBEvent event;
+
+        uint64_t mask = dma_aligned_pow2_mask(start, end, 64);
+
+        event.type = IOMMU_NOTIFIER_UNMAP;
+
+        IOMMUTLBEntry entry = {
+            .target_as = &address_space_memory,
+            .iova = start,
+            .translated_addr = 0,   /* irrelevant for unmap case */
+            .addr_mask = mask,
+            .perm = IOMMU_NONE,
+        };
+        event.entry = entry;
+
+        /* Call notifier registered for updates on this address space */
+        memory_region_notify_iommu_one(n, &event);
+
+        start += mask + 1;
+        remain -= mask + 1;
+    }
+
+    assert(!remain);
+}
+
+/*
+ * For all the address spaces with notifiers registered, unmap the entire range
+ * the notifier registered for i.e. clear all the address spaces managed by the
+ * IOMMU.
+ */
+static void amdvi_address_space_unmap_all(AMDVIState *s)
+{
+    AMDVIAddressSpace *as;
+    IOMMUNotifier *n;
+
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+            amdvi_address_space_unmap(as, n);
+        }
+    }
+}
+
 static int amdvi_get_irte(AMDVIState *s, MSIMessage *origin, uint64_t *dte,
                           union irte *irte, uint16_t devid)
 {
@@ -1667,6 +1732,9 @@ static void amdvi_sysbus_reset(DeviceState *dev)
 
     msi_reset(&s->pci.dev);
     amdvi_init(s);
+
+    /* Discard all mappings on device reset */
+    amdvi_address_space_unmap_all(s);
 }
 
 static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (3 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 04/18] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-22 12:17   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 06/18] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Enable the appropriate memory region for an address space depending on
the address translation mode selected for it. This is currently based on
a generic x86 IOMMMU property, and only done during the address space
initialization. Extract the code into a helper and toggle the regions
based on whether DMA remapping is available as a global capability, and
if the specific address space is using address translation.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index ad5869e72fdc..3f9aa2cc8d31 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1529,13 +1529,31 @@ static const MemoryRegionOps amdvi_ir_ops = {
     }
 };
 
+/*
+ * Toggle between address translation and passthrough modes by enabling the
+ * corresponding memory regions.
+ */
+static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
+{
+    AMDVIState *s = amdvi_as->iommu_state;
+
+    if (s->dma_remap && amdvi_as->addr_translation) {
+        /* Enabling DMA region */
+        memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
+        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
+    } else {
+        /* Disabling DMA region, using passthrough */
+        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), false);
+        memory_region_set_enabled(&amdvi_as->iommu_nodma, true);
+    }
+}
+
 static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     char name[128];
     AMDVIState *s = opaque;
     AMDVIAddressSpace **iommu_as, *amdvi_dev_as;
     int bus_num = pci_bus_num(bus);
-    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
     iommu_as = s->address_spaces[bus_num];
 
@@ -1595,15 +1613,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
                                             AMDVI_INT_ADDR_FIRST,
                                             &amdvi_dev_as->iommu_ir, 1);
 
-        if (!x86_iommu->pt_supported) {
-            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, false);
-            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
-                                      true);
-        } else {
-            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
-                                      false);
-            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, true);
-        }
+        amdvi_switch_address_space(amdvi_dev_as);
     }
     return &iommu_as[devfn]->as;
 }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/18] amd_iommu: Set all address spaces to default translation mode on reset
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (4 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 07/18] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

On reset, restore the default address translation mode for all the
address spaces managed by the vIOMMU.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 3f9aa2cc8d31..0df658712ec0 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1548,6 +1548,33 @@ static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
     }
 }
 
+/*
+ * For all existing address spaces managed by the IOMMU, enable/disable the
+ * corresponding memory regions depending on the address translation mode
+ * as determined by the global and individual address space settings.
+ */
+static void amdvi_switch_address_space_all(AMDVIState *s)
+{
+    AMDVIAddressSpace **iommu_as;
+
+    for (int bus_num = 0; bus_num < PCI_BUS_MAX; bus_num++) {
+
+        /* Nothing to do if there are no devices on the current bus */
+        if (!s->address_spaces[bus_num]) {
+            continue;
+        }
+        iommu_as = s->address_spaces[bus_num];
+
+        for (int devfn = 0; devfn < PCI_DEVFN_MAX; devfn++) {
+
+            if (!iommu_as[devfn]) {
+                continue;
+            }
+            amdvi_switch_address_space(iommu_as[devfn]);
+        }
+    }
+}
+
 static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     char name[128];
@@ -1745,6 +1772,7 @@ static void amdvi_sysbus_reset(DeviceState *dev)
 
     /* Discard all mappings on device reset */
     amdvi_address_space_unmap_all(s);
+    amdvi_switch_address_space_all(s);
 }
 
 static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/18] amd_iommu: Return an error when unable to read PTE from guest memory
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (5 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 06/18] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Make amdvi_get_pte_entry() return an error value (-1) in cases where the
memory read fails, versus the current return of 0 to indicate failure.
The reason is that 0 is also a valid PTE value, and it is useful to know
when a PTE points to memory that is zero i.e. the guest unmapped the
page.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 0df658712ec0..5f55be1f4d36 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -958,7 +958,7 @@ static inline uint64_t amdvi_get_pte_entry(AMDVIState *s, uint64_t pte_addr,
                         &pte, sizeof(pte), MEMTXATTRS_UNSPECIFIED)) {
         trace_amdvi_get_pte_hwerror(pte_addr);
         amdvi_log_pagetab_error(s, devid, pte_addr, 0);
-        pte = 0;
+        pte = (uint64_t)-1;
         return pte;
     }
 
@@ -999,7 +999,7 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
             /* add offset and load pte */
             pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
             pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-            if (!pte) {
+            if (!pte || (pte == (uint64_t)-1)) {
                 return;
             }
             oldlevel = level;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (6 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 07/18] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-22 12:26   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

The size of the region to invalidate depends on the S bit and address
encoded in the command. Add a helper to extract this information, which
will be used to sync shadow page tables in upcoming changes.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h |  4 ++++
 2 files changed, 38 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 5f55be1f4d36..0af873b66a31 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -481,6 +481,40 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
     return entry->domid == domid;
 }
 
+/*
+ * Helper to decode the size of the range to invalidate encoded in the
+ * INVALIDATE_IOMMU_PAGES Command format.
+ * The size of the region to invalidate depends on the S bit and address.
+ * S bit value:
+ * 0 :  Invalidation size is 4 Kbytes.
+ * 1 :  Invalidation size is determined by first zero bit in the address
+ *      starting from Address[12].
+ *
+ * In the AMD IOMMU Linux driver, an invalidation command with address
+ * ((1 << 63) - 1) is sent when intending to clear the entire cache.
+ * However, Table 14: Example Page Size Encodings shows that an address of
+ * ((1ULL << 51) - 1) encodes the entire cache, so effectively any address with
+ * first zero at bit 51 or larger is a request to invalidate the entire address
+ * space.
+ */
+static uint64_t __attribute__((unused))
+amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+{
+    uint64_t size = AMDVI_PAGE_SIZE;
+    uint8_t fzbit = 0;
+
+    if (flags & AMDVI_CMD_INVAL_IOMMU_PAGES_S) {
+        fzbit = cto64(addr | 0xFFF);
+
+        if (fzbit >= 51 || !addr) {
+            size = AMDVI_INV_ALL_PAGES;
+        } else {
+            size = 1ULL << (fzbit + 1);
+        }
+    }
+    return size;
+}
+
 /* we don't have devid - we can't remove pages by address */
 static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 {
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index e12ecade4baa..c89e7dc9947d 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -123,6 +123,10 @@
 #define AMDVI_CMD_COMPLETE_PPR_REQUEST    0x07
 #define AMDVI_CMD_INVAL_AMDVI_ALL         0x08
 
+
+#define AMDVI_CMD_INVAL_IOMMU_PAGES_S   (1ULL << 0)
+#define AMDVI_INV_ALL_PAGES             (1ULL << 52)
+
 #define AMDVI_DEVTAB_ENTRY_SIZE           32
 
 /* Device table entry bits 0:63 */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (7 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-17 12:40   ` CLEMENT MATHIEU--DRIF
  2025-04-14  2:02 ` [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
 hw/i386/amd_iommu.h |  42 +++++++++++++++
 2 files changed, 167 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 0af873b66a31..d089fdc28ef1 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
     }
 };
 
+/*
+ * For a PTE encoding a large page, return the page size it encodes as described
+ * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
+ * No need to adjust the value of the PTE to point to the first PTE in the large
+ * page since the encoding guarantees all "base" PTEs in the large page are the
+ * same.
+ */
+static uint64_t large_pte_page_size(uint64_t pte)
+{
+    assert(PTE_NEXT_LEVEL(pte) == 7);
+
+    /* Determine size of the large/contiguous page encoded in the PTE */
+    return PTE_LARGE_PAGE_SIZE(pte);
+}
+
+/*
+ * Helper function to fetch a PTE using AMD v1 pgtable format.
+ * Returns:
+ * -2:  The Page Table Root could not be read from DTE, or IOVA is larger than
+ *      supported by current page table level encodedin DTE[Mode].
+ * -1:  PTE could not be read from guest memory during a page table walk.
+ *      This means that the DTE has valid data, and one of the lower level
+ *      entries in the Page Table could not be read.
+ *  0:  PTE is marked not present, or entry is 0.
+ * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
+ */
+static uint64_t __attribute__((unused))
+fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
+          hwaddr *page_size)
+{
+    IOMMUAccessFlags perms = amdvi_get_perms(dte);
+
+    uint8_t level, mode;
+    uint64_t pte = dte, pte_addr;
+
+    *page_size = 0;
+
+    if (perms == IOMMU_NONE) {
+        return (uint64_t)-2;
+    }
+
+    /*
+     * The Linux kernel driver initializes the default mode to 3, corresponding
+     * to a 39-bit GPA space, where each entry in the pagetable translates to a
+     * 1GB (2^30) page size.
+     */
+    level = mode = get_pte_translation_mode(dte);
+    assert(mode > 0 && mode < 7);
+
+    /*
+     * If IOVA is larger than the max supported by the current pgtable level,
+     * there is nothing to do. This signals that the pagetable level should be
+     * increased, or is an address meant to have special behavior like
+     * invalidating the entire cache.
+     */
+    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
+        /* IOVA too large for the current DTE */
+        return (uint64_t)-2;
+    }
+
+    do {
+        level -= 1;
+
+        /* Update the page_size */
+        *page_size = PTE_LEVEL_PAGE_SIZE(level);
+
+        /* Permission bits are ANDed at every level, including the DTE */
+        perms &= amdvi_get_perms(pte);
+        if (perms == IOMMU_NONE) {
+            return pte;
+        }
+
+        /* Not Present */
+        if (!IOMMU_PTE_PRESENT(pte)) {
+            return 0;
+        }
+
+        /* Large or Leaf PTE found */
+        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
+            /* Leaf PTE found */
+            break;
+        }
+
+        /*
+         * Index the pgtable using the IOVA bits corresponding to current level
+         * and walk down to the lower level.
+         */
+        pte_addr = NEXT_PTE_ADDR(pte, level, address);
+        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+
+        if (pte == (uint64_t)-1) {
+            /*
+             * A returned PTE of -1 indicates a failure to read the page table
+             * entry from guest memory.
+             */
+            if (level == mode - 1) {
+                /* Failure to retrieve the Page Table from Root Pointer */
+                *page_size = 0;
+                return (uint64_t)-2;
+            } else {
+                /* Failure to read PTE. Page walk skips a page_size chunk */
+                return pte;
+            }
+        }
+    } while (level > 0);
+
+    /*
+     * Page walk ends when Next Level field on PTE shows that either a leaf PTE
+     * or a series of large PTEs have been reached. In the latter case, return
+     * the pointer to the first PTE of the series.
+     */
+    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 || PTE_NEXT_LEVEL(pte) == 7);
+
+    /*
+     * In case the range starts in the middle of a contiguous page, need to
+     * return the first PTE
+     */
+    if (PTE_NEXT_LEVEL(pte) == 7) {
+        /* Update page_size with the large PTE page size */
+        *page_size = large_pte_page_size(pte);
+    }
+
+    return pte;
+}
+
 /*
  * Toggle between address translation and passthrough modes by enabling the
  * corresponding memory regions.
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index c89e7dc9947d..fc4d2f7a4575 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -25,6 +25,8 @@
 #include "hw/i386/x86-iommu.h"
 #include "qom/object.h"
 
+#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
+
 /* Capability registers */
 #define AMDVI_CAPAB_BAR_LOW           0x04
 #define AMDVI_CAPAB_BAR_HIGH          0x08
@@ -174,6 +176,46 @@
 #define AMDVI_GATS_MODE                 (2ULL <<  12)
 #define AMDVI_HATS_MODE                 (2ULL <<  10)
 
+/* Page Table format */
+
+#define AMDVI_PTE_PR                    (1ULL << 0)
+#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
+
+#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
+
+/* Using level=0 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
+
+/* Return IOVA bit group used to index the Page Table at specific level */
+#define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
+                                        GENMASK64(8, 0))
+
+/* Return the max address for a specified level i.e. max_oaddr */
+#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
+                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
+                                (~(0ULL)))
+
+/* Extract the NextLevel field from PTE/PDE */
+#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
+
+/* Take page table level and return default pagetable size for level */
+#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << (PT_LEVEL_SHIFT(level)))
+
+/*
+ * Return address of lower level page table encoded in PTE and specified by
+ * current level and corresponding IOVA bit group at such level.
+ */
+#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
+                                        (PT_LEVEL_INDEX(level, iova) * 8))
+
+/*
+ * Take a PTE value with mode=0x07 and return the page size it encodes.
+ */
+#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 0xfffULL))))
+
+/* Return number of PTEs to use for a given page size (expected power of 2) */
+#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) % 9))
+
 /* IOTLB */
 #define AMDVI_IOTLB_MAX_SIZE 1024
 #define AMDVI_DEVID_SHIFT    36
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (8 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-17 15:14   ` Ethan MILON
  2025-04-14  2:02 ` [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

For the specified address range, walk the page table identifying regions
as mapped or unmapped and invoke registered notifiers with the
corresponding event type.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index d089fdc28ef1..6789e1e9b688 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1688,6 +1688,80 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
     return pte;
 }
 
+/*
+ * Walk the guest page table for an IOVA and range and signal the registered
+ * notifiers to sync the shadow page tables in the host.
+ * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
+ */
+static void __attribute__((unused))
+amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
+                                   hwaddr addr, uint64_t size, bool send_unmap)
+{
+    IOMMUTLBEvent event;
+
+    hwaddr iova_next, page_mask, pagesize;
+    hwaddr iova = addr;
+    hwaddr end = iova + size - 1;
+
+    uint64_t pte;
+
+    while (iova < end) {
+
+        pte = fetch_pte(as, iova, dte[0], &pagesize);
+
+        if (pte == (uint64_t)-2) {
+            /*
+             * Invalid conditions such as the IOVA being larger than supported
+             * by current page table mode as configured in the DTE, or a failure
+             * to fetch the Page Table from the Page Table Root Pointer in DTE.
+             */
+            assert(pagesize == 0);
+            return;
+        }
+        /* PTE has been validated for major errors and pagesize is set */
+        assert(pagesize);
+        page_mask = ~(pagesize - 1);
+        iova_next = (iova & page_mask) + pagesize;
+
+        if (pte == (uint64_t)-1) {
+            /*
+             * Failure to read PTE from memory, the pagesize matches the current
+             * level. Unable to determine the region type, so a safe strategy is
+             * to skip the range and continue the page walk.
+             */
+            goto next;
+        }
+
+        event.entry.target_as = &address_space_memory;
+        event.entry.iova = iova & page_mask;
+        /* translated_addr is irrelevant for the unmap case */
+        event.entry.translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) &
+                                      page_mask;
+        event.entry.addr_mask = ~page_mask;
+        event.entry.perm = amdvi_get_perms(pte);
+
+        /*
+         * In cases where the leaf PTE is not found, or it has invalid
+         * permissions, an UNMAP type notification is sent, but only if the
+         * caller requested it.
+         */
+        if (!IOMMU_PTE_PRESENT(pte) || (event.entry.perm == IOMMU_NONE)) {
+            if (!send_unmap) {
+                goto next;
+            }
+            event.type = IOMMU_NOTIFIER_UNMAP;
+        } else {
+            event.type = IOMMU_NOTIFIER_MAP;
+        }
+
+        /* Invoke the notifiers registered for this address space */
+        memory_region_notify_iommu(&as->iommu, 0, event);
+
+next:
+        iova = iova_next;
+    }
+}
+
 /*
  * Toggle between address translation and passthrough modes by enabling the
  * corresponding memory regions.
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (9 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-22 12:38   ` Sairaj Kodilkar
  2025-04-22 12:38   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 12/18] amd_iommu: Add replay callback Alejandro Jimenez
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the
address and size of the invalidation and sync the guest page table state
with the host. This requires walking the guest page table and calling
notifiers registered for address spaces matching the domain ID encoded
in the command.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 110 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 102 insertions(+), 8 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 6789e1e9b688..cf83ac607064 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -97,6 +97,9 @@ typedef enum AMDVIFaultReason {
 } AMDVIFaultReason;
 
 static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
+static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
+                                               uint64_t *dte, hwaddr addr,
+                                               uint64_t size, bool send_unmap);
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
@@ -497,8 +500,7 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
  * first zero at bit 51 or larger is a request to invalidate the entire address
  * space.
  */
-static uint64_t __attribute__((unused))
-amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
+static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
 {
     uint64_t size = AMDVI_PAGE_SIZE;
     uint8_t fzbit = 0;
@@ -515,10 +517,101 @@ amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
     return size;
 }
 
+/*
+ * Synchronize the guest page tables with the shadow page tables kept in the
+ * host for the specified range.
+ * The invalidation command issued by the guest and intercepted by the VMM
+ * does not specify a device, but a domain, since all devices in the same domain
+ * share the same page tables. However, vIOMMU emulation creates separate
+ * address spaces per device, so it is necessary to traverse the list of all of
+ * address spaces (i.e. devices) that have notifiers registered in order to
+ * propagate the changes to the host page tables. This could generate redundant
+ * requests to map/unmap regions when there are several devices in a same domain
+ * but it must be optimized by maintaining an internal representation of the
+ * per-domain address space, and avoid invoking a notifier when the change is
+ * already reflected in the host page tables.
+ *
+ * We cannot return early from this function once a matching domain has been
+ * identified and its page tables synced (based on the fact that all devices in
+ * the same domain share the page tables). The reason is that different devices
+ * (i.e. address spaces) could have different notifiers registered, and by
+ * skipping address spaces that appear later on the amdvi_as_with_notifiers list
+ * their notifiers (which could differ from the ones registered for the first
+ * device/address space) would not be invoked.
+ */
+static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t addr,
+                              uint16_t flags)
+{
+    AMDVIAddressSpace *as;
+
+    uint64_t size = amdvi_decode_invalidation_size(addr, flags);
+
+    if (size == AMDVI_INV_ALL_PAGES) {
+        addr = 0;       /* Set start address to 0 and invalidate entire AS */
+    } else {
+        addr &= ~(size - 1);
+    }
+
+    /*
+     * Call notifiers that have registered for each address space matching the
+     * domain ID, in order to sync the guest pagetable state with the host.
+     */
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+
+        uint64_t dte[4] = { 0 };
+
+        /*
+         * Retrieve the Device Table entry for the devid corresponding to the
+         * current address space, and verify the DomainID matches i.e. the page
+         * tables to be synced belong to devices in the domain.
+         */
+        if (amdvi_as_to_dte(as, dte)) {
+            continue;
+        }
+
+        /* Only need to sync the Page Tables for a matching domain */
+        if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
+            continue;
+        }
+
+        /*
+         * We have determined that there is a valid Device Table Entry for a
+         * device matching the DomainID in the INV_IOMMU_PAGES command issued by
+         * the guest. Walk the guest page table to sync shadow page table.
+         *
+         * An optimization can be made if only UNMAP notifiers are registered to
+         * avoid walking the page table and just invalidate the requested range.
+         */
+        if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
+
+            /* Sync guest IOMMU mappings with host */
+            amdvi_sync_shadow_page_table_range(as, &dte[0], addr, size, true);
+        } else {
+            /*
+             * For UNMAP only notifiers, invalidate the requested size. No page
+             * table walk is required since there is no need to replay mappings.
+             */
+            IOMMUTLBEvent event = {
+                .type = IOMMU_NOTIFIER_UNMAP,
+                .entry = {
+                    .target_as = &address_space_memory,
+                    .iova = addr,
+                    .translated_addr = 0, /* Irrelevant for unmap case */
+                    .addr_mask = size - 1,
+                    .perm = IOMMU_NONE,
+                },
+            };
+            memory_region_notify_iommu(&as->iommu, 0, event);
+        }
+    }
+}
+
 /* we don't have devid - we can't remove pages by address */
 static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 {
     uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
+    uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
+    uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
 
     if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
         extract64(cmd[1], 3, 9)) {
@@ -528,6 +621,8 @@ static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
 
     g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
                                 &domid);
+
+    amdvi_sync_domain(s, domid, addr, flags);
     trace_amdvi_pages_inval(domid);
 }
 
@@ -1589,9 +1684,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
  *  0:  PTE is marked not present, or entry is 0.
  * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
  */
-static uint64_t __attribute__((unused))
-fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
-          hwaddr *page_size)
+static uint64_t fetch_pte(AMDVIAddressSpace *as, const hwaddr address,
+                          uint64_t dte, hwaddr *page_size)
 {
     IOMMUAccessFlags perms = amdvi_get_perms(dte);
 
@@ -1693,9 +1787,9 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
  * notifiers to sync the shadow page tables in the host.
  * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
  */
-static void __attribute__((unused))
-amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
-                                   hwaddr addr, uint64_t size, bool send_unmap)
+static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
+                                               uint64_t *dte, hwaddr addr,
+                                               uint64_t size, bool send_unmap)
 {
     IOMMUTLBEvent event;
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/18] amd_iommu: Add replay callback
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (10 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 13/18] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

A replay() method is necessary to efficiently synchronize the host page
tables after VFIO registers a notifier for IOMMU events. It is called
to ensure that existing mappings from an IOMMU memory region are
"replayed" to a specified notifier, initializing or updating the shadow
page tables on the host.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index cf83ac607064..e24eab34c9e0 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1902,6 +1902,28 @@ static void amdvi_switch_address_space_all(AMDVIState *s)
     }
 }
 
+/*
+ * For every translation present in the IOMMU, construct IOMMUTLBEntry data
+ * and pass it as parameter to notifier callback.
+ */
+static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
+{
+    AMDVIAddressSpace *as = container_of(iommu_mr, AMDVIAddressSpace, iommu);
+    uint64_t dte[4] = { 0 };
+
+    if (!(n->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+        return;
+    }
+
+    if (amdvi_as_to_dte(as, dte)) {
+        return;
+    }
+
+    amdvi_address_space_unmap(as, n);
+
+    amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
+}
+
 static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
     char name[128];
@@ -2223,6 +2245,7 @@ static void amdvi_iommu_memory_region_class_init(ObjectClass *klass, void *data)
 
     imrc->translate = amdvi_translate;
     imrc->notify_flag_changed = amdvi_iommu_notify_flag_changed;
+    imrc->replay = amdvi_iommu_replay;
 }
 
 static const TypeInfo amdvi_iommu_memory_region_info = {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/18] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (11 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 12/18] amd_iommu: Add replay callback Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation Alejandro Jimenez
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

When the kernel IOMMU driver issues an INVALIDATE_IOMMU_ALL, the address
translation and interrupt remapping information must be cleared for all
Device IDs and all domains. Introduce a helper to sync the shadow page
table for all the address spaces with registered notifiers, which
replays both MAP and UNMAP events.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 49 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index e24eab34c9e0..3bfa08419ffe 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -100,6 +100,7 @@ static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
 static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
                                                uint64_t *dte, hwaddr addr,
                                                uint64_t size, bool send_unmap);
+static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n);
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
@@ -462,6 +463,47 @@ static void amdvi_intremap_inval_notify_all(AMDVIState *s, bool global,
     x86_iommu_iec_notify_all(X86_IOMMU_DEVICE(s), global, index, mask);
 }
 
+static void amdvi_address_space_sync(AMDVIAddressSpace *as)
+{
+    IOMMUNotifier *n;
+    uint64_t dte[4] = { 0 };
+
+    /* If only UNMAP notifiers are registered, drop all existing mappings */
+    if (!(as->notifier_flags & IOMMU_NOTIFIER_MAP)) {
+        IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+            /*
+             * Directly calling memory_region_unmap_iommu_notifier_range() does
+             * not guarantee that the addr_mask eventually passed as parameter
+             * to the notifier is valid. Use amdvi_address_space_unmap() which
+             * ensures the notifier range is divided into properly aligned
+             * regions, and issues notifications for each one.
+             */
+            amdvi_address_space_unmap(as, n);
+        }
+        return;
+    }
+
+    if (amdvi_as_to_dte(as, dte)) {
+        return;
+    }
+
+    amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, true);
+}
+
+/*
+ * This differs from the replay() method in that it issues both MAP and UNMAP
+ * notifications since it is called after global invalidation events in order to
+ * re-sync all address spaces.
+ */
+static void amdvi_iommu_address_space_sync_all(AMDVIState *s)
+{
+    AMDVIAddressSpace *as;
+
+    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
+        amdvi_address_space_sync(as);
+    }
+}
+
 static void amdvi_inval_all(AMDVIState *s, uint64_t *cmd)
 {
     if (extract64(cmd[0], 0, 60) || cmd[1]) {
@@ -473,6 +515,13 @@ static void amdvi_inval_all(AMDVIState *s, uint64_t *cmd)
     amdvi_intremap_inval_notify_all(s, true, 0, 0);
 
     amdvi_iotlb_reset(s);
+
+    /*
+     * Fully replay the address space i.e. send both UNMAP and MAP events in
+     * order to synchronize guest and host IO page tables tables.
+     */
+    amdvi_iommu_address_space_sync_all(s);
+
     trace_amdvi_all_inval();
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (12 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 13/18] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-22 12:48   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 15/18] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
Device Table entry (DTE) e.g. after attaching a device and setting up
its DTE. When intercepting this event, determine if the DTE has been
configured for paging or not, and toggle the appropriate memory regions
to allow DMA address translation for the address space if needed.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 3bfa08419ffe..abdd67f6b12c 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -101,6 +101,8 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
                                                uint64_t *dte, hwaddr addr,
                                                uint64_t size, bool send_unmap);
 static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n);
+static void amdvi_address_space_sync(AMDVIAddressSpace *as);
+static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as);
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
@@ -432,7 +434,15 @@ static void amdvi_completion_wait(AMDVIState *s, uint64_t *cmd)
     trace_amdvi_completion_wait(addr, data);
 }
 
-/* log error without aborting since linux seems to be using reserved bits */
+/*
+ * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
+ * after changing a Device Table entry. We can use this fact to detect when a
+ * Device Table entry is created for a device attached to a paging domain and
+ * and enable the corresponding IOMMU memory region to allow for DMA
+ * translation if appropriate.
+ *
+ * log error without aborting since linux seems to be using reserved bits
+ */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
     uint16_t devid = cpu_to_le16((uint16_t)extract64(cmd[0], 0, 16));
@@ -442,6 +452,62 @@ static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
         amdvi_log_illegalcom_error(s, extract64(cmd[0], 60, 4),
                                    s->cmdbuf + s->cmdbuf_head);
     }
+
+    /*
+     * Convert the devid encoded in the command to a bus and devfn in
+     * order to retrieve the corresponding address space.
+     */
+    uint8_t bus_num, devfn, dte_mode;
+    AMDVIAddressSpace *as;
+    uint64_t dte[4] = { 0 };
+    IOMMUNotifier *n;
+    int ret;
+
+    bus_num = PCI_BUS_NUM(devid);
+    devfn = devid & 0xff;
+
+    /*
+     * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
+     * been allocated within AMDVIState, but must be careful to not access
+     * unallocated devfn.
+     */
+    if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
+        return;
+    }
+    as = s->address_spaces[bus_num][devfn];
+
+    ret = amdvi_as_to_dte(as, dte);
+
+    if (!ret) {
+        dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
+    }
+
+    if ((ret < 0) || (!ret && !dte_mode)) {
+        /*
+         * The DTE could not be retrieved, it is not valid, or it is not setup
+         * for paging. In either case, ensure that if paging was previously in
+         * use then switch to use the no_dma memory region, and invalidate all
+         * existing mappings.
+         */
+        if (as->addr_translation) {
+            as->addr_translation = false;
+
+            amdvi_switch_address_space(as);
+
+            IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
+                amdvi_address_space_unmap(as, n);
+            }
+        }
+    } else if (!as->addr_translation) {
+        /*
+         * Installing a DTE that enables translation where it wasn't previously
+         * active. Activate the DMA memory region.
+         */
+        as->addr_translation = true;
+        amdvi_switch_address_space(as);
+        amdvi_address_space_sync(as);
+    }
+
     trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
                              PCI_FUNC(devid));
 }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/18] amd_iommu: Use iova_tree records to determine large page size on UNMAP
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (13 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Keep a record of mapped IOVA ranges per address space, using the
iova_tree implementation. Besides enabling optimizations like avoiding
unnecessary notifications, a record of existing <IOVA, size> mappings
makes it possible to determine if a specific IOVA is mapped by the guest
using a large page, and adjust the size when notifying UNMAP events.

When unmapping a large page, the information in the guest PTE encoding
the page size is lost, since the guest clears the PTE before issuing the
invalidation command to the IOMMU. In such case, the size of the
original mapping can be retrieved from the iova_tree and used to issue
the UNMAP notification. Using the correct size is essential since the
VFIO IOMMU Type1v2 driver in the host kernel will reject unmap requests
that do not fully cover previous mappings.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 98 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 5 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index abdd67f6b12c..e502bbbbb7d3 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -33,6 +33,7 @@
 #include "hw/i386/apic-msidef.h"
 #include "hw/qdev-properties.h"
 #include "kvm/kvm_i386.h"
+#include "qemu/iova-tree.h"
 
 /* used AMD-Vi MMIO registers */
 const char *amdvi_mmio_low[] = {
@@ -73,6 +74,7 @@ struct AMDVIAddressSpace {
     QLIST_ENTRY(AMDVIAddressSpace) next;
     /* DMA address translation active */
     bool addr_translation;
+    IOVATree *iova_tree;        /* Record DMA translation ranges */
 };
 
 /* AMDVI cache entry */
@@ -103,6 +105,7 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
 static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n);
 static void amdvi_address_space_sync(AMDVIAddressSpace *as);
 static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as);
+static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event);
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
@@ -1366,6 +1369,7 @@ static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n)
     hwaddr start = n->start;
     hwaddr end = n->end;
     hwaddr remain;
+    DMAMap map;
 
     assert(start <= end);
     remain = end - start + 1;
@@ -1399,6 +1403,11 @@ static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n)
     }
 
     assert(!remain);
+
+    map.iova = n->start;
+    map.size = n->end - n->start;
+
+    iova_tree_remove(as->iova_tree, map);
 }
 
 /*
@@ -1908,7 +1917,7 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
 {
     IOMMUTLBEvent event;
 
-    hwaddr iova_next, page_mask, pagesize;
+    hwaddr page_mask, pagesize;
     hwaddr iova = addr;
     hwaddr end = iova + size - 1;
 
@@ -1930,7 +1939,6 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
         /* PTE has been validated for major errors and pagesize is set */
         assert(pagesize);
         page_mask = ~(pagesize - 1);
-        iova_next = (iova & page_mask) + pagesize;
 
         if (pte == (uint64_t)-1) {
             /*
@@ -1963,12 +1971,90 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
             event.type = IOMMU_NOTIFIER_MAP;
         }
 
-        /* Invoke the notifiers registered for this address space */
-        memory_region_notify_iommu(&as->iommu, 0, event);
+        /*
+         * The following call might need to adjust event.entry.size in cases
+         * where the guest unmapped a series of large pages.
+         */
+        amdvi_notify_iommu(as, &event);
+        /*
+         * In the special scenario where the guest is unmapping a large page,
+         * addr_mask has been adjusted before sending the notification. Update
+         * pagesize accordingly in order to correctly compute the next IOVA.
+         */
+        pagesize = event.entry.addr_mask + 1;
 
 next:
-        iova = iova_next;
+        iova = (iova & ~(pagesize - 1)) + pagesize;
+    }
+}
+
+/*
+ * Invoke notifiers registered for the address space. Update record of mapped
+ * ranges in IOVA Tree.
+ */
+static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event)
+{
+    IOMMUTLBEntry *entry = &event->entry;
+
+    DMAMap target = {
+        .iova = entry->iova,
+        .size = entry->addr_mask,
+        .translated_addr = entry->translated_addr,
+        .perm = entry->perm,
+    };
+
+    /*
+     * Search the IOVA Tree for an existing translation for the target, and skip
+     * the notification if the mapping is already recorded.
+     * When the guest uses large pages, comparing against the record makes it
+     * possible to determine the size of the original MAP and adjust the UNMAP
+     * request to match it. This avoids failed checks against the mappings kept
+     * by the VFIO kernel driver.
+     */
+    const DMAMap *mapped = iova_tree_find(as->iova_tree, &target);
+
+    if (event->type == IOMMU_NOTIFIER_UNMAP) {
+        if (!mapped) {
+            /* No record exists of this mapping, nothing to do */
+            return;
+        }
+        /*
+         * Adjust the size based on the original record. This is essential to
+         * determine when large/contiguous pages are used, since the guest has
+         * already cleared the PTE (erasing the pagesize encoded on it) before
+         * issuing the invalidation command.
+         */
+        if (mapped->size != target.size) {
+            assert(mapped->size > target.size);
+            target.size = mapped->size;
+            /* Adjust event to invoke notifier with correct range */
+            entry->addr_mask = mapped->size;
+        }
+        iova_tree_remove(as->iova_tree, target);
+    } else { /* IOMMU_NOTIFIER_MAP */
+        if (mapped) {
+            /*
+             * If a mapping is present and matches the request, skip the
+             * notification.
+             */
+            if (!memcmp(mapped, &target, sizeof(DMAMap))) {
+                return;
+            } else {
+                /*
+                 * This should never happen unless a buggy guest OS omits or
+                 * sends incorrect invalidation(s). Report an error in the event
+                 * it does happen.
+                 */
+                error_report("Found conflicting translation. This could be due "
+                             "to an incorrect or missing invalidation command");
+            }
+        }
+        /* Record the new mapping */
+        iova_tree_insert(as->iova_tree, &target);
     }
+
+    /* Invoke the notifiers registered for this address space */
+    memory_region_notify_iommu(&as->iommu, 0, *event);
 }
 
 /*
@@ -2034,6 +2120,7 @@ static void amdvi_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
         return;
     }
 
+    /* Dropping all mappings for the addres space. Also clears the IOVA tree */
     amdvi_address_space_unmap(as, n);
 
     amdvi_sync_shadow_page_table_range(as, &dte[0], 0, UINT64_MAX, false);
@@ -2062,6 +2149,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
         iommu_as[devfn]->bus_num = (uint8_t)bus_num;
         iommu_as[devfn]->devfn = (uint8_t)devfn;
         iommu_as[devfn]->iommu_state = s;
+        iommu_as[devfn]->iova_tree = iova_tree_new();
 
         amdvi_dev_as = iommu_as[devfn];
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (14 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 15/18] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-23  6:06   ` Sairaj Kodilkar
  2025-04-14  2:02 ` [PATCH 17/18] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
with V=0, TV=1 does not contain a valid address translation information.
If a request requires a table walk, the walk is terminated when this
condition is encountered.

Do not assume that addresses for a device with DTE[TV]=0 are passed
through (i.e. not remapped) and instead terminate the page table walk
early.

Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 87 +++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 39 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index e502bbbbb7d3..edf2935f6a83 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1221,51 +1221,60 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
     uint64_t pte = dte[0], pte_addr, page_mask;
 
     /* make sure the DTE has TV = 1 */
-    if (pte & AMDVI_DEV_TRANSLATION_VALID) {
-        level = get_pte_translation_mode(pte);
-        if (level >= 7) {
-            trace_amdvi_mode_invalid(level, addr);
+    if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+        /*
+         * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
+         * An IOMMU processing a request that requires a table walk terminates
+         * the walk when it encounters this condition. Do the same and return
+         * instead of assuming that the address is forwarded without translation
+         * i.e. the passthrough case, as it is done for the case where DTE[V]=0.
+         */
+        return;
+    }
+
+    level = get_pte_translation_mode(pte);
+    if (level >= 7) {
+        trace_amdvi_mode_invalid(level, addr);
+        return;
+    }
+    if (level == 0) {
+        goto no_remap;
+    }
+
+    /* we are at the leaf page table or page table encodes a huge page */
+    do {
+        pte_perms = amdvi_get_perms(pte);
+        present = pte & 1;
+        if (!present || perms != (perms & pte_perms)) {
+            amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+            trace_amdvi_page_fault(addr);
             return;
         }
-        if (level == 0) {
-            goto no_remap;
+        /* go to the next lower level */
+        pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
+        /* add offset and load pte */
+        pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
+        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+        if (!pte) {
+            return;
         }
+        oldlevel = level;
+        level = get_pte_translation_mode(pte);
+    } while (level > 0 && level < 7);
 
-        /* we are at the leaf page table or page table encodes a huge page */
-        do {
-            pte_perms = amdvi_get_perms(pte);
-            present = pte & 1;
-            if (!present || perms != (perms & pte_perms)) {
-                amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
-                trace_amdvi_page_fault(addr);
-                return;
-            }
-
-            /* go to the next lower level */
-            pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
-            /* add offset and load pte */
-            pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
-            pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-            if (!pte || (pte == (uint64_t)-1)) {
-                return;
-            }
-            oldlevel = level;
-            level = get_pte_translation_mode(pte);
-        } while (level > 0 && level < 7);
+    if (level == 0x7) {
+        page_mask = pte_override_page_mask(pte);
+    } else {
+        page_mask = pte_get_page_mask(oldlevel);
+    }
 
-        if (level == 0x7) {
-            page_mask = pte_override_page_mask(pte);
-        } else {
-            page_mask = pte_get_page_mask(oldlevel);
-        }
+    /* get access permissions from pte */
+    ret->iova = addr & page_mask;
+    ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
+    ret->addr_mask = ~page_mask;
+    ret->perm = amdvi_get_perms(pte);
+    return;
 
-        /* get access permissions from pte */
-        ret->iova = addr & page_mask;
-        ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
-        ret->addr_mask = ~page_mask;
-        ret->perm = amdvi_get_perms(pte);
-        return;
-    }
 no_remap:
     ret->iova = addr & AMDVI_PAGE_MASK_4K;
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/18] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (15 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-14  2:02 ` [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay() Alejandro Jimenez
  2025-04-23 10:45 ` [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Sairaj Kodilkar
  18 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Simplify amdvi_page_walk() by making it call the fetch_pte() helper that
is already in use by the shadow page synchronization code. Ensures all
code uses the same page table walking algorithm.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 60 +++++++++++++++++++++------------------------
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index edf2935f6a83..dc29a52bd845 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -106,6 +106,8 @@ static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n);
 static void amdvi_address_space_sync(AMDVIAddressSpace *as);
 static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as);
 static void amdvi_notify_iommu(AMDVIAddressSpace *as, IOMMUTLBEvent *event);
+static uint64_t fetch_pte(AMDVIAddressSpace *as, const hwaddr address,
+                          uint64_t dte, hwaddr *page_size);
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
 {
@@ -1217,11 +1219,12 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
                             IOMMUTLBEntry *ret, unsigned perms,
                             hwaddr addr)
 {
-    unsigned level, present, pte_perms, oldlevel;
-    uint64_t pte = dte[0], pte_addr, page_mask;
+    hwaddr page_mask, pagesize = 0;
+    uint8_t mode;
+    uint64_t pte;
 
     /* make sure the DTE has TV = 1 */
-    if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
+    if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
         /*
          * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
          * An IOMMU processing a request that requires a table walk terminates
@@ -1232,42 +1235,35 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
         return;
     }
 
-    level = get_pte_translation_mode(pte);
-    if (level >= 7) {
-        trace_amdvi_mode_invalid(level, addr);
+    mode = get_pte_translation_mode(dte[0]);
+    if (mode >= 7) {
+        trace_amdvi_mode_invalid(mode, addr);
         return;
     }
-    if (level == 0) {
+    if (mode == 0) {
         goto no_remap;
     }
 
-    /* we are at the leaf page table or page table encodes a huge page */
-    do {
-        pte_perms = amdvi_get_perms(pte);
-        present = pte & 1;
-        if (!present || perms != (perms & pte_perms)) {
-            amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
-            trace_amdvi_page_fault(addr);
-            return;
-        }
-        /* go to the next lower level */
-        pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
-        /* add offset and load pte */
-        pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
-        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
-        if (!pte) {
-            return;
-        }
-        oldlevel = level;
-        level = get_pte_translation_mode(pte);
-    } while (level > 0 && level < 7);
+    /* Attempt to fetch the PTE to determine if a valid mapping exists */
+    pte = fetch_pte(as, addr, dte[0], &pagesize);
 
-    if (level == 0x7) {
-        page_mask = pte_override_page_mask(pte);
-    } else {
-        page_mask = pte_get_page_mask(oldlevel);
+    /*
+     * If walking the page table results in an error of any type, returns an
+     * empty PTE i.e. no mapping, or the permissions do not match, return since
+     * there is no translation available.
+     */
+    if ((pte == (uint64_t)-1) || (pte == (uint64_t)-2) ||
+        !IOMMU_PTE_PRESENT(pte) || perms != (perms & amdvi_get_perms(pte))) {
+
+        amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
+        trace_amdvi_page_fault(addr);
+        return;
     }
 
+    /* A valid PTE and page size has been retrieved */
+    assert(pagesize);
+    page_mask = ~(pagesize - 1);
+
     /* get access permissions from pte */
     ret->iova = addr & page_mask;
     ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
@@ -1279,7 +1275,7 @@ no_remap:
     ret->iova = addr & AMDVI_PAGE_MASK_4K;
     ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;
     ret->addr_mask = ~AMDVI_PAGE_MASK_4K;
-    ret->perm = amdvi_get_perms(pte);
+    ret->perm = amdvi_get_perms(dte[0]);
 }
 
 static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay()
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (16 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 17/18] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
@ 2025-04-14  2:02 ` Alejandro Jimenez
  2025-04-23  6:18   ` Sairaj Kodilkar
  2025-04-23 10:45 ` [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Sairaj Kodilkar
  18 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-14  2:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, sarunkod, Wei.Huang2,
	joao.m.martins, boris.ostrovsky, alejandro.j.jimenez

Do not emit an I/O page fault on amdvi_page_walk() when a valid mapping
is not found. The current role of amdvi_page_walk() is to be a helper
for the translate() method and ultimately the IOMMU replay()
functionality. These operations might be executed while the guest is
running, but do not necessarily correspond 1:1 with guest-initiated page
walks.

One example of such scenario is when VFIO code calls
memory_region_iommu_replay() (which ends up calling amdvi_walk_page())
to sync the dirty bitmap, or after registering a new notifier. The guest
would get IO_PAGE_FAULT events for all the regions where a mapping
doesn't currently exist, which is not correct.

Note that after this change there are no users of amdvi_page_fault(),
but since the IO page fault handling will be addressed in upcoming work,
I am choosing to mark it as unused rather than deleting it.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index dc29a52bd845..ac7000dbafc7 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -263,8 +263,8 @@ static void amdvi_encode_event(uint64_t *evt, uint16_t devid, uint64_t addr,
  *
  * @addr: virtual address in translation request
  */
-static void amdvi_page_fault(AMDVIState *s, uint16_t devid,
-                             hwaddr addr, uint16_t info)
+static void __attribute__((unused))
+amdvi_page_fault(AMDVIState *s, uint16_t devid, hwaddr addr, uint16_t info)
 {
     uint64_t evt[2];
 
@@ -1254,9 +1254,6 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
      */
     if ((pte == (uint64_t)-1) || (pte == (uint64_t)-2) ||
         !IOMMU_PTE_PRESENT(pte) || perms != (perms & amdvi_get_perms(pte))) {
-
-        amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
-        trace_amdvi_page_fault(addr);
         return;
     }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/18] amd_iommu: Add helper function to extract the DTE
  2025-04-14  2:02 ` [PATCH 02/18] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
@ 2025-04-16 11:36   ` Sairaj Kodilkar
  2025-04-16 13:29     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-16 11:36 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
Hi Alejandro,

> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
> common operation required for syncing the shadow page tables. Implement a
> helper to do it and check for common error conditions.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 47 ++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 5f9b95279997..22d648c2e0e3 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -77,6 +77,20 @@ typedef struct AMDVIIOTLBEntry {
>       uint64_t page_mask;         /* physical page size  */
>   } AMDVIIOTLBEntry;
>   
> +/*
> + * These 'fault' reasons have an overloaded meaning since they are not only
> + * intended for describing reasons that generate an IO_PAGE_FAULT as per the AMD
> + * IOMMU specification, but are also used to signal internal errors in the
> + * emulation code.
> + */
> +typedef enum AMDVIFaultReason {
> +    AMDVI_FR_DTE_RTR_ERR = 1,           /* Failure to retrieve DTE */
> +    AMDVI_FR_DTE_V,                     /* DTE[V] = 0 */
> +    AMDVI_FR_DTE_TV,                    /* DTE[TV] = 0 */
> +} AMDVIFaultReason;
> +
> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
> +

No need to have this function declaration as it is a static function.

>   uint64_t amdvi_extended_feature_register(AMDVIState *s)
>   {
>       uint64_t feature = AMDVI_DEFAULT_EXT_FEATURES;
> @@ -882,6 +896,28 @@ static bool amdvi_get_dte(AMDVIState *s, int devid, uint64_t *entry)
>       return true;
>   }
>   
> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
> +{
> +    uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
> +    AMDVIState *s = as->iommu_state;
> +
> +    if (!amdvi_get_dte(s, devid, dte)) {
> +        /* Unable to retrieve DTE for devid */
> +        return -AMDVI_FR_DTE_RTR_ERR;
> +    }
> +
> +    if (!(dte[0] & AMDVI_DEV_VALID)) {
> +        /* DTE[V] not set, address is passed untranslated for devid */
> +        return -AMDVI_FR_DTE_V;
> +    }
> +
> +    if (!(dte[0] & AMDVI_DEV_TRANSLATION_VALID)) {
> +        /* DTE[TV] not set, host page table not valid for devid */
> +        return -AMDVI_FR_DTE_TV;
> +    }
> +    return 0;
> +}
> +
>   /* get pte translation mode */
>   static inline uint8_t get_pte_translation_mode(uint64_t pte)
>   {
> @@ -990,6 +1026,7 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
>       uint16_t devid = PCI_BUILD_BDF(as->bus_num, as->devfn);
>       AMDVIIOTLBEntry *iotlb_entry = amdvi_iotlb_lookup(s, addr, devid);
>       uint64_t entry[4];
> +    int dte_ret;
>   
>       if (iotlb_entry) {
>           trace_amdvi_iotlb_hit(PCI_BUS_NUM(devid), PCI_SLOT(devid),
> @@ -1001,13 +1038,13 @@ static void amdvi_do_translate(AMDVIAddressSpace *as, hwaddr addr,
>           return;
>       }
>   
> -    if (!amdvi_get_dte(s, devid, entry)) {
> -        return;
> -    }
> +    dte_ret = amdvi_as_to_dte(as, entry);
>   
> -    /* devices with V = 0 are not translated */
> -    if (!(entry[0] & AMDVI_DEV_VALID)) {
> +    if (dte_ret == -AMDVI_FR_DTE_V) {
> +        /* DTE[V]=0, address is passed untranslated */
>           goto out;
> +    } else if (dte_ret == -AMDVI_FR_DTE_TV) {
> +        return;
>       }
>   
>       amdvi_page_walk(as, entry, ret,

Regards
Sairaj Kodilkar


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-14  2:02 ` [PATCH 03/18] amd_iommu: Add support for IOMMU notifier Alejandro Jimenez
@ 2025-04-16 12:14   ` Sairaj Kodilkar
  2025-04-16 22:17     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-16 12:14 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky

Hi Alejandro,

On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> In order to enable device assignment with IOMMU protection and guest
> DMA address translation, IOMMU notifier support is necessary to allow
> users like VFIO to synchronize the shadow page tables i.e. to receive
> notifications when the guest updates its IO page tables and replay the
> mappings onto host IO page tables.
> 
> This requires the vIOMMU is configured with the NpCache capability,
> so the guest issues IOMMU invalidations for both map() and unmap()
> operations. This capability is already part of AMDVI_CAPAB_FEATURES,
> and is written to the configuration in amdvi_pci_realize().
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++------
>   hw/i386/amd_iommu.h |  6 ++++++
>   2 files changed, 34 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 22d648c2e0e3..8dbb10d91339 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -66,6 +66,13 @@ struct AMDVIAddressSpace {
>       MemoryRegion iommu_nodma;   /* Alias of shared nodma memory region  */
>       MemoryRegion iommu_ir;      /* Device's interrupt remapping region  */
>       AddressSpace as;            /* device's corresponding address space */
> +
> +    /* DMA address translation support */
> +    IOMMUNotifierFlag notifier_flags;
> +    /* entry in list of Address spaces with registered notifiers */
> +    QLIST_ENTRY(AMDVIAddressSpace) next;
> +    /* DMA address translation active */
> +    bool addr_translation;

I dont see any use of addr_translation in current patch.
maybe define it when you are really need this flag ?

>   };
>   
>   /* AMDVI cache entry */
> @@ -1561,14 +1568,28 @@ static int amdvi_iommu_notify_flag_changed(IOMMUMemoryRegion *iommu,
>                                              Error **errp)
>   {
>       AMDVIAddressSpace *as = container_of(iommu, AMDVIAddressSpace, iommu);
> +    AMDVIState *s = as->iommu_state;
> +
> +    /* DMA remapping capability is required to implement IOMMU notifier */
> +    if (!s->dma_remap && (new & IOMMU_NOTIFIER_MAP)) {
> +        error_setg_errno(errp, ENOTSUP,
> +                "device %02x.%02x.%x requires dma-remap=1",
> +                as->bus_num, PCI_SLOT(as->devfn), PCI_FUNC(as->devfn));
> +        return -ENOTSUP;
> +    }
>   
> -    if (new & IOMMU_NOTIFIER_MAP) {
> -        error_setg(errp,
> -                   "device %02x.%02x.%x requires iommu notifier which is not "
> -                   "currently supported", as->bus_num, PCI_SLOT(as->devfn),
> -                   PCI_FUNC(as->devfn));
> -        return -EINVAL;
> +    /*
> +     * Update notifier flags for address space and the list of address spaces
> +     * with registered notifiers.
> +     */
> +    as->notifier_flags = new;
> +
> +    if (old == IOMMU_NOTIFIER_NONE) {
> +        QLIST_INSERT_HEAD(&s->amdvi_as_with_notifiers, as, next);
> +    } else if (new == IOMMU_NOTIFIER_NONE) {
> +        QLIST_REMOVE(as, next);
>       }
> +
>       return 0;
>   }
>   
> @@ -1700,6 +1721,7 @@ static void amdvi_sysbus_realize(DeviceState *dev, Error **errp)
>   
>   static const Property amdvi_properties[] = {
>       DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
> +    DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
>   };

Please add a description in commit message for this flag

>   
>   static const VMStateDescription vmstate_amdvi_sysbus = {
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index 28125130c6fc..e12ecade4baa 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -365,12 +365,18 @@ struct AMDVIState {
>       /* for each served device */
>       AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
>   
> +    /* list of address spaces with registered notifiers */
> +    QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
> +
>       /* IOTLB */
>       GHashTable *iotlb;
>   
>       /* Interrupt remapping */
>       bool ga_enabled;
>       bool xtsup;
> +
> +    /* DMA address translation */
> +    bool dma_remap;

I think you should use this flag in the remapping path as well.
I am aware that you are using it later in this series to switch the
address space, but current patch will make things inconsistent for
emulated and vfio devices (possibly breaking the bisect).

Also newer AMD IVRS format provides a HATDis bit (Table 96 in IOMMU
Specs [1]), informing the guest kernel that V1 page table is disabled in
the IOMMU. Its good idea to set this bit in IVRS when dma_remap=false.
This way a "HATDis bit aware" guest can enable iommu.passthrough.

Also, is it a good idea to have default value for dma_remap=false ?
Consider the guest which is not aware of HATDis bit. Things will break 
if such guest boots with iommu.passthrough=0 (recreating the pt=on
scenario).

[1] 
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

Regards
Sairaj Kodilkar

PS: Sorry If I missed something here, I haven't progressed much in the 
series.

>   };
>   
>   uint64_t amdvi_extended_feature_register(AMDVIState *s);



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/18] amd_iommu: Add helper function to extract the DTE
  2025-04-16 11:36   ` Sairaj Kodilkar
@ 2025-04-16 13:29     ` Alejandro Jimenez
  2025-04-16 18:50       ` Michael S. Tsirkin
  0 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-16 13:29 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/16/25 7:36 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> Hi Alejandro,
> 
>> Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
>> common operation required for syncing the shadow page tables. Implement a
>> helper to do it and check for common error conditions.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 47 ++++++++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 42 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 5f9b95279997..22d648c2e0e3 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -77,6 +77,20 @@ typedef struct AMDVIIOTLBEntry {
>>       uint64_t page_mask;         /* physical page size  */
>>   } AMDVIIOTLBEntry;
>> +/*
>> + * These 'fault' reasons have an overloaded meaning since they are 
>> not only
>> + * intended for describing reasons that generate an IO_PAGE_FAULT as 
>> per the AMD
>> + * IOMMU specification, but are also used to signal internal errors 
>> in the
>> + * emulation code.
>> + */
>> +typedef enum AMDVIFaultReason {
>> +    AMDVI_FR_DTE_RTR_ERR = 1,           /* Failure to retrieve DTE */
>> +    AMDVI_FR_DTE_V,                     /* DTE[V] = 0 */
>> +    AMDVI_FR_DTE_TV,                    /* DTE[TV] = 0 */
>> +} AMDVIFaultReason;
>> +
>> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
>> +
> 
> No need to have this function declaration as it is a static function.
> 

I am adding a forward declaration since this function will be used by 
several other new and existing functions throughout the series, and this 
avoids having to keep moving the function definition. I do the same for 
many other functions in later patches, since it is more practical than 
constantly moving definitions around to guarantee ordering constraints. 
That approach would create patches with large diffs but non-functional 
changes due to code movement alone, makes it harder to parse the changes 
that actually matter, harder to rebase and resolve conflicts, etc...

Alejandro
> 
> Regards
> Sairaj Kodilkar



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/18] amd_iommu: Add helper function to extract the DTE
  2025-04-16 13:29     ` Alejandro Jimenez
@ 2025-04-16 18:50       ` Michael S. Tsirkin
  2025-04-16 22:37         ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Michael S. Tsirkin @ 2025-04-16 18:50 UTC (permalink / raw)
  To: Alejandro Jimenez
  Cc: Sairaj Kodilkar, qemu-devel, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	vasant.hegde, suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	joao.m.martins, boris.ostrovsky

On Wed, Apr 16, 2025 at 09:29:23AM -0400, Alejandro Jimenez wrote:
> 
> 
> On 4/16/25 7:36 AM, Sairaj Kodilkar wrote:
> > 
> > 
> > On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> > Hi Alejandro,
> > 
> > > Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
> > > common operation required for syncing the shadow page tables. Implement a
> > > helper to do it and check for common error conditions.
> > > 
> > > Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> > > ---
> > >   hw/i386/amd_iommu.c | 47 ++++++++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 42 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> > > index 5f9b95279997..22d648c2e0e3 100644
> > > --- a/hw/i386/amd_iommu.c
> > > +++ b/hw/i386/amd_iommu.c
> > > @@ -77,6 +77,20 @@ typedef struct AMDVIIOTLBEntry {
> > >       uint64_t page_mask;         /* physical page size  */
> > >   } AMDVIIOTLBEntry;
> > > +/*
> > > + * These 'fault' reasons have an overloaded meaning since they are
> > > not only
> > > + * intended for describing reasons that generate an IO_PAGE_FAULT
> > > as per the AMD
> > > + * IOMMU specification, but are also used to signal internal errors
> > > in the
> > > + * emulation code.
> > > + */
> > > +typedef enum AMDVIFaultReason {
> > > +    AMDVI_FR_DTE_RTR_ERR = 1,           /* Failure to retrieve DTE */
> > > +    AMDVI_FR_DTE_V,                     /* DTE[V] = 0 */
> > > +    AMDVI_FR_DTE_TV,                    /* DTE[TV] = 0 */
> > > +} AMDVIFaultReason;
> > > +
> > > +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
> > > +
> > 
> > No need to have this function declaration as it is a static function.
> > 
> 
> I am adding a forward declaration since this function will be used by
> several other new and existing functions throughout the series, and this
> avoids having to keep moving the function definition. I do the same for many
> other functions in later patches, since it is more practical than constantly
> moving definitions around to guarantee ordering constraints. That approach
> would create patches with large diffs but non-functional changes due to code
> movement alone, makes it harder to parse the changes that actually matter,
> harder to rebase and resolve conflicts, etc...
> 
> Alejandro

You can add forward declarations temporarily to simplify review.  But in
the end, I prefer seeing code without forward declarations, with
functions ordered sensibly by order of calls, rather than spread
randomly all over the place.



> > 
> > Regards
> > Sairaj Kodilkar



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-16 12:14   ` Sairaj Kodilkar
@ 2025-04-16 22:17     ` Alejandro Jimenez
  2025-04-17 10:19       ` Sairaj Kodilkar
  0 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-16 22:17 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky

On 4/16/25 8:14 AM, Sairaj Kodilkar wrote:

>> +
>> +    /* DMA address translation support */
>> +    IOMMUNotifierFlag notifier_flags;
>> +    /* entry in list of Address spaces with registered notifiers */
>> +    QLIST_ENTRY(AMDVIAddressSpace) next;
>> +    /* DMA address translation active */
>> +    bool addr_translation;
> 
> I dont see any use of addr_translation in current patch.
> maybe define it when you are really need this flag ?

ACK. I can move it to a later patch. My rationale was that this patch is 
adding all the new structure members that will be needed, but it makes 
sense to split it.

>>       return 0;
>>   }
>> @@ -1700,6 +1721,7 @@ static void amdvi_sysbus_realize(DeviceState 
>> *dev, Error **errp)
>>   static const Property amdvi_properties[] = {
>>       DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
>> +    DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
>>   };
> 
> Please add a description in commit message for this flag

Will change in next revision.

> 
>>   static const VMStateDescription vmstate_amdvi_sysbus = {
>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>> index 28125130c6fc..e12ecade4baa 100644
>> --- a/hw/i386/amd_iommu.h
>> +++ b/hw/i386/amd_iommu.h
>> @@ -365,12 +365,18 @@ struct AMDVIState {
>>       /* for each served device */
>>       AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
>> +    /* list of address spaces with registered notifiers */
>> +    QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
>> +
>>       /* IOTLB */
>>       GHashTable *iotlb;
>>       /* Interrupt remapping */
>>       bool ga_enabled;
>>       bool xtsup;
>> +
>> +    /* DMA address translation */
>> +    bool dma_remap;
> 
> I think you should use this flag in the remapping path as well.
> I am aware that you are using it later in this series to switch the
> address space, but current patch will make things inconsistent for
> emulated and vfio devices (possibly breaking the bisect).

The change in behavior happens only if the user explicitly sets 
dma-remap=on property in the command line, which is why I made it off by 
default.

To eliminate the possibility of using dma-remap=on before all the 
infrastructure to support it is in place, I will move this patch and 
[5/18] to the end of the series. Does that address your concern?

> 
> Also newer AMD IVRS format provides a HATDis bit (Table 96 in IOMMU
> Specs [1]), informing the guest kernel that V1 page table is disabled in
> the IOMMU. 

Sounds like this mechanism could have been used until now to prevent the 
scenario where we can have the guest request DMA remapping and the 
vIOMMU doesn't have the capability. Seems moot now that we are actually 
adding the capability.

Its good idea to set this bit in IVRS when dma_remap=false.
> This way a "HATDis bit aware" guest can enable iommu.passthrough.

I'd need to research how to implement this, but isn't this enhancement 
better added in separate series, since it also requires a companion 
Linux change? I don't recall the Linux driver being "HATDis aware" (or 
even HATS aware for that matter), but perhaps I am mistaken...

> 
> Also, is it a good idea to have default value for dma_remap=false ?
> Consider the guest which is not aware of HATDis bit. Things will break 
> if such guest boots with iommu.passthrough=0 (recreating the pt=on
> scenario).

That is not new, that is the current behavior that this series is trying 
to fix by adding the missing functionality.

As far as the default value for dma-remap property, I think it must be 
set to 0/false (i.e. current behavior unchanged) until we deem the DMA 
remapping feature stable enough to be made available for guests.
On that topic, maybe it should be an experimental feature for now i.e. 
"x-dma-remap".

> 
> [1] https://www.amd.com/content/dam/amd/en/documents/processor-tech- 
> docs/specifications/48882_IOMMU.pdf
> 
> Regards
> Sairaj Kodilkar
> 
> PS: Sorry If I missed something here, I haven't progressed much in the 
> series.

No problem at all, the feedback is appreciated.

Thank you,
Alejandro

> 
>>   };
>>   uint64_t amdvi_extended_feature_register(AMDVIState *s);
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 02/18] amd_iommu: Add helper function to extract the DTE
  2025-04-16 18:50       ` Michael S. Tsirkin
@ 2025-04-16 22:37         ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-16 22:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sairaj Kodilkar, qemu-devel, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	vasant.hegde, suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	joao.m.martins, boris.ostrovsky



On 4/16/25 2:50 PM, Michael S. Tsirkin wrote:
> On Wed, Apr 16, 2025 at 09:29:23AM -0400, Alejandro Jimenez wrote:
>>
>>
>> On 4/16/25 7:36 AM, Sairaj Kodilkar wrote:
>>>

>>>> +static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
>>>> +
>>>
>>> No need to have this function declaration as it is a static function.
>>>
>>
>> I am adding a forward declaration since this function will be used by
>> several other new and existing functions throughout the series, and this
>> avoids having to keep moving the function definition. I do the same for many
>> other functions in later patches, since it is more practical than constantly
>> moving definitions around to guarantee ordering constraints. That approach
>> would create patches with large diffs but non-functional changes due to code
>> movement alone, makes it harder to parse the changes that actually matter,
>> harder to rebase and resolve conflicts, etc...
>>
>> Alejandro
> 
> You can add forward declarations temporarily to simplify review.  But in
> the end, I prefer seeing code without forward declarations, with
> functions ordered sensibly by order of calls, rather than spread
> randomly all over the place.

I see. I did try to add the new functions in locations where they would 
logically fit, instead of randomly spread. But as the patchset grew it 
became harder to do without code movement.

I'll rebase the series to avoid using forward declarations.

In a somewhat related note, there are several patches that introduce 
helpers that are not immediately used, so I marked the definitions with 
the unused attribute, to be removed later. I can obviously squash the 
helper patches with the later ones adding the functionality, but I chose 
the more modular approach to split the patches. Please let me know if 
you agree with this approach, or you'd like me to submit larger patches 
instead.

Thank you,
Alejandro

> 
> 
>>>
>>> Regards
>>> Sairaj Kodilkar
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-16 22:17     ` Alejandro Jimenez
@ 2025-04-17 10:19       ` Sairaj Kodilkar
  2025-04-17 16:21         ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-17 10:19 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/17/2025 3:47 AM, Alejandro Jimenez wrote:
> 
> 
> On 4/16/25 8:14 AM, Sairaj Kodilkar wrote:
> 
>>> +
>>> +    /* DMA address translation support */
>>> +    IOMMUNotifierFlag notifier_flags;
>>> +    /* entry in list of Address spaces with registered notifiers */
>>> +    QLIST_ENTRY(AMDVIAddressSpace) next;
>>> +    /* DMA address translation active */
>>> +    bool addr_translation;
>>
>> I dont see any use of addr_translation in current patch.
>> maybe define it when you are really need this flag ?
> 
> ACK. I can move it to a later patch. My rationale was that this patch is 
> adding all the new structure members that will be needed, but it makes 
> sense to split it.
> 
> 
>>>       return 0;
>>>   }
>>> @@ -1700,6 +1721,7 @@ static void amdvi_sysbus_realize(DeviceState 
>>> *dev, Error **errp)
>>>   static const Property amdvi_properties[] = {
>>>       DEFINE_PROP_BOOL("xtsup", AMDVIState, xtsup, false),
>>> +    DEFINE_PROP_BOOL("dma-remap", AMDVIState, dma_remap, false),
>>>   };
>>
>> Please add a description in commit message for this flag
> 
> Will change in next revision.
> 
>>
>>>   static const VMStateDescription vmstate_amdvi_sysbus = {
>>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>>> index 28125130c6fc..e12ecade4baa 100644
>>> --- a/hw/i386/amd_iommu.h
>>> +++ b/hw/i386/amd_iommu.h
>>> @@ -365,12 +365,18 @@ struct AMDVIState {
>>>       /* for each served device */
>>>       AMDVIAddressSpace **address_spaces[PCI_BUS_MAX];
>>> +    /* list of address spaces with registered notifiers */
>>> +    QLIST_HEAD(, AMDVIAddressSpace) amdvi_as_with_notifiers;
>>> +
>>>       /* IOTLB */
>>>       GHashTable *iotlb;
>>>       /* Interrupt remapping */
>>>       bool ga_enabled;
>>>       bool xtsup;
>>> +
>>> +    /* DMA address translation */
>>> +    bool dma_remap;
>>
>> I think you should use this flag in the remapping path as well.
>> I am aware that you are using it later in this series to switch the
>> address space, but current patch will make things inconsistent for
>> emulated and vfio devices (possibly breaking the bisect).
> 
> The change in behavior happens only if the user explicitly sets dma- 
> remap=on property in the command line, which is why I made it off by 
> default.
> 
> To eliminate the possibility of using dma-remap=on before all the 
> infrastructure to support it is in place, I will move this patch and 
> [5/18] to the end of the series. Does that address your concern?
> 

Yep, That works for me. Its always better to introduce flags at the end
of the series when we have the infrastructure ready !

>>
>> Also newer AMD IVRS format provides a HATDis bit (Table 96 in IOMMU
>> Specs [1]), informing the guest kernel that V1 page table is disabled in
>> the IOMMU. 
> 
> Sounds like this mechanism could have been used until now to prevent the 
> scenario where we can have the guest request DMA remapping and the 
> vIOMMU doesn't have the capability. Seems moot now that we are actually 
> adding the capability.
> 
> 
> Its good idea to set this bit in IVRS when dma_remap=false.
>> This way a "HATDis bit aware" guest can enable iommu.passthrough.
> 
> I'd need to research how to implement this, but isn't this enhancement 
> better added in separate series, since it also requires a companion

Yeah, I agree. Let's have it as a separate series.

> Linux change? I don't recall the Linux driver being "HATDis aware" (or 
> even HATS aware for that matter), but perhaps I am mistaken...
> 

As of now, we don't have the "HATDis aware" guest. Its a fairly new 
flag, we are working on adding support for this feature.

> 
>>
>> Also, is it a good idea to have default value for dma_remap=false ?
>> Consider the guest which is not aware of HATDis bit. Things will break 
>> if such guest boots with iommu.passthrough=0 (recreating the pt=on
>> scenario).
> 
> That is not new, that is the current behavior that this series is trying 
> to fix by adding the missing functionality.
> 
> As far as the default value for dma-remap property, I think it must be 
> set to 0/false (i.e. current behavior unchanged) until we deem the DMA 
> remapping feature stable enough to be made available for guests.
> On that topic, maybe it should be an experimental feature for now i.e. 
> "x-dma-remap".
> 
> 

But the current behaviour for the emulated (virtio devices) is to have 
dma-remapping on by default... I still think its better to have this 
flag = on.
Honestly, I am confused here...

Regards
Sairaj Kodilkar

>> [1] https://www.amd.com/content/dam/amd/en/documents/processor-tech- 
>> docs/specifications/48882_IOMMU.pdf
>>
>> Regards
>> Sairaj Kodilkar
>>
>> PS: Sorry If I missed something here, I haven't progressed much in the 
>> series.
> 
> No problem at all, the feedback is appreciated.
> 
> Thank you,
> Alejandro
> 
> 
>>
>>>   };
>>>   uint64_t amdvi_extended_feature_register(AMDVIState *s);
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-04-14  2:02 ` [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
@ 2025-04-17 12:40   ` CLEMENT MATHIEU--DRIF
  2025-04-17 15:27     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2025-04-17 12:40 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com



On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
> 
> 
> The current amdvi_page_walk() is designed to be called by the replay()
> method. Rather than drastically altering it, introduce helpers to fetch
> guest PTEs that will be used by a page walker implementation.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>   hw/i386/amd_iommu.h |  42 +++++++++++++++
>   2 files changed, 167 insertions(+)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 0af873b66a31..d089fdc28ef1 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
>       }
>   };
> 
> +/*
> + * For a PTE encoding a large page, return the page size it encodes as described
> + * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
> + * No need to adjust the value of the PTE to point to the first PTE in the large
> + * page since the encoding guarantees all "base" PTEs in the large page are the
> + * same.
> + */
> +static uint64_t large_pte_page_size(uint64_t pte)
> +{
> +    assert(PTE_NEXT_LEVEL(pte) == 7);
> +
> +    /* Determine size of the large/contiguous page encoded in the PTE */
> +    return PTE_LARGE_PAGE_SIZE(pte);
> +}
> +
> +/*
> + * Helper function to fetch a PTE using AMD v1 pgtable format.
> + * Returns:
> + * -2:  The Page Table Root could not be read from DTE, or IOVA is larger than
> + *      supported by current page table level encodedin DTE[Mode].
> + * -1:  PTE could not be read from guest memory during a page table walk.
> + *      This means that the DTE has valid data, and one of the lower level
> + *      entries in the Page Table could not be read.
> + *  0:  PTE is marked not present, or entry is 0.
> + * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
> + */

This seems to be a bit error prone as any statement like "if (pte < 0)" 
might be completely removed by the compiler during optimization phases.
If you want to reuse such "high" values, defines could help.
Otherwise, pte could be an out parameter.

> +static uint64_t __attribute__((unused))
> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
> +          hwaddr *page_size)
> +{
> +    IOMMUAccessFlags perms = amdvi_get_perms(dte);
> +
> +    uint8_t level, mode;
> +    uint64_t pte = dte, pte_addr;
> +
> +    *page_size = 0;
> +
> +    if (perms == IOMMU_NONE) {
> +        return (uint64_t)-2;
> +    }
> +
> +    /*
> +     * The Linux kernel driver initializes the default mode to 3, corresponding
> +     * to a 39-bit GPA space, where each entry in the pagetable translates to a
> +     * 1GB (2^30) page size.
> +     */
> +    level = mode = get_pte_translation_mode(dte);
> +    assert(mode > 0 && mode < 7);
> +
> +    /*
> +     * If IOVA is larger than the max supported by the current pgtable level,
> +     * there is nothing to do. This signals that the pagetable level should be
> +     * increased, or is an address meant to have special behavior like
> +     * invalidating the entire cache.
> +     */
> +    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
> +        /* IOVA too large for the current DTE */
> +        return (uint64_t)-2;
> +    }
> +
> +    do {
> +        level -= 1;
> +
> +        /* Update the page_size */
> +        *page_size = PTE_LEVEL_PAGE_SIZE(level);
> +
> +        /* Permission bits are ANDed at every level, including the DTE */
> +        perms &= amdvi_get_perms(pte);
> +        if (perms == IOMMU_NONE) {
> +            return pte;
> +        }
> +
> +        /* Not Present */
> +        if (!IOMMU_PTE_PRESENT(pte)) {
> +            return 0;
> +        }
> +
> +        /* Large or Leaf PTE found */
> +        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
> +            /* Leaf PTE found */
> +            break;
> +        }
> +
> +        /*
> +         * Index the pgtable using the IOVA bits corresponding to current level
> +         * and walk down to the lower level.
> +         */
> +        pte_addr = NEXT_PTE_ADDR(pte, level, address);
> +        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> +
> +        if (pte == (uint64_t)-1) {
> +            /*
> +             * A returned PTE of -1 indicates a failure to read the page table
> +             * entry from guest memory.
> +             */
> +            if (level == mode - 1) {
> +                /* Failure to retrieve the Page Table from Root Pointer */
> +                *page_size = 0;
> +                return (uint64_t)-2;
> +            } else {
> +                /* Failure to read PTE. Page walk skips a page_size chunk */
> +                return pte;
> +            }
> +        }
> +    } while (level > 0);
> +
> +    /*
> +     * Page walk ends when Next Level field on PTE shows that either a leaf PTE
> +     * or a series of large PTEs have been reached. In the latter case, return
> +     * the pointer to the first PTE of the series.
> +     */
> +    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 || PTE_NEXT_LEVEL(pte) == 7);
> +
> +    /*
> +     * In case the range starts in the middle of a contiguous page, need to
> +     * return the first PTE
> +     */
> +    if (PTE_NEXT_LEVEL(pte) == 7) {
> +        /* Update page_size with the large PTE page size */
> +        *page_size = large_pte_page_size(pte);
> +    }
> +
> +    return pte;
> +}
> +
>   /*
>    * Toggle between address translation and passthrough modes by enabling the
>    * corresponding memory regions.
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index c89e7dc9947d..fc4d2f7a4575 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -25,6 +25,8 @@
>   #include "hw/i386/x86-iommu.h"
>   #include "qom/object.h"
> 
> +#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
> +
>   /* Capability registers */
>   #define AMDVI_CAPAB_BAR_LOW           0x04
>   #define AMDVI_CAPAB_BAR_HIGH          0x08
> @@ -174,6 +176,46 @@
>   #define AMDVI_GATS_MODE                 (2ULL <<  12)
>   #define AMDVI_HATS_MODE                 (2ULL <<  10)
> 
> +/* Page Table format */
> +
> +#define AMDVI_PTE_PR                    (1ULL << 0)
> +#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
> +
> +#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
> +
> +/* Using level=0 for leaf PTE at 4K page size */
> +#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
> +
> +/* Return IOVA bit group used to index the Page Table at specific level */
> +#define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
> +                                        GENMASK64(8, 0))
> +
> +/* Return the max address for a specified level i.e. max_oaddr */
> +#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
> +                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
> +                                (~(0ULL)))
> +
> +/* Extract the NextLevel field from PTE/PDE */
> +#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
> +
> +/* Take page table level and return default pagetable size for level */
> +#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << (PT_LEVEL_SHIFT(level)))
> +
> +/*
> + * Return address of lower level page table encoded in PTE and specified by
> + * current level and corresponding IOVA bit group at such level.
> + */
> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
> +                                        (PT_LEVEL_INDEX(level, iova) * 8))
> +
> +/*
> + * Take a PTE value with mode=0x07 and return the page size it encodes.
> + */
> +#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 0xfffULL))))
> +
> +/* Return number of PTEs to use for a given page size (expected power of 2) */
> +#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) % 9))
> +
>   /* IOTLB */
>   #define AMDVI_IOTLB_MAX_SIZE 1024
>   #define AMDVI_DEVID_SHIFT    36
> --
> 2.43.5
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation
  2025-04-14  2:02 ` [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
@ 2025-04-17 15:14   ` Ethan MILON
  2025-04-17 15:45     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Ethan MILON @ 2025-04-17 15:14 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com

Hi,

On 4/13/25 10:02 PM, Alejandro Jimenez wrote:
> For the specified address range, walk the page table identifying regions
> as mapped or unmapped and invoke registered notifiers with the
> corresponding event type.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>  hw/i386/amd_iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 74 insertions(+)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index d089fdc28ef1..6789e1e9b688 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1688,6 +1688,80 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>      return pte;
>  }
>  
> +/*
> + * Walk the guest page table for an IOVA and range and signal the registered
> + * notifiers to sync the shadow page tables in the host.
> + * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
> + */
> +static void __attribute__((unused))
> +amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
> +                                   hwaddr addr, uint64_t size, bool send_unmap)
> +{
> +    IOMMUTLBEvent event;
> +
> +    hwaddr iova_next, page_mask, pagesize;
> +    hwaddr iova = addr;
> +    hwaddr end = iova + size - 1;
> +
> +    uint64_t pte;
> +
> +    while (iova < end) {
> +
> +        pte = fetch_pte(as, iova, dte[0], &pagesize);
> +
> +        if (pte == (uint64_t)-2) {
> +            /*
> +             * Invalid conditions such as the IOVA being larger than supported
> +             * by current page table mode as configured in the DTE, or a failure
> +             * to fetch the Page Table from the Page Table Root Pointer in DTE.
> +             */
> +            assert(pagesize == 0);
> +            return;
> +        }
> +        /* PTE has been validated for major errors and pagesize is set */
> +        assert(pagesize);
> +        page_mask = ~(pagesize - 1);
> +        iova_next = (iova & page_mask) + pagesize;
> +
> +        if (pte == (uint64_t)-1) {
> +            /*
> +             * Failure to read PTE from memory, the pagesize matches the current
> +             * level. Unable to determine the region type, so a safe strategy is
> +             * to skip the range and continue the page walk.
> +             */
> +            goto next;
> +        }
> +
> +        event.entry.target_as = &address_space_memory;
> +        event.entry.iova = iova & page_mask;
> +        /* translated_addr is irrelevant for the unmap case */
> +        event.entry.translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) &
> +                                      page_mask;
> +        event.entry.addr_mask = ~page_mask;
> +        event.entry.perm = amdvi_get_perms(pte);

Is it possible for the dte permissions to be more restrictive than
permissions of the fetched pte?

> +
> +        /*
> +         * In cases where the leaf PTE is not found, or it has invalid
> +         * permissions, an UNMAP type notification is sent, but only if the
> +         * caller requested it.
> +         */
> +        if (!IOMMU_PTE_PRESENT(pte) || (event.entry.perm == IOMMU_NONE)) {
> +            if (!send_unmap) {
> +                goto next;
> +            }
> +            event.type = IOMMU_NOTIFIER_UNMAP;
> +        } else {
> +            event.type = IOMMU_NOTIFIER_MAP;
> +        }
> +
> +        /* Invoke the notifiers registered for this address space */
> +        memory_region_notify_iommu(&as->iommu, 0, event);
> +
> +next:
> +        iova = iova_next;
> +    }
> +}
> +
>  /*
>   * Toggle between address translation and passthrough modes by enabling the
>   * corresponding memory regions.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-04-17 12:40   ` CLEMENT MATHIEU--DRIF
@ 2025-04-17 15:27     ` Alejandro Jimenez
  2025-04-18  5:30       ` CLEMENT MATHIEU--DRIF
  0 siblings, 1 reply; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-17 15:27 UTC (permalink / raw)
  To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com



On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote:
> 
> 
> On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
>> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>>
>>
>> The current amdvi_page_walk() is designed to be called by the replay()
>> method. Rather than drastically altering it, introduce helpers to fetch
>> guest PTEs that will be used by a page walker implementation.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>    hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
>>    hw/i386/amd_iommu.h |  42 +++++++++++++++
>>    2 files changed, 167 insertions(+)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 0af873b66a31..d089fdc28ef1 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
>>        }
>>    };
>>
>> +/*
>> + * For a PTE encoding a large page, return the page size it encodes as described
>> + * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
>> + * No need to adjust the value of the PTE to point to the first PTE in the large
>> + * page since the encoding guarantees all "base" PTEs in the large page are the
>> + * same.
>> + */
>> +static uint64_t large_pte_page_size(uint64_t pte)
>> +{
>> +    assert(PTE_NEXT_LEVEL(pte) == 7);
>> +
>> +    /* Determine size of the large/contiguous page encoded in the PTE */
>> +    return PTE_LARGE_PAGE_SIZE(pte);
>> +}
>> +
>> +/*
>> + * Helper function to fetch a PTE using AMD v1 pgtable format.
>> + * Returns:
>> + * -2:  The Page Table Root could not be read from DTE, or IOVA is larger than
>> + *      supported by current page table level encodedin DTE[Mode].
>> + * -1:  PTE could not be read from guest memory during a page table walk.
>> + *      This means that the DTE has valid data, and one of the lower level
>> + *      entries in the Page Table could not be read.
>> + *  0:  PTE is marked not present, or entry is 0.
>> + * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>> + */
> 
> This seems to be a bit error prone as any statement like "if (pte < 0)"
> might be completely removed by the compiler during optimization phases.

Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to 
check for error values. Like it is the case in many of the patches, I am 
following the examples from the VTD implementations where they do the 
same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().

When using fetch_pte() again in patch [17/18] I considered adding a 
helper to check if fetch_pte() returned a valid PTE, but seemed 
unnecessary given that there are only two errors to be checked.

Another choice was to make fetch_pte() return an int so the error check 
could be done simply via (pte < 0), since bit 63 is either Reserved 
(DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion 
since you'd expect a PTE to be a 64-bit long value. Though I am aware 
the way error return/checking is implemented essentially relies on that 
behavior.

> If you want to reuse such "high" values, defines could help.

Sorry, I don't follow. Do you mean using defines as in still returning a 
uint64_t but giving -1 and -2 special definitions? That might make the 
code a somewhat more readable when checking the error values, but still 
requires casting to uint64_t on the check, and doesn't solve the problem 
of a careless caller using (pte < 0) to check for errors...

> Otherwise, pte could be an out parameter.

In general, I think we have to accept the caveat that callers of 
fetch_pte() must have some implementation specific knowledge to know 
they cannot check for errors using (pte < 0). Maybe with the aid of a 
strongly worded warning on the function header comment...

But if that argument doesn't convince you, and none of the alternatives 
above seem better, then I am leaning towards using the out parameter 
approach.

Thank you for the feedback.
Alejandro

> 
>> +static uint64_t __attribute__((unused))
>> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>> +          hwaddr *page_size)
>> +{
>> +    IOMMUAccessFlags perms = amdvi_get_perms(dte);
>> +
>> +    uint8_t level, mode;
>> +    uint64_t pte = dte, pte_addr;
>> +
>> +    *page_size = 0;
>> +
>> +    if (perms == IOMMU_NONE) {
>> +        return (uint64_t)-2;
>> +    }
>> +
>> +    /*
>> +     * The Linux kernel driver initializes the default mode to 3, corresponding
>> +     * to a 39-bit GPA space, where each entry in the pagetable translates to a
>> +     * 1GB (2^30) page size.
>> +     */
>> +    level = mode = get_pte_translation_mode(dte);
>> +    assert(mode > 0 && mode < 7);
>> +
>> +    /*
>> +     * If IOVA is larger than the max supported by the current pgtable level,
>> +     * there is nothing to do. This signals that the pagetable level should be
>> +     * increased, or is an address meant to have special behavior like
>> +     * invalidating the entire cache.
>> +     */
>> +    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
>> +        /* IOVA too large for the current DTE */
>> +        return (uint64_t)-2;
>> +    }
>> +
>> +    do {
>> +        level -= 1;
>> +
>> +        /* Update the page_size */
>> +        *page_size = PTE_LEVEL_PAGE_SIZE(level);
>> +
>> +        /* Permission bits are ANDed at every level, including the DTE */
>> +        perms &= amdvi_get_perms(pte);
>> +        if (perms == IOMMU_NONE) {
>> +            return pte;
>> +        }
>> +
>> +        /* Not Present */
>> +        if (!IOMMU_PTE_PRESENT(pte)) {
>> +            return 0;
>> +        }
>> +
>> +        /* Large or Leaf PTE found */
>> +        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
>> +            /* Leaf PTE found */
>> +            break;
>> +        }
>> +
>> +        /*
>> +         * Index the pgtable using the IOVA bits corresponding to current level
>> +         * and walk down to the lower level.
>> +         */
>> +        pte_addr = NEXT_PTE_ADDR(pte, level, address);
>> +        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
>> +
>> +        if (pte == (uint64_t)-1) {
>> +            /*
>> +             * A returned PTE of -1 indicates a failure to read the page table
>> +             * entry from guest memory.
>> +             */
>> +            if (level == mode - 1) {
>> +                /* Failure to retrieve the Page Table from Root Pointer */
>> +                *page_size = 0;
>> +                return (uint64_t)-2;
>> +            } else {
>> +                /* Failure to read PTE. Page walk skips a page_size chunk */
>> +                return pte;
>> +            }
>> +        }
>> +    } while (level > 0);
>> +
>> +    /*
>> +     * Page walk ends when Next Level field on PTE shows that either a leaf PTE
>> +     * or a series of large PTEs have been reached. In the latter case, return
>> +     * the pointer to the first PTE of the series.
>> +     */
>> +    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 || PTE_NEXT_LEVEL(pte) == 7);
>> +
>> +    /*
>> +     * In case the range starts in the middle of a contiguous page, need to
>> +     * return the first PTE
>> +     */
>> +    if (PTE_NEXT_LEVEL(pte) == 7) {
>> +        /* Update page_size with the large PTE page size */
>> +        *page_size = large_pte_page_size(pte);
>> +    }
>> +
>> +    return pte;
>> +}
>> +
>>    /*
>>     * Toggle between address translation and passthrough modes by enabling the
>>     * corresponding memory regions.
>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>> index c89e7dc9947d..fc4d2f7a4575 100644
>> --- a/hw/i386/amd_iommu.h
>> +++ b/hw/i386/amd_iommu.h
>> @@ -25,6 +25,8 @@
>>    #include "hw/i386/x86-iommu.h"
>>    #include "qom/object.h"
>>
>> +#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
>> +
>>    /* Capability registers */
>>    #define AMDVI_CAPAB_BAR_LOW           0x04
>>    #define AMDVI_CAPAB_BAR_HIGH          0x08
>> @@ -174,6 +176,46 @@
>>    #define AMDVI_GATS_MODE                 (2ULL <<  12)
>>    #define AMDVI_HATS_MODE                 (2ULL <<  10)
>>
>> +/* Page Table format */
>> +
>> +#define AMDVI_PTE_PR                    (1ULL << 0)
>> +#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
>> +
>> +#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>> +
>> +/* Using level=0 for leaf PTE at 4K page size */
>> +#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
>> +
>> +/* Return IOVA bit group used to index the Page Table at specific level */
>> +#define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
>> +                                        GENMASK64(8, 0))
>> +
>> +/* Return the max address for a specified level i.e. max_oaddr */
>> +#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
>> +                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
>> +                                (~(0ULL)))
>> +
>> +/* Extract the NextLevel field from PTE/PDE */
>> +#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
>> +
>> +/* Take page table level and return default pagetable size for level */
>> +#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << (PT_LEVEL_SHIFT(level)))
>> +
>> +/*
>> + * Return address of lower level page table encoded in PTE and specified by
>> + * current level and corresponding IOVA bit group at such level.
>> + */
>> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
>> +                                        (PT_LEVEL_INDEX(level, iova) * 8))
>> +
>> +/*
>> + * Take a PTE value with mode=0x07 and return the page size it encodes.
>> + */
>> +#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 0xfffULL))))
>> +
>> +/* Return number of PTEs to use for a given page size (expected power of 2) */
>> +#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) % 9))
>> +
>>    /* IOTLB */
>>    #define AMDVI_IOTLB_MAX_SIZE 1024
>>    #define AMDVI_DEVID_SHIFT    36
>> --
>> 2.43.5
>>
>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation
  2025-04-17 15:14   ` Ethan MILON
@ 2025-04-17 15:45     ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-17 15:45 UTC (permalink / raw)
  To: Ethan MILON, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com



On 4/17/25 11:14 AM, Ethan MILON wrote:
> Hi,
> 
> On 4/13/25 10:02 PM, Alejandro Jimenez wrote:
>> For the specified address range, walk the page table identifying regions
>> as mapped or unmapped and invoke registered notifiers with the
>> corresponding event type.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 74 insertions(+)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index d089fdc28ef1..6789e1e9b688 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -1688,6 +1688,80 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>>       return pte;
>>   }
>>   
>> +/*
>> + * Walk the guest page table for an IOVA and range and signal the registered
>> + * notifiers to sync the shadow page tables in the host.
>> + * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
>> + */
>> +static void __attribute__((unused))
>> +amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
>> +                                   hwaddr addr, uint64_t size, bool send_unmap)
>> +{
>> +    IOMMUTLBEvent event;
>> +
>> +    hwaddr iova_next, page_mask, pagesize;
>> +    hwaddr iova = addr;
>> +    hwaddr end = iova + size - 1;
>> +
>> +    uint64_t pte;
>> +
>> +    while (iova < end) {
>> +
>> +        pte = fetch_pte(as, iova, dte[0], &pagesize);
>> +
>> +        if (pte == (uint64_t)-2) {
>> +            /*
>> +             * Invalid conditions such as the IOVA being larger than supported
>> +             * by current page table mode as configured in the DTE, or a failure
>> +             * to fetch the Page Table from the Page Table Root Pointer in DTE.
>> +             */
>> +            assert(pagesize == 0);
>> +            return;
>> +        }
>> +        /* PTE has been validated for major errors and pagesize is set */
>> +        assert(pagesize);
>> +        page_mask = ~(pagesize - 1);
>> +        iova_next = (iova & page_mask) + pagesize;
>> +
>> +        if (pte == (uint64_t)-1) {
>> +            /*
>> +             * Failure to read PTE from memory, the pagesize matches the current
>> +             * level. Unable to determine the region type, so a safe strategy is
>> +             * to skip the range and continue the page walk.
>> +             */
>> +            goto next;
>> +        }
>> +
>> +        event.entry.target_as = &address_space_memory;
>> +        event.entry.iova = iova & page_mask;
>> +        /* translated_addr is irrelevant for the unmap case */
>> +        event.entry.translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) &
>> +                                      page_mask;
>> +        event.entry.addr_mask = ~page_mask;
>> +        event.entry.perm = amdvi_get_perms(pte);
> 
> Is it possible for the dte permissions to be more restrictive than
> permissions of the fetched pte?

No. My understanding of the documentation is that permissions can only 
get more restrictive as you go down the page walk, because they are 
logically ANDed with the permissions of the levels above (including the 
DTE). This is more or less verbatim what it says on the Spec on Table 
17: I/O Page Translation Entry (PTE) Fields, PR=1

More details:
I haven't found any place where the Linux driver modifies intermediate 
permissions. As far as I can tell, alloc_pte() will create all the PDEs 
with RW permissions, and only applies permissions/prot requested in 
map_pages() to the PTE. So the effective permissions during a page walk 
are really determined by the leaf PTE.

The above is why my initial prototype didn't bother to check the 
intermediate permissions in fetch_pte() and only checked at the returned 
PTE. But I had to implement the intermediate checks since this code is 
emulating a hardware page walk so I have to comply with the specification.

Thank you,
Alejandro

> 
>> +
>> +        /*
>> +         * In cases where the leaf PTE is not found, or it has invalid
>> +         * permissions, an UNMAP type notification is sent, but only if the
>> +         * caller requested it.
>> +         */
>> +        if (!IOMMU_PTE_PRESENT(pte) || (event.entry.perm == IOMMU_NONE)) {
>> +            if (!send_unmap) {
>> +                goto next;
>> +            }
>> +            event.type = IOMMU_NOTIFIER_UNMAP;
>> +        } else {
>> +            event.type = IOMMU_NOTIFIER_MAP;
>> +        }
>> +
>> +        /* Invoke the notifiers registered for this address space */
>> +        memory_region_notify_iommu(&as->iommu, 0, event);
>> +
>> +next:
>> +        iova = iova_next;
>> +    }
>> +}
>> +
>>   /*
>>    * Toggle between address translation and passthrough modes by enabling the
>>    * corresponding memory regions.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-17 10:19       ` Sairaj Kodilkar
@ 2025-04-17 16:21         ` Alejandro Jimenez
  2025-04-17 16:34           ` Michael S. Tsirkin
  2025-04-18  6:33           ` Sairaj Kodilkar
  0 siblings, 2 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-17 16:21 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/17/25 6:19 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/17/2025 3:47 AM, Alejandro Jimenez wrote:
>>
>>
>> On 4/16/25 8:14 AM, Sairaj Kodilkar wrote:
>>


>>>> +
>>>> +    /* DMA address translation */
>>>> +    bool dma_remap;
>>>
>>> I think you should use this flag in the remapping path as well.
>>> I am aware that you are using it later in this series to switch the
>>> address space, but current patch will make things inconsistent for
>>> emulated and vfio devices (possibly breaking the bisect).
>>
>> The change in behavior happens only if the user explicitly sets dma- 
>> remap=on property in the command line, which is why I made it off by 
>> default.
>>
>> To eliminate the possibility of using dma-remap=on before all the 
>> infrastructure to support it is in place, I will move this patch and 
>> [5/18] to the end of the series. Does that address your concern?
>>
> 
> Yep, That works for me. Its always better to introduce flags at the end
> of the series when we have the infrastructure ready !
> 
>>>

>>
>>>
>>> Also, is it a good idea to have default value for dma_remap=false ?
>>> Consider the guest which is not aware of HATDis bit. Things will 
>>> break if such guest boots with iommu.passthrough=0 (recreating the pt=on
>>> scenario).
>>
>> That is not new, that is the current behavior that this series is 
>> trying to fix by adding the missing functionality.
>>
>> As far as the default value for dma-remap property, I think it must be 
>> set to 0/false (i.e. current behavior unchanged) until we deem the DMA 
>> remapping feature stable enough to be made available for guests.
>> On that topic, maybe it should be an experimental feature for now i.e. 
>> "x-dma-remap".
>>
>>
> 
> But the current behaviour for the emulated (virtio devices) is to have 
> dma-remapping on by default... I still think its better to have this 
> flag = on.
> Honestly, I am confused here...

The dma-remap property is meant to be the global on/off switch that 
controls whether any/all the code changes in this series have any 
effect. It is off by default, so the new code doesn't run and behavior 
stays the same.

This example case you mention:

 > guest boots with iommu.passthrough=0

is already broken today if you are using the AMD vIOMMU with VFIO 
devices. We cannot stop guests from doing invalid things if they choose 
to do so. Following up from what you said above:

 > Its always better to introduce flags at the end
 > of the series when we have the infrastructure ready !

and we only enable new features by default once we have sufficiently 
verified them, otherwise we risk regressions when launching guests that 
have not changed their configuration (i.e. explicitly opted in to the 
new feature) but are now running the new code/feature.

Perhaps I missed a different scenario that you are warning about, in 
which case please expand...

Alejandro

> 
> Regards
> Sairaj Kodilkar
> 




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-17 16:21         ` Alejandro Jimenez
@ 2025-04-17 16:34           ` Michael S. Tsirkin
  2025-04-18  6:33           ` Sairaj Kodilkar
  1 sibling, 0 replies; 49+ messages in thread
From: Michael S. Tsirkin @ 2025-04-17 16:34 UTC (permalink / raw)
  To: Alejandro Jimenez
  Cc: Sairaj Kodilkar, qemu-devel, pbonzini, richard.henderson, eduardo,
	peterx, david, philmd, marcel.apfelbaum, alex.williamson,
	vasant.hegde, suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	joao.m.martins, boris.ostrovsky

On Thu, Apr 17, 2025 at 12:21:57PM -0400, Alejandro Jimenez wrote:
> > guest boots with iommu.passthrough=0
> 
> is already broken today if you are using the AMD vIOMMU with VFIO devices.
> We cannot stop guests from doing invalid things if they choose to do so.

Yea no need for an "unbreak things" flag - just unbreak them.

-- 
MST



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-04-17 15:27     ` Alejandro Jimenez
@ 2025-04-18  5:30       ` CLEMENT MATHIEU--DRIF
  2025-04-23  6:28         ` Sairaj Kodilkar
  0 siblings, 1 reply; 49+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2025-04-18  5:30 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	sarunkod@amd.com, Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com



On 17/04/2025 5:27 pm, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless 
> this email comes from a known sender and you know the content is safe.
> 
> 
> On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote:
>>
>>
>> On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
>>> Caution: External email. Do not open attachments or click links, 
>>> unless this email comes from a known sender and you know the content 
>>> is safe.
>>>
>>>
>>> The current amdvi_page_walk() is designed to be called by the replay()
>>> method. Rather than drastically altering it, introduce helpers to fetch
>>> guest PTEs that will be used by a page walker implementation.
>>>
>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>> ---
>>>    hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++ 
>>> ++++
>>>    hw/i386/amd_iommu.h |  42 +++++++++++++++
>>>    2 files changed, 167 insertions(+)
>>>
>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>> index 0af873b66a31..d089fdc28ef1 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
>>>        }
>>>    };
>>>
>>> +/*
>>> + * For a PTE encoding a large page, return the page size it encodes 
>>> as described
>>> + * by the AMD IOMMU Specification Table 14: Example Page Size 
>>> Encodings.
>>> + * No need to adjust the value of the PTE to point to the first PTE 
>>> in the large
>>> + * page since the encoding guarantees all "base" PTEs in the large 
>>> page are the
>>> + * same.
>>> + */
>>> +static uint64_t large_pte_page_size(uint64_t pte)
>>> +{
>>> +    assert(PTE_NEXT_LEVEL(pte) == 7);
>>> +
>>> +    /* Determine size of the large/contiguous page encoded in the 
>>> PTE */
>>> +    return PTE_LARGE_PAGE_SIZE(pte);
>>> +}
>>> +
>>> +/*
>>> + * Helper function to fetch a PTE using AMD v1 pgtable format.
>>> + * Returns:
>>> + * -2:  The Page Table Root could not be read from DTE, or IOVA is 
>>> larger than
>>> + *      supported by current page table level encodedin DTE[Mode].
>>> + * -1:  PTE could not be read from guest memory during a page table 
>>> walk.
>>> + *      This means that the DTE has valid data, and one of the lower 
>>> level
>>> + *      entries in the Page Table could not be read.
>>> + *  0:  PTE is marked not present, or entry is 0.
>>> + * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>>> + */
>>
>> This seems to be a bit error prone as any statement like "if (pte < 0)"
>> might be completely removed by the compiler during optimization phases.
> 
> Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to
> check for error values. Like it is the case in many of the patches, I am
> following the examples from the VTD implementations where they do the
> same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().

Yes, I know :)

Note that VT-d only has 1 potential error code (-1) which seems easier 
to handle at call site.

> 
> When using fetch_pte() again in patch [17/18] I considered adding a
> helper to check if fetch_pte() returned a valid PTE, but seemed
> unnecessary given that there are only two errors to be checked.
> 
> Another choice was to make fetch_pte() return an int so the error check
> could be done simply via (pte < 0), since bit 63 is either Reserved
> (DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion
> since you'd expect a PTE to be a 64-bit long value. Though I am aware
> the way error return/checking is implemented essentially relies on that
> behavior.
> 
>> If you want to reuse such "high" values, defines could help.
> 
> Sorry, I don't follow. Do you mean using defines as in still returning a
> uint64_t but giving -1 and -2 special definitions? That might make the
> code a somewhat more readable when checking the error values, but still
> requires casting to uint64_t on the check, and doesn't solve the problem
> of a careless caller using (pte < 0) to check for errors...

Yes, I think that it would be more readable.
When using defines, the caller no longer needs to be aware of the fact 
that the value has been casted from a negative number, which reduces the 
risk of writing things like (pte < 0).

I prefer the out parameter solution but let's see what other reviews say.

Thanks for this patch set :)

> 
>> Otherwise, pte could be an out parameter.
> 
> In general, I think we have to accept the caveat that callers of
> fetch_pte() must have some implementation specific knowledge to know
> they cannot check for errors using (pte < 0). Maybe with the aid of a
> strongly worded warning on the function header comment...
> 
> But if that argument doesn't convince you, and none of the alternatives
> above seem better, then I am leaning towards using the out parameter
> approach.
> 
> Thank you for the feedback.
> Alejandro
> 
>>
>>> +static uint64_t __attribute__((unused))
>>> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>>> +          hwaddr *page_size)
>>> +{
>>> +    IOMMUAccessFlags perms = amdvi_get_perms(dte);
>>> +
>>> +    uint8_t level, mode;
>>> +    uint64_t pte = dte, pte_addr;
>>> +
>>> +    *page_size = 0;
>>> +
>>> +    if (perms == IOMMU_NONE) {
>>> +        return (uint64_t)-2;
>>> +    }
>>> +
>>> +    /*
>>> +     * The Linux kernel driver initializes the default mode to 3, 
>>> corresponding
>>> +     * to a 39-bit GPA space, where each entry in the pagetable 
>>> translates to a
>>> +     * 1GB (2^30) page size.
>>> +     */
>>> +    level = mode = get_pte_translation_mode(dte);
>>> +    assert(mode > 0 && mode < 7);
>>> +
>>> +    /*
>>> +     * If IOVA is larger than the max supported by the current 
>>> pgtable level,
>>> +     * there is nothing to do. This signals that the pagetable level 
>>> should be
>>> +     * increased, or is an address meant to have special behavior like
>>> +     * invalidating the entire cache.
>>> +     */
>>> +    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
>>> +        /* IOVA too large for the current DTE */
>>> +        return (uint64_t)-2;
>>> +    }
>>> +
>>> +    do {
>>> +        level -= 1;
>>> +
>>> +        /* Update the page_size */
>>> +        *page_size = PTE_LEVEL_PAGE_SIZE(level);
>>> +
>>> +        /* Permission bits are ANDed at every level, including the 
>>> DTE */
>>> +        perms &= amdvi_get_perms(pte);
>>> +        if (perms == IOMMU_NONE) {
>>> +            return pte;
>>> +        }
>>> +
>>> +        /* Not Present */
>>> +        if (!IOMMU_PTE_PRESENT(pte)) {
>>> +            return 0;
>>> +        }
>>> +
>>> +        /* Large or Leaf PTE found */
>>> +        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
>>> +            /* Leaf PTE found */
>>> +            break;
>>> +        }
>>> +
>>> +        /*
>>> +         * Index the pgtable using the IOVA bits corresponding to 
>>> current level
>>> +         * and walk down to the lower level.
>>> +         */
>>> +        pte_addr = NEXT_PTE_ADDR(pte, level, address);
>>> +        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as- 
>>> >devfn);
>>> +
>>> +        if (pte == (uint64_t)-1) {
>>> +            /*
>>> +             * A returned PTE of -1 indicates a failure to read the 
>>> page table
>>> +             * entry from guest memory.
>>> +             */
>>> +            if (level == mode - 1) {
>>> +                /* Failure to retrieve the Page Table from Root 
>>> Pointer */
>>> +                *page_size = 0;
>>> +                return (uint64_t)-2;
>>> +            } else {
>>> +                /* Failure to read PTE. Page walk skips a page_size 
>>> chunk */
>>> +                return pte;
>>> +            }
>>> +        }
>>> +    } while (level > 0);
>>> +
>>> +    /*
>>> +     * Page walk ends when Next Level field on PTE shows that either 
>>> a leaf PTE
>>> +     * or a series of large PTEs have been reached. In the latter 
>>> case, return
>>> +     * the pointer to the first PTE of the series.
>>> +     */
>>> +    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 || 
>>> PTE_NEXT_LEVEL(pte) == 7);
>>> +
>>> +    /*
>>> +     * In case the range starts in the middle of a contiguous page, 
>>> need to
>>> +     * return the first PTE
>>> +     */
>>> +    if (PTE_NEXT_LEVEL(pte) == 7) {
>>> +        /* Update page_size with the large PTE page size */
>>> +        *page_size = large_pte_page_size(pte);
>>> +    }
>>> +
>>> +    return pte;
>>> +}
>>> +
>>>    /*
>>>     * Toggle between address translation and passthrough modes by 
>>> enabling the
>>>     * corresponding memory regions.
>>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>>> index c89e7dc9947d..fc4d2f7a4575 100644
>>> --- a/hw/i386/amd_iommu.h
>>> +++ b/hw/i386/amd_iommu.h
>>> @@ -25,6 +25,8 @@
>>>    #include "hw/i386/x86-iommu.h"
>>>    #include "qom/object.h"
>>>
>>> +#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
>>> +
>>>    /* Capability registers */
>>>    #define AMDVI_CAPAB_BAR_LOW           0x04
>>>    #define AMDVI_CAPAB_BAR_HIGH          0x08
>>> @@ -174,6 +176,46 @@
>>>    #define AMDVI_GATS_MODE                 (2ULL <<  12)
>>>    #define AMDVI_HATS_MODE                 (2ULL <<  10)
>>>
>>> +/* Page Table format */
>>> +
>>> +#define AMDVI_PTE_PR                    (1ULL << 0)
>>> +#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
>>> +
>>> +#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>>> +
>>> +/* Using level=0 for leaf PTE at 4K page size */
>>> +#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
>>> +
>>> +/* Return IOVA bit group used to index the Page Table at specific 
>>> level */
>>> +#define PT_LEVEL_INDEX(level, iova)     (((iova) >> 
>>> PT_LEVEL_SHIFT(level)) & \
>>> +                                        GENMASK64(8, 0))
>>> +
>>> +/* Return the max address for a specified level i.e. max_oaddr */
>>> +#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
>>> +                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 
>>> 1) : \
>>> +                                (~(0ULL)))
>>> +
>>> +/* Extract the NextLevel field from PTE/PDE */
>>> +#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) 
>>> >> 9)
>>> +
>>> +/* Take page table level and return default pagetable size for level */
>>> +#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << 
>>> (PT_LEVEL_SHIFT(level)))
>>> +
>>> +/*
>>> + * Return address of lower level page table encoded in PTE and 
>>> specified by
>>> + * current level and corresponding IOVA bit group at such level.
>>> + */
>>> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & 
>>> AMDVI_DEV_PT_ROOT_MASK) + \
>>> +                                        (PT_LEVEL_INDEX(level, iova) 
>>> * 8))
>>> +
>>> +/*
>>> + * Take a PTE value with mode=0x07 and return the page size it encodes.
>>> + */
>>> +#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 
>>> 0xfffULL))))
>>> +
>>> +/* Return number of PTEs to use for a given page size (expected 
>>> power of 2) */
>>> +#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) 
>>> % 9))
>>> +
>>>    /* IOTLB */
>>>    #define AMDVI_IOTLB_MAX_SIZE 1024
>>>    #define AMDVI_DEVID_SHIFT    36
>>> -- 
>>> 2.43.5
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 03/18] amd_iommu: Add support for IOMMU notifier
  2025-04-17 16:21         ` Alejandro Jimenez
  2025-04-17 16:34           ` Michael S. Tsirkin
@ 2025-04-18  6:33           ` Sairaj Kodilkar
  1 sibling, 0 replies; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-18  6:33 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/17/2025 9:51 PM, Alejandro Jimenez wrote:
> 
> 
> On 4/17/25 6:19 AM, Sairaj Kodilkar wrote:
>>
>>
>> On 4/17/2025 3:47 AM, Alejandro Jimenez wrote:
>>>
>>>
>>> On 4/16/25 8:14 AM, Sairaj Kodilkar wrote:
>>>
> 
> 
>>>>> +
>>>>> +    /* DMA address translation */
>>>>> +    bool dma_remap;
>>>>
>>>> I think you should use this flag in the remapping path as well.
>>>> I am aware that you are using it later in this series to switch the
>>>> address space, but current patch will make things inconsistent for
>>>> emulated and vfio devices (possibly breaking the bisect).
>>>
>>> The change in behavior happens only if the user explicitly sets dma- 
>>> remap=on property in the command line, which is why I made it off by 
>>> default.
>>>
>>> To eliminate the possibility of using dma-remap=on before all the 
>>> infrastructure to support it is in place, I will move this patch and 
>>> [5/18] to the end of the series. Does that address your concern?
>>>
>>
>> Yep, That works for me. Its always better to introduce flags at the end
>> of the series when we have the infrastructure ready !
>>
>>>>
> 
>>>
>>>>
>>>> Also, is it a good idea to have default value for dma_remap=false ?
>>>> Consider the guest which is not aware of HATDis bit. Things will 
>>>> break if such guest boots with iommu.passthrough=0 (recreating the 
>>>> pt=on
>>>> scenario).
>>>
>>> That is not new, that is the current behavior that this series is 
>>> trying to fix by adding the missing functionality.
>>>
>>> As far as the default value for dma-remap property, I think it must 
>>> be set to 0/false (i.e. current behavior unchanged) until we deem the 
>>> DMA remapping feature stable enough to be made available for guests.
>>> On that topic, maybe it should be an experimental feature for now 
>>> i.e. "x-dma-remap".
>>>
>>>
>>
>> But the current behaviour for the emulated (virtio devices) is to have 
>> dma-remapping on by default... I still think its better to have this 
>> flag = on.
>> Honestly, I am confused here...
> 
> The dma-remap property is meant to be the global on/off switch that 
> controls whether any/all the code changes in this series have any 
> effect. It is off by default, so the new code doesn't run and behavior 
> stays the same.
> 
> This example case you mention:
> 
>  > guest boots with iommu.passthrough=0
> 
> is already broken today if you are using the AMD vIOMMU with VFIO 
> devices. We cannot stop guests from doing invalid things if they choose 
> to do so. Following up from what you said above:
> 
>  > Its always better to introduce flags at the end
>  > of the series when we have the infrastructure ready !
> 
> and we only enable new features by default once we have sufficiently 
> verified them, otherwise we risk regressions when launching guests that 
> have not changed their configuration (i.e. explicitly opted in to the 
> new feature) but are now running the new code/feature.
> 
> Perhaps I missed a different scenario that you are warning about, in 
> which case please expand...
>

Understood. Thanks for the explanation and sorry for the confusion.

Regard
Sairaj Kodilkar
> Alejandro
> 
>>
>> Regards
>> Sairaj Kodilkar
>>
> 
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode
  2025-04-14  2:02 ` [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
@ 2025-04-22 12:17   ` Sairaj Kodilkar
  2025-04-28 21:10     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-22 12:17 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> Enable the appropriate memory region for an address space depending on
> the address translation mode selected for it. This is currently based on
> a generic x86 IOMMMU property, and only done during the address space
> initialization. Extract the code into a helper and toggle the regions
> based on whether DMA remapping is available as a global capability, and
> if the specific address space is using address translation.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 30 ++++++++++++++++++++----------
>   1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index ad5869e72fdc..3f9aa2cc8d31 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1529,13 +1529,31 @@ static const MemoryRegionOps amdvi_ir_ops = {
>       }
>   };
>   
> +/*
> + * Toggle between address translation and passthrough modes by enabling the
> + * corresponding memory regions.
> + */
> +static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
> +{
> +    AMDVIState *s = amdvi_as->iommu_state;
> +
> +    if (s->dma_remap && amdvi_as->addr_translation) {

Hi Alejandro,

I know gnew0 initializes addr_translation to 0. but should we explicitly
initialize it to 0 ? in order to make it more readable.

Regards
Sairaj Kodilkar

> +        /* Enabling DMA region */
> +        memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
> +        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), true);
> +    } else {
> +        /* Disabling DMA region, using passthrough */
> +        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), false);
> +        memory_region_set_enabled(&amdvi_as->iommu_nodma, true);
> +    }
> +}
> +
>   static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>   {
>       char name[128];
>       AMDVIState *s = opaque;
>       AMDVIAddressSpace **iommu_as, *amdvi_dev_as;
>       int bus_num = pci_bus_num(bus);
> -    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>   
>       iommu_as = s->address_spaces[bus_num];
>   
> @@ -1595,15 +1613,7 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>                                               AMDVI_INT_ADDR_FIRST,
>                                               &amdvi_dev_as->iommu_ir, 1);
>   
> -        if (!x86_iommu->pt_supported) {
> -            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, false);
> -            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
> -                                      true);
> -        } else {
> -            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as->iommu),
> -                                      false);
> -            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, true);
> -        }
> +        amdvi_switch_address_space(amdvi_dev_as);
>       }
>       return &iommu_as[devfn]->as;
>   }



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command
  2025-04-14  2:02 ` [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
@ 2025-04-22 12:26   ` Sairaj Kodilkar
  2025-04-28 21:16     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-22 12:26 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> The size of the region to invalidate depends on the S bit and address
> encoded in the command. Add a helper to extract this information, which
> will be used to sync shadow page tables in upcoming changes.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++++++++
>   hw/i386/amd_iommu.h |  4 ++++
>   2 files changed, 38 insertions(+)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 5f55be1f4d36..0af873b66a31 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -481,6 +481,40 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
>       return entry->domid == domid;
>   }
>   
> +/*
> + * Helper to decode the size of the range to invalidate encoded in the
> + * INVALIDATE_IOMMU_PAGES Command format.
> + * The size of the region to invalidate depends on the S bit and address.
> + * S bit value:
> + * 0 :  Invalidation size is 4 Kbytes.
> + * 1 :  Invalidation size is determined by first zero bit in the address
> + *      starting from Address[12].
> + *
> + * In the AMD IOMMU Linux driver, an invalidation command with address
> + * ((1 << 63) - 1) is sent when intending to clear the entire cache.
> + * However, Table 14: Example Page Size Encodings shows that an address of
> + * ((1ULL << 51) - 1) encodes the entire cache, so effectively any address with
> + * first zero at bit 51 or larger is a request to invalidate the entire address
> + * space.
> + */
> +static uint64_t __attribute__((unused))
> +amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
> +{
> +    uint64_t size = AMDVI_PAGE_SIZE;
> +    uint8_t fzbit = 0;
> +
> +    if (flags & AMDVI_CMD_INVAL_IOMMU_PAGES_S) {
> +        fzbit = cto64(addr | 0xFFF);
> +
> +        if (fzbit >= 51 || !addr) {

I am skeptical about the condition addr == 0 (!addr)

Consider the case where user wants to invalidate 8K size page, starting
from address 0. It'll cause address field to be 0, right ? If so, then
we should invalidate only 8K page not the entire address range.

Am I missing something here ?

Regards
Sairaj
> +            size = AMDVI_INV_ALL_PAGES;
> +        } else {> +            size = 1ULL << (fzbit + 1);
> +        }
> +    }
> +    return size;
> +}
> +
>   /* we don't have devid - we can't remove pages by address */
>   static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>   {
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index e12ecade4baa..c89e7dc9947d 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -123,6 +123,10 @@
>   #define AMDVI_CMD_COMPLETE_PPR_REQUEST    0x07
>   #define AMDVI_CMD_INVAL_AMDVI_ALL         0x08
>   
> +
> +#define AMDVI_CMD_INVAL_IOMMU_PAGES_S   (1ULL << 0)
> +#define AMDVI_INV_ALL_PAGES             (1ULL << 52)
> +
>   #define AMDVI_DEVTAB_ENTRY_SIZE           32
>   
>   /* Device table entry bits 0:63 */



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation
  2025-04-14  2:02 ` [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
@ 2025-04-22 12:38   ` Sairaj Kodilkar
  2025-04-22 12:38   ` Sairaj Kodilkar
  1 sibling, 0 replies; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-22 12:38 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the
> address and size of the invalidation and sync the guest page table state
> with the host. This requires walking the guest page table and calling
> notifiers registered for address spaces matching the domain ID encoded
> in the command.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 110 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 102 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 6789e1e9b688..cf83ac607064 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -97,6 +97,9 @@ typedef enum AMDVIFaultReason {
>   } AMDVIFaultReason;
>   
>   static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> +                                               uint64_t *dte, hwaddr addr,
> +                                               uint64_t size, bool send_unmap);
>   
>   uint64_t amdvi_extended_feature_register(AMDVIState *s)
>   {
> @@ -497,8 +500,7 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
>    * first zero at bit 51 or larger is a request to invalidate the entire address
>    * space.
>    */
> -static uint64_t __attribute__((unused))
> -amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
> +static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>   {
>       uint64_t size = AMDVI_PAGE_SIZE;
>       uint8_t fzbit = 0;
> @@ -515,10 +517,101 @@ amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>       return size;
>   }
>   
> +/*
> + * Synchronize the guest page tables with the shadow page tables kept in the
> + * host for the specified range.
> + * The invalidation command issued by the guest and intercepted by the VMM
> + * does not specify a device, but a domain, since all devices in the same domain
> + * share the same page tables. However, vIOMMU emulation creates separate
> + * address spaces per device, so it is necessary to traverse the list of all of
> + * address spaces (i.e. devices) that have notifiers registered in order to
> + * propagate the changes to the host page tables. This could generate redundant
> + * requests to map/unmap regions when there are several devices in a same domain
> + * but it must be optimized by maintaining an internal representation of the
> + * per-domain address space, and avoid invoking a notifier when the change is
> + * already reflected in the host page tables.
> + *
> + * We cannot return early from this function once a matching domain has been
> + * identified and its page tables synced (based on the fact that all devices in
> + * the same domain share the page tables). The reason is that different devices
> + * (i.e. address spaces) could have different notifiers registered, and by
> + * skipping address spaces that appear later on the amdvi_as_with_notifiers list
> + * their notifiers (which could differ from the ones registered for the first
> + * device/address space) would not be invoked.
> + */
> +static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t addr,
> +                              uint16_t flags)
> +{
> +    AMDVIAddressSpace *as;
> +
> +    uint64_t size = amdvi_decode_invalidation_size(addr, flags);
> +
> +    if (size == AMDVI_INV_ALL_PAGES) {
> +        addr = 0;       /* Set start address to 0 and invalidate entire AS */
> +    } else {
> +        addr &= ~(size - 1);
> +    }
> +
> +    /*
> +     * Call notifiers that have registered for each address space matching the
> +     * domain ID, in order to sync the guest pagetable state with the host.
> +     */
> +    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
> +
> +        uint64_t dte[4] = { 0 };
> +
> +        /*
> +         * Retrieve the Device Table entry for the devid corresponding to the
> +         * current address space, and verify the DomainID matches i.e. the page
> +         * tables to be synced belong to devices in the domain.
> +         */
> +        if (amdvi_as_to_dte(as, dte)) {
> +            continue;
> +        }
> +
> +        /* Only need to sync the Page Tables for a matching domain */
> +        if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
> +            continue;
> +        }
> +
> +        /*
> +         * We have determined that there is a valid Device Table Entry for a
> +         * device matching the DomainID in the INV_IOMMU_PAGES command issued by
> +         * the guest. Walk the guest page table to sync shadow page table.
> +         *
> +         * An optimization can be made if only UNMAP notifiers are registered to
> +         * avoid walking the page table and just invalidate the requested range.
> +         */

Hi Alejandro,
I am not able to understand the optimization during "UNMAP only"
notifiers. I am confused because code will call the UNMAP notifier
even if the pte is present (which should be wrong). Could you please
explain why we are unmapping IOVAs with present pte ?

Regards
Sairaj Kodilkar

> +        if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
> +
> +            /* Sync guest IOMMU mappings with host */
> +            amdvi_sync_shadow_page_table_range(as, &dte[0], addr, size, true);
> +        } else {
> +            /*
> +             * For UNMAP only notifiers, invalidate the requested size. No page
> +             * table walk is required since there is no need to replay mappings.
> +             */
> +            IOMMUTLBEvent event = {
> +                .type = IOMMU_NOTIFIER_UNMAP,
> +                .entry = {
> +                    .target_as = &address_space_memory,
> +                    .iova = addr,
> +                    .translated_addr = 0, /* Irrelevant for unmap case */
> +                    .addr_mask = size - 1,
> +                    .perm = IOMMU_NONE,
> +                },
> +            };
> +            memory_region_notify_iommu(&as->iommu, 0, event);
> +        }
> +    }
> +}
> +
>   /* we don't have devid - we can't remove pages by address */
>   static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>   {
>       uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
> +    uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
> +    uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
>   
>       if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
>           extract64(cmd[1], 3, 9)) {
> @@ -528,6 +621,8 @@ static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>   
>       g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
>                                   &domid);
> +
> +    amdvi_sync_domain(s, domid, addr, flags);
>       trace_amdvi_pages_inval(domid);
>   }
>   
> @@ -1589,9 +1684,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
>    *  0:  PTE is marked not present, or entry is 0.
>    * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>    */
> -static uint64_t __attribute__((unused))
> -fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
> -          hwaddr *page_size)
> +static uint64_t fetch_pte(AMDVIAddressSpace *as, const hwaddr address,
> +                          uint64_t dte, hwaddr *page_size)
>   {
>       IOMMUAccessFlags perms = amdvi_get_perms(dte);
>   
> @@ -1693,9 +1787,9 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>    * notifiers to sync the shadow page tables in the host.
>    * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
>    */
> -static void __attribute__((unused))
> -amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
> -                                   hwaddr addr, uint64_t size, bool send_unmap)
> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> +                                               uint64_t *dte, hwaddr addr,
> +                                               uint64_t size, bool send_unmap)
>   {
>       IOMMUTLBEvent event;
>   



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation
  2025-04-14  2:02 ` [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
  2025-04-22 12:38   ` Sairaj Kodilkar
@ 2025-04-22 12:38   ` Sairaj Kodilkar
  2025-04-29 19:47     ` Alejandro Jimenez
  1 sibling, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-22 12:38 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the
> address and size of the invalidation and sync the guest page table state
> with the host. This requires walking the guest page table and calling
> notifiers registered for address spaces matching the domain ID encoded
> in the command.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 110 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 102 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 6789e1e9b688..cf83ac607064 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -97,6 +97,9 @@ typedef enum AMDVIFaultReason {
>   } AMDVIFaultReason;
>   
>   static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> +                                               uint64_t *dte, hwaddr addr,
> +                                               uint64_t size, bool send_unmap);
>   
>   uint64_t amdvi_extended_feature_register(AMDVIState *s)
>   {
> @@ -497,8 +500,7 @@ static gboolean amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
>    * first zero at bit 51 or larger is a request to invalidate the entire address
>    * space.
>    */
> -static uint64_t __attribute__((unused))
> -amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
> +static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>   {
>       uint64_t size = AMDVI_PAGE_SIZE;
>       uint8_t fzbit = 0;
> @@ -515,10 +517,101 @@ amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>       return size;
>   }
>   
> +/*
> + * Synchronize the guest page tables with the shadow page tables kept in the
> + * host for the specified range.
> + * The invalidation command issued by the guest and intercepted by the VMM
> + * does not specify a device, but a domain, since all devices in the same domain
> + * share the same page tables. However, vIOMMU emulation creates separate
> + * address spaces per device, so it is necessary to traverse the list of all of
> + * address spaces (i.e. devices) that have notifiers registered in order to
> + * propagate the changes to the host page tables. This could generate redundant
> + * requests to map/unmap regions when there are several devices in a same domain
> + * but it must be optimized by maintaining an internal representation of the
> + * per-domain address space, and avoid invoking a notifier when the change is
> + * already reflected in the host page tables.
> + *
> + * We cannot return early from this function once a matching domain has been
> + * identified and its page tables synced (based on the fact that all devices in
> + * the same domain share the page tables). The reason is that different devices
> + * (i.e. address spaces) could have different notifiers registered, and by
> + * skipping address spaces that appear later on the amdvi_as_with_notifiers list
> + * their notifiers (which could differ from the ones registered for the first
> + * device/address space) would not be invoked.
> + */
> +static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t addr,
> +                              uint16_t flags)
> +{
> +    AMDVIAddressSpace *as;
> +
> +    uint64_t size = amdvi_decode_invalidation_size(addr, flags);
> +
> +    if (size == AMDVI_INV_ALL_PAGES) {
> +        addr = 0;       /* Set start address to 0 and invalidate entire AS */
> +    } else {
> +        addr &= ~(size - 1);
> +    }
> +
> +    /*
> +     * Call notifiers that have registered for each address space matching the
> +     * domain ID, in order to sync the guest pagetable state with the host.
> +     */
> +    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
> +
> +        uint64_t dte[4] = { 0 };
> +
> +        /*
> +         * Retrieve the Device Table entry for the devid corresponding to the
> +         * current address space, and verify the DomainID matches i.e. the page
> +         * tables to be synced belong to devices in the domain.
> +         */
> +        if (amdvi_as_to_dte(as, dte)) {
> +            continue;
> +        }
> +
> +        /* Only need to sync the Page Tables for a matching domain */
> +        if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
> +            continue;
> +        }
> +
> +        /*
> +         * We have determined that there is a valid Device Table Entry for a
> +         * device matching the DomainID in the INV_IOMMU_PAGES command issued by
> +         * the guest. Walk the guest page table to sync shadow page table.
> +         *
> +         * An optimization can be made if only UNMAP notifiers are registered to
> +         * avoid walking the page table and just invalidate the requested range.
> +         */

Hi Alejandro,
I am not able to understand the optimization during "UNMAP only"
notifiers. I am confused because code will call the UNMAP notifier
even if the pte is present (which should be wrong). Could you please
explain why we are unmapping IOVAs with present pte ?

Regards
Sairaj Kodilkar

> +        if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
> +
> +            /* Sync guest IOMMU mappings with host */
> +            amdvi_sync_shadow_page_table_range(as, &dte[0], addr, size, true);
> +        } else {
> +            /*
> +             * For UNMAP only notifiers, invalidate the requested size. No page
> +             * table walk is required since there is no need to replay mappings.
> +             */
> +            IOMMUTLBEvent event = {
> +                .type = IOMMU_NOTIFIER_UNMAP,
> +                .entry = {
> +                    .target_as = &address_space_memory,
> +                    .iova = addr,
> +                    .translated_addr = 0, /* Irrelevant for unmap case */
> +                    .addr_mask = size - 1,
> +                    .perm = IOMMU_NONE,
> +                },
> +            };
> +            memory_region_notify_iommu(&as->iommu, 0, event);
> +        }
> +    }
> +}
> +
>   /* we don't have devid - we can't remove pages by address */
>   static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>   {
>       uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
> +    uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
> +    uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
>   
>       if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
>           extract64(cmd[1], 3, 9)) {
> @@ -528,6 +621,8 @@ static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>   
>       g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
>                                   &domid);
> +
> +    amdvi_sync_domain(s, domid, addr, flags);
>       trace_amdvi_pages_inval(domid);
>   }
>   
> @@ -1589,9 +1684,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
>    *  0:  PTE is marked not present, or entry is 0.
>    * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>    */
> -static uint64_t __attribute__((unused))
> -fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
> -          hwaddr *page_size)
> +static uint64_t fetch_pte(AMDVIAddressSpace *as, const hwaddr address,
> +                          uint64_t dte, hwaddr *page_size)
>   {
>       IOMMUAccessFlags perms = amdvi_get_perms(dte);
>   
> @@ -1693,9 +1787,9 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>    * notifiers to sync the shadow page tables in the host.
>    * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
>    */
> -static void __attribute__((unused))
> -amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
> -                                   hwaddr addr, uint64_t size, bool send_unmap)
> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
> +                                               uint64_t *dte, hwaddr addr,
> +                                               uint64_t size, bool send_unmap)
>   {
>       IOMMUTLBEvent event;
>   



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation
  2025-04-14  2:02 ` [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation Alejandro Jimenez
@ 2025-04-22 12:48   ` Sairaj Kodilkar
  2025-04-29 20:45     ` Alejandro Jimenez
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-22 12:48 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
> Device Table entry (DTE) e.g. after attaching a device and setting up
> its DTE. When intercepting this event, determine if the DTE has been
> configured for paging or not, and toggle the appropriate memory regions
> to allow DMA address translation for the address space if needed.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 67 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 3bfa08419ffe..abdd67f6b12c 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -101,6 +101,8 @@ static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
>                                                  uint64_t *dte, hwaddr addr,
>                                                  uint64_t size, bool send_unmap);
>   static void amdvi_address_space_unmap(AMDVIAddressSpace *as, IOMMUNotifier *n);
> +static void amdvi_address_space_sync(AMDVIAddressSpace *as);
> +static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as);
>   
>   uint64_t amdvi_extended_feature_register(AMDVIState *s)
>   {
> @@ -432,7 +434,15 @@ static void amdvi_completion_wait(AMDVIState *s, uint64_t *cmd)
>       trace_amdvi_completion_wait(addr, data);
>   }
>   
> -/* log error without aborting since linux seems to be using reserved bits */
> +/*
> + * A guest driver must issue the INVALIDATE_DEVTAB_ENTRY command to the IOMMU
> + * after changing a Device Table entry. We can use this fact to detect when a
> + * Device Table entry is created for a device attached to a paging domain and
> + * and enable the corresponding IOMMU memory region to allow for DMA
> + * translation if appropriate.
> + *
> + * log error without aborting since linux seems to be using reserved bits
> + */
>   static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
>   {
>       uint16_t devid = cpu_to_le16((uint16_t)extract64(cmd[0], 0, 16));
> @@ -442,6 +452,62 @@ static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
>           amdvi_log_illegalcom_error(s, extract64(cmd[0], 60, 4),
>                                      s->cmdbuf + s->cmdbuf_head);
>       }
> +
> +    /*
> +     * Convert the devid encoded in the command to a bus and devfn in
> +     * order to retrieve the corresponding address space.
> +     */
> +    uint8_t bus_num, devfn, dte_mode;
> +    AMDVIAddressSpace *as;
> +    uint64_t dte[4] = { 0 };
> +    IOMMUNotifier *n;
> +    int ret;
> +
> +    bus_num = PCI_BUS_NUM(devid);
> +    devfn = devid & 0xff;
> +
> +    /*
> +     * The main buffer of size (AMDVIAddressSpace *) * (PCI_BUS_MAX) has already
> +     * been allocated within AMDVIState, but must be careful to not access
> +     * unallocated devfn.
> +     */
> +    if (!s->address_spaces[bus_num] || !s->address_spaces[bus_num][devfn]) {
> +        return;
> +    }
> +    as = s->address_spaces[bus_num][devfn];
> +
> +    ret = amdvi_as_to_dte(as, dte);
> +
> +    if (!ret) {
> +        dte_mode = (dte[0] >> AMDVI_DEV_MODE_RSHIFT) & AMDVI_DEV_MODE_MASK;
> +    }
> +
> +    if ((ret < 0) || (!ret && !dte_mode)) {
> +        /*
> +         * The DTE could not be retrieved, it is not valid, or it is not setup
> +         * for paging. In either case, ensure that if paging was previously in
> +         * use then switch to use the no_dma memory region, and invalidate all
> +         * existing mappings.
> +         */
> +        if (as->addr_translation) {
> +            as->addr_translation = false;
> +
> +            amdvi_switch_address_space(as);
> +
> +            IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
> +                amdvi_address_space_unmap(as, n);
> +            }

Hi,
I think amdvi_switch_address_space() should come after
amdvi_address_space_unmap(). amdvi_switch_address_space() unregister the
VFIO notifier, hence mr->iommu_notify list is empty and we do not unmap
the shadow page table.

Code works fine because eventually vfio_iommu_map_notify maps
entire the address space, but we should keep the right ordering.

Regards
Sairaj Kodilkar

> +        }
> +    } else if (!as->addr_translation) {
> +        /*
> +         * Installing a DTE that enables translation where it wasn't previously
> +         * active. Activate the DMA memory region.
> +         */
> +        as->addr_translation = true;
> +        amdvi_switch_address_space(as);
> +        amdvi_address_space_sync(as);
> +    }
> +
>       trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
>                                PCI_FUNC(devid));
>   }



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0
  2025-04-14  2:02 ` [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
@ 2025-04-23  6:06   ` Sairaj Kodilkar
  0 siblings, 0 replies; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-23  6:06 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
> 8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
> with V=0, TV=1 does not contain a valid address translation information.

This should be "V=1, TV=0".

Regards
Sairaj Kodilkar

> If a request requires a table walk, the walk is terminated when this
> condition is encountered.
> 
> Do not assume that addresses for a device with DTE[TV]=0 are passed
> through (i.e. not remapped) and instead terminate the page table walk
> early.
> 
> Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 87 +++++++++++++++++++++++++--------------------
>   1 file changed, 48 insertions(+), 39 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index e502bbbbb7d3..edf2935f6a83 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -1221,51 +1221,60 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
>       uint64_t pte = dte[0], pte_addr, page_mask;
>   
>       /* make sure the DTE has TV = 1 */
> -    if (pte & AMDVI_DEV_TRANSLATION_VALID) {
> -        level = get_pte_translation_mode(pte);
> -        if (level >= 7) {
> -            trace_amdvi_mode_invalid(level, addr);
> +    if (!(pte & AMDVI_DEV_TRANSLATION_VALID)) {
> +        /*
> +         * A DTE with V=1, TV=0 does not have a valid Page Table Root Pointer.
> +         * An IOMMU processing a request that requires a table walk terminates
> +         * the walk when it encounters this condition. Do the same and return
> +         * instead of assuming that the address is forwarded without translation
> +         * i.e. the passthrough case, as it is done for the case where DTE[V]=0.
> +         */
> +        return;
> +    }
> +
> +    level = get_pte_translation_mode(pte);
> +    if (level >= 7) {
> +        trace_amdvi_mode_invalid(level, addr);
> +        return;
> +    }
> +    if (level == 0) {
> +        goto no_remap;
> +    }
> +
> +    /* we are at the leaf page table or page table encodes a huge page */
> +    do {
> +        pte_perms = amdvi_get_perms(pte);
> +        present = pte & 1;
> +        if (!present || perms != (perms & pte_perms)) {
> +            amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
> +            trace_amdvi_page_fault(addr);
>               return;
>           }
> -        if (level == 0) {
> -            goto no_remap;
> +        /* go to the next lower level */
> +        pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
> +        /* add offset and load pte */
> +        pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
> +        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> +        if (!pte) {
> +            return;
>           }
> +        oldlevel = level;
> +        level = get_pte_translation_mode(pte);
> +    } while (level > 0 && level < 7);
>   
> -        /* we are at the leaf page table or page table encodes a huge page */
> -        do {
> -            pte_perms = amdvi_get_perms(pte);
> -            present = pte & 1;
> -            if (!present || perms != (perms & pte_perms)) {
> -                amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
> -                trace_amdvi_page_fault(addr);
> -                return;
> -            }
> -
> -            /* go to the next lower level */
> -            pte_addr = pte & AMDVI_DEV_PT_ROOT_MASK;
> -            /* add offset and load pte */
> -            pte_addr += ((addr >> (3 + 9 * level)) & 0x1FF) << 3;
> -            pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> -            if (!pte || (pte == (uint64_t)-1)) {
> -                return;
> -            }
> -            oldlevel = level;
> -            level = get_pte_translation_mode(pte);
> -        } while (level > 0 && level < 7);
> +    if (level == 0x7) {
> +        page_mask = pte_override_page_mask(pte);
> +    } else {
> +        page_mask = pte_get_page_mask(oldlevel);
> +    }
>   
> -        if (level == 0x7) {
> -            page_mask = pte_override_page_mask(pte);
> -        } else {
> -            page_mask = pte_get_page_mask(oldlevel);
> -        }
> +    /* get access permissions from pte */
> +    ret->iova = addr & page_mask;
> +    ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
> +    ret->addr_mask = ~page_mask;
> +    ret->perm = amdvi_get_perms(pte);
> +    return;
>   
> -        /* get access permissions from pte */
> -        ret->iova = addr & page_mask;
> -        ret->translated_addr = (pte & AMDVI_DEV_PT_ROOT_MASK) & page_mask;
> -        ret->addr_mask = ~page_mask;
> -        ret->perm = amdvi_get_perms(pte);
> -        return;
> -    }
>   no_remap:
>       ret->iova = addr & AMDVI_PAGE_MASK_4K;
>       ret->translated_addr = addr & AMDVI_PAGE_MASK_4K;



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay()
  2025-04-14  2:02 ` [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay() Alejandro Jimenez
@ 2025-04-23  6:18   ` Sairaj Kodilkar
  0 siblings, 0 replies; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-23  6:18 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> Do not emit an I/O page fault on amdvi_page_walk() when a valid mapping
> is not found. The current role of amdvi_page_walk() is to be a helper
> for the translate() method and ultimately the IOMMU replay()
> functionality. These operations might be executed while the guest is
> running, but do not necessarily correspond 1:1 with guest-initiated page
> walks.
> 
> One example of such scenario is when VFIO code calls
> memory_region_iommu_replay() (which ends up calling amdvi_walk_page())

I dont think memory_region_iommu_replay() will call amdvi_walk_page()
after your changes. Since you have added a replay callback
memory_region_iommu_replay should return after calling
amdvi_iommu_replay() because of following lines --->

if (imrc->replay) {
	imrc->replay(iommu_mr,n);
	return;
}

Having said that, I am not sure if there are paths (other than emulated
device write) that calls translate(). So lets keep this patch but remove
or change the example.

Regards
Sairaj Kodilkar

> to sync the dirty bitmap, or after registering a new notifier. The guest
> would get IO_PAGE_FAULT events for all the regions where a mapping
> doesn't currently exist, which is not correct.
> 
> Note that after this change there are no users of amdvi_page_fault(),
> but since the IO page fault handling will be addressed in upcoming work,
> I am choosing to mark it as unused rather than deleting it.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index dc29a52bd845..ac7000dbafc7 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -263,8 +263,8 @@ static void amdvi_encode_event(uint64_t *evt, uint16_t devid, uint64_t addr,
>    *
>    * @addr: virtual address in translation request
>    */
> -static void amdvi_page_fault(AMDVIState *s, uint16_t devid,
> -                             hwaddr addr, uint16_t info)
> +static void __attribute__((unused))
> +amdvi_page_fault(AMDVIState *s, uint16_t devid, hwaddr addr, uint16_t info)
>   {
>       uint64_t evt[2];
>   
> @@ -1254,9 +1254,6 @@ static void amdvi_page_walk(AMDVIAddressSpace *as, uint64_t *dte,
>        */
>       if ((pte == (uint64_t)-1) || (pte == (uint64_t)-2) ||
>           !IOMMU_PTE_PRESENT(pte) || perms != (perms & amdvi_get_perms(pte))) {
> -
> -        amdvi_page_fault(as->iommu_state, as->devfn, addr, perms);
> -        trace_amdvi_page_fault(addr);
>           return;
>       }
>   



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format
  2025-04-18  5:30       ` CLEMENT MATHIEU--DRIF
@ 2025-04-23  6:28         ` Sairaj Kodilkar
  0 siblings, 0 replies; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-23  6:28 UTC (permalink / raw)
  To: CLEMENT MATHIEU--DRIF, Alejandro Jimenez, qemu-devel@nongnu.org
  Cc: pbonzini@redhat.com, richard.henderson@linaro.org,
	eduardo@habkost.net, peterx@redhat.com, david@redhat.com,
	philmd@linaro.org, mst@redhat.com, marcel.apfelbaum@gmail.com,
	alex.williamson@redhat.com, vasant.hegde@amd.com,
	suravee.suthikulpanit@amd.com, santosh.shukla@amd.com,
	Wei.Huang2@amd.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com



On 4/18/2025 11:00 AM, CLEMENT MATHIEU--DRIF wrote:
> 
> 
> On 17/04/2025 5:27 pm, Alejandro Jimenez wrote:
>> Caution: External email. Do not open attachments or click links, unless
>> this email comes from a known sender and you know the content is safe.
>>
>>
>> On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote:
>>>
>>>
>>> On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
>>>> Caution: External email. Do not open attachments or click links,
>>>> unless this email comes from a known sender and you know the content
>>>> is safe.
>>>>
>>>>
>>>> The current amdvi_page_walk() is designed to be called by the replay()
>>>> method. Rather than drastically altering it, introduce helpers to fetch
>>>> guest PTEs that will be used by a page walker implementation.
>>>>
>>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>>> ---
>>>>     hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++
>>>> ++++
>>>>     hw/i386/amd_iommu.h |  42 +++++++++++++++
>>>>     2 files changed, 167 insertions(+)
>>>>
>>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>>> index 0af873b66a31..d089fdc28ef1 100644
>>>> --- a/hw/i386/amd_iommu.c
>>>> +++ b/hw/i386/amd_iommu.c
>>>> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
>>>>         }
>>>>     };
>>>>
>>>> +/*
>>>> + * For a PTE encoding a large page, return the page size it encodes
>>>> as described
>>>> + * by the AMD IOMMU Specification Table 14: Example Page Size
>>>> Encodings.
>>>> + * No need to adjust the value of the PTE to point to the first PTE
>>>> in the large
>>>> + * page since the encoding guarantees all "base" PTEs in the large
>>>> page are the
>>>> + * same.
>>>> + */
>>>> +static uint64_t large_pte_page_size(uint64_t pte)
>>>> +{
>>>> +    assert(PTE_NEXT_LEVEL(pte) == 7);
>>>> +
>>>> +    /* Determine size of the large/contiguous page encoded in the
>>>> PTE */
>>>> +    return PTE_LARGE_PAGE_SIZE(pte);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Helper function to fetch a PTE using AMD v1 pgtable format.
>>>> + * Returns:
>>>> + * -2:  The Page Table Root could not be read from DTE, or IOVA is
>>>> larger than
>>>> + *      supported by current page table level encodedin DTE[Mode].
>>>> + * -1:  PTE could not be read from guest memory during a page table
>>>> walk.
>>>> + *      This means that the DTE has valid data, and one of the lower
>>>> level
>>>> + *      entries in the Page Table could not be read.
>>>> + *  0:  PTE is marked not present, or entry is 0.
>>>> + * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>>>> + */
>>>
>>> This seems to be a bit error prone as any statement like "if (pte < 0)"
>>> might be completely removed by the compiler during optimization phases.
>>
>> Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to
>> check for error values. Like it is the case in many of the patches, I am
>> following the examples from the VTD implementations where they do the
>> same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().
> 
> Yes, I know :)
> 
> Note that VT-d only has 1 potential error code (-1) which seems easier
> to handle at call site.
> 
>>
>> When using fetch_pte() again in patch [17/18] I considered adding a
>> helper to check if fetch_pte() returned a valid PTE, but seemed
>> unnecessary given that there are only two errors to be checked.
>>
>> Another choice was to make fetch_pte() return an int so the error check
>> could be done simply via (pte < 0), since bit 63 is either Reserved
>> (DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion
>> since you'd expect a PTE to be a 64-bit long value. Though I am aware
>> the way error return/checking is implemented essentially relies on that
>> behavior.
>>
>>> If you want to reuse such "high" values, defines could help.
>>
>> Sorry, I don't follow. Do you mean using defines as in still returning a
>> uint64_t but giving -1 and -2 special definitions? That might make the
>> code a somewhat more readable when checking the error values, but still
>> requires casting to uint64_t on the check, and doesn't solve the problem
>> of a careless caller using (pte < 0) to check for errors...
> 
> Yes, I think that it would be more readable.
> When using defines, the caller no longer needs to be aware of the fact
> that the value has been casted from a negative number, which reduces the
> risk of writing things like (pte < 0).
> 
> I prefer the out parameter solution but let's see what other reviews say.
> 

I think having pte as out parameter is the better solution here.
less error prone and readable !

Regards
Sairaj Kodilkar

> Thanks for this patch set :)
> 
>>
>>> Otherwise, pte could be an out parameter.
>>
>> In general, I think we have to accept the caveat that callers of
>> fetch_pte() must have some implementation specific knowledge to know
>> they cannot check for errors using (pte < 0). Maybe with the aid of a
>> strongly worded warning on the function header comment...
>>
>> But if that argument doesn't convince you, and none of the alternatives
>> above seem better, then I am leaning towards using the out parameter
>> approach.
>>
>> Thank you for the feedback.
>> Alejandro
>>
>>>
>>>> +static uint64_t __attribute__((unused))
>>>> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>>>> +          hwaddr *page_size)
>>>> +{
>>>> +    IOMMUAccessFlags perms = amdvi_get_perms(dte);
>>>> +
>>>> +    uint8_t level, mode;
>>>> +    uint64_t pte = dte, pte_addr;
>>>> +
>>>> +    *page_size = 0;
>>>> +
>>>> +    if (perms == IOMMU_NONE) {
>>>> +        return (uint64_t)-2;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * The Linux kernel driver initializes the default mode to 3,
>>>> corresponding
>>>> +     * to a 39-bit GPA space, where each entry in the pagetable
>>>> translates to a
>>>> +     * 1GB (2^30) page size.
>>>> +     */
>>>> +    level = mode = get_pte_translation_mode(dte);
>>>> +    assert(mode > 0 && mode < 7);
>>>> +
>>>> +    /*
>>>> +     * If IOVA is larger than the max supported by the current
>>>> pgtable level,
>>>> +     * there is nothing to do. This signals that the pagetable level
>>>> should be
>>>> +     * increased, or is an address meant to have special behavior like
>>>> +     * invalidating the entire cache.
>>>> +     */
>>>> +    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
>>>> +        /* IOVA too large for the current DTE */
>>>> +        return (uint64_t)-2;
>>>> +    }
>>>> +
>>>> +    do {
>>>> +        level -= 1;
>>>> +
>>>> +        /* Update the page_size */
>>>> +        *page_size = PTE_LEVEL_PAGE_SIZE(level);
>>>> +
>>>> +        /* Permission bits are ANDed at every level, including the
>>>> DTE */
>>>> +        perms &= amdvi_get_perms(pte);
>>>> +        if (perms == IOMMU_NONE) {
>>>> +            return pte;
>>>> +        }
>>>> +
>>>> +        /* Not Present */
>>>> +        if (!IOMMU_PTE_PRESENT(pte)) {
>>>> +            return 0;
>>>> +        }
>>>> +
>>>> +        /* Large or Leaf PTE found */
>>>> +        if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
>>>> +            /* Leaf PTE found */
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        /*
>>>> +         * Index the pgtable using the IOVA bits corresponding to
>>>> current level
>>>> +         * and walk down to the lower level.
>>>> +         */
>>>> +        pte_addr = NEXT_PTE_ADDR(pte, level, address);
>>>> +        pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as-
>>>>> devfn);
>>>> +
>>>> +        if (pte == (uint64_t)-1) {
>>>> +            /*
>>>> +             * A returned PTE of -1 indicates a failure to read the
>>>> page table
>>>> +             * entry from guest memory.
>>>> +             */
>>>> +            if (level == mode - 1) {
>>>> +                /* Failure to retrieve the Page Table from Root
>>>> Pointer */
>>>> +                *page_size = 0;
>>>> +                return (uint64_t)-2;
>>>> +            } else {
>>>> +                /* Failure to read PTE. Page walk skips a page_size
>>>> chunk */
>>>> +                return pte;
>>>> +            }
>>>> +        }
>>>> +    } while (level > 0);
>>>> +
>>>> +    /*
>>>> +     * Page walk ends when Next Level field on PTE shows that either
>>>> a leaf PTE
>>>> +     * or a series of large PTEs have been reached. In the latter
>>>> case, return
>>>> +     * the pointer to the first PTE of the series.
>>>> +     */
>>>> +    assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 ||
>>>> PTE_NEXT_LEVEL(pte) == 7);
>>>> +
>>>> +    /*
>>>> +     * In case the range starts in the middle of a contiguous page,
>>>> need to
>>>> +     * return the first PTE
>>>> +     */
>>>> +    if (PTE_NEXT_LEVEL(pte) == 7) {
>>>> +        /* Update page_size with the large PTE page size */
>>>> +        *page_size = large_pte_page_size(pte);
>>>> +    }
>>>> +
>>>> +    return pte;
>>>> +}
>>>> +
>>>>     /*
>>>>      * Toggle between address translation and passthrough modes by
>>>> enabling the
>>>>      * corresponding memory regions.
>>>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>>>> index c89e7dc9947d..fc4d2f7a4575 100644
>>>> --- a/hw/i386/amd_iommu.h
>>>> +++ b/hw/i386/amd_iommu.h
>>>> @@ -25,6 +25,8 @@
>>>>     #include "hw/i386/x86-iommu.h"
>>>>     #include "qom/object.h"
>>>>
>>>> +#define GENMASK64(h, l)  (((~0ULL) >> (63 - (h) + (l))) << (l))
>>>> +
>>>>     /* Capability registers */
>>>>     #define AMDVI_CAPAB_BAR_LOW           0x04
>>>>     #define AMDVI_CAPAB_BAR_HIGH          0x08
>>>> @@ -174,6 +176,46 @@
>>>>     #define AMDVI_GATS_MODE                 (2ULL <<  12)
>>>>     #define AMDVI_HATS_MODE                 (2ULL <<  10)
>>>>
>>>> +/* Page Table format */
>>>> +
>>>> +#define AMDVI_PTE_PR                    (1ULL << 0)
>>>> +#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
>>>> +
>>>> +#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>>>> +
>>>> +/* Using level=0 for leaf PTE at 4K page size */
>>>> +#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
>>>> +
>>>> +/* Return IOVA bit group used to index the Page Table at specific
>>>> level */
>>>> +#define PT_LEVEL_INDEX(level, iova)     (((iova) >>
>>>> PT_LEVEL_SHIFT(level)) & \
>>>> +                                        GENMASK64(8, 0))
>>>> +
>>>> +/* Return the max address for a specified level i.e. max_oaddr */
>>>> +#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
>>>> +                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) -
>>>> 1) : \
>>>> +                                (~(0ULL)))
>>>> +
>>>> +/* Extract the NextLevel field from PTE/PDE */
>>>> +#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK)
>>>>>> 9)
>>>> +
>>>> +/* Take page table level and return default pagetable size for level */
>>>> +#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL <<
>>>> (PT_LEVEL_SHIFT(level)))
>>>> +
>>>> +/*
>>>> + * Return address of lower level page table encoded in PTE and
>>>> specified by
>>>> + * current level and corresponding IOVA bit group at such level.
>>>> + */
>>>> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) &
>>>> AMDVI_DEV_PT_ROOT_MASK) + \
>>>> +                                        (PT_LEVEL_INDEX(level, iova)
>>>> * 8))
>>>> +
>>>> +/*
>>>> + * Take a PTE value with mode=0x07 and return the page size it encodes.
>>>> + */
>>>> +#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) |
>>>> 0xfffULL))))
>>>> +
>>>> +/* Return number of PTEs to use for a given page size (expected
>>>> power of 2) */
>>>> +#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12)
>>>> % 9))
>>>> +
>>>>     /* IOTLB */
>>>>     #define AMDVI_IOTLB_MAX_SIZE 1024
>>>>     #define AMDVI_DEVID_SHIFT    36
>>>> -- 
>>>> 2.43.5
>>>>
>>>>
>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
                   ` (17 preceding siblings ...)
  2025-04-14  2:02 ` [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay() Alejandro Jimenez
@ 2025-04-23 10:45 ` Sairaj Kodilkar
  2025-04-23 10:56   ` Sairaj Kodilkar
  18 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-23 10:45 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
> This series adds support for guests using the AMD vIOMMU to enable DMA
> remapping for VFIO devices. In addition to the currently supported
> passthrough (PT) mode, guest kernels are now able to to provide DMA
> address translation and access permission checking to VFs attached to
> paging domains, using the AMD v1 I/O page table format.
> 
> These changes provide the essential emulation required to boot and
> support regular operation for a Linux guest enabling DMA remapping e.g.
> via kernel parameters "iommu=nopt" or "iommu.passthrough=0".
> 
> A new amd-iommu device property "dma-remap" (default: off) is introduced
> to control whether the feature is available. See below for a full
> example of QEMU cmdline parameters used in testing.
> 
> The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
> host and guest kernels, launching guests with up to 256 vCPUs, 512G
> memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
> xtsup=on) requires fix:
> https://lore.kernel.org/all/20250410064447.29583-3-sarunkod@amd.com/
> 
> Although there is more work to do, I am sending this series as a patch
> and not an RFC since it provides a working implementation of the
> feature. With this basic infrastructure in place it becomes easier to
> add/verify enhancements and new functionality. Here are some items I am
> working to address in follow up patches:
> 
> - Page Fault and error reporting
> - Add QEMU tracing and tests
> - Provide control over VA Size advertised to guests
> - Support hotplug/unplug of devices and other advanced features
>    (suggestions welcomed)
> 
> Thank you,
> Alejandro
> 
> ---
> Example QEMU command line:
> 
> $QEMU \
> -nodefaults \
> -snapshot \
> -no-user-config \
> -display none \
> -serial mon:stdio -nographic \
> -machine q35,accel=kvm,kernel_irqchip=split \
> -cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
> -smp 32 \
> -m 128G \
> -kernel $KERNEL \
> -initrd $INITRD \
> -append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro rd.lvm.lv=ol/root rd.lvm.lv=ol/swap iommu.passthrough=0" \
> -device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
> -blockdev node-name=drive0,driver=qcow2,file.driver=file,file.filename=./OracleLinux-uefi-x86_64.qcow2 \
> -device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
> -drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/OVMF_CODE.fd,readonly=on \
> -drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
> -device vfio-pci,host=0000:a1:00.1,id=net0
> ---
> 
> Alejandro Jimenez (18):
>    memory: Adjust event ranges to fit within notifier boundaries
>    amd_iommu: Add helper function to extract the DTE
>    amd_iommu: Add support for IOMMU notifier
>    amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>    amd_iommu: Toggle memory regions based on address translation mode
>    amd_iommu: Set all address spaces to default translation mode on reset
>    amd_iommu: Return an error when unable to read PTE from guest memory
>    amd_iommu: Helper to decode size of page invalidation command
>    amd_iommu: Add helpers to walk AMD v1 Page Table format
>    amd_iommu: Add a page walker to sync shadow page tables on
>      invalidation
>    amd_iommu: Sync shadow page tables on page invalidation
>    amd_iommu: Add replay callback
>    amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>    amd_iommu: Toggle address translation on device table entry
>      invalidation
>    amd_iommu: Use iova_tree records to determine large page size on UNMAP
>    amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>    amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>    amd_iommu: Do not emit I/O page fault events during replay()
> 
>   hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
>   hw/i386/amd_iommu.h |  52 +++
>   system/memory.c     |  10 +-
>   3 files changed, 843 insertions(+), 75 deletions(-)
> 
> 
> base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365

Hi Alejandro,
I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio)
tests inside the guest. Everything looks good to me.

I also compared the fio performance with following parameters on a
passthrough nvme inside the guest with 16 vcpus.

[FIO PARAMETERS]
NVMEs     = 1
JOBS/NVME = 16
MODE      = RANDREAD
IOENGINE  = LIBAIO
IODEPTH   = 32
BLOCKSIZE = 4K
SIZE      = 100%

        RESULTS
=====================
Guest
IOMMU          IOPS
mode          (kilo)
=====================
nopt           13.7
pt           1191.0
--------------------

I see that nopt (emulate IOMMU) has a huge performance.
I wonder if the DMA remapping is really useful with such performance
penalty.

Regards
Sairaj Kodilkar



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-04-23 10:45 ` [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Sairaj Kodilkar
@ 2025-04-23 10:56   ` Sairaj Kodilkar
  2025-04-24 11:49     ` Joao Martins
  0 siblings, 1 reply; 49+ messages in thread
From: Sairaj Kodilkar @ 2025-04-23 10:56 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/23/2025 4:15 PM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>> This series adds support for guests using the AMD vIOMMU to enable DMA
>> remapping for VFIO devices. In addition to the currently supported
>> passthrough (PT) mode, guest kernels are now able to to provide DMA
>> address translation and access permission checking to VFs attached to
>> paging domains, using the AMD v1 I/O page table format.
>>
>> These changes provide the essential emulation required to boot and
>> support regular operation for a Linux guest enabling DMA remapping e.g.
>> via kernel parameters "iommu=nopt" or "iommu.passthrough=0".
>>
>> A new amd-iommu device property "dma-remap" (default: off) is introduced
>> to control whether the feature is available. See below for a full
>> example of QEMU cmdline parameters used in testing.
>>
>> The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
>> host and guest kernels, launching guests with up to 256 vCPUs, 512G
>> memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
>> xtsup=on) requires fix:
>> https://lore.kernel.org/all/20250410064447.29583-3-sarunkod@amd.com/
>>
>> Although there is more work to do, I am sending this series as a patch
>> and not an RFC since it provides a working implementation of the
>> feature. With this basic infrastructure in place it becomes easier to
>> add/verify enhancements and new functionality. Here are some items I am
>> working to address in follow up patches:
>>
>> - Page Fault and error reporting
>> - Add QEMU tracing and tests
>> - Provide control over VA Size advertised to guests
>> - Support hotplug/unplug of devices and other advanced features
>>    (suggestions welcomed)
>>
>> Thank you,
>> Alejandro
>>
>> ---
>> Example QEMU command line:
>>
>> $QEMU \
>> -nodefaults \
>> -snapshot \
>> -no-user-config \
>> -display none \
>> -serial mon:stdio -nographic \
>> -machine q35,accel=kvm,kernel_irqchip=split \
>> -cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
>> -smp 32 \
>> -m 128G \
>> -kernel $KERNEL \
>> -initrd $INITRD \
>> -append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro 
>> rd.lvm.lv=ol/root rd.lvm.lv=ol/swap iommu.passthrough=0" \
>> -device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
>> -blockdev node- 
>> name=drive0,driver=qcow2,file.driver=file,file.filename=./OracleLinux- 
>> uefi-x86_64.qcow2 \
>> -device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
>> -drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/ 
>> OVMF_CODE.fd,readonly=on \
>> -drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
>> -device vfio-pci,host=0000:a1:00.1,id=net0
>> ---
>>
>> Alejandro Jimenez (18):
>>    memory: Adjust event ranges to fit within notifier boundaries
>>    amd_iommu: Add helper function to extract the DTE
>>    amd_iommu: Add support for IOMMU notifier
>>    amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>>    amd_iommu: Toggle memory regions based on address translation mode
>>    amd_iommu: Set all address spaces to default translation mode on reset
>>    amd_iommu: Return an error when unable to read PTE from guest memory
>>    amd_iommu: Helper to decode size of page invalidation command
>>    amd_iommu: Add helpers to walk AMD v1 Page Table format
>>    amd_iommu: Add a page walker to sync shadow page tables on
>>      invalidation
>>    amd_iommu: Sync shadow page tables on page invalidation
>>    amd_iommu: Add replay callback
>>    amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>>    amd_iommu: Toggle address translation on device table entry
>>      invalidation
>>    amd_iommu: Use iova_tree records to determine large page size on UNMAP
>>    amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>>    amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>>    amd_iommu: Do not emit I/O page fault events during replay()
>>
>>   hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
>>   hw/i386/amd_iommu.h |  52 +++
>>   system/memory.c     |  10 +-
>>   3 files changed, 843 insertions(+), 75 deletions(-)
>>
>>
>> base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365
> 
> Hi Alejandro,
> I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio)
> tests inside the guest. Everything looks good to me.
> 
> I also compared the fio performance with following parameters on a
> passthrough nvme inside the guest with 16 vcpus.
> 
> [FIO PARAMETERS]
> NVMEs     = 1
> JOBS/NVME = 16
> MODE      = RANDREAD
> IOENGINE  = LIBAIO
> IODEPTH   = 32
> BLOCKSIZE = 4K
> SIZE      = 100%
> 
>         RESULTS
> =====================
> Guest
> IOMMU          IOPS
> mode          (kilo)
> =====================
> nopt           13.7
> pt           1191.0
> --------------------
> 
> I see that nopt (emulate IOMMU) has a huge performance.
This is suppose to be "huge performance penalty", sorry about the typo
> I wonder if the DMA remapping is really useful with such performance
> penalty.
> 
> Regards
> Sairaj Kodilkar
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices
  2025-04-23 10:56   ` Sairaj Kodilkar
@ 2025-04-24 11:49     ` Joao Martins
  0 siblings, 0 replies; 49+ messages in thread
From: Joao Martins @ 2025-04-24 11:49 UTC (permalink / raw)
  To: Sairaj Kodilkar, Alejandro Jimenez
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2,
	boris.ostrovsky, qemu-devel

On 23/04/2025 11:56, Sairaj Kodilkar wrote:
> On 4/23/2025 4:15 PM, Sairaj Kodilkar wrote:
>> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>>> This series adds support for guests using the AMD vIOMMU to enable DMA
>>> remapping for VFIO devices. In addition to the currently supported
>>> passthrough (PT) mode, guest kernels are now able to to provide DMA
>>> address translation and access permission checking to VFs attached to
>>> paging domains, using the AMD v1 I/O page table format.
>>>
>>> These changes provide the essential emulation required to boot and
>>> support regular operation for a Linux guest enabling DMA remapping e.g.
>>> via kernel parameters "iommu=nopt" or "iommu.passthrough=0".
>>>
>>> A new amd-iommu device property "dma-remap" (default: off) is introduced
>>> to control whether the feature is available. See below for a full
>>> example of QEMU cmdline parameters used in testing.
>>>
>>> The patchset has been tested on an AMD EPYC Genoa host, with Linux 6.14
>>> host and guest kernels, launching guests with up to 256 vCPUs, 512G
>>> memory, and 16 CX6 VFs. Testing with IOMMU x2apic support enabled (i.e.
>>> xtsup=on) requires fix:
>>> https://lore.kernel.org/all/20250410064447.29583-3-sarunkod@amd.com/
>>>
>>> Although there is more work to do, I am sending this series as a patch
>>> and not an RFC since it provides a working implementation of the
>>> feature. With this basic infrastructure in place it becomes easier to
>>> add/verify enhancements and new functionality. Here are some items I am
>>> working to address in follow up patches:
>>>
>>> - Page Fault and error reporting
>>> - Add QEMU tracing and tests
>>> - Provide control over VA Size advertised to guests
>>> - Support hotplug/unplug of devices and other advanced features
>>>    (suggestions welcomed)
>>>
>>> Thank you,
>>> Alejandro
>>>
>>> ---
>>> Example QEMU command line:
>>>
>>> $QEMU \
>>> -nodefaults \
>>> -snapshot \
>>> -no-user-config \
>>> -display none \
>>> -serial mon:stdio -nographic \
>>> -machine q35,accel=kvm,kernel_irqchip=split \
>>> -cpu host,+topoext,+x2apic,-svm,-vmx,-kvm-msi-ext-dest-id \
>>> -smp 32 \
>>> -m 128G \
>>> -kernel $KERNEL \
>>> -initrd $INITRD \
>>> -append "console=tty0 console=ttyS0 root=/dev/mapper/ol-root ro rd.lvm.lv=ol/
>>> root rd.lvm.lv=ol/swap iommu.passthrough=0" \
>>> -device amd-iommu,intremap=on,xtsup=on,dma-remap=on \
>>> -blockdev node- name=drive0,driver=qcow2,file.driver=file,file.filename=./
>>> OracleLinux- uefi-x86_64.qcow2 \
>>> -device virtio-blk-pci,drive=drive0,id=virtio-disk0 \
>>> -drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/
>>> OVMF_CODE.fd,readonly=on \
>>> -drive if=pflash,format=raw,unit=1,file=./OVMF_VARS.fd \
>>> -device vfio-pci,host=0000:a1:00.1,id=net0
>>> ---
>>>
>>> Alejandro Jimenez (18):
>>>    memory: Adjust event ranges to fit within notifier boundaries
>>>    amd_iommu: Add helper function to extract the DTE
>>>    amd_iommu: Add support for IOMMU notifier
>>>    amd_iommu: Unmap all address spaces under the AMD IOMMU on reset
>>>    amd_iommu: Toggle memory regions based on address translation mode
>>>    amd_iommu: Set all address spaces to default translation mode on reset
>>>    amd_iommu: Return an error when unable to read PTE from guest memory
>>>    amd_iommu: Helper to decode size of page invalidation command
>>>    amd_iommu: Add helpers to walk AMD v1 Page Table format
>>>    amd_iommu: Add a page walker to sync shadow page tables on
>>>      invalidation
>>>    amd_iommu: Sync shadow page tables on page invalidation
>>>    amd_iommu: Add replay callback
>>>    amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
>>>    amd_iommu: Toggle address translation on device table entry
>>>      invalidation
>>>    amd_iommu: Use iova_tree records to determine large page size on UNMAP
>>>    amd_iommu: Do not assume passthrough translation when DTE[TV]=0
>>>    amd_iommu: Refactor amdvi_page_walk() to use common code for page walk
>>>    amd_iommu: Do not emit I/O page fault events during replay()
>>>
>>>   hw/i386/amd_iommu.c | 856 ++++++++++++++++++++++++++++++++++++++++----
>>>   hw/i386/amd_iommu.h |  52 +++
>>>   system/memory.c     |  10 +-
>>>   3 files changed, 843 insertions(+), 75 deletions(-)
>>>
>>>
>>> base-commit: 56c6e249b6988c1b6edc2dd34ebb0f1e570a1365
>>
>> Hi Alejandro,
>> I tested the patches with FIO and VFIO (using guest's /dev/vfio/vfio)
>> tests inside the guest. Everything looks good to me.
>>
>> I also compared the fio performance with following parameters on a
>> passthrough nvme inside the guest with 16 vcpus.
>>
>> [FIO PARAMETERS]
>> NVMEs     = 1
>> JOBS/NVME = 16
>> MODE      = RANDREAD
>> IOENGINE  = LIBAIO
>> IODEPTH   = 32
>> BLOCKSIZE = 4K
>> SIZE      = 100%
>>
>>         RESULTS
>> =====================
>> Guest
>> IOMMU          IOPS
>> mode          (kilo)
>> =====================
>> nopt           13.7
>> pt           1191.0
>> --------------------
>>
>> I see that nopt (emulate IOMMU) has a huge performance.
> This is suppose to be "huge performance penalty", sorry about the typo
>> I wonder if the DMA remapping is really useful with such performance
>> penalty.

This is not so much about performance but more about guest compatibility (or
just not breaking guests) once you expose amd-iommu device (old or new guests;
as you can’t control what your guests are running), together with Windows OSes
which so far only works on Intel -- so this series brings parity; There’s also
more niche features (Windows Credential Guard or Windows firmware DMA protection
which requires a vIOMMU); and finally general testing/development for (v)IOMMU
with real VFs to work.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode
  2025-04-22 12:17   ` Sairaj Kodilkar
@ 2025-04-28 21:10     ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-28 21:10 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/22/25 8:17 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>> Enable the appropriate memory region for an address space depending on
>> the address translation mode selected for it. This is currently based on
>> a generic x86 IOMMMU property, and only done during the address space
>> initialization. Extract the code into a helper and toggle the regions
>> based on whether DMA remapping is available as a global capability, and
>> if the specific address space is using address translation.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 30 ++++++++++++++++++++----------
>>   1 file changed, 20 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index ad5869e72fdc..3f9aa2cc8d31 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -1529,13 +1529,31 @@ static const MemoryRegionOps amdvi_ir_ops = {
>>       }
>>   };
>> +/*
>> + * Toggle between address translation and passthrough modes by 
>> enabling the
>> + * corresponding memory regions.
>> + */
>> +static void amdvi_switch_address_space(AMDVIAddressSpace *amdvi_as)
>> +{
>> +    AMDVIState *s = amdvi_as->iommu_state;
>> +
>> +    if (s->dma_remap && amdvi_as->addr_translation) {
> 
> Hi Alejandro,
> 
> I know gnew0 initializes addr_translation to 0. but should we explicitly
> initialize it to 0 ? in order to make it more readable.

I am generally in favor of making things more explicit, so I like this 
suggestion, and would also initialize the notifier_flags i.e.

@@ -2152,6 +2152,8 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus 
*bus, void *opaque, int devfn)
          iommu_as[devfn]->devfn = (uint8_t)devfn;
          iommu_as[devfn]->iommu_state = s;
          iommu_as[devfn]->iova_tree = iova_tree_new();
+        iommu_as[devfn]->addr_translation = false;
+        iommu_as[devfn]->notifier_flags = IOMMU_NONE;

          amdvi_dev_as = iommu_as[devfn];

It is possible that others consider these redundant or bad coding style, 
but unless there is pushback I'll include the changes in v2.

Thank you,
Alejandro
> 
> Regards
> Sairaj Kodilkar
> 
>> +        /* Enabling DMA region */
>> +        memory_region_set_enabled(&amdvi_as->iommu_nodma, false);
>> +        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), 
>> true);
>> +    } else {
>> +        /* Disabling DMA region, using passthrough */
>> +        memory_region_set_enabled(MEMORY_REGION(&amdvi_as->iommu), 
>> false);
>> +        memory_region_set_enabled(&amdvi_as->iommu_nodma, true);
>> +    }
>> +}
>> +
>>   static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, 
>> int devfn)
>>   {
>>       char name[128];
>>       AMDVIState *s = opaque;
>>       AMDVIAddressSpace **iommu_as, *amdvi_dev_as;
>>       int bus_num = pci_bus_num(bus);
>> -    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>>       iommu_as = s->address_spaces[bus_num];
>> @@ -1595,15 +1613,7 @@ static AddressSpace 
>> *amdvi_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
>>                                               AMDVI_INT_ADDR_FIRST,
>>                                               &amdvi_dev_as->iommu_ir, 
>> 1);
>> -        if (!x86_iommu->pt_supported) {
>> -            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, 
>> false);
>> -            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as- 
>> >iommu),
>> -                                      true);
>> -        } else {
>> -            memory_region_set_enabled(MEMORY_REGION(&amdvi_dev_as- 
>> >iommu),
>> -                                      false);
>> -            memory_region_set_enabled(&amdvi_dev_as->iommu_nodma, true);
>> -        }
>> +        amdvi_switch_address_space(amdvi_dev_as);
>>       }
>>       return &iommu_as[devfn]->as;
>>   }
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command
  2025-04-22 12:26   ` Sairaj Kodilkar
@ 2025-04-28 21:16     ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-28 21:16 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/22/25 8:26 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>> The size of the region to invalidate depends on the S bit and address
>> encoded in the command. Add a helper to extract this information, which
>> will be used to sync shadow page tables in upcoming changes.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 34 ++++++++++++++++++++++++++++++++++
>>   hw/i386/amd_iommu.h |  4 ++++
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 5f55be1f4d36..0af873b66a31 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c


>> +static uint64_t __attribute__((unused))
>> +amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>> +{
>> +    uint64_t size = AMDVI_PAGE_SIZE;
>> +    uint8_t fzbit = 0;
>> +
>> +    if (flags & AMDVI_CMD_INVAL_IOMMU_PAGES_S) {
>> +        fzbit = cto64(addr | 0xFFF);
>> +
>> +        if (fzbit >= 51 || !addr) {
> 
> I am skeptical about the condition addr == 0 (!addr)
> 
> Consider the case where user wants to invalidate 8K size page, starting
> from address 0. It'll cause address field to be 0, right ? If so, then
> we should invalidate only 8K page not the entire address range.

You are right, I'll fix the issue (i.e. remove the !addr special case) 
in v2.
This was a remnant of an earlier prototype where I was decoding the 
address with a different method, but as you point out this special case 
is not needed anymore and would result in sub-optimal behavior.

Thank you!
Alejandro

> 
> Am I missing something here ?
> 
> Regards
> Sairaj




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation
  2025-04-22 12:38   ` Sairaj Kodilkar
@ 2025-04-29 19:47     ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-29 19:47 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/22/25 8:38 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:
>> When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the
>> address and size of the invalidation and sync the guest page table state
>> with the host. This requires walking the guest page table and calling
>> notifiers registered for address spaces matching the domain ID encoded
>> in the command.
>>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 110 ++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 102 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>> index 6789e1e9b688..cf83ac607064 100644
>> --- a/hw/i386/amd_iommu.c
>> +++ b/hw/i386/amd_iommu.c
>> @@ -97,6 +97,9 @@ typedef enum AMDVIFaultReason {
>>   } AMDVIFaultReason;
>>   static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte);
>> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
>> +                                               uint64_t *dte, hwaddr 
>> addr,
>> +                                               uint64_t size, bool 
>> send_unmap);
>>   uint64_t amdvi_extended_feature_register(AMDVIState *s)
>>   {
>> @@ -497,8 +500,7 @@ static gboolean 
>> amdvi_iotlb_remove_by_domid(gpointer key, gpointer value,
>>    * first zero at bit 51 or larger is a request to invalidate the 
>> entire address
>>    * space.
>>    */
>> -static uint64_t __attribute__((unused))
>> -amdvi_decode_invalidation_size(hwaddr addr, uint16_t flags)
>> +static uint64_t amdvi_decode_invalidation_size(hwaddr addr, uint16_t 
>> flags)
>>   {
>>       uint64_t size = AMDVI_PAGE_SIZE;
>>       uint8_t fzbit = 0;
>> @@ -515,10 +517,101 @@ amdvi_decode_invalidation_size(hwaddr addr, 
>> uint16_t flags)
>>       return size;
>>   }
>> +/*
>> + * Synchronize the guest page tables with the shadow page tables kept 
>> in the
>> + * host for the specified range.
>> + * The invalidation command issued by the guest and intercepted by 
>> the VMM
>> + * does not specify a device, but a domain, since all devices in the 
>> same domain
>> + * share the same page tables. However, vIOMMU emulation creates 
>> separate
>> + * address spaces per device, so it is necessary to traverse the list 
>> of all of
>> + * address spaces (i.e. devices) that have notifiers registered in 
>> order to
>> + * propagate the changes to the host page tables. This could generate 
>> redundant
>> + * requests to map/unmap regions when there are several devices in a 
>> same domain
>> + * but it must be optimized by maintaining an internal representation 
>> of the
>> + * per-domain address space, and avoid invoking a notifier when the 
>> change is
>> + * already reflected in the host page tables.
>> + *
>> + * We cannot return early from this function once a matching domain 
>> has been
>> + * identified and its page tables synced (based on the fact that all 
>> devices in
>> + * the same domain share the page tables). The reason is that 
>> different devices
>> + * (i.e. address spaces) could have different notifiers registered, 
>> and by
>> + * skipping address spaces that appear later on the 
>> amdvi_as_with_notifiers list
>> + * their notifiers (which could differ from the ones registered for 
>> the first
>> + * device/address space) would not be invoked.
>> + */
>> +static void amdvi_sync_domain(AMDVIState *s, uint16_t domid, uint64_t 
>> addr,
>> +                              uint16_t flags)
>> +{
>> +    AMDVIAddressSpace *as;
>> +
>> +    uint64_t size = amdvi_decode_invalidation_size(addr, flags);
>> +
>> +    if (size == AMDVI_INV_ALL_PAGES) {
>> +        addr = 0;       /* Set start address to 0 and invalidate 
>> entire AS */
>> +    } else {
>> +        addr &= ~(size - 1);
>> +    }
>> +
>> +    /*
>> +     * Call notifiers that have registered for each address space 
>> matching the
>> +     * domain ID, in order to sync the guest pagetable state with the 
>> host.
>> +     */
>> +    QLIST_FOREACH(as, &s->amdvi_as_with_notifiers, next) {
>> +
>> +        uint64_t dte[4] = { 0 };
>> +
>> +        /*
>> +         * Retrieve the Device Table entry for the devid 
>> corresponding to the
>> +         * current address space, and verify the DomainID matches 
>> i.e. the page
>> +         * tables to be synced belong to devices in the domain.
>> +         */
>> +        if (amdvi_as_to_dte(as, dte)) {
>> +            continue;
>> +        }
>> +
>> +        /* Only need to sync the Page Tables for a matching domain */
>> +        if (domid != (dte[1] & AMDVI_DEV_DOMID_ID_MASK)) {
>> +            continue;
>> +        }
>> +
>> +        /*
>> +         * We have determined that there is a valid Device Table 
>> Entry for a
>> +         * device matching the DomainID in the INV_IOMMU_PAGES 
>> command issued by
>> +         * the guest. Walk the guest page table to sync shadow page 
>> table.
>> +         *
>> +         * An optimization can be made if only UNMAP notifiers are 
>> registered to
>> +         * avoid walking the page table and just invalidate the 
>> requested range.
>> +         */
> 
> Hi Alejandro,
> I am not able to understand the optimization during "UNMAP only"
> notifiers. 

The only example I am aware of is in vhost, which can register for 
UNMAP-only notifications (or for DEVIOTLB_UNMAP when suporting 
device-iotlb capability for virtio-pci devices using ats=on). See the 
code at:

https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost.c#L896-898

And for more context, the comment at:

https://gitlab.com/qemu-project/qemu/-/blob/master/include/system/memory.h?ref_type=heads#L146-158

However, given that I am not testing with vhost, or implementing the 
rest of the changes to support the device IOTLB, I will remove this 
section. We can add it later if device-iotlb/ATS support makes sense...

I am confused because code will call the UNMAP notifier
> even if the pte is present (which should be wrong). Could you please
> explain why we are unmapping IOVAs with present pte ?

Taking this question in isolation, if any component registers only for 
UNMAP notifications, it means that it expects to receive invalidations 
only for unmap operations (with the correct range), and that is the only 
information it needs to work. So if you register for UNMAP-only 
notifications, you expect/know/trust the device driver in the guest that 
is managing the device (i.e. the address space) to only issue 
invalidations when it wants to discard mappings.

Alejandro

> 
> Regards
> Sairaj Kodilkar
> 
>> +        if (as->notifier_flags & IOMMU_NOTIFIER_MAP) {
>> +
>> +            /* Sync guest IOMMU mappings with host */
>> +            amdvi_sync_shadow_page_table_range(as, &dte[0], addr, 
>> size, true);
>> +        } else {
>> +            /*
>> +             * For UNMAP only notifiers, invalidate the requested 
>> size. No page
>> +             * table walk is required since there is no need to 
>> replay mappings.
>> +             */
>> +            IOMMUTLBEvent event = {
>> +                .type = IOMMU_NOTIFIER_UNMAP,
>> +                .entry = {
>> +                    .target_as = &address_space_memory,
>> +                    .iova = addr,
>> +                    .translated_addr = 0, /* Irrelevant for unmap 
>> case */
>> +                    .addr_mask = size - 1,
>> +                    .perm = IOMMU_NONE,
>> +                },
>> +            };
>> +            memory_region_notify_iommu(&as->iommu, 0, event);
>> +        }
>> +    }
>> +}
>> +
>>   /* we don't have devid - we can't remove pages by address */
>>   static void amdvi_inval_pages(AMDVIState *s, uint64_t *cmd)
>>   {
>>       uint16_t domid = cpu_to_le16((uint16_t)extract64(cmd[0], 32, 16));
>> +    uint64_t addr = cpu_to_le64(extract64(cmd[1], 12, 52)) << 12;
>> +    uint16_t flags = cpu_to_le16((uint16_t)extract64(cmd[1], 0, 3));
>>       if (extract64(cmd[0], 20, 12) || extract64(cmd[0], 48, 12) ||
>>           extract64(cmd[1], 3, 9)) {
>> @@ -528,6 +621,8 @@ static void amdvi_inval_pages(AMDVIState *s, 
>> uint64_t *cmd)
>>       g_hash_table_foreach_remove(s->iotlb, amdvi_iotlb_remove_by_domid,
>>                                   &domid);
>> +
>> +    amdvi_sync_domain(s, domid, addr, flags);
>>       trace_amdvi_pages_inval(domid);
>>   }
>> @@ -1589,9 +1684,8 @@ static uint64_t large_pte_page_size(uint64_t pte)
>>    *  0:  PTE is marked not present, or entry is 0.
>>    * >0:  Leaf PTE value resolved from walking Guest IO Page Table.
>>    */
>> -static uint64_t __attribute__((unused))
>> -fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>> -          hwaddr *page_size)
>> +static uint64_t fetch_pte(AMDVIAddressSpace *as, const hwaddr address,
>> +                          uint64_t dte, hwaddr *page_size)
>>   {
>>       IOMMUAccessFlags perms = amdvi_get_perms(dte);
>> @@ -1693,9 +1787,9 @@ fetch_pte(AMDVIAddressSpace *as, const hwaddr 
>> address, uint64_t dte,
>>    * notifiers to sync the shadow page tables in the host.
>>    * Must be called with a valid DTE for DMA remapping i.e. V=1,TV=1
>>    */
>> -static void __attribute__((unused))
>> -amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as, uint64_t *dte,
>> -                                   hwaddr addr, uint64_t size, bool 
>> send_unmap)
>> +static void amdvi_sync_shadow_page_table_range(AMDVIAddressSpace *as,
>> +                                               uint64_t *dte, hwaddr 
>> addr,
>> +                                               uint64_t size, bool 
>> send_unmap)
>>   {
>>       IOMMUTLBEvent event;
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation
  2025-04-22 12:48   ` Sairaj Kodilkar
@ 2025-04-29 20:45     ` Alejandro Jimenez
  0 siblings, 0 replies; 49+ messages in thread
From: Alejandro Jimenez @ 2025-04-29 20:45 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: pbonzini, richard.henderson, eduardo, peterx, david, philmd, mst,
	marcel.apfelbaum, alex.williamson, vasant.hegde,
	suravee.suthikulpanit, santosh.shukla, Wei.Huang2, joao.m.martins,
	boris.ostrovsky



On 4/22/25 8:48 AM, Sairaj Kodilkar wrote:
> 
> 
> On 4/14/2025 7:32 AM, Alejandro Jimenez wrote:


>> +    if ((ret < 0) || (!ret && !dte_mode)) {
>> +        /*
>> +         * The DTE could not be retrieved, it is not valid, or it is 
>> not setup
>> +         * for paging. In either case, ensure that if paging was 
>> previously in
>> +         * use then switch to use the no_dma memory region, and 
>> invalidate all
>> +         * existing mappings.
>> +         */
>> +        if (as->addr_translation) {
>> +            as->addr_translation = false;
>> +
>> +            amdvi_switch_address_space(as);
>> +
>> +            IOMMU_NOTIFIER_FOREACH(n, &as->iommu) {
>> +                amdvi_address_space_unmap(as, n);
>> +            }
> 
> Hi,
> I think amdvi_switch_address_space() should come after
> amdvi_address_space_unmap(). amdvi_switch_address_space() unregister the
> VFIO notifier, hence mr->iommu_notify list is empty and we do not unmap
> the shadow page table.

Good point, fixed it.

Alejandro

> 
> Code works fine because eventually vfio_iommu_map_notify maps
> entire the address space, but we should keep the right ordering.
> 
> Regards
> Sairaj Kodilkar
> 
>> +        }
>> +    } else if (!as->addr_translation) {
>> +        /*
>> +         * Installing a DTE that enables translation where it wasn't 
>> previously
>> +         * active. Activate the DMA memory region.
>> +         */
>> +        as->addr_translation = true;
>> +        amdvi_switch_address_space(as);
>> +        amdvi_address_space_sync(as);
>> +    }
>> +
>>       trace_amdvi_devtab_inval(PCI_BUS_NUM(devid), PCI_SLOT(devid),
>>                                PCI_FUNC(devid));
>>   }
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-04-29 20:46 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14  2:02 [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 01/18] memory: Adjust event ranges to fit within notifier boundaries Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 02/18] amd_iommu: Add helper function to extract the DTE Alejandro Jimenez
2025-04-16 11:36   ` Sairaj Kodilkar
2025-04-16 13:29     ` Alejandro Jimenez
2025-04-16 18:50       ` Michael S. Tsirkin
2025-04-16 22:37         ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 03/18] amd_iommu: Add support for IOMMU notifier Alejandro Jimenez
2025-04-16 12:14   ` Sairaj Kodilkar
2025-04-16 22:17     ` Alejandro Jimenez
2025-04-17 10:19       ` Sairaj Kodilkar
2025-04-17 16:21         ` Alejandro Jimenez
2025-04-17 16:34           ` Michael S. Tsirkin
2025-04-18  6:33           ` Sairaj Kodilkar
2025-04-14  2:02 ` [PATCH 04/18] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 05/18] amd_iommu: Toggle memory regions based on address translation mode Alejandro Jimenez
2025-04-22 12:17   ` Sairaj Kodilkar
2025-04-28 21:10     ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 06/18] amd_iommu: Set all address spaces to default translation mode on reset Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 07/18] amd_iommu: Return an error when unable to read PTE from guest memory Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 08/18] amd_iommu: Helper to decode size of page invalidation command Alejandro Jimenez
2025-04-22 12:26   ` Sairaj Kodilkar
2025-04-28 21:16     ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 09/18] amd_iommu: Add helpers to walk AMD v1 Page Table format Alejandro Jimenez
2025-04-17 12:40   ` CLEMENT MATHIEU--DRIF
2025-04-17 15:27     ` Alejandro Jimenez
2025-04-18  5:30       ` CLEMENT MATHIEU--DRIF
2025-04-23  6:28         ` Sairaj Kodilkar
2025-04-14  2:02 ` [PATCH 10/18] amd_iommu: Add a page walker to sync shadow page tables on invalidation Alejandro Jimenez
2025-04-17 15:14   ` Ethan MILON
2025-04-17 15:45     ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 11/18] amd_iommu: Sync shadow page tables on page invalidation Alejandro Jimenez
2025-04-22 12:38   ` Sairaj Kodilkar
2025-04-22 12:38   ` Sairaj Kodilkar
2025-04-29 19:47     ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 12/18] amd_iommu: Add replay callback Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 13/18] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 14/18] amd_iommu: Toggle address translation on device table entry invalidation Alejandro Jimenez
2025-04-22 12:48   ` Sairaj Kodilkar
2025-04-29 20:45     ` Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 15/18] amd_iommu: Use iova_tree records to determine large page size on UNMAP Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 16/18] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Alejandro Jimenez
2025-04-23  6:06   ` Sairaj Kodilkar
2025-04-14  2:02 ` [PATCH 17/18] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Alejandro Jimenez
2025-04-14  2:02 ` [PATCH 18/18] amd_iommu: Do not emit I/O page fault events during replay() Alejandro Jimenez
2025-04-23  6:18   ` Sairaj Kodilkar
2025-04-23 10:45 ` [PATCH 00/18] AMD vIOMMU: DMA remapping support for VFIO devices Sairaj Kodilkar
2025-04-23 10:56   ` Sairaj Kodilkar
2025-04-24 11:49     ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).