qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: qemu-devel@nongnu.org
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Alejandro Jimenez <alejandro.j.jimenez@oracle.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	Eduardo Habkost <eduardo@habkost.net>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Subject: [PULL 56/75] amd_iommu: Add helpers to walk AMD v1 Page Table format
Date: Sun, 5 Oct 2025 15:18:05 -0400	[thread overview]
Message-ID: <39789107398c5c554efe0ff3726b3b3671dfb376.1759691708.git.mst@redhat.com> (raw)
In-Reply-To: <cover.1759691708.git.mst@redhat.com>

From: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>

The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-8-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 hw/i386/amd_iommu.h |  40 ++++++++++++++
 hw/i386/amd_iommu.c | 123 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 163 insertions(+)

diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index c1170a8202..9f833b297d 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -178,6 +178,46 @@
 #define AMDVI_GATS_MODE                 (2ULL <<  12)
 #define AMDVI_HATS_MODE                 (2ULL <<  10)
 
+/* Page Table format */
+
+#define AMDVI_PTE_PR                    (1ULL << 0)
+#define AMDVI_PTE_NEXT_LEVEL_MASK       GENMASK64(11, 9)
+
+#define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
+
+/* Using level=0 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
+
+/* Return IOVA bit group used to index the Page Table at specific level */
+#define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
+                                        GENMASK64(8, 0))
+
+/* Return the max address for a specified level i.e. max_oaddr */
+#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
+                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
+                                (~(0ULL)))
+
+/* Extract the NextLevel field from PTE/PDE */
+#define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
+
+/* Take page table level and return default pagetable size for level */
+#define PTE_LEVEL_PAGE_SIZE(level)      (1ULL << (PT_LEVEL_SHIFT(level)))
+
+/*
+ * Return address of lower level page table encoded in PTE and specified by
+ * current level and corresponding IOVA bit group at such level.
+ */
+#define NEXT_PTE_ADDR(pte, level, iova) (((pte) & AMDVI_DEV_PT_ROOT_MASK) + \
+                                        (PT_LEVEL_INDEX(level, iova) * 8))
+
+/*
+ * Take a PTE value with mode=0x07 and return the page size it encodes.
+ */
+#define PTE_LARGE_PAGE_SIZE(pte)    (1ULL << (1 + cto64(((pte) | 0xfffULL))))
+
+/* Return number of PTEs to use for a given page size (expected power of 2) */
+#define PAGE_SIZE_PTE_COUNT(pgsz)       (1ULL << ((ctz64(pgsz) - 12) % 9))
+
 /* IOTLB */
 #define AMDVI_IOTLB_MAX_SIZE 1024
 #define AMDVI_DEVID_SHIFT    36
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 29ed3f0ef2..c25981ff93 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -87,6 +87,8 @@ typedef enum AMDVIFaultReason {
     AMDVI_FR_DTE_RTR_ERR = 1,   /* Failure to retrieve DTE */
     AMDVI_FR_DTE_V,             /* DTE[V] = 0 */
     AMDVI_FR_DTE_TV,            /* DTE[TV] = 0 */
+    AMDVI_FR_PT_ROOT_INV,       /* Page Table Root ptr invalid */
+    AMDVI_FR_PT_ENTRY_INV,      /* Failure to read PTE from guest memory */
 } AMDVIFaultReason;
 
 uint64_t amdvi_extended_feature_register(AMDVIState *s)
@@ -558,6 +560,127 @@ static int amdvi_as_to_dte(AMDVIAddressSpace *as, uint64_t *dte)
     return 0;
 }
 
+/*
+ * For a PTE encoding a large page, return the page size it encodes as described
+ * by the AMD IOMMU Specification Table 14: Example Page Size Encodings.
+ * No need to adjust the value of the PTE to point to the first PTE in the large
+ * page since the encoding guarantees all "base" PTEs in the large page are the
+ * same.
+ */
+static uint64_t large_pte_page_size(uint64_t pte)
+{
+    assert(PTE_NEXT_LEVEL(pte) == 7);
+
+    /* Determine size of the large/contiguous page encoded in the PTE */
+    return PTE_LARGE_PAGE_SIZE(pte);
+}
+
+/*
+ * Helper function to fetch a PTE using AMD v1 pgtable format.
+ * On successful page walk, returns 0 and pte parameter points to a valid PTE.
+ * On failure, returns:
+ * -AMDVI_FR_PT_ROOT_INV: A page walk is not possible due to conditions like DTE
+ *      with invalid permissions, Page Table Root can not be read from DTE, or a
+ *      larger IOVA than supported by page table level encoded in DTE[Mode].
+ * -AMDVI_FR_PT_ENTRY_INV: A PTE could not be read from guest memory during a
+ *      page table walk. This means that the DTE has valid data, but one of the
+ *      lower level entries in the Page Table could not be read.
+ */
+static int __attribute__((unused))
+fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte, uint64_t *pte,
+          hwaddr *page_size)
+{
+    IOMMUAccessFlags perms = amdvi_get_perms(dte);
+
+    uint8_t level, mode;
+    uint64_t pte_addr;
+
+    *pte = dte;
+    *page_size = 0;
+
+    if (perms == IOMMU_NONE) {
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    /*
+     * The Linux kernel driver initializes the default mode to 3, corresponding
+     * to a 39-bit GPA space, where each entry in the pagetable translates to a
+     * 1GB (2^30) page size.
+     */
+    level = mode = get_pte_translation_mode(dte);
+    assert(mode > 0 && mode < 7);
+
+    /*
+     * If IOVA is larger than the max supported by the current pgtable level,
+     * there is nothing to do.
+     */
+    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
+        /* IOVA too large for the current DTE */
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    do {
+        level -= 1;
+
+        /* Update the page_size */
+        *page_size = PTE_LEVEL_PAGE_SIZE(level);
+
+        /* Permission bits are ANDed at every level, including the DTE */
+        perms &= amdvi_get_perms(*pte);
+        if (perms == IOMMU_NONE) {
+            return 0;
+        }
+
+        /* Not Present */
+        if (!IOMMU_PTE_PRESENT(*pte)) {
+            return 0;
+        }
+
+        /* Large or Leaf PTE found */
+        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
+            /* Leaf PTE found */
+            break;
+        }
+
+        /*
+         * Index the pgtable using the IOVA bits corresponding to current level
+         * and walk down to the lower level.
+         */
+        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
+        *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+
+        if (*pte == (uint64_t)-1) {
+            /*
+             * A returned PTE of -1 indicates a failure to read the page table
+             * entry from guest memory.
+             */
+            if (level == mode - 1) {
+                /* Failure to retrieve the Page Table from Root Pointer */
+                *page_size = 0;
+                return -AMDVI_FR_PT_ROOT_INV;
+            } else {
+                /* Failure to read PTE. Page walk skips a page_size chunk */
+                return -AMDVI_FR_PT_ENTRY_INV;
+            }
+        }
+    } while (level > 0);
+
+    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
+           level == 0);
+    /*
+     * Page walk ends when Next Level field on PTE shows that either a leaf PTE
+     * or a series of large PTEs have been reached. In the latter case, even if
+     * the range starts in the middle of a contiguous page, the returned PTE
+     * must be the first PTE of the series.
+     */
+    if (PTE_NEXT_LEVEL(*pte) == 7) {
+        /* Update page_size with the large PTE page size */
+        *page_size = large_pte_page_size(*pte);
+    }
+
+    return 0;
+}
+
 /* log error without aborting since linux seems to be using reserved bits */
 static void amdvi_inval_devtab_entry(AMDVIState *s, uint64_t *cmd)
 {
-- 
MST



  parent reply	other threads:[~2025-10-05 19:26 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-05 19:16 [PULL 00/75] virtio,pci,pc: features, fixes Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 01/75] net: bundle all offloads in a single struct Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 02/75] linux-headers: deal with counted_by annotation Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 03/75] linux-headers: Update to Linux v6.17-rc1 Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 04/75] virtio: introduce extended features type Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 05/75] virtio: serialize extended features state Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 06/75] virtio: add support for negotiating extended features Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 07/75] virtio-pci: implement support for " Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 08/75] vhost: add support for negotiating " Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 09/75] qmp: update virtio features map to support " Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 10/75] vhost-backend: implement extended features support Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 11/75] vhost-net: " Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 12/75] virtio-net: " Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 13/75] net: implement tunnel probing Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 14/75] net: implement UDP tunnel features offloading Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 15/75] Revert "hw/acpi/ghes: Make ghes_record_cper_errors() static" Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 16/75] acpi/ghes: Cleanup the code which gets ghes ged state Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 17/75] acpi/ghes: prepare to change the way HEST offsets are calculated Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 18/75] acpi/ghes: add a firmware file with HEST address Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 19/75] acpi/ghes: Use HEST table offsets when preparing GHES records Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 20/75] acpi/ghes: don't hard-code the number of sources for HEST table Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 21/75] acpi/ghes: add a notifier to notify when error data is ready Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 22/75] acpi/generic_event_device: Update GHES migration to cover hest addr Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 23/75] acpi/generic_event_device: add logic to detect if HEST addr is available Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 24/75] acpi/generic_event_device: add an APEI error device Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 25/75] tests/acpi: virt: allow acpi table changes at DSDT and HEST tables Michael S. Tsirkin
2025-10-05 19:16 ` [PULL 26/75] arm/virt: Wire up a GED error device for ACPI / GHES Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 27/75] qapi/acpi-hest: add an interface to do generic CPER error injection Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 28/75] acpi/generic_event_device.c: enable use_hest_addr for QEMU 10.x Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 29/75] tests/acpi: virt: update HEST and DSDT tables Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 30/75] docs: hest: add new "etc/acpi_table_hest_addr" and update workflow Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 31/75] scripts/ghes_inject: add a script to generate GHES error inject Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 32/75] hw/smbios: allow clearing the VM bit in SMBIOS table 0 Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 33/75] hw/i386/pc: Avoid overlap between CXL window and PCI 64bit BARs in QEMU Michael S. Tsirkin
2025-10-06 17:08   ` Michael Tokarev
2025-10-05 19:17 ` [PULL 34/75] pcie_sriov: Fix broken MMIO accesses from SR-IOV VFs Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 35/75] hw/virtio: rename vhost-user-device and make user creatable Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 36/75] smbios: cap DIMM size to 2Tb as workaround for broken Windows Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 37/75] pcie: Add a way to get the outstanding page request allocation (pri) from the config space Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 38/75] intel_iommu: Bypass barrier wait descriptor Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 39/75] intel_iommu: Declare PRI constants and structures Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 40/75] intel_iommu: Declare registers for PRI Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 41/75] intel_iommu: Add PRI operations support Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 42/75] x86: ich9: fix default value of 'No Reboot' bit in GCS Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 43/75] vhost: use virtio_config_get_guest_notifier() Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 44/75] virtio: unify virtio_notify_irqfd() and virtio_notify() Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 45/75] virtio: support irqfd in virtio_notify_config() Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 46/75] tests/libqos: extract qvirtqueue_set_avail_idx() Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 47/75] tests/virtio-scsi: add a virtio_error() IOThread test Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 48/75] pcie_sriov: make pcie_sriov_pf_exit() safe on non-SR-IOV devices Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 49/75] virtio: Add function name to error messages Michael S. Tsirkin
2025-10-05 20:13   ` Alessandro Ratti
2025-10-05 20:24     ` Michael S. Tsirkin
2025-10-08 10:01     ` Michael S. Tsirkin
2025-10-08 16:53       ` Alessandro Ratti
2025-10-05 20:19   ` [PULL v2 75/75] virtio: improve virtqueue mapping " Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 50/75] memory: Adjust event ranges to fit within notifier boundaries Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 51/75] amd_iommu: Document '-device amd-iommu' common options Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 52/75] amd_iommu: Reorder device and page table helpers Michael S. Tsirkin
2025-10-05 19:17 ` [PULL 53/75] amd_iommu: Helper to decode size of page invalidation command Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 54/75] amd_iommu: Add helper function to extract the DTE Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 55/75] amd_iommu: Return an error when unable to read PTE from guest memory Michael S. Tsirkin
2025-10-05 19:18 ` Michael S. Tsirkin [this message]
2025-10-05 19:18 ` [PULL 57/75] amd_iommu: Add a page walker to sync shadow page tables on invalidation Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 58/75] amd_iommu: Add basic structure to support IOMMU notifier updates Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 59/75] amd_iommu: Sync shadow page tables on page invalidation Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 60/75] amd_iommu: Use iova_tree records to determine large page size on UNMAP Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 61/75] amd_iommu: Unmap all address spaces under the AMD IOMMU on reset Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 62/75] amd_iommu: Add replay callback Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 63/75] amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 64/75] amd_iommu: Toggle memory regions based on address translation mode Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 65/75] amd_iommu: Set all address spaces to use passthrough mode on reset Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 66/75] amd_iommu: Add dma-remap property to AMD vIOMMU device Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 67/75] amd_iommu: Toggle address translation mode on devtab entry invalidation Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 68/75] amd_iommu: Do not assume passthrough translation when DTE[TV]=0 Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 69/75] amd_iommu: Refactor amdvi_page_walk() to use common code for page walk Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 70/75] intel-iommu: Move dma_translation to x86-iommu Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 71/75] amd_iommu: HATDis/HATS=11 support Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 72/75] vdpa-dev: add get_vhost() callback for vhost-vdpa device Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 73/75] intel_iommu: Enable Enhanced Set Root Table Pointer Support (ESRTPS) Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 74/75] intel_iommu: Simplify caching mode check with VFIO device Michael S. Tsirkin
2025-10-05 19:18 ` [PULL 75/75] pci: Fix wrong parameter passing to pci_device_get_iommu_bus_devfn() Michael S. Tsirkin
2025-10-05 20:20 ` [PULL 00/75] virtio,pci,pc: features, fixes Michael S. Tsirkin
2025-10-06 21:59 ` Richard Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=39789107398c5c554efe0ff3726b3b3671dfb376.1759691708.git.mst@redhat.com \
    --to=mst@redhat.com \
    --cc=alejandro.j.jimenez@oracle.com \
    --cc=eduardo@habkost.net \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).