public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
* [PATCH 0/2] amd_iommu: Fix page size reporting on hugepage mappings
@ 2026-03-11 20:39 Alejandro Jimenez
  2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
  2026-03-11 20:39 ` [PATCH 2/2] amd_iommu: Reject non-decreasing NextLevel in fetch_pte() Alejandro Jimenez
  0 siblings, 2 replies; 7+ messages in thread
From: Alejandro Jimenez @ 2026-03-11 20:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky, alejandro.j.jimenez

An issue in the page walking code where hugepage mappings were reported as
4K pages in certain configurations was reported by David Hoppenbrouwers in:
https://lore.kernel.org/qemu-devel/20260225145831.28275-1-qemu@demindiro.com/

While the change proposed in that thread fixes that specific bug, there are
opportunities to simplify the algorithm by using a 1-based level accounting
(matching DTE[Mode] and PTE NextLevel semantics).

The second change enforces a rule from the AMD-Vi specification which
requires that NextLevel must be strictly lower than the current level to
guarantee that the page walk eventually completes.

These changes make the page walk logic easier to follow, keep page size
tracking aligned with the current page table level, and prevent invalid
guest tables from causing page walk loops.

For stable: Please pick both patches together, PATCH 2 can be squashed into
PATCH 1 for ease of backporting.

Thank you David and Sairaj for the detailed reporting/analysis of the
original issue and the proposed fixes.

Alejandro

Alejandro Jimenez (2):
  amd_iommu: Follow root pointer before page walk and use 1-based levels
  amd_iommu: Reject non-decreasing NextLevel in fetch_pte()

 hw/i386/amd_iommu.c | 136 +++++++++++++++++++++++++++++++-------------
 hw/i386/amd_iommu.h |  11 ++--
 2 files changed, 101 insertions(+), 46 deletions(-)


base-commit: 1fd5ff9d76d23ab23a68419cbc76d5ee33e8b455
-- 
2.47.3



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
  2026-03-11 20:39 [PATCH 0/2] amd_iommu: Fix page size reporting on hugepage mappings Alejandro Jimenez
@ 2026-03-11 20:39 ` Alejandro Jimenez
  2026-03-12 17:01   ` David Hoppenbrouwers
  2026-03-23 11:01   ` Sairaj Kodilkar
  2026-03-11 20:39 ` [PATCH 2/2] amd_iommu: Reject non-decreasing NextLevel in fetch_pte() Alejandro Jimenez
  1 sibling, 2 replies; 7+ messages in thread
From: Alejandro Jimenez @ 2026-03-11 20:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky, alejandro.j.jimenez

DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
fetch_pte() currently uses a 0-based level counter, making the logic
harder to follow and requiring conversions between DTE mode and level.

Switch the page table walk logic to use 1-based level accounting in
fetch_pte() and the relevant macro helpers. To further simplify the page
walking loop, split the root page table access from the walk i.e. rework
fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
level pagetable entry before entering the loop, then iterate only over the
PDE/PTE entries.

The reworked algorithm fixes a page walk bug where the page size was
calculated for the next level before checking if the current PTE was already
a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
leading to performance degradation and failures in some setups.

Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table format")
Cc: qemu-stable@nongnu.org
Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
 hw/i386/amd_iommu.h |  11 ++--
 2 files changed, 97 insertions(+), 46 deletions(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 789e09d6f2..991c6c379a 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -648,6 +648,52 @@ static uint64_t large_pte_page_size(uint64_t pte)
     return PTE_LARGE_PAGE_SIZE(pte);
 }
 
+/*
+ * Validate DTE fields and extract permissions and top level data required to
+ * initiate the page table walk.
+ *
+ * On success, returns 0 and stores:
+ *   - top_level: highest page-table level encoded in DTE[Mode]
+ *   - dte_perms: effective permissions from the DTE
+ *
+ * On failure, returns -AMDVI_FR_PT_ROOT_INV. This includes cases where:
+ *   - DTE permissions disallow read AND write
+ *   - DTE[Mode] is invalid for translation
+ *   - IOVA exceeds the address width supported by DTE[Mode]
+ * In all such cases a page walk must be aborted.
+ */
+static uint64_t amdvi_get_top_pt_level_and_perms(hwaddr address, uint64_t dte,
+                                       uint8_t *top_level,
+                                       IOMMUAccessFlags *dte_perms)
+{
+    *dte_perms = amdvi_get_perms(dte);
+    if (*dte_perms == IOMMU_NONE) {
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    /* Verifying a valid mode is encoded in DTE */
+    *top_level = get_pte_translation_mode(dte);
+
+    /*
+     * Page Table Root pointer is only valid for GPA->SPA translation on
+     * supported modes.
+     */
+    if (*top_level == 0 || *top_level > 6) {
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    /*
+     * If IOVA is larger than the max supported by the highest pgtable level,
+     * there is nothing to do.
+     */
+    if (address > PT_LEVEL_MAX_ADDR(*top_level)) {
+        /* IOVA too large for the current DTE */
+        return -AMDVI_FR_PT_ROOT_INV;
+    }
+
+    return 0;
+}
+
 /*
  * Helper function to fetch a PTE using AMD v1 pgtable format.
  * On successful page walk, returns 0 and pte parameter points to a valid PTE.
@@ -662,40 +708,49 @@ static uint64_t large_pte_page_size(uint64_t pte)
 static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
                           uint64_t *pte, hwaddr *page_size)
 {
-    IOMMUAccessFlags perms = amdvi_get_perms(dte);
-
-    uint8_t level, mode;
     uint64_t pte_addr;
+    uint8_t pt_level, next_pt_level;
+    IOMMUAccessFlags perms;
+    int ret;
 
-    *pte = dte;
     *page_size = 0;
 
-    if (perms == IOMMU_NONE) {
-        return -AMDVI_FR_PT_ROOT_INV;
-    }
-
     /*
-     * The Linux kernel driver initializes the default mode to 3, corresponding
-     * to a 39-bit GPA space, where each entry in the pagetable translates to a
-     * 1GB (2^30) page size.
+     * Verify the DTE is properly configured before page walk, and extract
+     * top pagetable level and permissions.
      */
-    level = mode = get_pte_translation_mode(dte);
-    assert(mode > 0 && mode < 7);
+    ret = amdvi_get_top_pt_level_and_perms(address, dte, &pt_level, &perms);
+    if (ret < 0) {
+        return ret;
+    }
 
     /*
-     * If IOVA is larger than the max supported by the current pgtable level,
-     * there is nothing to do.
+     * Retrieve the top pagetable entry by following the DTE Page Table Root
+     * Pointer and indexing the top level table using the IOVA from the request.
      */
-    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
-        /* IOVA too large for the current DTE */
+    pte_addr = NEXT_PTE_ADDR(dte, pt_level, address);
+    *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
+
+    if (*pte == (uint64_t)-1) {
+        /*
+         * A returned PTE of -1 here indicates a failure to read the top level
+         * page table from guest memory. A page walk is not possible and page
+         * size must be returned as 0.
+         */
         return -AMDVI_FR_PT_ROOT_INV;
     }
 
-    do {
-        level -= 1;
+    /*
+     * Calculate page size for the top level page table entry.
+     * This ensures correct results for a single level Page Table setup.
+     */
+    *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
 
-        /* Update the page_size */
-        *page_size = PTE_LEVEL_PAGE_SIZE(level);
+    /*
+     * The root page table entry and its level have been determined. Begin the
+     * page walk.
+     */
+    while (pt_level > 0) {
 
         /* Permission bits are ANDed at every level, including the DTE */
         perms &= amdvi_get_perms(*pte);
@@ -708,37 +763,34 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
             return 0;
         }
 
+        next_pt_level = PTE_NEXT_LEVEL(*pte);
+
         /* Large or Leaf PTE found */
-        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
+        if (next_pt_level == 0 || next_pt_level == 7) {
             /* Leaf PTE found */
             break;
         }
 
+        pt_level = next_pt_level;
+
         /*
-         * Index the pgtable using the IOVA bits corresponding to current level
-         * and walk down to the lower level.
+         * The current entry is a Page Directory Entry. Descend to the lower
+         * page table level encoded in current pte, and index the new table
+         * using the appropriate IOVA bits to retrieve the new entry.
          */
-        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
+        *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
+
+        pte_addr = NEXT_PTE_ADDR(*pte, pt_level, address);
         *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
 
         if (*pte == (uint64_t)-1) {
-            /*
-             * A returned PTE of -1 indicates a failure to read the page table
-             * entry from guest memory.
-             */
-            if (level == mode - 1) {
-                /* Failure to retrieve the Page Table from Root Pointer */
-                *page_size = 0;
-                return -AMDVI_FR_PT_ROOT_INV;
-            } else {
-                /* Failure to read PTE. Page walk skips a page_size chunk */
-                return -AMDVI_FR_PT_ENTRY_INV;
-            }
+            /* Failure to read PTE. Page walk skips a page_size chunk */
+            return -AMDVI_FR_PT_ENTRY_INV;
         }
-    } while (level > 0);
+    }
+
+    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7);
 
-    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
-           level == 0);
     /*
      * Page walk ends when Next Level field on PTE shows that either a leaf PTE
      * or a series of large PTEs have been reached. In the latter case, even if
diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
index 302ccca512..7af3c742b7 100644
--- a/hw/i386/amd_iommu.h
+++ b/hw/i386/amd_iommu.h
@@ -186,17 +186,16 @@
 
 #define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
 
-/* Using level=0 for leaf PTE at 4K page size */
-#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
+/* Using level=1 for leaf PTE at 4K page size */
+#define PT_LEVEL_SHIFT(level)           (12 + (((level) - 1) * 9))
 
 /* Return IOVA bit group used to index the Page Table at specific level */
 #define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
                                         GENMASK64(8, 0))
 
-/* Return the max address for a specified level i.e. max_oaddr */
-#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
-                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
-                                (~(0ULL)))
+/* Return the maximum output address for a specified page table level */
+#define PT_LEVEL_MAX_ADDR(level)    (((level) > 5) ? (~(0ULL)) : \
+                                    ((1ULL << PT_LEVEL_SHIFT((level) + 1)) - 1))
 
 /* Extract the NextLevel field from PTE/PDE */
 #define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] amd_iommu: Reject non-decreasing NextLevel in fetch_pte()
  2026-03-11 20:39 [PATCH 0/2] amd_iommu: Fix page size reporting on hugepage mappings Alejandro Jimenez
  2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
@ 2026-03-11 20:39 ` Alejandro Jimenez
  1 sibling, 0 replies; 7+ messages in thread
From: Alejandro Jimenez @ 2026-03-11 20:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky, alejandro.j.jimenez

The AMD-Vi specification requires that the NextLevel field for a page table
entry must not be greater or equal to the current page table entry level.
Enforce this to avoid infinite page walk loops on corrupted or buggy guest
page tables.

The initial implementation of fetch_pte() did not implement this check, but
was not vulnerable since the page walk code explicitly decremented the level
instead of retrieving it from the page table entry.

Cc: qemu-stable@nongnu.org
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
---
 hw/i386/amd_iommu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 991c6c379a..a5c873b705 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -771,6 +771,10 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
             break;
         }
 
+        /* Next level must always be less than current level */
+        if (pt_level <= next_pt_level) {
+            return -AMDVI_FR_PT_ENTRY_INV;
+        }
         pt_level = next_pt_level;
 
         /*
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
  2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
@ 2026-03-12 17:01   ` David Hoppenbrouwers
  2026-03-23 11:01   ` Sairaj Kodilkar
  1 sibling, 0 replies; 7+ messages in thread
From: David Hoppenbrouwers @ 2026-03-12 17:01 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky

On 3/11/26 9:39 PM, Alejandro Jimenez wrote:
> DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
> fetch_pte() currently uses a 0-based level counter, making the logic
> harder to follow and requiring conversions between DTE mode and level.
> 
> Switch the page table walk logic to use 1-based level accounting in
> fetch_pte() and the relevant macro helpers. To further simplify the page
> walking loop, split the root page table access from the walk i.e. rework
> fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
> level pagetable entry before entering the loop, then iterate only over the
> PDE/PTE entries.
> 
> The reworked algorithm fixes a page walk bug where the page size was
> calculated for the next level before checking if the current PTE was already
> a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
> leading to performance degradation and failures in some setups.
> 
> Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table format")
> Cc: qemu-stable@nongnu.org
> Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
>   hw/i386/amd_iommu.h |  11 ++--
>   2 files changed, 97 insertions(+), 46 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 789e09d6f2..991c6c379a 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -648,6 +648,52 @@ static uint64_t large_pte_page_size(uint64_t pte)
>       return PTE_LARGE_PAGE_SIZE(pte);
>   }
>   
> +/*
> + * Validate DTE fields and extract permissions and top level data required to
> + * initiate the page table walk.
> + *
> + * On success, returns 0 and stores:
> + *   - top_level: highest page-table level encoded in DTE[Mode]
> + *   - dte_perms: effective permissions from the DTE
> + *
> + * On failure, returns -AMDVI_FR_PT_ROOT_INV. This includes cases where:
> + *   - DTE permissions disallow read AND write
> + *   - DTE[Mode] is invalid for translation
> + *   - IOVA exceeds the address width supported by DTE[Mode]
> + * In all such cases a page walk must be aborted.
> + */
> +static uint64_t amdvi_get_top_pt_level_and_perms(hwaddr address, uint64_t dte,
> +                                       uint8_t *top_level,
> +                                       IOMMUAccessFlags *dte_perms)
> +{
> +    *dte_perms = amdvi_get_perms(dte);
> +    if (*dte_perms == IOMMU_NONE) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /* Verifying a valid mode is encoded in DTE */
> +    *top_level = get_pte_translation_mode(dte);
> +
> +    /*
> +     * Page Table Root pointer is only valid for GPA->SPA translation on
> +     * supported modes.
> +     */
> +    if (*top_level == 0 || *top_level > 6) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /*
> +     * If IOVA is larger than the max supported by the highest pgtable level,
> +     * there is nothing to do.
> +     */
> +    if (address > PT_LEVEL_MAX_ADDR(*top_level)) {
> +        /* IOVA too large for the current DTE */
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    return 0;
> +}
> +
>   /*
>    * Helper function to fetch a PTE using AMD v1 pgtable format.
>    * On successful page walk, returns 0 and pte parameter points to a valid PTE.
> @@ -662,40 +708,49 @@ static uint64_t large_pte_page_size(uint64_t pte)
>   static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>                             uint64_t *pte, hwaddr *page_size)
>   {
> -    IOMMUAccessFlags perms = amdvi_get_perms(dte);
> -
> -    uint8_t level, mode;
>       uint64_t pte_addr;
> +    uint8_t pt_level, next_pt_level;
> +    IOMMUAccessFlags perms;
> +    int ret;
>   
> -    *pte = dte;
>       *page_size = 0;
>   
> -    if (perms == IOMMU_NONE) {
> -        return -AMDVI_FR_PT_ROOT_INV;
> -    }
> -
>       /*
> -     * The Linux kernel driver initializes the default mode to 3, corresponding
> -     * to a 39-bit GPA space, where each entry in the pagetable translates to a
> -     * 1GB (2^30) page size.
> +     * Verify the DTE is properly configured before page walk, and extract
> +     * top pagetable level and permissions.
>        */
> -    level = mode = get_pte_translation_mode(dte);
> -    assert(mode > 0 && mode < 7);
> +    ret = amdvi_get_top_pt_level_and_perms(address, dte, &pt_level, &perms);
> +    if (ret < 0) {
> +        return ret;
> +    }
>   
>       /*
> -     * If IOVA is larger than the max supported by the current pgtable level,
> -     * there is nothing to do.
> +     * Retrieve the top pagetable entry by following the DTE Page Table Root
> +     * Pointer and indexing the top level table using the IOVA from the request.
>        */
> -    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
> -        /* IOVA too large for the current DTE */
> +    pte_addr = NEXT_PTE_ADDR(dte, pt_level, address);
> +    *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> +
> +    if (*pte == (uint64_t)-1) {
> +        /*
> +         * A returned PTE of -1 here indicates a failure to read the top level
> +         * page table from guest memory. A page walk is not possible and page
> +         * size must be returned as 0.
> +         */
>           return -AMDVI_FR_PT_ROOT_INV;
>       }
>   
> -    do {
> -        level -= 1;
> +    /*
> +     * Calculate page size for the top level page table entry.
> +     * This ensures correct results for a single level Page Table setup.
> +     */
> +    *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
>   
> -        /* Update the page_size */
> -        *page_size = PTE_LEVEL_PAGE_SIZE(level);
> +    /*
> +     * The root page table entry and its level have been determined. Begin the
> +     * page walk.
> +     */
> +    while (pt_level > 0) {
>   
>           /* Permission bits are ANDed at every level, including the DTE */
>           perms &= amdvi_get_perms(*pte);
> @@ -708,37 +763,34 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>               return 0;
>           }
>   
> +        next_pt_level = PTE_NEXT_LEVEL(*pte);
> +
>           /* Large or Leaf PTE found */
> -        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
> +        if (next_pt_level == 0 || next_pt_level == 7) {
>               /* Leaf PTE found */
>               break;
>           }
>   
> +        pt_level = next_pt_level;
> +
>           /*
> -         * Index the pgtable using the IOVA bits corresponding to current level
> -         * and walk down to the lower level.
> +         * The current entry is a Page Directory Entry. Descend to the lower
> +         * page table level encoded in current pte, and index the new table
> +         * using the appropriate IOVA bits to retrieve the new entry.
>            */
> -        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
> +        *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
> +
> +        pte_addr = NEXT_PTE_ADDR(*pte, pt_level, address);
>           *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
>   
>           if (*pte == (uint64_t)-1) {
> -            /*
> -             * A returned PTE of -1 indicates a failure to read the page table
> -             * entry from guest memory.
> -             */
> -            if (level == mode - 1) {
> -                /* Failure to retrieve the Page Table from Root Pointer */
> -                *page_size = 0;
> -                return -AMDVI_FR_PT_ROOT_INV;
> -            } else {
> -                /* Failure to read PTE. Page walk skips a page_size chunk */
> -                return -AMDVI_FR_PT_ENTRY_INV;
> -            }
> +            /* Failure to read PTE. Page walk skips a page_size chunk */
> +            return -AMDVI_FR_PT_ENTRY_INV;
>           }
> -    } while (level > 0);
> +    }
> +
> +    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7);
>   
> -    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
> -           level == 0);
>       /*
>        * Page walk ends when Next Level field on PTE shows that either a leaf PTE
>        * or a series of large PTEs have been reached. In the latter case, even if
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index 302ccca512..7af3c742b7 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -186,17 +186,16 @@
>   
>   #define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>   
> -/* Using level=0 for leaf PTE at 4K page size */
> -#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
> +/* Using level=1 for leaf PTE at 4K page size */
> +#define PT_LEVEL_SHIFT(level)           (12 + (((level) - 1) * 9))
>   
>   /* Return IOVA bit group used to index the Page Table at specific level */
>   #define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
>                                           GENMASK64(8, 0))
>   
> -/* Return the max address for a specified level i.e. max_oaddr */
> -#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
> -                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
> -                                (~(0ULL)))
> +/* Return the maximum output address for a specified page table level */
> +#define PT_LEVEL_MAX_ADDR(level)    (((level) > 5) ? (~(0ULL)) : \
> +                                    ((1ULL << PT_LEVEL_SHIFT((level) + 1)) - 1))
>   
>   /* Extract the NextLevel field from PTE/PDE */
>   #define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
Reviewed-By: David Hoppenbrouwers <qemu@demindiro.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
  2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
  2026-03-12 17:01   ` David Hoppenbrouwers
@ 2026-03-23 11:01   ` Sairaj Kodilkar
  2026-03-24 14:25     ` Alejandro Jimenez
  1 sibling, 1 reply; 7+ messages in thread
From: Sairaj Kodilkar @ 2026-03-23 11:01 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky



On 3/12/2026 2:09 AM, Alejandro Jimenez wrote:
> DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
> fetch_pte() currently uses a 0-based level counter, making the logic
> harder to follow and requiring conversions between DTE mode and level.
>
> Switch the page table walk logic to use 1-based level accounting in
> fetch_pte() and the relevant macro helpers. To further simplify the page
> walking loop, split the root page table access from the walk i.e. rework
> fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
> level pagetable entry before entering the loop, then iterate only over the
> PDE/PTE entries.
>
> The reworked algorithm fixes a page walk bug where the page size was
> calculated for the next level before checking if the current PTE was already
> a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
> leading to performance degradation and failures in some setups.
>
> Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table format")
> Cc: qemu-stable@nongnu.org
> Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
>   hw/i386/amd_iommu.h |  11 ++--
>   2 files changed, 97 insertions(+), 46 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 789e09d6f2..991c6c379a 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -648,6 +648,52 @@ static uint64_t large_pte_page_size(uint64_t pte)
>       return PTE_LARGE_PAGE_SIZE(pte);
>   }
>   
> +/*
> + * Validate DTE fields and extract permissions and top level data required to
> + * initiate the page table walk.
> + *
> + * On success, returns 0 and stores:
> + *   - top_level: highest page-table level encoded in DTE[Mode]
> + *   - dte_perms: effective permissions from the DTE
> + *
> + * On failure, returns -AMDVI_FR_PT_ROOT_INV. This includes cases where:
> + *   - DTE permissions disallow read AND write
> + *   - DTE[Mode] is invalid for translation
> + *   - IOVA exceeds the address width supported by DTE[Mode]
> + * In all such cases a page walk must be aborted.
> + */
> +static uint64_t amdvi_get_top_pt_level_and_perms(hwaddr address, uint64_t dte,
> +                                       uint8_t *top_level,
> +                                       IOMMUAccessFlags *dte_perms)
> +{
> +    *dte_perms = amdvi_get_perms(dte);
> +    if (*dte_perms == IOMMU_NONE) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /* Verifying a valid mode is encoded in DTE */
> +    *top_level = get_pte_translation_mode(dte);
> +
> +    /*
> +     * Page Table Root pointer is only valid for GPA->SPA translation on
> +     * supported modes.
> +     */
> +    if (*top_level == 0 || *top_level > 6) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /*
> +     * If IOVA is larger than the max supported by the highest pgtable level,
> +     * there is nothing to do.
> +     */
> +    if (address > PT_LEVEL_MAX_ADDR(*top_level)) {
> +        /* IOVA too large for the current DTE */
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    return 0;
> +}
> +
>   /*
>    * Helper function to fetch a PTE using AMD v1 pgtable format.
>    * On successful page walk, returns 0 and pte parameter points to a valid PTE.
> @@ -662,40 +708,49 @@ static uint64_t large_pte_page_size(uint64_t pte)
>   static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>                             uint64_t *pte, hwaddr *page_size)
>   {
> -    IOMMUAccessFlags perms = amdvi_get_perms(dte);
> -
> -    uint8_t level, mode;
>       uint64_t pte_addr;
> +    uint8_t pt_level, next_pt_level;
> +    IOMMUAccessFlags perms;
> +    int ret;
>   
> -    *pte = dte;
>       *page_size = 0;
>   
> -    if (perms == IOMMU_NONE) {
> -        return -AMDVI_FR_PT_ROOT_INV;
> -    }
> -
>       /*
> -     * The Linux kernel driver initializes the default mode to 3, corresponding
> -     * to a 39-bit GPA space, where each entry in the pagetable translates to a
> -     * 1GB (2^30) page size.
> +     * Verify the DTE is properly configured before page walk, and extract
> +     * top pagetable level and permissions.
>        */
> -    level = mode = get_pte_translation_mode(dte);
> -    assert(mode > 0 && mode < 7);
> +    ret = amdvi_get_top_pt_level_and_perms(address, dte, &pt_level, &perms);
> +    if (ret < 0) {
> +        return ret;
> +    }
>   
>       /*
> -     * If IOVA is larger than the max supported by the current pgtable level,
> -     * there is nothing to do.
> +     * Retrieve the top pagetable entry by following the DTE Page Table Root
> +     * Pointer and indexing the top level table using the IOVA from the request.
>        */
> -    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
> -        /* IOVA too large for the current DTE */
> +    pte_addr = NEXT_PTE_ADDR(dte, pt_level, address);
> +    *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> +
> +    if (*pte == (uint64_t)-1) {
> +        /*
> +         * A returned PTE of -1 here indicates a failure to read the top level
> +         * page table from guest memory. A page walk is not possible and page
> +         * size must be returned as 0.
> +         */
>           return -AMDVI_FR_PT_ROOT_INV;
>       }
>   
> -    do {
> -        level -= 1;
> +    /*
> +     * Calculate page size for the top level page table entry.
> +     * This ensures correct results for a single level Page Table setup.
> +     */
> +    *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
>   
> -        /* Update the page_size */
> -        *page_size = PTE_LEVEL_PAGE_SIZE(level);
> +    /*
> +     * The root page table entry and its level have been determined. Begin the
> +     * page walk.
> +     */
> +    while (pt_level > 0) {
>   
>           /* Permission bits are ANDed at every level, including the DTE */
>           perms &= amdvi_get_perms(*pte);
> @@ -708,37 +763,34 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>               return 0;
>           }
>   
> +        next_pt_level = PTE_NEXT_LEVEL(*pte);
> +
>           /* Large or Leaf PTE found */
> -        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
> +        if (next_pt_level == 0 || next_pt_level == 7) {
>               /* Leaf PTE found */
>               break;
>           }
>   
> +        pt_level = next_pt_level;
> +
>           /*
> -         * Index the pgtable using the IOVA bits corresponding to current level
> -         * and walk down to the lower level.
> +         * The current entry is a Page Directory Entry. Descend to the lower
> +         * page table level encoded in current pte, and index the new table
> +         * using the appropriate IOVA bits to retrieve the new entry.
>            */
> -        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
> +        *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
> +
> +        pte_addr = NEXT_PTE_ADDR(*pte, pt_level, address);
>           *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
>   
>           if (*pte == (uint64_t)-1) {
> -            /*
> -             * A returned PTE of -1 indicates a failure to read the page table
> -             * entry from guest memory.
> -             */
> -            if (level == mode - 1) {
> -                /* Failure to retrieve the Page Table from Root Pointer */
> -                *page_size = 0;
> -                return -AMDVI_FR_PT_ROOT_INV;
> -            } else {
> -                /* Failure to read PTE. Page walk skips a page_size chunk */
> -                return -AMDVI_FR_PT_ENTRY_INV;
> -            }
> +            /* Failure to read PTE. Page walk skips a page_size chunk */
> +            return -AMDVI_FR_PT_ENTRY_INV;
>           }
> -    } while (level > 0);
> +    }
> +
> +    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7);
>   
> -    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
> -           level == 0);
>       /*
>        * Page walk ends when Next Level field on PTE shows that either a leaf PTE
>        * or a series of large PTEs have been reached. In the latter case, even if
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index 302ccca512..7af3c742b7 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -186,17 +186,16 @@
>   
>   #define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>   
> -/* Using level=0 for leaf PTE at 4K page size */
> -#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
> +/* Using level=1 for leaf PTE at 4K page size */
> +#define PT_LEVEL_SHIFT(level)           (12 + (((level) - 1) * 9))
>   
>   /* Return IOVA bit group used to index the Page Table at specific level */
>   #define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
>                                           GENMASK64(8, 0))
>   
> -/* Return the max address for a specified level i.e. max_oaddr */
> -#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
> -                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
> -                                (~(0ULL)))
> +/* Return the maximum output address for a specified page table level */
> +#define PT_LEVEL_MAX_ADDR(level)    (((level) > 5) ? (~(0ULL)) : \
> +                                    ((1ULL << PT_LEVEL_SHIFT((level) + 1)) - 1))
>   
>   /* Extract the NextLevel field from PTE/PDE */
>   #define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)

Hi Alejandro,

amdvi_sync_shadow_page_table_range() does not check if DTE
**valid and translation valid** bit are set. I added this check in
my reply to david's patch. Do you think we should include that
check as well to ensure that only DTEs with **valid and translation valid**
bit are passed to the fetch_pte ?

Thanks
Sairaj Kodilkar



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
  2026-03-23 11:01   ` Sairaj Kodilkar
@ 2026-03-24 14:25     ` Alejandro Jimenez
  2026-03-26  6:52       ` Sairaj Kodilkar
  0 siblings, 1 reply; 7+ messages in thread
From: Alejandro Jimenez @ 2026-03-24 14:25 UTC (permalink / raw)
  To: Sairaj Kodilkar, qemu-devel
  Cc: qemu, mst, clement.mathieu--drif, pbonzini, richard.henderson,
	eduardo, boris.ostrovsky



On 3/23/26 7:01 AM, Sairaj Kodilkar wrote:
> 
> 
> On 3/12/2026 2:09 AM, Alejandro Jimenez wrote:
>> DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
>> fetch_pte() currently uses a 0-based level counter, making the logic
>> harder to follow and requiring conversions between DTE mode and level.
>>
>> Switch the page table walk logic to use 1-based level accounting in
>> fetch_pte() and the relevant macro helpers. To further simplify the page
>> walking loop, split the root page table access from the walk i.e. rework
>> fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
>> level pagetable entry before entering the loop, then iterate only over the
>> PDE/PTE entries.
>>
>> The reworked algorithm fixes a page walk bug where the page size was
>> calculated for the next level before checking if the current PTE was already
>> a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
>> leading to performance degradation and failures in some setups.
>>
>> Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table
>> format")
>> Cc: qemu-stable@nongnu.org
>> Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>> ---
>>   hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
>>   hw/i386/amd_iommu.h |  11 ++--
>>   2 files changed, 97 insertions(+), 46 deletions(-)
>>

[...]

> 
> Hi Alejandro,
> 
> amdvi_sync_shadow_page_table_range() does not check if DTE
> **valid and translation valid** bit are set. I added this check in
> my reply to david's patch. Do you think we should include that
> check as well to ensure that only DTEs with **valid and translation valid**
> bit are passed to the fetch_pte ?
> 

Right now the check for DTE[V] and DTE[TV] are implemented in
amdvi_as_to_dte(), which all callers of
amdvi_sync_shadow_page_table_range() use to extract the DTE that will be
passed as its argument. So I think if we were to add a check, an assert
would be more appropriate.

On the other hand, I am worried about sprinkling assertions around the
code, and I am actually considering removing the ones the original series
added to amdvi_sync_shadow_page_table_range(), fetch_pte(), and
amdvi_notify_iommu(), since any failures in these cases are likely to cause
immediate symptoms and I don't believe it creates any window for data
corruption.

With that in mind, to answer your question I would not add any additional
check on amdvi_sync_shadow_page_table_range(), and would rely on the
current convention to check when retrieving a DTE via the interface we
provided for it and all the code already uses i.e. amdvi_as_to_dte().


For a follow up cleanup, maybe amdvi_as_to_dte() should be renamed with a
more suggestive name that clearly indicates that it validates for those
conditions. But in context, all of its callers rely on the fact that it
does this verification and its return values also encode it.

While looking into this I noticed a chance for minor cleanup in this flow:
amdvi_do_translate()
    amdvi_as_to_dte()
    amdvi_page_walk()

Where amdvi_page_walk() does an unnecessary check for DTE[TV], that
amd_vi_as_to_dte() already ensures is set.

Thank you,
Alejandro

> Thanks
> Sairaj Kodilkar
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
  2026-03-24 14:25     ` Alejandro Jimenez
@ 2026-03-26  6:52       ` Sairaj Kodilkar
  0 siblings, 0 replies; 7+ messages in thread
From: Sairaj Kodilkar @ 2026-03-26  6:52 UTC (permalink / raw)
  To: Alejandro Jimenez, qemu-devel
  Cc: sarunkod, qemu, mst, clement.mathieu--drif, pbonzini,
	richard.henderson, eduardo, boris.ostrovsky



On 3/24/2026 7:55 PM, Alejandro Jimenez wrote:
>
> On 3/23/26 7:01 AM, Sairaj Kodilkar wrote:
>>
>> On 3/12/2026 2:09 AM, Alejandro Jimenez wrote:
>>> DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
>>> fetch_pte() currently uses a 0-based level counter, making the logic
>>> harder to follow and requiring conversions between DTE mode and level.
>>>
>>> Switch the page table walk logic to use 1-based level accounting in
>>> fetch_pte() and the relevant macro helpers. To further simplify the page
>>> walking loop, split the root page table access from the walk i.e. rework
>>> fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
>>> level pagetable entry before entering the loop, then iterate only over the
>>> PDE/PTE entries.
>>>
>>> The reworked algorithm fixes a page walk bug where the page size was
>>> calculated for the next level before checking if the current PTE was already
>>> a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
>>> leading to performance degradation and failures in some setups.
>>>
>>> Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table
>>> format")
>>> Cc: qemu-stable@nongnu.org
>>> Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
>>> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
>>> ---
>>>    hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
>>>    hw/i386/amd_iommu.h |  11 ++--
>>>    2 files changed, 97 insertions(+), 46 deletions(-)
>>>
> [...]
>
>> Hi Alejandro,
>>
>> amdvi_sync_shadow_page_table_range() does not check if DTE
>> **valid and translation valid** bit are set. I added this check in
>> my reply to david's patch. Do you think we should include that
>> check as well to ensure that only DTEs with **valid and translation valid**
>> bit are passed to the fetch_pte ?
>>
> Right now the check for DTE[V] and DTE[TV] are implemented in
> amdvi_as_to_dte(), which all callers of
> amdvi_sync_shadow_page_table_range() use to extract the DTE that will be
> passed as its argument. So I think if we were to add a check, an assert
> would be more appropriate.
>
> On the other hand, I am worried about sprinkling assertions around the
> code, and I am actually considering removing the ones the original series
> added to amdvi_sync_shadow_page_table_range(), fetch_pte(), and
> amdvi_notify_iommu(), since any failures in these cases are likely to cause
> immediate symptoms and I don't believe it creates any window for data
> corruption.
>
> With that in mind, to answer your question I would not add any additional
> check on amdvi_sync_shadow_page_table_range(), and would rely on the
> current convention to check when retrieving a DTE via the interface we
> provided for it and all the code already uses i.e. amdvi_as_to_dte().
>
>
> For a follow up cleanup, maybe amdvi_as_to_dte() should be renamed with a
> more suggestive name that clearly indicates that it validates for those
> conditions. But in context, all of its callers rely on the fact that it
> does this verification and its return values also encode it.
>
> While looking into this I noticed a chance for minor cleanup in this flow:
> amdvi_do_translate()
>      amdvi_as_to_dte()
>      amdvi_page_walk()
>
> Where amdvi_page_walk() does an unnecessary check for DTE[TV], that
> amd_vi_as_to_dte() already ensures is set.

Hi Alejandro,
I agree with you about sprinkling assertions, I think some of the assertions
should ideally deliver a events to the guest. But we can have this in later
series

Also for both the patches

Reviewed-by: Sairaj Kodilkar <sarunkod@amd.com>

Thanks
Sairaj

>
> Thank you,
> Alejandro
>
>> Thanks
>> Sairaj Kodilkar
>>



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-26  6:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 20:39 [PATCH 0/2] amd_iommu: Fix page size reporting on hugepage mappings Alejandro Jimenez
2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
2026-03-12 17:01   ` David Hoppenbrouwers
2026-03-23 11:01   ` Sairaj Kodilkar
2026-03-24 14:25     ` Alejandro Jimenez
2026-03-26  6:52       ` Sairaj Kodilkar
2026-03-11 20:39 ` [PATCH 2/2] amd_iommu: Reject non-decreasing NextLevel in fetch_pte() Alejandro Jimenez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox