public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
From: David Hoppenbrouwers <qemu@demindiro.com>
To: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>,
	qemu-devel@nongnu.org
Cc: sarunkod@amd.com, qemu@demindiro.com, mst@redhat.com,
	clement.mathieu--drif@eviden.com, pbonzini@redhat.com,
	richard.henderson@linaro.org, eduardo@habkost.net,
	boris.ostrovsky@oracle.com
Subject: Re: [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels
Date: Thu, 12 Mar 2026 18:01:54 +0100	[thread overview]
Message-ID: <5c6c85df-c3ec-4bea-bc7b-f1b768766a12@demindiro.com> (raw)
In-Reply-To: <20260311203943.2309841-2-alejandro.j.jimenez@oracle.com>

On 3/11/26 9:39 PM, Alejandro Jimenez wrote:
> DTE[Mode] and PTE NextLevel encode page table levels as 1-based values, but
> fetch_pte() currently uses a 0-based level counter, making the logic
> harder to follow and requiring conversions between DTE mode and level.
> 
> Switch the page table walk logic to use 1-based level accounting in
> fetch_pte() and the relevant macro helpers. To further simplify the page
> walking loop, split the root page table access from the walk i.e. rework
> fetch_pte() to follow the DTE Page Table Root Pointer and retrieve the top
> level pagetable entry before entering the loop, then iterate only over the
> PDE/PTE entries.
> 
> The reworked algorithm fixes a page walk bug where the page size was
> calculated for the next level before checking if the current PTE was already
> a leaf/hugepage. That caused hugepage mappings to be reported as 4K pages,
> leading to performance degradation and failures in some setups.
> 
> Fixes: a74bb3110a5b ("amd_iommu: Add helpers to walk AMD v1 Page Table format")
> Cc: qemu-stable@nongnu.org
> Reported-by: David Hoppenbrouwers <qemu@demindiro.com>
> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> ---
>   hw/i386/amd_iommu.c | 132 ++++++++++++++++++++++++++++++--------------
>   hw/i386/amd_iommu.h |  11 ++--
>   2 files changed, 97 insertions(+), 46 deletions(-)
> 
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index 789e09d6f2..991c6c379a 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -648,6 +648,52 @@ static uint64_t large_pte_page_size(uint64_t pte)
>       return PTE_LARGE_PAGE_SIZE(pte);
>   }
>   
> +/*
> + * Validate DTE fields and extract permissions and top level data required to
> + * initiate the page table walk.
> + *
> + * On success, returns 0 and stores:
> + *   - top_level: highest page-table level encoded in DTE[Mode]
> + *   - dte_perms: effective permissions from the DTE
> + *
> + * On failure, returns -AMDVI_FR_PT_ROOT_INV. This includes cases where:
> + *   - DTE permissions disallow read AND write
> + *   - DTE[Mode] is invalid for translation
> + *   - IOVA exceeds the address width supported by DTE[Mode]
> + * In all such cases a page walk must be aborted.
> + */
> +static uint64_t amdvi_get_top_pt_level_and_perms(hwaddr address, uint64_t dte,
> +                                       uint8_t *top_level,
> +                                       IOMMUAccessFlags *dte_perms)
> +{
> +    *dte_perms = amdvi_get_perms(dte);
> +    if (*dte_perms == IOMMU_NONE) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /* Verifying a valid mode is encoded in DTE */
> +    *top_level = get_pte_translation_mode(dte);
> +
> +    /*
> +     * Page Table Root pointer is only valid for GPA->SPA translation on
> +     * supported modes.
> +     */
> +    if (*top_level == 0 || *top_level > 6) {
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    /*
> +     * If IOVA is larger than the max supported by the highest pgtable level,
> +     * there is nothing to do.
> +     */
> +    if (address > PT_LEVEL_MAX_ADDR(*top_level)) {
> +        /* IOVA too large for the current DTE */
> +        return -AMDVI_FR_PT_ROOT_INV;
> +    }
> +
> +    return 0;
> +}
> +
>   /*
>    * Helper function to fetch a PTE using AMD v1 pgtable format.
>    * On successful page walk, returns 0 and pte parameter points to a valid PTE.
> @@ -662,40 +708,49 @@ static uint64_t large_pte_page_size(uint64_t pte)
>   static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>                             uint64_t *pte, hwaddr *page_size)
>   {
> -    IOMMUAccessFlags perms = amdvi_get_perms(dte);
> -
> -    uint8_t level, mode;
>       uint64_t pte_addr;
> +    uint8_t pt_level, next_pt_level;
> +    IOMMUAccessFlags perms;
> +    int ret;
>   
> -    *pte = dte;
>       *page_size = 0;
>   
> -    if (perms == IOMMU_NONE) {
> -        return -AMDVI_FR_PT_ROOT_INV;
> -    }
> -
>       /*
> -     * The Linux kernel driver initializes the default mode to 3, corresponding
> -     * to a 39-bit GPA space, where each entry in the pagetable translates to a
> -     * 1GB (2^30) page size.
> +     * Verify the DTE is properly configured before page walk, and extract
> +     * top pagetable level and permissions.
>        */
> -    level = mode = get_pte_translation_mode(dte);
> -    assert(mode > 0 && mode < 7);
> +    ret = amdvi_get_top_pt_level_and_perms(address, dte, &pt_level, &perms);
> +    if (ret < 0) {
> +        return ret;
> +    }
>   
>       /*
> -     * If IOVA is larger than the max supported by the current pgtable level,
> -     * there is nothing to do.
> +     * Retrieve the top pagetable entry by following the DTE Page Table Root
> +     * Pointer and indexing the top level table using the IOVA from the request.
>        */
> -    if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
> -        /* IOVA too large for the current DTE */
> +    pte_addr = NEXT_PTE_ADDR(dte, pt_level, address);
> +    *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
> +
> +    if (*pte == (uint64_t)-1) {
> +        /*
> +         * A returned PTE of -1 here indicates a failure to read the top level
> +         * page table from guest memory. A page walk is not possible and page
> +         * size must be returned as 0.
> +         */
>           return -AMDVI_FR_PT_ROOT_INV;
>       }
>   
> -    do {
> -        level -= 1;
> +    /*
> +     * Calculate page size for the top level page table entry.
> +     * This ensures correct results for a single level Page Table setup.
> +     */
> +    *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
>   
> -        /* Update the page_size */
> -        *page_size = PTE_LEVEL_PAGE_SIZE(level);
> +    /*
> +     * The root page table entry and its level have been determined. Begin the
> +     * page walk.
> +     */
> +    while (pt_level > 0) {
>   
>           /* Permission bits are ANDed at every level, including the DTE */
>           perms &= amdvi_get_perms(*pte);
> @@ -708,37 +763,34 @@ static uint64_t fetch_pte(AMDVIAddressSpace *as, hwaddr address, uint64_t dte,
>               return 0;
>           }
>   
> +        next_pt_level = PTE_NEXT_LEVEL(*pte);
> +
>           /* Large or Leaf PTE found */
> -        if (PTE_NEXT_LEVEL(*pte) == 7 || PTE_NEXT_LEVEL(*pte) == 0) {
> +        if (next_pt_level == 0 || next_pt_level == 7) {
>               /* Leaf PTE found */
>               break;
>           }
>   
> +        pt_level = next_pt_level;
> +
>           /*
> -         * Index the pgtable using the IOVA bits corresponding to current level
> -         * and walk down to the lower level.
> +         * The current entry is a Page Directory Entry. Descend to the lower
> +         * page table level encoded in current pte, and index the new table
> +         * using the appropriate IOVA bits to retrieve the new entry.
>            */
> -        pte_addr = NEXT_PTE_ADDR(*pte, level, address);
> +        *page_size = PTE_LEVEL_PAGE_SIZE(pt_level);
> +
> +        pte_addr = NEXT_PTE_ADDR(*pte, pt_level, address);
>           *pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as->devfn);
>   
>           if (*pte == (uint64_t)-1) {
> -            /*
> -             * A returned PTE of -1 indicates a failure to read the page table
> -             * entry from guest memory.
> -             */
> -            if (level == mode - 1) {
> -                /* Failure to retrieve the Page Table from Root Pointer */
> -                *page_size = 0;
> -                return -AMDVI_FR_PT_ROOT_INV;
> -            } else {
> -                /* Failure to read PTE. Page walk skips a page_size chunk */
> -                return -AMDVI_FR_PT_ENTRY_INV;
> -            }
> +            /* Failure to read PTE. Page walk skips a page_size chunk */
> +            return -AMDVI_FR_PT_ENTRY_INV;
>           }
> -    } while (level > 0);
> +    }
> +
> +    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7);
>   
> -    assert(PTE_NEXT_LEVEL(*pte) == 0 || PTE_NEXT_LEVEL(*pte) == 7 ||
> -           level == 0);
>       /*
>        * Page walk ends when Next Level field on PTE shows that either a leaf PTE
>        * or a series of large PTEs have been reached. In the latter case, even if
> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
> index 302ccca512..7af3c742b7 100644
> --- a/hw/i386/amd_iommu.h
> +++ b/hw/i386/amd_iommu.h
> @@ -186,17 +186,16 @@
>   
>   #define IOMMU_PTE_PRESENT(pte)          ((pte) & AMDVI_PTE_PR)
>   
> -/* Using level=0 for leaf PTE at 4K page size */
> -#define PT_LEVEL_SHIFT(level)           (12 + ((level) * 9))
> +/* Using level=1 for leaf PTE at 4K page size */
> +#define PT_LEVEL_SHIFT(level)           (12 + (((level) - 1) * 9))
>   
>   /* Return IOVA bit group used to index the Page Table at specific level */
>   #define PT_LEVEL_INDEX(level, iova)     (((iova) >> PT_LEVEL_SHIFT(level)) & \
>                                           GENMASK64(8, 0))
>   
> -/* Return the max address for a specified level i.e. max_oaddr */
> -#define PT_LEVEL_MAX_ADDR(x)    (((x) < 5) ? \
> -                                ((1ULL << PT_LEVEL_SHIFT((x + 1))) - 1) : \
> -                                (~(0ULL)))
> +/* Return the maximum output address for a specified page table level */
> +#define PT_LEVEL_MAX_ADDR(level)    (((level) > 5) ? (~(0ULL)) : \
> +                                    ((1ULL << PT_LEVEL_SHIFT((level) + 1)) - 1))
>   
>   /* Extract the NextLevel field from PTE/PDE */
>   #define PTE_NEXT_LEVEL(pte)     (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK) >> 9)
Reviewed-By: David Hoppenbrouwers <qemu@demindiro.com>


  reply	other threads:[~2026-03-12 17:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 20:39 [PATCH 0/2] amd_iommu: Fix page size reporting on hugepage mappings Alejandro Jimenez
2026-03-11 20:39 ` [PATCH 1/2] amd_iommu: Follow root pointer before page walk and use 1-based levels Alejandro Jimenez
2026-03-12 17:01   ` David Hoppenbrouwers [this message]
2026-03-23 11:01   ` Sairaj Kodilkar
2026-03-24 14:25     ` Alejandro Jimenez
2026-03-26  6:52       ` Sairaj Kodilkar
2026-03-11 20:39 ` [PATCH 2/2] amd_iommu: Reject non-decreasing NextLevel in fetch_pte() Alejandro Jimenez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c6c85df-c3ec-4bea-bc7b-f1b768766a12@demindiro.com \
    --to=qemu@demindiro.com \
    --cc=alejandro.j.jimenez@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=clement.mathieu--drif@eviden.com \
    --cc=eduardo@habkost.net \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=sarunkod@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox