* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
[not found] ` <20260327201421.2824383-7-rick.p.edgecombe@intel.com>
@ 2026-03-30 5:00 ` Yan Zhao
2026-03-31 16:37 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 5:00 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:10PM -0700, Rick Edgecombe wrote:
> Remove the conditional logic for handling the setting of mirror EPTs to
Should we unify the terms "mirror EPTs," "mirror TDP," and "mirror page tables"
in this series?
> frozen in __tdp_mmu_set_spte_atomic() and add it as a warning instead.
>
> Mirror TDP needs propagate PTE changes to the to the external TDP. This
Two "to the".
> presents a problem for atomic updates which can't update both at once. So
> a special value, FROZEN_SPTE, is used as a temporary state during these
> updates to prevent concurrent operations to the PTE. If the TDP MMU tried
> to install this as a long term value, it would confuse these updates.
> Despite this __tdp_mmu_set_spte_atomic() includes a check to handle it
> being set. Remove this check and turn it into a warning.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 0809fe8e8737..338957bc5109 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -656,7 +656,13 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> */
> WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>
> - if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
> + /*
> + * FROZEN_SPTE is a temporary state and should never be set via higher
> + * level helpers.
> + */
> + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE() is used
above for old_spte?
> + if (is_mirror_sptep(iter->sptep)) {
> int ret;
>
> ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
[not found] ` <20260327201421.2824383-8-rick.p.edgecombe@intel.com>
@ 2026-03-30 6:14 ` Yan Zhao
2026-04-01 23:45 ` Edgecombe, Rick P
2026-03-31 10:09 ` Huang, Kai
2026-04-01 8:34 ` Yan Zhao
2 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 6:14 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:11PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Centralize the updates to present external PTEs to the
> handle_changed_spte() function.
>
> When setting a PTE to present in the mirror page tables, the update needs
> to propagate to the external page tables (in TDX parlance the S-EPT).
> Today this is handled by special mirror page tables branching in
> __tdp_mmu_set_spte_atomic(), which is the only place where present PTEs
> are set for TDX.
>
> This keeps things running, but is a bit hacked on. The hook for setting
> present leaf PTEs are added only where TDX happens to need them. For
> example, TDX does not support any of the operations that use the
> non-atomic variant, tdp_mmu_set_spte() to set present PTEs. Since the hook
> is missing there, it is very hard to understand the code from a non-TDX
> lens. If the reader doesn’t know the TDX specifics it could look like the
> external update is missing.
>
> In addition to being confusing, it also litters the TDP MMU with
> "external" update callbacks. This is especially unfortunate because there
> is already a central place to react to TDP updates, handle_changed_spte().
>
> Begin the process of moving towards a model where all mirror page table
> updates are forwarded to TDX code where the TDX specific logic can live
> with a more proper separation of concerns. Do this by teaching
> handle_changed_spte() how to return error codes, such that it can
> propagate the failures that may come from TDX external page table updates.
>
> Atomic mirror page table updates need to be done in a special way to
> prevent concurrent updates to the mirror page table while the external
> page table is updated. The mirror page table is set to the frozen PTE
> value while the external version is updates. This frozen PTE dance is
"is updates" --> "is updating"?
> currently done in __tdp_mmu_set_spte_atomic(). Hoist it up a level so that
> the external update in handle_changed_spte() can be done while the PTE is
> frozen.
>
> Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Based on a diff by Sean Chrisopherson]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 150 ++++++++++++++++++++++---------------
> 1 file changed, 88 insertions(+), 62 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 338957bc5109..db16e81b9701 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -320,9 +320,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
> }
> }
>
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> - u64 old_spte, u64 new_spte, int level,
> - bool shared);
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> + gfn_t gfn, u64 old_spte, u64 new_spte,
> + int level, bool shared);
>
> static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> {
> @@ -471,8 +471,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> FROZEN_SPTE, level);
> }
> - handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> - old_spte, FROZEN_SPTE, level, shared);
> + handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
>
> if (is_mirror_sp(sp)) {
> KVM_BUG_ON(shared, kvm);
> @@ -508,22 +507,15 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
> return NULL;
> }
>
> -static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> - gfn_t gfn, u64 *old_spte,
> - u64 new_spte, int level)
> +static int __must_check set_external_spte_present(struct kvm *kvm,
> + gfn_t gfn, u64 old_spte,
> + u64 new_spte, int level)
> {
> bool is_present = is_shadow_present_pte(new_spte);
> bool is_leaf = is_present && is_last_spte(new_spte, level);
> int ret = 0;
>
> lockdep_assert_held(&kvm->mmu_lock);
> - /*
> - * We need to lock out other updates to the SPTE until the external
> - * page table has been modified. Use FROZEN_SPTE similar to
> - * the zapping case.
> - */
> - if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> - return -EBUSY;
>
> /*
> * Use different call to either set up middle level
> @@ -537,17 +529,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> KVM_BUG_ON(!external_spt, kvm);
> ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
> }
> - if (ret)
> - __kvm_tdp_mmu_write_spte(sptep, *old_spte);
> - else
> - __kvm_tdp_mmu_write_spte(sptep, new_spte);
> return ret;
> }
>
> /**
> - * handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> * @kvm: kvm instance
> - * @as_id: the address space of the paging structure the SPTE was a part of
> + * @sp: the page table in which the SPTE resides
> * @gfn: the base GFN that was mapped by the SPTE
> * @old_spte: The value of the SPTE before the change
> * @new_spte: The value of the SPTE after the change
> @@ -560,15 +548,17 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> * dirty logging updates are handled in common code, not here (see make_spte()
> * and fast_pf_fix_direct_spte()).
> */
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> - u64 old_spte, u64 new_spte, int level,
> - bool shared)
> +static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> + gfn_t gfn, u64 old_spte, u64 new_spte,
> + int level, bool shared)
> {
> bool was_present = is_shadow_present_pte(old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> bool was_leaf = was_present && is_last_spte(old_spte, level);
> bool is_leaf = is_present && is_last_spte(new_spte, level);
> bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> + int as_id = kvm_mmu_page_as_id(sp);
> + int r;
>
> WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -598,9 +588,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> }
>
> if (old_spte == new_spte)
> - return;
> -
> - trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> + return 0;
>
> if (is_leaf)
> check_spte_writable_invariants(new_spte);
> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> "a temporary frozen SPTE.\n"
> "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
> as_id, gfn, old_spte, new_spte, level);
> - return;
> + return 0;
> }
>
> - if (is_leaf != was_leaf)
> - kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
> /*
> * Recursively handle child PTs if the change removed a subtree from
> * the paging structure. Note the WARN on the PFN changing without the
> * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> * pages are kernel allocations and should never be migrated.
> + *
> + * When modifying leaf entries in mirrored page tables, propagate all
> + * changes to the external SPTE.
> */
> if (was_present && !was_leaf &&
> - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> + } else if (is_mirror_sp(sp) && is_present) {
> + r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> + if (r)
> + return r;
> + }
> +
> + trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +
> + if (is_leaf != was_leaf)
> + kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> +
> + return 0;
> +}
> +
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> + gfn_t gfn, u64 old_spte, u64 new_spte,
> + int level, bool shared)
> +{
> + KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
> + level, shared), kvm);
> }
>
> static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> @@ -657,32 +665,14 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>
> /*
> - * FROZEN_SPTE is a temporary state and should never be set via higher
> - * level helpers.
> + * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> + * does not hold the mmu_lock. On failure, i.e. if a different logical
> + * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> + * the current value, so the caller operates on fresh data, e.g. if it
> + * retries tdp_mmu_set_spte_atomic()
> */
> - KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> -
> - if (is_mirror_sptep(iter->sptep)) {
> - int ret;
> -
> - ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> - &iter->old_spte, new_spte, iter->level);
> - if (ret)
> - return ret;
> - } else {
> - u64 *sptep = rcu_dereference(iter->sptep);
> -
> - /*
> - * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
> - * and does not hold the mmu_lock. On failure, i.e. if a
> - * different logical CPU modified the SPTE, try_cmpxchg64()
> - * updates iter->old_spte with the current value, so the caller
> - * operates on fresh data, e.g. if it retries
> - * tdp_mmu_set_spte_atomic()
> - */
> - if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> - return -EBUSY;
> - }
> + if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
> + return -EBUSY;
>
> return 0;
> }
> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
> int ret;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> + /* KVM should never freeze SPTEs using higher level APIs. */
"higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
new_spte to be FROZEN_SPTE.
What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
directly"?
> + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> + /*
> + * Temporarily freeze the SPTE until the external PTE operation has
> + * completed (unless the new SPTE itself will be frozen), e.g. so that
> + * concurrent faults don't attempt to install a child PTE in the
> + * external page table before the parent PTE has been written, or try
> + * to re-install a page table before the old one was removed.
> + */
> + if (is_mirror_sptep(iter->sptep))
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> + else
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> if (ret)
> return ret;
>
> - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, true);
> + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> + new_spte, iter->level, true);
What about adding a comment for the tricky part for the mirror page table:
while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
for freezing the mirror page table, the original new_spte from the caller of
tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
properly update statistics and propagate to the external page table.
> - return 0;
> + /*
> + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> + * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> + * otherwise set the mirror SPTE to the new desired value.
> + */
> + if (is_mirror_sptep(iter->sptep)) {
> + if (ret)
> + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> + else
> + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> + } else {
> + /*
> + * Bug the VM if handling the change failed, as failure is only
> + * allowed if KVM couldn't update the external SPTE.
> + */
> + KVM_BUG_ON(ret, kvm);
> + }
> + return ret;
> }
>
> /*
> @@ -738,6 +759,8 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> {
> + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> +
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> /*
> @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>
> old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>
> - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> + handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
>
> /*
> * Users that do non-atomic setting of PTEs don't operate on mirror
> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
> {
> u64 new_spte;
>
> + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> + return;
> +
Add a comment for why mirror page table is not expected here?
And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
or clear_dirty_pt_masked()?
> if (spte_ad_enabled(iter->old_spte)) {
> iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> shadow_accessed_mask);
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all
[not found] ` <20260327201421.2824383-9-rick.p.edgecombe@intel.com>
@ 2026-03-30 6:28 ` Yan Zhao
0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 6:28 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:12PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Drop the dedicated .link_external_spt() for linking non-leaf S-EPT pages,
> and instead funnel everything through .set_external_spte(). Using separate
> hooks doesn't help prevent TDP MMU details from bleeding into TDX, and vice
> versa; to the contrary, dedicated callbacks will result in _more_ pollution
> when hugepage support is added, e.g. will require the TDP MMU to know
> details about the splitting rules for TDX that aren't all that relevant to
> the TDP MMU.
>
> Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
> one hook for setting SPTEs under write-lock and another for settings SPTEs
> under read-lock (e.g. to ensure the entire operation is "atomic", to allow
> for failure, etc.). Sadly, TDX's requirement that all child S-EPT entries
> are removed before the parent makes that impractical: the TDP MMU
> deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
> making it quite important for the TDP MMU to differentiate between zapping
> leaf and non-leaf S-EPT entries.
>
> However, that's the _only_ case that's truly special, and even that case
> could be shoehorned into a single hook; it's just wouldn't be a net
"it's" --> "it".
> positive.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
[not found] ` <20260327201421.2824383-10-rick.p.edgecombe@intel.com>
@ 2026-03-30 6:43 ` Yan Zhao
2026-04-01 23:59 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 6:43 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:13PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
> PAGE.AUG operations, and thus complete tdx_sept_set_private_spte()'s
> transition into a "dispatch" routine for setting/writing S-EPT entries.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 92a846b91bac..361a75b42ae7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1689,18 +1689,12 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> return 0;
> }
>
> -static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> - enum pg_level level, u64 mirror_spte)
> +static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + u64 mirror_spte)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
>
> - if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> - return -EIO;
> -
> - if (!is_last_spte(mirror_spte, level))
> - return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
> -
> /* TODO: handle large pages. */
> if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> return -EIO;
> @@ -1725,7 +1719,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> return tdx_mem_page_aug(kvm, gfn, level, pfn);
> }
>
> +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, u64 mirror_spte)
> +{
>
> + if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> + return -EIO;
> +
> + if (!is_last_spte(mirror_spte, level))
> + return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
For symmetry, what about renaming tdx_sept_link_private_spt() to
tdx_sept_map_nonleaf_spte()?
> +
> + return tdx_sept_map_leaf_spte(kvm, gfn, level, mirror_spte);
> +}
>
> /*
> * Ensure shared and private EPTs to be flushed on all vCPUs.
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
[not found] ` <20260327201421.2824383-15-rick.p.edgecombe@intel.com>
@ 2026-03-30 7:01 ` Yan Zhao
2026-03-31 10:46 ` Huang, Kai
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 7:01 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:18PM -0700, Rick Edgecombe wrote:
> As part of an ongoing effort to move TDX specific bits from the MMU into
> the TDX code, drop the KVM_BUG_ON() that checks the MMU lock is held for
> write while removing page tables.
>
> Future changes forward PTE removal mirror EPT updates into the
> set_private_spte() and let TDX code parse the PTE to decide what S-EPT
> operations to take. This operations does not pass a shared bool for this
"operations" --> "operation" ?
> KVM_BUG_ON() to use in the logics future home.
By "in the logics future home", do you mean "when this logic is moved to its
future location"?
> But even today there are already MMU write lockdep asserts that mostly
> cover the case. Since the KVM_BUG_ON() is already a bit redundant, just
> remove it instead of trying to plumb the bool into TDX code.
>
> Link: https://lore.kernel.org/kvm/aYUarHf3KEwHGuJe@google.com/
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 991870789863..5dc9633c866e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -473,10 +473,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> }
> handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
>
> - if (is_mirror_sp(sp)) {
> - KVM_BUG_ON(shared, kvm);
> + if (is_mirror_sp(sp))
> remove_external_spte(kvm, gfn, old_spte, level);
> - }
> }
>
> if (is_mirror_sp(sp) &&
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
[not found] ` <20260327201421.2824383-18-rick.p.edgecombe@intel.com>
@ 2026-03-30 7:49 ` Yan Zhao
2026-04-02 0:17 ` Edgecombe, Rick P
2026-03-31 11:02 ` Huang, Kai
1 sibling, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-30 7:49 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:21PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Move the freeing of external page tables into the reclaim operation that
> lives in TDX code.
>
> The TDP MMU supports traversing the TDP without holding locks. Page
> tables needs to be freed via RCU to prevent walking one that gets
> freed.
>
> While none of these lockless walk operations actually happen for the
> mirror EPT, the TDP MMU none-the-less frees the mirror EPT page tables in
> the same way, and because it’s a handy place to plug it in, the external
> page tables as well.
>
> However, the external page tables definitely can’t be walked once they are
> reclaimed from the TDX module. The TDX module releases the page for the
> host VMM to use, so this RCU-time free is unnecessary for external page
> tables.
>
> So move the free_page() call to TDX code. Create an
> tdp_mmu_free_unused_sp() to allow for freeing external page tables that
> have never left the TDP MMU code (i.e. don’t need freed in a special way.
>
> Link: https://lore.kernel.org/kvm/aYpjNrtGmogNzqwT@google.com/
> Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Based on a diff by Sean, added log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 16 +++++++++++-----
> arch/x86/kvm/vmx/tdx.c | 11 ++++++++++-
> 2 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 575033cc7fe4..18e11c1c7631 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,13 +53,18 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> rcu_barrier();
> }
>
> -static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> +static void __tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> {
> - free_page((unsigned long)sp->external_spt);
> free_page((unsigned long)sp->spt);
> kmem_cache_free(mmu_page_header_cache, sp);
> }
>
> +static void tdp_mmu_free_unused_sp(struct kvm_mmu_page *sp)
> +{
> + free_page((unsigned long)sp->external_spt);
> + __tdp_mmu_free_sp(sp);
> +}
> +
> /*
> * This is called through call_rcu in order to free TDP page table memory
> * safely with respect to other kernel threads that may be operating on
> @@ -73,7 +78,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
> rcu_head);
>
> - tdp_mmu_free_sp(sp);
> + WARN_ON_ONCE(sp->external_spt);
> + __tdp_mmu_free_sp(sp);
> }
>
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
> @@ -1261,7 +1267,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> * failed, e.g. because a different task modified the SPTE.
> */
> if (r) {
> - tdp_mmu_free_sp(sp);
> + tdp_mmu_free_unused_sp(sp);
> goto retry;
> }
>
> @@ -1571,7 +1577,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> * installs its own sp in place of the last sp we tried to split.
> */
> if (sp)
> - tdp_mmu_free_sp(sp);
> + tdp_mmu_free_unused_sp(sp);
>
> return 0;
> }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d064b40a6b31..1346e891ca94 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> */
> if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> tdx_reclaim_page(virt_to_page(sp->external_spt)))
> - sp->external_spt = NULL;
> + goto out;
> +
> + /*
> + * Immediately free the S-EPT page as the TDX subsystem doesn't support
> + * freeing pages from RCU callbacks, and more importantly because
> + * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
> + */
> + free_page((unsigned long)sp->external_spt);
Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a similar
way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
sp->external_spt special? i.e., what's the downside of freeing it together with
sp->spt in tdp_mmu_free_sp() ?
> +out:
> + sp->external_spt = NULL;
> }
>
> static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
2026-03-31 9:47 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Huang, Kai
@ 2026-03-31 9:17 ` Yan Zhao
2026-03-31 9:59 ` Huang, Kai
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-31 9:17 UTC (permalink / raw)
To: Huang, Kai
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Hansen, Dave,
linux-kernel@vger.kernel.org, x86@kernel.org
On Tue, Mar 31, 2026 at 05:47:29PM +0800, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Pass a pointer to iter->old_spte, not simply its value, when setting an
> > external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
> > will be updated if the cmpxchg64 to freeze the mirror SPTE fails. The bug
> > is currently benign as TDX is mutualy exclusive with all paths that do
> > "local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().
> >
> > Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 7b1102d26f9c..dbaeb80f2b64 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -509,10 +509,10 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
> > }
> >
> > static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> > - gfn_t gfn, u64 old_spte,
> > + gfn_t gfn, u64 *old_spte,
> > u64 new_spte, int level)
> > {
> > - bool was_present = is_shadow_present_pte(old_spte);
> > + bool was_present = is_shadow_present_pte(*old_spte);
> > bool is_present = is_shadow_present_pte(new_spte);
> > bool is_leaf = is_present && is_last_spte(new_spte, level);
> > int ret = 0;
> > @@ -525,7 +525,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> > * page table has been modified. Use FROZEN_SPTE similar to
> > * the zapping case.
> > */
> > - if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
> > + if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > return -EBUSY;
> >
> > /*
> > @@ -541,7 +541,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> > ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
> > }
> > if (ret)
> > - __kvm_tdp_mmu_write_spte(sptep, old_spte);
> > + __kvm_tdp_mmu_write_spte(sptep, *old_spte);
> > else
> > __kvm_tdp_mmu_write_spte(sptep, new_spte);
> > return ret;
> > @@ -670,7 +670,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > return -EBUSY;
> >
> > ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> > - iter->old_spte, new_spte, iter->level);
> > + &iter->old_spte, new_spte, iter->level);
> > if (ret)
> > return ret;
> > } else {
>
> The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> iter->old_spte isn't a frozen SPTE:
>
> WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>
> Thinking more, I _think_ this patch could potentially trigger this WARNING
> due to now set_external_spte_present() will set iter->old_spte to
> FROZEN_SPTE when try_cmpxchg64() fails.
>
> Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> __tdp_mmu_set_spte_atomic() simultaneously. Assuming vCPU1 does the
>
> if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> return -EBUSY;
>
> .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> FROZEN_SPTE.
>
> Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> be triggered due to is_frozen_spte(iter->old_spte) will now return true.
The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
retrying, as in kvm_tdp_mmu_map()?
> Or did I miss anything?
>
> Also, AFAICT this issue doesn't exist for non-TDX case because there's no
> case tdp_mmu_set_spte_atomic() is called to set new_spte as FROZEN_SPTE in
> such case.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
2026-03-31 9:59 ` Huang, Kai
@ 2026-03-31 9:22 ` Yan Zhao
2026-03-31 10:14 ` Huang, Kai
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-03-31 9:22 UTC (permalink / raw)
To: Huang, Kai
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
kas@kernel.org, Hansen, Dave, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, kvm@vger.kernel.org
On Tue, Mar 31, 2026 at 05:59:54PM +0800, Huang, Kai wrote:
> > >
> > > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > > iter->old_spte isn't a frozen SPTE:
> > >
> > > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> > >
> > > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > > due to now set_external_spte_present() will set iter->old_spte to
> > > FROZEN_SPTE when try_cmpxchg64() fails.
> > >
> > > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > > __tdp_mmu_set_spte_atomic() simultaneously. Assuming vCPU1 does the
> > >
> > > if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > > return -EBUSY;
> > >
> > > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > > FROZEN_SPTE.
> > >
> > > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
> >
> > The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> > retrying, as in kvm_tdp_mmu_map()?
>
> It's possible the vCPU3 is already about to go into
> __tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.
Hmm, different vCPU's &iter shouldn't locate at the same memory, where iter is a
local variable.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
[not found] ` <20260327201421.2824383-3-rick.p.edgecombe@intel.com>
@ 2026-03-31 9:47 ` Huang, Kai
2026-03-31 9:17 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 9:47 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Pass a pointer to iter->old_spte, not simply its value, when setting an
> external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
> will be updated if the cmpxchg64 to freeze the mirror SPTE fails. The bug
> is currently benign as TDX is mutualy exclusive with all paths that do
> "local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().
>
> Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b1102d26f9c..dbaeb80f2b64 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -509,10 +509,10 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
> }
>
> static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> - gfn_t gfn, u64 old_spte,
> + gfn_t gfn, u64 *old_spte,
> u64 new_spte, int level)
> {
> - bool was_present = is_shadow_present_pte(old_spte);
> + bool was_present = is_shadow_present_pte(*old_spte);
> bool is_present = is_shadow_present_pte(new_spte);
> bool is_leaf = is_present && is_last_spte(new_spte, level);
> int ret = 0;
> @@ -525,7 +525,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> * page table has been modified. Use FROZEN_SPTE similar to
> * the zapping case.
> */
> - if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
> + if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> return -EBUSY;
>
> /*
> @@ -541,7 +541,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
> }
> if (ret)
> - __kvm_tdp_mmu_write_spte(sptep, old_spte);
> + __kvm_tdp_mmu_write_spte(sptep, *old_spte);
> else
> __kvm_tdp_mmu_write_spte(sptep, new_spte);
> return ret;
> @@ -670,7 +670,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> return -EBUSY;
>
> ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> - iter->old_spte, new_spte, iter->level);
> + &iter->old_spte, new_spte, iter->level);
> if (ret)
> return ret;
> } else {
The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
iter->old_spte isn't a frozen SPTE:
WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
Thinking more, I _think_ this patch could potentially trigger this WARNING
due to now set_external_spte_present() will set iter->old_spte to
FROZEN_SPTE when try_cmpxchg64() fails.
Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
__tdp_mmu_set_spte_atomic() simultaneously. Assuming vCPU1 does the
if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
return -EBUSY;
.. successfully in set_external_spte_present(), then vCPU2 will fail on the
try_cmpxchg64(), but this will cause iter->old_spte to be updated to
FROZEN_SPTE.
Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
be triggered due to is_frozen_spte(iter->old_spte) will now return true.
Or did I miss anything?
Also, AFAICT this issue doesn't exist for non-TDX case because there's no
case tdp_mmu_set_spte_atomic() is called to set new_spte as FROZEN_SPTE in
such case.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
2026-03-31 9:17 ` Yan Zhao
@ 2026-03-31 9:59 ` Huang, Kai
2026-03-31 9:22 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 9:59 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
kas@kernel.org, Hansen, Dave, linux-kernel@vger.kernel.org,
pbonzini@redhat.com, kvm@vger.kernel.org
> >
> > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > iter->old_spte isn't a frozen SPTE:
> >
> > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> >
> > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > due to now set_external_spte_present() will set iter->old_spte to
> > FROZEN_SPTE when try_cmpxchg64() fails.
> >
> > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > __tdp_mmu_set_spte_atomic() simultaneously. Assuming vCPU1 does the
> >
> > if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > return -EBUSY;
> >
> > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > FROZEN_SPTE.
> >
> > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
>
> The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> retrying, as in kvm_tdp_mmu_map()?
It's possible the vCPU3 is already about to go into
__tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
[not found] ` <20260327201421.2824383-8-rick.p.edgecombe@intel.com>
2026-03-30 6:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Yan Zhao
@ 2026-03-31 10:09 ` Huang, Kai
2026-04-01 23:58 ` Edgecombe, Rick P
2026-04-01 8:34 ` Yan Zhao
2 siblings, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:09 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
> {
> u64 new_spte;
>
> + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> + return;
> +
> if (spte_ad_enabled(iter->old_spte)) {
> iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
> shadow_accessed_mask);
AFAICT this chunk isn't related to "Centralize updates to present external
PTEs"? Should it be in a separate patch?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
2026-03-31 9:22 ` Yan Zhao
@ 2026-03-31 10:14 ` Huang, Kai
0 siblings, 0 replies; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:14 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, kas@kernel.org,
Edgecombe, Rick P, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, pbonzini@redhat.com
On Tue, 2026-03-31 at 17:22 +0800, Yan Zhao wrote:
> On Tue, Mar 31, 2026 at 05:59:54PM +0800, Huang, Kai wrote:
> > > >
> > > > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > > > iter->old_spte isn't a frozen SPTE:
> > > >
> > > > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> > > >
> > > > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > > > due to now set_external_spte_present() will set iter->old_spte to
> > > > FROZEN_SPTE when try_cmpxchg64() fails.
> > > >
> > > > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > > > __tdp_mmu_set_spte_atomic() simultaneously. Assuming vCPU1 does the
> > > >
> > > > if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > > > return -EBUSY;
> > > >
> > > > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > > > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > > > FROZEN_SPTE.
> > > >
> > > > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > > > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
> > >
> > > The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> > > retrying, as in kvm_tdp_mmu_map()?
> >
> > It's possible the vCPU3 is already about to go into
> > __tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.
> Hmm, different vCPU's &iter shouldn't locate at the same memory, where iter is a
> local variable.
Ah right. I missed that :-)
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
[not found] ` <20260327201421.2824383-11-rick.p.edgecombe@intel.com>
@ 2026-03-31 10:30 ` Huang, Kai
2026-04-02 0:00 ` Edgecombe, Rick P
2026-03-31 10:34 ` Huang, Kai
1 sibling, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:30 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, u64 mirror_spte)
> {
> -
> if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> return -EIO;
This line deleting isn't needed if it wasn't introduced in the previous
patch.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
[not found] ` <20260327201421.2824383-11-rick.p.edgecombe@intel.com>
2026-03-31 10:30 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Huang, Kai
@ 2026-03-31 10:34 ` Huang, Kai
1 sibling, 0 replies; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:34 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> Move the MMU lockdep assert in set_external_spte_present() into the TDX
> specific op because the assert is TDX specific in intention.
>
> The TDP MMU has many lockdep asserts for various scenarios, and in fact
> the callchains that are used for TDX already have a lockdep assert which
> cover the case in set_external_spte_present().
>
cover -> covers
> However, these asserts are
> for management of the TDP root owned by KVM. In the
> set_external_spte_present() assert case, it is helping with a scheme to
> avoid contention in the TDX module during zap operations. That is very
> TDX specific.
>
> One option would be to just remove the assert in
> set_external_spte_present() and rely on the other ones in the TDP MMU. But
> that assert is for an a different intention, and too far away from the
"an a" -> a
> SEAMCALL that needs it.
>
> So move just move it to TDX code.
So just move it ...
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
[not found] ` <20260327201421.2824383-12-rick.p.edgecombe@intel.com>
@ 2026-03-31 10:36 ` Huang, Kai
2026-04-01 7:41 ` Yan Zhao
1 sibling, 0 replies; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:36 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() as all
Fold .. into __handle_changed_spte()?
> the other functionality besides calling the op.
>
all the other functionality besides calling the op is gone?
> It now is a single line
> helper that is called once.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6763537098ee..85c92aec868f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -494,13 +494,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
> call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> }
>
> -static int __must_check set_external_spte_present(struct kvm *kvm,
> - gfn_t gfn, u64 old_spte,
> - u64 new_spte, int level)
> -{
> - return kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
> -}
> -
> /**
> * __handle_changed_spte - handle bookkeeping associated with an SPTE change
> * @kvm: kvm instance
> @@ -600,7 +593,7 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> } else if (is_mirror_sp(sp) && is_present) {
> - r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> + r = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
> if (r)
> return r;
> }
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()
[not found] ` <20260327201421.2824383-14-rick.p.edgecombe@intel.com>
@ 2026-03-31 10:42 ` Huang, Kai
2026-04-02 0:04 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:42 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Move tdx_sept_remove_private_spte() (and its tdx_track() helper) above
> tdx_sept_set_private_spte() in anticipation of routing all non-atomic
> S-EPT writes (with the exception of reclaiming non-leaf pages) through
> the "set" API.
The diff reads more like moving tdx_sept_set_private_spte() down, though.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
2026-03-30 7:01 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Yan Zhao
@ 2026-03-31 10:46 ` Huang, Kai
2026-04-02 0:08 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 10:46 UTC (permalink / raw)
To: Zhao, Yan Y, Edgecombe, Rick P
Cc: Hansen, Dave, kvm@vger.kernel.org, pbonzini@redhat.com,
kas@kernel.org, seanjc@google.com, linux-kernel@vger.kernel.org,
x86@kernel.org
>
> > KVM_BUG_ON() to use in the logics future home.
> By "in the logics future home", do you mean "when this logic is moved to its
> future location"?
>
Maybe it just meant to be:
the logic's future home
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
[not found] ` <20260327201421.2824383-18-rick.p.edgecombe@intel.com>
2026-03-30 7:49 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Yan Zhao
@ 2026-03-31 11:02 ` Huang, Kai
1 sibling, 0 replies; 43+ messages in thread
From: Huang, Kai @ 2026-03-31 11:02 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
>
> So move the free_page() call to TDX code. Create an
> tdp_mmu_free_unused_sp() to allow for freeing external page tables that
> have never left the TDP MMU code (i.e. don’t need freed in a special way.
>
Missing ')' in the last sentence.
Also, don't need to be freed ?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
2026-03-30 5:00 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Yan Zhao
@ 2026-03-31 16:37 ` Edgecombe, Rick P
2026-04-02 1:06 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-03-31 16:37 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
Yep on the typos.
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -656,7 +656,13 @@ static inline int __must_check
> > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > */
> > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > >old_spte));
> >
> > - if (is_mirror_sptep(iter->sptep) &&
> > !is_frozen_spte(new_spte)) {
> > + /*
> > + * FROZEN_SPTE is a temporary state and should never be
> > set via higher
> > + * level helpers.
> > + */
> > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> is used
> above for old_spte?
For the KVM_MMU_WARN_ON() it was Sean's suggestion.
https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
It allows for compiling it out, so probably a better choice. So I see
the options are leave them different or opportunistically convert the
other one to KVM_MMU_WARN_ON(). Thoughts?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
[not found] ` <20260327201421.2824383-12-rick.p.edgecombe@intel.com>
2026-03-31 10:36 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Huang, Kai
@ 2026-04-01 7:41 ` Yan Zhao
1 sibling, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-01 7:41 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
On Fri, Mar 27, 2026 at 01:14:15PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() as all
> the other functionality besides calling the op. It now is a single line
> helper that is called once.
Ack to the sashiko review comment:
"This isn't a bug, but the commit message says the function is folded into
__tdp_mmu_set_spte_atomic(), whereas the code actually folds it into
__handle_changed_spte(). Should the commit message be updated to reflect
the correct function name?
Additionally, the phrase "as all the other functionality besides calling
the op." appears to be missing a verb or is incomplete. Could this sentence
be clarified?" [1].
[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
[not found] ` <20260327201421.2824383-8-rick.p.edgecombe@intel.com>
2026-03-30 6:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Yan Zhao
2026-03-31 10:09 ` Huang, Kai
@ 2026-04-01 8:34 ` Yan Zhao
2026-04-02 23:46 ` Edgecombe, Rick P
2 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-04-01 8:34 UTC (permalink / raw)
To: Rick Edgecombe
Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
dave.hansen
Below are responses to issues reported by sashiko. [1]
[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com
> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> "a temporary frozen SPTE.\n"
> "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
> as_id, gfn, old_spte, new_spte, level);
> - return;
> + return 0;
> }
>
> - if (is_leaf != was_leaf)
> - kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
> /*
> * Recursively handle child PTs if the change removed a subtree from
> * the paging structure. Note the WARN on the PFN changing without the
> * SPTE being converted to a hugepage (leaf) or being zapped. Shadow
> * pages are kernel allocations and should never be migrated.
> + *
> + * When modifying leaf entries in mirrored page tables, propagate all
> + * changes to the external SPTE.
> */
> if (was_present && !was_leaf &&
> - (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> + (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> + } else if (is_mirror_sp(sp) && is_present) {
> + r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> + if (r)
> + return r;
> + }
Issue #1:
"If a present non-leaf SPTE is replaced with a present leaf SPTE, such as
during a hugepage collapse, the first branch executes and calls
handle_removed_pt(). Because of the mutually exclusive "else if",
set_external_spte_present() is bypassed for the new leaf SPTE. Could this
leave the external S-EPT out of sync, with the parent entry pointing to a
freed child page table?
"
Response:
This is the page merging scenario. The mirror root does not allow huge pages yet,
So, maybe nothing is required for now.
After the basic TDX huge page series, page merging is also disabled [4].
See more in the responses to issues #2 and #3.
Issue #2:
"Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
S-EPT mapping remains fully present while KVM marks it as non-present?
"
Response:
Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
Maybe move patch 5 to after patch 15?
[2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@intel.com/
[3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@intel.com/
[4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@intel.com/
Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
let TDX just have a single hook set_external_spte() for propagation of changes
from mirror page table to S-EPT?
(below change is on code base with TDX huge page support).
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13910d9af03a..d9404aeae5f3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,7 +95,6 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
KVM_X86_OP(load_mmu_pgd)
KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL(reclaim_external_spt)
KVM_X86_OP_OPTIONAL_RET0(topup_external_cache)
KVM_X86_OP(has_wbinvd_exit)
KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 76f119a6412b..b722cce0b994 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1857,11 +1857,6 @@ struct kvm_x86_ops {
int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
u64 new_spte, enum pg_level level);
- /* Update external page tables for page table about to be freed. */
- void (*reclaim_external_spt)(struct kvm *kvm, gfn_t gfn,
- struct kvm_mmu_page *sp);
-
-
int (*topup_external_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu, int min_nr_spts);
bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2f6bfc875de1..9b4fa5b31f23 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
}
- if (is_mirror_sp(sp))
- kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
* changes to the external SPTE.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
- } else if (is_mirror_sp(sp)) {
+
+ if (is_mirror_sp(sp)) {
+ /*
+ * Can also propagate changes to remove external pt. Since this
+ * occurs after the call_rcu() in handle_removed_pt(), the RCU
+ * read lock must be held.
+ */
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
new_spte, level);
if (r)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 32d0c206ade6..cc03b9c7838c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1965,9 +1965,13 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
return ret;
}
-static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
+static int tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
struct kvm_mmu_page *sp)
{
+ int ret;
+
+ lockdep_assert_held_write(&kvm->mmu_lock);
+
/*
* KVM doesn't (yet) zap page table pages in mirror page table while
* TD is active, though guest pages mapped in mirror page table could be
@@ -1981,8 +1985,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
* the page to prevent the kernel from accessing the encrypted page.
*/
if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
- tdx_reclaim_page(virt_to_page(sp->external_spt)))
+ tdx_reclaim_page(virt_to_page(sp->external_spt))) {
+ ret = -EIO;
goto out;
+ }
/*
* Immediately free the S-EPT page as the TDX subsystem doesn't support
@@ -1990,8 +1996,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
* TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
*/
free_page((unsigned long)sp->external_spt);
+ ret = 0;
out:
sp->external_spt = NULL;
+ return ret;
}
static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -2045,10 +2053,17 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
u64 new_spte, enum pg_level level)
{
- if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
- return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
- else if (is_shadow_present_pte(old_spte))
- return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+ if (is_shadow_present_pte(old_spte)) {
+ if (is_shadow_present_pte(new_spte)) {
+ if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+ return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
+ } else {
+ if (is_last_spte(old_spte, level))
+ return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+ else
+ return tdx_sept_reclaim_private_spt(kvm, gfn, spte_to_child_sp(old_spte));
+ }
+ }
if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
return -EIO;
@@ -3915,7 +3930,6 @@ void __init tdx_hardware_setup(void)
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
- vt_x86_ops.reclaim_external_spt = tdx_sept_reclaim_private_spt;
vt_x86_ops.gmem_convert = tdx_gmem_convert;
> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> struct tdp_iter *iter,
> u64 new_spte)
> {
> + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
> int ret;
>
> lockdep_assert_held_read(&kvm->mmu_lock);
>
> - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> + /* KVM should never freeze SPTEs using higher level APIs. */
> + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> + /*
> + * Temporarily freeze the SPTE until the external PTE operation has
> + * completed (unless the new SPTE itself will be frozen), e.g. so that
> + * concurrent faults don't attempt to install a child PTE in the
> + * external page table before the parent PTE has been written, or try
> + * to re-install a page table before the old one was removed.
> + */
> + if (is_mirror_sptep(iter->sptep))
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> + else
> + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> if (ret)
> return ret;
>
> - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> - new_spte, iter->level, true);
> + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> + new_spte, iter->level, true);
>
> - return 0;
> + /*
> + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> + * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> + * otherwise set the mirror SPTE to the new desired value.
> + */
> + if (is_mirror_sptep(iter->sptep)) {
> + if (ret)
> + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> + else
> + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
Issue 3:
"
If the mutually exclusive branches in __handle_changed_spte() are updated
so both can run, there seems to be a risk during rollback. The function
handle_removed_pt() unlinks the child page table and schedules it for RCU
destruction. If set_external_spte_present() then fails, the rollback
code here restores iter->old_spte. Since iter->old_spte points to the child
page table that was just scheduled for freeing, can this cause a
use-after-free when the old SPTE is restored?
"
Response:
As in the response to issue 1, changes from non-leaf present to leaf present
in mirror page table are not valid. We need special handling in TDX MMU to
support such scenario in the future. e.g., TDX requires all children entries
must have been all mapped or all unmapped before merging.
Maybe TDX huge page series could skip the handle_removed_pt() for the page
merging case first if more robustness is preferred to guard unexpected page
merging.
Issue 4:
"
Furthermore, tdp_mmu_set_spte_atomic() now returns errors from
__handle_changed_spte(). Previously, it only returned an error (like -EBUSY)
if the try_cmpxchg64() failed, which natively updated iter->old_spte to the
new memory value.
If try_cmpxchg64() succeeds but __handle_changed_spte() fails, iter->old_spte
is not updated. Callers like clear_dirty_gfn_range() loop on failure assuming
the old value was refreshed:
clear_dirty_gfn_range() {
...
retry:
...
if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
goto retry;
...
}
Will this result in an infinite loop, as the caller retries the exact same
atomic operation with the unchanged iter.old_spte?
"
Response:
Though the only expected error from __handle_changed_spte() is also -EBUSY, the
infinite loop is possible if updating the external page table constantly fails
due to contentions.
However, it's currently benign since callers like clear_dirty_gfn_range() can't
operate on the mirror page table.
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-03-30 6:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Yan Zhao
@ 2026-04-01 23:45 ` Edgecombe, Rick P
2026-04-02 1:59 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:45 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > u64 new_spte)
> > {
> > + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter-
> > >sptep));
> > int ret;
> >
> > lockdep_assert_held_read(&kvm->mmu_lock);
> >
> > - ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > + /* KVM should never freeze SPTEs using higher level APIs. */
> "higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
> new_spte to be FROZEN_SPTE.
Yea you are right. It felt too fuzzy but I couldn't think of a better word.
>
> What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
> directly"?
Sure.
>
> > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > +
> > + /*
> > + * Temporarily freeze the SPTE until the external PTE operation has
> > + * completed (unless the new SPTE itself will be frozen), e.g. so
> > that
> > + * concurrent faults don't attempt to install a child PTE in the
> > + * external page table before the parent PTE has been written, or
> > try
> > + * to re-install a page table before the old one was removed.
> > + */
> > + if (is_mirror_sptep(iter->sptep))
> > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > + else
> > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > if (ret)
> > return ret;
> >
> > - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > - new_spte, iter->level, true);
> > + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > + new_spte, iter->level, true);
>
> What about adding a comment for the tricky part for the mirror page table:
> while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
You meant it sets iter->sptep I think.
> for freezing the mirror page table, the original new_spte from the caller of
> tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> properly update statistics and propagate to the external page table.
new_spte was already passed in. What changed? You mean that
__tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
I'm not sure if it threshold TDP MMU.
>
> > - return 0;
> > + /*
> > + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> > + * restore the old SPTE so that the SPTE isn't frozen in
> > perpetuity,
> > + * otherwise set the mirror SPTE to the new desired value.
> > + */
> > + if (is_mirror_sptep(iter->sptep)) {
> > + if (ret)
> > + __kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > >old_spte);
> > + else
> > + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > + } else {
> > + /*
> > + * Bug the VM if handling the change failed, as failure is
> > only
> > + * allowed if KVM couldn't update the external SPTE.
> > + */
> > + KVM_BUG_ON(ret, kvm);
> > + }
> > + return ret;
> > }
> >
> > /*
> > @@ -738,6 +759,8 @@ static inline int __must_check
> > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > u64 old_spte, u64 new_spte, gfn_t gfn, int
> > level)
> > {
> > + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > +
> > lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > /*
> > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >
> > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > level);
> >
> > - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > false);
> > + handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > false);
> >
> > /*
> > * Users that do non-atomic setting of PTEs don't operate on mirror
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> > {
> > u64 new_spte;
> >
> > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > + return;
> > +
> Add a comment for why mirror page table is not expected here?
Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
little bit on the idea of the patch: to forward all changes through limited ops
in a central place, such that we don't have TDX specifics encoded in core MMU.
Trying to forward this through properly would result in more burden to the TDP
MMU, so that's not the right answer either.
"Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
just leaving it without comment, but I can add something like that. Or do you
have another suggestion?
>
> And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> or clear_dirty_pt_masked()?
Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
case, warning coverage is removed in this patch.
>
> > if (spte_ad_enabled(iter->old_spte)) {
> > iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> > shadow_acce
> > ssed_mask);
> > --
> > 2.53.0
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-03-31 10:09 ` Huang, Kai
@ 2026-04-01 23:58 ` Edgecombe, Rick P
2026-04-02 23:21 ` Sean Christopherson
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:58 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Huang, Kai, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Tue, 2026-03-31 at 10:09 +0000, Huang, Kai wrote:
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> > {
> > u64 new_spte;
> >
> > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > + return;
> > +
> > if (spte_ad_enabled(iter->old_spte)) {
> > iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> > shadow_acce
> > ssed_mask);
>
> AFAICT this chunk isn't related to "Centralize updates to present external
> PTEs"? Should it be in a separate patch?
It was because we lose warning coverage in this patch for the aging case. Though
the diff was from Sean. But my take was this just maintains coverage that
already existed.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
2026-03-30 6:43 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Yan Zhao
@ 2026-04-01 23:59 ` Edgecombe, Rick P
0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:59 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Mon, 2026-03-30 at 14:43 +0800, Yan Zhao wrote:
> >
> > +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, u64 mirror_spte)
> > +{
> >
> > + if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> > + return -EIO;
> > +
> > + if (!is_last_spte(mirror_spte, level))
> > + return tdx_sept_link_private_spt(kvm, gfn, level,
> > mirror_spte);
> For symmetry, what about renaming tdx_sept_link_private_spt() to
> tdx_sept_map_nonleaf_spte()?
Yea, it fits better.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
2026-03-31 10:30 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Huang, Kai
@ 2026-04-02 0:00 ` Edgecombe, Rick P
0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 0:00 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Huang, Kai, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Tue, 2026-03-31 at 10:30 +0000, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> > static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > enum pg_level level, u64 mirror_spte)
> > {
> > -
> > if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> > return -EIO;
>
> This line deleting isn't needed if it wasn't introduced in the previous
> patch.
Oops, will fix.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()
2026-03-31 10:42 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Huang, Kai
@ 2026-04-02 0:04 ` Edgecombe, Rick P
0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 0:04 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
seanjc@google.com, Huang, Kai, Zhao, Yan Y
Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org
On Tue, 2026-03-31 at 10:42 +0000, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> > From: Sean Christopherson <seanjc@google.com>
> >
> > Move tdx_sept_remove_private_spte() (and its tdx_track() helper) above
> > tdx_sept_set_private_spte() in anticipation of routing all non-atomic
> > S-EPT writes (with the exception of reclaiming non-leaf pages) through
> > the "set" API.
>
> The diff reads more like moving tdx_sept_set_private_spte() down, though.
Yea I noticed that too. We can make it more generic, like "arrange such that..."
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
2026-03-31 10:46 ` Huang, Kai
@ 2026-04-02 0:08 ` Edgecombe, Rick P
2026-04-02 2:04 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 0:08 UTC (permalink / raw)
To: Huang, Kai, Zhao, Yan Y
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Hansen, Dave,
kas@kernel.org, seanjc@google.com, x86@kernel.org,
linux-kernel@vger.kernel.org
On Tue, 2026-03-31 at 10:46 +0000, Huang, Kai wrote:
> >
> > > KVM_BUG_ON() to use in the logics future home.
> > By "in the logics future home", do you mean "when this logic is moved to its
> > future location"?
> >
>
> Maybe it just meant to be:
>
> the logic's future home
Yea that is what I meant. I'll update it. Unless Yan do you find the "future
home" phrasing confusing?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
2026-03-30 7:49 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Yan Zhao
@ 2026-04-02 0:17 ` Edgecombe, Rick P
2026-04-02 2:16 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 0:17 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index d064b40a6b31..1346e891ca94 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > *kvm, gfn_t gfn,
> > */
> > if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> > tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > - sp->external_spt = NULL;
> > + goto out;
> > +
> > + /*
> > + * Immediately free the S-EPT page as the TDX subsystem doesn't
> > support
> > + * freeing pages from RCU callbacks, and more importantly because
> > + * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > readers.
> > + */
> > + free_page((unsigned long)sp->external_spt);
> Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> similar
> way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> sp->external_spt special? i.e., what's the downside of freeing it together
> with
> sp->spt in tdp_mmu_free_sp() ?
I agree it's questionable as a cleanup. The change originally came from your
comment here actually:
https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
alone cleanup is pretty weak. I tried to put a brave face on it.
We have to think if we should try to include it in the initial set or include it
with the later DPAMT parts. I thought to try at least to do it earlier and leave
it at the end. It also kind of helps to see more of the overhaul together.
Thoughts?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
2026-03-31 16:37 ` Edgecombe, Rick P
@ 2026-04-02 1:06 ` Yan Zhao
2026-04-02 19:21 ` Sean Christopherson
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-04-02 1:06 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
>
> Yep on the typos.
>
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > */
> > > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > >old_spte));
> > >
> > > - if (is_mirror_sptep(iter->sptep) &&
> > > !is_frozen_spte(new_spte)) {
> > > + /*
> > > + * FROZEN_SPTE is a temporary state and should never be
> > > set via higher
> > > + * level helpers.
> > > + */
> > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > is used
> > above for old_spte?
>
> For the KVM_MMU_WARN_ON() it was Sean's suggestion.
>
> https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
>
> It allows for compiling it out, so probably a better choice. So I see
> the options are leave them different or opportunistically convert the
> other one to KVM_MMU_WARN_ON(). Thoughts?
I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
tdp_mmu.c. I'm not sure if there's a rule for which one to use.
Is it necessary to evaluate them all and have a separate patch to convert
WARN_ON_ONCE() to KVM_MMU_WARN_ON()?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-01 23:45 ` Edgecombe, Rick P
@ 2026-04-02 1:59 ` Yan Zhao
2026-04-02 23:10 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-04-02 1:59 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > +
> > > + /*
> > > + * Temporarily freeze the SPTE until the external PTE operation has
> > > + * completed (unless the new SPTE itself will be frozen), e.g. so
> > > that
> > > + * concurrent faults don't attempt to install a child PTE in the
> > > + * external page table before the parent PTE has been written, or
> > > try
> > > + * to re-install a page table before the old one was removed.
> > > + */
> > > + if (is_mirror_sptep(iter->sptep))
> > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > + else
> > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > if (ret)
> > > return ret;
> > >
> > > - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > - new_spte, iter->level, true);
> > > + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > + new_spte, iter->level, true);
> >
> > What about adding a comment for the tricky part for the mirror page table:
> > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
>
> You meant it sets iter->sptep I think.
>
> > for freezing the mirror page table, the original new_spte from the caller of
> > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > properly update statistics and propagate to the external page table.
>
> new_spte was already passed in. What changed? You mean that
> __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> I'm not sure if it threshold TDP MMU.
For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
like this:
tdp_mmu_set_spte_atomic() {
__tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
__handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
new_spte, iter->level, true);==>sets S-EPT to new_spte
__kvm_tdp_mmu_write_spte(iter->sptep, new_spte); ==>sets mirror to new_spte
}
So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
> > > - return 0;
> > > + /*
> > > + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> > > + * restore the old SPTE so that the SPTE isn't frozen in
> > > perpetuity,
> > > + * otherwise set the mirror SPTE to the new desired value.
> > > + */
> > > + if (is_mirror_sptep(iter->sptep)) {
> > > + if (ret)
> > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > > >old_spte);
> > > + else
> > > + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > + } else {
> > > + /*
> > > + * Bug the VM if handling the change failed, as failure is
> > > only
> > > + * allowed if KVM couldn't update the external SPTE.
> > > + */
> > > + KVM_BUG_ON(ret, kvm);
> > > + }
> > > + return ret;
> > > }
> > >
> > > /*
> > > @@ -738,6 +759,8 @@ static inline int __must_check
> > > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > u64 old_spte, u64 new_spte, gfn_t gfn, int
> > > level)
> > > {
> > > + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > > +
> > > lockdep_assert_held_write(&kvm->mmu_lock);
> > >
> > > /*
> > > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > > tdp_ptep_t sptep,
> > >
> > > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > > level);
> > >
> > > - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > false);
> > > + handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > > false);
> > >
> > > /*
> > > * Users that do non-atomic setting of PTEs don't operate on mirror
> > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > struct tdp_iter *iter)
> > > {
> > > u64 new_spte;
> > >
> > > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > + return;
> > > +
> > Add a comment for why mirror page table is not expected here?
>
> Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> little bit on the idea of the patch: to forward all changes through limited ops
> in a central place, such that we don't have TDX specifics encoded in core MMU.
> Trying to forward this through properly would result in more burden to the TDP
> MMU, so that's not the right answer either.
>
> "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> just leaving it without comment, but I can add something like that. Or do you
> have another suggestion?
What about "External TDP doesn't support clearing PTE A/D bit"?
I'm ok with leaving without comment if you think it's too obvious.
> > And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> > or clear_dirty_pt_masked()?
>
> Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
> case, warning coverage is removed in this patch.
In kvm_tdp_mmu_age_spte() and clear_dirty_pt_masked() SPTEs are updated via
__tdp_mmu_set_spte_atomic() or tdp_mmu_clear_spte_bits(), which don't propagate
changes to the external page table.
So, I think it's better to warn if a mirror root is accidentally passed in to
those functions.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
2026-04-02 0:08 ` Edgecombe, Rick P
@ 2026-04-02 2:04 ` Yan Zhao
0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-02 2:04 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Huang, Kai, kvm@vger.kernel.org, pbonzini@redhat.com,
Hansen, Dave, kas@kernel.org, seanjc@google.com, x86@kernel.org,
linux-kernel@vger.kernel.org
On Thu, Apr 02, 2026 at 08:08:54AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2026-03-31 at 10:46 +0000, Huang, Kai wrote:
> > >
> > > > KVM_BUG_ON() to use in the logics future home.
> > > By "in the logics future home", do you mean "when this logic is moved to its
> > > future location"?
> > >
> >
> > Maybe it just meant to be:
> >
> > the logic's future home
>
> Yea that is what I meant. I'll update it. Unless Yan do you find the "future
> home" phrasing confusing?
Yeah, "future home" seems less clear to me. :)
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
2026-04-02 0:17 ` Edgecombe, Rick P
@ 2026-04-02 2:16 ` Yan Zhao
2026-04-02 2:17 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-04-02 2:16 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Thu, Apr 02, 2026 at 08:17:19AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index d064b40a6b31..1346e891ca94 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > > *kvm, gfn_t gfn,
> > > */
> > > if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> > > tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > > - sp->external_spt = NULL;
> > > + goto out;
> > > +
> > > + /*
> > > + * Immediately free the S-EPT page as the TDX subsystem doesn't
> > > support
> > > + * freeing pages from RCU callbacks, and more importantly because
> > > + * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > > readers.
> > > + */
> > > + free_page((unsigned long)sp->external_spt);
> > Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> > similar
> > way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> > sp->external_spt special? i.e., what's the downside of freeing it together
> > with
> > sp->spt in tdp_mmu_free_sp() ?
>
> I agree it's questionable as a cleanup. The change originally came from your
> comment here actually:
> https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
>
> So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
> alone cleanup is pretty weak. I tried to put a brave face on it.
>
> We have to think if we should try to include it in the initial set or include it
> with the later DPAMT parts. I thought to try at least to do it earlier and leave
> it at the end. It also kind of helps to see more of the overhaul together.
> Thoughts?
However, even with DPAMT, sp->external_sp is also allocated via
kvm_mmu_alloc_external_spt().
static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
/*
* external_spt is allocated for TDX module to hold private EPT mappings,
* TDX module will initialize the page by itself.
* Therefore, KVM does not need to initialize or access external_spt.
* KVM only interacts with sp->spt for private EPT operations.
*/
sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
}
i.e., sp->external_spt is allocated with no difference from sp->spt.
So, I'm not sure why we have to free it differently.
What's the benefit of keeping the memory for a little longer than freeing it
immediately after TDX RECLAIM?
Or are you planning to allocate sp->external_spt via a TDX hook?
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
2026-04-02 2:16 ` Yan Zhao
@ 2026-04-02 2:17 ` Yan Zhao
0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-02 2:17 UTC (permalink / raw)
To: Edgecombe, Rick P, Hansen, Dave, seanjc@google.com, Huang, Kai,
kas@kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, pbonzini@redhat.com
On Thu, Apr 02, 2026 at 10:16:07AM +0800, Yan Zhao wrote:
> On Thu, Apr 02, 2026 at 08:17:19AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index d064b40a6b31..1346e891ca94 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > > > *kvm, gfn_t gfn,
> > > > */
> > > > if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> > > > tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > > > - sp->external_spt = NULL;
> > > > + goto out;
> > > > +
> > > > + /*
> > > > + * Immediately free the S-EPT page as the TDX subsystem doesn't
> > > > support
> > > > + * freeing pages from RCU callbacks, and more importantly because
> > > > + * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > > > readers.
> > > > + */
> > > > + free_page((unsigned long)sp->external_spt);
> > > Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> > > similar
> > > way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> > > sp->external_spt special? i.e., what's the downside of freeing it together
> > > with
> > > sp->spt in tdp_mmu_free_sp() ?
> >
> > I agree it's questionable as a cleanup. The change originally came from your
> > comment here actually:
> > https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
> >
> > So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
> > alone cleanup is pretty weak. I tried to put a brave face on it.
> >
> > We have to think if we should try to include it in the initial set or include it
> > with the later DPAMT parts. I thought to try at least to do it earlier and leave
> > it at the end. It also kind of helps to see more of the overhaul together.
> > Thoughts?
> However, even with DPAMT, sp->external_sp is also allocated via
> kvm_mmu_alloc_external_spt().
>
> static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> {
> /*
> * external_spt is allocated for TDX module to hold private EPT mappings,
> * TDX module will initialize the page by itself.
> * Therefore, KVM does not need to initialize or access external_spt.
> * KVM only interacts with sp->spt for private EPT operations.
> */
> sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> }
>
> i.e., sp->external_spt is allocated with no difference from sp->spt.
> So, I'm not sure why we have to free it differently.
> What's the benefit of keeping the memory for a little longer than freeing it
Typo: what's the downside
> immediately after TDX RECLAIM?
>
> Or are you planning to allocate sp->external_spt via a TDX hook?
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
2026-04-02 1:06 ` Yan Zhao
@ 2026-04-02 19:21 ` Sean Christopherson
2026-04-03 2:47 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2026-04-02 19:21 UTC (permalink / raw)
To: Yan Zhao
Cc: Rick P Edgecombe, Dave Hansen, Kai Huang, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Thu, Apr 02, 2026, Yan Zhao wrote:
> On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
> >
> > Yep on the typos.
> >
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > > */
> > > > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > > >old_spte));
> > > >
> > > > - if (is_mirror_sptep(iter->sptep) &&
> > > > !is_frozen_spte(new_spte)) {
> > > > + /*
> > > > + * FROZEN_SPTE is a temporary state and should never be
> > > > set via higher
> > > > + * level helpers.
> > > > + */
> > > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > > is used
> > > above for old_spte?
> >
> > For the KVM_MMU_WARN_ON() it was Sean's suggestion.
> >
> > https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> >
> > It allows for compiling it out, so probably a better choice. So I see
> > the options are leave them different or opportunistically convert the
> > other one to KVM_MMU_WARN_ON(). Thoughts?
> I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
> tdp_mmu.c. I'm not sure if there's a rule for which one to use.
KVM_MMU_WARN_ON() is intended to be used only when having a WARN_ON_ONCE() enabled
in production environments isn't worth the cost of the code. E.g. if the code in
question is a low level helper that used "everywhere" and in super hot paths, or
in extremely paranoid sanity checks where the platform is beyond hosed if the
sanity check fails.
The other thing with KVM_MMU_WARN_ON() is that it doesn't evaluate the expression
when CONFIG_KVM_PROVE_MMU=n, i.e. can't be used in if-statements (the only case
where it's used in an if-statement is wrapped with a CONFIG_KVM_PROVE_MMU #ifdef).
For infrequently used and/or slow paths, and for cases where KVM needs to take
action no matter what, WARN_ON_ONCE() is preferred because if something goes
sideways, we'll get a helpful splat even in production.
> Is it necessary to evaluate them all and have a separate patch to convert
> WARN_ON_ONCE() to KVM_MMU_WARN_ON()?
Nope. We could probably convert a few select ones if there's good reason to do
so, but we definitely don't want to do a bulk conversion.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-02 1:59 ` Yan Zhao
@ 2026-04-02 23:10 ` Edgecombe, Rick P
2026-04-02 23:28 ` Sean Christopherson
2026-04-03 9:08 ` Yan Zhao
0 siblings, 2 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 23:10 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, Huang, Kai,
kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
kvm@vger.kernel.org
On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > +
> > > > + /*
> > > > + * Temporarily freeze the SPTE until the external PTE operation has
> > > > + * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > that
> > > > + * concurrent faults don't attempt to install a child PTE in the
> > > > + * external page table before the parent PTE has been written, or
> > > > try
> > > > + * to re-install a page table before the old one was removed.
> > > > + */
> > > > + if (is_mirror_sptep(iter->sptep))
> > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > + else
> > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > > if (ret)
> > > > return ret;
> > > >
> > > > - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > - new_spte, iter->level, true);
> > > > + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > + new_spte, iter->level, true);
> > >
> > > What about adding a comment for the tricky part for the mirror page table:
> > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> >
> > You meant it sets iter->sptep I think.
> >
> > > for freezing the mirror page table, the original new_spte from the caller of
> > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > properly update statistics and propagate to the external page table.
> >
> > new_spte was already passed in. What changed? You mean that
> > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > I'm not sure if it threshold TDP MMU.
>
> For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> like this:
> tdp_mmu_set_spte_atomic() {
> __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
> __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> new_spte, iter->level, true);==>sets S-EPT to new_spte
> __kvm_tdp_mmu_write_spte(iter->sptep, new_spte); ==>sets mirror to new_spte
> }
>
> So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
I still don't see the point. That ordering is not new, and this patch actually
adds a bunch of comments around the operations above and below the
__handle_changed_spte() call. If you think something is still missing maybe you
can suggest something.
>
>
> > > > - return 0;
> > > > + /*
> > > > + * Unfreeze the mirror SPTE. If updating the external SPTE failed,
> > > > + * restore the old SPTE so that the SPTE isn't frozen in
> > > > perpetuity,
> > > > + * otherwise set the mirror SPTE to the new desired value.
> > > > + */
> > > > + if (is_mirror_sptep(iter->sptep)) {
> > > > + if (ret)
> > > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > > > > old_spte);
> > > > + else
> > > > + __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > > + } else {
> > > > + /*
> > > > + * Bug the VM if handling the change failed, as failure is
> > > > only
> > > > + * allowed if KVM couldn't update the external SPTE.
> > > > + */
> > > > + KVM_BUG_ON(ret, kvm);
> > > > + }
> > > > + return ret;
> > > > }
> > > >
> > > > /*
> > > > @@ -738,6 +759,8 @@ static inline int __must_check
> > > > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > > static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > > u64 old_spte, u64 new_spte, gfn_t gfn, int
> > > > level)
> > > > {
> > > > + struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > > > +
> > > > lockdep_assert_held_write(&kvm->mmu_lock);
> > > >
> > > > /*
> > > > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > > > tdp_ptep_t sptep,
> > > >
> > > > old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > > > level);
> > > >
> > > > - handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > > false);
> > > > + handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > > > false);
> > > >
> > > > /*
> > > > * Users that do non-atomic setting of PTEs don't operate on mirror
> > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > struct tdp_iter *iter)
> > > > {
> > > > u64 new_spte;
> > > >
> > > > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > + return;
> > > > +
> > > Add a comment for why mirror page table is not expected here?
> >
> > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > little bit on the idea of the patch: to forward all changes through limited ops
> > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > Trying to forward this through properly would result in more burden to the TDP
> > MMU, so that's not the right answer either.
> >
> > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > just leaving it without comment, but I can add something like that. Or do you
> > have another suggestion?
> What about "External TDP doesn't support clearing PTE A/D bit"?
It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
add a comment unless you strongly object.
> I'm ok with leaving without comment if you think it's too obvious.
>
> > > And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> > > or clear_dirty_pt_masked()?
> >
> > Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
> > case, warning coverage is removed in this patch.
>
> In kvm_tdp_mmu_age_spte() and clear_dirty_pt_masked() SPTEs are updated via
> __tdp_mmu_set_spte_atomic() or tdp_mmu_clear_spte_bits(), which don't propagate
> changes to the external page table.
> So, I think it's better to warn if a mirror root is accidentally passed in to
> those functions.
>
>
>
I don't think we should add warnings for new paths in this patch. Meaning, the
patch shouldn't increase coverage, just maintain. But it seems reasonable in
general. How about we add it to the cleanup list.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-01 23:58 ` Edgecombe, Rick P
@ 2026-04-02 23:21 ` Sean Christopherson
0 siblings, 0 replies; 43+ messages in thread
From: Sean Christopherson @ 2026-04-02 23:21 UTC (permalink / raw)
To: Rick P Edgecombe
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
Kai Huang, Yan Y Zhao, Dave Hansen, linux-kernel@vger.kernel.org,
x86@kernel.org
On Wed, Apr 01, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-03-31 at 10:09 +0000, Huang, Kai wrote:
> > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > struct tdp_iter *iter)
> > > {
> > > u64 new_spte;
> > >
> > > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > + return;
> > > +
> > > if (spte_ad_enabled(iter->old_spte)) {
> > > iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > > >sptep,
> > > shadow_acce
> > > ssed_mask);
> >
> > AFAICT this chunk isn't related to "Centralize updates to present external
> > PTEs"? Should it be in a separate patch?
>
> It was because we lose warning coverage in this patch for the aging case. Though
> the diff was from Sean. But my take was this just maintains coverage that
> already existed.
Yep, exactly.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-02 23:10 ` Edgecombe, Rick P
@ 2026-04-02 23:28 ` Sean Christopherson
2026-04-03 9:05 ` Yan Zhao
2026-04-03 9:08 ` Yan Zhao
1 sibling, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2026-04-02 23:28 UTC (permalink / raw)
To: Rick P Edgecombe
Cc: Yan Y Zhao, Dave Hansen, x86@kernel.org, Kai Huang,
kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
kvm@vger.kernel.org
On Thu, Apr 02, 2026, Rick P Edgecombe wrote:
> On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> > On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > > +
> > > > > + /*
> > > > > + * Temporarily freeze the SPTE until the external PTE operation has
> > > > > + * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > > that
> > > > > + * concurrent faults don't attempt to install a child PTE in the
> > > > > + * external page table before the parent PTE has been written, or
> > > > > try
> > > > > + * to re-install a page table before the old one was removed.
> > > > > + */
> > > > > + if (is_mirror_sptep(iter->sptep))
> > > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > > + else
> > > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > > > if (ret)
> > > > > return ret;
> > > > >
> > > > > - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > > - new_spte, iter->level, true);
> > > > > + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > > + new_spte, iter->level, true);
> > > >
> > > > What about adding a comment for the tricky part for the mirror page table:
> > > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> > >
> > > You meant it sets iter->sptep I think.
> > >
> > > > for freezing the mirror page table, the original new_spte from the caller of
> > > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > > properly update statistics and propagate to the external page table.
> > >
> > > new_spte was already passed in. What changed? You mean that
> > > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > > I'm not sure if it threshold TDP MMU.
> >
> > For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> > like this:
> > tdp_mmu_set_spte_atomic() {
> > __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
> > __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > new_spte, iter->level, true);==>sets S-EPT to new_spte
> > __kvm_tdp_mmu_write_spte(iter->sptep, new_spte); ==>sets mirror to new_spte
> > }
> >
> > So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
>
> I still don't see the point. That ordering is not new, and this patch actually
> adds a bunch of comments around the operations above and below the
> __handle_changed_spte() call. If you think something is still missing maybe you
> can suggest something.
Ya, I'm a bit confused too. For me, the "tricky" part is understanding the need
to set the mirror SPTE to FROZE_SPTE while updating the external SPTE. Once that
is understood, I don't find passing in @new_spte to be surprising in any way.
> > > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > > struct tdp_iter *iter)
> > > > > {
> > > > > u64 new_spte;
> > > > >
> > > > > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > > + return;
> > > > > +
> > > > Add a comment for why mirror page table is not expected here?
> > >
> > > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > > little bit on the idea of the patch: to forward all changes through limited ops
> > > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > > Trying to forward this through properly would result in more burden to the TDP
> > > MMU, so that's not the right answer either.
> > >
> > > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > > just leaving it without comment, but I can add something like that. Or do you
> > > have another suggestion?
> > What about "External TDP doesn't support clearing PTE A/D bit"?
>
> It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
> add a comment unless you strongly object.
How about something like this?
/* TODO: Add support for aging external SPTEs, if necessary. */
That makes it clear that this path is supposed to be unreachable because KVM doesn't
yet support aging external SPTEs, while not trying to say anything about *why*
KVM doesn't support aging external SPTEs.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-01 8:34 ` Yan Zhao
@ 2026-04-02 23:46 ` Edgecombe, Rick P
2026-04-03 10:33 ` Yan Zhao
0 siblings, 1 reply; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 23:46 UTC (permalink / raw)
To: Zhao, Yan Y
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Wed, 2026-04-01 at 16:34 +0800, Yan Zhao wrote:
> Issue #2:
> "Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
> Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
> call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
> S-EPT mapping remains fully present while KVM marks it as non-present?
> "
>
> Response:
> Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
> root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
> Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
> Maybe move patch 5 to after patch 15?
>
> [2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@intel.com/
> [3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@intel.com/
> [4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@intel.com/
Thanks for checking the AI reviews. I'll see how it slots in there.
>
> Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
> let TDX just have a single hook set_external_spte() for propagation of changes
> from mirror page table to S-EPT?
> (below change is on code base with TDX huge page support).
I was asking Yan internally why this works but Sean's earlier attempt failed.
Yan, let's finish the discussion externally now that Sean is poking around.
I'd be inclined to kind to call the cleanup a win and leave further unification
for the future. At least not going turning over rocks.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
2026-04-02 19:21 ` Sean Christopherson
@ 2026-04-03 2:47 ` Yan Zhao
0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-03 2:47 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, Dave Hansen, Kai Huang, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Thu, Apr 02, 2026 at 12:21:13PM -0700, Sean Christopherson wrote:
> On Thu, Apr 02, 2026, Yan Zhao wrote:
> > On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
> > >
> > > Yep on the typos.
> > >
> > > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > > > */
> > > > > WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > > > >old_spte));
> > > > >
> > > > > - if (is_mirror_sptep(iter->sptep) &&
> > > > > !is_frozen_spte(new_spte)) {
> > > > > + /*
> > > > > + * FROZEN_SPTE is a temporary state and should never be
> > > > > set via higher
> > > > > + * level helpers.
> > > > > + */
> > > > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > > > is used
> > > > above for old_spte?
> > >
> > > For the KVM_MMU_WARN_ON() it was Sean's suggestion.
> > >
> > > https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> > >
> > > It allows for compiling it out, so probably a better choice. So I see
> > > the options are leave them different or opportunistically convert the
> > > other one to KVM_MMU_WARN_ON(). Thoughts?
> > I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
> > tdp_mmu.c. I'm not sure if there's a rule for which one to use.
>
> KVM_MMU_WARN_ON() is intended to be used only when having a WARN_ON_ONCE() enabled
> in production environments isn't worth the cost of the code. E.g. if the code in
> question is a low level helper that used "everywhere" and in super hot paths, or
> in extremely paranoid sanity checks where the platform is beyond hosed if the
> sanity check fails.
>
> The other thing with KVM_MMU_WARN_ON() is that it doesn't evaluate the expression
> when CONFIG_KVM_PROVE_MMU=n, i.e. can't be used in if-statements (the only case
> where it's used in an if-statement is wrapped with a CONFIG_KVM_PROVE_MMU #ifdef).
Ah, I missed this one.
> For infrequently used and/or slow paths, and for cases where KVM needs to take
> action no matter what, WARN_ON_ONCE() is preferred because if something goes
> sideways, we'll get a helpful splat even in production.
Got it. Thanks for the detailed explanation!
> > Is it necessary to evaluate them all and have a separate patch to convert
> > WARN_ON_ONCE() to KVM_MMU_WARN_ON()?
>
> Nope. We could probably convert a few select ones if there's good reason to do
> so, but we definitely don't want to do a bulk conversion.
Makes sense.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-02 23:28 ` Sean Christopherson
@ 2026-04-03 9:05 ` Yan Zhao
2026-04-04 0:15 ` Edgecombe, Rick P
0 siblings, 1 reply; 43+ messages in thread
From: Yan Zhao @ 2026-04-03 9:05 UTC (permalink / raw)
To: Sean Christopherson
Cc: Rick P Edgecombe, Dave Hansen, x86@kernel.org, Kai Huang,
kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
kvm@vger.kernel.org
On Thu, Apr 02, 2026 at 04:28:33PM -0700, Sean Christopherson wrote:
> On Thu, Apr 02, 2026, Rick P Edgecombe wrote:
> > On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> > > On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > > > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > > > + KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > > > +
> > > > > > + /*
> > > > > > + * Temporarily freeze the SPTE until the external PTE operation has
> > > > > > + * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > > > that
> > > > > > + * concurrent faults don't attempt to install a child PTE in the
> > > > > > + * external page table before the parent PTE has been written, or
> > > > > > try
> > > > > > + * to re-install a page table before the old one was removed.
> > > > > > + */
> > > > > > + if (is_mirror_sptep(iter->sptep))
> > > > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > > > + else
> > > > > > + ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > > > > if (ret)
> > > > > > return ret;
> > > > > >
> > > > > > - handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > > > - new_spte, iter->level, true);
> > > > > > + ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > > > + new_spte, iter->level, true);
> > > > >
> > > > > What about adding a comment for the tricky part for the mirror page table:
> > > > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> > > >
> > > > You meant it sets iter->sptep I think.
> > > >
> > > > > for freezing the mirror page table, the original new_spte from the caller of
> > > > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > > > properly update statistics and propagate to the external page table.
> > > >
> > > > new_spte was already passed in. What changed? You mean that
> > > > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > > > I'm not sure if it threshold TDP MMU.
> > >
> > > For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> > > like this:
> > > tdp_mmu_set_spte_atomic() {
> > > __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
> > > __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > new_spte, iter->level, true);==>sets S-EPT to new_spte
> > > __kvm_tdp_mmu_write_spte(iter->sptep, new_spte); ==>sets mirror to new_spte
> > > }
> > >
> > > So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
> >
> > I still don't see the point. That ordering is not new, and this patch actually
> > adds a bunch of comments around the operations above and below the
> > __handle_changed_spte() call. If you think something is still missing maybe you
> > can suggest something.
Hmm, sorry for the confusion. I didn't express it clearly.
The ordering inside tdp_mmu_set_spte_atomic() for mirror root is:
Before this patch,
1. set mirror SPTE to frozen
2. invoke TDX op to update external PTE
3. set mirror SPTE to new_spte or restore old_spte
4. if 2 succeeds, invoke handle_changed_spte() to propagate changes to
child mirror SPTEs and child external PTEs
After this patch,
1. set mirror SPTE to frozen
2. invoke __handle_changed_spte(), which propagates changes to
(1) child mirror SPTEs and child external PTEs
(2) external PTE
3. set mirror SPTE to new_spte or restore old_spte
So, the step to propagate changes to child mirror SPTEs and child external PTEs
now occurs before the steps to update the external PTE and the mirror SPTE.
> Ya, I'm a bit confused too. For me, the "tricky" part is understanding the need
> to set the mirror SPTE to FROZE_SPTE while updating the external SPTE. Once that
> is understood, I don't find passing in @new_spte to be surprising in any way.
I still find it tricky because it seems strange to me to invoke a function named
handle_changed_spte() before the change actually occurs on the SPTE (i.e.,to me,
the SPTE has only changed from xxx to FROZEN_SPTE, but handle_changed_spte()
handles changes from xxx to new_spte).
Besides, another tricky point (currently benign to TDX) is that:
before this patch, tdp_mmu_set_spte_atomic() cannot be used to atomically zap
non-leaf mirror SPTEs, since TDX requires child PTEs to be zapped before the
parent PTE;
after this patch, performing atomic zapping of non-leaf mirror SPTEs seems to be
allowed in TDP MMU since the above step 2.1 now occurs before step 2.2. However,
if step 2.2 fails after step 2.1 succeeds, step 3 cannot easily restore the real
old state.
So, if we allow atomic zap on the mirror root in the future, it looks like we
need to ensure atomic zapping of S-EPT cannot fail.
> > > > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > > > struct tdp_iter *iter)
> > > > > > {
> > > > > > u64 new_spte;
> > > > > >
> > > > > > + if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > > > + return;
> > > > > > +
> > > > > Add a comment for why mirror page table is not expected here?
> > > >
> > > > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > > > little bit on the idea of the patch: to forward all changes through limited ops
> > > > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > > > Trying to forward this through properly would result in more burden to the TDP
> > > > MMU, so that's not the right answer either.
> > > >
> > > > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > > > just leaving it without comment, but I can add something like that. Or do you
> > > > have another suggestion?
> > > What about "External TDP doesn't support clearing PTE A/D bit"?
> >
> > It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
> > add a comment unless you strongly object.
>
> How about something like this?
>
> /* TODO: Add support for aging external SPTEs, if necessary. */
>
> That makes it clear that this path is supposed to be unreachable because KVM doesn't
> yet support aging external SPTEs, while not trying to say anything about *why*
> KVM doesn't support aging external SPTEs.
LGTM :)
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-02 23:10 ` Edgecombe, Rick P
2026-04-02 23:28 ` Sean Christopherson
@ 2026-04-03 9:08 ` Yan Zhao
1 sibling, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-03 9:08 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, Huang, Kai,
kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
kvm@vger.kernel.org
On Fri, Apr 03, 2026 at 07:10:18AM +0800, Edgecombe, Rick P wrote:
> I don't think we should add warnings for new paths in this patch. Meaning, the
> patch shouldn't increase coverage, just maintain. But it seems reasonable in
> general. How about we add it to the cleanup list.
Adding it to the cleanup list looks good to me.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-02 23:46 ` Edgecombe, Rick P
@ 2026-04-03 10:33 ` Yan Zhao
0 siblings, 0 replies; 43+ messages in thread
From: Yan Zhao @ 2026-04-03 10:33 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
pbonzini@redhat.com
On Fri, Apr 03, 2026 at 07:46:21AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2026-04-01 at 16:34 +0800, Yan Zhao wrote:
> > Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
> > let TDX just have a single hook set_external_spte() for propagation of changes
> > from mirror page table to S-EPT?
> > (below change is on code base with TDX huge page support).
>
> I was asking Yan internally why this works but Sean's earlier attempt failed.
> Yan, let's finish the discussion externally now that Sean is poking around.
Hmm, I guess why Sean provided op reclaim_external_spt() (now named
free_external_spt() in this series) is because the old_spte and new_spte
required by op set_external_spte() are not available in handle_removed_pt(), and
also because there's a "call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback)"
in handle_removed_pt().
So, if I'm not missing something, we may have 2 options for further unification:
1. pass in required old_parent_spte and new_parent_spte to handle_removed_pt(),
and invoke op set_external_spte() (instead of reclaim_external_spt()) in
handle_removed_pt().
2. as I proposed in this thread (see below key changes), assert that RCU read
lock is always held during __handle_changed_spte() (which is a reasonable
assumption in TDP MMU) and invoke op set_external_spte() for reclaiming
external pt as well.
Though invoking call_rcu() occurs before invoking op set_external_spte(),
tdp_mmu_free_sp_rcu_callback() should only occur after invoking
set_external_spte() due to __handle_changed_spte() holding RCU read lock.
However, I agree it's odd to have call_rcu() invoked before reclaiming
external pt :)
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
}
- if (is_mirror_sp(sp))
- kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
* changes to the external SPTE.
*/
if (was_present && !was_leaf &&
- (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+ (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
- } else if (is_mirror_sp(sp)) {
+
+ if (is_mirror_sp(sp)) {
+ /*
+ * Can also propagate changes to remove external pt. Since this
+ * occurs after the call_rcu() in handle_removed_pt(), the RCU
+ * read lock must be held.
+ */
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
new_spte, level);
> I'd be inclined to kind to call the cleanup a win and leave further unification
> for the future. At least not going turning over rocks.
I'm ok with leaving it to future refactoring.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
2026-04-03 9:05 ` Yan Zhao
@ 2026-04-04 0:15 ` Edgecombe, Rick P
0 siblings, 0 replies; 43+ messages in thread
From: Edgecombe, Rick P @ 2026-04-04 0:15 UTC (permalink / raw)
To: seanjc@google.com, Zhao, Yan Y
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Hansen, Dave,
kas@kernel.org, Huang, Kai, x86@kernel.org,
linux-kernel@vger.kernel.org
On Fri, 2026-04-03 at 17:05 +0800, Yan Zhao wrote:
> Hmm, sorry for the confusion. I didn't express it clearly.
>
> The ordering inside tdp_mmu_set_spte_atomic() for mirror root is:
>
> Before this patch,
> 1. set mirror SPTE to frozen
> 2. invoke TDX op to update external PTE
> 3. set mirror SPTE to new_spte or restore old_spte
> 4. if 2 succeeds, invoke handle_changed_spte() to propagate changes to
> child mirror SPTEs and child external PTEs
>
> After this patch,
> 1. set mirror SPTE to frozen
> 2. invoke __handle_changed_spte(), which propagates changes to
> (1) child mirror SPTEs and child external PTEs
> (2) external PTE
> 3. set mirror SPTE to new_spte or restore old_spte
>
> So, the step to propagate changes to child mirror SPTEs and child external PTEs
> now occurs before the steps to update the external PTE and the mirror SPTE.
How about I add this info to the log. I think you are right, it's an important
change to call out.
Now it's making me think... Can you (if you haven't already) scrutinize for
races/reasons that may trigger the KVM_BUG_ON() in handle_changed_spte() due to
BUSY or other? Like in the handle_removed_pt() path. I guess the write lock
saves us?
Hmm... zero step?
>
> > Ya, I'm a bit confused too. For me, the "tricky" part is understanding the need
> > to set the mirror SPTE to FROZE_SPTE while updating the external SPTE. Once that
> > is understood, I don't find passing in @new_spte to be surprising in any way.
> I still find it tricky because it seems strange to me to invoke a function named
> handle_changed_spte() before the change actually occurs on the SPTE (i.e.,to me,
> the SPTE has only changed from xxx to FROZEN_SPTE, but handle_changed_spte()
> handles changes from xxx to new_spte).
>
> Besides, another tricky point (currently benign to TDX) is that:
> before this patch, tdp_mmu_set_spte_atomic() cannot be used to atomically zap
> non-leaf mirror SPTEs, since TDX requires child PTEs to be zapped before the
> parent PTE;
> after this patch, performing atomic zapping of non-leaf mirror SPTEs seems to be
> allowed in TDP MMU since the above step 2.1 now occurs before step 2.2. However,
> if step 2.2 fails after step 2.1 succeeds, step 3 cannot easily restore the real
> old state.
> So, if we allow atomic zap on the mirror root in the future, it looks like we
> need to ensure atomic zapping of S-EPT cannot fail.
I think we shouldn't comment these kind of TDX specifics. It won't confuse non-
TDX developers working in the TDP MMU I think.
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2026-04-04 0:15 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260327201421.2824383-1-rick.p.edgecombe@intel.com>
[not found] ` <20260327201421.2824383-7-rick.p.edgecombe@intel.com>
2026-03-30 5:00 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Yan Zhao
2026-03-31 16:37 ` Edgecombe, Rick P
2026-04-02 1:06 ` Yan Zhao
2026-04-02 19:21 ` Sean Christopherson
2026-04-03 2:47 ` Yan Zhao
[not found] ` <20260327201421.2824383-9-rick.p.edgecombe@intel.com>
2026-03-30 6:28 ` [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all Yan Zhao
[not found] ` <20260327201421.2824383-10-rick.p.edgecombe@intel.com>
2026-03-30 6:43 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Yan Zhao
2026-04-01 23:59 ` Edgecombe, Rick P
[not found] ` <20260327201421.2824383-15-rick.p.edgecombe@intel.com>
2026-03-30 7:01 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Yan Zhao
2026-03-31 10:46 ` Huang, Kai
2026-04-02 0:08 ` Edgecombe, Rick P
2026-04-02 2:04 ` Yan Zhao
[not found] ` <20260327201421.2824383-18-rick.p.edgecombe@intel.com>
2026-03-30 7:49 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Yan Zhao
2026-04-02 0:17 ` Edgecombe, Rick P
2026-04-02 2:16 ` Yan Zhao
2026-04-02 2:17 ` Yan Zhao
2026-03-31 11:02 ` Huang, Kai
[not found] ` <20260327201421.2824383-3-rick.p.edgecombe@intel.com>
2026-03-31 9:47 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Huang, Kai
2026-03-31 9:17 ` Yan Zhao
2026-03-31 9:59 ` Huang, Kai
2026-03-31 9:22 ` Yan Zhao
2026-03-31 10:14 ` Huang, Kai
[not found] ` <20260327201421.2824383-8-rick.p.edgecombe@intel.com>
2026-03-30 6:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Yan Zhao
2026-04-01 23:45 ` Edgecombe, Rick P
2026-04-02 1:59 ` Yan Zhao
2026-04-02 23:10 ` Edgecombe, Rick P
2026-04-02 23:28 ` Sean Christopherson
2026-04-03 9:05 ` Yan Zhao
2026-04-04 0:15 ` Edgecombe, Rick P
2026-04-03 9:08 ` Yan Zhao
2026-03-31 10:09 ` Huang, Kai
2026-04-01 23:58 ` Edgecombe, Rick P
2026-04-02 23:21 ` Sean Christopherson
2026-04-01 8:34 ` Yan Zhao
2026-04-02 23:46 ` Edgecombe, Rick P
2026-04-03 10:33 ` Yan Zhao
[not found] ` <20260327201421.2824383-11-rick.p.edgecombe@intel.com>
2026-03-31 10:30 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Huang, Kai
2026-04-02 0:00 ` Edgecombe, Rick P
2026-03-31 10:34 ` Huang, Kai
[not found] ` <20260327201421.2824383-12-rick.p.edgecombe@intel.com>
2026-03-31 10:36 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Huang, Kai
2026-04-01 7:41 ` Yan Zhao
[not found] ` <20260327201421.2824383-14-rick.p.edgecombe@intel.com>
2026-03-31 10:42 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Huang, Kai
2026-04-02 0:04 ` Edgecombe, Rick P
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox