public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] TDX MMU refactors
@ 2026-03-27 20:14 Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 01/17] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level Rick Edgecombe
                   ` (16 more replies)
  0 siblings, 17 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

Hi,

This is all the MMU refactors extracted from the discussion on Sean’s 
DPAMT/Huge page combined series [0]. I was thinking that since they are 
valid cleanups in their own way, we could maybe get them upstream ahead of 
time. Yan and I rebased DPAMT/huge pages on this series and made sure it 
all still fits together.

It’s not suggested for merging yet. We should go over it again for a more
thorough bug check with everything put together. Sean/Paolo please feel
free to ignore for now, except to maybe comment if you have any objections
to merging it earlier as a cleanup series.

Changes from on-list discussion:
 - Go back to free_external_spt() name. Since free_external_sp() was 
   dropped from the changes there was no similarly named function to 
   confuse.
 - Suggestions around dropping or moving KVM_BUG_ON/WARNs turned into
   patches.
   
Another possible point of discussion:
 - Is the last patch needed and does it fit in to a general
   refactor/cleanup? (Yan)

Since some of the patches were created from diffs posted by Sean without
SOB, I left them not signed off for now.

Based on kvm-x86-next-2026.03.13, testing was our usual CI suite. AI
review is pending. We'll work through any Sashiko feedback on the list.

Thanks,
Rick

[0] https://lore.kernel.org/kvm/20260129011517.3545883-1-seanjc@google.com/

Rick Edgecombe (4):
  KVM: x86/tdp_mmu: Drop zapping KVM_BUG_ON()
    set_external_spte_present()
  KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a
    KVM_MMU_WARN_ON()
  KVM: TDX: Move set_external_spte_present() assert into TDX code
  KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs

Sean Christopherson (13):
  x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level
  KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE
    "fails"
  KVM: TDX: Account all non-transient page allocations for per-TD
    structures
  KVM: x86: Make "external SPTE" ops that can fail RET0 static calls
  KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use
    .set_external_spte() for all
  KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
  KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
  KVM: x86/mmu: Plumb the old_spte into kvm_x86_ops.set_external_spte()
  KVM: TDX: Hoist tdx_sept_remove_private_spte() above
    set_private_spte()
  KVM: TDX: Handle removal of leaf SPTEs in .set_private_spte()
  KVM: x86: Move error handling inside free_external_spt()
  KVM: TDX: Move external page table freeing to TDX code

 arch/x86/include/asm/kvm-x86-ops.h |   4 +-
 arch/x86/include/asm/kvm_host.h    |  13 +-
 arch/x86/include/asm/tdx.h         |  14 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 253 ++++++++++++-----------------
 arch/x86/kvm/vmx/tdx.c             | 147 ++++++++++-------
 arch/x86/virt/vmx/tdx/tdx.c        |  26 ++-
 6 files changed, 219 insertions(+), 238 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/17] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Rick Edgecombe
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Rework the TDX APIs to take the kernel's 1-based pg_level enum, not the
TDX-Module's 0-based level.  The APIs are _kernel_ APIs, not TDX-Module
APIs, and the kernel (and KVM) uses "enum pg_level" literally everywhere.

Using "enum pg_level" eliminates ambiguity when looking at the APIs (it's
NOT clear that "int level" refers to the TDX-Module's level), and will
allow for using existing helpers like page_level_size() when support for
hugepages is added to the S-EPT APIs.

No functional change intended.

Cc: Kai Huang <kai.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/tdx.h  | 14 ++++----------
 arch/x86/kvm/vmx/tdx.c      | 11 ++++-------
 arch/x86/virt/vmx/tdx/tdx.c | 26 ++++++++++++++++++--------
 3 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index a149740b24e8..c140ddde59ff 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -187,19 +187,13 @@ static inline u64 mk_keyed_paddr(u16 hkid, struct page *page)
 	return ret;
 }
 
-static inline int pg_level_to_tdx_sept_level(enum pg_level level)
-{
-        WARN_ON_ONCE(level == PG_LEVEL_NONE);
-        return level - 1;
-}
-
 u64 tdh_vp_enter(struct tdx_vp *vp, struct tdx_module_args *args);
 u64 tdh_mng_addcx(struct tdx_td *td, struct page *tdcs_page);
 u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page *source, u64 *ext_err1, u64 *ext_err2);
-u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, enum pg_level level, struct page *page, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page);
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2);
-u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, enum pg_level level, struct page *page, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, enum pg_level level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_mng_key_config(struct tdx_td *td);
 u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
 u64 tdh_vp_create(struct tdx_td *td, struct tdx_vp *vp);
@@ -215,7 +209,7 @@ u64 tdh_vp_rd(struct tdx_vp *vp, u64 field, u64 *data);
 u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask);
 u64 tdh_phymem_page_reclaim(struct page *page, u64 *tdx_pt, u64 *tdx_owner, u64 *tdx_size);
 u64 tdh_mem_track(struct tdx_td *tdr);
-u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2);
+u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, enum pg_level level, u64 *ext_err1, u64 *ext_err2);
 u64 tdh_phymem_cache_wb(bool resume);
 u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
 u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1e47c194af53..38e7b6fa8664 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1638,14 +1638,13 @@ static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
-	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
-	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
+	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, level, page, &entry, &level_state);
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
@@ -1689,12 +1688,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, void *private_spt)
 {
-	int tdx_level = pg_level_to_tdx_sept_level(level);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	struct page *page = virt_to_page(private_spt);
 	u64 err, entry, level_state;
 
-	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, tdx_level, page, &entry,
+	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, level, page, &entry,
 			       &level_state);
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
@@ -1778,7 +1776,6 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 					 enum pg_level level, u64 mirror_spte)
 {
 	struct page *page = pfn_to_page(spte_to_pfn(mirror_spte));
-	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
@@ -1798,7 +1795,7 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 		return;
 
 	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
-			      tdx_level, &entry, &level_state);
+			      level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
 		return;
 
@@ -1814,7 +1811,7 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
 	 */
 	err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa,
-			      tdx_level, &entry, &level_state);
+			      level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
 		return;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index cb9b3210ab71..a6e77afafa79 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1568,6 +1568,12 @@ static void tdx_clflush_page(struct page *page)
 	clflush_cache_range(page_to_virt(page), PAGE_SIZE);
 }
 
+static int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON_ONCE(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 noinstr u64 tdh_vp_enter(struct tdx_vp *td, struct tdx_module_args *args)
 {
 	args->rcx = td->tdvpr_pa;
@@ -1608,10 +1614,11 @@ u64 tdh_mem_page_add(struct tdx_td *td, u64 gpa, struct page *page, struct page
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_add);
 
-u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_sept_add(struct tdx_td *td, u64 gpa, enum pg_level level,
+		     struct page *page, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
-		.rcx = gpa | level,
+		.rcx = gpa | pg_level_to_tdx_sept_level(level),
 		.rdx = tdx_tdr_pa(td),
 		.r8 = page_to_phys(page),
 	};
@@ -1639,10 +1646,11 @@ u64 tdh_vp_addcx(struct tdx_vp *vp, struct page *tdcx_page)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_vp_addcx);
 
-u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, enum pg_level level,
+		     struct page *page, u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
-		.rcx = gpa | level,
+		.rcx = gpa | pg_level_to_tdx_sept_level(level),
 		.rdx = tdx_tdr_pa(td),
 		.r8 = page_to_phys(page),
 	};
@@ -1658,10 +1666,11 @@ u64 tdh_mem_page_aug(struct tdx_td *td, u64 gpa, int level, struct page *page, u
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_mem_page_aug);
 
-u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, int level, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_range_block(struct tdx_td *td, u64 gpa, enum pg_level level,
+			u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
-		.rcx = gpa | level,
+		.rcx = gpa | pg_level_to_tdx_sept_level(level),
 		.rdx = tdx_tdr_pa(td),
 	};
 	u64 ret;
@@ -1874,10 +1883,11 @@ u64 tdh_mem_track(struct tdx_td *td)
 }
 EXPORT_SYMBOL_FOR_KVM(tdh_mem_track);
 
-u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, u64 level, u64 *ext_err1, u64 *ext_err2)
+u64 tdh_mem_page_remove(struct tdx_td *td, u64 gpa, enum pg_level level,
+			u64 *ext_err1, u64 *ext_err2)
 {
 	struct tdx_module_args args = {
-		.rcx = gpa | level,
+		.rcx = gpa | pg_level_to_tdx_sept_level(level),
 		.rdx = tdx_tdr_pa(td),
 	};
 	u64 ret;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 01/17] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-31  9:47   ` Huang, Kai
  2026-03-27 20:14 ` [PATCH 03/17] KVM: TDX: Account all non-transient page allocations for per-TD structures Rick Edgecombe
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Pass a pointer to iter->old_spte, not simply its value, when setting an
external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
will be updated if the cmpxchg64 to freeze the mirror SPTE fails.  The bug
is currently benign as TDX is mutualy exclusive with all paths that do
"local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().

Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7b1102d26f9c..dbaeb80f2b64 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -509,10 +509,10 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 }
 
 static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
-						 gfn_t gfn, u64 old_spte,
+						 gfn_t gfn, u64 *old_spte,
 						 u64 new_spte, int level)
 {
-	bool was_present = is_shadow_present_pte(old_spte);
+	bool was_present = is_shadow_present_pte(*old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	int ret = 0;
@@ -525,7 +525,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	 * page table has been modified. Use FROZEN_SPTE similar to
 	 * the zapping case.
 	 */
-	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
+	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
 		return -EBUSY;
 
 	/*
@@ -541,7 +541,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
 	}
 	if (ret)
-		__kvm_tdp_mmu_write_spte(sptep, old_spte);
+		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
 	else
 		__kvm_tdp_mmu_write_spte(sptep, new_spte);
 	return ret;
@@ -670,7 +670,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 			return -EBUSY;
 
 		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
-						iter->old_spte, new_spte, iter->level);
+						&iter->old_spte, new_spte, iter->level);
 		if (ret)
 			return ret;
 	} else {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/17] KVM: TDX: Account all non-transient page allocations for per-TD structures
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 01/17] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 04/17] KVM: x86: Make "external SPTE" ops that can fail RET0 static calls Rick Edgecombe
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Account all non-transient allocations associated with a single TD (or its
vCPUs), as KVM's ABI is that allocations that are active for the lifetime
of a VM are accounted.  Leave temporary allocations, i.e. allocations that
are freed within a single function/ioctl, unaccounted, to again align with
KVM's existing behavior, e.g. see commit dd103407ca31 ("KVM: X86: Remove
unnecessary GFP_KERNEL_ACCOUNT for temporary variables").

Fixes: 8d032b683c29 ("KVM: TDX: create/destroy VM structure")
Fixes: a50f673f25e0 ("KVM: TDX: Do TDX specific vcpu initialization")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 38e7b6fa8664..01e070ec10fd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2384,7 +2384,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 
 	ret = -ENOMEM;
 
-	tdr_page = alloc_page(GFP_KERNEL);
+	tdr_page = alloc_page(GFP_KERNEL_ACCOUNT);
 	if (!tdr_page)
 		goto free_hkid;
 
@@ -2392,12 +2392,13 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 	/* TDVPS = TDVPR(4K page) + TDCX(multiple 4K pages), -1 for TDVPR. */
 	kvm_tdx->td.tdcx_nr_pages = tdx_sysinfo->td_ctrl.tdvps_base_size / PAGE_SIZE - 1;
 	tdcs_pages = kzalloc_objs(*kvm_tdx->td.tdcs_pages,
-				  kvm_tdx->td.tdcs_nr_pages);
+				  kvm_tdx->td.tdcs_nr_pages,
+				  GFP_KERNEL_ACCOUNT);
 	if (!tdcs_pages)
 		goto free_tdr;
 
 	for (i = 0; i < kvm_tdx->td.tdcs_nr_pages; i++) {
-		tdcs_pages[i] = alloc_page(GFP_KERNEL);
+		tdcs_pages[i] = alloc_page(GFP_KERNEL_ACCOUNT);
 		if (!tdcs_pages[i])
 			goto free_tdcs;
 	}
@@ -2872,7 +2873,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	int ret, i;
 	u64 err;
 
-	page = alloc_page(GFP_KERNEL);
+	page = alloc_page(GFP_KERNEL_ACCOUNT);
 	if (!page)
 		return -ENOMEM;
 	tdx->vp.tdvpr_page = page;
@@ -2885,14 +2886,14 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	tdx->vp.tdvpr_pa = page_to_phys(tdx->vp.tdvpr_page);
 
 	tdx->vp.tdcx_pages = kcalloc(kvm_tdx->td.tdcx_nr_pages, sizeof(*tdx->vp.tdcx_pages),
-			       	     GFP_KERNEL);
+				     GFP_KERNEL_ACCOUNT);
 	if (!tdx->vp.tdcx_pages) {
 		ret = -ENOMEM;
 		goto free_tdvpr;
 	}
 
 	for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
-		page = alloc_page(GFP_KERNEL);
+		page = alloc_page(GFP_KERNEL_ACCOUNT);
 		if (!page) {
 			ret = -ENOMEM;
 			goto free_tdcx;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/17] KVM: x86: Make "external SPTE" ops that can fail RET0 static calls
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (2 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 03/17] KVM: TDX: Account all non-transient page allocations for per-TD structures Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 05/17] KVM: x86/tdp_mmu: Drop zapping KVM_BUG_ON() set_external_spte_present() Rick Edgecombe
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Define kvm_x86_ops .link_external_spt(), .set_external_spte(), and
.free_external_spt() as RET0 static calls so that an unexpected call to a
a default operation doesn't consume garbage.

Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Fixes: 94faba8999b9 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 3776cf5382a2..31d5c5d58ae6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,9 +95,9 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
-KVM_X86_OP_OPTIONAL(link_external_spt)
-KVM_X86_OP_OPTIONAL(set_external_spte)
-KVM_X86_OP_OPTIONAL(free_external_spt)
+KVM_X86_OP_OPTIONAL_RET0(link_external_spt)
+KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
+KVM_X86_OP_OPTIONAL_RET0(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/17] KVM: x86/tdp_mmu: Drop zapping KVM_BUG_ON() set_external_spte_present()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (3 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 04/17] KVM: x86: Make "external SPTE" ops that can fail RET0 static calls Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Rick Edgecombe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

Drop some KVM_BUG_ON() that are guarding against TDP MMU attempting to
propagate unsupported operations to the external set_external_spte() ops.

Despite the generic naming, external TDP ops are designed completely
around TDX. They hook the bare minimum of what is needed, and exclude the
operations that are not supported by TDX. To help wrangle which operations
are handleable by various operations, warnings and KVM_BUG_ONs exist in
the code. These warnings and KVM_BUG_ON()s put the burden of understanding
which operations should be forwarded to TDX code on the TDP MMU
developers, who often read the code without TDX context.

Future changes will transition the encapsulation of this domain knowledge
to TDX code by funneling the external EPT updates through a central update
mechanism. In this paradigm, central update mechanism can encapsulate the
special knowledge, but will not have as much knowledge about what
operation is in progress. So remove the set external SPTE based
KVM_BUG_ON()s in preparation for this future change.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dbaeb80f2b64..0809fe8e8737 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -512,13 +512,10 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 						 gfn_t gfn, u64 *old_spte,
 						 u64 new_spte, int level)
 {
-	bool was_present = is_shadow_present_pte(*old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
-
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * We need to lock out other updates to the SPTE until the external
@@ -662,13 +659,6 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
 		int ret;
 
-		/*
-		 * Users of atomic zapping don't operate on mirror roots,
-		 * so don't handle it and bug the VM if it's seen.
-		 */
-		if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
-			return -EBUSY;
-
 		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
 						&iter->old_spte, new_spte, iter->level);
 		if (ret)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (4 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 05/17] KVM: x86/tdp_mmu: Drop zapping KVM_BUG_ON() set_external_spte_present() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  5:00   ` Yan Zhao
  2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

Remove the conditional logic for handling the setting of mirror EPTs to
frozen in __tdp_mmu_set_spte_atomic() and add it as a warning instead.

Mirror TDP needs propagate PTE changes to the to the external TDP. This
presents a problem for atomic updates which can't update both at once. So
a special value, FROZEN_SPTE, is used as a temporary state during these
updates to prevent concurrent operations to the PTE. If the TDP MMU tried
to install this as a long term value, it would confuse these updates.
Despite this __tdp_mmu_set_spte_atomic() includes a check to handle it
being set. Remove this check and turn it into a warning.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0809fe8e8737..338957bc5109 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -656,7 +656,13 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 */
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
-	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
+	/*
+	 * FROZEN_SPTE is a temporary state and should never be set via higher
+	 * level helpers.
+	 */
+	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
+
+	if (is_mirror_sptep(iter->sptep)) {
 		int ret;
 
 		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (5 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  6:14   ` Yan Zhao
                     ` (2 more replies)
  2026-03-27 20:14 ` [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all Rick Edgecombe
                   ` (9 subsequent siblings)
  16 siblings, 3 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Centralize the updates to present external PTEs to the
handle_changed_spte() function.

When setting a PTE to present in the mirror page tables, the update needs
to propagate to the external page tables (in TDX parlance the S-EPT).
Today this is handled by special mirror page tables branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs
are set for TDX.

This keeps things running, but is a bit hacked on. The hook for setting
present leaf PTEs are added only where TDX happens to need them. For
example, TDX does not support any of the operations that use the
non-atomic variant, tdp_mmu_set_spte() to set present PTEs. Since the hook
is missing there, it is very hard to understand the code from a non-TDX
lens. If the reader doesn’t know the TDX specifics it could look like the
external update is missing.

In addition to being confusing, it also litters the TDP MMU with
"external" update callbacks. This is especially unfortunate because there
is already a central place to react to TDP updates, handle_changed_spte().

Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX specific logic can live
with a more proper separation of concerns. Do this by teaching
handle_changed_spte() how to return error codes, such that it can
propagate the failures that may come from TDX external page table updates.

Atomic mirror page table updates need to be done in a special way to
prevent concurrent updates to the mirror page table while the external
page table is updated. The mirror page table is set to the frozen PTE
value while the external version is updates. This frozen PTE dance is
currently done in __tdp_mmu_set_spte_atomic(). Hoist it up a level so that
the external update in handle_changed_spte() can be done while the PTE is
frozen.

Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
[Based on a diff by Sean Chrisopherson]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 150 ++++++++++++++++++++++---------------
 1 file changed, 88 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 338957bc5109..db16e81b9701 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -320,9 +320,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	}
 }
 
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				gfn_t gfn, u64 old_spte, u64 new_spte,
+				int level, bool shared);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -471,8 +471,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
 							  FROZEN_SPTE, level);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, FROZEN_SPTE, level, shared);
+		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 
 		if (is_mirror_sp(sp)) {
 			KVM_BUG_ON(shared, kvm);
@@ -508,22 +507,15 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
 	return NULL;
 }
 
-static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
-						 gfn_t gfn, u64 *old_spte,
-						 u64 new_spte, int level)
+static int __must_check set_external_spte_present(struct kvm *kvm,
+						  gfn_t gfn, u64 old_spte,
+						  u64 new_spte, int level)
 {
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	int ret = 0;
 
 	lockdep_assert_held(&kvm->mmu_lock);
-	/*
-	 * We need to lock out other updates to the SPTE until the external
-	 * page table has been modified. Use FROZEN_SPTE similar to
-	 * the zapping case.
-	 */
-	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
-		return -EBUSY;
 
 	/*
 	 * Use different call to either set up middle level
@@ -537,17 +529,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 		KVM_BUG_ON(!external_spt, kvm);
 		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
 	}
-	if (ret)
-		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
-	else
-		__kvm_tdp_mmu_write_spte(sptep, new_spte);
 	return ret;
 }
 
 /**
- * handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
- * @as_id: the address space of the paging structure the SPTE was a part of
+ * @sp: the page table in which the SPTE resides
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
@@ -560,15 +548,17 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
  * dirty logging updates are handled in common code, not here (see make_spte()
  * and fast_pf_fix_direct_spte()).
  */
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 gfn_t gfn, u64 old_spte, u64 new_spte,
+				 int level, bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	int as_id = kvm_mmu_page_as_id(sp);
+	int r;
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -598,9 +588,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 
 	if (old_spte == new_spte)
-		return;
-
-	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+		return 0;
 
 	if (is_leaf)
 		check_spte_writable_invariants(new_spte);
@@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       "a temporary frozen SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
-		return;
+		return 0;
 	}
 
-	if (is_leaf != was_leaf)
-		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
-
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.  Note the WARN on the PFN changing without the
 	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
 	 * pages are kernel allocations and should never be migrated.
+	 *
+	 * When modifying leaf entries in mirrored page tables, propagate all
+	 * changes to the external SPTE.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	} else if (is_mirror_sp(sp) && is_present) {
+		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
+		if (r)
+			return r;
+	}
+
+	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+
+	if (is_leaf != was_leaf)
+		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
+
+	return 0;
+}
+
+static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				gfn_t gfn, u64 old_spte, u64 new_spte,
+				int level, bool shared)
+{
+	KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
+					 level, shared), kvm);
 }
 
 static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
@@ -657,32 +665,14 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
 	/*
-	 * FROZEN_SPTE is a temporary state and should never be set via higher
-	 * level helpers.
+	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
+	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
+	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
+	 * the current value, so the caller operates on fresh data, e.g. if it
+	 * retries tdp_mmu_set_spte_atomic()
 	 */
-	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
-
-	if (is_mirror_sptep(iter->sptep)) {
-		int ret;
-
-		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
-						&iter->old_spte, new_spte, iter->level);
-		if (ret)
-			return ret;
-	} else {
-		u64 *sptep = rcu_dereference(iter->sptep);
-
-		/*
-		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
-		 * and does not hold the mmu_lock.  On failure, i.e. if a
-		 * different logical CPU modified the SPTE, try_cmpxchg64()
-		 * updates iter->old_spte with the current value, so the caller
-		 * operates on fresh data, e.g. if it retries
-		 * tdp_mmu_set_spte_atomic()
-		 */
-		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-			return -EBUSY;
-	}
+	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
+		return -EBUSY;
 
 	return 0;
 }
@@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       u64 new_spte)
 {
+	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+	/* KVM should never freeze SPTEs using higher level APIs. */
+	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
+
+	/*
+	 * Temporarily freeze the SPTE until the external PTE operation has
+	 * completed (unless the new SPTE itself will be frozen), e.g. so that
+	 * concurrent faults don't attempt to install a child PTE in the
+	 * external page table before the parent PTE has been written, or try
+	 * to re-install a page table before the old one was removed.
+	 */
+	if (is_mirror_sptep(iter->sptep))
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
+	else
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
-	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    new_spte, iter->level, true);
+	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
+				    new_spte, iter->level, true);
 
-	return 0;
+	/*
+	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
+	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
+	 * otherwise set the mirror SPTE to the new desired value.
+	 */
+	if (is_mirror_sptep(iter->sptep)) {
+		if (ret)
+			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
+		else
+			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+	} else {
+		/*
+		 * Bug the VM if handling the change failed, as failure is only
+		 * allowed if KVM couldn't update the external SPTE.
+		 */
+		KVM_BUG_ON(ret, kvm);
+	}
+	return ret;
 }
 
 /*
@@ -738,6 +759,8 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
 {
+	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
 
 	/*
 	 * Users that do non-atomic setting of PTEs don't operate on mirror
@@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
 {
 	u64 new_spte;
 
+	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
+		return;
+
 	if (spte_ad_enabled(iter->old_spte)) {
 		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
 								shadow_accessed_mask);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (6 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  6:28   ` Yan Zhao
  2026-03-27 20:14 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Rick Edgecombe
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Drop the dedicated .link_external_spt() for linking non-leaf S-EPT pages,
and instead funnel everything through .set_external_spte().  Using separate
hooks doesn't help prevent TDP MMU details from bleeding into TDX, and vice
versa; to the contrary, dedicated callbacks will result in _more_ pollution
when hugepage support is added, e.g. will require the TDP MMU to know
details about the splitting rules for TDX that aren't all that relevant to
the TDP MMU.

Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
one hook for setting SPTEs under write-lock and another for settings SPTEs
under read-lock (e.g. to ensure the entire operation is "atomic", to allow
for failure, etc.).  Sadly, TDX's requirement that all child S-EPT entries
are removed before the parent makes that impractical: the TDP MMU
deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
making it quite important for the TDP MMU to differentiate between zapping
leaf and non-leaf S-EPT entries.

However, that's the _only_ case that's truly special, and even that case
could be shoehorned into a single hook; it's just wouldn't be a net
positive.

Signed-off-by: Sean Christopherson <seanjc@google.com>
[add in trivial feedback]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v6:
 - rename external_spt->sept_pt (Rick, Yan)
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h    |  3 --
 arch/x86/kvm/mmu/tdp_mmu.c         | 31 +--------------
 arch/x86/kvm/vmx/tdx.c             | 61 ++++++++++++++++++++----------
 4 files changed, 43 insertions(+), 53 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 31d5c5d58ae6..bced6d938702 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,7 +95,6 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
-KVM_X86_OP_OPTIONAL_RET0(link_external_spt)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
 KVM_X86_OP_OPTIONAL_RET0(free_external_spt)
 KVM_X86_OP_OPTIONAL(remove_external_spte)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d3bdc9828133..1139bd89f0cf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1876,9 +1876,6 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
-	/* Update external mapping with page table link. */
-	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				void *external_spt);
 	/* Update the external page table from spte getting set. */
 	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				 u64 mirror_spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index db16e81b9701..6dc08fe22841 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -494,42 +494,13 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
-static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
-{
-	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
-		struct kvm_mmu_page *sp = spte_to_child_sp(new_spte);
-
-		WARN_ON_ONCE(sp->role.level + 1 != level);
-		WARN_ON_ONCE(sp->gfn != gfn);
-		return sp->external_spt;
-	}
-
-	return NULL;
-}
-
 static int __must_check set_external_spte_present(struct kvm *kvm,
 						  gfn_t gfn, u64 old_spte,
 						  u64 new_spte, int level)
 {
-	bool is_present = is_shadow_present_pte(new_spte);
-	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	int ret = 0;
-
 	lockdep_assert_held(&kvm->mmu_lock);
 
-	/*
-	 * Use different call to either set up middle level
-	 * external page table, or leaf.
-	 */
-	if (is_leaf) {
-		ret = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
-	} else {
-		void *external_spt = get_external_spt(gfn, new_spte, level);
-
-		KVM_BUG_ON(!external_spt, kvm);
-		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
-	}
-	return ret;
+	return kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
 }
 
 /**
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 01e070ec10fd..92a846b91bac 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1654,18 +1654,58 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+static struct page *tdx_spte_to_sept_pt(struct kvm *kvm, gfn_t gfn,
+					     u64 new_spte, enum pg_level level)
+{
+	struct kvm_mmu_page *sp = spte_to_child_sp(new_spte);
+
+	if (KVM_BUG_ON(!sp->external_spt, kvm) ||
+	    KVM_BUG_ON(sp->role.level + 1 != level, kvm) ||
+	    KVM_BUG_ON(sp->gfn != gfn, kvm))
+		return NULL;
+
+	return virt_to_page(sp->external_spt);
+}
+
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, u64 mirror_spte)
+{
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
+	struct page *sept_pt;
+
+	sept_pt = tdx_spte_to_sept_pt(kvm, gfn, mirror_spte, level);
+	if (!sept_pt)
+		return -EIO;
+
+	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, level, sept_pt,
+			       &entry, &level_state);
+	if (unlikely(tdx_operand_busy(err)))
+		return -EBUSY;
+
+	if (TDX_BUG_ON_2(err, TDH_MEM_SEPT_ADD, entry, level_state, kvm))
+		return -EIO;
+
+	return 0;
+}
+
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, u64 mirror_spte)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
 
+	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
+		return -EIO;
+
+	if (!is_last_spte(mirror_spte, level))
+		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
+
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EIO;
 
-	WARN_ON_ONCE(!is_shadow_present_pte(mirror_spte) ||
-		     (mirror_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
+	WARN_ON_ONCE((mirror_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
 
 	/*
 	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
@@ -1685,23 +1725,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
-static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level, void *private_spt)
-{
-	gpa_t gpa = gfn_to_gpa(gfn);
-	struct page *page = virt_to_page(private_spt);
-	u64 err, entry, level_state;
 
-	err = tdh_mem_sept_add(&to_kvm_tdx(kvm)->td, gpa, level, page, &entry,
-			       &level_state);
-	if (unlikely(tdx_operand_busy(err)))
-		return -EBUSY;
-
-	if (TDX_BUG_ON_2(err, TDH_MEM_SEPT_ADD, entry, level_state, kvm))
-		return -EIO;
-
-	return 0;
-}
 
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
@@ -3413,7 +3437,6 @@ int __init tdx_hardware_setup(void)
 
 	vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
 
-	vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
 	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (7 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  6:43   ` Yan Zhao
  2026-03-27 20:14 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Rick Edgecombe
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
PAGE.AUG operations, and thus complete tdx_sept_set_private_spte()'s
transition into a "dispatch" routine for setting/writing S-EPT entries.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 92a846b91bac..361a75b42ae7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1689,18 +1689,12 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level, u64 mirror_spte)
+static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				  u64 mirror_spte)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
 
-	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
-		return -EIO;
-
-	if (!is_last_spte(mirror_spte, level))
-		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
-
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EIO;
@@ -1725,7 +1719,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, u64 mirror_spte)
+{
 
+	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
+		return -EIO;
+
+	if (!is_last_spte(mirror_spte, level))
+		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
+
+	return tdx_sept_map_leaf_spte(kvm, gfn, level, mirror_spte);
+}
 
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (8 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-31 10:30   ` Huang, Kai
  2026-03-31 10:34   ` Huang, Kai
  2026-03-27 20:14 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Rick Edgecombe
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

Move the MMU lockdep assert in set_external_spte_present() into the TDX
specific op because the assert is TDX specific in intention.

The TDP MMU has many lockdep asserts for various scenarios, and in fact
the callchains that are used for TDX already have a lockdep assert which
cover the case in set_external_spte_present(). However, these asserts are
for management of the TDP root owned by KVM. In the
set_external_spte_present() assert case, it is helping with a scheme to
avoid contention in the TDX module during zap operations. That is very
TDX specific.

One option would be to just remove the assert in
set_external_spte_present() and rely on the other ones in the TDP MMU. But
that assert is for an a different intention, and too far away from the
SEAMCALL that needs it. So move just move it to TDX code.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 2 --
 arch/x86/kvm/vmx/tdx.c     | 3 ++-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6dc08fe22841..6763537098ee 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -498,8 +498,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm,
 						  gfn_t gfn, u64 old_spte,
 						  u64 new_spte, int level)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
-
 	return kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 361a75b42ae7..b44a9c96c89e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1722,10 +1722,11 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, u64 mirror_spte)
 {
-
 	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
 		return -EIO;
 
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	if (!is_last_spte(mirror_spte, level))
 		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (9 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-31 10:36   ` Huang, Kai
  2026-04-01  7:41   ` Yan Zhao
  2026-03-27 20:14 ` [PATCH 12/17] KVM: x86/mmu: Plumb the old_spte into kvm_x86_ops.set_external_spte() Rick Edgecombe
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() as all
the other functionality besides calling the op. It now is a single line
helper that is called once.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6763537098ee..85c92aec868f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -494,13 +494,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
-static int __must_check set_external_spte_present(struct kvm *kvm,
-						  gfn_t gfn, u64 old_spte,
-						  u64 new_spte, int level)
-{
-	return kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
-}
-
 /**
  * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -600,7 +593,7 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 	} else if (is_mirror_sp(sp) && is_present) {
-		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
+		r = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
 		if (r)
 			return r;
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/17] KVM: x86/mmu: Plumb the old_spte into kvm_x86_ops.set_external_spte()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (10 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Rick Edgecombe
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Plumb the old SPTE into .set_external_spte() so that the callback can be
used to handle removal and splitting of leaf SPTEs.  Rename mirror_spte to
new_spte to follow the TDP MMU's naming, and to make it more obvious what
value the parameter holds.

Opportunistically tweak the ordering of parameters to match the pattern of
most TDP MMU functions, which do "old, new, level".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      |  3 ++-
 arch/x86/kvm/vmx/tdx.c          | 18 +++++++++---------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1139bd89f0cf..808d2c7ea546 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1877,8 +1877,8 @@ struct kvm_x86_ops {
 			     int root_level);
 
 	/* Update the external page table from spte getting set. */
-	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				 u64 mirror_spte);
+	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+				 u64 new_spte, enum pg_level level);
 
 	/* Update external page tables for page table about to be freed. */
 	int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 85c92aec868f..991870789863 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -593,7 +593,8 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
 	} else if (is_mirror_sp(sp) && is_present) {
-		r = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
+		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
+						    new_spte, level);
 		if (r)
 			return r;
 	}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b44a9c96c89e..569a0576e7c9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1690,16 +1690,16 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 }
 
 static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				  u64 mirror_spte)
+				  u64 new_spte)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
+	kvm_pfn_t pfn = spte_to_pfn(new_spte);
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EIO;
 
-	WARN_ON_ONCE((mirror_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
+	WARN_ON_ONCE((new_spte & VMX_EPT_RWX_MASK) != VMX_EPT_RWX_MASK);
 
 	/*
 	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
@@ -1719,18 +1719,18 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
-static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level, u64 mirror_spte)
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+				     u64 new_spte, enum pg_level level)
 {
-	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
+	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
 		return -EIO;
 
 	lockdep_assert_held(&kvm->mmu_lock);
 
-	if (!is_last_spte(mirror_spte, level))
-		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
+	if (!is_last_spte(new_spte, level))
+		return tdx_sept_link_private_spt(kvm, gfn, level, new_spte);
 
-	return tdx_sept_map_leaf_spte(kvm, gfn, level, mirror_spte);
+	return tdx_sept_map_leaf_spte(kvm, gfn, level, new_spte);
 }
 
 /*
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (11 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 12/17] KVM: x86/mmu: Plumb the old_spte into kvm_x86_ops.set_external_spte() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-31 10:42   ` Huang, Kai
  2026-03-27 20:14 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Rick Edgecombe
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Move tdx_sept_remove_private_spte() (and its tdx_track() helper) above
tdx_sept_set_private_spte() in anticipation of routing all non-atomic
S-EPT writes (with the exception of reclaiming non-leaf pages) through
the "set" API.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 569a0576e7c9..5a1a6610a98f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1719,20 +1719,6 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
-static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-				     u64 new_spte, enum pg_level level)
-{
-	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
-		return -EIO;
-
-	lockdep_assert_held(&kvm->mmu_lock);
-
-	if (!is_last_spte(new_spte, level))
-		return tdx_sept_link_private_spt(kvm, gfn, level, new_spte);
-
-	return tdx_sept_map_leaf_spte(kvm, gfn, level, new_spte);
-}
-
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
  * tdh_mem_track() is the only caller that increases TD epoch. An increase in
@@ -1852,6 +1838,20 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	tdx_quirk_reset_page(page);
 }
 
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
+				     u64 new_spte, enum pg_level level)
+{
+	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
+		return -EIO;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+
+	if (!is_last_spte(new_spte, level))
+		return tdx_sept_link_private_spt(kvm, gfn, level, new_spte);
+
+	return tdx_sept_map_leaf_spte(kvm, gfn, level, new_spte);
+}
+
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector)
 {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (12 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  7:01   ` Yan Zhao
  2026-03-27 20:14 ` [PATCH 15/17] KVM: TDX: Handle removal of leaf SPTEs in .set_private_spte() Rick Edgecombe
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

As part of an ongoing effort to move TDX specific bits from the MMU into
the TDX code, drop the KVM_BUG_ON() that checks the MMU lock is held for
write while removing page tables.

Future changes forward PTE removal mirror EPT updates into the
set_private_spte() and let TDX code parse the PTE to decide what S-EPT
operations to take. This operations does not pass a shared bool for this
KVM_BUG_ON() to use in the logics future home.

But even today there are already MMU write lockdep asserts that mostly
cover the case. Since the KVM_BUG_ON() is already a bit redundant, just
remove it instead of trying to plumb the bool into TDX code.

Link: https://lore.kernel.org/kvm/aYUarHf3KEwHGuJe@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 991870789863..5dc9633c866e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -473,10 +473,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		}
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 
-		if (is_mirror_sp(sp)) {
-			KVM_BUG_ON(shared, kvm);
+		if (is_mirror_sp(sp))
 			remove_external_spte(kvm, gfn, old_spte, level);
-		}
 	}
 
 	if (is_mirror_sp(sp) &&
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 15/17] KVM: TDX: Handle removal of leaf SPTEs in .set_private_spte()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (13 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 16/17] KVM: x86: Move error handling inside free_external_spt() Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Rick Edgecombe
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Drop kvm_x86_ops.remove_external_spte(), and instead handling the removal
of leaf SPTEs in the S-EPT (a.k.a. external root) in .set_private_spte().
This will allow extending tdx_sept_set_private_spte() to support splitting
a huge S-EPT entry without needing yet another kvm_x86_ops hook.

Bug the VM if the callback fails, as redundant KVM_BUG_ON() calls are
benign (the WARN will fire if and only if the VM isn't already bugged) and
handle_changed_spte() is most definitely not prepared to handle failure.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h    |  3 ---
 arch/x86/kvm/mmu/tdp_mmu.c         | 33 +-----------------------------
 arch/x86/kvm/vmx/tdx.c             | 22 ++++++++++++--------
 4 files changed, 14 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index bced6d938702..ed348c6dd445 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -97,7 +97,6 @@ KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
 KVM_X86_OP_OPTIONAL_RET0(free_external_spt)
-KVM_X86_OP_OPTIONAL(remove_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 808d2c7ea546..09588e797e4b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1884,9 +1884,6 @@ struct kvm_x86_ops {
 	int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 				 void *external_spt);
 
-	/* Update external page table from spte getting removed, and flush TLB. */
-	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				     u64 mirror_spte);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5dc9633c866e..806788bdecce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -359,25 +359,6 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
-static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-				 int level)
-{
-	/*
-	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
-	 * PTs are removed in a special order, involving free_external_spt().
-	 * But remove_external_spte() will be called on non-leaf PTEs via
-	 * __tdp_mmu_zap_root(), so avoid the error the former would return
-	 * in this case.
-	 */
-	if (!is_last_spte(old_spte, level))
-		return;
-
-	/* Zapping leaf spte is allowed only when write lock is held. */
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
-	kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte);
-}
-
 /**
  * handle_removed_pt() - handle a page table removed from the TDP structure
  *
@@ -472,9 +453,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 							  FROZEN_SPTE, level);
 		}
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
-
-		if (is_mirror_sp(sp))
-			remove_external_spte(kvm, gfn, old_spte, level);
 	}
 
 	if (is_mirror_sp(sp) &&
@@ -590,7 +568,7 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	if (was_present && !was_leaf &&
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-	} else if (is_mirror_sp(sp) && is_present) {
+	} else if (is_mirror_sp(sp)) {
 		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
 						    new_spte, level);
 		if (r)
@@ -737,15 +715,6 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
 
-	/*
-	 * Users that do non-atomic setting of PTEs don't operate on mirror
-	 * roots, so don't handle it and bug the VM if it's seen.
-	 */
-	if (is_mirror_sptep(sptep)) {
-		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
-		remove_external_spte(kvm, gfn, old_spte, level);
-	}
-
 	return old_spte;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5a1a6610a98f..bfbadba8bc08 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1788,10 +1788,10 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
-static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-					 enum pg_level level, u64 mirror_spte)
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, u64 old_spte)
 {
-	struct page *page = pfn_to_page(spte_to_pfn(mirror_spte));
+	struct page *page = pfn_to_page(spte_to_pfn(old_spte));
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
@@ -1804,16 +1804,16 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * there can't be anything populated in the private EPT.
 	 */
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return;
+		return -EIO;
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return;
+		return -EIO;
 
 	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
 			      level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
-		return;
+		return -EIO;
 
 	/*
 	 * TDX requires TLB tracking before dropping private page.  Do
@@ -1829,18 +1829,22 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa,
 			      level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
-		return;
+		return -EIO;
 
 	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
-		return;
+		return -EIO;
 
 	tdx_quirk_reset_page(page);
+	return 0;
 }
 
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				     u64 new_spte, enum pg_level level)
 {
+	if (is_shadow_present_pte(old_spte))
+		return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+
 	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
 		return -EIO;
 
@@ -3445,7 +3449,7 @@ int __init tdx_hardware_setup(void)
 
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
 	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
-	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+
 	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
 	return 0;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 16/17] KVM: x86: Move error handling inside free_external_spt()
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (14 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 15/17] KVM: TDX: Handle removal of leaf SPTEs in .set_private_spte() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-27 20:14 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Rick Edgecombe
  16 siblings, 0 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Move the logic for TDX’s specific need to leak pages when reclaim
fails inside the free_external_spt() op, so this can be done in TDX
specific code and not the generic MMU.

Do this by passing the SP in instead of the external page table
pointer. This way TDX code can set sp->external_spt to NULL. Since the
error is now handled internally, change the op to return void. This way
it also operated like a normal free in that success is guaranteed from
the callers perspective.

Opportunistically, drop the unused level arg while adjusting the sp arg.

Signed-off-by: Sean Christopherson <seanjc@google.com>
[re-wrote log and massaged op name]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
Notable changes since last discussion
 - Since free_external_sp() is dropped in the latter DPAMT patches, don't
   bother renaming free_external_spt().
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +-
 arch/x86/include/asm/kvm_host.h    |  3 +--
 arch/x86/kvm/mmu/tdp_mmu.c         | 13 ++-----------
 arch/x86/kvm/vmx/tdx.c             | 25 +++++++++++--------------
 4 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index ed348c6dd445..10ccf6ea9d9a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -96,7 +96,7 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL_RET0(free_external_spt)
+KVM_X86_OP_OPTIONAL(free_external_spt)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09588e797e4b..fbc39f0bb491 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1881,8 +1881,7 @@ struct kvm_x86_ops {
 				 u64 new_spte, enum pg_level level);
 
 	/* Update external page tables for page table about to be freed. */
-	int (*free_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				 void *external_spt);
+	void (*free_external_spt)(struct kvm *kvm, gfn_t gfn, struct kvm_mmu_page *sp);
 
 
 	bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 806788bdecce..575033cc7fe4 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -455,17 +455,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 	}
 
-	if (is_mirror_sp(sp) &&
-	    WARN_ON(kvm_x86_call(free_external_spt)(kvm, base_gfn, sp->role.level,
-						    sp->external_spt))) {
-		/*
-		 * Failed to free page table page in mirror page table and
-		 * there is nothing to do further.
-		 * Intentionally leak the page to prevent the kernel from
-		 * accessing the encrypted page.
-		 */
-		sp->external_spt = NULL;
-	}
+	if (is_mirror_sp(sp))
+		kvm_x86_call(free_external_spt)(kvm, base_gfn, sp);
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bfbadba8bc08..d064b40a6b31 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1765,27 +1765,24 @@ static void tdx_track(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
 }
 
-static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level, void *private_spt)
+static void tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+					struct kvm_mmu_page *sp)
 {
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
 	/*
-	 * free_external_spt() is only called after hkid is freed when TD is
-	 * tearing down.
 	 * KVM doesn't (yet) zap page table pages in mirror page table while
 	 * TD is active, though guest pages mapped in mirror page table could be
 	 * zapped during TD is active, e.g. for shared <-> private conversion
 	 * and slot move/deletion.
+	 *
+	 * In other words, KVM should only free mirror page tables after the
+	 * TD's hkid is freed, when the TD is being torn down.
+	 *
+	 * If the S-EPT PTE can't be removed for any reason, intentionally leak
+	 * the page to prevent the kernel from accessing the encrypted page.
 	 */
-	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
-		return -EIO;
-
-	/*
-	 * The HKID assigned to this TD was already freed and cache was
-	 * already flushed. We don't have to flush again.
-	 */
-	return tdx_reclaim_page(virt_to_page(private_spt));
+	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
+	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
+		sp->external_spt = NULL;
 }
 
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
                   ` (15 preceding siblings ...)
  2026-03-27 20:14 ` [PATCH 16/17] KVM: x86: Move error handling inside free_external_spt() Rick Edgecombe
@ 2026-03-27 20:14 ` Rick Edgecombe
  2026-03-30  7:49   ` Yan Zhao
  2026-03-31 11:02   ` Huang, Kai
  16 siblings, 2 replies; 61+ messages in thread
From: Rick Edgecombe @ 2026-03-27 20:14 UTC (permalink / raw)
  To: seanjc, pbonzini, yan.y.zhao, kai.huang, kvm, kas
  Cc: linux-kernel, x86, dave.hansen, rick.p.edgecombe

From: Sean Christopherson <seanjc@google.com>

Move the freeing of external page tables into the reclaim operation that
lives in TDX code.

The TDP MMU supports traversing the TDP without holding locks. Page
tables needs to be freed via RCU to prevent walking one that gets
freed.

While none of these lockless walk operations actually happen for the
mirror EPT, the TDP MMU none-the-less frees the mirror EPT page tables in
the same way, and because it’s a handy place to plug it in, the external
page tables as well.

However, the external page tables definitely can’t be walked once they are
reclaimed from the TDX module. The TDX module releases the page for the
host VMM to use, so this RCU-time free is unnecessary for external page
tables.

So move the free_page() call to TDX code. Create an
tdp_mmu_free_unused_sp() to allow for freeing external page tables that
have never left the TDP MMU code (i.e. don’t need freed in a special way.

Link: https://lore.kernel.org/kvm/aYpjNrtGmogNzqwT@google.com/
Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
[Based on a diff by Sean, added log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 16 +++++++++++-----
 arch/x86/kvm/vmx/tdx.c     | 11 ++++++++++-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 575033cc7fe4..18e11c1c7631 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,13 +53,18 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
-static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
+static void __tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
-	free_page((unsigned long)sp->external_spt);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
+static void tdp_mmu_free_unused_sp(struct kvm_mmu_page *sp)
+{
+	free_page((unsigned long)sp->external_spt);
+	__tdp_mmu_free_sp(sp);
+}
+
 /*
  * This is called through call_rcu in order to free TDP page table memory
  * safely with respect to other kernel threads that may be operating on
@@ -73,7 +78,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 	struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
 					       rcu_head);
 
-	tdp_mmu_free_sp(sp);
+	WARN_ON_ONCE(sp->external_spt);
+	__tdp_mmu_free_sp(sp);
 }
 
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
@@ -1261,7 +1267,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * failed, e.g. because a different task modified the SPTE.
 		 */
 		if (r) {
-			tdp_mmu_free_sp(sp);
+			tdp_mmu_free_unused_sp(sp);
 			goto retry;
 		}
 
@@ -1571,7 +1577,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * installs its own sp in place of the last sp we tried to split.
 	 */
 	if (sp)
-		tdp_mmu_free_sp(sp);
+		tdp_mmu_free_unused_sp(sp);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d064b40a6b31..1346e891ca94 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
 	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
-		sp->external_spt = NULL;
+		goto out;
+
+	/*
+	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
+	 * freeing pages from RCU callbacks, and more importantly because
+	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
+	 */
+	free_page((unsigned long)sp->external_spt);
+out:
+	sp->external_spt = NULL;
 }
 
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-03-27 20:14 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Rick Edgecombe
@ 2026-03-30  5:00   ` Yan Zhao
  2026-03-31 16:37     ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  5:00 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:10PM -0700, Rick Edgecombe wrote:
> Remove the conditional logic for handling the setting of mirror EPTs to
Should we unify the terms "mirror EPTs," "mirror TDP," and "mirror page tables"
in this series?

> frozen in __tdp_mmu_set_spte_atomic() and add it as a warning instead.
> 
> Mirror TDP needs propagate PTE changes to the to the external TDP. This
Two "to the".

> presents a problem for atomic updates which can't update both at once. So
> a special value, FROZEN_SPTE, is used as a temporary state during these
> updates to prevent concurrent operations to the PTE. If the TDP MMU tried
> to install this as a long term value, it would confuse these updates.
> Despite this __tdp_mmu_set_spte_atomic() includes a check to handle it
> being set. Remove this check and turn it into a warning.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 0809fe8e8737..338957bc5109 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -656,7 +656,13 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	 */
>  	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>  
> -	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
> +	/*
> +	 * FROZEN_SPTE is a temporary state and should never be set via higher
> +	 * level helpers.
> +	 */
> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE() is used
above for old_spte?

> +	if (is_mirror_sptep(iter->sptep)) {
>  		int ret;
>  
>  		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
@ 2026-03-30  6:14   ` Yan Zhao
  2026-04-01 23:45     ` Edgecombe, Rick P
  2026-03-31 10:09   ` Huang, Kai
  2026-04-01  8:34   ` Yan Zhao
  2 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  6:14 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:11PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Centralize the updates to present external PTEs to the
> handle_changed_spte() function.
> 
> When setting a PTE to present in the mirror page tables, the update needs
> to propagate to the external page tables (in TDX parlance the S-EPT).
> Today this is handled by special mirror page tables branching in
> __tdp_mmu_set_spte_atomic(), which is the only place where present PTEs
> are set for TDX.
> 
> This keeps things running, but is a bit hacked on. The hook for setting
> present leaf PTEs are added only where TDX happens to need them. For
> example, TDX does not support any of the operations that use the
> non-atomic variant, tdp_mmu_set_spte() to set present PTEs. Since the hook
> is missing there, it is very hard to understand the code from a non-TDX
> lens. If the reader doesn’t know the TDX specifics it could look like the
> external update is missing.
> 
> In addition to being confusing, it also litters the TDP MMU with
> "external" update callbacks. This is especially unfortunate because there
> is already a central place to react to TDP updates, handle_changed_spte().
> 
> Begin the process of moving towards a model where all mirror page table
> updates are forwarded to TDX code where the TDX specific logic can live
> with a more proper separation of concerns. Do this by teaching
> handle_changed_spte() how to return error codes, such that it can
> propagate the failures that may come from TDX external page table updates.
> 
> Atomic mirror page table updates need to be done in a special way to
> prevent concurrent updates to the mirror page table while the external
> page table is updated. The mirror page table is set to the frozen PTE
> value while the external version is updates. This frozen PTE dance is
"is updates" --> "is updating"?

> currently done in __tdp_mmu_set_spte_atomic(). Hoist it up a level so that
> the external update in handle_changed_spte() can be done while the PTE is
> frozen.
> 
> Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Based on a diff by Sean Chrisopherson]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 150 ++++++++++++++++++++++---------------
>  1 file changed, 88 insertions(+), 62 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 338957bc5109..db16e81b9701 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -320,9 +320,9 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	}
>  }
>  
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared);
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				gfn_t gfn, u64 old_spte, u64 new_spte,
> +				int level, bool shared);
>  
>  static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> @@ -471,8 +471,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
>  							  FROZEN_SPTE, level);
>  		}
> -		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> -				    old_spte, FROZEN_SPTE, level, shared);
> +		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
>  
>  		if (is_mirror_sp(sp)) {
>  			KVM_BUG_ON(shared, kvm);
> @@ -508,22 +507,15 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
>  	return NULL;
>  }
>  
> -static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> -						 gfn_t gfn, u64 *old_spte,
> -						 u64 new_spte, int level)
> +static int __must_check set_external_spte_present(struct kvm *kvm,
> +						  gfn_t gfn, u64 old_spte,
> +						  u64 new_spte, int level)
>  {
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&kvm->mmu_lock);
> -	/*
> -	 * We need to lock out other updates to the SPTE until the external
> -	 * page table has been modified. Use FROZEN_SPTE similar to
> -	 * the zapping case.
> -	 */
> -	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> -		return -EBUSY;
>  
>  	/*
>  	 * Use different call to either set up middle level
> @@ -537,17 +529,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  		KVM_BUG_ON(!external_spt, kvm);
>  		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
>  	}
> -	if (ret)
> -		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
> -	else
> -		__kvm_tdp_mmu_write_spte(sptep, new_spte);
>  	return ret;
>  }
>  
>  /**
> - * handle_changed_spte - handle bookkeeping associated with an SPTE change
> + * __handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> - * @as_id: the address space of the paging structure the SPTE was a part of
> + * @sp: the page table in which the SPTE resides
>   * @gfn: the base GFN that was mapped by the SPTE
>   * @old_spte: The value of the SPTE before the change
>   * @new_spte: The value of the SPTE after the change
> @@ -560,15 +548,17 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>   * dirty logging updates are handled in common code, not here (see make_spte()
>   * and fast_pf_fix_direct_spte()).
>   */
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared)
> +static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				 gfn_t gfn, u64 old_spte, u64 new_spte,
> +				 int level, bool shared)
>  {
>  	bool was_present = is_shadow_present_pte(old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool was_leaf = was_present && is_last_spte(old_spte, level);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +	int as_id = kvm_mmu_page_as_id(sp);
> +	int r;
>  
>  	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
>  	WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -598,9 +588,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	}
>  
>  	if (old_spte == new_spte)
> -		return;
> -
> -	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +		return 0;
>  
>  	if (is_leaf)
>  		check_spte_writable_invariants(new_spte);
> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  			       "a temporary frozen SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
> -	if (is_leaf != was_leaf)
> -		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
>  	 * the paging structure.  Note the WARN on the PFN changing without the
>  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>  	 * pages are kernel allocations and should never be migrated.
> +	 *
> +	 * When modifying leaf entries in mirrored page tables, propagate all
> +	 * changes to the external SPTE.
>  	 */
>  	if (was_present && !was_leaf &&
> -	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> +	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> +	} else if (is_mirror_sp(sp) && is_present) {
> +		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> +		if (r)
> +			return r;
> +	}
> +
> +	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
> +
> +	if (is_leaf != was_leaf)
> +		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> +
> +	return 0;
> +}
> +
> +static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
> +				gfn_t gfn, u64 old_spte, u64 new_spte,
> +				int level, bool shared)
> +{
> +	KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
> +					 level, shared), kvm);
>  }
>  
>  static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> @@ -657,32 +665,14 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
>  
>  	/*
> -	 * FROZEN_SPTE is a temporary state and should never be set via higher
> -	 * level helpers.
> +	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> +	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
> +	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> +	 * the current value, so the caller operates on fresh data, e.g. if it
> +	 * retries tdp_mmu_set_spte_atomic()
>  	 */
> -	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> -
> -	if (is_mirror_sptep(iter->sptep)) {
> -		int ret;
> -
> -		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> -						&iter->old_spte, new_spte, iter->level);
> -		if (ret)
> -			return ret;
> -	} else {
> -		u64 *sptep = rcu_dereference(iter->sptep);
> -
> -		/*
> -		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
> -		 * and does not hold the mmu_lock.  On failure, i.e. if a
> -		 * different logical CPU modified the SPTE, try_cmpxchg64()
> -		 * updates iter->old_spte with the current value, so the caller
> -		 * operates on fresh data, e.g. if it retries
> -		 * tdp_mmu_set_spte_atomic()
> -		 */
> -		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> -			return -EBUSY;
> -	}
> +	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
> +		return -EBUSY;
>  
>  	return 0;
>  }
> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  						       struct tdp_iter *iter,
>  						       u64 new_spte)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
>  	int ret;
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
> -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +	/* KVM should never freeze SPTEs using higher level APIs. */
"higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
new_spte to be FROZEN_SPTE.

What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
directly"?

> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> +	/*
> +	 * Temporarily freeze the SPTE until the external PTE operation has
> +	 * completed (unless the new SPTE itself will be frozen), e.g. so that
> +	 * concurrent faults don't attempt to install a child PTE in the
> +	 * external page table before the parent PTE has been written, or try
> +	 * to re-install a page table before the old one was removed.
> +	 */
> +	if (is_mirror_sptep(iter->sptep))
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> +	else
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>  	if (ret)
>  		return ret;
>  
> -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			    new_spte, iter->level, true);
> +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> +				    new_spte, iter->level, true);

What about adding a comment for the tricky part for the mirror page table:
while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
for freezing the mirror page table, the original new_spte from the caller of
tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
properly update statistics and propagate to the external page table.

> -	return 0;
> +	/*
> +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> +	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> +	 * otherwise set the mirror SPTE to the new desired value.
> +	 */
> +	if (is_mirror_sptep(iter->sptep)) {
> +		if (ret)
> +			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> +		else
> +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> +	} else {
> +		/*
> +		 * Bug the VM if handling the change failed, as failure is only
> +		 * allowed if KVM couldn't update the external SPTE.
> +		 */
> +		KVM_BUG_ON(ret, kvm);
> +	}
> +	return ret;
>  }
>  
>  /*
> @@ -738,6 +759,8 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> +
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	/*
> @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  
>  	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>  
> -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level, false);
>  
>  	/*
>  	 * Users that do non-atomic setting of PTEs don't operate on mirror
> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	u64 new_spte;
>  
> +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> +		return;
> +
Add a comment for why mirror page table is not expected here?

And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
or clear_dirty_pt_masked()?

>  	if (spte_ad_enabled(iter->old_spte)) {
>  		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>  								shadow_accessed_mask);
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all
  2026-03-27 20:14 ` [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all Rick Edgecombe
@ 2026-03-30  6:28   ` Yan Zhao
  0 siblings, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  6:28 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:12PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Drop the dedicated .link_external_spt() for linking non-leaf S-EPT pages,
> and instead funnel everything through .set_external_spte().  Using separate
> hooks doesn't help prevent TDP MMU details from bleeding into TDX, and vice
> versa; to the contrary, dedicated callbacks will result in _more_ pollution
> when hugepage support is added, e.g. will require the TDP MMU to know
> details about the splitting rules for TDX that aren't all that relevant to
> the TDP MMU.
> 
> Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
> one hook for setting SPTEs under write-lock and another for settings SPTEs
> under read-lock (e.g. to ensure the entire operation is "atomic", to allow
> for failure, etc.).  Sadly, TDX's requirement that all child S-EPT entries
> are removed before the parent makes that impractical: the TDP MMU
> deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
> making it quite important for the TDP MMU to differentiate between zapping
> leaf and non-leaf S-EPT entries.
> 
> However, that's the _only_ case that's truly special, and even that case
> could be shoehorned into a single hook; it's just wouldn't be a net
"it's" --> "it".

> positive.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
  2026-03-27 20:14 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Rick Edgecombe
@ 2026-03-30  6:43   ` Yan Zhao
  2026-04-01 23:59     ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  6:43 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:13PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
> PAGE.AUG operations, and thus complete tdx_sept_set_private_spte()'s
> transition into a "dispatch" routine for setting/writing S-EPT entries.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 92a846b91bac..361a75b42ae7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1689,18 +1689,12 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> -				     enum pg_level level, u64 mirror_spte)
> +static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				  u64 mirror_spte)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	kvm_pfn_t pfn = spte_to_pfn(mirror_spte);
>  
> -	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> -		return -EIO;
> -
> -	if (!is_last_spte(mirror_spte, level))
> -		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
> -
>  	/* TODO: handle large pages. */
>  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
>  		return -EIO;
> @@ -1725,7 +1719,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
>  
> +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> +				     enum pg_level level, u64 mirror_spte)
> +{
>  
> +	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> +		return -EIO;
> +
> +	if (!is_last_spte(mirror_spte, level))
> +		return tdx_sept_link_private_spt(kvm, gfn, level, mirror_spte);
For symmetry, what about renaming tdx_sept_link_private_spt() to
tdx_sept_map_nonleaf_spte()?

> +
> +	return tdx_sept_map_leaf_spte(kvm, gfn, level, mirror_spte);
> +}
>  
>  /*
>   * Ensure shared and private EPTs to be flushed on all vCPUs.
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
  2026-03-27 20:14 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Rick Edgecombe
@ 2026-03-30  7:01   ` Yan Zhao
  2026-03-31 10:46     ` Huang, Kai
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  7:01 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:18PM -0700, Rick Edgecombe wrote:
> As part of an ongoing effort to move TDX specific bits from the MMU into
> the TDX code, drop the KVM_BUG_ON() that checks the MMU lock is held for
> write while removing page tables.
> 
> Future changes forward PTE removal mirror EPT updates into the
> set_private_spte() and let TDX code parse the PTE to decide what S-EPT
> operations to take. This operations does not pass a shared bool for this
"operations" --> "operation" ?

> KVM_BUG_ON() to use in the logics future home.
By "in the logics future home", do you mean "when this logic is moved to its
future location"?
 
> But even today there are already MMU write lockdep asserts that mostly
> cover the case. Since the KVM_BUG_ON() is already a bit redundant, just
> remove it instead of trying to plumb the bool into TDX code.
> 
> Link: https://lore.kernel.org/kvm/aYUarHf3KEwHGuJe@google.com/
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 991870789863..5dc9633c866e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -473,10 +473,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  		}
>  		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
>  
> -		if (is_mirror_sp(sp)) {
> -			KVM_BUG_ON(shared, kvm);
> +		if (is_mirror_sp(sp))
>  			remove_external_spte(kvm, gfn, old_spte, level);
> -		}
>  	}
>  
>  	if (is_mirror_sp(sp) &&
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-03-27 20:14 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Rick Edgecombe
@ 2026-03-30  7:49   ` Yan Zhao
  2026-04-02  0:17     ` Edgecombe, Rick P
  2026-03-31 11:02   ` Huang, Kai
  1 sibling, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-30  7:49 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:21PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Move the freeing of external page tables into the reclaim operation that
> lives in TDX code.
> 
> The TDP MMU supports traversing the TDP without holding locks. Page
> tables needs to be freed via RCU to prevent walking one that gets
> freed.
> 
> While none of these lockless walk operations actually happen for the
> mirror EPT, the TDP MMU none-the-less frees the mirror EPT page tables in
> the same way, and because it’s a handy place to plug it in, the external
> page tables as well.
> 
> However, the external page tables definitely can’t be walked once they are
> reclaimed from the TDX module. The TDX module releases the page for the
> host VMM to use, so this RCU-time free is unnecessary for external page
> tables.
> 
> So move the free_page() call to TDX code. Create an
> tdp_mmu_free_unused_sp() to allow for freeing external page tables that
> have never left the TDP MMU code (i.e. don’t need freed in a special way.
> 
> Link: https://lore.kernel.org/kvm/aYpjNrtGmogNzqwT@google.com/
> Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
> [Based on a diff by Sean, added log]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 16 +++++++++++-----
>  arch/x86/kvm/vmx/tdx.c     | 11 ++++++++++-
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 575033cc7fe4..18e11c1c7631 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,13 +53,18 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	rcu_barrier();
>  }
>  
> -static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
> +static void __tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> -	free_page((unsigned long)sp->external_spt);
>  	free_page((unsigned long)sp->spt);
>  	kmem_cache_free(mmu_page_header_cache, sp);
>  }
>  
> +static void tdp_mmu_free_unused_sp(struct kvm_mmu_page *sp)
> +{
> +	free_page((unsigned long)sp->external_spt);
> +	__tdp_mmu_free_sp(sp);
> +}
> +
>  /*
>   * This is called through call_rcu in order to free TDP page table memory
>   * safely with respect to other kernel threads that may be operating on
> @@ -73,7 +78,8 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  	struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
>  					       rcu_head);
>  
> -	tdp_mmu_free_sp(sp);
> +	WARN_ON_ONCE(sp->external_spt);
> +	__tdp_mmu_free_sp(sp);
>  }
>  
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
> @@ -1261,7 +1267,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * failed, e.g. because a different task modified the SPTE.
>  		 */
>  		if (r) {
> -			tdp_mmu_free_sp(sp);
> +			tdp_mmu_free_unused_sp(sp);
>  			goto retry;
>  		}
>  
> @@ -1571,7 +1577,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>  	 * installs its own sp in place of the last sp we tried to split.
>  	 */
>  	if (sp)
> -		tdp_mmu_free_sp(sp);
> +		tdp_mmu_free_unused_sp(sp);
>  
>  	return 0;
>  }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d064b40a6b31..1346e891ca94 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>  	 */
>  	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
>  	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
> -		sp->external_spt = NULL;
> +		goto out;
> +
> +	/*
> +	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
> +	 * freeing pages from RCU callbacks, and more importantly because
> +	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
> +	 */
> +	free_page((unsigned long)sp->external_spt);
Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a similar
way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
sp->external_spt special? i.e., what's the downside of freeing it together with
sp->spt in tdp_mmu_free_sp() ?

> +out:
> +	sp->external_spt = NULL;
>  }
>  
>  static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> -- 
> 2.53.0
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-31  9:47   ` Huang, Kai
@ 2026-03-31  9:17     ` Yan Zhao
  2026-03-31  9:59       ` Huang, Kai
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-31  9:17 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Hansen, Dave,
	linux-kernel@vger.kernel.org, x86@kernel.org

On Tue, Mar 31, 2026 at 05:47:29PM +0800, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > Pass a pointer to iter->old_spte, not simply its value, when setting an
> > external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
> > will be updated if the cmpxchg64 to freeze the mirror SPTE fails.  The bug
> > is currently benign as TDX is mutualy exclusive with all paths that do
> > "local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().
> > 
> > Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 7b1102d26f9c..dbaeb80f2b64 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -509,10 +509,10 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
> >  }
> >  
> >  static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> > -						 gfn_t gfn, u64 old_spte,
> > +						 gfn_t gfn, u64 *old_spte,
> >  						 u64 new_spte, int level)
> >  {
> > -	bool was_present = is_shadow_present_pte(old_spte);
> > +	bool was_present = is_shadow_present_pte(*old_spte);
> >  	bool is_present = is_shadow_present_pte(new_spte);
> >  	bool is_leaf = is_present && is_last_spte(new_spte, level);
> >  	int ret = 0;
> > @@ -525,7 +525,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  	 * page table has been modified. Use FROZEN_SPTE similar to
> >  	 * the zapping case.
> >  	 */
> > -	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
> > +	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> >  		return -EBUSY;
> >  
> >  	/*
> > @@ -541,7 +541,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
> >  		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
> >  	}
> >  	if (ret)
> > -		__kvm_tdp_mmu_write_spte(sptep, old_spte);
> > +		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
> >  	else
> >  		__kvm_tdp_mmu_write_spte(sptep, new_spte);
> >  	return ret;
> > @@ -670,7 +670,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >  			return -EBUSY;
> >  
> >  		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> > -						iter->old_spte, new_spte, iter->level);
> > +						&iter->old_spte, new_spte, iter->level);
> >  		if (ret)
> >  			return ret;
> >  	} else {
> 
> The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> iter->old_spte isn't a frozen SPTE:
> 
>         WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> 
> Thinking more, I _think_ this patch could potentially trigger this WARNING
> due to now set_external_spte_present() will set iter->old_spte to
> FROZEN_SPTE when try_cmpxchg64() fails.
> 
> Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> __tdp_mmu_set_spte_atomic() simultaneously.  Assuming vCPU1 does the 
> 
> 	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
>  		return -EBUSY;
> 
> .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> FROZEN_SPTE.
> 
> Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> be triggered due to is_frozen_spte(iter->old_spte) will now return true.

The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
retrying, as in kvm_tdp_mmu_map()?
 
> Or did I miss anything?
> 
> Also, AFAICT this issue doesn't exist for non-TDX case because there's no
> case tdp_mmu_set_spte_atomic() is called to set new_spte as FROZEN_SPTE in
> such case.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-31  9:59       ` Huang, Kai
@ 2026-03-31  9:22         ` Yan Zhao
  2026-03-31 10:14           ` Huang, Kai
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-03-31  9:22 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, kvm@vger.kernel.org

On Tue, Mar 31, 2026 at 05:59:54PM +0800, Huang, Kai wrote:
> > > 
> > > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > > iter->old_spte isn't a frozen SPTE:
> > > 
> > >         WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> > > 
> > > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > > due to now set_external_spte_present() will set iter->old_spte to
> > > FROZEN_SPTE when try_cmpxchg64() fails.
> > > 
> > > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > > __tdp_mmu_set_spte_atomic() simultaneously.  Assuming vCPU1 does the 
> > > 
> > > 	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > >  		return -EBUSY;
> > > 
> > > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > > FROZEN_SPTE.
> > > 
> > > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
> > 
> > The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> > retrying, as in kvm_tdp_mmu_map()?
> 
> It's possible the vCPU3 is already about to go into
> __tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.
Hmm, different vCPU's &iter shouldn't locate at the same memory, where iter is a
local variable.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-27 20:14 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Rick Edgecombe
@ 2026-03-31  9:47   ` Huang, Kai
  2026-03-31  9:17     ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31  9:47 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Pass a pointer to iter->old_spte, not simply its value, when setting an
> external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
> will be updated if the cmpxchg64 to freeze the mirror SPTE fails.  The bug
> is currently benign as TDX is mutualy exclusive with all paths that do
> "local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().
> 
> Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7b1102d26f9c..dbaeb80f2b64 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -509,10 +509,10 @@ static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
>  }
>  
>  static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
> -						 gfn_t gfn, u64 old_spte,
> +						 gfn_t gfn, u64 *old_spte,
>  						 u64 new_spte, int level)
>  {
> -	bool was_present = is_shadow_present_pte(old_spte);
> +	bool was_present = is_shadow_present_pte(*old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
>  	int ret = 0;
> @@ -525,7 +525,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  	 * page table has been modified. Use FROZEN_SPTE similar to
>  	 * the zapping case.
>  	 */
> -	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
> +	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
>  		return -EBUSY;
>  
>  	/*
> @@ -541,7 +541,7 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
>  		ret = kvm_x86_call(link_external_spt)(kvm, gfn, level, external_spt);
>  	}
>  	if (ret)
> -		__kvm_tdp_mmu_write_spte(sptep, old_spte);
> +		__kvm_tdp_mmu_write_spte(sptep, *old_spte);
>  	else
>  		__kvm_tdp_mmu_write_spte(sptep, new_spte);
>  	return ret;
> @@ -670,7 +670,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  			return -EBUSY;
>  
>  		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
> -						iter->old_spte, new_spte, iter->level);
> +						&iter->old_spte, new_spte, iter->level);
>  		if (ret)
>  			return ret;
>  	} else {

The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
iter->old_spte isn't a frozen SPTE:

        WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));

Thinking more, I _think_ this patch could potentially trigger this WARNING
due to now set_external_spte_present() will set iter->old_spte to
FROZEN_SPTE when try_cmpxchg64() fails.

Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
__tdp_mmu_set_spte_atomic() simultaneously.  Assuming vCPU1 does the 

	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
 		return -EBUSY;

.. successfully in set_external_spte_present(), then vCPU2 will fail on the
try_cmpxchg64(), but this will cause iter->old_spte to be updated to
FROZEN_SPTE.

Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
be triggered due to is_frozen_spte(iter->old_spte) will now return true.

Or did I miss anything?

Also, AFAICT this issue doesn't exist for non-TDX case because there's no
case tdp_mmu_set_spte_atomic() is called to set new_spte as FROZEN_SPTE in
such case.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-31  9:17     ` Yan Zhao
@ 2026-03-31  9:59       ` Huang, Kai
  2026-03-31  9:22         ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31  9:59 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Edgecombe, Rick P, seanjc@google.com, x86@kernel.org,
	kas@kernel.org, Hansen, Dave, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, kvm@vger.kernel.org

> > 
> > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > iter->old_spte isn't a frozen SPTE:
> > 
> >         WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> > 
> > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > due to now set_external_spte_present() will set iter->old_spte to
> > FROZEN_SPTE when try_cmpxchg64() fails.
> > 
> > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > __tdp_mmu_set_spte_atomic() simultaneously.  Assuming vCPU1 does the 
> > 
> > 	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> >  		return -EBUSY;
> > 
> > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > FROZEN_SPTE.
> > 
> > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
> 
> The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> retrying, as in kvm_tdp_mmu_map()?

It's possible the vCPU3 is already about to go into
__tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
  2026-03-30  6:14   ` Yan Zhao
@ 2026-03-31 10:09   ` Huang, Kai
  2026-04-01 23:58     ` Edgecombe, Rick P
  2026-04-01  8:34   ` Yan Zhao
  2 siblings, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:09 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org


> @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
>  {
>  	u64 new_spte;
>  
> +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> +		return;
> +
>  	if (spte_ad_enabled(iter->old_spte)) {
>  		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
>  								shadow_accessed_mask);

AFAICT this chunk isn't related to "Centralize updates to present external
PTEs"?  Should it be in a separate patch?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
  2026-03-31  9:22         ` Yan Zhao
@ 2026-03-31 10:14           ` Huang, Kai
  0 siblings, 0 replies; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:14 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, kas@kernel.org,
	Edgecombe, Rick P, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, pbonzini@redhat.com

On Tue, 2026-03-31 at 17:22 +0800, Yan Zhao wrote:
> On Tue, Mar 31, 2026 at 05:59:54PM +0800, Huang, Kai wrote:
> > > > 
> > > > The __tdp_mmu_set_spte_atomic() has a WARN() at the beginning to check the
> > > > iter->old_spte isn't a frozen SPTE:
> > > > 
> > > >         WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
> > > > 
> > > > Thinking more, I _think_ this patch could potentially trigger this WARNING
> > > > due to now set_external_spte_present() will set iter->old_spte to
> > > > FROZEN_SPTE when try_cmpxchg64() fails.
> > > > 
> > > > Consider there are 3 vCPUs trying to accept the same GFN, and they all reach
> > > > __tdp_mmu_set_spte_atomic() simultaneously.  Assuming vCPU1 does the 
> > > > 
> > > > 	if (!try_cmpxchg64(rcu_dereference(sptep), old_spte, FROZEN_SPTE))
> > > >  		return -EBUSY;
> > > > 
> > > > .. successfully in set_external_spte_present(), then vCPU2 will fail on the
> > > > try_cmpxchg64(), but this will cause iter->old_spte to be updated to
> > > > FROZEN_SPTE.
> > > > 
> > > > Then when vCPU3 enters __tdp_mmu_set_spte_atomic(), AFAICT the WARNING will
> > > > be triggered due to is_frozen_spte(iter->old_spte) will now return true.
> > > 
> > > The failed caller needs to check "if (is_frozen_spte(iter.old_spte))" before
> > > retrying, as in kvm_tdp_mmu_map()?
> > 
> > It's possible the vCPU3 is already about to go into
> > __tdp_mmu_set_spte_atomic() when iter.old_spte becomes FROZEN_SPTE.
> Hmm, different vCPU's &iter shouldn't locate at the same memory, where iter is a
> local variable.

Ah right.  I missed that :-)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
  2026-03-27 20:14 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Rick Edgecombe
@ 2026-03-31 10:30   ` Huang, Kai
  2026-04-02  0:00     ` Edgecombe, Rick P
  2026-03-31 10:34   ` Huang, Kai
  1 sibling, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:30 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
>  static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, u64 mirror_spte)
>  {
> -
>  	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
>  		return -EIO;

This line deleting isn't needed if it wasn't introduced in the previous
patch.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
  2026-03-27 20:14 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Rick Edgecombe
  2026-03-31 10:30   ` Huang, Kai
@ 2026-03-31 10:34   ` Huang, Kai
  1 sibling, 0 replies; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:34 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> Move the MMU lockdep assert in set_external_spte_present() into the TDX
> specific op because the assert is TDX specific in intention.
> 
> The TDP MMU has many lockdep asserts for various scenarios, and in fact
> the callchains that are used for TDX already have a lockdep assert which
> cover the case in set_external_spte_present(). 
> 

cover -> covers

> However, these asserts are
> for management of the TDP root owned by KVM. In the
> set_external_spte_present() assert case, it is helping with a scheme to
> avoid contention in the TDX module during zap operations. That is very
> TDX specific.
> 
> One option would be to just remove the assert in
> set_external_spte_present() and rely on the other ones in the TDP MMU. But
> that assert is for an a different intention, and too far away from the

"an a" -> a

> SEAMCALL that needs it. 
> 

> So move just move it to TDX code.

So just move it ...



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
  2026-03-27 20:14 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Rick Edgecombe
@ 2026-03-31 10:36   ` Huang, Kai
  2026-04-01  7:41   ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:36 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() as all

Fold .. into __handle_changed_spte()?

> the other functionality besides calling the op. 
> 

all the other functionality besides calling the op is gone?

> It now is a single line
> helper that is called once.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6763537098ee..85c92aec868f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -494,13 +494,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
>  
> -static int __must_check set_external_spte_present(struct kvm *kvm,
> -						  gfn_t gfn, u64 old_spte,
> -						  u64 new_spte, int level)
> -{
> -	return kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
> -}
> -
>  /**
>   * __handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> @@ -600,7 +593,7 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
>  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
>  	} else if (is_mirror_sp(sp) && is_present) {
> -		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> +		r = kvm_x86_call(set_external_spte)(kvm, gfn, level, new_spte);
>  		if (r)
>  			return r;
>  	}

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()
  2026-03-27 20:14 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Rick Edgecombe
@ 2026-03-31 10:42   ` Huang, Kai
  2026-04-02  0:04     ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:42 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Move tdx_sept_remove_private_spte() (and its tdx_track() helper) above
> tdx_sept_set_private_spte() in anticipation of routing all non-atomic
> S-EPT writes (with the exception of reclaiming non-leaf pages) through
> the "set" API.

The diff reads more like moving tdx_sept_set_private_spte() down, though.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
  2026-03-30  7:01   ` Yan Zhao
@ 2026-03-31 10:46     ` Huang, Kai
  2026-04-02  0:08       ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 10:46 UTC (permalink / raw)
  To: Zhao, Yan Y, Edgecombe, Rick P
  Cc: Hansen, Dave, kvm@vger.kernel.org, pbonzini@redhat.com,
	kas@kernel.org, seanjc@google.com, linux-kernel@vger.kernel.org,
	x86@kernel.org

> 
> > KVM_BUG_ON() to use in the logics future home.
> By "in the logics future home", do you mean "when this logic is moved to its
> future location"?
> 

Maybe it just meant to be:

  the logic's future home

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-03-27 20:14 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Rick Edgecombe
  2026-03-30  7:49   ` Yan Zhao
@ 2026-03-31 11:02   ` Huang, Kai
  1 sibling, 0 replies; 61+ messages in thread
From: Huang, Kai @ 2026-03-31 11:02 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

> 
> So move the free_page() call to TDX code. Create an
> tdp_mmu_free_unused_sp() to allow for freeing external page tables that
> have never left the TDP MMU code (i.e. don’t need freed in a special way.
> 

Missing ')' in the last sentence.

Also, don't need to be freed ?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-03-30  5:00   ` Yan Zhao
@ 2026-03-31 16:37     ` Edgecombe, Rick P
  2026-04-02  1:06       ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-03-31 16:37 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:

Yep on the typos. 

> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -656,7 +656,13 @@ static inline int __must_check
> > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >   	 */
> >   	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > >old_spte));
> >   
> > -	if (is_mirror_sptep(iter->sptep) &&
> > !is_frozen_spte(new_spte)) {
> > +	/*
> > +	 * FROZEN_SPTE is a temporary state and should never be
> > set via higher
> > +	 * level helpers.
> > +	 */
> > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> is used
> above for old_spte?

For the KVM_MMU_WARN_ON() it was Sean's suggestion.

https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/

It allows for compiling it out, so probably a better choice. So I see
the options are leave them different or opportunistically convert the
other one to KVM_MMU_WARN_ON(). Thoughts?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
  2026-03-27 20:14 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Rick Edgecombe
  2026-03-31 10:36   ` Huang, Kai
@ 2026-04-01  7:41   ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-01  7:41 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

On Fri, Mar 27, 2026 at 01:14:15PM -0700, Rick Edgecombe wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() as all
> the other functionality besides calling the op. It now is a single line
> helper that is called once.
Ack to the sashiko review comment:

"This isn't a bug, but the commit message says the function is folded into
__tdp_mmu_set_spte_atomic(), whereas the code actually folds it into
__handle_changed_spte(). Should the commit message be updated to reflect
the correct function name?
Additionally, the phrase "as all the other functionality besides calling
the op." appears to be missing a verb or is incomplete. Could this sentence
be clarified?" [1].


[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
  2026-03-30  6:14   ` Yan Zhao
  2026-03-31 10:09   ` Huang, Kai
@ 2026-04-01  8:34   ` Yan Zhao
  2026-04-02 23:46     ` Edgecombe, Rick P
  2 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-04-01  8:34 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kai.huang, kvm, kas, linux-kernel, x86,
	dave.hansen

Below are responses to issues reported by sashiko. [1]

[1] https://sashiko.dev/#/patchset/20260327201421.2824383-1-rick.p.edgecombe%40intel.com

> @@ -627,21 +615,41 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  			       "a temporary frozen SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
> -	if (is_leaf != was_leaf)
> -		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
> -
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
>  	 * the paging structure.  Note the WARN on the PFN changing without the
>  	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
>  	 * pages are kernel allocations and should never be migrated.
> +	 *
> +	 * When modifying leaf entries in mirrored page tables, propagate all
> +	 * changes to the external SPTE.
>  	 */
>  	if (was_present && !was_leaf &&
> -	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> +	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> +	} else if (is_mirror_sp(sp) && is_present) {
> +		r = set_external_spte_present(kvm, gfn, old_spte, new_spte, level);
> +		if (r)
> +			return r;
> +	}
Issue #1:
"If a present non-leaf SPTE is replaced with a present leaf SPTE, such as
 during a hugepage collapse, the first branch executes and calls
 handle_removed_pt(). Because of the mutually exclusive "else if",
 set_external_spte_present() is bypassed for the new leaf SPTE. Could this
 leave the external S-EPT out of sync, with the parent entry pointing to a
 freed child page table?
"
Response:
This is the page merging scenario. The mirror root does not allow huge pages yet,
So, maybe nothing is required for now.
After the basic TDX huge page series, page merging is also disabled [4].
See more in the responses to issues #2 and #3.

Issue #2:
"Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
 Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
 call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
 S-EPT mapping remains fully present while KVM marks it as non-present?
"

Response:
Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
Maybe move patch 5 to after patch 15?

[2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@intel.com/
[3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@intel.com/
[4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@intel.com/

Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
let TDX just have a single hook set_external_spte() for propagation of changes
from mirror page table to S-EPT?
(below change is on code base with TDX huge page support).

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 13910d9af03a..d9404aeae5f3 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,7 +95,6 @@ KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
 KVM_X86_OP_OPTIONAL_RET0(set_external_spte)
-KVM_X86_OP_OPTIONAL(reclaim_external_spt)
 KVM_X86_OP_OPTIONAL_RET0(topup_external_cache)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 76f119a6412b..b722cce0b994 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1857,11 +1857,6 @@ struct kvm_x86_ops {
 	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				 u64 new_spte, enum pg_level level);
 
-	/* Update external page tables for page table about to be freed. */
-	void (*reclaim_external_spt)(struct kvm *kvm, gfn_t gfn,
-				     struct kvm_mmu_page *sp);
-
-
 	int (*topup_external_cache)(struct kvm *kvm, struct kvm_vcpu *vcpu, int min_nr_spts);
 
 	bool (*has_wbinvd_exit)(void);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2f6bfc875de1..9b4fa5b31f23 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 	}
 
-	if (is_mirror_sp(sp))
-		kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	 * changes to the external SPTE.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-	} else if (is_mirror_sp(sp)) {
+
+	if (is_mirror_sp(sp)) {
+		/*
+		 * Can also propagate changes to remove external pt. Since this
+		 * occurs after the call_rcu() in handle_removed_pt(), the RCU
+		 * read lock must be held.
+		 */
+		RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
 		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
 						    new_spte, level);
 		if (r)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 32d0c206ade6..cc03b9c7838c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1965,9 +1965,13 @@ static int tdx_sept_map_leaf_spte(struct kvm *kvm, gfn_t gfn, enum pg_level leve
 	return ret;
 }
 
-static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
+static int tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 					struct kvm_mmu_page *sp)
 {
+	int ret;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
 	/*
 	 * KVM doesn't (yet) zap page table pages in mirror page table while
 	 * TD is active, though guest pages mapped in mirror page table could be
@@ -1981,8 +1985,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * the page to prevent the kernel from accessing the encrypted page.
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
-	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
+	    tdx_reclaim_page(virt_to_page(sp->external_spt))) {
+		ret = -EIO;
 		goto out;
+	}
 
 	/*
 	 * Immediately free the S-EPT page as the TDX subsystem doesn't support
@@ -1990,8 +1996,10 @@ static void tdx_sept_reclaim_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding readers.
 	 */
 	free_page((unsigned long)sp->external_spt);
+	ret = 0;
 out:
 	sp->external_spt = NULL;
+	return ret;
 }
 
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -2045,10 +2053,17 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				     u64 new_spte, enum pg_level level)
 {
-	if (is_shadow_present_pte(old_spte) && is_shadow_present_pte(new_spte))
-		return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
-	else if (is_shadow_present_pte(old_spte))
-		return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+	if (is_shadow_present_pte(old_spte)) {
+		if (is_shadow_present_pte(new_spte)) {
+			if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
+				return tdx_sept_split_private_spte(kvm, gfn, old_spte, new_spte, level);
+		} else {
+			if (is_last_spte(old_spte, level))
+				return tdx_sept_remove_private_spte(kvm, gfn, level, old_spte);
+			else
+				return tdx_sept_reclaim_private_spt(kvm, gfn, spte_to_child_sp(old_spte));
+		}
+	}
 
 	if (KVM_BUG_ON(!is_shadow_present_pte(new_spte), kvm))
 		return -EIO;
@@ -3915,7 +3930,6 @@ void __init tdx_hardware_setup(void)
 	vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
 
 	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
-	vt_x86_ops.reclaim_external_spt = tdx_sept_reclaim_private_spt;
 	vt_x86_ops.gmem_convert = tdx_gmem_convert;
 


> @@ -708,18 +698,49 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  						       struct tdp_iter *iter,
>  						       u64 new_spte)
>  {
> +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
>  	int ret;
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
>  
> -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +	/* KVM should never freeze SPTEs using higher level APIs. */
> +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> +
> +	/*
> +	 * Temporarily freeze the SPTE until the external PTE operation has
> +	 * completed (unless the new SPTE itself will be frozen), e.g. so that
> +	 * concurrent faults don't attempt to install a child PTE in the
> +	 * external page table before the parent PTE has been written, or try
> +	 * to re-install a page table before the old one was removed.
> +	 */
> +	if (is_mirror_sptep(iter->sptep))
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> +	else
> +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>  	if (ret)
>  		return ret;
>  
> -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			    new_spte, iter->level, true);
> +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> +				    new_spte, iter->level, true);
>  
> -	return 0;
> +	/*
> +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> +	 * restore the old SPTE so that the SPTE isn't frozen in perpetuity,
> +	 * otherwise set the mirror SPTE to the new desired value.
> +	 */
> +	if (is_mirror_sptep(iter->sptep)) {
> +		if (ret)
> +			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
> +		else
> +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
Issue 3:
"
 If the mutually exclusive branches in __handle_changed_spte() are updated
 so both can run, there seems to be a risk during rollback. The function
 handle_removed_pt() unlinks the child page table and schedules it for RCU
 destruction. If set_external_spte_present() then fails, the rollback
 code here restores iter->old_spte. Since iter->old_spte points to the child
 page table that was just scheduled for freeing, can this cause a
 use-after-free when the old SPTE is restored?
"

Response:
As in the response to issue 1, changes from non-leaf present to leaf present
in mirror page table are not valid. We need special handling in TDX MMU to
support such scenario in the future. e.g., TDX requires all children entries
must have been all mapped or all unmapped before merging.
Maybe TDX huge page series could skip the handle_removed_pt() for the page
merging case first if more robustness is preferred to guard unexpected page
merging.


Issue 4:
 "
 Furthermore, tdp_mmu_set_spte_atomic() now returns errors from
 __handle_changed_spte(). Previously, it only returned an error (like -EBUSY)
 if the try_cmpxchg64() failed, which natively updated iter->old_spte to the
 new memory value.
 If try_cmpxchg64() succeeds but __handle_changed_spte() fails, iter->old_spte
 is not updated. Callers like clear_dirty_gfn_range() loop on failure assuming
 the old value was refreshed:
 clear_dirty_gfn_range() {
 ...
 retry:
         ...
         if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
                 goto retry;
 ...
 }
 Will this result in an infinite loop, as the caller retries the exact same
 atomic operation with the unchanged iter.old_spte?
"

Response:
Though the only expected error from __handle_changed_spte() is also -EBUSY, the
infinite loop is possible if updating the external page table constantly fails
due to contentions.
However, it's currently benign since callers like clear_dirty_gfn_range() can't
operate on the mirror page table.

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-30  6:14   ` Yan Zhao
@ 2026-04-01 23:45     ` Edgecombe, Rick P
  2026-04-02  1:59       ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:45 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> >   						       u64 new_spte)
> >   {
> > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter-
> > >sptep));
> >   	int ret;
> >   
> >   	lockdep_assert_held_read(&kvm->mmu_lock);
> >   
> > -	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > +	/* KVM should never freeze SPTEs using higher level APIs. */
> "higher level API" is ambiguous. e.g. kvm_tdp_mmu_write_spte_atomic() allows
> new_spte to be FROZEN_SPTE.

Yea you are right. It felt too fuzzy but I couldn't think of a better word.

> 
> What about just "callers of tdp_mmu_set_spte_atomic() should not freeze SPTEs
> directly"?

Sure.

> 
> > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > +
> > +	/*
> > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > that
> > +	 * concurrent faults don't attempt to install a child PTE in the
> > +	 * external page table before the parent PTE has been written, or
> > try
> > +	 * to re-install a page table before the old one was removed.
> > +	 */
> > +	if (is_mirror_sptep(iter->sptep))
> > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > +	else
> > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> >   	if (ret)
> >   		return ret;
> >   
> > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > -			    new_spte, iter->level, true);
> > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > +				    new_spte, iter->level, true);
> 
> What about adding a comment for the tricky part for the mirror page table:
> while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()

You meant it sets iter->sptep I think.

> for freezing the mirror page table, the original new_spte from the caller of
> tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> properly update statistics and propagate to the external page table.

new_spte was already passed in. What changed? You mean that
__tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
I'm not sure if it threshold TDP MMU.

> 
> > -	return 0;
> > +	/*
> > +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> > +	 * restore the old SPTE so that the SPTE isn't frozen in
> > perpetuity,
> > +	 * otherwise set the mirror SPTE to the new desired value.
> > +	 */
> > +	if (is_mirror_sptep(iter->sptep)) {
> > +		if (ret)
> > +			__kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > >old_spte);
> > +		else
> > +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > +	} else {
> > +		/*
> > +		 * Bug the VM if handling the change failed, as failure is
> > only
> > +		 * allowed if KVM couldn't update the external SPTE.
> > +		 */
> > +		KVM_BUG_ON(ret, kvm);
> > +	}
> > +	return ret;
> >   }
> >   
> >   /*
> > @@ -738,6 +759,8 @@ static inline int __must_check
> > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >   static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >   			    u64 old_spte, u64 new_spte, gfn_t gfn, int
> > level)
> >   {
> > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > +
> >   	lockdep_assert_held_write(&kvm->mmu_lock);
> >   
> >   	/*
> > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >   
> >   	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > level);
> >   
> > -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > false);
> > +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > false);
> >   
> >   	/*
> >   	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> >   {
> >   	u64 new_spte;
> >   
> > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > +		return;
> > +
> Add a comment for why mirror page table is not expected here?

Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
little bit on the idea of the patch: to forward all changes through limited ops
in a central place, such that we don't have TDX specifics encoded in core MMU.
Trying to forward this through properly would result in more burden to the TDP
MMU, so that's not the right answer either.

"Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
just leaving it without comment, but I can add something like that. Or do you
have another suggestion?

> 
> And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> or clear_dirty_pt_masked()?

Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
case, warning coverage is removed in this patch.

> 
> >   	if (spte_ad_enabled(iter->old_spte)) {
> >   		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> >   								shadow_acce
> > ssed_mask);
> > -- 
> > 2.53.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-03-31 10:09   ` Huang, Kai
@ 2026-04-01 23:58     ` Edgecombe, Rick P
  2026-04-02 23:21       ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:58 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Huang, Kai, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Tue, 2026-03-31 at 10:09 +0000, Huang, Kai wrote:
> > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > struct tdp_iter *iter)
> >   {
> >   	u64 new_spte;
> >   
> > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > +		return;
> > +
> >   	if (spte_ad_enabled(iter->old_spte)) {
> >   		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > >sptep,
> >   								shadow_acce
> > ssed_mask);
> 
> AFAICT this chunk isn't related to "Centralize updates to present external
> PTEs"?  Should it be in a separate patch?

It was because we lose warning coverage in this patch for the aging case. Though
the diff was from Sean. But my take was this just maintains coverage that
already existed.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT
  2026-03-30  6:43   ` Yan Zhao
@ 2026-04-01 23:59     ` Edgecombe, Rick P
  0 siblings, 0 replies; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-01 23:59 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Mon, 2026-03-30 at 14:43 +0800, Yan Zhao wrote:
> >   
> > +static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > +				     enum pg_level level, u64 mirror_spte)
> > +{
> >   
> > +	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> > +		return -EIO;
> > +
> > +	if (!is_last_spte(mirror_spte, level))
> > +		return tdx_sept_link_private_spt(kvm, gfn, level,
> > mirror_spte);
> For symmetry, what about renaming tdx_sept_link_private_spt() to
> tdx_sept_map_nonleaf_spte()?

Yea, it fits better.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code
  2026-03-31 10:30   ` Huang, Kai
@ 2026-04-02  0:00     ` Edgecombe, Rick P
  0 siblings, 0 replies; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02  0:00 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Huang, Kai, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Tue, 2026-03-31 at 10:30 +0000, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> >  static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  				     enum pg_level level, u64 mirror_spte)
> >  {
> > -
> >  	if (KVM_BUG_ON(!is_shadow_present_pte(mirror_spte), kvm))
> >  		return -EIO;
> 
> This line deleting isn't needed if it wasn't introduced in the previous
> patch.


Oops, will fix.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte()
  2026-03-31 10:42   ` Huang, Kai
@ 2026-04-02  0:04     ` Edgecombe, Rick P
  0 siblings, 0 replies; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02  0:04 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	seanjc@google.com, Huang, Kai, Zhao, Yan Y
  Cc: Hansen, Dave, linux-kernel@vger.kernel.org, x86@kernel.org

On Tue, 2026-03-31 at 10:42 +0000, Huang, Kai wrote:
> On Fri, 2026-03-27 at 13:14 -0700, Rick Edgecombe wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > Move tdx_sept_remove_private_spte() (and its tdx_track() helper) above
> > tdx_sept_set_private_spte() in anticipation of routing all non-atomic
> > S-EPT writes (with the exception of reclaiming non-leaf pages) through
> > the "set" API.
> 
> The diff reads more like moving tdx_sept_set_private_spte() down, though.

Yea I noticed that too. We can make it more generic, like "arrange such that..."

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
  2026-03-31 10:46     ` Huang, Kai
@ 2026-04-02  0:08       ` Edgecombe, Rick P
  2026-04-02  2:04         ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02  0:08 UTC (permalink / raw)
  To: Huang, Kai, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Hansen, Dave,
	kas@kernel.org, seanjc@google.com, x86@kernel.org,
	linux-kernel@vger.kernel.org

On Tue, 2026-03-31 at 10:46 +0000, Huang, Kai wrote:
> > 
> > > KVM_BUG_ON() to use in the logics future home.
> > By "in the logics future home", do you mean "when this logic is moved to its
> > future location"?
> > 
> 
> Maybe it just meant to be:
> 
>   the logic's future home

Yea that is what I meant. I'll update it. Unless Yan do you find the "future
home" phrasing confusing?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-03-30  7:49   ` Yan Zhao
@ 2026-04-02  0:17     ` Edgecombe, Rick P
  2026-04-02  2:16       ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02  0:17 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index d064b40a6b31..1346e891ca94 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > *kvm, gfn_t gfn,
> >   	 */
> >   	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> >   	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > -		sp->external_spt = NULL;
> > +		goto out;
> > +
> > +	/*
> > +	 * Immediately free the S-EPT page as the TDX subsystem doesn't
> > support
> > +	 * freeing pages from RCU callbacks, and more importantly because
> > +	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > readers.
> > +	 */
> > +	free_page((unsigned long)sp->external_spt);
> Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> similar
> way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> sp->external_spt special? i.e., what's the downside of freeing it together
> with
> sp->spt in tdp_mmu_free_sp() ?

I agree it's questionable as a cleanup. The change originally came from your
comment here actually:
https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/

So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
alone cleanup is pretty weak. I tried to put a brave face on it.

We have to think if we should try to include it in the initial set or include it
with the later DPAMT parts. I thought to try at least to do it earlier and leave
it at the end. It also kind of helps to see more of the overhaul together.
Thoughts?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-03-31 16:37     ` Edgecombe, Rick P
@ 2026-04-02  1:06       ` Yan Zhao
  2026-04-02 19:21         ` Sean Christopherson
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-04-02  1:06 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
> 
> Yep on the typos. 
> 
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > >   	 */
> > >   	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > >old_spte));
> > >   
> > > -	if (is_mirror_sptep(iter->sptep) &&
> > > !is_frozen_spte(new_spte)) {
> > > +	/*
> > > +	 * FROZEN_SPTE is a temporary state and should never be
> > > set via higher
> > > +	 * level helpers.
> > > +	 */
> > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > is used
> > above for old_spte?
> 
> For the KVM_MMU_WARN_ON() it was Sean's suggestion.
> 
> https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> 
> It allows for compiling it out, so probably a better choice. So I see
> the options are leave them different or opportunistically convert the
> other one to KVM_MMU_WARN_ON(). Thoughts?
I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
tdp_mmu.c. I'm not sure if there's a rule for which one to use.
Is it necessary to evaluate them all and have a separate patch to convert
WARN_ON_ONCE() to KVM_MMU_WARN_ON()?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-01 23:45     ` Edgecombe, Rick P
@ 2026-04-02  1:59       ` Yan Zhao
  2026-04-02 23:10         ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-04-02  1:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > +
> > > +	/*
> > > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > > that
> > > +	 * concurrent faults don't attempt to install a child PTE in the
> > > +	 * external page table before the parent PTE has been written, or
> > > try
> > > +	 * to re-install a page table before the old one was removed.
> > > +	 */
> > > +	if (is_mirror_sptep(iter->sptep))
> > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > +	else
> > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > >   	if (ret)
> > >   		return ret;
> > >   
> > > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > -			    new_spte, iter->level, true);
> > > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > +				    new_spte, iter->level, true);
> > 
> > What about adding a comment for the tricky part for the mirror page table:
> > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> 
> You meant it sets iter->sptep I think.
> 
> > for freezing the mirror page table, the original new_spte from the caller of
> > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > properly update statistics and propagate to the external page table.
> 
> new_spte was already passed in. What changed? You mean that
> __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> I'm not sure if it threshold TDP MMU.

For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
like this:
tdp_mmu_set_spte_atomic() {
  __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
  __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
			new_spte, iter->level, true);==>sets S-EPT to new_spte
  __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);   ==>sets mirror to new_spte
}

So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.


> > > -	return 0;
> > > +	/*
> > > +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> > > +	 * restore the old SPTE so that the SPTE isn't frozen in
> > > perpetuity,
> > > +	 * otherwise set the mirror SPTE to the new desired value.
> > > +	 */
> > > +	if (is_mirror_sptep(iter->sptep)) {
> > > +		if (ret)
> > > +			__kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > > >old_spte);
> > > +		else
> > > +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > +	} else {
> > > +		/*
> > > +		 * Bug the VM if handling the change failed, as failure is
> > > only
> > > +		 * allowed if KVM couldn't update the external SPTE.
> > > +		 */
> > > +		KVM_BUG_ON(ret, kvm);
> > > +	}
> > > +	return ret;
> > >   }
> > >   
> > >   /*
> > > @@ -738,6 +759,8 @@ static inline int __must_check
> > > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > >   static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > >   			    u64 old_spte, u64 new_spte, gfn_t gfn, int
> > > level)
> > >   {
> > > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > > +
> > >   	lockdep_assert_held_write(&kvm->mmu_lock);
> > >   
> > >   	/*
> > > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > > tdp_ptep_t sptep,
> > >   
> > >   	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > > level);
> > >   
> > > -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > false);
> > > +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > > false);
> > >   
> > >   	/*
> > >   	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > struct tdp_iter *iter)
> > >   {
> > >   	u64 new_spte;
> > >   
> > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > +		return;
> > > +
> > Add a comment for why mirror page table is not expected here?
> 
> Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> little bit on the idea of the patch: to forward all changes through limited ops
> in a central place, such that we don't have TDX specifics encoded in core MMU.
> Trying to forward this through properly would result in more burden to the TDP
> MMU, so that's not the right answer either.
> 
> "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> just leaving it without comment, but I can add something like that. Or do you
> have another suggestion?
What about "External TDP doesn't support clearing PTE A/D bit"?
I'm ok with leaving without comment if you think it's too obvious.

> > And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> > or clear_dirty_pt_masked()?
> 
> Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
> case, warning coverage is removed in this patch.

In kvm_tdp_mmu_age_spte() and clear_dirty_pt_masked() SPTEs are updated via
__tdp_mmu_set_spte_atomic() or tdp_mmu_clear_spte_bits(), which don't propagate
changes to the external page table.
So, I think it's better to warn if a mirror root is accidentally passed in to
those functions.


 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs
  2026-04-02  0:08       ` Edgecombe, Rick P
@ 2026-04-02  2:04         ` Yan Zhao
  0 siblings, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-02  2:04 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Huang, Kai, kvm@vger.kernel.org, pbonzini@redhat.com,
	Hansen, Dave, kas@kernel.org, seanjc@google.com, x86@kernel.org,
	linux-kernel@vger.kernel.org

On Thu, Apr 02, 2026 at 08:08:54AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2026-03-31 at 10:46 +0000, Huang, Kai wrote:
> > > 
> > > > KVM_BUG_ON() to use in the logics future home.
> > > By "in the logics future home", do you mean "when this logic is moved to its
> > > future location"?
> > > 
> > 
> > Maybe it just meant to be:
> > 
> >   the logic's future home
> 
> Yea that is what I meant. I'll update it. Unless Yan do you find the "future
> home" phrasing confusing?
Yeah, "future home" seems less clear to me. :)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-04-02  0:17     ` Edgecombe, Rick P
@ 2026-04-02  2:16       ` Yan Zhao
  2026-04-02  2:17         ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-04-02  2:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Thu, Apr 02, 2026 at 08:17:19AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index d064b40a6b31..1346e891ca94 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > > *kvm, gfn_t gfn,
> > >   	 */
> > >   	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> > >   	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > > -		sp->external_spt = NULL;
> > > +		goto out;
> > > +
> > > +	/*
> > > +	 * Immediately free the S-EPT page as the TDX subsystem doesn't
> > > support
> > > +	 * freeing pages from RCU callbacks, and more importantly because
> > > +	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > > readers.
> > > +	 */
> > > +	free_page((unsigned long)sp->external_spt);
> > Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> > similar
> > way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> > sp->external_spt special? i.e., what's the downside of freeing it together
> > with
> > sp->spt in tdp_mmu_free_sp() ?
> 
> I agree it's questionable as a cleanup. The change originally came from your
> comment here actually:
> https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
> 
> So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
> alone cleanup is pretty weak. I tried to put a brave face on it.
> 
> We have to think if we should try to include it in the initial set or include it
> with the later DPAMT parts. I thought to try at least to do it earlier and leave
> it at the end. It also kind of helps to see more of the overhaul together.
> Thoughts?
However, even with DPAMT, sp->external_sp is also allocated via
kvm_mmu_alloc_external_spt().

static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
        /*
         * external_spt is allocated for TDX module to hold private EPT mappings,
         * TDX module will initialize the page by itself.
         * Therefore, KVM does not need to initialize or access external_spt.
         * KVM only interacts with sp->spt for private EPT operations.
         */
        sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
}

i.e., sp->external_spt is allocated with no difference from sp->spt.
So, I'm not sure why we have to free it differently.
What's the benefit of keeping the memory for a little longer than freeing it
immediately after TDX RECLAIM?

Or are you planning to allocate sp->external_spt via a TDX hook?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code
  2026-04-02  2:16       ` Yan Zhao
@ 2026-04-02  2:17         ` Yan Zhao
  0 siblings, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-02  2:17 UTC (permalink / raw)
  To: Edgecombe, Rick P, Hansen, Dave, seanjc@google.com, Huang, Kai,
	kas@kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, pbonzini@redhat.com

On Thu, Apr 02, 2026 at 10:16:07AM +0800, Yan Zhao wrote:
> On Thu, Apr 02, 2026 at 08:17:19AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 15:49 +0800, Yan Zhao wrote:
> > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > > index d064b40a6b31..1346e891ca94 100644
> > > > --- a/arch/x86/kvm/vmx/tdx.c
> > > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > > @@ -1782,7 +1782,16 @@ static void tdx_sept_free_private_spt(struct kvm
> > > > *kvm, gfn_t gfn,
> > > >   	 */
> > > >   	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm) ||
> > > >   	    tdx_reclaim_page(virt_to_page(sp->external_spt)))
> > > > -		sp->external_spt = NULL;
> > > > +		goto out;
> > > > +
> > > > +	/*
> > > > +	 * Immediately free the S-EPT page as the TDX subsystem doesn't
> > > > support
> > > > +	 * freeing pages from RCU callbacks, and more importantly because
> > > > +	 * TDH.PHYMEM.PAGE.RECLAIM ensures there are no outstanding
> > > > readers.
> > > > +	 */
> > > > +	free_page((unsigned long)sp->external_spt);
> > > Since sp->external_spt is allocated in kvm_mmu_alloc_external_spt() in a
> > > similar
> > > way as sp->spt in tdp_mmu_alloc_sp(), why bother making freeing of
> > > sp->external_spt special? i.e., what's the downside of freeing it together
> > > with
> > > sp->spt in tdp_mmu_free_sp() ?
> > 
> > I agree it's questionable as a cleanup. The change originally came from your
> > comment here actually:
> > https://lore.kernel.org/kvm/aYW5CbUvZrLogsWF@yzhao56-desk.sh.intel.com/
> > 
> > So for a more concrete DPAMT reason. That said I agree the reasoning as a stand
> > alone cleanup is pretty weak. I tried to put a brave face on it.
> > 
> > We have to think if we should try to include it in the initial set or include it
> > with the later DPAMT parts. I thought to try at least to do it earlier and leave
> > it at the end. It also kind of helps to see more of the overhaul together.
> > Thoughts?
> However, even with DPAMT, sp->external_sp is also allocated via
> kvm_mmu_alloc_external_spt().
> 
> static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> {
>         /*
>          * external_spt is allocated for TDX module to hold private EPT mappings,
>          * TDX module will initialize the page by itself.
>          * Therefore, KVM does not need to initialize or access external_spt.
>          * KVM only interacts with sp->spt for private EPT operations.
>          */
>         sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
> }
> 
> i.e., sp->external_spt is allocated with no difference from sp->spt.
> So, I'm not sure why we have to free it differently.
> What's the benefit of keeping the memory for a little longer than freeing it
Typo: what's the downside

> immediately after TDX RECLAIM?
> 
> Or are you planning to allocate sp->external_spt via a TDX hook?
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-04-02  1:06       ` Yan Zhao
@ 2026-04-02 19:21         ` Sean Christopherson
  2026-04-03  2:47           ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2026-04-02 19:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, Dave Hansen, Kai Huang, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Thu, Apr 02, 2026, Yan Zhao wrote:
> On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
> > 
> > Yep on the typos. 
> > 
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > >   	 */
> > > >   	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > > >old_spte));
> > > >   
> > > > -	if (is_mirror_sptep(iter->sptep) &&
> > > > !is_frozen_spte(new_spte)) {
> > > > +	/*
> > > > +	 * FROZEN_SPTE is a temporary state and should never be
> > > > set via higher
> > > > +	 * level helpers.
> > > > +	 */
> > > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > > is used
> > > above for old_spte?
> > 
> > For the KVM_MMU_WARN_ON() it was Sean's suggestion.
> > 
> > https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> > 
> > It allows for compiling it out, so probably a better choice. So I see
> > the options are leave them different or opportunistically convert the
> > other one to KVM_MMU_WARN_ON(). Thoughts?
> I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
> tdp_mmu.c. I'm not sure if there's a rule for which one to use.

KVM_MMU_WARN_ON() is intended to be used only when having a WARN_ON_ONCE() enabled
in production environments isn't worth the cost of the code.  E.g. if the code in
question is a low level helper that used "everywhere" and in super hot paths, or
in extremely paranoid sanity checks where the platform is beyond hosed if the
sanity check fails.

The other thing with KVM_MMU_WARN_ON() is that it doesn't evaluate the expression
when CONFIG_KVM_PROVE_MMU=n, i.e. can't be used in if-statements (the only case
where it's used in an if-statement is wrapped with a CONFIG_KVM_PROVE_MMU #ifdef).

For infrequently used and/or slow paths, and for cases where KVM needs to take
action no matter what, WARN_ON_ONCE() is preferred because if something goes
sideways, we'll get a helpful splat even in production.

> Is it necessary to evaluate them all and have a separate patch to convert
> WARN_ON_ONCE() to KVM_MMU_WARN_ON()?

Nope.  We could probably convert a few select ones if there's good reason to do
so, but we definitely don't want to do a bulk conversion.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-02  1:59       ` Yan Zhao
@ 2026-04-02 23:10         ` Edgecombe, Rick P
  2026-04-02 23:28           ` Sean Christopherson
  2026-04-03  9:08           ` Yan Zhao
  0 siblings, 2 replies; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 23:10 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, Huang, Kai,
	kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	kvm@vger.kernel.org

On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > +
> > > > +	/*
> > > > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > > > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > that
> > > > +	 * concurrent faults don't attempt to install a child PTE in the
> > > > +	 * external page table before the parent PTE has been written, or
> > > > try
> > > > +	 * to re-install a page table before the old one was removed.
> > > > +	 */
> > > > +	if (is_mirror_sptep(iter->sptep))
> > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > +	else
> > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > >   	if (ret)
> > > >   		return ret;
> > > >   
> > > > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > -			    new_spte, iter->level, true);
> > > > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > +				    new_spte, iter->level, true);
> > > 
> > > What about adding a comment for the tricky part for the mirror page table:
> > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> > 
> > You meant it sets iter->sptep I think.
> > 
> > > for freezing the mirror page table, the original new_spte from the caller of
> > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > properly update statistics and propagate to the external page table.
> > 
> > new_spte was already passed in. What changed? You mean that
> > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > I'm not sure if it threshold TDP MMU.
> 
> For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> like this:
> tdp_mmu_set_spte_atomic() {
>   __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
>   __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> 			new_spte, iter->level, true);==>sets S-EPT to new_spte
>   __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);   ==>sets mirror to new_spte
> }
> 
> So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.

I still don't see the point. That ordering is not new, and this patch actually
adds a bunch of comments around the operations above and below the
__handle_changed_spte() call. If you think something is still missing maybe you
can suggest something.

> 
> 
> > > > -	return 0;
> > > > +	/*
> > > > +	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
> > > > +	 * restore the old SPTE so that the SPTE isn't frozen in
> > > > perpetuity,
> > > > +	 * otherwise set the mirror SPTE to the new desired value.
> > > > +	 */
> > > > +	if (is_mirror_sptep(iter->sptep)) {
> > > > +		if (ret)
> > > > +			__kvm_tdp_mmu_write_spte(iter->sptep, iter-
> > > > > old_spte);
> > > > +		else
> > > > +			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
> > > > +	} else {
> > > > +		/*
> > > > +		 * Bug the VM if handling the change failed, as failure is
> > > > only
> > > > +		 * allowed if KVM couldn't update the external SPTE.
> > > > +		 */
> > > > +		KVM_BUG_ON(ret, kvm);
> > > > +	}
> > > > +	return ret;
> > > >   }
> > > >   
> > > >   /*
> > > > @@ -738,6 +759,8 @@ static inline int __must_check
> > > > tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > >   static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> > > >   			    u64 old_spte, u64 new_spte, gfn_t gfn, int
> > > > level)
> > > >   {
> > > > +	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(sptep));
> > > > +
> > > >   	lockdep_assert_held_write(&kvm->mmu_lock);
> > > >   
> > > >   	/*
> > > > @@ -751,7 +774,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > > > tdp_ptep_t sptep,
> > > >   
> > > >   	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte,
> > > > level);
> > > >   
> > > > -	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > > > false);
> > > > +	handle_changed_spte(kvm, sp, gfn, old_spte, new_spte, level,
> > > > false);
> > > >   
> > > >   	/*
> > > >   	 * Users that do non-atomic setting of PTEs don't operate on mirror
> > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > struct tdp_iter *iter)
> > > >   {
> > > >   	u64 new_spte;
> > > >   
> > > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > +		return;
> > > > +
> > > Add a comment for why mirror page table is not expected here?
> > 
> > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > little bit on the idea of the patch: to forward all changes through limited ops
> > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > Trying to forward this through properly would result in more burden to the TDP
> > MMU, so that's not the right answer either.
> > 
> > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > just leaving it without comment, but I can add something like that. Or do you
> > have another suggestion?
> What about "External TDP doesn't support clearing PTE A/D bit"?

It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
add a comment unless you strongly object.


> I'm ok with leaving without comment if you think it's too obvious.
> 
> > > And do we need a similar WARN_ON_ONCE() in kvm_tdp_mmu_clear_dirty_pt_masked()
> > > or clear_dirty_pt_masked()?
> > 
> > Nothing changes for those in this patch though? For the kvm_tdp_mmu_age_spte()
> > case, warning coverage is removed in this patch.
> 
> In kvm_tdp_mmu_age_spte() and clear_dirty_pt_masked() SPTEs are updated via
> __tdp_mmu_set_spte_atomic() or tdp_mmu_clear_spte_bits(), which don't propagate
> changes to the external page table.
> So, I think it's better to warn if a mirror root is accidentally passed in to
> those functions.
> 
> 
>  

I don't think we should add warnings for new paths in this patch. Meaning, the
patch shouldn't increase coverage, just maintain. But it seems reasonable in
general. How about we add it to the cleanup list.




^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-01 23:58     ` Edgecombe, Rick P
@ 2026-04-02 23:21       ` Sean Christopherson
  0 siblings, 0 replies; 61+ messages in thread
From: Sean Christopherson @ 2026-04-02 23:21 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, kas@kernel.org,
	Kai Huang, Yan Y Zhao, Dave Hansen, linux-kernel@vger.kernel.org,
	x86@kernel.org

On Wed, Apr 01, 2026, Rick P Edgecombe wrote:
> On Tue, 2026-03-31 at 10:09 +0000, Huang, Kai wrote:
> > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > struct tdp_iter *iter)
> > >   {
> > >   	u64 new_spte;
> > >   
> > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > +		return;
> > > +
> > >   	if (spte_ad_enabled(iter->old_spte)) {
> > >   		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter-
> > > >sptep,
> > >   								shadow_acce
> > > ssed_mask);
> > 
> > AFAICT this chunk isn't related to "Centralize updates to present external
> > PTEs"?  Should it be in a separate patch?
> 
> It was because we lose warning coverage in this patch for the aging case. Though
> the diff was from Sean. But my take was this just maintains coverage that
> already existed.

Yep, exactly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-02 23:10         ` Edgecombe, Rick P
@ 2026-04-02 23:28           ` Sean Christopherson
  2026-04-03  9:05             ` Yan Zhao
  2026-04-03  9:08           ` Yan Zhao
  1 sibling, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2026-04-02 23:28 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, Dave Hansen, x86@kernel.org, Kai Huang,
	kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	kvm@vger.kernel.org

On Thu, Apr 02, 2026, Rick P Edgecombe wrote:
> On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> > On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > > +
> > > > > +	/*
> > > > > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > > > > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > > that
> > > > > +	 * concurrent faults don't attempt to install a child PTE in the
> > > > > +	 * external page table before the parent PTE has been written, or
> > > > > try
> > > > > +	 * to re-install a page table before the old one was removed.
> > > > > +	 */
> > > > > +	if (is_mirror_sptep(iter->sptep))
> > > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > > +	else
> > > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > > >   	if (ret)
> > > > >   		return ret;
> > > > >   
> > > > > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > > -			    new_spte, iter->level, true);
> > > > > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > > +				    new_spte, iter->level, true);
> > > > 
> > > > What about adding a comment for the tricky part for the mirror page table:
> > > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> > > 
> > > You meant it sets iter->sptep I think.
> > > 
> > > > for freezing the mirror page table, the original new_spte from the caller of
> > > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > > properly update statistics and propagate to the external page table.
> > > 
> > > new_spte was already passed in. What changed? You mean that
> > > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > > I'm not sure if it threshold TDP MMU.
> > 
> > For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> > like this:
> > tdp_mmu_set_spte_atomic() {
> >   __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
> >   __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > 			new_spte, iter->level, true);==>sets S-EPT to new_spte
> >   __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);   ==>sets mirror to new_spte
> > }
> > 
> > So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
> 
> I still don't see the point. That ordering is not new, and this patch actually
> adds a bunch of comments around the operations above and below the
> __handle_changed_spte() call. If you think something is still missing maybe you
> can suggest something.

Ya, I'm a bit confused too.  For me, the "tricky" part is understanding the need
to set the mirror SPTE to FROZE_SPTE while updating the external SPTE.  Once that
is understood, I don't find passing in @new_spte to be surprising in any way.

> > > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > > struct tdp_iter *iter)
> > > > >   {
> > > > >   	u64 new_spte;
> > > > >   
> > > > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > > +		return;
> > > > > +
> > > > Add a comment for why mirror page table is not expected here?
> > > 
> > > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > > little bit on the idea of the patch: to forward all changes through limited ops
> > > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > > Trying to forward this through properly would result in more burden to the TDP
> > > MMU, so that's not the right answer either.
> > > 
> > > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > > just leaving it without comment, but I can add something like that. Or do you
> > > have another suggestion?
> > What about "External TDP doesn't support clearing PTE A/D bit"?
> 
> It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
> add a comment unless you strongly object.

How about something like this?

	/* TODO: Add support for aging external SPTEs, if necessary. */

That makes it clear that this path is supposed to be unreachable because KVM doesn't
yet support aging external SPTEs, while not trying to say anything about *why*
KVM doesn't support aging external SPTEs.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-01  8:34   ` Yan Zhao
@ 2026-04-02 23:46     ` Edgecombe, Rick P
  2026-04-03 10:33       ` Yan Zhao
  0 siblings, 1 reply; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-02 23:46 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Wed, 2026-04-01 at 16:34 +0800, Yan Zhao wrote:
> Issue #2:
> "Additionally, if an SPTE is atomically zapped, is_present evaluates to false.
>  Because tdp_mmu_set_spte_atomic() lacks a fallback remove_external_spte()
>  call like the non-atomic tdp_mmu_set_spte() has, does this mean the external
>  S-EPT mapping remains fully present while KVM marks it as non-present?
> "
> 
> Response:
> Up to this patch, this seems to be a valid concern. But atomic zap on the mirror
> root isn't allowed yet (though warning on such case is dropped in patch 5 [2]).
> Centralizing atomic zap to __handle_changed_spte() will be done in patch 15 [3].
> Maybe move patch 5 to after patch 15?
> 
> [2] https://lore.kernel.org/kvm/20260327201421.2824383-6-rick.p.edgecombe@intel.com/
> [3] https://lore.kernel.org/kvm/20260327201421.2824383-16-rick.p.edgecombe@intel.com/
> [4] https://lore.kernel.org/all/20260106102024.25023-1-yan.y.zhao@intel.com/

Thanks for checking the AI reviews. I'll see how it slots in there.

> 
> Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
> let TDX just have a single hook set_external_spte() for propagation of changes
> from mirror page table to S-EPT?
> (below change is on code base with TDX huge page support).

I was asking Yan internally why this works but Sean's earlier attempt failed.
Yan, let's finish the discussion externally now that Sean is poking around.

I'd be inclined to kind to call the cleanup a win and leave further unification
for the future. At least not going turning over rocks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON()
  2026-04-02 19:21         ` Sean Christopherson
@ 2026-04-03  2:47           ` Yan Zhao
  0 siblings, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-03  2:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Dave Hansen, Kai Huang, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Thu, Apr 02, 2026 at 12:21:13PM -0700, Sean Christopherson wrote:
> On Thu, Apr 02, 2026, Yan Zhao wrote:
> > On Wed, Apr 01, 2026 at 12:37:51AM +0800, Edgecombe, Rick P wrote:
> > > On Mon, 2026-03-30 at 13:00 +0800, Yan Zhao wrote:
> > > 
> > > Yep on the typos. 
> > > 
> > > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > > @@ -656,7 +656,13 @@ static inline int __must_check
> > > > > __tdp_mmu_set_spte_atomic(struct kvm *kvm,
> > > > >   	 */
> > > > >   	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter-
> > > > > >old_spte));
> > > > >   
> > > > > -	if (is_mirror_sptep(iter->sptep) &&
> > > > > !is_frozen_spte(new_spte)) {
> > > > > +	/*
> > > > > +	 * FROZEN_SPTE is a temporary state and should never be
> > > > > set via higher
> > > > > +	 * level helpers.
> > > > > +	 */
> > > > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > Why is KVM_MMU_WARN_ON() used here for new_spte while WARN_ON_ONCE()
> > > > is used
> > > > above for old_spte?
> > > 
> > > For the KVM_MMU_WARN_ON() it was Sean's suggestion.
> > > 
> > > https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
> > > 
> > > It allows for compiling it out, so probably a better choice. So I see
> > > the options are leave them different or opportunistically convert the
> > > other one to KVM_MMU_WARN_ON(). Thoughts?
> > I see there are mixed WARN_ON_ONCE() and KVM_MMU_WARN_ON() calls in mmu.c and
> > tdp_mmu.c. I'm not sure if there's a rule for which one to use.
> 
> KVM_MMU_WARN_ON() is intended to be used only when having a WARN_ON_ONCE() enabled
> in production environments isn't worth the cost of the code.  E.g. if the code in
> question is a low level helper that used "everywhere" and in super hot paths, or
> in extremely paranoid sanity checks where the platform is beyond hosed if the
> sanity check fails.
> 
> The other thing with KVM_MMU_WARN_ON() is that it doesn't evaluate the expression
> when CONFIG_KVM_PROVE_MMU=n, i.e. can't be used in if-statements (the only case
> where it's used in an if-statement is wrapped with a CONFIG_KVM_PROVE_MMU #ifdef).
Ah, I missed this one.

> For infrequently used and/or slow paths, and for cases where KVM needs to take
> action no matter what, WARN_ON_ONCE() is preferred because if something goes
> sideways, we'll get a helpful splat even in production.
Got it. Thanks for the detailed explanation!

> > Is it necessary to evaluate them all and have a separate patch to convert
> > WARN_ON_ONCE() to KVM_MMU_WARN_ON()?
> 
> Nope.  We could probably convert a few select ones if there's good reason to do
> so, but we definitely don't want to do a bulk conversion.
Makes sense.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-02 23:28           ` Sean Christopherson
@ 2026-04-03  9:05             ` Yan Zhao
  2026-04-04  0:15               ` Edgecombe, Rick P
  0 siblings, 1 reply; 61+ messages in thread
From: Yan Zhao @ 2026-04-03  9:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Dave Hansen, x86@kernel.org, Kai Huang,
	kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	kvm@vger.kernel.org

On Thu, Apr 02, 2026 at 04:28:33PM -0700, Sean Christopherson wrote:
> On Thu, Apr 02, 2026, Rick P Edgecombe wrote:
> > On Thu, 2026-04-02 at 09:59 +0800, Yan Zhao wrote:
> > > On Thu, Apr 02, 2026 at 07:45:54AM +0800, Edgecombe, Rick P wrote:
> > > > On Mon, 2026-03-30 at 14:14 +0800, Yan Zhao wrote:
> > > > > > +	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Temporarily freeze the SPTE until the external PTE operation has
> > > > > > +	 * completed (unless the new SPTE itself will be frozen), e.g. so
> > > > > > that
> > > > > > +	 * concurrent faults don't attempt to install a child PTE in the
> > > > > > +	 * external page table before the parent PTE has been written, or
> > > > > > try
> > > > > > +	 * to re-install a page table before the old one was removed.
> > > > > > +	 */
> > > > > > +	if (is_mirror_sptep(iter->sptep))
> > > > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
> > > > > > +	else
> > > > > > +		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> > > > > >   	if (ret)
> > > > > >   		return ret;
> > > > > >   
> > > > > > -	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > > > > > -			    new_spte, iter->level, true);
> > > > > > +	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > > > > +				    new_spte, iter->level, true);
> > > > > 
> > > > > What about adding a comment for the tricky part for the mirror page table:
> > > > > while new_spte is set to FROZEN_SPTE in the above __tdp_mmu_set_spte_atomic()
> > > > 
> > > > You meant it sets iter->sptep I think.
> > > > 
> > > > > for freezing the mirror page table, the original new_spte from the caller of
> > > > > tdp_mmu_set_spte_atomic() is passed to __handle_changed_spte() in order to
> > > > > properly update statistics and propagate to the external page table.
> > > > 
> > > > new_spte was already passed in. What changed? You mean that
> > > > __tdp_mmu_set_spte_atomic() sets iter->sptep and doesn't update new_spte? If so
> > > > I'm not sure if it threshold TDP MMU.
> > > 
> > > For mirror page table, a successful path in tdp_mmu_set_spte_atomic() looks
> > > like this:
> > > tdp_mmu_set_spte_atomic() {
> > >   __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE); ==>sets mirror to frozen
> > >   __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
> > > 			new_spte, iter->level, true);==>sets S-EPT to new_spte
> > >   __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);   ==>sets mirror to new_spte
> > > }
> > > 
> > > So, the tricky part is that S-EPT is updated to new_spte ahead of mirror EPT.
> > 
> > I still don't see the point. That ordering is not new, and this patch actually
> > adds a bunch of comments around the operations above and below the
> > __handle_changed_spte() call. If you think something is still missing maybe you
> > can suggest something.
Hmm, sorry for the confusion. I didn't express it clearly.

The ordering inside tdp_mmu_set_spte_atomic() for mirror root is:

Before this patch,
    1. set mirror SPTE to frozen
    2. invoke TDX op to update external PTE
    3. set mirror SPTE to new_spte or restore old_spte
    4. if 2 succeeds, invoke handle_changed_spte() to propagate changes to
       child mirror SPTEs and child external PTEs

After this patch,
   1. set mirror SPTE to frozen
   2. invoke __handle_changed_spte(), which propagates changes to
      (1) child mirror SPTEs and child external PTEs
      (2) external PTE
   3. set mirror SPTE to new_spte or restore old_spte 

So, the step to propagate changes to child mirror SPTEs and child external PTEs
now occurs before the steps to update the external PTE and the mirror SPTE.

> Ya, I'm a bit confused too.  For me, the "tricky" part is understanding the need
> to set the mirror SPTE to FROZE_SPTE while updating the external SPTE.  Once that
> is understood, I don't find passing in @new_spte to be surprising in any way.
I still find it tricky because it seems strange to me to invoke a function named
handle_changed_spte() before the change actually occurs on the SPTE (i.e.,to me,
the SPTE has only changed from xxx to FROZEN_SPTE, but handle_changed_spte()
handles changes from xxx to new_spte).

Besides, another tricky point (currently benign to TDX) is that:
before this patch, tdp_mmu_set_spte_atomic() cannot be used to atomically zap
non-leaf mirror SPTEs, since TDX requires child PTEs to be zapped before the
parent PTE;
after this patch, performing atomic zapping of non-leaf mirror SPTEs seems to be
allowed in TDP MMU since the above step 2.1 now occurs before step 2.2. However,
if step 2.2 fails after step 2.1 succeeds, step 3 cannot easily restore the real
old state.
So, if we allow atomic zap on the mirror root in the future, it looks like we
need to ensure atomic zapping of S-EPT cannot fail.

> > > > > > @@ -1373,6 +1396,9 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm,
> > > > > > struct tdp_iter *iter)
> > > > > >   {
> > > > > >   	u64 new_spte;
> > > > > >   
> > > > > > +	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
> > > > > > +		return;
> > > > > > +
> > > > > Add a comment for why mirror page table is not expected here?
> > > > 
> > > > Ehh, maybe. Thinking about what to put... The warning is kind of cheating a
> > > > little bit on the idea of the patch: to forward all changes through limited ops
> > > > in a central place, such that we don't have TDX specifics encoded in core MMU.
> > > > Trying to forward this through properly would result in more burden to the TDP
> > > > MMU, so that's not the right answer either.
> > > > 
> > > > "Mirror TDP doesn't support PTE aging" is a pretty obvious comment. I'm fine
> > > > just leaving it without comment, but I can add something like that. Or do you
> > > > have another suggestion?
> > > What about "External TDP doesn't support clearing PTE A/D bit"?
> > 
> > It sounds too close to "TDX doesn't support..." to me. I think I'd prefer to not
> > add a comment unless you strongly object.
> 
> How about something like this?
> 
> 	/* TODO: Add support for aging external SPTEs, if necessary. */
> 
> That makes it clear that this path is supposed to be unreachable because KVM doesn't
> yet support aging external SPTEs, while not trying to say anything about *why*
> KVM doesn't support aging external SPTEs.
LGTM :)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-02 23:10         ` Edgecombe, Rick P
  2026-04-02 23:28           ` Sean Christopherson
@ 2026-04-03  9:08           ` Yan Zhao
  1 sibling, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-03  9:08 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, seanjc@google.com, x86@kernel.org, Huang, Kai,
	kas@kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	kvm@vger.kernel.org

On Fri, Apr 03, 2026 at 07:10:18AM +0800, Edgecombe, Rick P wrote:
> I don't think we should add warnings for new paths in this patch. Meaning, the
> patch shouldn't increase coverage, just maintain. But it seems reasonable in
> general. How about we add it to the cleanup list.
Adding it to the cleanup list looks good to me.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-02 23:46     ` Edgecombe, Rick P
@ 2026-04-03 10:33       ` Yan Zhao
  0 siblings, 0 replies; 61+ messages in thread
From: Yan Zhao @ 2026-04-03 10:33 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Hansen, Dave, seanjc@google.com, Huang, Kai, kas@kernel.org,
	x86@kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	pbonzini@redhat.com

On Fri, Apr 03, 2026 at 07:46:21AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2026-04-01 at 16:34 +0800, Yan Zhao wrote:
> > Thinking more about centralizing TDX hooks, could we be more aggressive? i.e.,
> > let TDX just have a single hook set_external_spte() for propagation of changes
> > from mirror page table to S-EPT?
> > (below change is on code base with TDX huge page support).
> 
> I was asking Yan internally why this works but Sean's earlier attempt failed.
> Yan, let's finish the discussion externally now that Sean is poking around.
Hmm, I guess why Sean provided op reclaim_external_spt() (now named
free_external_spt() in this series) is because the old_spte and new_spte
required by op set_external_spte() are not available in handle_removed_pt(), and
also because there's a "call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback)"
in handle_removed_pt().

So, if I'm not missing something, we may have 2 options for further unification:
1. pass in required old_parent_spte and new_parent_spte to handle_removed_pt(),
   and invoke op set_external_spte() (instead of reclaim_external_spt()) in
   handle_removed_pt().
2. as I proposed in this thread (see below key changes), assert that RCU read
   lock is always held during __handle_changed_spte() (which is a reasonable
   assumption in TDP MMU) and invoke op set_external_spte() for reclaiming
   external pt as well.
   
   Though invoking call_rcu() occurs before invoking op set_external_spte(),
   tdp_mmu_free_sp_rcu_callback() should only occur after invoking
   set_external_spte() due to __handle_changed_spte() holding RCU read lock.
   
   However, I agree it's odd to have call_rcu() invoked before reclaiming
   external pt :)

--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -461,9 +461,6 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		handle_changed_spte(kvm, sp, gfn, old_spte, FROZEN_SPTE, level, shared);
 	}

-	if (is_mirror_sp(sp))
-		kvm_x86_call(reclaim_external_spt)(kvm, base_gfn, sp);
-
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }

@@ -563,9 +560,17 @@ static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	 * changes to the external SPTE.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-	} else if (is_mirror_sp(sp)) {
+
+	if (is_mirror_sp(sp)) {
+		/*
+		 * Can also propagate changes to remove external pt. Since this
+		 * occurs after the call_rcu() in handle_removed_pt(), the RCU
+		 * read lock must be held.
+		 */
+		RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+
 		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte,
 						    new_spte, level);

   

> I'd be inclined to kind to call the cleanup a win and leave further unification
> for the future. At least not going turning over rocks.
I'm ok with leaving it to future refactoring.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs
  2026-04-03  9:05             ` Yan Zhao
@ 2026-04-04  0:15               ` Edgecombe, Rick P
  0 siblings, 0 replies; 61+ messages in thread
From: Edgecombe, Rick P @ 2026-04-04  0:15 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Hansen, Dave,
	kas@kernel.org, Huang, Kai, x86@kernel.org,
	linux-kernel@vger.kernel.org

On Fri, 2026-04-03 at 17:05 +0800, Yan Zhao wrote:
> Hmm, sorry for the confusion. I didn't express it clearly.
> 
> The ordering inside tdp_mmu_set_spte_atomic() for mirror root is:
> 
> Before this patch,
>     1. set mirror SPTE to frozen
>     2. invoke TDX op to update external PTE
>     3. set mirror SPTE to new_spte or restore old_spte
>     4. if 2 succeeds, invoke handle_changed_spte() to propagate changes to
>        child mirror SPTEs and child external PTEs
> 
> After this patch,
>    1. set mirror SPTE to frozen
>    2. invoke __handle_changed_spte(), which propagates changes to
>       (1) child mirror SPTEs and child external PTEs
>       (2) external PTE
>    3. set mirror SPTE to new_spte or restore old_spte 
> 
> So, the step to propagate changes to child mirror SPTEs and child external PTEs
> now occurs before the steps to update the external PTE and the mirror SPTE.

How about I add this info to the log. I think you are right, it's an important
change to call out.

Now it's making me think... Can you (if you haven't already) scrutinize for 
races/reasons that may trigger the KVM_BUG_ON() in handle_changed_spte() due to
BUSY or other? Like in the handle_removed_pt() path. I guess the write lock
saves us?

Hmm... zero step?

> 
> > Ya, I'm a bit confused too.  For me, the "tricky" part is understanding the need
> > to set the mirror SPTE to FROZE_SPTE while updating the external SPTE.  Once that
> > is understood, I don't find passing in @new_spte to be surprising in any way.
> I still find it tricky because it seems strange to me to invoke a function named
> handle_changed_spte() before the change actually occurs on the SPTE (i.e.,to me,
> the SPTE has only changed from xxx to FROZEN_SPTE, but handle_changed_spte()
> handles changes from xxx to new_spte).
> 
> Besides, another tricky point (currently benign to TDX) is that:
> before this patch, tdp_mmu_set_spte_atomic() cannot be used to atomically zap
> non-leaf mirror SPTEs, since TDX requires child PTEs to be zapped before the
> parent PTE;
> after this patch, performing atomic zapping of non-leaf mirror SPTEs seems to be
> allowed in TDP MMU since the above step 2.1 now occurs before step 2.2. However,
> if step 2.2 fails after step 2.1 succeeds, step 3 cannot easily restore the real
> old state.
> So, if we allow atomic zap on the mirror root in the future, it looks like we
> need to ensure atomic zapping of S-EPT cannot fail.

I think we shouldn't comment these kind of TDX specifics. It won't confuse non-
TDX developers working in the TDP MMU I think.


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2026-04-04  0:15 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 20:14 [PATCH 00/17] TDX MMU refactors Rick Edgecombe
2026-03-27 20:14 ` [PATCH 01/17] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level Rick Edgecombe
2026-03-27 20:14 ` [PATCH 02/17] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails" Rick Edgecombe
2026-03-31  9:47   ` Huang, Kai
2026-03-31  9:17     ` Yan Zhao
2026-03-31  9:59       ` Huang, Kai
2026-03-31  9:22         ` Yan Zhao
2026-03-31 10:14           ` Huang, Kai
2026-03-27 20:14 ` [PATCH 03/17] KVM: TDX: Account all non-transient page allocations for per-TD structures Rick Edgecombe
2026-03-27 20:14 ` [PATCH 04/17] KVM: x86: Make "external SPTE" ops that can fail RET0 static calls Rick Edgecombe
2026-03-27 20:14 ` [PATCH 05/17] KVM: x86/tdp_mmu: Drop zapping KVM_BUG_ON() set_external_spte_present() Rick Edgecombe
2026-03-27 20:14 ` [PATCH 06/17] KVM: x86/tdp_mmu: Morph the !is_frozen_spte() check into a KVM_MMU_WARN_ON() Rick Edgecombe
2026-03-30  5:00   ` Yan Zhao
2026-03-31 16:37     ` Edgecombe, Rick P
2026-04-02  1:06       ` Yan Zhao
2026-04-02 19:21         ` Sean Christopherson
2026-04-03  2:47           ` Yan Zhao
2026-03-27 20:14 ` [PATCH 07/17] KVM: x86/tdp_mmu: Centralize updates to present external PTEs Rick Edgecombe
2026-03-30  6:14   ` Yan Zhao
2026-04-01 23:45     ` Edgecombe, Rick P
2026-04-02  1:59       ` Yan Zhao
2026-04-02 23:10         ` Edgecombe, Rick P
2026-04-02 23:28           ` Sean Christopherson
2026-04-03  9:05             ` Yan Zhao
2026-04-04  0:15               ` Edgecombe, Rick P
2026-04-03  9:08           ` Yan Zhao
2026-03-31 10:09   ` Huang, Kai
2026-04-01 23:58     ` Edgecombe, Rick P
2026-04-02 23:21       ` Sean Christopherson
2026-04-01  8:34   ` Yan Zhao
2026-04-02 23:46     ` Edgecombe, Rick P
2026-04-03 10:33       ` Yan Zhao
2026-03-27 20:14 ` [PATCH 08/17] KVM: TDX: Drop kvm_x86_ops.link_external_spt(), use .set_external_spte() for all Rick Edgecombe
2026-03-30  6:28   ` Yan Zhao
2026-03-27 20:14 ` [PATCH 09/17] KVM: TDX: Add helper to handle mapping leaf SPTE into S-EPT Rick Edgecombe
2026-03-30  6:43   ` Yan Zhao
2026-04-01 23:59     ` Edgecombe, Rick P
2026-03-27 20:14 ` [PATCH 10/17] KVM: TDX: Move set_external_spte_present() assert into TDX code Rick Edgecombe
2026-03-31 10:30   ` Huang, Kai
2026-04-02  0:00     ` Edgecombe, Rick P
2026-03-31 10:34   ` Huang, Kai
2026-03-27 20:14 ` [PATCH 11/17] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Rick Edgecombe
2026-03-31 10:36   ` Huang, Kai
2026-04-01  7:41   ` Yan Zhao
2026-03-27 20:14 ` [PATCH 12/17] KVM: x86/mmu: Plumb the old_spte into kvm_x86_ops.set_external_spte() Rick Edgecombe
2026-03-27 20:14 ` [PATCH 13/17] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Rick Edgecombe
2026-03-31 10:42   ` Huang, Kai
2026-04-02  0:04     ` Edgecombe, Rick P
2026-03-27 20:14 ` [PATCH 14/17] KVM: x86/mmu: Remove KVM_BUG_ON() that checks lock when removing PTs Rick Edgecombe
2026-03-30  7:01   ` Yan Zhao
2026-03-31 10:46     ` Huang, Kai
2026-04-02  0:08       ` Edgecombe, Rick P
2026-04-02  2:04         ` Yan Zhao
2026-03-27 20:14 ` [PATCH 15/17] KVM: TDX: Handle removal of leaf SPTEs in .set_private_spte() Rick Edgecombe
2026-03-27 20:14 ` [PATCH 16/17] KVM: x86: Move error handling inside free_external_spt() Rick Edgecombe
2026-03-27 20:14 ` [PATCH 17/17] KVM: TDX: Move external page table freeing to TDX code Rick Edgecombe
2026-03-30  7:49   ` Yan Zhao
2026-04-02  0:17     ` Edgecombe, Rick P
2026-04-02  2:16       ` Yan Zhao
2026-04-02  2:17         ` Yan Zhao
2026-03-31 11:02   ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox