Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Yan Zhao <yan.y.zhao@intel.com>
To: seanjc@google.com, pbonzini@redhat.com, kvm@vger.kernel.org,
	rick.p.edgecombe@intel.com, kas@kernel.org
Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	dave.hansen@intel.com, kai.huang@intel.com,
	binbin.wu@linux.intel.com, xiaoyao.li@intel.com,
	yan.y.zhao@intel.com
Subject: [PATCH v2 09/15] KVM: x86/tdp_mmu: Centrally propagate to-present/atomic zap updates to external PTEs
Date: Sat,  9 May 2026 15:56:34 +0800	[thread overview]
Message-ID: <20260509075634.4274-1-yan.y.zhao@intel.com> (raw)
In-Reply-To: <20260509075201.4077-1-yan.y.zhao@intel.com>

From: Sean Christopherson <seanjc@google.com>

Move propagation of to-present changes and atomic zap changes to external
PTEs from function __tdp_mmu_set_spte_atomic() to function
__handle_changed_spte(), which centrally handles changes of SPTEs.

When setting a PTE to present in the mirror page tables, the update needs
to be propagated to the external page tables (in TDX parlance the S-EPT).
Today this is handled by special mirror page tables branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs are
set for TDX.

This keeps things running, but is a bit hacked on. The hook for setting
present leaf PTEs is added only where TDX happens to need it. For example,
TDX does not support any of the operations that use the non-atomic variant,
tdp_mmu_set_spte() to set present PTEs. Since the hook is missing there, it
is very hard to understand the code from a non-TDX lens. If the reader
doesn't know the TDX specifics it could look like the external update is
missing.

In addition to being confusing, it also litters the TDP MMU with "external"
update callbacks. This is especially unfortunate because there is already a
central place to react to TDP updates, handle_changed_spte().

Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX-specific logic can live
with a more proper separation of concerns. Do this by adding a helper
__handle_changed_spte() and teaching it how to return error codes, such
that it can propagate the failures that may come from TDX external page
table updates. Make the original handle_changed_spte() a no-fail version of
__handle_changed_spte(), so it handles no-fail changes which are under
exclusive mmu_lock or under the no-fail path handle_removed_pt(),
triggering KVM_BUG_ON() on error returns.

Instead of having __tdp_mmu_set_spte_atomic() do the frozen mirror SPTE
dance and trigger propagation to external PTEs, make
__tdp_mmu_set_spte_atomic() a simple helper of try_cmpxchg64() and hoist
the frozen mirror SPTE dance up a level to tdp_mmu_set_spte_atomic(). Then,
the propagation of changes to present to the external PTEs can be
centralized to __handle_changed_spte(). Aging external SPTEs is not yet
supported for the mirror page table, so just warn on mirror usage in
kvm_tdp_mmu_age_spte() and invoke __tdp_mmu_set_spte_atomic() directly
without frozen dance. No need to warn on installing FROZEN_SPTE as a
long-term value in kvm_tdp_mmu_age_spte() since removing accessed bit is
mutually exclusive with installing FROZEN_SPTE (FROZEN_SPTE is with
accessed bit in all x86 platforms).

Since tdp_mmu_set_spte_atomic() can also be invoked to atomically zap SPTEs
(though there's no path to trigger atomic zap on the mirror page table up
to now), also leverage set_external_spte() op to propagate the atomic zaps
when tdp_mmu_set_spte_atomic() zaps leaf SPTEs directly. (When
tdp_mmu_set_spte_atomic() zaps a non-leaf SPTE, zaps of the child leaf
SPTEs are propagated via the remove_external_spte() op).

Note: tdp_mmu_set_spte_atomic() invokes __handle_changed_spte() to handle
changes to new_spte while the mirror SPTE is frozen, so
(1) the update of the external PTEs and statistics, or
(2) the update of child mirror SPTEs, child external PTEs and corresponding
    statistics,
now occur before the mirror SPTE is actually set to new_spte.
(1) is ok since if it fails, the mirror SPTE will be restored to its
original value. (2) is also ok since handle_removed_pt() is no-fail.

Link: https://lore.kernel.org/lkml/aYYn0nf2cayYu8e7@google.com/
Not-yet-Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rick: Based on a diff by Sean Chrisopherson]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[Yan: added atomic zap case ]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
MMU_refactors v2:
- Updated comments and patch log (Yan, Binbin).
- Split the "KVM: x86/mmu: Plumb "sp" _pointer_ into the TDP MMU's
  handle_changed_spte()" out (which was in Sean's original series but
  had SPTE instead of "sp" in patch title). (Yan)
- Also invoke set_external_spte() op to propagate changes for atomic zap of
  leaf SPTEs. (Yan).
- Kept the "Not-yet-Signed-off-by" in v1
  https://lore.kernel.org/all/20260327201421.2824383-8-rick.p.edgecombe@intel.com.
---
 arch/x86/kvm/mmu/tdp_mmu.c | 124 +++++++++++++++++++++++--------------
 1 file changed, 79 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 05dc8bdc1ea5..ada4a0837298 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -495,7 +495,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 }
 
 /**
- * handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
  * @sp: the page table in which the SPTE resides
  * @gfn: the base GFN that was mapped by the SPTE
@@ -510,9 +510,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * dirty logging updates are handled in common code, not here (see make_spte()
  * and fast_pf_fix_direct_spte()).
  */
-static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
-				gfn_t gfn, u64 old_spte, u64 new_spte,
-				int level, bool shared)
+static int __handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 gfn_t gfn, u64 old_spte, u64 new_spte,
+				 int level, bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
@@ -549,9 +549,7 @@ static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 	}
 
 	if (old_spte == new_spte)
-		return;
-
-	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+		return 0;
 
 	if (is_leaf)
 		check_spte_writable_invariants(new_spte);
@@ -578,29 +576,49 @@ static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			       "a temporary frozen SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
-		return;
+		return 0;
 	}
 
-	if (is_leaf != was_leaf)
-		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
-
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.  Note the WARN on the PFN changing without the
 	 * SPTE being converted to a hugepage (leaf) or being zapped.  Shadow
 	 * pages are kernel allocations and should never be migrated.
+	 *
+	 * For the mirror page table, propagate changes to present or changes of
+	 * leaf SPTEs to !present under shared mmu_lock to the external SPTE via
+	 * set_external_spte() op.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	} else if (is_mirror_sp(sp) && (is_present || shared)) {
+		int r;
+
+		r = kvm_x86_call(set_external_spte)(kvm, gfn, old_spte, new_spte, level);
+		if (r)
+			return r;
+	}
+	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
+
+	if (is_leaf != was_leaf)
+		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
+
+	return 0;
+}
+
+static void handle_changed_spte(struct kvm *kvm, struct kvm_mmu_page *sp,
+				gfn_t gfn, u64 old_spte, u64 new_spte,
+				int level, bool shared)
+{
+	KVM_BUG_ON(__handle_changed_spte(kvm, sp, gfn, old_spte, new_spte,
+					 level, shared), kvm);
 }
 
 static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 							 struct tdp_iter *iter,
 							 u64 new_spte)
 {
-	u64 *raw_sptep = rcu_dereference(iter->sptep);
-
 	/*
 	 * The caller is responsible for ensuring the old SPTE is not a FROZEN
 	 * SPTE.  KVM should never attempt to zap or manipulate a FROZEN SPTE,
@@ -609,31 +627,6 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 */
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
-	/* Should not set FROZEN_SPTE as a long-term value. */
-	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
-
-	if (is_mirror_sptep(iter->sptep)) {
-		int ret;
-
-		/*
-		 * We need to lock out other updates to the SPTE until the external
-		 * page table has been modified. Use FROZEN_SPTE similar to
-		 * the zapping case.
-		 */
-		if (!try_cmpxchg64(raw_sptep, &iter->old_spte, FROZEN_SPTE))
-			return -EBUSY;
-
-		ret = kvm_x86_call(set_external_spte)(kvm, iter->gfn, iter->old_spte,
-						      new_spte, iter->level);
-
-		if (ret)
-			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
-		else
-			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
-
-		return ret;
-	}
-
 	/*
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
 	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
@@ -641,7 +634,7 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * the current value, so the caller operates on fresh data, e.g. if it
 	 * retries tdp_mmu_set_spte_atomic().
 	 */
-	if (!try_cmpxchg64(raw_sptep, &iter->old_spte, new_spte))
+	if (!try_cmpxchg64(rcu_dereference(iter->sptep), &iter->old_spte, new_spte))
 		return -EBUSY;
 
 	return 0;
@@ -673,14 +666,51 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+	/* Should not set FROZEN_SPTE as a long-term value. */
+	KVM_MMU_WARN_ON(is_frozen_spte(new_spte));
+
+	/*
+	 * Temporarily freeze the SPTE until the external PTE operation has
+	 * completed, e.g. so that concurrent faults don't attempt to install a
+	 * child PTE in the external page table before the parent PTE has been
+	 * written.
+	 */
+	if (is_mirror_sptep(iter->sptep))
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
+	else
+		ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
+
 	if (ret)
 		return ret;
 
-	handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
-			    new_spte, iter->level, true);
-
-	return 0;
+	/*
+	 * Handle the change from iter->old_spte to new_spte.
+	 *
+	 * Note: for mirror page table, this means the updates of the external
+	 * PTE, statistics, or updates of child SPTEs, child external PTEs and
+	 * corresponding statistics are performed while the mirror SPTE is in
+	 * frozen state (i.e., before the mirror SPTE is set to new_spte).
+	 */
+	ret = __handle_changed_spte(kvm, sp, iter->gfn, iter->old_spte,
+				    new_spte, iter->level, true);
+	/*
+	 * Unfreeze the mirror SPTE.  If updating the external SPTE failed,
+	 * restore the old value so that the mirror SPTE isn't frozen in
+	 * perpetuity, otherwise set the mirror SPTE to the new desired value.
+	 */
+	if (is_mirror_sptep(iter->sptep)) {
+		if (ret)
+			__kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte);
+		else
+			__kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+	} else {
+		/*
+		 * Bug the VM if handling the change failed, as failure is only
+		 * allowed if KVM couldn't update the external SPTE.
+		 */
+		KVM_BUG_ON(ret, kvm);
+	}
+	return ret;
 }
 
 /*
@@ -1334,6 +1364,10 @@ static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter)
 {
 	u64 new_spte;
 
+	/* TODO: Add support for aging external SPTEs, if necessary. */
+	if (WARN_ON_ONCE(is_mirror_sptep(iter->sptep)))
+		return;
+
 	if (spte_ad_enabled(iter->old_spte)) {
 		iter->old_spte = tdp_mmu_clear_spte_bits_atomic(iter->sptep,
 								shadow_accessed_mask);
-- 
2.43.2


  parent reply	other threads:[~2026-05-09  8:36 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-09  7:52 [PATCH v2 00/15] TDX MMU refactors Yan Zhao
2026-05-09  7:53 ` [PATCH v2 01/15] KVM: TDX: Drop kvm_x86_ops.link_external_spt() Yan Zhao
2026-05-09  7:55 ` [PATCH v2 02/15] KVM: TDX: Wrap mapping of leaf and non-leaf S-EPT entries into helpers Yan Zhao
2026-05-09  7:55 ` [PATCH v2 03/15] KVM: x86/mmu: Fold set_external_spte_present() into its sole caller Yan Zhao
2026-05-09  7:55 ` [PATCH v2 04/15] KVM: x86/mmu: Plumb param "old_spte" into kvm_x86_ops.set_external_spte() Yan Zhao
2026-05-09  7:55 ` [PATCH v2 05/15] KVM: TDX: Move KVM_BUG_ON()s in __tdp_mmu_set_spte_atomic() to TDX code Yan Zhao
2026-05-09  7:55 ` [PATCH v2 06/15] KVM: TDX: Move lockdep assert " Yan Zhao
2026-05-09  7:56 ` [PATCH v2 07/15] KVM: x86/tdp_mmu: Morph !is_frozen_spte() check into a KVM_MMU_WARN_ON() Yan Zhao
2026-05-09  7:56 ` [PATCH v2 08/15] KVM: x86/mmu: Plumb "sp" _pointer_ into the TDP MMU's handle_changed_spte() Yan Zhao
2026-05-09  7:56 ` Yan Zhao [this message]
2026-05-09  7:56 ` [PATCH v2 10/15] KVM: x86/mmu: Drop KVM_BUG_ON() on shared lock to zap child external PTEs Yan Zhao
2026-05-09  7:56 ` [PATCH v2 11/15] KVM: TDX: Hoist tdx_sept_remove_private_spte() above set_private_spte() Yan Zhao
2026-05-09  7:57 ` [PATCH v2 12/15] KVM: TDX: Drop kvm_x86_ops.remove_external_spte() Yan Zhao
2026-05-09  7:57 ` [PATCH v2 13/15] KVM: TDX: Rename tdx_sept_remove_private_spte() to show it's for leaf SPTEs Yan Zhao
2026-05-09  7:57 ` [PATCH v2 14/15] KVM: x86: Move error handling inside free_external_spt() Yan Zhao
2026-05-09  7:57 ` [PATCH v2 15/15] KVM: TDX: Move external page table freeing to TDX code Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260509075634.4274-1-yan.y.zhao@intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=dave.hansen@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox