[RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups
@ 2025-08-29  0:06 Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
                   ` (17 more replies)
  0 siblings, 18 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

New (still largely untested) version of the TDX post-populate cleanup series
to address locking issues between gmem and TDX's post-populate hook[*], with
a pile of related cleanups throw in to (hopefully) simplify future development,
e.g. for hugepage and in-place conversion.

RFC as this is compile tested only again, and there are substantial differences
relative to v1.

P.S. I wasn't intending this to be 6.18 material (at all), but with the change
     in how TDH_MEM_PAGE_ADD is handled, I'm tempted to make a push to get this
     in sooner than later so that in-flight development can benefit.  Thoughts?

[*] http://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com

v2:
 - Collect a few reviews (and ignore some because the patches went away).
   [Rick, Kai, Ira]
 - Move TDH_MEM_PAGE_ADD under mmu_lock and drop nr_premapped. [Yan, Rick]
 - Force max_level = PG_LEVEL_4K straightaway. [Yan]
 - s/kvm_tdp_prefault_page/kvm_tdp_page_prefault. [Rick]
 - Use Yan's version of "Say no to pinning!".  [Yan, Rick]
 - Tidy up helpers and macros to reduce boilerplate and copy+pate code, and
   to eliminate redundant/dead code (e.g. KVM_BUG_ON() the same error
   multiple times).
 - KVM_BUG_ON() if TDH_MR_EXTEND fails (I convinced myself it can't).

v1: https://lore.kernel.org/all/20250827000522.4022426-1-seanjc@google.com

Sean Christopherson (17):
  KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP
    MMU"
  KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault()
  KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  KVM: TDX: Fold tdx_sept_drop_private_spte() into
    tdx_sept_remove_private_spte()
  KVM: x86/mmu: Drop the return code from
    kvm_x86_ops.remove_external_spte()
  KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  KVM: TDX: Bug the VM if extended the initial measurement fails
  KVM: TDX: ADD pages to the TD image while populating mirror EPT
    entries
  KVM: TDX: Fold tdx_sept_zap_private_spte() into
    tdx_sept_remove_private_spte()
  KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  KVM: TDX: Derive error argument names from the local variable names
  KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT
    entries
  KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest

Yan Zhao (1):
  KVM: TDX: Drop superfluous page pinning in S-EPT management

 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu.h              |   3 +-
 arch/x86/kvm/mmu/mmu.c          |  66 ++++-
 arch/x86/kvm/mmu/tdp_mmu.c      |  45 +---
 arch/x86/kvm/vmx/tdx.c          | 460 +++++++++++---------------------
 arch/x86/kvm/vmx/tdx.h          |   8 +-
 6 files changed, 234 insertions(+), 352 deletions(-)


base-commit: ecbcc2461839e848970468b44db32282e5059925
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  6:20   ` Binbin Wu
  2025-08-29  0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
Add a helper function to walk down the TDP MMU"), the justification for
introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
was pre-populated, with a link that points to this snippet:

 : > One small question:
 : >
 : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
 : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
 : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
 : > return error when it finds the region hasn't been pre-populated?
 :
 : Return an error.  I don't love the idea of bleeding so many TDX details into
 : userspace, but I'm pretty sure that ship sailed a long, long time ago.

But that justification makes little sense for the final code, as simply
doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
if the S-EPT mapping is invalid (as evidenced by the code being guarded
with CONFIG_KVM_PROVE_MMU=y).

The sanity check is also incomplete in the sense that mmu_lock is dropped
between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
zap SPTEs in a very specific window.

Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
which has no business being exposed to vendor code.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6784aaaced87..71da245d160f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3175,20 +3175,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret < 0)
 		goto out;

-	/*
-	 * The private mem cannot be zapped after kvm_tdp_map_page()
-	 * because all paths are covered by slots_lock and the
-	 * filemap invalidate lock.  Check that they are indeed enough.
-	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
-
 	ret = 0;
 	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
 			       src_page, &entry, &level_state);
-- 
2.51.0.318.gd7df087d1a-goog

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29 18:34   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Add and use a new API for mapping a private pfn from guest_memfd into the
TDP MMU from TDX's post-populate hook instead of partially open-coding the
functionality into the TDX code.  Sharing code with the pre-fault path
sounded good on paper, but it's fatally flawed as simulating a fault loses
the pfn, and calling back into gmem to re-retrieve the pfn creates locking
problems, e.g. kvm_gmem_populate() already holds the gmem invalidation
lock.

Providing a dedicated API will also removing several MMU exports that
ideally would not be exposed outside of the MMU, let alone to vendor code.
On that topic, opportunistically drop the kvm_mmu_load() export.  Leave
kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added
kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future.

Cc: Michael Roth <michael.roth@amd.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 60 +++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.c | 10 +++----
 3 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..697b90a97f43 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -259,6 +259,7 @@ extern bool tdp_mmu_enabled;
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 92ff15969a36..65300e43d6a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4994,6 +4994,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	return min(range->size, end - range->gpa);
 }
 
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct kvm_page_fault fault = {
+		.addr = gfn_to_gpa(gfn),
+		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
+		.prefetch = true,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+
+		.max_level = PG_LEVEL_4K,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+		.is_private = true,
+
+		.gfn = gfn,
+		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+		.pfn = pfn,
+		.map_writable = true,
+	};
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return -EIO;
+
+	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
+		return -EPERM;
+
+	r = kvm_mmu_reload(vcpu);
+	if (r)
+		return r;
+
+	r = mmu_topup_memory_caches(vcpu, false);
+	if (r)
+		return r;
+
+	do {
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
+			return -EIO;
+
+		cond_resched();
+
+		guard(read_lock)(&kvm->mmu_lock);
+
+		r = kvm_tdp_mmu_map(vcpu, &fault);
+	} while (r == RET_PF_RETRY);
+
+	if (r != RET_PF_FIXED)
+		return -EIO;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
@@ -5977,7 +6036,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
-EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 71da245d160f..c83e1ff02827 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3151,15 +3151,12 @@ struct tdx_gmem_post_populate_arg {
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				  void __user *src, int order, void *_arg)
 {
-	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct tdx_gmem_post_populate_arg *arg = _arg;
-	struct kvm_vcpu *vcpu = arg->vcpu;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
 	gpa_t gpa = gfn_to_gpa(gfn);
-	u8 level = PG_LEVEL_4K;
 	struct page *src_page;
 	int ret, i;
-	u64 err, entry, level_state;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
@@ -3171,7 +3168,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
 	if (ret < 0)
 		goto out;
 
@@ -3234,7 +3231,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
-	kvm_mmu_reload(vcpu);
 	ret = 0;
 	while (region.nr_pages) {
 		if (signal_pending(current)) {
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29 19:00   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Remove the helper and exports that were added to allow TDX code to reuse
kvm_tdp_map_page() for its gmem post-populate flow now that a dedicated
TDP MMU API is provided to install a mapping given a gfn+pfn pair.

This reverts commit 2608f105760115e94a03efd9f12f8fbfd1f9af4b.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h         |  2 --
 arch/x86/kvm/mmu/mmu.c     |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++--------------------------------
 3 files changed, 7 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 697b90a97f43..dc6b965cea4f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -257,8 +257,6 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
 int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 65300e43d6a1..f808c437d738 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4904,7 +4904,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+			    u8 *level)
 {
 	int r;
 
@@ -4946,7 +4947,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 31d921705dee..3ea2dd64ce72 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1939,13 +1939,16 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-				  struct kvm_mmu_page *root)
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
 {
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
+	*root_level = vcpu->arch.mmu->root_role.level;
+
 	for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1954,36 +1957,6 @@ static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
-{
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
-	*root_level = vcpu->arch.mmu->root_role.level;
-
-	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
-}
-
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
-{
-	struct kvm *kvm = vcpu->kvm;
-	bool is_direct = kvm_is_addr_direct(kvm, gpa);
-	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
-				 vcpu->arch.mmu->mirror_root_hpa;
-	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
-	int leaf;
-
-	lockdep_assert_held(&kvm->mmu_lock);
-	rcu_read_lock();
-	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
-	rcu_read_unlock();
-	if (leaf < 0)
-		return false;
-
-	spte = sptes[leaf];
-	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
-}
-EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
-
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29 19:03   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() now that it's used
only by kvm_arch_vcpu_pre_fault_memory().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f808c437d738..dddeda7f05eb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4904,8 +4904,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
-			    u8 *level)
+static int kvm_tdp_page_prefault(struct kvm_vcpu *vcpu, gpa_t gpa,
+				 u64 error_code, u8 *level)
 {
 	int r;
 
@@ -4982,7 +4982,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
+	r = kvm_tdp_page_prefault(vcpu, range->gpa | direct_bits, error_code, &level);
 	if (r < 0)
 		return r;
 
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  8:36   ` Binbin Wu
  2025-08-29 19:53   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

From: Yan Zhao <yan.y.zhao@intel.com>

Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
doesn't support page migration in any capacity, i.e. there are no migrate
callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
kvm_gmem_migrate_folio().

Eliminating TDX's explicit pinning will also enable guest_memfd to support
in-place conversion between shared and private memory[1][2].  Because KVM
cannot distinguish between speculative/transient refcounts and the
intentional refcount for TDX on private pages[3], failing to release
private page refcount in TDX could cause guest_memfd to indefinitely wait
on decreasing the refcount for the splitting.

Under normal conditions, not holding an extra page refcount in TDX is safe
because guest_memfd ensures pages are retained until its invalidation
notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
module, not holding an extra refcount when a page is mapped in S-EPT could
result in a page being released from guest_memfd while still mapped in the
S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
negative when that extra work adds complexity and confusion.

Several approaches were considered to address the refcount issue, including
  - Attempting to modify the KVM unmap operation to return a failure,
    which was deemed too complex and potentially incorrect[4].
 - Increasing the folio reference count only upon S-EPT zapping failure[5].
 - Use page flags or page_ext to indicate a page is still used by TDX[6],
   which does not work for HVO (HugeTLB Vmemmap Optimization).
  - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].

Due to the complexity or inappropriateness of these approaches, and the
fact that S-EPT zapping failure is currently only possible when there are
bugs in the KVM or TDX module, which is very rare in a production kernel,
a straightforward approach of simply not holding the page reference count
in TDX was chosen[8].

When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
vCPUs and mark the VM as dead. Although there is a potential window that a
private page mapped in the S-EPT could be reallocated and used outside the
VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
information. To be robust against bugs, the user can enable panic_on_warn
as normal.

Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
Link: https://youtu.be/UnBKahkAon4 [2]
Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Suggested-by: Ackerley Tng <ackerleytng@google.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
[sean: extract out of hugepage series, massage changelog accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
 1 file changed, 4 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c83e1ff02827..f24f8635b433 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
-static void tdx_unpin(struct kvm *kvm, struct page *page)
-{
-	put_page(page);
-}
-
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
-			    enum pg_level level, struct page *page)
+			    enum pg_level level, kvm_pfn_t pfn)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct page *page = pfn_to_page(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
 	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
-	if (unlikely(tdx_operand_busy(err))) {
-		tdx_unpin(kvm, page);
+	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
-	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
-		tdx_unpin(kvm, page);
 		return -EIO;
 	}
 
@@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	struct page *page = pfn_to_page(pfn);
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
-	/*
-	 * Because guest_memfd doesn't support page migration with
-	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
-	 * migration.  Until guest_memfd supports page migration, prevent page
-	 * migration.
-	 * TODO: Once guest_memfd introduces callback on page migration,
-	 * implement it and remove get_page/put_page().
-	 */
-	get_page(page);
-
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
 	 * barrier in tdx_td_finalize().
 	 */
 	smp_rmb();
 	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
-		return tdx_mem_page_aug(kvm, gfn, level, page);
+		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
 	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
 }
@@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 	tdx_clear_page(page);
-	tdx_unpin(kvm, page);
 	return 0;
 }
 
@@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
 	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
 		atomic64_dec(&kvm_tdx->nr_premapped);
-		tdx_unpin(kvm, page);
 		return 0;
 	}
 
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  9:40   ` Binbin Wu
                     ` (2 more replies)
  2025-08-29  0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
when a VM has been killed due to a KVM bug, not -EINVAL.  Note, many (all?)
of the affected paths never propagate the error code to userspace, i.e.
this is about internal consistency more than anything else.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f24f8635b433..50a9d81dad53 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
 	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
 	atomic64_inc(&kvm_tdx->nr_premapped);
@@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
@@ -1661,10 +1661,10 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
+		return -EIO;
 
 	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/*
 	 * When zapping private page, write lock is held. So no race condition
@@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * and slot move/deletion.
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/*
 	 * The HKID assigned to this TD was already freed and cache was
@@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * there can't be anything populated in the private EPT.
 	 */
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+		return -EIO;
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (5 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  9:49   ` Binbin Wu
  2025-08-29  0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() to
avoid having to differnatiate between "zap", "drop", and "remove", and to
eliminate dead code due to redundant checks, e.g. on an HKID being
assigned.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 90 +++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 50a9d81dad53..8cb6a2627eb2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1651,55 +1651,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
 }
 
-static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
-				      enum pg_level level, struct page *page)
-{
-	int tdx_level = pg_level_to_tdx_sept_level(level);
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	gpa_t gpa = gfn_to_gpa(gfn);
-	u64 err, entry, level_state;
-
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EIO;
-
-	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
-		return -EIO;
-
-	/*
-	 * When zapping private page, write lock is held. So no race condition
-	 * with other vcpu sept operation.
-	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
-	 */
-	err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
-				  &level_state);
-
-	if (unlikely(tdx_operand_busy(err))) {
-		/*
-		 * The second retry is expected to succeed after kicking off all
-		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
-		 */
-		tdx_no_vcpus_enter_start(kvm);
-		err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
-					  &level_state);
-		tdx_no_vcpus_enter_stop(kvm);
-	}
-
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
-		return -EIO;
-	}
-
-	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
-
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
-		return -EIO;
-	}
-	tdx_clear_page(page);
-	return 0;
-}
-
 static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, void *private_spt)
 {
@@ -1861,7 +1812,11 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 					enum pg_level level, kvm_pfn_t pfn)
 {
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
 	int ret;
 
 	/*
@@ -1872,6 +1827,10 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
 		return -EIO;
 
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EIO;
+
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
 		return ret;
@@ -1882,7 +1841,38 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 */
 	tdx_track(kvm);
 
-	return tdx_sept_drop_private_spte(kvm, gfn, level, page);
+	/*
+	 * When zapping private page, write lock is held. So no race condition
+	 * with other vcpu sept operation.
+	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
+	 */
+	err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
+				  &level_state);
+
+	if (unlikely(tdx_operand_busy(err))) {
+		/*
+		 * The second retry is expected to succeed after kicking off all
+		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
+		 */
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
+					  &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
+		return -EIO;
+	}
+
+	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+		return -EIO;
+	}
+
+	tdx_clear_page(page);
+	return 0;
 }
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (6 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  9:52   ` Binbin Wu
  2025-08-29  0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a.
tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure,
and that KVM_BUG_ON() is redundant since all error paths in TDX also do a
KVM_BUG_ON().

Opportunistically pass the spte instead of the pfn, as the API is clearly
about removing an spte.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++------
 arch/x86/kvm/vmx/tdx.c          | 17 ++++++++---------
 3 files changed, 12 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0d3cc0fc27af..d0a8404a6b8f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1852,8 +1852,8 @@ struct kvm_x86_ops {
 				 void *external_spt);
 
 	/* Update external page table from spte getting removed, and flush TLB. */
-	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				    kvm_pfn_t pfn_for_gfn);
+	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				     u64 spte);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3ea2dd64ce72..78ee085f7cbc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -362,9 +362,6 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 				 int level)
 {
-	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
-	int ret;
-
 	/*
 	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
 	 * PTs are removed in a special order, involving free_external_spt().
@@ -377,9 +374,8 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 
 	/* Zapping leaf spte is allowed only when write lock is held. */
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	/* Because write lock is held, operation should success. */
-	ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn);
-	KVM_BUG_ON(ret, kvm);
+
+	kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte);
 }
 
 /**
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8cb6a2627eb2..07f9ad1fbfb6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1809,12 +1809,12 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
-static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-					enum pg_level level, kvm_pfn_t pfn)
+static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					 enum pg_level level, u64 spte)
 {
+	struct page *page = pfn_to_page(spte_to_pfn(spte));
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	struct page *page = pfn_to_page(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 	int ret;
@@ -1825,15 +1825,15 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * there can't be anything populated in the private EPT.
 	 */
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EIO;
+		return;
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EIO;
+		return;
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
-		return ret;
+		return;
 
 	/*
 	 * TDX requires TLB tracking before dropping private page.  Do
@@ -1862,17 +1862,16 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
-		return -EIO;
+		return;
 	}
 
 	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
-		return -EIO;
+		return;
 	}
 
 	tdx_clear_page(page);
-	return 0;
 }
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (7 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  9:52   ` Binbin Wu
  2025-08-29  0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
isn't also triggered.  Isolating the check from the "is premap error"
if-statement will also allow adding a lockdep assertion that premap errors
are encountered if and only if slots_lock is held.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 07f9ad1fbfb6..cafd618ca43c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1724,8 +1724,10 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
-	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
-	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
+	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
+		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+			return -EIO;
+
 		atomic64_dec(&kvm_tdx->nr_premapped);
 		return 0;
 	}
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (8 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29 10:06   ` Binbin Wu
  2025-08-29  0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Use atomic64_dec_return() when decrementing the number of "pre-mapped"
S-EPT pages to ensure that the count can't go negative without KVM
noticing.  In theory, checking for '0' and then decrementing in a separate
operation could miss a 0=>-1 transition.  In practice, such a condition is
impossible because nr_premapped is protected by slots_lock, i.e. doesn't
actually need to be an atomic (that wart will be addressed shortly).

Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
ensures the VM is dead, i.e. there's no point in trying to limp along.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cafd618ca43c..fe0815d542e3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1725,10 +1725,9 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
 			return -EIO;
 
-		atomic64_dec(&kvm_tdx->nr_premapped);
 		return 0;
 	}
 
@@ -3151,8 +3150,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
-	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
-		atomic64_dec(&kvm_tdx->nr_premapped);
+	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
 
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (9 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-09-02 22:46   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
providing a one-off helper for effectively three lines of code is at best a
wash, and splitting the code makes the comment for smp_rmb()  _extremely_
confusing as the comment talks about reading kvm->arch.pre_fault_allowed
before kvm_tdx->state, but the immediately visible code does the exact
opposite.

Opportunistically rewrite the comments to more explicitly explain who is
checking what, as well as _why_ the ordering matters.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fe0815d542e3..06dd2861eba7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
- * callback tdx_gmem_post_populate() then maps pages into private memory.
- * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
- * private EPT structures for the page to have been built before, which is
- * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
- * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
- * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
- * are no half-initialized shared EPT pages.
- */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
-					  enum pg_level level, kvm_pfn_t pfn)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
-	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-		return -EIO;
-
-	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
-	atomic64_inc(&kvm_tdx->nr_premapped);
-	return 0;
-}
-
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	/*
-	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
-	 * barrier in tdx_td_finalize().
+	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
+	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
+	 * arbitrary memory until the initial memory image is finalized.  Pairs
+	 * with the smp_wmb() in tdx_td_finalize().
 	 */
 	smp_rmb();
-	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
-		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
-	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+	/*
+	 * If the TD isn't finalized/runnable, then userspace is initializing
+	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
+	 * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
+	 * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages
+	 * have been added to the image, to prevent running the TD with a
+	 * valid mapping in the mirror EPT, but not in the S-EPT.
+	 */
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
+		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+			return -EIO;
+
+		atomic64_inc(&kvm_tdx->nr_premapped);
+		return 0;
+	}
+
+	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
 static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (10 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  8:18   ` Yan Zhao
  2025-08-29  0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the
measurement should fail if and only if there is a KVM bug, or if the S-EPT
mapping is invalid, and it should be impossibe for the S-EPT mappings to
be removed between kvm_tdp_mmu_map_private_pfn() and tdh_mr_extend().

Holding slots_lock prevents zaps due to memslot updates,
filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE,
and all usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT
entries that can be used for the initial image.  The call from sev.c is
obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while
__kvm_set_or_clear_apicv_inhibit() can likely be tripped while building the
image, the APIC page has its own non-guest_memfd memslot and so can't be
used for the initial image, which means that too is mutually exclusive.

Opportunistically switch to "goto" to jump around the measurement code,
partly to make it clear that KVM needs to bail entirely if extending the
measurement fails, partly in anticipation of reworking how and when
TDH_MEM_PAGE_ADD is done.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 06dd2861eba7..bc92e87a1dbb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3145,14 +3145,22 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 
 	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
 
-	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
-		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
-					    &level_state);
-			if (err) {
-				ret = -EIO;
-				break;
-			}
+	if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
+		goto out;
+
+	/*
+	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
+	 * between mapping the pfn and now, but slots_lock prevents memslot
+	 * updates, filemap_invalidate_lock() prevents guest_memfd updates,
+	 * mmu_notifier events can't reach S-EPT entries, and KVM's internal
+	 * zapping flows are mutually exclusive with S-EPT mappings.
+	 */
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
+			ret = -EIO;
+			goto out;
 		}
 	}
 
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (11 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29 23:42   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

When populating the initial memory image for a TDX guest, ADD pages to the
TD as part of establishing the mappings in the mirror EPT, as opposed to
creating the mappings and then doing ADD after the fact.  Doing ADD in the
S-EPT callbacks eliminates the need to track "premapped" pages, as the
mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).

Eliminating the hole where the M-EPT can have a mapping that doesn't exist
in the S-EPT in turn obviates the need to handle errors that are unique to
encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).

Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
for unconsumed "premap" entries during tdx_td_finalize(), as there simply
can't be any such entries.  Dropping that check in particular reduces the
overall cognitive load, as the managemented of nr_premapped with respect
to removal of S-EPT is _very_ subtle.  E.g. successful removal of an S-EPT
entry after it completed ADD doesn't adjust nr_premapped, but it's not
clear why that's "ok" but having half-baked entries is not (it's not truly
"ok" in that removing pages from the image will likely prevent the guest
from booting, but from KVM's perspective it's "ok").

Doing ADD in the S-EPT path requires passing an argument via a scratch
field, but the current approach of tracking the number of "premapped"
pages effectively does the same.  And the "premapped" counter is much more
dangerous, as it doesn't have a singular lock to protect its usage, since
nr_premapped can be modified as soon as mmu_lock is dropped, at least in
theory.  I.e. nr_premapped is guarded by slots_lock, but only for "happy"
paths.

Note, this approach was used/tried at various points in TDX development,
but was ultimately discarded due to a desire to avoid stashing temporary
state in kvm_tdx.  But as above, KVM ended up with such state anyways,
and fully committing to using temporary state provides better access
rules (100% guarded by slots_lock), and makes several edge cases flat out
impossible.

Note #2, continue to extend the measurement outside of mmu_lock, as it's
a slow operation (typically 16 SEAMCALLs per page whose data is included
in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
for consistency purposes.  However, MR.EXTEND isn't _that_ slow, e.g.
~1ms latency to measure a full page, so if it needs to be done under
mmu_lock in the future, e.g. because KVM gains a flow that can remove
S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the
measurement can also be moved into the S-EPT mapping path (again, only if
absolutely necessary).  P.S. _If_ MR.EXTEND is moved into the S-EPT path,
take care not to return an error up the stack if TDH_MR_EXTEND fails, as
removing the M-EPT entry but not the S-EPT entry would result in
inconsistent state!

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 116 ++++++++++++++---------------------------
 arch/x86/kvm/vmx/tdx.h |   8 ++-
 2 files changed, 46 insertions(+), 78 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bc92e87a1dbb..00c3dc376690 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,6 +1586,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			    kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
+	gpa_t gpa = gfn_to_gpa(gfn);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
+	    KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
+		return -EIO;
+
+	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
+			       kvm_tdx->page_add_src, &entry, &level_state);
+	if (unlikely(tdx_operand_busy(err)))
+		return -EBUSY;
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
+		return -EIO;
+	}
+
+	return 0;
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1627,19 +1653,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/*
 	 * If the TD isn't finalized/runnable, then userspace is initializing
-	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
-	 * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
-	 * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages
-	 * have been added to the image, to prevent running the TD with a
-	 * valid mapping in the mirror EPT, but not in the S-EPT.
+	 * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
 	 */
-	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
-		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-			return -EIO;
-
-		atomic64_inc(&kvm_tdx->nr_premapped);
-		return 0;
-	}
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
+		return tdx_mem_page_add(kvm, gfn, level, pfn);
 
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
@@ -1665,39 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
- * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
- * successfully.
- *
- * Since tdh_mem_sept_add() must have been invoked successfully before a
- * non-leaf entry present in the mirrored page table, the SEPT ZAP related
- * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
- * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
- * SEPT.
- *
- * Further check if the returned entry from SEPT walking is with RWX permissions
- * to filter out anything unexpected.
- *
- * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
- * level_state returned from a SEAMCALL error is the same as that passed into
- * the SEAMCALL.
- */
-static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
-					     u64 entry, int level)
-{
-	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
-		return false;
-
-	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
-		return false;
-
-	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
-		return false;
-
-	return true;
-}
-
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, struct page *page)
 {
@@ -1717,12 +1701,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
-	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
-			return -EIO;
-
-		return 0;
-	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
@@ -2827,12 +2805,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 
 	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
-	/*
-	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
-	 * TDH.MEM.PAGE.ADD().
-	 */
-	if (atomic64_read(&kvm_tdx->nr_premapped))
-		return -EINVAL;
 
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
 	if (tdx_operand_busy(cmd->hw_error))
@@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 {
 	struct tdx_gmem_post_populate_arg *arg = _arg;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	u64 err, entry, level_state;
 	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 err, entry, level_state;
 	struct page *src_page;
 	int ret, i;
 
+	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
+		return -EIO;
+
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
 	 * source page has been swapped out or unmapped in primary memory.
@@ -3131,22 +3106,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
+	kvm_tdx->page_add_src = src_page;
 	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
-	if (ret < 0)
-		goto out;
+	kvm_tdx->page_add_src = NULL;
 
-	ret = 0;
-	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
-			       src_page, &entry, &level_state);
-	if (err) {
-		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
-		goto out;
-	}
+	put_page(src_page);
 
-	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
-
-	if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
-		goto out;
+	if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
+		return ret;
 
 	/*
 	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
@@ -3159,14 +3126,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
 		if (KVM_BUG_ON(err, kvm)) {
 			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
-			ret = -EIO;
-			goto out;
+			return -EIO;
 		}
 	}
 
-out:
-	put_page(src_page);
-	return ret;
+	return 0;
 }
 
 static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..1b00adbbaf77 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -36,8 +36,12 @@ struct kvm_tdx {
 
 	struct tdx_td td;
 
-	/* For KVM_TDX_INIT_MEM_REGION. */
-	atomic64_t nr_premapped;
+	/*
+	 * Scratch pointer used to pass the source page to tdx_mem_page_add.
+	 * Protected by slots_lock, and non-NULL only when mapping a private
+	 * pfn via tdx_gmem_post_populate().
+	 */
+	struct page *page_add_src;
 
 	/*
 	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (12 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-09-02 17:31   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Do TDH_MEM_RANGE_BLOCK directly in tdx_sept_remove_private_spte() instead
of using a one-off helper now that the nr_premapped tracking is gone.

Opportunistically drop the WARN on hugepages, which was dead code (see the
KVM_BUG_ON() in tdx_sept_remove_private_spte()).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 41 +++++++++++------------------------------
 1 file changed, 11 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 00c3dc376690..aa6d88629dae 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1682,33 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
-				     enum pg_level level, struct page *page)
-{
-	int tdx_level = pg_level_to_tdx_sept_level(level);
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
-	u64 err, entry, level_state;
-
-	/* For now large page isn't supported yet. */
-	WARN_ON_ONCE(level != PG_LEVEL_4K);
-
-	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
-
-	if (unlikely(tdx_operand_busy(err))) {
-		/* After no vCPUs enter, the second retry is expected to succeed */
-		tdx_no_vcpus_enter_start(kvm);
-		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
-		tdx_no_vcpus_enter_stop(kvm);
-	}
-
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
-		return -EIO;
-	}
-	return 1;
-}
-
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
  * tdh_mem_track() is the only caller that increases TD epoch. An increase in
@@ -1789,7 +1762,6 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
-	int ret;
 
 	/*
 	 * HKID is released after all private pages have been removed, and set
@@ -1803,9 +1775,18 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return;
 
-	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
-	if (ret <= 0)
+	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
+	if (unlikely(tdx_operand_busy(err))) {
+		/* After no vCPUs enter, the second retry is expected to succeed */
+		tdx_no_vcpus_enter_start(kvm);
+		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
+		tdx_no_vcpus_enter_stop(kvm);
+	}
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return;
+	}
 
 	/*
 	 * TDX requires TLB tracking before dropping private page.  Do
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (13 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  9:03   ` Binbin Wu
  2025-09-02 18:55   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
pr_tdx_error().  In addition to reducing boilerplate copy+paste code, this
also helps ensure that KVM provides consistent handling of SEAMCALL errors.

Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
equivalent of KVM_BUG_ON(), i.e. have them terminate the VM.  If a SEAMCALL
error is fatal enough to WARN on, it's fatal enough to terminate the TD.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++------------------------
 1 file changed, 47 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index aa6d88629dae..df9b4496cd01 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -24,20 +24,32 @@
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#define pr_tdx_error(__fn, __err)	\
-	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
+#define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...)			\
+({										\
+	struct kvm *_kvm = (__kvm);						\
+	bool __ret = !!(__err);							\
+										\
+	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
+		if (_kvm)							\
+			kvm_vm_bugged(_kvm);					\
+		pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\
+				   __err,  __args);				\
+	}									\
+	unlikely(__ret);							\
+})
 
-#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
-	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
+#define TDX_BUG_ON(__err, __fn, __kvm)				\
+	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
 
-#define pr_tdx_error_1(__fn, __err, __rcx)		\
-	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
+#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
 
-#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
-	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
+#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
+
+#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
 
-#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
-	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
 
 bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
@@ -332,10 +344,9 @@ static int __tdx_reclaim_page(struct page *page)
 	 * before the HKID is released and control pages have also been
 	 * released at this point, so there is no possibility of contention.
 	 */
-	if (WARN_ON_ONCE(err)) {
-		pr_tdx_error_3(TDH_PHYMEM_PAGE_RECLAIM, err, rcx, rdx, r8);
+	if (TDX_BUG_ON_3(err, TDH_PHYMEM_PAGE_RECLAIM, rcx, rdx, r8, NULL))
 		return -EIO;
-	}
+
 	return 0;
 }
 
@@ -423,8 +434,8 @@ static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
 		return;
 
 	smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
-	if (KVM_BUG_ON(arg.err, vcpu->kvm))
-		pr_tdx_error(TDH_VP_FLUSH, arg.err);
+
+	TDX_BUG_ON(arg.err, TDH_VP_FLUSH, vcpu->kvm);
 }
 
 void tdx_disable_virtualization_cpu(void)
@@ -473,8 +484,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
 	}
 
 out:
-	if (WARN_ON_ONCE(err))
-		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
+	TDX_BUG_ON(err, TDH_PHYMEM_CACHE_WB, NULL);
 }
 
 void tdx_mmu_release_hkid(struct kvm *kvm)
@@ -513,8 +523,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	err = tdh_mng_vpflushdone(&kvm_tdx->td);
 	if (err == TDX_FLUSHVP_NOT_DONE)
 		goto out;
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
+	if (TDX_BUG_ON(err, TDH_MNG_VPFLUSHDONE, kvm)) {
 		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
 		       kvm_tdx->hkid);
 		goto out;
@@ -537,8 +546,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 * tdh_mng_key_freeid() will fail.
 	 */
 	err = tdh_mng_key_freeid(&kvm_tdx->td);
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_MNG_KEY_FREEID, err);
+	if (TDX_BUG_ON(err, TDH_MNG_KEY_FREEID, kvm)) {
 		pr_err("tdh_mng_key_freeid() failed. HKID %d is leaked.\n",
 		       kvm_tdx->hkid);
 	} else {
@@ -589,10 +597,9 @@ static void tdx_reclaim_td_control_pages(struct kvm *kvm)
 	 * when it is reclaiming TDCS).
 	 */
 	err = tdh_phymem_page_wbinvd_tdr(&kvm_tdx->td);
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
-	}
+
 	tdx_clear_page(kvm_tdx->td.tdr_page);
 
 	__free_page(kvm_tdx->td.tdr_page);
@@ -615,11 +622,8 @@ static int tdx_do_tdh_mng_key_config(void *param)
 
 	/* TDX_RND_NO_ENTROPY related retries are handled by sc_retry() */
 	err = tdh_mng_key_config(&kvm_tdx->td);
-
-	if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
-		pr_tdx_error(TDH_MNG_KEY_CONFIG, err);
+	if (TDX_BUG_ON(err, TDH_MNG_KEY_CONFIG, &kvm_tdx->kvm))
 		return -EIO;
-	}
 
 	return 0;
 }
@@ -1604,10 +1608,8 @@ static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_ADD, entry, level_state, kvm))
 		return -EIO;
-	}
 
 	return 0;
 }
@@ -1626,10 +1628,8 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_AUG, entry, level_state, kvm))
 		return -EIO;
-	}
 
 	return 0;
 }
@@ -1674,10 +1674,8 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
 
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_SEPT_ADD, entry, level_state, kvm))
 		return -EIO;
-	}
 
 	return 0;
 }
@@ -1725,8 +1723,7 @@ static void tdx_track(struct kvm *kvm)
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
-	if (KVM_BUG_ON(err, kvm))
-		pr_tdx_error(TDH_MEM_TRACK, err);
+	TDX_BUG_ON(err, TDH_MEM_TRACK, kvm);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
 }
@@ -1783,10 +1780,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
 		return;
-	}
 
 	/*
 	 * TDX requires TLB tracking before dropping private page.  Do
@@ -1813,16 +1808,12 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
+	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
 		return;
-	}
 
 	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
-	if (KVM_BUG_ON(err, kvm)) {
-		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
 		return;
-	}
 
 	tdx_clear_page(page);
 }
@@ -2451,8 +2442,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 		goto free_packages;
 	}
 
-	if (WARN_ON_ONCE(err)) {
-		pr_tdx_error(TDH_MNG_CREATE, err);
+	if (TDX_BUG_ON(err, TDH_MNG_CREATE, kvm)) {
 		ret = -EIO;
 		goto free_packages;
 	}
@@ -2493,8 +2483,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 			ret = -EAGAIN;
 			goto teardown;
 		}
-		if (WARN_ON_ONCE(err)) {
-			pr_tdx_error(TDH_MNG_ADDCX, err);
+		if (TDX_BUG_ON(err, TDH_MNG_ADDCX, kvm)) {
 			ret = -EIO;
 			goto teardown;
 		}
@@ -2511,8 +2500,7 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
 		*seamcall_err = err;
 		ret = -EINVAL;
 		goto teardown;
-	} else if (WARN_ON_ONCE(err)) {
-		pr_tdx_error_1(TDH_MNG_INIT, err, rcx);
+	} else if (TDX_BUG_ON_1(err, TDH_MNG_INIT, rcx, kvm)) {
 		ret = -EIO;
 		goto teardown;
 	}
@@ -2790,10 +2778,8 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
 	if (tdx_operand_busy(cmd->hw_error))
 		return -EBUSY;
-	if (KVM_BUG_ON(cmd->hw_error, kvm)) {
-		pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
+	if (TDX_BUG_ON(cmd->hw_error, TDH_MR_FINALIZE, kvm))
 		return -EIO;
-	}
 
 	kvm_tdx->state = TD_STATE_RUNNABLE;
 	/* TD_STATE_RUNNABLE must be set before 'pre_fault_allowed' */
@@ -2873,16 +2859,14 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	}
 
 	err = tdh_vp_create(&kvm_tdx->td, &tdx->vp);
-	if (KVM_BUG_ON(err, vcpu->kvm)) {
+	if (TDX_BUG_ON(err, TDH_VP_CREATE, vcpu->kvm)) {
 		ret = -EIO;
-		pr_tdx_error(TDH_VP_CREATE, err);
 		goto free_tdcx;
 	}
 
 	for (i = 0; i < kvm_tdx->td.tdcx_nr_pages; i++) {
 		err = tdh_vp_addcx(&tdx->vp, tdx->vp.tdcx_pages[i]);
-		if (KVM_BUG_ON(err, vcpu->kvm)) {
-			pr_tdx_error(TDH_VP_ADDCX, err);
+		if (TDX_BUG_ON(err, TDH_VP_ADDCX, vcpu->kvm)) {
 			/*
 			 * Pages already added are reclaimed by the vcpu_free
 			 * method, but the rest are freed here.
@@ -2896,10 +2880,8 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
 	}
 
 	err = tdh_vp_init(&tdx->vp, vcpu_rcx, vcpu->vcpu_id);
-	if (KVM_BUG_ON(err, vcpu->kvm)) {
-		pr_tdx_error(TDH_VP_INIT, err);
+	if (TDX_BUG_ON(err, TDH_VP_INIT, vcpu->kvm))
 		return -EIO;
-	}
 
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
@@ -3105,10 +3087,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	 */
 	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
 		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
-		if (KVM_BUG_ON(err, kvm)) {
-			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
+		if (TDX_BUG_ON_2(err, TDH_MR_EXTEND, entry, level_state, kvm))
 			return -EIO;
-		}
 	}
 
 	return 0;
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (14 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-30  0:00   ` Edgecombe, Rick P
  2025-08-29  0:06 ` [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson
  17 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

When printing SEAMCALL errors, use the name of the variable holding an
error parameter instead of the register from whence it came, so that flows
which use descriptive variable names will similarly print descriptive
error messages.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index df9b4496cd01..b73f260a55fd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -41,14 +41,15 @@
 #define TDX_BUG_ON(__err, __fn, __kvm)				\
 	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
 
-#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
-	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
+#define TDX_BUG_ON_1(__err, __fn, a1, __kvm)			\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx", a1)
 
-#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
-	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
+#define TDX_BUG_ON_2(__err, __fn, a1, a2, __kvm)	\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx, " #a2 " 0x%llx", a1, a2)
 
-#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
-	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
+#define TDX_BUG_ON_3(__err, __fn, a1, a2, a3, __kvm)	\
+	__TDX_BUG_ON(__err, #__fn, __kvm, ", " #a1 " 0x%llx, " #a2 ", 0x%llx, " #a3 " 0x%llx", \
+		     a1, a2, a3)
 
 
 bool enable_tdx __ro_after_init;
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (15 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  2025-08-29  0:06 ` [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson
  17 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Unconditionally assert that mmu_lock is held for write when removing S-EPT
entries, not just when removing S-EPT entries triggers certain conditions,
e.g. needs to do TDH_MEM_TRACK or kick vCPUs out of the guest.
Conditionally asserting implies that it's safe to hold mmu_lock for read
when those paths aren't hit, which is simply not true, as KVM doesn't
support removing S-EPT entries under read-lock.

Only two paths lead to remove_external_spte(), and both paths asserts that
mmu_lock is held for write (tdp_mmu_set_spte() via lockdep, and
handle_removed_pt() via KVM_BUG_ON()).

Deliberately leave lockdep assertions in the "no vCPUs" helpers to document
that wait_for_sept_zap is guarded by holding mmu_lock for write.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b73f260a55fd..aa740eeb1c2a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1714,8 +1714,6 @@ static void tdx_track(struct kvm *kvm)
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
 		return;
 
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
 	err = tdh_mem_track(&kvm_tdx->td);
 	if (unlikely(tdx_operand_busy(err))) {
 		/* After no vCPUs enter, the second retry is expected to succeed */
@@ -1761,6 +1759,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 err, entry, level_state;
 
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
 	/*
 	 * HKID is released after all private pages have been removed, and set
 	 * before any might be populated. Warn if zapping is attempted when
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest
  2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (16 preceding siblings ...)
  2025-08-29  0:06 ` [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
@ 2025-08-29  0:06 ` Sean Christopherson
  17 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29  0:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Add a macro to handle kicking vCPUs out of the guest and retrying
SEAMCALLs on -EBUSY instead of providing small helpers to be used by each
SEAMCALL.  Wrapping the SEAMCALLs in a macro makes it a little harder to
tease out which SEAMCALL is being made, but significantly reduces the
amount of copy+paste code and makes it all but impossible to leave an
elevated wait_for_sept_zap.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 72 ++++++++++++++----------------------------
 1 file changed, 23 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index aa740eeb1c2a..d6c9defad9cd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -313,25 +313,24 @@ static void tdx_clear_page(struct page *page)
 	__mb();
 }
 
-static void tdx_no_vcpus_enter_start(struct kvm *kvm)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
-	WRITE_ONCE(kvm_tdx->wait_for_sept_zap, true);
-
-	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
-}
-
-static void tdx_no_vcpus_enter_stop(struct kvm *kvm)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
-	WRITE_ONCE(kvm_tdx->wait_for_sept_zap, false);
-}
+#define tdh_do_no_vcpus(tdh_func, kvm, args...)					\
+({										\
+	struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);				\
+	u64 __err;								\
+										\
+	lockdep_assert_held_write(&kvm->mmu_lock);				\
+										\
+	__err = tdh_func(args);							\
+	if (unlikely(tdx_operand_busy(__err))) {				\
+		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true);			\
+		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);	\
+										\
+		__err = tdh_func(args);						\
+										\
+		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false);		\
+	}									\
+	__err;									\
+})
 
 /* TDH.PHYMEM.PAGE.RECLAIM is allowed only when destroying the TD. */
 static int __tdx_reclaim_page(struct page *page)
@@ -1714,14 +1713,7 @@ static void tdx_track(struct kvm *kvm)
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
 		return;
 
-	err = tdh_mem_track(&kvm_tdx->td);
-	if (unlikely(tdx_operand_busy(err))) {
-		/* After no vCPUs enter, the second retry is expected to succeed */
-		tdx_no_vcpus_enter_start(kvm);
-		err = tdh_mem_track(&kvm_tdx->td);
-		tdx_no_vcpus_enter_stop(kvm);
-	}
-
+	err = tdh_do_no_vcpus(tdh_mem_track, kvm, &kvm_tdx->td);
 	TDX_BUG_ON(err, TDH_MEM_TRACK, kvm);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
@@ -1773,14 +1765,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return;
 
-	err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
-	if (unlikely(tdx_operand_busy(err))) {
-		/* After no vCPUs enter, the second retry is expected to succeed */
-		tdx_no_vcpus_enter_start(kvm);
-		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
-		tdx_no_vcpus_enter_stop(kvm);
-	}
-
+	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
+			      tdx_level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
 		return;
 
@@ -1795,20 +1781,8 @@ static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * with other vcpu sept operation.
 	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
 	 */
-	err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
-				  &level_state);
-
-	if (unlikely(tdx_operand_busy(err))) {
-		/*
-		 * The second retry is expected to succeed after kicking off all
-		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
-		 */
-		tdx_no_vcpus_enter_start(kvm);
-		err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
-					  &level_state);
-		tdx_no_vcpus_enter_stop(kvm);
-	}
-
+	err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa,
+			      tdx_level, &entry, &level_state);
 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
 		return;
 
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
@ 2025-08-29  6:20   ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  6:20 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
                                  ^
                                 should be M-EPT?

> said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
> superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
> Add a helper function to walk down the TDP MMU"), the justification for
> introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
> was pre-populated, with a link that points to this snippet:
>
>   : > One small question:
>   : >
>   : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
>   : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
>   : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
>   : > return error when it finds the region hasn't been pre-populated?
>   :
>   : Return an error.  I don't love the idea of bleeding so many TDX details into
>   : userspace, but I'm pretty sure that ship sailed a long, long time ago.
>
> But that justification makes little sense for the final code, as simply
> doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
> if the S-EPT mapping is invalid (as evidenced by the code being guarded
> with CONFIG_KVM_PROVE_MMU=y).

I think this also needs to be updated.
As Yan mentioned in https://lore.kernel.org/lkml/aK6+TQ0r1j5j2PCx@yzhao56-desk.sh.intel.com/
TDH.MEM.PAGE.ADD would succeed without error, but error is still can be detected
via the value of nr_premapped in the end.

>
> The sanity check is also incomplete in the sense that mmu_lock is dropped
> between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
> zap SPTEs in a very specific window.
>
> Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
> which has no business being exposed to vendor code.
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 14 --------------
>   1 file changed, 14 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 6784aaaced87..71da245d160f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3175,20 +3175,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   	if (ret < 0)
>   		goto out;
>   
> -	/*
> -	 * The private mem cannot be zapped after kvm_tdp_map_page()
> -	 * because all paths are covered by slots_lock and the
> -	 * filemap invalidate lock.  Check that they are indeed enough.
> -	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> -
>   	ret = 0;
>   	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
>   			       src_page, &entry, &level_state);


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29  0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
@ 2025-08-29  8:18   ` Yan Zhao
  2025-08-29 18:16     ` Edgecombe, Rick P
  0 siblings, 1 reply; 62+ messages in thread
From: Yan Zhao @ 2025-08-29  8:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Vishal Annapurve, Rick Edgecombe, Ackerley Tng

On Thu, Aug 28, 2025 at 05:06:12PM -0700, Sean Christopherson wrote:
> WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the
> measurement should fail if and only if there is a KVM bug, or if the S-EPT
> mapping is invalid, and it should be impossibe for the S-EPT mappings to
> be removed between kvm_tdp_mmu_map_private_pfn() and tdh_mr_extend().
> 
> Holding slots_lock prevents zaps due to memslot updates,
> filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE,
> and all usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT
> entries that can be used for the initial image.  The call from sev.c is
> obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
> so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while
> __kvm_set_or_clear_apicv_inhibit() can likely be tripped while building the
> image, the APIC page has its own non-guest_memfd memslot and so can't be
> used for the initial image, which means that too is mutually exclusive.
> 
> Opportunistically switch to "goto" to jump around the measurement code,
> partly to make it clear that KVM needs to bail entirely if extending the
> measurement fails, partly in anticipation of reworking how and when
> TDH_MEM_PAGE_ADD is done.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 06dd2861eba7..bc92e87a1dbb 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3145,14 +3145,22 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  
>  	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
>  
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> +	if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		goto out;
> +
> +	/*
> +	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> +	 * between mapping the pfn and now, but slots_lock prevents memslot
> +	 * updates, filemap_invalidate_lock() prevents guest_memfd updates,
> +	 * mmu_notifier events can't reach S-EPT entries, and KVM's internal
> +	 * zapping flows are mutually exclusive with S-EPT mappings.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
I suspect tdh_mr_extend() running on one vCPU may contend with
tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/
tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl
KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs.

It's similar to the analysis of contention of tdh_mem_page_add() [1], as
both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on
resource TDR.

I'll try to write a test to verify it and come back to you.

[1] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/ 
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			ret = -EIO;
> +			goto out;
>  		}
>  	}
>  
> -- 
> 2.51.0.318.gd7df087d1a-goog
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
@ 2025-08-29  8:36   ` Binbin Wu
  2025-08-29 19:53   ` Edgecombe, Rick P
  1 sibling, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  8:36 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
>
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().
>
> Eliminating TDX's explicit pinning will also enable guest_memfd to support
> in-place conversion between shared and private memory[1][2].  Because KVM
> cannot distinguish between speculative/transient refcounts and the
> intentional refcount for TDX on private pages[3], failing to release
> private page refcount in TDX could cause guest_memfd to indefinitely wait
> on decreasing the refcount for the splitting.
>
> Under normal conditions, not holding an extra page refcount in TDX is safe
> because guest_memfd ensures pages are retained until its invalidation
> notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> module, not holding an extra refcount when a page is mapped in S-EPT could
> result in a page being released from guest_memfd while still mapped in the
> S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
> negative when that extra work adds complexity and confusion.
>
> Several approaches were considered to address the refcount issue, including
>    - Attempting to modify the KVM unmap operation to return a failure,
>      which was deemed too complex and potentially incorrect[4].
>   - Increasing the folio reference count only upon S-EPT zapping failure[5].
>   - Use page flags or page_ext to indicate a page is still used by TDX[6],
>     which does not work for HVO (HugeTLB Vmemmap Optimization).
>    - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
Nit: alignment issue with the bullets.

Otherwise,
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

>
> Due to the complexity or inappropriateness of these approaches, and the
> fact that S-EPT zapping failure is currently only possible when there are
> bugs in the KVM or TDX module, which is very rare in a production kernel,
> a straightforward approach of simply not holding the page reference count
> in TDX was chosen[8].
>
> When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> vCPUs and mark the VM as dead. Although there is a potential window that a
> private page mapped in the S-EPT could be reallocated and used outside the
> VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> information. To be robust against bugs, the user can enable panic_on_warn
> as normal.
>
> Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
> Link: https://youtu.be/UnBKahkAon4 [2]
> Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
> Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
> Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
> Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
> Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
> Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Suggested-by: Ackerley Tng <ackerleytng@google.com>
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> [sean: extract out of hugepage series, massage changelog accordingly]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
>   1 file changed, 4 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c83e1ff02827..f24f8635b433 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>   }
>   
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> -{
> -	put_page(page);
> -}
> -
>   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> -			    enum pg_level level, struct page *page)
> +			    enum pg_level level, kvm_pfn_t pfn)
>   {
>   	int tdx_level = pg_level_to_tdx_sept_level(level);
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct page *page = pfn_to_page(pfn);
>   	gpa_t gpa = gfn_to_gpa(gfn);
>   	u64 entry, level_state;
>   	u64 err;
>   
>   	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -	if (unlikely(tdx_operand_busy(err))) {
> -		tdx_unpin(kvm, page);
> +	if (unlikely(tdx_operand_busy(err)))
>   		return -EBUSY;
> -	}
>   
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -		tdx_unpin(kvm, page);
>   		return -EIO;
>   	}
>   
> @@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   				     enum pg_level level, kvm_pfn_t pfn)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct page *page = pfn_to_page(pfn);
>   
>   	/* TODO: handle large pages. */
>   	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
>   		return -EINVAL;
>   
> -	/*
> -	 * Because guest_memfd doesn't support page migration with
> -	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> -	 * migration.  Until guest_memfd supports page migration, prevent page
> -	 * migration.
> -	 * TODO: Once guest_memfd introduces callback on page migration,
> -	 * implement it and remove get_page/put_page().
> -	 */
> -	get_page(page);
> -
>   	/*
>   	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
>   	 * barrier in tdx_td_finalize().
>   	 */
>   	smp_rmb();
>   	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, page);
> +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>   
>   	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>   }
> @@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		return -EIO;
>   	}
>   	tdx_clear_page(page);
> -	tdx_unpin(kvm, page);
>   	return 0;
>   }
>   
> @@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>   	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>   		atomic64_dec(&kvm_tdx->nr_premapped);
> -		tdx_unpin(kvm, page);
>   		return 0;
>   	}
>   


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
@ 2025-08-29  9:03   ` Binbin Wu
  2025-08-29 14:19     ` Sean Christopherson
  2025-09-02 18:55   ` Edgecombe, Rick P
  1 sibling, 1 reply; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  9:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
> the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
> pr_tdx_error().  In addition to reducing boilerplate copy+paste code, this
> also helps ensure that KVM provides consistent handling of SEAMCALL errors.
>
> Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
> equivalent of KVM_BUG_ON(), i.e. have them terminate the VM.  If a SEAMCALL
> error is fatal enough to WARN on, it's fatal enough to terminate the TD.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++------------------------
>   1 file changed, 47 insertions(+), 67 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index aa6d88629dae..df9b4496cd01 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -24,20 +24,32 @@
>   #undef pr_fmt
>   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>   
> -#define pr_tdx_error(__fn, __err)	\
> -	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
> +#define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...)			\
> +({										\
> +	struct kvm *_kvm = (__kvm);						\
> +	bool __ret = !!(__err);							\
> +										\
> +	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
> +		if (_kvm)							\
> +			kvm_vm_bugged(_kvm);					\
> +		pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\
> +				   __err,  __args);				\
> +	}									\
> +	unlikely(__ret);							\
> +})
>   
> -#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
> -	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
> +#define TDX_BUG_ON(__err, __fn, __kvm)				\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
>   
> -#define pr_tdx_error_1(__fn, __err, __rcx)		\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
> +#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
>   
> -#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
> +#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
> +
> +#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
>   
> -#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)

I thought you would use the format Rick proposed in
https://lore.kernel.org/all/9e55a0e767317d20fc45575c4ed6dafa863e1ca0.camel@intel.com/
     #define TDX_BUG_ON_2(__err, __fn, arg1, arg2, __kvm)        \
         __TDX_BUG_ON(__err, #__fn, __kvm, ", " #arg1 " 0x%llx, " #arg2 "
     0x%llx", arg1, arg2)

     so you get: entry: 0x00 level:0xF00

No?

[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
@ 2025-08-29  9:40   ` Binbin Wu
  2025-08-29 16:58   ` Ira Weiny
  2025-08-29 19:59   ` Edgecombe, Rick P
  2 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  9:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.  Note, many (all?)
> of the affected paths never propagate the error code to userspace, i.e.
> this is about internal consistency more than anything else.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>   arch/x86/kvm/vmx/tdx.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f24f8635b433..50a9d81dad53 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>   
>   	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
>   	atomic64_inc(&kvm_tdx->nr_premapped);
> @@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   
>   	/* TODO: handle large pages. */
>   	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	/*
>   	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> @@ -1661,10 +1661,10 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   
>   	/* TODO: handle large pages. */
>   	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	/*
>   	 * When zapping private page, write lock is held. So no race condition
> @@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>   	 * and slot move/deletion.
>   	 */
>   	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	/*
>   	 * The HKID assigned to this TD was already freed and cache was
> @@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   	 * there can't be anything populated in the private EPT.
>   	 */
>   	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EINVAL;
> +		return -EIO;
>   
>   	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>   	if (ret <= 0)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte()
  2025-08-29  0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
@ 2025-08-29  9:49   ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  9:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() to
> avoid having to differnatiate between "zap", "drop", and "remove", and to
> eliminate dead code due to redundant checks, e.g. on an HKID being
> assigned.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>   arch/x86/kvm/vmx/tdx.c | 90 +++++++++++++++++++-----------------------
>   1 file changed, 40 insertions(+), 50 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 50a9d81dad53..8cb6a2627eb2 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1651,55 +1651,6 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>   }
>   
> -static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> -				      enum pg_level level, struct page *page)
> -{
> -	int tdx_level = pg_level_to_tdx_sept_level(level);
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	u64 err, entry, level_state;
> -
> -	/* TODO: handle large pages. */
> -	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EIO;
> -
> -	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> -		return -EIO;
> -
> -	/*
> -	 * When zapping private page, write lock is held. So no race condition
> -	 * with other vcpu sept operation.
> -	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
> -	 */
> -	err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
> -				  &level_state);
> -
> -	if (unlikely(tdx_operand_busy(err))) {
> -		/*
> -		 * The second retry is expected to succeed after kicking off all
> -		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
> -		 */
> -		tdx_no_vcpus_enter_start(kvm);
> -		err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
> -					  &level_state);
> -		tdx_no_vcpus_enter_stop(kvm);
> -	}
> -
> -	if (KVM_BUG_ON(err, kvm)) {
> -		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
> -		return -EIO;
> -	}
> -
> -	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> -
> -	if (KVM_BUG_ON(err, kvm)) {
> -		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> -		return -EIO;
> -	}
> -	tdx_clear_page(page);
> -	return 0;
> -}
> -
>   static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>   				     enum pg_level level, void *private_spt)
>   {
> @@ -1861,7 +1812,11 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>   static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   					enum pg_level level, kvm_pfn_t pfn)
>   {
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>   	struct page *page = pfn_to_page(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 err, entry, level_state;
>   	int ret;
>   
>   	/*
> @@ -1872,6 +1827,10 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
>   		return -EIO;
>   
> +	/* TODO: handle large pages. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return -EIO;
> +
>   	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>   	if (ret <= 0)
>   		return ret;
> @@ -1882,7 +1841,38 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   	 */
>   	tdx_track(kvm);
>   
> -	return tdx_sept_drop_private_spte(kvm, gfn, level, page);
> +	/*
> +	 * When zapping private page, write lock is held. So no race condition
> +	 * with other vcpu sept operation.
> +	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
> +	 */
> +	err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
> +				  &level_state);
> +
> +	if (unlikely(tdx_operand_busy(err))) {
> +		/*
> +		 * The second retry is expected to succeed after kicking off all
> +		 * other vCPUs and prevent them from invoking TDH.VP.ENTER.
> +		 */
> +		tdx_no_vcpus_enter_start(kvm);
> +		err = tdh_mem_page_remove(&kvm_tdx->td, gpa, tdx_level, &entry,
> +					  &level_state);
> +		tdx_no_vcpus_enter_stop(kvm);
> +	}
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> +		return -EIO;
> +	}
> +
> +	tdx_clear_page(page);
> +	return 0;
>   }
>   
>   void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte()
  2025-08-29  0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
@ 2025-08-29  9:52   ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  9:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a.
> tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure,
> and that KVM_BUG_ON() is redundant since all error paths in TDX also do a
> KVM_BUG_ON().
>
> Opportunistically pass the spte instead of the pfn, as the API is clearly
> about removing an spte.
>
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>   arch/x86/include/asm/kvm_host.h |  4 ++--
>   arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++------
>   arch/x86/kvm/vmx/tdx.c          | 17 ++++++++---------
>   3 files changed, 12 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0d3cc0fc27af..d0a8404a6b8f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1852,8 +1852,8 @@ struct kvm_x86_ops {
>   				 void *external_spt);
>   
>   	/* Update external page table from spte getting removed, and flush TLB. */
> -	int (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> -				    kvm_pfn_t pfn_for_gfn);
> +	void (*remove_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				     u64 spte);
>   
>   	bool (*has_wbinvd_exit)(void);
>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3ea2dd64ce72..78ee085f7cbc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -362,9 +362,6 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
>   				 int level)
>   {
> -	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> -	int ret;
> -
>   	/*
>   	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
>   	 * PTs are removed in a special order, involving free_external_spt().
> @@ -377,9 +374,8 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
>   
>   	/* Zapping leaf spte is allowed only when write lock is held. */
>   	lockdep_assert_held_write(&kvm->mmu_lock);
> -	/* Because write lock is held, operation should success. */
> -	ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn);
> -	KVM_BUG_ON(ret, kvm);
> +
> +	kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_spte);
>   }
>   
>   /**
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8cb6a2627eb2..07f9ad1fbfb6 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1809,12 +1809,12 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>   	return tdx_reclaim_page(virt_to_page(private_spt));
>   }
>   
> -static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> -					enum pg_level level, kvm_pfn_t pfn)
> +static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> +					 enum pg_level level, u64 spte)
>   {
> +	struct page *page = pfn_to_page(spte_to_pfn(spte));
>   	int tdx_level = pg_level_to_tdx_sept_level(level);
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct page *page = pfn_to_page(pfn);
>   	gpa_t gpa = gfn_to_gpa(gfn);
>   	u64 err, entry, level_state;
>   	int ret;
> @@ -1825,15 +1825,15 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   	 * there can't be anything populated in the private EPT.
>   	 */
>   	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EIO;
> +		return;
>   
>   	/* TODO: handle large pages. */
>   	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EIO;
> +		return;
>   
>   	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>   	if (ret <= 0)
> -		return ret;
> +		return;
>   
>   	/*
>   	 * TDX requires TLB tracking before dropping private page.  Do
> @@ -1862,17 +1862,16 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
> -		return -EIO;
> +		return;
>   	}
>   
>   	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
> -		return -EIO;
> +		return;
>   	}
>   
>   	tdx_clear_page(page);
> -	return 0;
>   }
>   
>   void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-29  0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
@ 2025-08-29  9:52   ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29  9:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
> to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
> isn't also triggered.  Isolating the check from the "is premap error"
> if-statement will also allow adding a lockdep assertion that premap errors
> are encountered if and only if slots_lock is held.
>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>   arch/x86/kvm/vmx/tdx.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 07f9ad1fbfb6..cafd618ca43c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1724,8 +1724,10 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>   		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>   		tdx_no_vcpus_enter_stop(kvm);
>   	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> -	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> +	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> +		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> +			return -EIO;
> +
>   		atomic64_dec(&kvm_tdx->nr_premapped);
>   		return 0;
>   	}


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-29  0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
@ 2025-08-29 10:06   ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-08-29 10:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> S-EPT pages to ensure that the count can't go negative without KVM
> noticing.  In theory, checking for '0' and then decrementing in a separate
> operation could miss a 0=>-1 transition.  In practice, such a condition is
> impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> actually need to be an atomic (that wart will be addressed shortly).
>
> Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> ensures the VM is dead, i.e. there's no point in trying to limp along.
>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

> ---
>   arch/x86/kvm/vmx/tdx.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cafd618ca43c..fe0815d542e3 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1725,10 +1725,9 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>   		tdx_no_vcpus_enter_stop(kvm);
>   	}
>   	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> +		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
>   			return -EIO;
>   
> -		atomic64_dec(&kvm_tdx->nr_premapped);
>   		return 0;
>   	}
>   
> @@ -3151,8 +3150,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>   		goto out;
>   	}
>   
> -	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> -		atomic64_dec(&kvm_tdx->nr_premapped);
> +	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
>   
>   	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
>   		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  2025-08-29  9:03   ` Binbin Wu
@ 2025-08-29 14:19     ` Sean Christopherson
  2025-09-01  1:46       ` Binbin Wu
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 14:19 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng

On Fri, Aug 29, 2025, Binbin Wu wrote:
> On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> > Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
> > the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
> > pr_tdx_error().  In addition to reducing boilerplate copy+paste code, this
> > also helps ensure that KVM provides consistent handling of SEAMCALL errors.
> > 
> > Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
> > equivalent of KVM_BUG_ON(), i.e. have them terminate the VM.  If a SEAMCALL
> > error is fatal enough to WARN on, it's fatal enough to terminate the TD.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++------------------------
> >   1 file changed, 47 insertions(+), 67 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index aa6d88629dae..df9b4496cd01 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -24,20 +24,32 @@
> >   #undef pr_fmt
> >   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > -#define pr_tdx_error(__fn, __err)	\
> > -	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
> > +#define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...)			\
> > +({										\
> > +	struct kvm *_kvm = (__kvm);						\
> > +	bool __ret = !!(__err);							\
> > +										\
> > +	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
> > +		if (_kvm)							\
> > +			kvm_vm_bugged(_kvm);					\
> > +		pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\
> > +				   __err,  __args);				\
> > +	}									\
> > +	unlikely(__ret);							\
> > +})
> > -#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
> > -	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
> > +#define TDX_BUG_ON(__err, __fn, __kvm)				\
> > +	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
> > -#define pr_tdx_error_1(__fn, __err, __rcx)		\
> > -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
> > +#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
> > +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
> > -#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
> > -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
> > +#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
> > +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
> > +
> > +#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
> > +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
> > -#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
> > -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
> 
> I thought you would use the format Rick proposed in
> https://lore.kernel.org/all/9e55a0e767317d20fc45575c4ed6dafa863e1ca0.camel@intel.com/
>     #define TDX_BUG_ON_2(__err, __fn, arg1, arg2, __kvm)        \
>         __TDX_BUG_ON(__err, #__fn, __kvm, ", " #arg1 " 0x%llx, " #arg2 "
>     0x%llx", arg1, arg2)
> 
>     so you get: entry: 0x00 level:0xF00
> 
> No?

Ya, see the next patch :-)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
  2025-08-29  9:40   ` Binbin Wu
@ 2025-08-29 16:58   ` Ira Weiny
  2025-08-29 19:59   ` Edgecombe, Rick P
  2 siblings, 0 replies; 62+ messages in thread
From: Ira Weiny @ 2025-08-29 16:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Ira Weiny, Kai Huang, Michael Roth, Yan Zhao,
	Vishal Annapurve, Rick Edgecombe, Ackerley Tng

Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.  Note, many (all?)
> of the affected paths never propagate the error code to userspace, i.e.
> this is about internal consistency more than anything else.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29  8:18   ` Yan Zhao
@ 2025-08-29 18:16     ` Edgecombe, Rick P
  2025-08-29 20:11       ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 18:16 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Weiny, Ira, kvm@vger.kernel.org,
	michael.roth@amd.com, pbonzini@redhat.com

On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote:
> > +	/*
> > +	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> > +	 * between mapping the pfn and now, but slots_lock prevents memslot
> > +	 * updates, filemap_invalidate_lock() prevents guest_memfd updates,
> > +	 * mmu_notifier events can't reach S-EPT entries, and KVM's
> > internal
> > +	 * zapping flows are mutually exclusive with S-EPT mappings.
> > +	 */
> > +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> > +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> > &level_state);
> > +		if (KVM_BUG_ON(err, kvm)) {
> I suspect tdh_mr_extend() running on one vCPU may contend with
> tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/
> tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl
> KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs.
> 
> It's similar to the analysis of contention of tdh_mem_page_add() [1], as
> both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on
> resource TDR.
> 
> I'll try to write a test to verify it and come back to you.

I'm seeing the same thing in the TDX module. It could fail because of contention
controllable from userspace. So the KVM_BUG_ON() is not appropriate.

Today though if tdh_mr_extend() fails because of contention then the TD is
essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The
M-EPT fault could be spurious but the second tdh_mem_page_add() would return an
error and never get back to the tdh_mr_extend().

The version in this patch can't recover for a different reason. That is 
kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just
drop the KVM_BUG_ON(), and try to handle the contention in a separate effort.

I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust,
or prevent the contention. For the latter case:
tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()
...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and
KVM_TDX_GET_CPUID.

For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to
handle.

So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust,
which would mean the eventual solution wouldn't have ABI concerns by later
blocking things that used to be allowed.

Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is
enough. But this is all for a case that userspace isn't expected to actually
hit, so seems like something that could be kicked down the road easily.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-29  0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
@ 2025-08-29 18:34   ` Edgecombe, Rick P
  2025-08-29 20:27     ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 18:34 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4994,6 +4994,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	return min(range->size, end - range->gpa);
>  }
>  
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct kvm_page_fault fault = {
> +		.addr = gfn_to_gpa(gfn),
> +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> +		.prefetch = true,
> +		.is_tdp = true,
> +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),

These fault's don't have fault->exec so nx_huge_page_workaround_enabled
shouldn't be a factor. Not a functional issue though. Maybe it is more robust?

> +
> +		.max_level = PG_LEVEL_4K,
> +		.req_level = PG_LEVEL_4K,
> +		.goal_level = PG_LEVEL_4K,
> +		.is_private = true,
> +
> +		.gfn = gfn,
> +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> +		.pfn = pfn,
> +		.map_writable = true,
> +	};
> +	struct kvm *kvm = vcpu->kvm;
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return -EIO;
> +
> +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> +		return -EPERM;

If we care about this, why don't we care about the read only memslot flag? TDX
doesn't need this or the nx huge page part above. So this function is more
general.

What about calling it __kvm_tdp_mmu_map_private_pfn() and making it a powerful
"map this pfn at this GFN and don't ask questions" function. Otherwise, I'm not
sure where to draw the line.

> +
> +	r = kvm_mmu_reload(vcpu);
> +	if (r)
> +		return r;
> +
> +	r = mmu_topup_memory_caches(vcpu, false);
> +	if (r)
> +		return r;
> +
> +	do {
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> +			return -EIO;
> +
> +		cond_resched();
> +
> +		guard(read_lock)(&kvm->mmu_lock);
> +
> +		r = kvm_tdp_mmu_map(vcpu, &fault);
> +	} while (r == RET_PF_RETRY);
> +
> +	if (r != RET_PF_FIXED)
> +		return -EIO;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"
  2025-08-29  0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
@ 2025-08-29 19:00   ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 19:00 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Remove the helper and exports that were added to allow TDX code to reuse
> kvm_tdp_map_page() for its gmem post-populate flow now that a dedicated
> TDP MMU API is provided to install a mapping given a gfn+pfn pair.
> 
> This reverts commit 2608f105760115e94a03efd9f12f8fbfd1f9af4b.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault()
  2025-08-29  0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
@ 2025-08-29 19:03   ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 19:03 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() now that it's used
> only by kvm_arch_vcpu_pre_fault_memory().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
  2025-08-29  8:36   ` Binbin Wu
@ 2025-08-29 19:53   ` Edgecombe, Rick P
  2025-08-29 20:19     ` Sean Christopherson
  2025-09-01  1:25     ` Yan Zhao
  1 sibling, 2 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 19:53 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().
> 
> Eliminating TDX's explicit pinning will also enable guest_memfd to support
> in-place conversion between shared and private memory[1][2].  Because KVM
> cannot distinguish between speculative/transient refcounts and the
> intentional refcount for TDX on private pages[3], failing to release
> private page refcount in TDX could cause guest_memfd to indefinitely wait
> on decreasing the refcount for the splitting.
> 
> Under normal conditions, not holding an extra page refcount in TDX is safe
> because guest_memfd ensures pages are retained until its invalidation
> notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> module, not holding an extra refcount when a page is mapped in S-EPT could
> result in a page being released from guest_memfd while still mapped in the
> S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
> negative when that extra work adds complexity and confusion.
> 
> Several approaches were considered to address the refcount issue, including
>   - Attempting to modify the KVM unmap operation to return a failure,
>     which was deemed too complex and potentially incorrect[4].
>  - Increasing the folio reference count only upon S-EPT zapping failure[5].
>  - Use page flags or page_ext to indicate a page is still used by TDX[6],
>    which does not work for HVO (HugeTLB Vmemmap Optimization).
>   - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
> 
> Due to the complexity or inappropriateness of these approaches, and the
> fact that S-EPT zapping failure is currently only possible when there are
> bugs in the KVM or TDX module, which is very rare in a production kernel,
> a straightforward approach of simply not holding the page reference count
> in TDX was chosen[8].
> 
> When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> vCPUs and mark the VM as dead. Although there is a potential window that a
> private page mapped in the S-EPT could be reallocated and used outside the
> VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> information.
> 

Yea, in the case of a bug, there could be a use-after-free. This logic applies
to all code that has allocations including the entire KVM MMU. But in this case,
we can actually catch the use-after-free scenario under scrutiny and not have it
happen silently, which does not apply to all code. But the special case here is
that the use-after-free depends on TDX module logic which is not part of the
kernel.

Yan, can you clarify what you mean by "there could be a small window"? I'm
thinking this is a hypothetical window around vm_dead races? Or more concrete? I
*don't* want to re-open the debate on whether to go with this approach, but I
think this is a good teaching edge case to settle on how we want to treat
similar issues. So I just want to make sure we have the justification right.

>  To be robust against bugs, the user can enable panic_on_warn
> as normal.
> 
> Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
> Link: https://youtu.be/UnBKahkAon4 [2]
> Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
> Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
> Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
> Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
> Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
> Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Suggested-by: Ackerley Tng <ackerleytng@google.com>
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> [sean: extract out of hugepage series, massage changelog accordingly]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Discussion aside, Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
  2025-08-29  9:40   ` Binbin Wu
  2025-08-29 16:58   ` Ira Weiny
@ 2025-08-29 19:59   ` Edgecombe, Rick P
  2 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 19:59 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.  Note, many (all?)
> of the affected paths never propagate the error code to userspace, i.e.
> this is about internal consistency more than anything else.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29 18:16     ` Edgecombe, Rick P
@ 2025-08-29 20:11       ` Sean Christopherson
  2025-08-29 22:39         ` Edgecombe, Rick P
  2025-09-02  9:24         ` Yan Zhao
  0 siblings, 2 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 20:11 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, Kai Huang, ackerleytng@google.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Ira Weiny, kvm@vger.kernel.org,
	michael.roth@amd.com, pbonzini@redhat.com

[-- Attachment #1: Type: text/plain, Size: 5605 bytes --]

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote:
> > > +	/*
> > > +	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> > > +	 * between mapping the pfn and now, but slots_lock prevents memslot
> > > +	 * updates, filemap_invalidate_lock() prevents guest_memfd updates,
> > > +	 * mmu_notifier events can't reach S-EPT entries, and KVM's
> > > internal
> > > +	 * zapping flows are mutually exclusive with S-EPT mappings.
> > > +	 */
> > > +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> > > +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> > > &level_state);
> > > +		if (KVM_BUG_ON(err, kvm)) {
> > I suspect tdh_mr_extend() running on one vCPU may contend with
> > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/
> > tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl
> > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs.
> > 
> > It's similar to the analysis of contention of tdh_mem_page_add() [1], as
> > both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on
> > resource TDR.
> > 
> > I'll try to write a test to verify it and come back to you.
> 
> I'm seeing the same thing in the TDX module. It could fail because of contention
> controllable from userspace. So the KVM_BUG_ON() is not appropriate.
> 
> Today though if tdh_mr_extend() fails because of contention then the TD is
> essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The
> M-EPT fault could be spurious but the second tdh_mem_page_add() would return an
> error and never get back to the tdh_mr_extend().
> 
> The version in this patch can't recover for a different reason. That is 
> kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just
> drop the KVM_BUG_ON(), and try to handle the contention in a separate effort.
> 
> I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust,

This.  First and foremost, KVM's ordering and locking rules need to be explicit
(ideally documented, but at the very least apparent in the code), *especially*
when the locking (or lack thereof) impacts userspace.  Even if effectively relying
on the TDX-module to provide ordering "works", it's all but impossible to follow.

And it doesn't truly work, as everything in the TDX-Module is a trylock, and that
in turn prevents KVM from asserting success.  Sometimes KVM has better option than
to rely on hardware to detect failure, but it really should be a last resort,
because not being able to expect success makes debugging no fun.  Even worse, it
bleeds hard-to-document, specific ordering requirements into userspace, e.g. in
this case, it sounds like userspace can't do _anything_ on vCPUs while doing
KVM_TDX_INIT_MEM_REGION.  Which might not be a burden for userspace, but oof is
it nasty from an ABI perspective.

> or prevent the contention. For the latter case:
> tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()
> ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and
> KVM_TDX_GET_CPUID.
> 
> For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to
> handle.
> 
> So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust,
> which would mean the eventual solution wouldn't have ABI concerns by later
> blocking things that used to be allowed.
> 
> Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is
> enough. But this is all for a case that userspace isn't expected to actually
> hit, so seems like something that could be kicked down the road easily.

You're trying to be too "nice", just smack 'em with a big hammer.  For all intents
and purposes, the paths in question are fully serialized, there's no reason to try
and allow anything remotely interesting to happen.

Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.

Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
well.

Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
problematic today, but it feels like a deadlock waiting to happen.

The other oddity I see is the handling of kvm_tdx->state.  I don't see how this
check in tdx_vcpu_create() is safe:

	if (kvm_tdx->state != TD_STATE_INITIALIZED)
		return -EIO;

kvm_arch_vcpu_create() runs without any locks held, and so TDX effectively has
the same bug that SEV intra-host migration had, where an in-flight vCPU creation
could race with a VM-wide state transition (see commit ecf371f8b02d ("KVM: SVM:
Reject SEV{-ES} intra host migration if vCPU creation is in-flight").  To fix
that, kvm->lock needs to be taken and KVM needs to verify there's no in-flight
vCPU creation, e.g. so that a vCPU doesn't pop up and contend a TDX-Module lock.

We an even define a fancy new CLASS to handle the lock+check => unlock logic
with guard()-like syntax:

	CLASS(tdx_vm_state_guard, guard)(kvm);
	if (IS_ERR(guard))
		return PTR_ERR(guard);

IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD and
TDH_MR_EXTEND, with no exceptions given for -EBUSY.  Attached patches are very
lightly tested as usual and need to be chunked up, but seem do to what I want.

[-- Attachment #2: 0001-KVM-Make-support-for-kvm_arch_vcpu_async_ioctl-manda.patch --]
[-- Type: text/x-diff, Size: 4250 bytes --]

From 44a96a0db69d9cd56e77813125aa1e318b11d718 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 29 Aug 2025 07:28:44 -0700
Subject: [PATCH 1/2] KVM: Make support for kvm_arch_vcpu_async_ioctl()
 mandatory

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/loongarch/kvm/Kconfig |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/riscv/kvm/Kconfig     |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/kvm/x86.c         |  6 ++++++
 include/linux/kvm_host.h   | 10 ----------
 virt/kvm/Kconfig           |  3 ---
 8 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/loongarch/kvm/Kconfig b/arch/loongarch/kvm/Kconfig
index 40eea6da7c25..e53948ec978a 100644
--- a/arch/loongarch/kvm/Kconfig
+++ b/arch/loongarch/kvm/Kconfig
@@ -25,7 +25,6 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_MSI
 	select HAVE_KVM_READONLY_MEM
-	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_COMMON
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_GENERIC_HARDWARE_ENABLING
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index ab57221fa4dd..cc13cc35f208 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	select EXPORT_UASM
 	select KVM_COMMON
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
-	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
 	select KVM_GENERIC_MMU_NOTIFIER
 	select KVM_GENERIC_HARDWARE_ENABLING
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 2f2702c867f7..c9a2d50ff1b0 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select KVM_COMMON
-	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_VFIO
 	select HAVE_KVM_IRQ_BYPASS
 
diff --git a/arch/riscv/kvm/Kconfig b/arch/riscv/kvm/Kconfig
index 5a62091b0809..de67bfabebc8 100644
--- a/arch/riscv/kvm/Kconfig
+++ b/arch/riscv/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQ_ROUTING
 	select HAVE_KVM_MSI
-	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_READONLY_MEM
 	select HAVE_KVM_DIRTY_RING_ACQ_REL
 	select KVM_COMMON
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index cae908d64550..96d16028e8b7 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -20,7 +20,6 @@ config KVM
 	def_tristate y
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
-	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_ASYNC_PF
 	select KVM_ASYNC_PF_SYNC
 	select KVM_COMMON
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7ba2cdfdac44..92e916eba6a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6943,6 +6943,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	return 0;
 }
 
+long kvm_arch_vcpu_async_ioctl(struct file *filp, unsigned int ioctl,
+			       unsigned long arg)
+{
+	return -ENOIOCTLCMD;
+}
+
 int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
 	struct kvm *kvm = filp->private_data;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 15656b7fba6c..a1840aaf80d4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2421,18 +2421,8 @@ static inline bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_HAVE_KVM_NO_POLL */
 
-#ifdef CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL
 long kvm_arch_vcpu_async_ioctl(struct file *filp,
 			       unsigned int ioctl, unsigned long arg);
-#else
-static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
-					     unsigned int ioctl,
-					     unsigned long arg)
-{
-	return -ENOIOCTLCMD;
-}
-#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
-
 void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 727b542074e7..661a4b998875 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -78,9 +78,6 @@ config HAVE_KVM_IRQ_BYPASS
        tristate
        select IRQ_BYPASS_MANAGER
 
-config HAVE_KVM_VCPU_ASYNC_IOCTL
-       bool
-
 config HAVE_KVM_VCPU_RUN_PID_CHANGE
        bool
 

base-commit: f4b88d6c85871847340a86daf838e11986a97348
-- 
2.51.0.318.gd7df087d1a-goog


[-- Attachment #3: 0002-KVM-TDX-Guard-VM-state-transitions.patch --]
[-- Type: text/x-diff, Size: 9085 bytes --]

From 7277396033c21569dbed0a52fa92804307db111e Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 29 Aug 2025 09:19:11 -0700
Subject: [PATCH 2/2] KVM: TDX: Guard VM state transitions

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   1 +
 arch/x86/kvm/vmx/main.c            |   9 +++
 arch/x86/kvm/vmx/tdx.c             | 112 ++++++++++++++++++++++-------
 arch/x86/kvm/vmx/x86_ops.h         |   1 +
 arch/x86/kvm/x86.c                 |   7 ++
 6 files changed, 105 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 18a5c3119e1a..fe2bb2e2ebc8 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -128,6 +128,7 @@ KVM_X86_OP(enable_smi_window)
 KVM_X86_OP_OPTIONAL(dev_get_attr)
 KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_async_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d0a8404a6b8f..ac5d3b8fa49f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1911,6 +1911,7 @@ struct kvm_x86_ops {
 	int (*dev_get_attr)(u32 group, u64 attr, u64 *val);
 	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
 	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
+	int (*vcpu_mem_enc_async_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index dbab1c15b0cd..e0e35ceec9b1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -831,6 +831,14 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
+static int vt_vcpu_mem_enc_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_async_ioctl(vcpu, argp);
+}
+
 static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 {
 	if (is_td(kvm))
@@ -1004,6 +1012,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl),
 	.vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl),
+	.vcpu_mem_enc_async_ioctl = vt_op_tdx_only(vcpu_mem_enc_async_ioctl),
 
 	.private_max_mapping_level = vt_op_tdx_only(gmem_private_max_mapping_level)
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d6c9defad9cd..c595d9cb6dcd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2624,6 +2624,44 @@ static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
 	return -EIO;
 }
 
+typedef void * tdx_vm_state_guard_t;
+
+static tdx_vm_state_guard_t tdx_acquire_vm_state_locks(struct kvm *kvm)
+{
+	int r;
+
+	mutex_lock(&kvm->lock);
+	mutex_lock(&kvm->slots_lock);
+
+	if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) {
+		r = -EBUSY;
+		goto out_err;
+	}
+
+	r = kvm_lock_all_vcpus(kvm);
+	if (r)
+		goto out_err;
+
+	return kvm;
+
+out_err:
+	mutex_unlock(&kvm->slots_lock);
+	mutex_unlock(&kvm->lock);
+
+	return ERR_PTR(r);
+}
+
+static void tdx_release_vm_state_locks(struct kvm *kvm)
+{
+	kvm_unlock_all_vcpus(kvm);
+	mutex_unlock(&kvm->slots_lock);
+	mutex_unlock(&kvm->lock);
+}
+
+DEFINE_CLASS(tdx_vm_state_guard, tdx_vm_state_guard_t,
+	     if (!IS_ERR(_T)) tdx_release_vm_state_locks(_T),
+	     tdx_acquire_vm_state_locks(kvm), struct kvm *kvm);
+
 static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -2634,6 +2672,10 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid));
 	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
 
+	CLASS(tdx_vm_state_guard, guard)(kvm);
+	if (IS_ERR(guard))
+		return PTR_ERR(guard);
+
 	if (kvm_tdx->state != TD_STATE_UNINITIALIZED)
 		return -EINVAL;
 
@@ -2745,7 +2787,9 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
-	guard(mutex)(&kvm->slots_lock);
+	CLASS(tdx_vm_state_guard, guard)(kvm);
+	if (IS_ERR(guard))
+		return PTR_ERR(guard);
 
 	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
@@ -2763,22 +2807,25 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd)
+{
+	if (copy_from_user(cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd->hw_error)
+		return -EINVAL;
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
 	int r;
 
-	if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
-		return -EFAULT;
-
-	/*
-	 * Userspace should never set hw_error. It is used to fill
-	 * hardware-defined error by the kernel.
-	 */
-	if (tdx_cmd.hw_error)
-		return -EINVAL;
-
-	mutex_lock(&kvm->lock);
+	r = tdx_get_cmd(argp, &tdx_cmd);
+	if (r)
+		return r;
 
 	switch (tdx_cmd.id) {
 	case KVM_TDX_CAPABILITIES:
@@ -2791,15 +2838,12 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 		r = tdx_td_finalize(kvm, &tdx_cmd);
 		break;
 	default:
-		r = -EINVAL;
-		goto out;
+		return -EINVAL;
 	}
 
 	if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
 		r = -EFAULT;
 
-out:
-	mutex_unlock(&kvm->lock);
 	return r;
 }
 
@@ -3079,11 +3123,13 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	long gmem_ret;
 	int ret;
 
+	CLASS(tdx_vm_state_guard, guard)(kvm);
+	if (IS_ERR(guard))
+		return PTR_ERR(guard);
+
 	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
 		return -EINVAL;
 
-	guard(mutex)(&kvm->slots_lock);
-
 	/* Once TD is finalized, the initial guest memory is fixed. */
 	if (kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
@@ -3101,6 +3147,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
+	vcpu_load(vcpu);
+
 	ret = 0;
 	while (region.nr_pages) {
 		if (signal_pending(current)) {
@@ -3132,11 +3180,28 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 		cond_resched();
 	}
 
+	vcpu_put(vcpu);
+
 	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
 		ret = -EFAULT;
 	return ret;
 }
 
+int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct kvm_tdx_cmd cmd;
+	int r;
+
+	r = tdx_get_cmd(argp, &cmd);
+	if (r)
+		return r;
+
+	if (cmd.id != KVM_TDX_INIT_MEM_REGION)
+		return -ENOIOCTLCMD;
+
+	return tdx_vcpu_init_mem_region(vcpu, &cmd);
+}
+
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -3146,19 +3211,14 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
 
-	if (copy_from_user(&cmd, argp, sizeof(cmd)))
-		return -EFAULT;
-
-	if (cmd.hw_error)
-		return -EINVAL;
+	ret = tdx_get_cmd(argp, &cmd);
+	if (ret)
+		return ret;
 
 	switch (cmd.id) {
 	case KVM_TDX_INIT_VCPU:
 		ret = tdx_vcpu_init(vcpu, &cmd);
 		break;
-	case KVM_TDX_INIT_MEM_REGION:
-		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
-		break;
 	case KVM_TDX_GET_CPUID:
 		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
 		break;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2b3424f638db..a797101a2150 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -149,6 +149,7 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 92e916eba6a9..281cd0980245 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6946,6 +6946,13 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 long kvm_arch_vcpu_async_ioctl(struct file *filp, unsigned int ioctl,
 			       unsigned long arg)
 {
+	struct kvm_vcpu *vcpu = filp->private_data;
+	void __user *argp = (void __user *)arg;
+
+	if (ioctl == KVM_MEMORY_ENCRYPT_OP &&
+	    kvm_x86_ops.vcpu_mem_enc_async_ioctl)
+		return kvm_x86_call(vcpu_mem_enc_async_ioctl)(vcpu, argp);
+
 	return -ENOIOCTLCMD;
 }
 
-- 
2.51.0.318.gd7df087d1a-goog


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 19:53   ` Edgecombe, Rick P
@ 2025-08-29 20:19     ` Sean Christopherson
  2025-08-29 21:54       ` Edgecombe, Rick P
  2025-09-01  1:25     ` Yan Zhao
  1 sibling, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 20:19 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Kai Huang, ackerleytng@google.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	Ira Weiny, kvm@vger.kernel.org, michael.roth@amd.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > From: Yan Zhao <yan.y.zhao@intel.com>
> > 
> > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > doesn't support page migration in any capacity, i.e. there are no migrate
> > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > kvm_gmem_migrate_folio().
> > 
> > Eliminating TDX's explicit pinning will also enable guest_memfd to support
> > in-place conversion between shared and private memory[1][2].  Because KVM
> > cannot distinguish between speculative/transient refcounts and the
> > intentional refcount for TDX on private pages[3], failing to release
> > private page refcount in TDX could cause guest_memfd to indefinitely wait
> > on decreasing the refcount for the splitting.
> > 
> > Under normal conditions, not holding an extra page refcount in TDX is safe
> > because guest_memfd ensures pages are retained until its invalidation
> > notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> > module, not holding an extra refcount when a page is mapped in S-EPT could
> > result in a page being released from guest_memfd while still mapped in the
> > S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
> > negative when that extra work adds complexity and confusion.
> > 
> > Several approaches were considered to address the refcount issue, including
> >   - Attempting to modify the KVM unmap operation to return a failure,
> >     which was deemed too complex and potentially incorrect[4].
> >  - Increasing the folio reference count only upon S-EPT zapping failure[5].
> >  - Use page flags or page_ext to indicate a page is still used by TDX[6],
> >    which does not work for HVO (HugeTLB Vmemmap Optimization).
> >   - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
> > 
> > Due to the complexity or inappropriateness of these approaches, and the
> > fact that S-EPT zapping failure is currently only possible when there are
> > bugs in the KVM or TDX module, which is very rare in a production kernel,
> > a straightforward approach of simply not holding the page reference count
> > in TDX was chosen[8].
> > 
> > When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> > vCPUs and mark the VM as dead. Although there is a potential window that a
> > private page mapped in the S-EPT could be reallocated and used outside the
> > VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> > information.
> > 
> 
> Yea, in the case of a bug, there could be a use-after-free. This logic applies
> to all code that has allocations including the entire KVM MMU. But in this case,
> we can actually catch the use-after-free scenario under scrutiny and not have it
> happen silently, which does not apply to all code. But the special case here is
> that the use-after-free depends on TDX module logic which is not part of the
> kernel.
> 
> Yan, can you clarify what you mean by "there could be a small window"? I'm
> thinking this is a hypothetical window around vm_dead races? Or more concrete? I
> *don't* want to re-open the debate on whether to go with this approach, but I
> think this is a good teaching edge case to settle on how we want to treat
> similar issues. So I just want to make sure we have the justification right.

The first paragraph is all the justification we need.  Seriously.  Bad things
will happen if you have UAF bugs, news at 11!

I'm all for defensive programming, but pinning pages goes too far, because that
itself can be dangerous, e.g. see commit 2bcb52a3602b ("KVM: Pin (as in FOLL_PIN)
pages during kvm_vcpu_map()") and the many messes KVM created with respect to
struct page refcounts.

I'm happy to include more context in the changelog, but I really don't want
anyone to walk away from this thinking that pinning pages in random KVM code is
at all encouraged.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-29 18:34   ` Edgecombe, Rick P
@ 2025-08-29 20:27     ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 20:27 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Kai Huang, ackerleytng@google.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	Ira Weiny, kvm@vger.kernel.org, michael.roth@amd.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4994,6 +4994,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	return min(range->size, end - range->gpa);
> >  }
> >  
> > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +	struct kvm_page_fault fault = {
> > +		.addr = gfn_to_gpa(gfn),
> > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > +		.prefetch = true,
> > +		.is_tdp = true,
> > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> 
> These fault's don't have fault->exec so nx_huge_page_workaround_enabled
> shouldn't be a factor. Not a functional issue though. Maybe it is more robust?

Whether or not the fault itself is EXEC is irrelevant, nx_huge_page_workaround_enabled
is used to ensure KVM doesn't create hugepage overtop an exiting EXEC 4KiB mapping.
Of course, this fault is irrelevant on that front as well.  But I don't see any
reason to get cute and let .nx_huge_page_workaround_enabled be stale.

> > +
> > +		.max_level = PG_LEVEL_4K,
> > +		.req_level = PG_LEVEL_4K,
> > +		.goal_level = PG_LEVEL_4K,
> > +		.is_private = true,
> > +
> > +		.gfn = gfn,
> > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > +		.pfn = pfn,
> > +		.map_writable = true,
> > +	};
> > +	struct kvm *kvm = vcpu->kvm;
> > +	int r;
> > +
> > +	lockdep_assert_held(&kvm->slots_lock);
> > +
> > +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> > +		return -EIO;
> > +
> > +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> > +		return -EPERM;
> 
> If we care about this, why don't we care about the read only memslot flag?

Because private memory fundamentally can't support read-only memslots.  If we
wanted to be paranoid, this code could assert that the memslot can be private
but for me that reaches a pointless level of paranoia.

> TDX doesn't need this or the nx huge page part above. So this function is
> more general.

I don't see anything that makes nx_huge_page_workaround_enabled mutually exclusive
with TDX though.

> What about calling it __kvm_tdp_mmu_map_private_pfn() and making it a powerful
> "map this pfn at this GFN and don't ask questions" function. Otherwise, I'm not
> sure where to draw the line.

Eh, for me, the line is pretty clear.  This is obviously specific to private memory,
and so implies a guest_memfd source, a private pfn, and everything that comes
along with private gmem pfns.  Everything else should be accounted for.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 20:19     ` Sean Christopherson
@ 2025-08-29 21:54       ` Edgecombe, Rick P
  2025-08-29 22:02         ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 21:54 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, pbonzini@redhat.com, michael.roth@amd.com

On Fri, 2025-08-29 at 13:19 -0700, Sean Christopherson wrote:
> > Yan, can you clarify what you mean by "there could be a small window"? I'm
> > thinking this is a hypothetical window around vm_dead races? Or more
> > concrete? I *don't* want to re-open the debate on whether to go with this
> > approach, but I think this is a good teaching edge case to settle on how we
> > want to treat similar issues. So I just want to make sure we have the
> > justification right.
> 
> The first paragraph is all the justification we need.  Seriously.  Bad things
> will happen if you have UAF bugs, news at 11!

Totally.

> 
> I'm all for defensive programming, but pinning pages goes too far, because
> that itself can be dangerous, e.g. see commit 2bcb52a3602b ("KVM: Pin (as in
> FOLL_PIN) pages during kvm_vcpu_map()") and the many messes KVM created with
> respect to struct page refcounts.
> 
> I'm happy to include more context in the changelog, but I really don't want
> anyone to walk away from this thinking that pinning pages in random KVM code
> is at all encouraged.

Sorry for going on a tangent. Defensive programming inside the kernel is a
little more settled. But for defensive programming against the TDX module, there
are various schools of thought internally. Currently we rely on some
undocumented behavior of the TDX module (as in not in the spec) for correctness.
But I don't think we do for security.

Speaking for Yan here, I think she was a little more worried about this scenario
then me, so I read this verbiage and thought to try to close it out.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 21:54       ` Edgecombe, Rick P
@ 2025-08-29 22:02         ` Sean Christopherson
  2025-08-29 22:17           ` Edgecombe, Rick P
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 22:02 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, ackerleytng@google.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, Ira Weiny,
	kvm@vger.kernel.org, pbonzini@redhat.com, michael.roth@amd.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-08-29 at 13:19 -0700, Sean Christopherson wrote:
> > I'm happy to include more context in the changelog, but I really don't want
> > anyone to walk away from this thinking that pinning pages in random KVM code
> > is at all encouraged.
> 
> Sorry for going on a tangent. Defensive programming inside the kernel is a
> little more settled. But for defensive programming against the TDX module, there
> are various schools of thought internally. Currently we rely on some
> undocumented behavior of the TDX module (as in not in the spec) for correctness.

Examples?

> But I don't think we do for security.
> 
> Speaking for Yan here, I think she was a little more worried about this scenario
> then me, so I read this verbiage and thought to try to close it out.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 22:02         ` Sean Christopherson
@ 2025-08-29 22:17           ` Edgecombe, Rick P
  2025-08-29 22:58             ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 22:17 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	pbonzini@redhat.com, kvm@vger.kernel.org, michael.roth@amd.com

On Fri, 2025-08-29 at 15:02 -0700, Sean Christopherson wrote:
> On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> > On Fri, 2025-08-29 at 13:19 -0700, Sean Christopherson wrote:
> > > I'm happy to include more context in the changelog, but I really don't want
> > > anyone to walk away from this thinking that pinning pages in random KVM code
> > > is at all encouraged.
> > 
> > Sorry for going on a tangent. Defensive programming inside the kernel is a
> > little more settled. But for defensive programming against the TDX module, there
> > are various schools of thought internally. Currently we rely on some
> > undocumented behavior of the TDX module (as in not in the spec) for correctness.
> 
> Examples?

I was thinking about the BUSY error code avoidance logic that is now called
tdh_do_no_vcpus(). We assume no new conditions will appear that cause a
TDX_OPERAND_BUSY. Like a guest opt-in or something.

It's on our todo list to transition those assumptions to promises. We just need
to formalize them.

> 
> > But I don't think we do for security.

But, actually they are some of the same paths. So same pattern.

> > 
> > Speaking for Yan here, I think she was a little more worried about this scenario
> > then me, so I read this verbiage and thought to try to close it out.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29 20:11       ` Sean Christopherson
@ 2025-08-29 22:39         ` Edgecombe, Rick P
  2025-08-29 23:15           ` Edgecombe, Rick P
  2025-09-02  9:24         ` Yan Zhao
  1 sibling, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 22:39 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com, pbonzini@redhat.com

On Fri, 2025-08-29 at 13:11 -0700, Sean Christopherson wrote:
> > I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more
> > robust,
> 
> This.  First and foremost, KVM's ordering and locking rules need to be
> explicit (ideally documented, but at the very least apparent in the code),
> *especially* when the locking (or lack thereof) impacts userspace.  Even if
> effectively relying on the TDX-module to provide ordering "works", it's all
> but impossible to follow.
> 
> And it doesn't truly work, as everything in the TDX-Module is a trylock, and
> that in turn prevents KVM from asserting success.  Sometimes KVM has better
> option than to rely on hardware to detect failure, but it really should be a
> last resort, because not being able to expect success makes debugging no fun. 
> Even worse, it bleeds hard-to-document, specific ordering requirements into
> userspace, e.g. in this case, it sounds like userspace can't do _anything_ on
> vCPUs while doing KVM_TDX_INIT_MEM_REGION.  Which might not be a burden for
> userspace, but oof is it nasty from an ABI perspective.

I could see that. I didn't think of the below.

> 
> > or prevent the contention. For the latter case:
> > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()
> > ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and
> > KVM_TDX_GET_CPUID.
> > 
> > For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to
> > handle.
> > 
> > So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more
> > robust, which would mean the eventual solution wouldn't have ABI concerns by
> > later blocking things that used to be allowed.
> > 
> > Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious
> > faults is enough. But this is all for a case that userspace isn't expected
> > to actually hit, so seems like something that could be kicked down the road
> > easily.
> 
> You're trying to be too "nice", just smack 'em with a big hammer.  For all
> intents and purposes, the paths in question are fully serialized, there's no
> reason to try and allow anything remotely interesting to happen.
> 
> Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to
> prevent kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from
> interefering.
> 
> Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can
> abuse kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would
> have been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU
> fd. Oh well.

Yea.

> 
> Anyways, I think we need to avoid the "synchronous" ioctl path anyways,
> because taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not
> actively problematic today, but it feels like a deadlock waiting to happen.
> 
> The other oddity I see is the handling of kvm_tdx->state.  I don't see how
> this check in tdx_vcpu_create() is safe:
> 
> 	if (kvm_tdx->state != TD_STATE_INITIALIZED)
> 		return -EIO;
> 
> kvm_arch_vcpu_create() runs without any locks held, and so TDX effectively has
> the same bug that SEV intra-host migration had, where an in-flight vCPU
> creation could race with a VM-wide state transition (see commit ecf371f8b02d
> ("KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-
> flight").  To fix that, kvm->lock needs to be taken and KVM needs to verify
> there's no in-flight vCPU creation, e.g. so that a vCPU doesn't pop up and
> contend a TDX-Module lock.
> 
> We an even define a fancy new CLASS to handle the lock+check => unlock logic
> with guard()-like syntax:
> 
> 	CLASS(tdx_vm_state_guard, guard)(kvm);
> 	if (IS_ERR(guard))
> 		return PTR_ERR(guard);
> 
> IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD and
> TDH_MR_EXTEND, with no exceptions given for -EBUSY.  Attached patches are very
> lightly tested as usual and need to be chunked up, but seem do to what I want.

Ok, the direction seem clear. The patch has an issue, need to debug.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 22:17           ` Edgecombe, Rick P
@ 2025-08-29 22:58             ` Sean Christopherson
  2025-08-29 22:59               ` Edgecombe, Rick P
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 22:58 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, ackerleytng@google.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, Ira Weiny,
	pbonzini@redhat.com, kvm@vger.kernel.org, michael.roth@amd.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-08-29 at 15:02 -0700, Sean Christopherson wrote:
> > On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> > > On Fri, 2025-08-29 at 13:19 -0700, Sean Christopherson wrote:
> > > > I'm happy to include more context in the changelog, but I really don't want
> > > > anyone to walk away from this thinking that pinning pages in random KVM code
> > > > is at all encouraged.
> > > 
> > > Sorry for going on a tangent. Defensive programming inside the kernel is a
> > > little more settled. But for defensive programming against the TDX module, there
> > > are various schools of thought internally. Currently we rely on some
> > > undocumented behavior of the TDX module (as in not in the spec) for correctness.
> > 
> > Examples?
> 
> I was thinking about the BUSY error code avoidance logic that is now called
> tdh_do_no_vcpus(). We assume no new conditions will appear that cause a
> TDX_OPERAND_BUSY. Like a guest opt-in or something.

Ah, gotcha.  If that happens, that's a TDX-Module ABI break.  Probably a good
idea to drill it into the TDX-Module authors/designers that ABI is established
when behavior is visible to the user, regardless of whether or not that behavior
is formally defined.

Note, breaking ABI _can_ be fine, e.g. if the behavior of some SEAMCALL changes,
but KVM doesn't care.  But if the TDX-Module suddenly starts failing a SEAMCALL
that previously succeeded, then we're going to have a problem.

> It's on our todo list to transition those assumptions to promises. We just need
> to formalize them.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 22:58             ` Sean Christopherson
@ 2025-08-29 22:59               ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 22:59 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, pbonzini@redhat.com, michael.roth@amd.com

On Fri, 2025-08-29 at 15:58 -0700, Sean Christopherson wrote:
> > I was thinking about the BUSY error code avoidance logic that is now called
> > tdh_do_no_vcpus(). We assume no new conditions will appear that cause a
> > TDX_OPERAND_BUSY. Like a guest opt-in or something.
> 
> Ah, gotcha.  If that happens, that's a TDX-Module ABI break.  Probably a good
> idea to drill it into the TDX-Module authors/designers that ABI is established
> when behavior is visible to the user, regardless of whether or not that
> behavior is formally defined.
> 
> Note, breaking ABI _can_ be fine, e.g. if the behavior of some SEAMCALL
> changes, but KVM doesn't care.  But if the TDX-Module suddenly starts failing
> a SEAMCALL that previously succeeded, then we're going to have a problem.

Thanks! I'll use this quote.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29 22:39         ` Edgecombe, Rick P
@ 2025-08-29 23:15           ` Edgecombe, Rick P
  2025-08-29 23:18             ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 23:15 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com, pbonzini@redhat.com

On Fri, 2025-08-29 at 15:39 -0700, Rick Edgecombe wrote:
> > 
> > Anyways, I think we need to avoid the "synchronous" ioctl path anyways,
> > because taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not
> > actively problematic today, but it feels like a deadlock waiting to happen.
> > 
> > The other oddity I see is the handling of kvm_tdx->state.  I don't see how
> > this check in tdx_vcpu_create() is safe:
> > 
> >  	if (kvm_tdx->state != TD_STATE_INITIALIZED)
> >  		return -EIO;
> > 
> > kvm_arch_vcpu_create() runs without any locks held, 

Oh, you're right. It's about those fields being set further down in the function
based on the results of KVM_TDX_INIT_VM, rather then TDX module locking. The
race would show if vCPU creation transitioned to TD_STATE_RUNNABLE in finalized
while another vCPU was getting created. Though I'm not sure exactly what would
go wrong, the code is wrong enough looking to be worth a fix.

> > and so TDX effectively has the same bug that SEV intra-host migration had,
> > where an in-flight vCPU creation could race with a VM-wide state transition
> > (see commit ecf371f8b02d ("KVM: SVM: Reject SEV{-ES} intra host migration if
> > vCPU creation is in-flight").  To fix that, kvm->lock needs to be taken and
> > KVM needs to verify there's no in-flight vCPU creation, e.g. so that a vCPU
> > doesn't pop up and contend a TDX-Module lock.
> > 
> > We an even define a fancy new CLASS to handle the lock+check => unlock logic
> > with guard()-like syntax:
> > 
> >  	CLASS(tdx_vm_state_guard, guard)(kvm);
> >  	if (IS_ERR(guard))
> >  		return PTR_ERR(guard);
> > 
> > IIUC, with all of those locks, KVM can KVM_BUG_ON() both TDH_MEM_PAGE_ADD
> > and TDH_MR_EXTEND, with no exceptions given for -EBUSY.  Attached patches
> > are very lightly tested as usual and need to be chunked up, but seem do to
> > what I want.
> 
> Ok, the direction seem clear. The patch has an issue, need to debug.

Just this:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c595d9cb6dcd..e99d07611393 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2809,7 +2809,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct
kvm_tdx_cmd *cmd)
 
 static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd)
 {
-       if (copy_from_user(cmd, argp, sizeof(cmd)))
+       if (copy_from_user(cmd, argp, sizeof(*cmd)))
                return -EFAULT;
 
        if (cmd->hw_error)


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29 23:15           ` Edgecombe, Rick P
@ 2025-08-29 23:18             ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-08-29 23:18 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, ackerleytng@google.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, Ira Weiny,
	kvm@vger.kernel.org, michael.roth@amd.com, pbonzini@redhat.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-08-29 at 15:39 -0700, Rick Edgecombe wrote:
> > Ok, the direction seem clear. The patch has an issue, need to debug.
> 
> Just this:
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c595d9cb6dcd..e99d07611393 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2809,7 +2809,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct
> kvm_tdx_cmd *cmd)
>  
>  static int tdx_get_cmd(void __user *argp, struct kvm_tdx_cmd *cmd)
>  {
> -       if (copy_from_user(cmd, argp, sizeof(cmd)))
> +       if (copy_from_user(cmd, argp, sizeof(*cmd)))

LOL, it's always some mundane detail!

>                 return -EFAULT;
>  
>         if (cmd->hw_error)
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries
  2025-08-29  0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
@ 2025-08-29 23:42   ` Edgecombe, Rick P
  2025-09-02 17:09     ` Sean Christopherson
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-29 23:42 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> When populating the initial memory image for a TDX guest, ADD pages to the
> TD as part of establishing the mappings in the mirror EPT, as opposed to
> creating the mappings and then doing ADD after the fact.  Doing ADD in the
> S-EPT callbacks eliminates the need to track "premapped" pages, as the
> mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
> KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).
> 
> Eliminating the hole where the M-EPT can have a mapping that doesn't exist
> in the S-EPT in turn obviates the need to handle errors that are unique to
> encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).
> 
> Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
> for unconsumed "premap" entries during tdx_td_finalize(), as there simply
> can't be any such entries.  Dropping that check in particular reduces the
> overall cognitive load, as the managemented of nr_premapped with respect
> to removal of S-EPT is _very_ subtle.  E.g. successful removal of an S-EPT
> entry after it completed ADD doesn't adjust nr_premapped, but it's not
> clear why that's "ok" but having half-baked entries is not (it's not truly
> "ok" in that removing pages from the image will likely prevent the guest
> from booting, but from KVM's perspective it's "ok").
> 
> Doing ADD in the S-EPT path requires passing an argument via a scratch
> field, but the current approach of tracking the number of "premapped"
> pages effectively does the same.  And the "premapped" counter is much more
> dangerous, as it doesn't have a singular lock to protect its usage, since
> nr_premapped can be modified as soon as mmu_lock is dropped, at least in
> theory.  I.e. nr_premapped is guarded by slots_lock, but only for "happy"
> paths.
> 
> Note, this approach was used/tried at various points in TDX development,
> but was ultimately discarded due to a desire to avoid stashing temporary
> state in kvm_tdx.  But as above, KVM ended up with such state anyways,
> and fully committing to using temporary state provides better access
> rules (100% guarded by slots_lock), and makes several edge cases flat out
> impossible.
> 
> Note #2, continue to extend the measurement outside of mmu_lock, as it's
> a slow operation (typically 16 SEAMCALLs per page whose data is included
> in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
> for consistency purposes.  However, MR.EXTEND isn't _that_ slow, e.g.
> ~1ms latency to measure a full page, so if it needs to be done under
> mmu_lock in the future, e.g. because KVM gains a flow that can remove
> S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the
                ^using
> measurement can also be moved into the S-EPT mapping path (again, only if
> absolutely necessary).  P.S. _If_ MR.EXTEND is moved into the S-EPT path,
> take care not to return an error up the stack if TDH_MR_EXTEND fails, as
> removing the M-EPT entry but not the S-EPT entry would result in
> inconsistent state!
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

But some possible unintended changed below.

>  arch/x86/kvm/vmx/tdx.c | 116 ++++++++++++++---------------------------
>  arch/x86/kvm/vmx/tdx.h |   8 ++-
>  2 files changed, 46 insertions(+), 78 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index bc92e87a1dbb..00c3dc376690 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add_src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,19 +1653,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
> -	 * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages
> -	 * have been added to the image, to prevent running the TD with a
> -	 * valid mapping in the mirror EPT, but not in the S-EPT.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		atomic64_inc(&kvm_tdx->nr_premapped);
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1665,39 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1717,12 +1701,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
>  
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
> @@ -2827,12 +2805,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (atomic64_read(&kvm_tdx->nr_premapped))
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
>  	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 err, entry, level_state;

Fine, but ?

>  	struct page *src_page;
>  	int ret, i;
>  
> +	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> +		return -EIO;
> +
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
> @@ -3131,22 +3106,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add_src = src_page;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add_src = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> +	put_page(src_page);
>  
> -	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
> -
> -	if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> -		goto out;
> +	if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return ret;
>  
>  	/*
>  	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> @@ -3159,14 +3126,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
>  		if (KVM_BUG_ON(err, kvm)) {
>  			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> -			ret = -EIO;
> -			goto out;
> +			return -EIO;
>  		}
>  	}
>  
> -out:
> -	put_page(src_page);
> -	return ret;
> +	return 0;
>  }
>  
>  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..1b00adbbaf77 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -36,8 +36,12 @@ struct kvm_tdx {
>  
>  	struct tdx_td td;
>  
> -	/* For KVM_TDX_INIT_MEM_REGION. */
> -	atomic64_t nr_premapped;
> +	/*
> +	 * Scratch pointer used to pass the source page to tdx_mem_page_add.
> +	 * Protected by slots_lock, and non-NULL only when mapping a private
> +	 * pfn via tdx_gmem_post_populate().
> +	 */
> +	struct page *page_add_src;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names
  2025-08-29  0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
@ 2025-08-30  0:00   ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-08-30  0:00 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> When printing SEAMCALL errors, use the name of the variable holding an
> error parameter instead of the register from whence it came, so that flows
> which use descriptive variable names will similarly print descriptive
> error messages.
> 
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Tested that it actually prints out what is expected.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-29 19:53   ` Edgecombe, Rick P
  2025-08-29 20:19     ` Sean Christopherson
@ 2025-09-01  1:25     ` Yan Zhao
  2025-09-02 17:33       ` Sean Christopherson
  1 sibling, 1 reply; 62+ messages in thread
From: Yan Zhao @ 2025-09-01  1:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, Huang, Kai,
	ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Weiny, Ira, kvm@vger.kernel.org,
	michael.roth@amd.com

On Sat, Aug 30, 2025 at 03:53:24AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > From: Yan Zhao <yan.y.zhao@intel.com>
> > When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> > vCPUs and mark the VM as dead. Although there is a potential window that a
> > private page mapped in the S-EPT could be reallocated and used outside the
> > VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> > information.
... 
> Yan, can you clarify what you mean by "there could be a small window"? I'm
> thinking this is a hypothetical window around vm_dead races? Or more concrete? I
> *don't* want to re-open the debate on whether to go with this approach, but I
> think this is a good teaching edge case to settle on how we want to treat
> similar issues. So I just want to make sure we have the justification right.
I think this window isn't hypothetical.

1. SEAMCALL failure in tdx_sept_remove_private_spte().
   KVM_BUG_ON() sets vm_dead and kicks off all vCPUs.
2. guest_memfd invalidation completes. memory is freed.
3. VM gets killed.

After 2, the page is still mapped in the S-EPT, but it could potentially be
reallocated and used outside the VM.

From the TDX module and hardware's perspective, the mapping in the S-EPT for
this page remains valid. So, I'm uncertain if the TDX module might do something
creative to access the guest page after 2.

Besides, a cache flush after 2 can essentially cause a memory write to the page.
Though we could invoke tdh_phymem_page_wbinvd_hkid() after the KVM_BUG_ON(), the
SEAMCALL itself can fail.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  2025-08-29 14:19     ` Sean Christopherson
@ 2025-09-01  1:46       ` Binbin Wu
  0 siblings, 0 replies; 62+ messages in thread
From: Binbin Wu @ 2025-09-01  1:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Ira Weiny, Kai Huang,
	Michael Roth, Yan Zhao, Vishal Annapurve, Rick Edgecombe,
	Ackerley Tng



On 8/29/2025 10:19 PM, Sean Christopherson wrote:
> On Fri, Aug 29, 2025, Binbin Wu wrote:
>> On 8/29/2025 8:06 AM, Sean Christopherson wrote:
>>> Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
>>> the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
>>> pr_tdx_error().  In addition to reducing boilerplate copy+paste code, this
>>> also helps ensure that KVM provides consistent handling of SEAMCALL errors.
>>>
>>> Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
>>> equivalent of KVM_BUG_ON(), i.e. have them terminate the VM.  If a SEAMCALL
>>> error is fatal enough to WARN on, it's fatal enough to terminate the TD.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++------------------------
>>>    1 file changed, 47 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index aa6d88629dae..df9b4496cd01 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -24,20 +24,32 @@
>>>    #undef pr_fmt
>>>    #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>> -#define pr_tdx_error(__fn, __err)	\
>>> -	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
>>> +#define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...)			\
>>> +({										\
>>> +	struct kvm *_kvm = (__kvm);						\
>>> +	bool __ret = !!(__err);							\
>>> +										\
>>> +	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
>>> +		if (_kvm)							\
>>> +			kvm_vm_bugged(_kvm);					\
>>> +		pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\
>>> +				   __err,  __args);				\
>>> +	}									\
>>> +	unlikely(__ret);							\
>>> +})
>>> -#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
>>> -	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
>>> +#define TDX_BUG_ON(__err, __fn, __kvm)				\
>>> +	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
>>> -#define pr_tdx_error_1(__fn, __err, __rcx)		\
>>> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
>>> +#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
>>> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
>>> -#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
>>> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
>>> +#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
>>> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
>>> +
>>> +#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
>>> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
>>> -#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
>>> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
>> I thought you would use the format Rick proposed in
>> https://lore.kernel.org/all/9e55a0e767317d20fc45575c4ed6dafa863e1ca0.camel@intel.com/
>>      #define TDX_BUG_ON_2(__err, __fn, arg1, arg2, __kvm)        \
>>          __TDX_BUG_ON(__err, #__fn, __kvm, ", " #arg1 " 0x%llx, " #arg2 "
>>      0x%llx", arg1, arg2)
>>
>>      so you get: entry: 0x00 level:0xF00
>>
>> No?
> Ya, see the next patch :-)

Oh, sorry for the noise.
>


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-08-29 20:11       ` Sean Christopherson
  2025-08-29 22:39         ` Edgecombe, Rick P
@ 2025-09-02  9:24         ` Yan Zhao
  2025-09-02 17:04           ` Sean Christopherson
  1 sibling, 1 reply; 62+ messages in thread
From: Yan Zhao @ 2025-09-02  9:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Kai Huang, ackerleytng@google.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Ira Weiny,
	kvm@vger.kernel.org, michael.roth@amd.com, pbonzini@redhat.com

On Fri, Aug 29, 2025 at 01:11:35PM -0700, Sean Christopherson wrote:
> On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> > On Fri, 2025-08-29 at 16:18 +0800, Yan Zhao wrote:
> > > > +	/*
> > > > +	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> > > > +	 * between mapping the pfn and now, but slots_lock prevents memslot
> > > > +	 * updates, filemap_invalidate_lock() prevents guest_memfd updates,
> > > > +	 * mmu_notifier events can't reach S-EPT entries, and KVM's
> > > > internal
> > > > +	 * zapping flows are mutually exclusive with S-EPT mappings.
> > > > +	 */
> > > > +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> > > > +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> > > > &level_state);
> > > > +		if (KVM_BUG_ON(err, kvm)) {
> > > I suspect tdh_mr_extend() running on one vCPU may contend with
> > > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()/
> > > tdh_mng_rd()/tdh_vp_flush() on other vCPUs, if userspace invokes ioctl
> > > KVM_TDX_INIT_MEM_REGION on one vCPU while initializing other vCPUs.
> > > 
> > > It's similar to the analysis of contention of tdh_mem_page_add() [1], as
> > > both tdh_mr_extend() and tdh_mem_page_add() acquire exclusive lock on
> > > resource TDR.
> > > 
> > > I'll try to write a test to verify it and come back to you.
I've written a selftest and proved the contention between tdh_mr_extend() and
tdh_vp_create().

The KVM_BUG_ON() after tdh_mr_extend() now is not hittable with Sean's newly
provided 2 fixes.


But during writing another concurrency test, I found a sad news :

SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
leaf_opcode.version > 0. So, when I use v1 (which is the current value in
upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
"SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".

If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.

Note: this acquiring of exclusive lock was not previously present in the public
repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
(The branch has been force-updated to new implementation now).


> > I'm seeing the same thing in the TDX module. It could fail because of contention
> > controllable from userspace. So the KVM_BUG_ON() is not appropriate.
> > 
> > Today though if tdh_mr_extend() fails because of contention then the TD is
> > essentially dead anyway. Trying to redo KVM_TDX_INIT_MEM_REGION will fail. The
> > M-EPT fault could be spurious but the second tdh_mem_page_add() would return an
> > error and never get back to the tdh_mr_extend().
> > 
> > The version in this patch can't recover for a different reason. That is 
> > kvm_tdp_mmu_map_private_pfn() doesn't handle spurious faults, so I'd say just
> > drop the KVM_BUG_ON(), and try to handle the contention in a separate effort.
> > 
> > I guess the two approaches could be to make KVM_TDX_INIT_MEM_REGION more robust,
> 
> This.  First and foremost, KVM's ordering and locking rules need to be explicit
> (ideally documented, but at the very least apparent in the code), *especially*
> when the locking (or lack thereof) impacts userspace.  Even if effectively relying
> on the TDX-module to provide ordering "works", it's all but impossible to follow.
> 
> And it doesn't truly work, as everything in the TDX-Module is a trylock, and that
> in turn prevents KVM from asserting success.  Sometimes KVM has better option than
> to rely on hardware to detect failure, but it really should be a last resort,
> because not being able to expect success makes debugging no fun.  Even worse, it
> bleeds hard-to-document, specific ordering requirements into userspace, e.g. in
> this case, it sounds like userspace can't do _anything_ on vCPUs while doing
> KVM_TDX_INIT_MEM_REGION.  Which might not be a burden for userspace, but oof is
> it nasty from an ABI perspective.
> 
> > or prevent the contention. For the latter case:
> > tdh_vp_create()/tdh_vp_addcx()/tdh_vp_init*()/tdh_vp_rd()/tdh_vp_wr()
> > ...I think we could just take slots_lock during KVM_TDX_INIT_VCPU and
> > KVM_TDX_GET_CPUID.
> > 
> > For tdh_vp_flush() the vcpu_load() in kvm_arch_vcpu_ioctl() could be hard to
> > handle.
> > 
> > So I'd think maybe to look towards making KVM_TDX_INIT_MEM_REGION more robust,
> > which would mean the eventual solution wouldn't have ABI concerns by later
> > blocking things that used to be allowed.
> > 
> > Maybe having kvm_tdp_mmu_map_private_pfn() return success for spurious faults is
> > enough. But this is all for a case that userspace isn't expected to actually
> > hit, so seems like something that could be kicked down the road easily.
> 
> You're trying to be too "nice", just smack 'em with a big hammer.  For all intents
> and purposes, the paths in question are fully serialized, there's no reason to try
> and allow anything remotely interesting to happen.
This big hammer looks good to me :)

> 
> Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
!mirror roots. The slots_lock should be for slots deletion.

> 
> Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> well.
> 
> Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> problematic today, but it feels like a deadlock waiting to happen.
Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
vcpu->mutex.

 
> The other oddity I see is the handling of kvm_tdx->state.  I don't see how this
> check in tdx_vcpu_create() is safe:
> 
> 	if (kvm_tdx->state != TD_STATE_INITIALIZED)
> 		return -EIO;

Right, if tdh_vp_create() contends with tdh_mr_finalize(), KVM_BUG_ON() will be
triggered.
I previously overlooked the KVM_BUG_ON() after tdh_vp_create(), thinking that
it's ok to have it return error once tdh_vp_create() is invoked after
tdh_mr_finalize().

...
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -3146,19 +3211,14 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
>  
> -	if (copy_from_user(&cmd, argp, sizeof(cmd)))
> -		return -EFAULT;
> -
> -	if (cmd.hw_error)
> -		return -EINVAL;
> +	ret = tdx_get_cmd(argp, &cmd);
> +	if (ret)
> +		return ret;
>  
>  	switch (cmd.id) {
>  	case KVM_TDX_INIT_VCPU:
>  		ret = tdx_vcpu_init(vcpu, &cmd);
>  		break;
So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?

> -	case KVM_TDX_INIT_MEM_REGION:
> -		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
> -		break;
>  	case KVM_TDX_GET_CPUID:
>  		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-09-02  9:24         ` Yan Zhao
@ 2025-09-02 17:04           ` Sean Christopherson
  2025-09-03  0:18             ` Edgecombe, Rick P
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:04 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, Kai Huang, ackerleytng@google.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Ira Weiny,
	kvm@vger.kernel.org, michael.roth@amd.com, pbonzini@redhat.com

On Tue, Sep 02, 2025, Yan Zhao wrote:
> But during writing another concurrency test, I found a sad news :
> 
> SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> 
> If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.

Uh, so that's exactly the type of breaking ABI change that isn't acceptable.  If
it's really truly necessary, then we can can probably handle the change in KVM
since TDX is so new, but generally speaking such changes simply must not happen.

> Note: this acquiring of exclusive lock was not previously present in the public
> repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> (The branch has been force-updated to new implementation now).

Lovely.

> > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> !mirror roots. The slots_lock should be for slots deletion.

Oof, I missed that.  We should have required nx_huge_pages=never for tdx=1.
Probably too late for that now though :-/

> > Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> > kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> > well.
> > 
> > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> > problematic today, but it feels like a deadlock waiting to happen.
> Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> vcpu->mutex.

Yikes.  As does kvm_alloc_apic_access_page(), which is likely why I thought it
was ok to take slots_lock.  But while kvm_alloc_apic_access_page() appears to be
called with vCPU scope, it's actually called from VM scope during vCPU creation.

I'll chew on this, though if someone has any ideas...

> So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?

If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
suffice, and probably would be preferable.  If INIT_VCPU needs to take kvm->lock
to protect against other races, then I guess the big hammer approach could work?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries
  2025-08-29 23:42   ` Edgecombe, Rick P
@ 2025-09-02 17:09     ` Sean Christopherson
  0 siblings, 0 replies; 62+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:09 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Kai Huang, ackerleytng@google.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	Ira Weiny, kvm@vger.kernel.org, michael.roth@amd.com

On Fri, Aug 29, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > @@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >  {
> >  	struct tdx_gmem_post_populate_arg *arg = _arg;
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > -	u64 err, entry, level_state;
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> > +	u64 err, entry, level_state;
> 
> Fine, but ?

Yeah, accidental copy+paste bug (I deleted several of the variables when trying
out an idea and then had to add them back when it didn't pan out).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte()
  2025-08-29  0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
@ 2025-09-02 17:31   ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 17:31 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Do TDH_MEM_RANGE_BLOCK directly in tdx_sept_remove_private_spte() instead
> of using a one-off helper now that the nr_premapped tracking is gone.
> 
> Opportunistically drop the WARN on hugepages, which was dead code (see the
> KVM_BUG_ON() in tdx_sept_remove_private_spte()).
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

And cleanup strange resultant return error value semantics from the removal of
nr_premapped.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-09-01  1:25     ` Yan Zhao
@ 2025-09-02 17:33       ` Sean Christopherson
  2025-09-02 18:55         ` Edgecombe, Rick P
  0 siblings, 1 reply; 62+ messages in thread
From: Sean Christopherson @ 2025-09-02 17:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Kai Huang,
	ackerleytng@google.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Ira Weiny, kvm@vger.kernel.org,
	michael.roth@amd.com

On Mon, Sep 01, 2025, Yan Zhao wrote:
> On Sat, Aug 30, 2025 at 03:53:24AM +0800, Edgecombe, Rick P wrote:
> > On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> > > From: Yan Zhao <yan.y.zhao@intel.com>
> > > When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> > > vCPUs and mark the VM as dead. Although there is a potential window that a
> > > private page mapped in the S-EPT could be reallocated and used outside the
> > > VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> > > information.
> ... 
> > Yan, can you clarify what you mean by "there could be a small window"? I'm
> > thinking this is a hypothetical window around vm_dead races? Or more concrete? I
> > *don't* want to re-open the debate on whether to go with this approach, but I
> > think this is a good teaching edge case to settle on how we want to treat
> > similar issues. So I just want to make sure we have the justification right.
> I think this window isn't hypothetical.
> 
> 1. SEAMCALL failure in tdx_sept_remove_private_spte().

But tdx_sept_remove_private_spte() failing is already a hypothetical scenario.

>    KVM_BUG_ON() sets vm_dead and kicks off all vCPUs.
> 2. guest_memfd invalidation completes. memory is freed.
> 3. VM gets killed.
> 
> After 2, the page is still mapped in the S-EPT, but it could potentially be
> reallocated and used outside the VM.
> 
> >From the TDX module and hardware's perspective, the mapping in the S-EPT for
> this page remains valid. So, I'm uncertain if the TDX module might do something
> creative to access the guest page after 2.
> 
> Besides, a cache flush after 2 can essentially cause a memory write to the page.
> Though we could invoke tdh_phymem_page_wbinvd_hkid() after the KVM_BUG_ON(), the
> SEAMCALL itself can fail.

I think this falls into the category of "don't screw up" flows.  Failure to remove
a private SPTE is a near-catastrophic error.  Going out of our way to reduce the
impact of such errors increases complexity without providing much in the way of
value.

E.g. if VMCLEAR fails, KVM WARNs but continues on and hopes for the best, even
though there's a decent chance failure to purge the VMCS cache entry could be
lead to UAF-like problems.  To me, this is largely the same.

If anything, we should try to prevent #2, e.g. by marking the entire guest_memfd
as broken or something, and then deliberately leaking _all_ pages.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON()
  2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
  2025-08-29  9:03   ` Binbin Wu
@ 2025-09-02 18:55   ` Edgecombe, Rick P
  1 sibling, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 18:55 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
> the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
> pr_tdx_error().  In addition to reducing boilerplate copy+paste code, this
> also helps ensure that KVM provides consistent handling of SEAMCALL errors.
> 
> Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
> equivalent of KVM_BUG_ON(), i.e. have them terminate the VM.  If a SEAMCALL
> error is fatal enough to WARN on, it's fatal enough to terminate the TD.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++------------------------
>  1 file changed, 47 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index aa6d88629dae..df9b4496cd01 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -24,20 +24,32 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> -#define pr_tdx_error(__fn, __err)	\
> -	pr_err_ratelimited("SEAMCALL %s failed: 0x%llx\n", #__fn, __err)
> +#define __TDX_BUG_ON(__err, __f, __kvm, __fmt, __args...)			\
> +({										\
> +	struct kvm *_kvm = (__kvm);						\
> +	bool __ret = !!(__err);							\
> +										\
> +	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
> +		if (_kvm)							\
> +			kvm_vm_bugged(_kvm);					\
> +		pr_err_ratelimited("SEAMCALL " __f " failed: 0x%llx" __fmt "\n",\
> +				   __err,  __args);				\
> +	}									\
> +	unlikely(__ret);							\
> +})
>  
> -#define __pr_tdx_error_N(__fn_str, __err, __fmt, ...)		\
> -	pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx, " __fmt,  __err,  __VA_ARGS__)
> +#define TDX_BUG_ON(__err, __fn, __kvm)				\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
>  
> -#define pr_tdx_error_1(__fn, __err, __rcx)		\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx\n", __rcx)
> +#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
>  
> -#define pr_tdx_error_2(__fn, __err, __rcx, __rdx)	\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx\n", __rcx, __rdx)
> +#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
> +
> +#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
> +	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
>  
> -#define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
> -	__pr_tdx_error_N(#__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)

Having each TDX_BUG_ON() individually do #__fn is extra code today, but it
leaves __TDX_BUG_ON() usable for custom special TDX_BUG_ON()'s later.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-09-02 17:33       ` Sean Christopherson
@ 2025-09-02 18:55         ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 18:55 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Weiny, Ira, kvm@vger.kernel.org,
	michael.roth@amd.com, pbonzini@redhat.com

On Tue, 2025-09-02 at 10:33 -0700, Sean Christopherson wrote:
> > Besides, a cache flush after 2 can essentially cause a memory write to the
> > page.
> > Though we could invoke tdh_phymem_page_wbinvd_hkid() after the KVM_BUG_ON(),
> > the SEAMCALL itself can fail.
> 
> I think this falls into the category of "don't screw up" flows.  Failure to
> remove a private SPTE is a near-catastrophic error.  Going out of our way to
> reduce the impact of such errors increases complexity without providing much
> in the way of value.
> 
> E.g. if VMCLEAR fails, KVM WARNs but continues on and hopes for the best, even
> though there's a decent chance failure to purge the VMCS cache entry could be
> lead to UAF-like problems.  To me, this is largely the same.
> 
> If anything, we should try to prevent #2, e.g. by marking the entire
> guest_memfd as broken or something, and then deliberately leaking _all_ pages.

There was a marathon thread on this subject. We did discuss this option (link to
most relevant part I could find):
https://lore.kernel.org/kvm/a9affa03c7cdc8109d0ed6b5ca30ec69269e2f34.camel@intel.com/

The high level summary is that pinning the pages wrinkles guestmemfd's plans to
use refcount for other tracking purposes. Dropping refcounts interferes with the
error handling safety.

I strongly agree that we should not optimize for the error path at all. If we
could bug the guestmemfd (kind of what we were discussing in that link) I think
it would be appropriate to use in these cases. I guess the question is are we ok
dropping the safety before we have a solution like that. In that thread I was
advocating for yes, partly to close it because the conversation was getting
stuck. But there is probably a long tail of potential issues or ways of looking
at it that could put it in the grey area.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-29  0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
@ 2025-09-02 22:46   ` Edgecombe, Rick P
  0 siblings, 0 replies; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-09-02 22:46 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira,
	kvm@vger.kernel.org, michael.roth@amd.com

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
> providing a one-off helper for effectively three lines of code is at best a
> wash, and splitting the code makes the comment for smp_rmb()  _extremely_
> confusing as the comment talks about reading kvm->arch.pre_fault_allowed
> before kvm_tdx->state, but the immediately visible code does the exact
> opposite.
> 
> Opportunistically rewrite the comments to more explicitly explain who is
> checking what, as well as _why_ the ordering matters.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-09-02 17:04           ` Sean Christopherson
@ 2025-09-03  0:18             ` Edgecombe, Rick P
  2025-09-03  3:34               ` Yan Zhao
  0 siblings, 1 reply; 62+ messages in thread
From: Edgecombe, Rick P @ 2025-09-03  0:18 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Huang, Kai, ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	michael.roth@amd.com, kvm@vger.kernel.org

On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote:
> On Tue, Sep 02, 2025, Yan Zhao wrote:
> > But during writing another concurrency test, I found a sad news :
> > 
> > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> > leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> > 
> > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.
> 
> Uh, so that's exactly the type of breaking ABI change that isn't acceptable.  If
> it's really truly necessary, then we can can probably handle the change in KVM
> since TDX is so new, but generally speaking such changes simply must not happen.
> 
> > Note: this acquiring of exclusive lock was not previously present in the public
> > repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> > (The branch has been force-updated to new implementation now).
> 
> Lovely.

Hmm, this exactly the kind of TDX module change we were just discussing
reporting as a bug. Not clear on the timing of the change as far as the landing
upstream. We could investigate whether whether we could fix it in the TDX
module. This probably falls into the category of not actually regressing any
userspace. But it does trigger a kernel warning, so warrant a fix, hmm.

> 
> > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> > !mirror roots. The slots_lock should be for slots deletion.
> 
> Oof, I missed that.  We should have required nx_huge_pages=never for tdx=1.
> Probably too late for that now though :-/
> 
> > > Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> > > kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> > > well.
> > > 
> > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > > taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> > > problematic today, but it feels like a deadlock waiting to happen.
> > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> > vcpu->mutex.
> 
> Yikes.  As does kvm_alloc_apic_access_page(), which is likely why I thought it
> was ok to take slots_lock.  But while kvm_alloc_apic_access_page() appears to be
> called with vCPU scope, it's actually called from VM scope during vCPU creation.
> 
> I'll chew on this, though if someone has any ideas...
> 
> > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?
> 
> If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
> take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
> suffice, and probably would be preferable.  If INIT_VCPU needs to take kvm->lock
> to protect against other races, then I guess the big hammer approach could work?

A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the
reasoning easier to follow. Like, you don't need to remember "we take
slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of
adding more locks, and have argued against it in the past. But are we just
fooling ourselves though? There are already more locks.

Another reason to duplicate (some) locks is that if it gives the scheduler more
hints as far as waking up waiters, etc. The TDX module needs these locks to
protect itself, so those are required. But when we just do retry loops (or let
userspace do this), then we lose out on all of the locking goodness in the
kernel.

Anyway, just a strawman. I don't have any great ideas.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-09-03  0:18             ` Edgecombe, Rick P
@ 2025-09-03  3:34               ` Yan Zhao
  2025-09-03  9:19                 ` Yan Zhao
  0 siblings, 1 reply; 62+ messages in thread
From: Yan Zhao @ 2025-09-03  3:34 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, ackerleytng@google.com,
	Annapurve, Vishal, linux-kernel@vger.kernel.org, Weiny, Ira,
	pbonzini@redhat.com, michael.roth@amd.com, kvm@vger.kernel.org

On Wed, Sep 03, 2025 at 08:18:10AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote:
> > On Tue, Sep 02, 2025, Yan Zhao wrote:
> > > But during writing another concurrency test, I found a sad news :
> > > 
> > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> > > 
> > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.
> > 
> > Uh, so that's exactly the type of breaking ABI change that isn't acceptable.  If
> > it's really truly necessary, then we can can probably handle the change in KVM
> > since TDX is so new, but generally speaking such changes simply must not happen.
> > 
> > > Note: this acquiring of exclusive lock was not previously present in the public
> > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> > > (The branch has been force-updated to new implementation now).
> > 
> > Lovely.
> 
> Hmm, this exactly the kind of TDX module change we were just discussing
> reporting as a bug. Not clear on the timing of the change as far as the landing
> upstream. We could investigate whether whether we could fix it in the TDX
> module. This probably falls into the category of not actually regressing any
> userspace. But it does trigger a kernel warning, so warrant a fix, hmm.
> 
> > 
> > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> > > !mirror roots. The slots_lock should be for slots deletion.
> > 
> > Oof, I missed that.  We should have required nx_huge_pages=never for tdx=1.
> > Probably too late for that now though :-/
> > 
> > > > Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> > > > kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> > > > well.
> > > > 
> > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > > > taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> > > > problematic today, but it feels like a deadlock waiting to happen.
> > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> > > vcpu->mutex.
> > 
> > Yikes.  As does kvm_alloc_apic_access_page(), which is likely why I thought it
> > was ok to take slots_lock.  But while kvm_alloc_apic_access_page() appears to be
> > called with vCPU scope, it's actually called from VM scope during vCPU creation.
> > 
> > I'll chew on this, though if someone has any ideas...
> > 
> > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?
> > 
> > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
> > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
> > suffice, and probably would be preferable.  If INIT_VCPU needs to take kvm->lock
> > to protect against other races, then I guess the big hammer approach could work?
We need the big hammer approach as INIT_VCPU may race with vcpu_load()
in other vCPU ioctls.

> A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the
> reasoning easier to follow. Like, you don't need to remember "we take
> slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of
> adding more locks, and have argued against it in the past. But are we just
> fooling ourselves though? There are already more locks.
> 
> Another reason to duplicate (some) locks is that if it gives the scheduler more
> hints as far as waking up waiters, etc. The TDX module needs these locks to
> protect itself, so those are required. But when we just do retry loops (or let
> userspace do this), then we lose out on all of the locking goodness in the
> kernel.
> 
> Anyway, just a strawman. I don't have any great ideas.
Do you think the following fix is good?
It moves INIT_VCPU to tdx_vcpu_async_ioctl and uses the big hammer to make it
impossible to contend with other SEAMCALLs, just as for tdh_mr_extend().

It passed my local concurrent test.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 99381c8b4108..8a6f2feaab41 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3047,16 +3047,22 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)

 static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 {
-       u64 apic_base;
+       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
        struct vcpu_tdx *tdx = to_tdx(vcpu);
+       u64 apic_base;
        int ret;

+       CLASS(tdx_vm_state_guard, guard)(vcpu->kvm);
+       if (IS_ERR(guard))
+               return PTR_ERR(guard);
+
        if (cmd->flags)
                return -EINVAL;

-       if (tdx->state != VCPU_TD_STATE_UNINITIALIZED)
+       if (!is_hkid_assigned(kvm_tdx) || tdx->state != VCPU_TD_STATE_UNINITIALIZED)
                return -EINVAL;

+       vcpu_load(vcpu);
        /*
         * TDX requires X2APIC, userspace is responsible for configuring guest
         * CPUID accordingly.
@@ -3075,6 +3081,7 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
        td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);

        tdx->state = VCPU_TD_STATE_INITIALIZED;
+       vcpu_put(vcpu);

        return 0;
 }
@@ -3228,10 +3235,18 @@ int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
        if (r)
                return r;

-       if (cmd.id != KVM_TDX_INIT_MEM_REGION)
-               return -ENOIOCTLCMD;
-
-       return tdx_vcpu_init_mem_region(vcpu, &cmd);
+       switch (cmd.id) {
+       case KVM_TDX_INIT_MEM_REGION:
+               r = tdx_vcpu_init_mem_region(vcpu, &cmd);
+               break;
+       case KVM_TDX_INIT_VCPU:
+               r = tdx_vcpu_init(vcpu, &cmd);
+               break;
+       default:
+               r = -ENOIOCTLCMD;
+               break;
+       }
+       return r;
 }

 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
@@ -3248,9 +3263,6 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
                return ret;

        switch (cmd.id) {
-       case KVM_TDX_INIT_VCPU:
-               ret = tdx_vcpu_init(vcpu, &cmd);
-               break;
        case KVM_TDX_GET_CPUID:
                ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
                break;


Besides, to unblock testing the above code, I fixed a bug related to vcpu_load()
in current TDX upstream code. Attached the fixup patch below.

Sean, please let me know if you want to include it into this series.
(It still lacks a Fixes tag as I haven't found out which commit is the best fit). 


From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001
From: Yan Zhao <yan.y.zhao@intel.com>
Date: Tue, 2 Sep 2025 18:31:27 -0700
Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load()

During vCPU creation, a vCPU may be destroyed immediately after
kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
before invoking tdx_vcpu_free().

Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
will cause list corruption of the per-pCPU list associated_tdvcpus.

Then, a later list_add() during vcpu_load() would detect list corruption
and print calltrace as shown below.

Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
destroyed immediately after creation which must be in
VCPU_TD_STATE_UNINITIALIZED state.

kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#2] SMP NOPTI
RIP: 0010:__list_add_valid_or_report+0x82/0xd0

Call Trace:
 <TASK>
 tdx_vcpu_load+0xa8/0x120
 vt_vcpu_load+0x25/0x30
 kvm_arch_vcpu_load+0x81/0x300
 vcpu_load+0x55/0x90
 kvm_arch_vcpu_create+0x24f/0x330
 kvm_vm_ioctl_create_vcpu+0x1b1/0x53
 ? trace_lock_release+0x6d/0xb0
 kvm_vm_ioctl+0xc2/0xa60
 ? tty_ldisc_deref+0x16/0x20
 ? debug_smp_processor_id+0x17/0x20
 ? __fget_files+0xc2/0x1b0
 ? debug_smp_processor_id+0x17/0x20
 ? rcu_is_watching+0x13/0x70
 ? __fget_files+0xc2/0x1b0
 ? trace_lock_release+0x6d/0xb0
 ? lock_release+0x14/0xd0
 ? __fget_files+0xcc/0x1b0
 __x64_sys_ioctl+0x9a/0xf0
 ? rcu_is_watching+0x13/0x70
 x64_sys_call+0x10ee/0x20d0
 do_syscall_64+0xc3/0x470
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e99d07611393..99381c8b4108 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu)
        tdx_prepare_switch_to_host(vcpu);
 }

+/*
+ * Life cycles for a TD and a vCPU:
+ * 1. KVM_CREATE_VM ioctl.
+ *    TD state is TD_STATE_UNINITIALIZED.
+ *    hkid is not assigned at this stage.
+ * 2. KVM_TDX_INIT_VM ioctl.
+ *    TD transistions to TD_STATE_INITIALIZED.
+ *    hkid is assigned after this stage.
+ * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED).
+ *    3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED.
+ *    3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create().
+ *    3.3 (conditional) if any error encountered after kvm_arch_vcpu_create()
+ *        kvm_arch_vcpu_destroy() --> tdx_vcpu_free().
+ * 4. KVM_TDX_INIT_VCPU ioctl.
+ *    tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED.
+ *    vCPU control structures are allocated at this stage.
+ * 5. kvm_destroy_vm().
+ *    5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs.
+ *                                (2) puts hkid to !assigned state.
+ *    5.2 kvm_destroy_vcpus() --> tdx_vcpu_free():
+ *        transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state.
+ *    5.3 tdx_vm_destroy()
+ *        transitions TD to TD_STATE_UNINITIALIZED state.
+ *
+ * tdx_vcpu_free() can be invoked only at 3.3 or 5.2.
+ * - If at 3.3, hkid is still assigned, but the vCPU must be in
+ *   VCPU_TD_STATE_UNINITIALIZED state.
+ * - if at 5.2, hkid must be !assigned and all vCPUs must be in
+ *   VCPU_TD_STATE_INITIALIZED state and have been dissociated.
+ */
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
        struct vcpu_tdx *tdx = to_tdx(vcpu);
        int i;

+       if (vcpu->cpu != -1) {
+               KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm);
+               tdx_disassociate_vp(vcpu);
+               return;
+       }
        /*
         * It is not possible to reclaim pages while hkid is assigned. It might
-        * be assigned if:
-        * 1. the TD VM is being destroyed but freeing hkid failed, in which
-        * case the pages are leaked
-        * 2. TD VCPU creation failed and this on the error path, in which case
-        * there is nothing to do anyway
+        * be assigned if the TD VM is being destroyed but freeing hkid failed,
+        * in which case the pages are leaked.
         */
        if (is_hkid_assigned(kvm_tdx))
                return;
--
2.43.0


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails
  2025-09-03  3:34               ` Yan Zhao
@ 2025-09-03  9:19                 ` Yan Zhao
  0 siblings, 0 replies; 62+ messages in thread
From: Yan Zhao @ 2025-09-03  9:19 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, Huang, Kai,
	ackerleytng@google.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Weiny, Ira, pbonzini@redhat.com,
	michael.roth@amd.com, kvm@vger.kernel.org

> From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001
> From: Yan Zhao <yan.y.zhao@intel.com>
> Date: Tue, 2 Sep 2025 18:31:27 -0700
> Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load()
> 
> During vCPU creation, a vCPU may be destroyed immediately after
> kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
> vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
> pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
> before invoking tdx_vcpu_free().
> 
> Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
> dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
> will cause list corruption of the per-pCPU list associated_tdvcpus.
> 
> Then, a later list_add() during vcpu_load() would detect list corruption
> and print calltrace as shown below.
> 
> Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
> destroyed immediately after creation which must be in
> VCPU_TD_STATE_UNINITIALIZED state.
> 
> kernel BUG at lib/list_debug.c:29!
> Oops: invalid opcode: 0000 [#2] SMP NOPTI
> RIP: 0010:__list_add_valid_or_report+0x82/0xd0
> 
> Call Trace:
>  <TASK>
>  tdx_vcpu_load+0xa8/0x120
>  vt_vcpu_load+0x25/0x30
>  kvm_arch_vcpu_load+0x81/0x300
>  vcpu_load+0x55/0x90
>  kvm_arch_vcpu_create+0x24f/0x330
>  kvm_vm_ioctl_create_vcpu+0x1b1/0x53
>  ? trace_lock_release+0x6d/0xb0
>  kvm_vm_ioctl+0xc2/0xa60
>  ? tty_ldisc_deref+0x16/0x20
>  ? debug_smp_processor_id+0x17/0x20
>  ? __fget_files+0xc2/0x1b0
>  ? debug_smp_processor_id+0x17/0x20
>  ? rcu_is_watching+0x13/0x70
>  ? __fget_files+0xc2/0x1b0
>  ? trace_lock_release+0x6d/0xb0
>  ? lock_release+0x14/0xd0
>  ? __fget_files+0xcc/0x1b0
>  __x64_sys_ioctl+0x9a/0xf0
>  ? rcu_is_watching+0x13/0x70
>  x64_sys_call+0x10ee/0x20d0
>  do_syscall_64+0xc3/0x470
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>

Fixes: d789fa6efac9 ("KVM: TDX: Handle vCPU dissociation")
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e99d07611393..99381c8b4108 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu)
>         tdx_prepare_switch_to_host(vcpu);
>  }
> 
> +/*
> + * Life cycles for a TD and a vCPU:
> + * 1. KVM_CREATE_VM ioctl.
> + *    TD state is TD_STATE_UNINITIALIZED.
> + *    hkid is not assigned at this stage.
> + * 2. KVM_TDX_INIT_VM ioctl.
> + *    TD transistions to TD_STATE_INITIALIZED.
> + *    hkid is assigned after this stage.
> + * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED).
> + *    3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED.
> + *    3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create().
> + *    3.3 (conditional) if any error encountered after kvm_arch_vcpu_create()
> + *        kvm_arch_vcpu_destroy() --> tdx_vcpu_free().
> + * 4. KVM_TDX_INIT_VCPU ioctl.
> + *    tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED.
> + *    vCPU control structures are allocated at this stage.
> + * 5. kvm_destroy_vm().
> + *    5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs.
> + *                                (2) puts hkid to !assigned state.
> + *    5.2 kvm_destroy_vcpus() --> tdx_vcpu_free():
> + *        transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state.
> + *    5.3 tdx_vm_destroy()
> + *        transitions TD to TD_STATE_UNINITIALIZED state.
> + *
> + * tdx_vcpu_free() can be invoked only at 3.3 or 5.2.
> + * - If at 3.3, hkid is still assigned, but the vCPU must be in
> + *   VCPU_TD_STATE_UNINITIALIZED state.
> + * - if at 5.2, hkid must be !assigned and all vCPUs must be in
> + *   VCPU_TD_STATE_INITIALIZED state and have been dissociated.
> + */
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>         struct vcpu_tdx *tdx = to_tdx(vcpu);
>         int i;
> 
> +       if (vcpu->cpu != -1) {
> +               KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm);
> +               tdx_disassociate_vp(vcpu);
Sorry, I should use "tdx_flush_vp_on_cpu(vcpu);" here to ensure the list_del()
is running on vcpu->cpu with local irq disabled.

> +               return;
> +       }
>         /*
>          * It is not possible to reclaim pages while hkid is assigned. It might
> -        * be assigned if:
> -        * 1. the TD VM is being destroyed but freeing hkid failed, in which
> -        * case the pages are leaked
> -        * 2. TD VCPU creation failed and this on the error path, in which case
> -        * there is nothing to do anyway
> +        * be assigned if the TD VM is being destroyed but freeing hkid failed,
> +        * in which case the pages are leaked.
>          */
>         if (is_hkid_assigned(kvm_tdx))
>                 return;
> --
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2025-09-03  9:20 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
2025-08-29  6:20   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
2025-08-29 18:34   ` Edgecombe, Rick P
2025-08-29 20:27     ` Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
2025-08-29 19:00   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
2025-08-29 19:03   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
2025-08-29  8:36   ` Binbin Wu
2025-08-29 19:53   ` Edgecombe, Rick P
2025-08-29 20:19     ` Sean Christopherson
2025-08-29 21:54       ` Edgecombe, Rick P
2025-08-29 22:02         ` Sean Christopherson
2025-08-29 22:17           ` Edgecombe, Rick P
2025-08-29 22:58             ` Sean Christopherson
2025-08-29 22:59               ` Edgecombe, Rick P
2025-09-01  1:25     ` Yan Zhao
2025-09-02 17:33       ` Sean Christopherson
2025-09-02 18:55         ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
2025-08-29  9:40   ` Binbin Wu
2025-08-29 16:58   ` Ira Weiny
2025-08-29 19:59   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-08-29  9:49   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
2025-08-29  9:52   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
2025-08-29  9:52   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
2025-08-29 10:06   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
2025-09-02 22:46   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
2025-08-29  8:18   ` Yan Zhao
2025-08-29 18:16     ` Edgecombe, Rick P
2025-08-29 20:11       ` Sean Christopherson
2025-08-29 22:39         ` Edgecombe, Rick P
2025-08-29 23:15           ` Edgecombe, Rick P
2025-08-29 23:18             ` Sean Christopherson
2025-09-02  9:24         ` Yan Zhao
2025-09-02 17:04           ` Sean Christopherson
2025-09-03  0:18             ` Edgecombe, Rick P
2025-09-03  3:34               ` Yan Zhao
2025-09-03  9:19                 ` Yan Zhao
2025-08-29  0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
2025-08-29 23:42   ` Edgecombe, Rick P
2025-09-02 17:09     ` Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-09-02 17:31   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
2025-08-29  9:03   ` Binbin Wu
2025-08-29 14:19     ` Sean Christopherson
2025-09-01  1:46       ` Binbin Wu
2025-09-02 18:55   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
2025-08-30  0:00   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).