kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups
@ 2025-08-27  0:05 Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
                   ` (13 more replies)
  0 siblings, 14 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

This is a largely untested series to do most of what was discussed in the
thread regarding locking issues between gmem and TDX's post-populate hook[*],
with more than a few side quests thrown in as I was navigating through the
code to try to figure out how best to eliminate the copy_from_user() from
sev_gmem_post_populate(), which has the same locking problem (copying from
a userspace address can fault and in theory trigger the same problematic
path, I think).

Notably absent is the extraction of copy_from_user() from
sev_gmem_post_populate() to kvm_gmem_populate().  I've had this on my todo
list for a few weeks now, and haven't been able to focus on it for long
enough to get something hammered out, and with KVM Forum on the horizon, I
don't anticipate getting 'round to it within the next month (if not much
longer).

The thing that stymied me is what to do if snp_launch_update() is passed in
a huge batch of pages.  I waffled between doing a slow one-at-a-time approach
and a batched approached, and got especially stuck when trying to remember
and/or figure out how that handling would interact with hugepage support in
SNP in particular.

If anyone wants to tackle that project, the one thing change I definitely
think we should do is change the post_populate() callback to operate on
exactly one page.  KVM_SEV_SNP_LAUNCH_UPDATE allows for partial progress,
i.e. KVM's ABI doesn't require it to unwind a batch if adding a page fails.
If we take advantage of that, then sev_gmem_post_populate() will be a bit
simpler (though I wouldn't go so far as to call it "simple").

RFC as this is compile tested only (mostly due to lack of access to a TDX
capable system, but also due to lack of cycles).

[*] http://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com

Sean Christopherson (12):
  KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP
    MMU"
  KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
  KVM: TDX: Drop superfluous page pinning in S-EPT management
  KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed
  KVM: TDX: Track nr_premapped as an "unsigned long", not an
    "atomic64_t"
  KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds

 arch/x86/kvm/mmu.h         |   3 +-
 arch/x86/kvm/mmu/mmu.c     |  66 ++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c |  37 ++---------
 arch/x86/kvm/vmx/tdx.c     | 123 +++++++++++++------------------------
 arch/x86/kvm/vmx/tdx.h     |   9 ++-
 5 files changed, 117 insertions(+), 121 deletions(-)


base-commit: 196d9e72c4b0bd68b74a4ec7f52d248f37d0f030
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  8:14   ` Yan Zhao
                     ` (2 more replies)
  2025-08-27  0:05 ` [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
Add a helper function to walk down the TDP MMU"), the justification for
introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
was pre-populated, with a link that points to this snippet:

 : > One small question:
 : >
 : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
 : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
 : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
 : > return error when it finds the region hasn't been pre-populated?
 :
 : Return an error.  I don't love the idea of bleeding so many TDX details into
 : userspace, but I'm pretty sure that ship sailed a long, long time ago.

But that justification makes little sense for the final code, as simply
doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
if the S-EPT mapping is invalid (as evidenced by the code being guarded
with CONFIG_KVM_PROVE_MMU=y).

The sanity check is also incomplete in the sense that mmu_lock is dropped
between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
zap SPTEs in a very specific window.

Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
which has no business being exposed to vendor code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 66744f5768c8..a6155f76cc6a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3175,20 +3175,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret < 0)
 		goto out;
 
-	/*
-	 * The private mem cannot be zapped after kvm_tdp_map_page()
-	 * because all paths are covered by slots_lock and the
-	 * filemap invalidate lock.  Check that they are indeed enough.
-	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
-
 	ret = 0;
 	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
 			       src_page, &entry, &level_state);
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  8:25   ` Yan Zhao
  2025-08-28  0:40   ` Ira Weiny
  2025-08-27  0:05 ` [RFC PATCH 03/12] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Add and use a new API for mapping a private pfn from guest_memfd into the
TDP MMU from TDX's post-populate hook instead of partially open-coding the
functionality into the TDX code.  Sharing code with the pre-fault path
sounded good on paper, but it's fatally flawed as simulating a fault loses
the pfn, and calling back into gmem to re-retrieve the pfn creates locking
problems, e.g. kvm_gmem_populate() already holds the gmem invalidation
lock.

Providing a dedicated API will also removing several MMU exports that
ideally would not be exposed outside of the MMU, let alone to vendor code.
On that topic, opportunistically drop the kvm_mmu_load() export.  Leave
kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added
kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future.

Cc: Michael Roth <michael.roth@amd.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 60 +++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.c | 10 +++----
 3 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..697b90a97f43 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -259,6 +259,7 @@ extern bool tdp_mmu_enabled;
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..d3625e00baf9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4990,6 +4990,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	return min(range->size, end - range->gpa);
 }
 
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct kvm_page_fault fault = {
+		.addr = gfn_to_gpa(gfn),
+		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
+		.prefetch = true,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+
+		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+		.is_private = true,
+
+		.gfn = gfn,
+		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+		.pfn = pfn,
+		.map_writable = true,
+	};
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return -EIO;
+
+	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
+		return -EPERM;
+
+	r = kvm_mmu_reload(vcpu);
+	if (r)
+		return r;
+
+	r = mmu_topup_memory_caches(vcpu, false);
+	if (r)
+		return r;
+
+	do {
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
+			return -EIO;
+
+		cond_resched();
+
+		guard(read_lock)(&kvm->mmu_lock);
+
+		r = kvm_tdp_mmu_map(vcpu, &fault);
+	} while (r == RET_PF_RETRY);
+
+	if (r != RET_PF_FIXED)
+		return -EIO;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
@@ -5973,7 +6032,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
-EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a6155f76cc6a..1724d82c8512 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3151,15 +3151,12 @@ struct tdx_gmem_post_populate_arg {
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				  void __user *src, int order, void *_arg)
 {
-	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct tdx_gmem_post_populate_arg *arg = _arg;
-	struct kvm_vcpu *vcpu = arg->vcpu;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
 	gpa_t gpa = gfn_to_gpa(gfn);
-	u8 level = PG_LEVEL_4K;
 	struct page *src_page;
 	int ret, i;
-	u64 err, entry, level_state;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
@@ -3171,7 +3168,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
 	if (ret < 0)
 		goto out;
 
@@ -3234,7 +3231,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
-	kvm_mmu_reload(vcpu);
 	ret = 0;
 	while (region.nr_pages) {
 		if (signal_pending(current)) {
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 03/12] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() Sean Christopherson
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Remove the helper and exports that were added to allow TDX code to reuse
kvm_tdp_map_page() for its gmem post-populated flow now that a dedicated
TDP MMU API is provided to install a mapping given a gfn+pfn pair.

This reverts commit 2608f105760115e94a03efd9f12f8fbfd1f9af4b.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h         |  2 --
 arch/x86/kvm/mmu/mmu.c     |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++--------------------------------
 3 files changed, 7 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 697b90a97f43..dc6b965cea4f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -257,8 +257,6 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
 int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d3625e00baf9..f532beed9029 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+			    u8 *level)
 {
 	int r;
 
@@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..1b559a50db51 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1910,13 +1910,16 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-				  struct kvm_mmu_page *root)
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
 {
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
+	*root_level = vcpu->arch.mmu->root_role.level;
+
 	for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1925,36 +1928,6 @@ static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
-{
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
-	*root_level = vcpu->arch.mmu->root_role.level;
-
-	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
-}
-
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
-{
-	struct kvm *kvm = vcpu->kvm;
-	bool is_direct = kvm_is_addr_direct(kvm, gpa);
-	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
-				 vcpu->arch.mmu->mirror_root_hpa;
-	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
-	int leaf;
-
-	lockdep_assert_held(&kvm->mmu_lock);
-	rcu_read_lock();
-	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
-	rcu_read_unlock();
-	if (leaf < 0)
-		return false;
-
-	spte = sptes[leaf];
-	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
-}
-EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
-
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 03/12] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-28  2:01   ` Edgecombe, Rick P
  2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() now that it's used
only by kvm_arch_vcpu_pre_fault_memory().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f532beed9029..cb08785ce29b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4900,8 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
-			    u8 *level)
+static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+				 u64 error_code, u8 *level)
 {
 	int r;
 
@@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
+	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
 	if (r < 0)
 		return r;
 
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  8:33   ` Yan Zhao
                     ` (2 more replies)
  2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
                   ` (8 subsequent siblings)
  13 siblings, 3 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
doesn't support page migration in any capacity, i.e. there are no migrate
callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
kvm_gmem_migrate_folio().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
 1 file changed, 4 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1724d82c8512..9fb6e5f02cc9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
-static void tdx_unpin(struct kvm *kvm, struct page *page)
-{
-	put_page(page);
-}
-
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
-			    enum pg_level level, struct page *page)
+			    enum pg_level level, kvm_pfn_t pfn)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct page *page = pfn_to_page(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	u64 entry, level_state;
 	u64 err;
 
 	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
-	if (unlikely(tdx_operand_busy(err))) {
-		tdx_unpin(kvm, page);
+	if (unlikely(tdx_operand_busy(err)))
 		return -EBUSY;
-	}
 
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
-		tdx_unpin(kvm, page);
 		return -EIO;
 	}
 
@@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	struct page *page = pfn_to_page(pfn);
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
 		return -EINVAL;
 
-	/*
-	 * Because guest_memfd doesn't support page migration with
-	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
-	 * migration.  Until guest_memfd supports page migration, prevent page
-	 * migration.
-	 * TODO: Once guest_memfd introduces callback on page migration,
-	 * implement it and remove get_page/put_page().
-	 */
-	get_page(page);
-
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
 	 * barrier in tdx_td_finalize().
 	 */
 	smp_rmb();
 	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
-		return tdx_mem_page_aug(kvm, gfn, level, page);
+		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
 	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
 }
@@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 	}
 	tdx_clear_page(page);
-	tdx_unpin(kvm, page);
 	return 0;
 }
 
@@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
 	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
 		atomic64_dec(&kvm_tdx->nr_premapped);
-		tdx_unpin(kvm, page);
 		return 0;
 	}
 
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  8:39   ` Yan Zhao
                     ` (2 more replies)
  2025-08-27  0:05 ` [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
when a VM has been killed due to a KVM bug, not -EINVAL.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9fb6e5f02cc9..ef4ffcad131f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
 	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
 	atomic64_inc(&kvm_tdx->nr_premapped);
@@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/* TODO: handle large pages. */
 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/*
 	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
@@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	 * and slot move/deletion.
 	 */
 	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
-		return -EINVAL;
+		return -EIO;
 
 	/*
 	 * The HKID assigned to this TD was already freed and cache was
@@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * there can't be anything populated in the private EPT.
 	 */
 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
-		return -EINVAL;
+		return -EIO;
 
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (5 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-28  2:19   ` Edgecombe, Rick P
  2025-08-28 15:02   ` Ira Weiny
  2025-08-27  0:05 ` [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
isn't also triggered.  Isolating the check from the "is premap error"
if-statement will also allow adding a lockdep assertion that premap errors
are encountered if and only if slots_lock is held.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ef4ffcad131f..88079e2d45fb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1773,8 +1773,10 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
-	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
-	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
+	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
+		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+			return -EIO;
+
 		atomic64_dec(&kvm_tdx->nr_premapped);
 		return 0;
 	}
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (6 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-28  2:56   ` Edgecombe, Rick P
  2025-08-28 15:03   ` Ira Weiny
  2025-08-27  0:05 ` [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Use atomic64_dec_return() when decrementing the number of "pre-mapped"
S-EPT pages to ensure that the count can't go negative without KVM
noticing.  In theory, checking for '0' and then decrementing in a separate
operation could miss a 0=>-1 transition.  In practice, such a condition is
impossible because nr_premapped is protected by slots_lock, i.e. doesn't
actually need to be an atomic (that wart will be addressed shortly).

Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
ensures the VM is dead, i.e. there's no point in trying to limp along.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 88079e2d45fb..b7559ea1e353 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1774,10 +1774,9 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
 			return -EIO;
 
-		atomic64_dec(&kvm_tdx->nr_premapped);
 		return 0;
 	}
 
@@ -3162,8 +3161,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
-	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
-		atomic64_dec(&kvm_tdx->nr_premapped);
+	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
 
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (7 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  9:02   ` Yan Zhao
  2025-08-27  0:05 ` [RFC PATCH 10/12] KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed Sean Christopherson
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
providing a one-off helper for effectively three lines of code is at best a
wash, and splitting the code makes the comment for smp_rmb()  _extremely_
confusing as the comment talks about reading kvm->arch.pre_fault_allowed
before kvm_tdx->state, but the immediately visible code does the exact
opposite.

Opportunistically rewrite the comments to more explicitly explain who is
checking what, as well as _why_ the ordering matters.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b7559ea1e353..e4b70c0dbda3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
- * callback tdx_gmem_post_populate() then maps pages into private memory.
- * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
- * private EPT structures for the page to have been built before, which is
- * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
- * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
- * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
- * are no half-initialized shared EPT pages.
- */
-static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
-					  enum pg_level level, kvm_pfn_t pfn)
-{
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-
-	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-		return -EIO;
-
-	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
-	atomic64_inc(&kvm_tdx->nr_premapped);
-	return 0;
-}
-
 static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		return -EIO;
 
 	/*
-	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
-	 * barrier in tdx_td_finalize().
+	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
+	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
+	 * arbitrary memory until the initial memory image is finalized.  Pairs
+	 * with the smp_wmb() in tdx_td_finalize().
 	 */
 	smp_rmb();
-	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
-		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
-	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
+	/*
+	 * If the TD isn't finalized/runnable, then userspace is initializing
+	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
+	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
+	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
+	 * the counter to ensure all mapped pages have been added to the image,
+	 * to prevent running the TD with uninitialized memory.
+	 */
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
+		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
+			return -EIO;
+
+		atomic64_inc(&kvm_tdx->nr_premapped);
+		return 0;
+	}
+
+	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
 
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 10/12] KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (8 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  0:05 ` [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t" Sean Christopherson
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Assert that slots_lock is held when the TDX codes accesses the number of
premapped pfns, as KVM relies on calls to tdx_vcpu_init_mem_region() being
serialized to prevent double-population of gmem and false negatives on the
consumption of a "premapped" pfn.  In addition to helping document how the
TDX code works, this will allow converting "nr_premapped" to a non-atomic
variable, as all usage asserts that slots_lock is held.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e4b70c0dbda3..27941defb62e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1634,6 +1634,8 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 * to prevent running the TD with uninitialized memory.
 	 */
 	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
+		lockdep_assert_held(&kvm->slots_lock);
+
 		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
 			return -EIO;
 
@@ -1767,6 +1769,8 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		tdx_no_vcpus_enter_stop(kvm);
 	}
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
+		lockdep_assert_held(&kvm->slots_lock);
+
 		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
 			return -EIO;
 
@@ -3132,6 +3136,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	struct page *src_page;
 	int ret, i;
 
+	lockdep_assert_held(&kvm->slots_lock);
+
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
 	 * source page has been swapped out or unmapped in primary memory.
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t"
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (9 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 10/12] KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  9:12   ` Yan Zhao
  2025-08-27  0:05 ` [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds Sean Christopherson
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Track the number of premapped pfns as a non-atomic variable as all usage
is guarded by slots_lock, and KVM now asserts as much.  Note, slots_lock
has always effectively guarded nr_premapped since TDX support landed, the
use of an atomic64_t was likely a leftover from development that was
never cleaned up.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 8 ++++----
 arch/x86/kvm/vmx/tdx.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 27941defb62e..5d2bb27f22da 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1639,7 +1639,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
 			return -EIO;
 
-		atomic64_inc(&kvm_tdx->nr_premapped);
+		kvm_tdx->nr_premapped++;
 		return 0;
 	}
 
@@ -1771,7 +1771,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
 		lockdep_assert_held(&kvm->slots_lock);
 
-		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
+		if (KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm))
 			return -EIO;
 
 		return 0;
@@ -2846,7 +2846,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
 	 * TDH.MEM.PAGE.ADD().
 	 */
-	if (atomic64_read(&kvm_tdx->nr_premapped))
+	if (kvm_tdx->nr_premapped)
 		return -EINVAL;
 
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
@@ -3160,7 +3160,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
-	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
+	KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm);
 
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index ca39a9391db1..04ba9ea3e0ba 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -37,7 +37,7 @@ struct kvm_tdx {
 	struct tdx_td td;
 
 	/* For KVM_TDX_INIT_MEM_REGION. */
-	atomic64_t nr_premapped;
+	unsigned long nr_premapped;
 
 	/*
 	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (10 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t" Sean Christopherson
@ 2025-08-27  0:05 ` Sean Christopherson
  2025-08-27  9:22   ` Yan Zhao
  2025-08-28 15:23   ` Ira Weiny
  2025-08-27  9:48 ` [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Yan Zhao
  2025-08-28 19:01 ` Edgecombe, Rick P
  13 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27  0:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Rename "nr_premapped" to an asurdly verbose "nr_pending_tdh_mem_page_adds"
to make it explicitly clear what the counter tracks.  "pre-map" is far
too similar to "pre-fault", especially since tdx_sept_set_private_spte()
deals with both "pre_fault_allowed" and the counter.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/tdx.c | 8 ++++----
 arch/x86/kvm/vmx/tdx.h | 9 +++++++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5d2bb27f22da..f9ac590e8ff0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1639,7 +1639,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
 			return -EIO;
 
-		kvm_tdx->nr_premapped++;
+		kvm_tdx->nr_pending_tdh_mem_page_adds++;
 		return 0;
 	}
 
@@ -1771,7 +1771,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
 		lockdep_assert_held(&kvm->slots_lock);
 
-		if (KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm))
+		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
 			return -EIO;
 
 		return 0;
@@ -2846,7 +2846,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
 	 * TDH.MEM.PAGE.ADD().
 	 */
-	if (kvm_tdx->nr_premapped)
+	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
 		return -EINVAL;
 
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
@@ -3160,7 +3160,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 		goto out;
 	}
 
-	KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm);
+	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
 
 	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
 		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 04ba9ea3e0ba..45d86f9fa41c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -36,8 +36,13 @@ struct kvm_tdx {
 
 	struct tdx_td td;
 
-	/* For KVM_TDX_INIT_MEM_REGION. */
-	unsigned long nr_premapped;
+	/*
+	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
+	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
+	 * check adding pages to the image, and to ensure that all pages have
+	 * been initialized before finalizing the TD.
+	 */
+	unsigned long nr_pending_tdh_mem_page_adds;
 
 	/*
 	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
-- 
2.51.0.268.g9569e192d0-goog


^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
@ 2025-08-27  8:14   ` Yan Zhao
  2025-08-28  0:37   ` Ira Weiny
  2025-08-28  2:13   ` Huang, Kai
  2 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  8:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:11PM -0700, Sean Christopherson wrote:
> Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
> said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
> superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
> Add a helper function to walk down the TDP MMU"), the justification for
> introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
> was pre-populated, with a link that points to this snippet:
> 
>  : > One small question:
>  : >
>  : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
>  : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
>  : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
>  : > return error when it finds the region hasn't been pre-populated?
>  :
>  : Return an error.  I don't love the idea of bleeding so many TDX details into
>  : userspace, but I'm pretty sure that ship sailed a long, long time ago.
> 
> But that justification makes little sense for the final code, as simply
> doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
> if the S-EPT mapping is invalid (as evidenced by the code being guarded
> with CONFIG_KVM_PROVE_MMU=y).
Checking of kvm_tdp_mmu_gpa_is_mapped() was intended to detect unexpected zaps
like kvm_zap_gfn_range() between kvm_tdp_map_page() and tdh_mem_page_add()?
In that case, TDH.MEM.PAGE.ADD would succeed without any error.

But as you said, the read mmu_lock is dropped before tdh_mem_page_add().
Moreover, it still cannot guard against atomic zaps.

As zaps between kvm_tdp_map_page() and tdh_mem_page_add() could still be
detectable through the incorrect value of nr_premapped in the end, dropping the
checks of kvm_tdp_mmu_gpa_is_mapped() looks good.

> The sanity check is also incomplete in the sense that mmu_lock is dropped
> between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
> zap SPTEs in a very specific window.
>
> Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
> which has no business being exposed to vendor code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 14 --------------
>  1 file changed, 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 66744f5768c8..a6155f76cc6a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3175,20 +3175,6 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret < 0)
>  		goto out;
>  
> -	/*
> -	 * The private mem cannot be zapped after kvm_tdp_map_page()
> -	 * because all paths are covered by slots_lock and the
> -	 * filemap invalidate lock.  Check that they are indeed enough.
> -	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> -
>  	ret = 0;
>  	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
>  			       src_page, &entry, &level_state);
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-27  0:05 ` [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
@ 2025-08-27  8:25   ` Yan Zhao
  2025-08-28  0:54     ` Edgecombe, Rick P
  2025-08-28  0:40   ` Ira Weiny
  1 sibling, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  8:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:12PM -0700, Sean Christopherson wrote:
> Add and use a new API for mapping a private pfn from guest_memfd into the
> TDP MMU from TDX's post-populate hook instead of partially open-coding the
> functionality into the TDX code.  Sharing code with the pre-fault path
> sounded good on paper, but it's fatally flawed as simulating a fault loses
> the pfn, and calling back into gmem to re-retrieve the pfn creates locking
> problems, e.g. kvm_gmem_populate() already holds the gmem invalidation
> lock.
> 
> Providing a dedicated API will also removing several MMU exports that
> ideally would not be exposed outside of the MMU, let alone to vendor code.
> On that topic, opportunistically drop the kvm_mmu_load() export.  Leave
> kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added
> kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future.
> 
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Vishal Annapurve <vannapurve@google.com>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu.h     |  1 +
>  arch/x86/kvm/mmu/mmu.c | 60 +++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.c | 10 +++----
>  3 files changed, 63 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..697b90a97f43 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -259,6 +259,7 @@ extern bool tdp_mmu_enabled;
>  
>  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..d3625e00baf9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4990,6 +4990,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	return min(range->size, end - range->gpa);
>  }
>  
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
As the function starts with kvm_tdp_mmu, can we move it to tdp_mmu.c?

> +{
> +	struct kvm_page_fault fault = {
> +		.addr = gfn_to_gpa(gfn),
> +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> +		.prefetch = true,
> +		.is_tdp = true,
> +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> +
> +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
Looks the kvm_tdp_mmu_map_private_pfn() is only for initial memory mapping,
given that ".prefetch = true" and RET_PF_SPURIOUS is not a valid return value.

Then, what about setting
                .max_level = PG_LEVEL_4K,
directly?

Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered in
tdx_sept_set_private_spte().

> +		.req_level = PG_LEVEL_4K,
> +		.goal_level = PG_LEVEL_4K,
> +		.is_private = true,
> +
> +		.gfn = gfn,
> +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> +		.pfn = pfn,
> +		.map_writable = true,
> +	};
> +	struct kvm *kvm = vcpu->kvm;
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return -EIO;
> +
> +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> +		return -EPERM;
> +
> +	r = kvm_mmu_reload(vcpu);
> +	if (r)
> +		return r;
> +
> +	r = mmu_topup_memory_caches(vcpu, false);
> +	if (r)
> +		return r;
> +
> +	do {
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> +			return -EIO;
> +
> +		cond_resched();
> +
> +		guard(read_lock)(&kvm->mmu_lock);
> +
> +		r = kvm_tdp_mmu_map(vcpu, &fault);
> +	} while (r == RET_PF_RETRY);
> +
> +	if (r != RET_PF_FIXED)
> +		return -EIO;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> +
>  static void nonpaging_init_context(struct kvm_mmu *context)
>  {
>  	context->page_fault = nonpaging_page_fault;
> @@ -5973,7 +6032,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  out:
>  	return r;
>  }
> -EXPORT_SYMBOL_GPL(kvm_mmu_load);
>  
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a6155f76cc6a..1724d82c8512 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3151,15 +3151,12 @@ struct tdx_gmem_post_populate_arg {
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  				  void __user *src, int order, void *_arg)
>  {
> -	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
> -	struct kvm_vcpu *vcpu = arg->vcpu;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
>  	gpa_t gpa = gfn_to_gpa(gfn);
> -	u8 level = PG_LEVEL_4K;
>  	struct page *src_page;
>  	int ret, i;
> -	u64 err, entry, level_state;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
> @@ -3171,7 +3168,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> -	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
>  	if (ret < 0)
>  		goto out;
>  
> @@ -3234,7 +3231,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
>  		return -EINVAL;
>  
> -	kvm_mmu_reload(vcpu);
>  	ret = 0;
>  	while (region.nr_pages) {
>  		if (signal_pending(current)) {
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
@ 2025-08-27  8:33   ` Yan Zhao
  2025-08-28  2:05     ` Edgecombe, Rick P
  2025-08-28  0:36   ` Ira Weiny
  2025-08-28  2:45   ` Huang, Kai
  2 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  8:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:15PM -0700, Sean Christopherson wrote:
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().
Hmm, we implemented exactly the same patch at [1], where we explained the
potential problems of not holding page refcount, and the explored various
approaches, and related considerations.

[1] https://lore.kernel.org/all/20250807094241.4523-1-yan.y.zhao@intel.com/

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
>  1 file changed, 4 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1724d82c8512..9fb6e5f02cc9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> -{
> -	put_page(page);
> -}
> -
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> -			    enum pg_level level, struct page *page)
> +			    enum pg_level level, kvm_pfn_t pfn)
>  {
>  	int tdx_level = pg_level_to_tdx_sept_level(level);
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct page *page = pfn_to_page(pfn);
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 entry, level_state;
>  	u64 err;
>  
>  	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -	if (unlikely(tdx_operand_busy(err))) {
> -		tdx_unpin(kvm, page);
> +	if (unlikely(tdx_operand_busy(err)))
>  		return -EBUSY;
> -	}
>  
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -		tdx_unpin(kvm, page);
>  		return -EIO;
>  	}
>  
> @@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, kvm_pfn_t pfn)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct page *page = pfn_to_page(pfn);
>  
>  	/* TODO: handle large pages. */
>  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
>  		return -EINVAL;
>  
> -	/*
> -	 * Because guest_memfd doesn't support page migration with
> -	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> -	 * migration.  Until guest_memfd supports page migration, prevent page
> -	 * migration.
> -	 * TODO: Once guest_memfd introduces callback on page migration,
> -	 * implement it and remove get_page/put_page().
> -	 */
> -	get_page(page);
> -
>  	/*
>  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
>  	 * barrier in tdx_td_finalize().
>  	 */
>  	smp_rmb();
>  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, page);
> +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>  }
> @@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  		return -EIO;
>  	}
>  	tdx_clear_page(page);
> -	tdx_unpin(kvm, page);
>  	return 0;
>  }
>  
> @@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>  	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>  		atomic64_dec(&kvm_tdx->nr_premapped);
> -		tdx_unpin(kvm, page);
>  		return 0;
>  	}
>  
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
@ 2025-08-27  8:39   ` Yan Zhao
  2025-08-27 17:26     ` Sean Christopherson
  2025-08-28  2:11   ` Edgecombe, Rick P
  2025-08-28 15:03   ` Ira Weiny
  2 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  8:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:16PM -0700, Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.
Looks good to me, though currently the "-EIO" will not be returned to userspace
either. In the fault path, RET_PF_RETRY is returned instead, while in the zap
paths, void is returned.

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9fb6e5f02cc9..ef4ffcad131f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  
>  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
>  	atomic64_inc(&kvm_tdx->nr_premapped);
> @@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/* TODO: handle large pages. */
>  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/*
>  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> @@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>  	 * and slot move/deletion.
>  	 */
>  	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/*
>  	 * The HKID assigned to this TD was already freed and cache was
> @@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	 * there can't be anything populated in the private EPT.
>  	 */
>  	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>  	if (ret <= 0)
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27  0:05 ` [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
@ 2025-08-27  9:02   ` Yan Zhao
  2025-08-27 19:08     ` Sean Christopherson
  2025-08-28 15:28     ` Ira Weiny
  0 siblings, 2 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  9:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
> providing a one-off helper for effectively three lines of code is at best a
> wash, and splitting the code makes the comment for smp_rmb()  _extremely_
> confusing as the comment talks about reading kvm->arch.pre_fault_allowed
> before kvm_tdx->state, but the immediately visible code does the exact
> opposite.
> 
> Opportunistically rewrite the comments to more explicitly explain who is
> checking what, as well as _why_ the ordering matters.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 49 ++++++++++++++++++------------------------
>  1 file changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b7559ea1e353..e4b70c0dbda3 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1608,29 +1608,6 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
> - * callback tdx_gmem_post_populate() then maps pages into private memory.
> - * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
> - * private EPT structures for the page to have been built before, which is
> - * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
> - * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
> - * The counter has to be zero on KVM_TDX_FINALIZE_VM, to ensure that there
> - * are no half-initialized shared EPT pages.
> - */
> -static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> -					  enum pg_level level, kvm_pfn_t pfn)
> -{
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -
> -	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -		return -EIO;
> -
> -	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> -	atomic64_inc(&kvm_tdx->nr_premapped);
> -	return 0;
> -}
> -
>  static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  		return -EIO;
>  
>  	/*
> -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> -	 * barrier in tdx_td_finalize().
> +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> +	 * with the smp_wmb() in tdx_td_finalize().
>  	 */
>  	smp_rmb();
> -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  
> -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> +	/*
> +	 * If the TD isn't finalized/runnable, then userspace is initializing
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> +	 * the counter to ensure all mapped pages have been added to the image,
> +	 * to prevent running the TD with uninitialized memory.
To prevent the mismatch between mirror EPT and the S-EPT?

e.g., Before KVM_TDX_FINALIZE_VM,
if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed
from the S-EPT. The count of nr_premapped will not change after the successful
TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE.

As a result, the TD will still run with uninitialized memory.

> +	 */
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> +			return -EIO;
> +
> +		atomic64_inc(&kvm_tdx->nr_premapped);
> +		return 0;
> +	}
> +
> +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
>  
>  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t"
  2025-08-27  0:05 ` [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t" Sean Christopherson
@ 2025-08-27  9:12   ` Yan Zhao
  0 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:21PM -0700, Sean Christopherson wrote:
> Track the number of premapped pfns as a non-atomic variable as all usage
> is guarded by slots_lock, and KVM now asserts as much.  Note, slots_lock
> has always effectively guarded nr_premapped since TDX support landed, the
> use of an atomic64_t was likely a leftover from development that was
> never cleaned up.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 8 ++++----
>  arch/x86/kvm/vmx/tdx.h | 2 +-
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 27941defb62e..5d2bb27f22da 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1639,7 +1639,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
>  			return -EIO;
>  
> -		atomic64_inc(&kvm_tdx->nr_premapped);
> +		kvm_tdx->nr_premapped++;
>  		return 0;
>  	}
>  
> @@ -1771,7 +1771,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
>  		lockdep_assert_held(&kvm->slots_lock);
>  
> -		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
> +		if (KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm))
>
>  			return -EIO;
>  
>  		return 0;
> @@ -2846,7 +2846,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
>  	 * TDH.MEM.PAGE.ADD().
>  	 */
> -	if (atomic64_read(&kvm_tdx->nr_premapped))
> +	if (kvm_tdx->nr_premapped)
>  		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
> @@ -3160,7 +3160,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  		goto out;
>  	}
>  
> -	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
> +	KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm);
>  
>  	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
>  		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..04ba9ea3e0ba 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -37,7 +37,7 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/* For KVM_TDX_INIT_MEM_REGION. */
> -	atomic64_t nr_premapped;
> +	unsigned long nr_premapped;

Due to the comparison with < 0, the type should be "s64" or "signed long"?
  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds
  2025-08-27  0:05 ` [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds Sean Christopherson
@ 2025-08-27  9:22   ` Yan Zhao
  2025-08-28 15:23   ` Ira Weiny
  1 sibling, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  9:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:22PM -0700, Sean Christopherson wrote:
> Rename "nr_premapped" to an asurdly verbose "nr_pending_tdh_mem_page_adds"
> to make it explicitly clear what the counter tracks.  "pre-map" is far
> too similar to "pre-fault", especially since tdx_sept_set_private_spte()
> deals with both "pre_fault_allowed" and the counter.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 8 ++++----
>  arch/x86/kvm/vmx/tdx.h | 9 +++++++--
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 5d2bb27f22da..f9ac590e8ff0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1639,7 +1639,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
>  			return -EIO;
>  
> -		kvm_tdx->nr_premapped++;
> +		kvm_tdx->nr_pending_tdh_mem_page_adds++;
>  		return 0;
>  	}
>  
> @@ -1771,7 +1771,7 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
>  		lockdep_assert_held(&kvm->slots_lock);
>  
> -		if (KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm))
> +		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
>  			return -EIO;
>  
>  		return 0;
> @@ -2846,7 +2846,7 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
>  	 * TDH.MEM.PAGE.ADD().
>  	 */
> -	if (kvm_tdx->nr_premapped)
> +	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
>  		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
> @@ -3160,7 +3160,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  		goto out;
>  	}
>  
> -	KVM_BUG_ON(--kvm_tdx->nr_premapped < 0, kvm);
> +	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
>  
>  	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
>  		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 04ba9ea3e0ba..45d86f9fa41c 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -36,8 +36,13 @@ struct kvm_tdx {
>  
>  	struct tdx_td td;
>  
> -	/* For KVM_TDX_INIT_MEM_REGION. */
> -	unsigned long nr_premapped;
> +	/*
> +	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> +	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> +	 * check adding pages to the image, and to ensure that all pages have
> +	 * been initialized before finalizing the TD.
> +	 */
To ensure the consistency between mirror EPT and S-EPT? (as in
https://lore.kernel.org/all/aK7Ji3kAoDaEYn3h@yzhao56-desk.sh.intel.com).

tdx_td_finalize() holds slots_lock, so it can't run in-between a
KVM_TDX_INIT_MEM_REGION.

> +	unsigned long nr_pending_tdh_mem_page_adds;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (11 preceding siblings ...)
  2025-08-27  0:05 ` [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds Sean Christopherson
@ 2025-08-27  9:48 ` Yan Zhao
  2025-08-28 19:01 ` Edgecombe, Rick P
  13 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-27  9:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Tue, Aug 26, 2025 at 05:05:10PM -0700, Sean Christopherson wrote:
> This is a largely untested series to do most of what was discussed in the
> thread regarding locking issues between gmem and TDX's post-populate hook[*],
> with more than a few side quests thrown in as I was navigating through the
> code to try to figure out how best to eliminate the copy_from_user() from
> sev_gmem_post_populate(), which has the same locking problem (copying from
> a userspace address can fault and in theory trigger the same problematic
> path, I think).
> 
> Notably absent is the extraction of copy_from_user() from
> sev_gmem_post_populate() to kvm_gmem_populate().  I've had this on my todo
> list for a few weeks now, and haven't been able to focus on it for long
> enough to get something hammered out, and with KVM Forum on the horizon, I
> don't anticipate getting 'round to it within the next month (if not much
> longer).
> 
> The thing that stymied me is what to do if snp_launch_update() is passed in
> a huge batch of pages.  I waffled between doing a slow one-at-a-time approach
> and a batched approached, and got especially stuck when trying to remember
> and/or figure out how that handling would interact with hugepage support in
> SNP in particular.
> 
> If anyone wants to tackle that project, the one thing change I definitely
> think we should do is change the post_populate() callback to operate on
> exactly one page.
Not sure if I understand it correctly.
Do you mean something like the tdx_gmem_post_populate_4k() in
https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com, or
invoking hugepage_set_guest_inhibit() in the post_populate() callback? 

> KVM_SEV_SNP_LAUNCH_UPDATE allows for partial progress,
> i.e. KVM's ABI doesn't require it to unwind a batch if adding a page fails.
> If we take advantage of that, then sev_gmem_post_populate() will be a bit
> simpler (though I wouldn't go so far as to call it "simple").
> 
> RFC as this is compile tested only (mostly due to lack of access to a TDX
> capable system, but also due to lack of cycles).
> 
> [*] http://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com
> 
> Sean Christopherson (12):
>   KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
>   KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
>   Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP
>     MMU"
>   KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
>   KVM: TDX: Drop superfluous page pinning in S-EPT management
>   KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
>   KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
>   KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
>   KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
>   KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed
>   KVM: TDX: Track nr_premapped as an "unsigned long", not an
>     "atomic64_t"
>   KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds
> 
>  arch/x86/kvm/mmu.h         |   3 +-
>  arch/x86/kvm/mmu/mmu.c     |  66 ++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c |  37 ++---------
>  arch/x86/kvm/vmx/tdx.c     | 123 +++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h     |   9 ++-
>  5 files changed, 117 insertions(+), 121 deletions(-)
> 
> 
> base-commit: 196d9e72c4b0bd68b74a4ec7f52d248f37d0f030
> -- 
> 2.51.0.268.g9569e192d0-goog
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-27  8:39   ` Yan Zhao
@ 2025-08-27 17:26     ` Sean Christopherson
  0 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27 17:26 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Wed, Aug 27, 2025, Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:16PM -0700, Sean Christopherson wrote:
> > Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> > when a VM has been killed due to a KVM bug, not -EINVAL.
> Looks good to me, though currently the "-EIO" will not be returned to userspace
> either. In the fault path, RET_PF_RETRY is returned instead, while in the zap
> paths, void is returned.

Yeah, I suspected as much.  I'll call that out in the changeloge, i.e. that this
is really just for internal consistency.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27  9:02   ` Yan Zhao
@ 2025-08-27 19:08     ` Sean Christopherson
  2025-08-28  3:13       ` Edgecombe, Rick P
                         ` (2 more replies)
  2025-08-28 15:28     ` Ira Weiny
  1 sibling, 3 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-27 19:08 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Wed, Aug 27, 2025, Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  
> >  	/*
> > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > -	 * barrier in tdx_td_finalize().
> > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > +	 * with the smp_wmb() in tdx_td_finalize().
> >  	 */
> >  	smp_rmb();
> > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  
> > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > +	/*
> > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > +	 * the counter to ensure all mapped pages have been added to the image,
> > +	 * to prevent running the TD with uninitialized memory.
> To prevent the mismatch between mirror EPT and the S-EPT?

No?  Because KVM bumps the count when installing the S-EPT and decrements it
on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?

> e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> TDH.MEM.PAGE.REMOVE.

Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
underlying problem is unexpectedly removing pages from the initial, whether the
VMM is doing stupid things before vs. after FINALIZE doesn't really matter.

> As a result, the TD will still run with uninitialized memory.

No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
"hole" that the guest might not expect, but that hole will trigger an EPT
violation and only get "filled" if the guest explicitly accepts an AUG'd page.

Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
nice with tdh_mem_page_add() failure necessitates both the
tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
that all pending pages have been consumed.

What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?

If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
case.  And if it's only for -EBUSY, why can't that be handled by retrying in
tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
pages mapped into the S-EPT are ADDed, then it can assert that there are no
pending pages when it completes (even if it "fails"), and similarly
tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
non-zero.

> > +	 */
> > +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> > +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > +			return -EIO;
> > +
> > +		atomic64_inc(&kvm_tdx->nr_premapped);
> > +		return 0;
> > +	}
> > +
> > +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  }
> >  
> >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > -- 
> > 2.51.0.268.g9569e192d0-goog
> > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
  2025-08-27  8:33   ` Yan Zhao
@ 2025-08-28  0:36   ` Ira Weiny
  2025-08-28  7:08     ` Yan Zhao
  2025-08-28  2:45   ` Huang, Kai
  2 siblings, 1 reply; 85+ messages in thread
From: Ira Weiny @ 2025-08-28  0:36 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().

I like the fact this removes a poorly named function tdx_unpin() as well.

That said, concerning gmem tracking page reference, I have some questions.
In the TDX.PAGE.AUG path, [via kvm_gmem_get_pfn()] gmem takes a folio
reference whereas the TDX.PAGE.ADD path [via kvm_gmem_populate()] does not
take a folio reference.

Why are these paths different?

For this patch.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
>  1 file changed, 4 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1724d82c8512..9fb6e5f02cc9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> -{
> -	put_page(page);
> -}
> -
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> -			    enum pg_level level, struct page *page)
> +			    enum pg_level level, kvm_pfn_t pfn)
>  {
>  	int tdx_level = pg_level_to_tdx_sept_level(level);
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct page *page = pfn_to_page(pfn);
>  	gpa_t gpa = gfn_to_gpa(gfn);
>  	u64 entry, level_state;
>  	u64 err;
>  
>  	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -	if (unlikely(tdx_operand_busy(err))) {
> -		tdx_unpin(kvm, page);
> +	if (unlikely(tdx_operand_busy(err)))
>  		return -EBUSY;
> -	}
>  
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -		tdx_unpin(kvm, page);
>  		return -EIO;
>  	}
>  
> @@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, kvm_pfn_t pfn)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct page *page = pfn_to_page(pfn);
>  
>  	/* TODO: handle large pages. */
>  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
>  		return -EINVAL;
>  
> -	/*
> -	 * Because guest_memfd doesn't support page migration with
> -	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> -	 * migration.  Until guest_memfd supports page migration, prevent page
> -	 * migration.
> -	 * TODO: Once guest_memfd introduces callback on page migration,
> -	 * implement it and remove get_page/put_page().
> -	 */
> -	get_page(page);
> -
>  	/*
>  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
>  	 * barrier in tdx_td_finalize().
>  	 */
>  	smp_rmb();
>  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, page);
> +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>  }
> @@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>  		return -EIO;
>  	}
>  	tdx_clear_page(page);
> -	tdx_unpin(kvm, page);
>  	return 0;
>  }
>  
> @@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>  	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>  		atomic64_dec(&kvm_tdx->nr_premapped);
> -		tdx_unpin(kvm, page);
>  		return 0;
>  	}
>  
> -- 
> 2.51.0.268.g9569e192d0-goog
> 



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
  2025-08-27  8:14   ` Yan Zhao
@ 2025-08-28  0:37   ` Ira Weiny
  2025-08-28  2:13   ` Huang, Kai
  2 siblings, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28  0:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
> said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
> superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
> Add a helper function to walk down the TDP MMU"), the justification for
> introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
> was pre-populated, with a link that points to this snippet:
> 
>  : > One small question:
>  : >
>  : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
>  : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
>  : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
>  : > return error when it finds the region hasn't been pre-populated?
>  :
>  : Return an error.  I don't love the idea of bleeding so many TDX details into
>  : userspace, but I'm pretty sure that ship sailed a long, long time ago.
> 
> But that justification makes little sense for the final code, as simply
> doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
> if the S-EPT mapping is invalid (as evidenced by the code being guarded
> with CONFIG_KVM_PROVE_MMU=y).
> 
> The sanity check is also incomplete in the sense that mmu_lock is dropped
> between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
> zap SPTEs in a very specific window.
> 
> Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
> which has no business being exposed to vendor code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-27  0:05 ` [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
  2025-08-27  8:25   ` Yan Zhao
@ 2025-08-28  0:40   ` Ira Weiny
  2025-08-28  1:51     ` Edgecombe, Rick P
  1 sibling, 1 reply; 85+ messages in thread
From: Ira Weiny @ 2025-08-28  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:

[snip]

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..d3625e00baf9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4990,6 +4990,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	return min(range->size, end - range->gpa);
>  }
>  
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct kvm_page_fault fault = {
> +		.addr = gfn_to_gpa(gfn),
> +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> +		.prefetch = true,
> +		.is_tdp = true,
> +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> +
> +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +		.req_level = PG_LEVEL_4K,
> +		.goal_level = PG_LEVEL_4K,
> +		.is_private = true,
> +
> +		.gfn = gfn,
> +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> +		.pfn = pfn,
> +		.map_writable = true,

Why is map_writable set?  Doesn't this get translated into host_writable?

Ira

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-27  8:25   ` Yan Zhao
@ 2025-08-28  0:54     ` Edgecombe, Rick P
  2025-08-28  1:26       ` Edgecombe, Rick P
  2025-08-28  6:55       ` Yan Zhao
  0 siblings, 2 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  0:54 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira

On Wed, 2025-08-27 at 16:25 +0800, Yan Zhao wrote:
> > +{
> > +	struct kvm_page_fault fault = {
> > +		.addr = gfn_to_gpa(gfn),
> > +		.error_code = PFERR_GUEST_FINAL_MASK |
> > PFERR_PRIVATE_ACCESS,
> > +		.prefetch = true,
> > +		.is_tdp = true,
> > +		.nx_huge_page_workaround_enabled =
> > is_nx_huge_page_enabled(vcpu->kvm),
> > +
> > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> Looks the kvm_tdp_mmu_map_private_pfn() is only for initial memory mapping,
> given that ".prefetch = true" and RET_PF_SPURIOUS is not a valid return value.

Hmm, what are you referring to regarding RET_PF_SPURIOUS?

> 
> Then, what about setting
>                 .max_level = PG_LEVEL_4K,
> directly?
> 
> Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered in
> tdx_sept_set_private_spte().

Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
tests. I don't think it's ideal to encode PAGE.ADD details here though.

But I'm not immediately clear what is going wrong. The old struct kvm_page_fault
looks pretty similar. Did you root cause it?



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  0:54     ` Edgecombe, Rick P
@ 2025-08-28  1:26       ` Edgecombe, Rick P
  2025-08-28  6:23         ` Yan Zhao
  2025-08-28  6:55       ` Yan Zhao
  1 sibling, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  1:26 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira

On Wed, 2025-08-27 at 17:54 -0700, Rick Edgecombe wrote:
> > 
> > Then, what about setting
> >                 .max_level = PG_LEVEL_4K,
> > directly?
> > 
> > Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered
> > in
> > tdx_sept_set_private_spte().
> 
> Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
> tests. I don't think it's ideal to encode PAGE.ADD details here though.
> 
> But I'm not immediately clear what is going wrong. The old struct
> kvm_page_fault
> looks pretty similar. Did you root cause it?

Oh, duh. Because we are passing in the PFN now so it can't know the size. So
it's not about PAGE.ADD actually.

Sill, how about calling the function kvm_tdp_mmu_map_private_pfn_4k(), or
passing in the level?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  0:40   ` Ira Weiny
@ 2025-08-28  1:51     ` Edgecombe, Rick P
  2025-08-28 19:57       ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  1:51 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com, Weiny, Ira
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com

On Wed, 2025-08-27 at 19:40 -0500, Ira Weiny wrote:
> > +		.map_writable = true,
> 
> Why is map_writable set?  Doesn't this get translated into host_writable?

I guess it's normally set only if it's a !KVM_MEM_READONLY slot for private
faults memory. But that flag is invalid for gmem. So we should only have
map_writable=true cases for tdx.

Hypothetically this function should work for non-gmem. I guess since it's
exported, a comment could be nice to specify that the memslots are not
consulted. There are many MMU details that are not commented though, so it's
probably too much given that the struct is right there to look at what kind of
fault it is.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
  2025-08-27  0:05 ` [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() Sean Christopherson
@ 2025-08-28  2:01   ` Edgecombe, Rick P
  2025-08-28 18:50     ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  2:01 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() now that it's used
> only by kvm_arch_vcpu_pre_fault_memory().
> 
> No functional change intended.

I realize you are just trying to do map->prefault here, but "page" seems
redundant once you have "prefault" in the name. Why page here vs all the other
fault handler functions without it?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-27  8:33   ` Yan Zhao
@ 2025-08-28  2:05     ` Edgecombe, Rick P
  2025-08-28 20:16       ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  2:05 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira

On Wed, 2025-08-27 at 16:33 +0800, Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:15PM -0700, Sean Christopherson wrote:
> > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > doesn't support page migration in any capacity, i.e. there are no migrate
> > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > kvm_gmem_migrate_folio().
> Hmm, we implemented exactly the same patch at [1], where we explained the
> potential problems of not holding page refcount, and the explored various
> approaches, and related considerations.
> 
> [1] https://lore.kernel.org/all/20250807094241.4523-1-yan.y.zhao@intel.com/

Yea, so the outcome of the huge page related discussion was that we should look
at some sort of emergency page reclaim feature for the TDX module to use in the
case of bugs. But in the meantime to move forward without it, using a solution
like in this patch.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
  2025-08-27  8:39   ` Yan Zhao
@ 2025-08-28  2:11   ` Edgecombe, Rick P
  2025-08-28 19:21     ` Sean Christopherson
  2025-08-28 15:03   ` Ira Weiny
  2 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  2:11 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9fb6e5f02cc9..ef4ffcad131f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  
>  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
>  	atomic64_inc(&kvm_tdx->nr_premapped);
> @@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/* TODO: handle large pages. */
>  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/*
>  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> @@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
>  	 * and slot move/deletion.
>  	 */
>  	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	/*
>  	 * The HKID assigned to this TD was already freed and cache was
> @@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>  	 * there can't be anything populated in the private EPT.
>  	 */
>  	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> -		return -EINVAL;
> +		return -EIO;
>  
>  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>  	if (ret <= 0)


Did you miss?
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f9ac590e8ff0..fd1b8fea55a9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1656,10 +1656,10 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm,
gfn_t gfn,
 
        /* TODO: handle large pages. */
        if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-               return -EINVAL;
+               return -EIO;
 
        if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
-               return -EINVAL;
+               return -EIO;
 
        /*
         * When zapping private page, write lock is held. So no race condition


We really have a lot of KVM_BUG_ON()s in tdx code. I hesitate to pull them out
but it feels a bit gratuitous.

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings
  2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
  2025-08-27  8:14   ` Yan Zhao
  2025-08-28  0:37   ` Ira Weiny
@ 2025-08-28  2:13   ` Huang, Kai
  2 siblings, 0 replies; 85+ messages in thread
From: Huang, Kai @ 2025-08-28  2:13 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Drop TDX's sanity check that an S-EPT mapping isn't zapped between creating
> said mapping and doing TDH.MEM.PAGE.ADD, as the check is simultaneously
> superfluous and incomplete.  Per commit 2608f1057601 ("KVM: x86/tdp_mmu:
> Add a helper function to walk down the TDP MMU"), the justification for
> introducing kvm_tdp_mmu_gpa_is_mapped() was to check that the target gfn
> was pre-populated, with a link that points to this snippet:
> 
>  : > One small question:
>  : >
>  : > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
>  : > populated?  If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
>  : > then we still need to do the real map.  Or we can make KVM_TDX_INIT_MEM_REGION
>  : > return error when it finds the region hasn't been pre-populated?
>  :
>  : Return an error.  I don't love the idea of bleeding so many TDX details into
>  : userspace, but I'm pretty sure that ship sailed a long, long time ago.
> 
> But that justification makes little sense for the final code, as simply
> doing TDH.MEM.PAGE.ADD without a paranoid sanity check will return an error
> if the S-EPT mapping is invalid (as evidenced by the code being guarded
> with CONFIG_KVM_PROVE_MMU=y).
> 
> The sanity check is also incomplete in the sense that mmu_lock is dropped
> between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
> zap SPTEs in a very specific window.
> 
> Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
> which has no business being exposed to vendor code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

I guess I asked that small question :-)

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-27  0:05 ` [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
@ 2025-08-28  2:19   ` Edgecombe, Rick P
  2025-08-28 14:50     ` Edgecombe, Rick P
  2025-08-28 15:02   ` Ira Weiny
  1 sibling, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  2:19 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
> to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
> isn't also triggered.  Isolating the check from the "is premap error"
> if-statement will also allow adding a lockdep assertion that premap errors
> are encountered if and only if slots_lock is held.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
  2025-08-27  8:33   ` Yan Zhao
  2025-08-28  0:36   ` Ira Weiny
@ 2025-08-28  2:45   ` Huang, Kai
  2 siblings, 0 replies; 85+ messages in thread
From: Huang, Kai @ 2025-08-28  2:45 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Edgecombe, Rick P, kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> 

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-27  0:05 ` [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
@ 2025-08-28  2:56   ` Edgecombe, Rick P
  2025-08-28  6:48     ` Yan Zhao
  2025-08-28 15:03   ` Ira Weiny
  1 sibling, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  2:56 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> S-EPT pages to ensure that the count can't go negative without KVM
> noticing.  In theory, checking for '0' and then decrementing in a separate
> operation could miss a 0=>-1 transition.  In practice, such a condition is
> impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> actually need to be an atomic (that wart will be addressed shortly).
> 
> Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> ensures the VM is dead, i.e. there's no point in trying to limp along.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

This area has gone through a lot of designs. In the v19 era PAGE.ADD got
performed deep inside the fault by stuffing the source page in the vCPU. Then we
switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare
the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we
switched to current code which does something like KVM_PRE_FAULT_MEMORY
internally, then looks up what got faulted and does the PAGE.ADD. Then the
version in this series which does it even more directly.

nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't
like it, but it was needed because userspace could do unexpected things. Now it
seems like its only purpose is to generate a KVM_BUG_ON() in
tdx_sept_zap_private_spte(). I wonder if we could drop it all together and
accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific
error case.

Yan, am I missing something?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27 19:08     ` Sean Christopherson
@ 2025-08-28  3:13       ` Edgecombe, Rick P
  2025-08-28  5:56         ` Yan Zhao
  2025-08-28  5:43       ` Yan Zhao
  2025-08-28 15:30       ` Ira Weiny
  2 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28  3:13 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira

On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote:
> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether
> the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> 
> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.

Ah, I just responded on another patch. I wonder if we can get rid of the premap
cnt.

> 
> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in
> tdx_td_finalize()
> that all pending pages have been consumed.

Reasons that tdh_mem_page_add() could get BUSY:
1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time  they
could contend the SEPT entry lock
2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
tdh_mem_range_block()).

I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
possibility.

> 
> What reasonable use case is there for gracefully handling tdh_mem_page_add()
> failure?
> 
> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.

Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe
even for a single call of it, until someone wants to parallelize the operation. 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27 19:08     ` Sean Christopherson
  2025-08-28  3:13       ` Edgecombe, Rick P
@ 2025-08-28  5:43       ` Yan Zhao
  2025-08-28 17:00         ` Sean Christopherson
  2025-08-28 15:30       ` Ira Weiny
  2 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  5:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  		return -EIO;
> > >  
> > >  	/*
> > > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > -	 * barrier in tdx_td_finalize().
> > > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > > +	 * with the smp_wmb() in tdx_td_finalize().
> > >  	 */
> > >  	smp_rmb();
> > > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	/*
> > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > +	 * to prevent running the TD with uninitialized memory.
> > To prevent the mismatch between mirror EPT and the S-EPT?
> 
> No?  Because KVM bumps the count when installing the S-EPT and decrements it
> on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
Hmm, I think there must be some misunderstanding.

Before userspace invokes KVM_TDX_FINALIZE_VM,
=======
1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
   (1) KVM holds slot_lock and filemap lock.
   (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
       patch 2).
       KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
       that there's a page mapped in M-EPT, while it's not yet installed in
       S-EPT.
   (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
       page has been mapped in S-EPT too.
       
   As the name of nr_premapped indicates, the count means a page is pre-mapped
   in the M-EPT, before its real mapping in the S-EPT.
   If ADD fails in step (3), nr_premapped will not be decreased.

   With mere the normal path, nr_premapped should return back to 0 after all
   KVM_TDX_INIT_MEM_REGIONs.
      

2. Expected zap paths (e.g. If userspace does something strange, such as
   removing a slot after KVM_TDX_INIT_MEM_REGION)
   Those zap paths could be triggered by
   1) userspace performs a page attribute conversion
   2) userspace invokes gmem punch hole
   3) userspace removes a slot
   As all those paths either hold a slot_lock or a filemap lock, they can't
   contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
   slot_lock and internally filemap lock).
   Consequently, those zaps must occur
   a) before kvm_tdp_map_page() or
   b) after TDH.MEM.PAGE.ADD.
   For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
           mapped in M-EPT either;
   For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
           SEAMCALLs are following the ADD.
   nr_premapped is therere unchanged, since it does not change the consistency
   between M-EPT and S-EPT.

3. Unexpected zaps (such as kvm_zap_gfn_range()).
   Those zaps are currently just paranoid ones. Not found in any existing paths
   yet. i.e.,
   We want to detect any future code or any missed code piecies, which invokes
   kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).

   As those zaps do not necessarily hold slot_lock or filemap lock, they may
   ocurr after installing M-EPT and before installing S-EPT.
   As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
   true.
   Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
   but not in S-EPT decreases.

   TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
   will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
   successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
   So, we will be able to detect those unexpected zaps.
   

When userspace invokes KVM_TDX_FINALIZE_VM,
=======
The nr_premapped must be 0 before tdx_td_finalize() succeeds.

The nr_premapped could be 0 if
(1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
(2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
(3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
    regions.

For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. So, TD can still run with
uninitialized memory.

> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
Are you referring to the above "case 2 Expected zap paths"?

It's equal to that userspace never triggers any KVM_TDX_INIT_MEM_REGION.
We can't force userspace must invoke KVM_TDX_INIT_MEM_REGION after all.

I don't think there's a hole from the guest point of view. See below.

> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.

If TD runs with unintialized memory,
- for the linux guest, it will cause TD to access unaccepted memory and get
  killed by KVM;
- for non-linux guest which configured with #VE, the guest will see #VE and
  be informed of that the page must be accepted before access. Though the guest
  should not be able to run without any initial code, but there's not any
  security problem.


> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
We don't. It returns -EBUSY or -EIO immediately.

> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> that all pending pages have been consumed.

tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
by executing BLOCK before ADD.

> What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.

> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
I analyzed the contention status of tdh_mem_sept_add() at
https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.

As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
vCPU, returning -EBUSY instead of retrying looks safer and easier.

> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.
tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
KVM_BUG_ON/WARN_ON should be also ok.

> > > +	 */
> > > +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> > > +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > > +			return -EIO;
> > > +
> > > +		atomic64_inc(&kvm_tdx->nr_premapped);
> > > +		return 0;
> > > +	}
> > > +
> > > +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  }
> > >  
> > >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > -- 
> > > 2.51.0.268.g9569e192d0-goog
> > > 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28  3:13       ` Edgecombe, Rick P
@ 2025-08-28  5:56         ` Yan Zhao
  2025-08-28 19:08           ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  5:56 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, Aug 28, 2025 at 11:13:11AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 12:08 -0700, Sean Christopherson wrote:
> > > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > > TDH.MEM.PAGE.REMOVE.
> > 
> > Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> > underlying problem is unexpectedly removing pages from the initial, whether
> > the
> > VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> > 
> > > As a result, the TD will still run with uninitialized memory.
> > 
> > No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> > "hole" that the guest might not expect, but that hole will trigger an EPT
> > violation and only get "filled" if the guest explicitly accepts an AUG'd page.
> 
> Ah, I just responded on another patch. I wonder if we can get rid of the premap
> cnt.

I think keeping it is safer.
See my explanation at [1].

[1] https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com.

> > 
> > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > nice with tdh_mem_page_add() failure necessitates both the
> > tdx_is_sept_zap_err_due_to_premap() craziness and the check in
> > tdx_td_finalize()
> > that all pending pages have been consumed.
> 
> Reasons that tdh_mem_page_add() could get BUSY:
> 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time  they
> could contend the SEPT entry lock
> 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
> tdh_mem_range_block()).
Hmm, two tdh_mem_page_add()s can't contend as they are protected by both
slot_lock and filemap lock.

With regard to the contention to tdh_mem_range_block(), please check my analysis
at the above [1].

tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes
KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU.

Please check more details at [2].

[2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/


> I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
> possibility.
2 is possible only for paranoid zaps.
See "case 3. Unexpected zaps" in [1].


> > What reasonable use case is there for gracefully handling tdh_mem_page_add()
> > failure?
> > 
> > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > pending pages when it completes (even if it "fails"), and similarly
> > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > non-zero.
> 
> Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or maybe
> even for a single call of it, until someone wants to parallelize the operation.
Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION 
before finishing initializing all vCPUs are uncommon.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  1:26       ` Edgecombe, Rick P
@ 2025-08-28  6:23         ` Yan Zhao
  2025-08-28 19:40           ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  6:23 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, Aug 28, 2025 at 09:26:50AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 17:54 -0700, Rick Edgecombe wrote:
> > > 
> > > Then, what about setting
> > >                 .max_level = PG_LEVEL_4K,
> > > directly?
> > > 
> > > Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered
> > > in
> > > tdx_sept_set_private_spte().
> > 
> > Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
> > tests. I don't think it's ideal to encode PAGE.ADD details here though.
> > 
> > But I'm not immediately clear what is going wrong. The old struct
> > kvm_page_fault
> > looks pretty similar. Did you root cause it?
>
> Oh, duh. Because we are passing in the PFN now so it can't know the size. So
> it's not about PAGE.ADD actually.
Right, it's because the previous kvm_tdp_map_page() updates fault->max_level in
kvm_mmu_faultin_pfn_private() by checking the private_max_mapping_level hook.

However, private_max_mapping_level() skips the faultin step and goes straight
to kvm_tdp_mmu_map().

> Sill, how about calling the function kvm_tdp_mmu_map_private_pfn_4k(), or
> passing in the level?
Looks [1] can also address this issue. Not sure which one Sean prefers.

[1] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-28  2:56   ` Edgecombe, Rick P
@ 2025-08-28  6:48     ` Yan Zhao
  2025-08-28 19:14       ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  6:48 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, Aug 28, 2025 at 10:56:18AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> > S-EPT pages to ensure that the count can't go negative without KVM
> > noticing.  In theory, checking for '0' and then decrementing in a separate
> > operation could miss a 0=>-1 transition.  In practice, such a condition is
> > impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> > actually need to be an atomic (that wart will be addressed shortly).
> > 
> > Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> > ensures the VM is dead, i.e. there's no point in trying to limp along.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> This area has gone through a lot of designs. In the v19 era PAGE.ADD got
> performed deep inside the fault by stuffing the source page in the vCPU. Then we
> switched to having userspace call KVM_PRE_FAULT_MEMORY manually to pre-populare
> the mirror EPT, and then have TDX code look up the PFN. Then nearer the end, we
> switched to current code which does something like KVM_PRE_FAULT_MEMORY
> internally, then looks up what got faulted and does the PAGE.ADD. Then the
> version in this series which does it even more directly.
Right.
If we invoke PAGE.ADD directly in tdx_sept_set_private_spte() (similar to
PAGE.AUG), then we'll have to have some way to pass in the source page info.

So, rather than passing around the source page, we opted to record the count
of pages mapped in M-EPT while still unmapped in S-EPT, i.e.,

1. map a page in M-EPT
2. increase nr_premapped.
3. map the page in S-EPT
4. decrease nr_premapped.

If a page is zapped in M-EPT before 3, decrease nr_premapped. So if 3 is
executed successfully after zapping of the M-EPT, decrease nr_premapped too.
The unbalancing of nr_premapped due to the double decrease indicates the
mismatching between M-EPT and S-EPT.
If 3 never comes or fails, it's ok.


> nr_premapped got added during the KVM_PRE_FAULT_MEMORY era. I personally didn't
> like it, but it was needed because userspace could do unexpected things. Now it
> seems like its only purpose is to generate a KVM_BUG_ON() in
> tdx_sept_zap_private_spte(). I wonder if we could drop it all together and
> accept less KVM_BUG_ON() coverage. It seems weird to focus in on this specific
> error case.
> 
> Yan, am I missing something?
Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
code change.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  0:54     ` Edgecombe, Rick P
  2025-08-28  1:26       ` Edgecombe, Rick P
@ 2025-08-28  6:55       ` Yan Zhao
  1 sibling, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  6:55 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, Aug 28, 2025 at 08:54:48AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 16:25 +0800, Yan Zhao wrote:
> > > +{
> > > +	struct kvm_page_fault fault = {
> > > +		.addr = gfn_to_gpa(gfn),
> > > +		.error_code = PFERR_GUEST_FINAL_MASK |
> > > PFERR_PRIVATE_ACCESS,
> > > +		.prefetch = true,
> > > +		.is_tdp = true,
> > > +		.nx_huge_page_workaround_enabled =
> > > is_nx_huge_page_enabled(vcpu->kvm),
> > > +
> > > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > Looks the kvm_tdp_mmu_map_private_pfn() is only for initial memory mapping,
> > given that ".prefetch = true" and RET_PF_SPURIOUS is not a valid return value.
> 
> Hmm, what are you referring to regarding RET_PF_SPURIOUS?
If kvm_tdp_mmu_map_private_pfn() can also be invoked after initial memory
mapping stage, RET_PF_SPURIOUS is a valid return case.

But in this patch, only RET_PF_RETRY and RET_PF_FIXED are valid.
So, I think it's expected to be invoked only by initial memory mapping stage :)
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-28  0:36   ` Ira Weiny
@ 2025-08-28  7:08     ` Yan Zhao
  2025-08-28 15:54       ` Ira Weiny
  0 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-28  7:08 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel,
	Michael Roth, Vishal Annapurve, Rick Edgecombe

On Wed, Aug 27, 2025 at 07:36:46PM -0500, Ira Weiny wrote:
> Sean Christopherson wrote:
> > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > doesn't support page migration in any capacity, i.e. there are no migrate
> > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > kvm_gmem_migrate_folio().
> 
> I like the fact this removes a poorly named function tdx_unpin() as well.
> 
> That said, concerning gmem tracking page reference, I have some questions.
> In the TDX.PAGE.AUG path, [via kvm_gmem_get_pfn()] gmem takes a folio
kvm_mmu_finish_page_fault() will decrease the folio refcount.

> reference whereas the TDX.PAGE.ADD path [via kvm_gmem_populate()] does not
> take a folio reference.
> 
> Why are these paths different?
> 
> For this patch.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> 
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
> >  1 file changed, 4 insertions(+), 24 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 1724d82c8512..9fb6e5f02cc9 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> >  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> >  }
> >  
> > -static void tdx_unpin(struct kvm *kvm, struct page *page)
> > -{
> > -	put_page(page);
> > -}
> > -
> >  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > -			    enum pg_level level, struct page *page)
> > +			    enum pg_level level, kvm_pfn_t pfn)
> >  {
> >  	int tdx_level = pg_level_to_tdx_sept_level(level);
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	struct page *page = pfn_to_page(pfn);
> >  	gpa_t gpa = gfn_to_gpa(gfn);
> >  	u64 entry, level_state;
> >  	u64 err;
> >  
> >  	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> > -	if (unlikely(tdx_operand_busy(err))) {
> > -		tdx_unpin(kvm, page);
> > +	if (unlikely(tdx_operand_busy(err)))
> >  		return -EBUSY;
> > -	}
> >  
> >  	if (KVM_BUG_ON(err, kvm)) {
> >  		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > -		tdx_unpin(kvm, page);
> >  		return -EIO;
> >  	}
> >  
> > @@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  				     enum pg_level level, kvm_pfn_t pfn)
> >  {
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > -	struct page *page = pfn_to_page(pfn);
> >  
> >  	/* TODO: handle large pages. */
> >  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> >  		return -EINVAL;
> >  
> > -	/*
> > -	 * Because guest_memfd doesn't support page migration with
> > -	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > -	 * migration.  Until guest_memfd supports page migration, prevent page
> > -	 * migration.
> > -	 * TODO: Once guest_memfd introduces callback on page migration,
> > -	 * implement it and remove get_page/put_page().
> > -	 */
> > -	get_page(page);
> > -
> >  	/*
> >  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> >  	 * barrier in tdx_td_finalize().
> >  	 */
> >  	smp_rmb();
> >  	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > -		return tdx_mem_page_aug(kvm, gfn, level, page);
> > +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  
> >  	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> >  }
> > @@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  	}
> >  	tdx_clear_page(page);
> > -	tdx_unpin(kvm, page);
> >  	return 0;
> >  }
> >  
> > @@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> >  	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> >  		atomic64_dec(&kvm_tdx->nr_premapped);
> > -		tdx_unpin(kvm, page);
> >  		return 0;
> >  	}
> >  
> > -- 
> > 2.51.0.268.g9569e192d0-goog
> > 
> 
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-28  2:19   ` Edgecombe, Rick P
@ 2025-08-28 14:50     ` Edgecombe, Rick P
  2025-08-29  1:10       ` Yan Zhao
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 14:50 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Wed, 2025-08-27 at 19:19 -0700, Rick Edgecombe wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
> > to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
> > isn't also triggered.  Isolating the check from the "is premap error"
> > if-statement will also allow adding a lockdep assertion that premap errors
> > are encountered if and only if slots_lock is held.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

There is actually another KVM_BUG_ON() in the path here:

static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
				 int level)
{
	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
	int ret;

	/*
	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
	 * PTs are removed in a special order, involving free_external_spt().
	 * But remove_external_spte() will be called on non-leaf PTEs via
	 * __tdp_mmu_zap_root(), so avoid the error the former would return
	 * in this case.
	 */
	if (!is_last_spte(old_spte, level))
		return;

	/* Zapping leaf spte is allowed only when write lock is held. */
	lockdep_assert_held_write(&kvm->mmu_lock);
	/* Because write lock is held, operation should success. */
	ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn);
->	KVM_BUG_ON(ret, kvm);

We don't need to do it in this patch, but we could remove the return value in
.remove_external_spte, and the KVM_BUG_ON(). Just let remove_external_spte
handle it internally.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-27  0:05 ` [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
  2025-08-28  2:19   ` Edgecombe, Rick P
@ 2025-08-28 15:02   ` Ira Weiny
  1 sibling, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
> to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
> isn't also triggered.  Isolating the check from the "is premap error"
> if-statement will also allow adding a lockdep assertion that premap errors
> are encountered if and only if slots_lock is held.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ef4ffcad131f..88079e2d45fb 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1773,8 +1773,10 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
> -	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
> +	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> +		if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> +			return -EIO;

Won't this -EIO cause the KVM_BUG_ON on in remove_external_spte() to fire too?

static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
                                 int level)
{
	...
	ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn);
	KVM_BUG_ON(ret, kvm);
}


This patch is better than 3 bug ons but wouldn't it be better to make both
KVM_BUG_ON's internal errors or debug?

Something like this:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4920ee8ad773..83065f3fe605 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1774,14 +1774,16 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
                tdx_no_vcpus_enter_stop(kvm);
        }
        if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-               if (KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+               if (!atomic64_read(&kvm_tdx->nr_premapped)) {
+                       pr_err("nr_premapped underflow\n");
                        return -EIO;
+               }
 
                atomic64_dec(&kvm_tdx->nr_premapped);
                return 0;
        }
 
-       if (KVM_BUG_ON(err, kvm)) {
+       if (err) {
                pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
                return -EIO;
        }

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-27  0:05 ` [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
  2025-08-28  2:56   ` Edgecombe, Rick P
@ 2025-08-28 15:03   ` Ira Weiny
  1 sibling, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Use atomic64_dec_return() when decrementing the number of "pre-mapped"
> S-EPT pages to ensure that the count can't go negative without KVM
> noticing.  In theory, checking for '0' and then decrementing in a separate
> operation could miss a 0=>-1 transition.  In practice, such a condition is
> impossible because nr_premapped is protected by slots_lock, i.e. doesn't
> actually need to be an atomic (that wart will be addressed shortly).
> 
> Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
> ensures the VM is dead, i.e. there's no point in trying to limp along.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
  2025-08-27  8:39   ` Yan Zhao
  2025-08-28  2:11   ` Edgecombe, Rick P
@ 2025-08-28 15:03   ` Ira Weiny
  2 siblings, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> when a VM has been killed due to a KVM bug, not -EINVAL.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds
  2025-08-27  0:05 ` [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds Sean Christopherson
  2025-08-27  9:22   ` Yan Zhao
@ 2025-08-28 15:23   ` Ira Weiny
  1 sibling, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Michael Roth, Yan Zhao, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> Rename "nr_premapped" to an asurdly verbose "nr_pending_tdh_mem_page_adds"
> to make it explicitly clear what the counter tracks.  "pre-map" is far
> too similar to "pre-fault", especially since tdx_sept_set_private_spte()
> deals with both "pre_fault_allowed" and the counter.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27  9:02   ` Yan Zhao
  2025-08-27 19:08     ` Sean Christopherson
@ 2025-08-28 15:28     ` Ira Weiny
  1 sibling, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:28 UTC (permalink / raw)
  To: Yan Zhao, Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Yan Zhao wrote:
> On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:

[snip]

> > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  		return -EIO;
> >  
> >  	/*
> > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > -	 * barrier in tdx_td_finalize().
> > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > +	 * with the smp_wmb() in tdx_td_finalize().
> >  	 */
> >  	smp_rmb();
> > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> >  
> > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > +	/*
> > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > +	 * the counter to ensure all mapped pages have been added to the image,
> > +	 * to prevent running the TD with uninitialized memory.
> To prevent the mismatch between mirror EPT and the S-EPT?
> 
> e.g., Before KVM_TDX_FINALIZE_VM,
> if userspace performs a zap after the TDH.MEM.PAGE.ADD, the page will be removed
> from the S-EPT. The count of nr_premapped will not change after the successful
> TDH.MEM.RANGE.BLOCK and TDH.MEM.PAGE.REMOVE.
> 
> As a result, the TD will still run with uninitialized memory.

I'm wondering if we are trying to over-architect this.

Should we even allow KVM_TDX_FINALIZE_VM to race with
KVM_TDX_INIT_MEM_REGION?  What is the use case for that?

It seems a basic sanity check/KVM_BUG_ON would suffice to tell the user;
Don't start adding memory dynamically until yall have finalized the VM.

Ira

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-27 19:08     ` Sean Christopherson
  2025-08-28  3:13       ` Edgecombe, Rick P
  2025-08-28  5:43       ` Yan Zhao
@ 2025-08-28 15:30       ` Ira Weiny
  2 siblings, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:30 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  		return -EIO;

[snip]

> > >  	/*
> > > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > -	 * barrier in tdx_td_finalize().
> > > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > > +	 * with the smp_wmb() in tdx_td_finalize().
> > >  	 */
> > >  	smp_rmb();
> > > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	/*
> > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > +	 * to prevent running the TD with uninitialized memory.
> > To prevent the mismatch between mirror EPT and the S-EPT?
> 
> No?  Because KVM bumps the count when installing the S-EPT and decrements it
> on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> 
> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
> 
> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.
> 
> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> that all pending pages have been consumed.
> 
> What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> 
> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.

Ah just reading this...  yea I'm wondering the same thing.

Ira

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-28  7:08     ` Yan Zhao
@ 2025-08-28 15:54       ` Ira Weiny
  0 siblings, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 15:54 UTC (permalink / raw)
  To: Yan Zhao, Ira Weiny
  Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel,
	Michael Roth, Vishal Annapurve, Rick Edgecombe

Yan Zhao wrote:
> On Wed, Aug 27, 2025 at 07:36:46PM -0500, Ira Weiny wrote:
> > Sean Christopherson wrote:
> > > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > > doesn't support page migration in any capacity, i.e. there are no migrate
> > > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > > kvm_gmem_migrate_folio().
> > 
> > I like the fact this removes a poorly named function tdx_unpin() as well.
> > 
> > That said, concerning gmem tracking page reference, I have some questions.
> > In the TDX.PAGE.AUG path, [via kvm_gmem_get_pfn()] gmem takes a folio
> kvm_mmu_finish_page_fault() will decrease the folio refcount.

Thanks,
Ira

> 
> > reference whereas the TDX.PAGE.ADD path [via kvm_gmem_populate()] does not
> > take a folio reference.
> > 
> > Why are these paths different?
> > 

[snip]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28  5:43       ` Yan Zhao
@ 2025-08-28 17:00         ` Sean Christopherson
  2025-08-28 18:52           ` Edgecombe, Rick P
  2025-08-29  2:31           ` Yan Zhao
  0 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 17:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Thu, Aug 28, 2025, Yan Zhao wrote:
> On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > +	/*
> > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > +	 * to prevent running the TD with uninitialized memory.
> > > To prevent the mismatch between mirror EPT and the S-EPT?
> > 
> > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> Hmm, I think there must be some misunderstanding.

Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.

> Before userspace invokes KVM_TDX_FINALIZE_VM,
> =======
> 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
>    (1) KVM holds slot_lock and filemap lock.
>    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
>        patch 2).
>        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
>        that there's a page mapped in M-EPT, while it's not yet installed in
>        S-EPT.
>    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
>        page has been mapped in S-EPT too.
>        
>    As the name of nr_premapped indicates, the count means a page is pre-mapped
>    in the M-EPT, before its real mapping in the S-EPT.
>    If ADD fails in step (3), nr_premapped will not be decreased.
> 
>    With mere the normal path, nr_premapped should return back to 0 after all
>    KVM_TDX_INIT_MEM_REGIONs.
>       
> 
> 2. Expected zap paths (e.g. If userspace does something strange, such as
>    removing a slot after KVM_TDX_INIT_MEM_REGION)
>    Those zap paths could be triggered by
>    1) userspace performs a page attribute conversion
>    2) userspace invokes gmem punch hole
>    3) userspace removes a slot
>    As all those paths either hold a slot_lock or a filemap lock, they can't
>    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
>    slot_lock and internally filemap lock).
>    Consequently, those zaps must occur
>    a) before kvm_tdp_map_page() or
>    b) after TDH.MEM.PAGE.ADD.
>    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
>            mapped in M-EPT either;
>    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
>            SEAMCALLs are following the ADD.
>    nr_premapped is therere unchanged, since it does not change the consistency
>    between M-EPT and S-EPT.
> 
> 3. Unexpected zaps (such as kvm_zap_gfn_range()).

Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
user-triggerable.

Easiest thing would be to mark the vCPU as dead (though we obviously need
"KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index dbab1c15b0cd..1c0b43ff9544 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
        if (is_td_vcpu(vcpu)) {
-               KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
+               if (!kvm_vcpu_apicv_active(vcpu))
+                       kvm_make_request(KVM_REQ_VM_DEAD, vcpu);
                return;
        }
 
>    Those zaps are currently just paranoid ones. Not found in any existing paths
>    yet. i.e.,
>    We want to detect any future code or any missed code piecies, which invokes
>    kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).
> 
>    As those zaps do not necessarily hold slot_lock or filemap lock, they may
>    ocurr after installing M-EPT and before installing S-EPT.
>    As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
>    true.
>    Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
>    but not in S-EPT decreases.
> 
>    TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
>    will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
>    successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
>    So, we will be able to detect those unexpected zaps.
>    
> 
> When userspace invokes KVM_TDX_FINALIZE_VM,
> =======
> The nr_premapped must be 0 before tdx_td_finalize() succeeds.
> 
> The nr_premapped could be 0 if
> (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
> (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
> (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
>     regions.
> 
> For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed.

Ya, we're in agreement on what can happen.  I think all of the confusion was due
to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT
entry.

> So, TD can still run with uninitialized memory.

No, the TD can never run with truly uninitialized memory.  By "uninitialized", I
mean memory that the guest can access and which has not been written to.  Again,
my confusion was due to thinking a page was already mapped into the guest, but
awaiting TDH.MEM.PAGE.ADD to 
 
> > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> We don't. It returns -EBUSY or -EIO immediately.

But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
actions on the VM.  Tolerating failure is fine in general, but in this case it
leaves the MMU is left in a half-baked state.  

> > nice with tdh_mem_page_add() failure necessitates both the
> > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> > that all pending pages have been consumed.
> 
> tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
> by executing BLOCK before ADD.

We need to make this situation impossible.

> > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.
> 
> > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> I analyzed the contention status of tdh_mem_sept_add() at
> https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.
> 
> As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
> vCPU, returning -EBUSY instead of retrying looks safer and easier.
> 
> > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > pending pages when it completes (even if it "fails"), and similarly
> > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > non-zero.
> tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> KVM_BUG_ON/WARN_ON should be also ok.

Ok, so I vaguely recall that I may have pushed back on using a scratch field in
"struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
disliked?), but what we ended up with is far worse.

For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
with access rules that are all but impossible to understand, e.g. there's practically
zero chance anyone could suss out complications with "Unexpected zaps", and the
existence of that super subtle edge case necessitates using an atomic because
KVM can't strictly guarantee that access to the field is mutually exclusive.  And
that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
is checking nr_premapped, what happens?

The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
the map completes and mmu_lock is dropped means that failure at that point leaves
the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
S-EPT does not.  Eww.

Note, in no way am I trying to blame anyone; quite the opposite, you've done an
admirable job to get all of this landed.  And I apologize if any of my past
feedback led y'all down this road.  I suspect my prefaulting idea really screwed
things up; sorry :-(

Back to the code, unless I'm missing yet another complication, I think the obvious
fix to all of this is to pass the source page and metadata flags via a scratch
field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
tdx_sept_set_private_spte().  Then there is no need to keep track of pending
pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.

Diff on top below for feedback on the idea.  A proper series for this would simply
replace several of the patches, e.g. asserting that slots_lock is held on
tdx_is_sept_zap_err_due_to_premap() is wrong.

---
 arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
 arch/x86/kvm/vmx/tdx.h |  11 +--
 2 files changed, 70 insertions(+), 98 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f9ac590e8ff0..5d981a061442 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+
+struct kvm_tdx_page_add {
+	struct page *src;
+	unsigned long flags;
+};
+
+static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			    kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	int i;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
+	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
+		return -EIO;
+
+	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
+			       kvm_tdx->page_add->src, &entry, &level_state);
+	if (unlikely(tdx_operand_busy(err)))
+		return -EBUSY;
+
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
+		return -EIO;
+	}
+
+	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
+		return 0;
+
+	/*
+	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
+	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
+	 * Note!  If extended the measurement fails, bug the VM, but do NOT
+	 * return an error, as mapping the page in the S-EPT succeeded and so
+	 * needs to be tracked in KVM's mirror page tables.
+	 */
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
+			break;
+		}
+	}
+	return 0;
+}
+
 static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 			    enum pg_level level, kvm_pfn_t pfn)
 {
@@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 
 	/*
 	 * If the TD isn't finalized/runnable, then userspace is initializing
-	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
-	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
-	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
-	 * the counter to ensure all mapped pages have been added to the image,
-	 * to prevent running the TD with uninitialized memory.
+	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
+	 * and optionally extend the measurement with the page contents.
 	 */
-	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
-		lockdep_assert_held(&kvm->slots_lock);
-
-		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
-			return -EIO;
-
-		kvm_tdx->nr_pending_tdh_mem_page_adds++;
-		return 0;
-	}
+	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
+		return tdx_mem_page_add(kvm, gfn, level, pfn);
 
 	return tdx_mem_page_aug(kvm, gfn, level, pfn);
 }
@@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-/*
- * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
- * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
- * successfully.
- *
- * Since tdh_mem_sept_add() must have been invoked successfully before a
- * non-leaf entry present in the mirrored page table, the SEPT ZAP related
- * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
- * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
- * SEPT.
- *
- * Further check if the returned entry from SEPT walking is with RWX permissions
- * to filter out anything unexpected.
- *
- * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
- * level_state returned from a SEAMCALL error is the same as that passed into
- * the SEAMCALL.
- */
-static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
-					     u64 entry, int level)
-{
-	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
-		return false;
-
-	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
-		return false;
-
-	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
-		return false;
-
-	return true;
-}
-
 static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 				     enum pg_level level, struct page *page)
 {
@@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
 		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
 		tdx_no_vcpus_enter_stop(kvm);
 	}
-	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
-		lockdep_assert_held(&kvm->slots_lock);
-
-		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
-			return -EIO;
-
-		return 0;
-	}
-
 	if (KVM_BUG_ON(err, kvm)) {
 		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
 		return -EIO;
@@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 
 	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
 		return -EINVAL;
-	/*
-	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
-	 * TDH.MEM.PAGE.ADD().
-	 */
-	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
-		return -EINVAL;
 
 	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
 	if (tdx_operand_busy(cmd->hw_error))
@@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 {
 	struct tdx_gmem_post_populate_arg *arg = _arg;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	u64 err, entry, level_state;
-	gpa_t gpa = gfn_to_gpa(gfn);
-	struct page *src_page;
-	int ret, i;
+	struct kvm_tdx_page_add page_add = {
+		.flags = arg->flags,
+	};
+	int ret;
 
-	lockdep_assert_held(&kvm->slots_lock);
+	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
+		return -EIO;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
 	 * source page has been swapped out or unmapped in primary memory.
 	 */
-	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
+	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
 	if (ret < 0)
 		return ret;
 	if (ret != 1)
 		return -ENOMEM;
 
+	kvm_tdx->page_add = &page_add;
 	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
-	if (ret < 0)
-		goto out;
+	kvm_tdx->page_add = NULL;
 
-	ret = 0;
-	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
-			       src_page, &entry, &level_state);
-	if (err) {
-		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
-		goto out;
-	}
-
-	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
-
-	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
-		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
-					    &level_state);
-			if (err) {
-				ret = -EIO;
-				break;
-			}
-		}
-	}
-
-out:
-	put_page(src_page);
+	put_page(page_add.src);
 	return ret;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 45d86f9fa41c..39e0c3bcc866 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -21,6 +21,8 @@ enum kvm_tdx_state {
 	TD_STATE_RUNNABLE,
 };
 
+struct kvm_tdx_add_page;
+
 struct kvm_tdx {
 	struct kvm kvm;
 
@@ -37,12 +39,11 @@ struct kvm_tdx {
 	struct tdx_td td;
 
 	/*
-	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
-	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
-	 * check adding pages to the image, and to ensure that all pages have
-	 * been initialized before finalizing the TD.
+	 * Scratch structure used to pass the source page and metadata flags to
+	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
+	 * mapping a private pfn via tdx_gmem_post_populate().
 	 */
-	unsigned long nr_pending_tdh_mem_page_adds;
+	struct kvm_tdx_page_add *page_add;
 
 	/*
 	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do

base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
--

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
  2025-08-28  2:01   ` Edgecombe, Rick P
@ 2025-08-28 18:50     ` Sean Christopherson
  2025-08-28 19:04       ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 18:50 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, michael.roth@amd.com,
	Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() now that it's used
> > only by kvm_arch_vcpu_pre_fault_memory().
> > 
> > No functional change intended.
> 
> I realize you are just trying to do map->prefault here, but "page" seems
> redundant once you have "prefault" in the name. Why page here vs all the other
> fault handler functions without it?

kvm_tdp_prefault() feels a bit ambiguous/bare.  Many of the fault helpers do have
"page", it's just before the fault part.

  kvm_mmu_finish_page_fault
  kvm_handle_page_fault
  kvm_tdp_page_fault
  direct_page_fault
  nonpaging_page_fault
  kvm_tdp_mmu_page_fault

  (and probably more)

How about kvm_tdp_page_prefault()?  Or kvm_tdp_do_prefault(), but I think I like
kvm_tdp_page_prefault() a little more.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 17:00         ` Sean Christopherson
@ 2025-08-28 18:52           ` Edgecombe, Rick P
  2025-08-28 20:26             ` Sean Christopherson
                               ` (2 more replies)
  2025-08-29  2:31           ` Yan Zhao
  1 sibling, 3 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 18:52 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira

On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > +	/*
> > > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > > +	 * to prevent running the TD with uninitialized memory.
> > > > To prevent the mismatch between mirror EPT and the S-EPT?
> > > 
> > > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> > Hmm, I think there must be some misunderstanding.
> 
> Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.
> 
> > Before userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
> >    (1) KVM holds slot_lock and filemap lock.
> >    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
> >        patch 2).
> >        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
> >        that there's a page mapped in M-EPT, while it's not yet installed in
> >        S-EPT.
> >    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
> >        page has been mapped in S-EPT too.
> >        
> >    As the name of nr_premapped indicates, the count means a page is pre-mapped
> >    in the M-EPT, before its real mapping in the S-EPT.
> >    If ADD fails in step (3), nr_premapped will not be decreased.
> > 
> >    With mere the normal path, nr_premapped should return back to 0 after all
> >    KVM_TDX_INIT_MEM_REGIONs.
> >       
> > 
> > 2. Expected zap paths (e.g. If userspace does something strange, such as
> >    removing a slot after KVM_TDX_INIT_MEM_REGION)
> >    Those zap paths could be triggered by
> >    1) userspace performs a page attribute conversion
> >    2) userspace invokes gmem punch hole
> >    3) userspace removes a slot
> >    As all those paths either hold a slot_lock or a filemap lock, they can't
> >    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
> >    slot_lock and internally filemap lock).
> >    Consequently, those zaps must occur
> >    a) before kvm_tdp_map_page() or
> >    b) after TDH.MEM.PAGE.ADD.
> >    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
> >            mapped in M-EPT either;
> >    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
> >            SEAMCALLs are following the ADD.
> >    nr_premapped is therere unchanged, since it does not change the consistency
> >    between M-EPT and S-EPT.
> > 
> > 3. Unexpected zaps (such as kvm_zap_gfn_range()).
> 
> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
> user-triggerable.
> 
> Easiest thing would be to mark the vCPU as dead (though we obviously need
> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
> 
> 
> 
I'm going need to look up the related apic discussions from the base series and
circle back.

[snip]

> > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > KVM_BUG_ON/WARN_ON should be also ok.
> 
> Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> disliked?), but what we ended up with is far worse.

I think it was also that the tdh_mr_extend() loop was too heavyweight for the
fault path. But that was before we got to the kick+lock stuff.

> 
> For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
> with access rules that are all but impossible to understand, e.g. there's practically
> zero chance anyone could suss out complications with "Unexpected zaps", and the
> existence of that super subtle edge case necessitates using an atomic because
> KVM can't strictly guarantee that access to the field is mutually exclusive.  And
> that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
> is checking nr_premapped, what happens?
> 
> The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
> the map completes and mmu_lock is dropped means that failure at that point leaves
> the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
> S-EPT does not.  Eww.
> 
> Note, in no way am I trying to blame anyone; quite the opposite, you've done an
> admirable job to get all of this landed.  And I apologize if any of my past
> feedback led y'all down this road.  I suspect my prefaulting idea really screwed
> things up; sorry :-(

It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could
have just encrypted it in place.

Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the
impression that you wanted to see less TDX churn for a bit. The fact is, the TDX
base support still needs more work like this.

> 
> Back to the code, unless I'm missing yet another complication, I think the obvious
> fix to all of this is to pass the source page and metadata flags via a scratch
> field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
> tdx_sept_set_private_spte().  Then there is no need to keep track of pending
> pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
> with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
> It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
> for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.
> 
> Diff on top below for feedback on the idea.  A proper series for this would simply
> replace several of the patches, e.g. asserting that slots_lock is held on
> tdx_is_sept_zap_err_due_to_premap() is wrong.

This works. The "stuffing data on the vCPU" part is a little ugly as it was
before, but the other solutions were worse. Especially nr_premap was problematic
for a number of reasons that keeps growing.

So, big Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

> 
> ---
>  arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h |  11 +--
>  2 files changed, 70 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..5d981a061442 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +
> +struct kvm_tdx_page_add {
> +	struct page *src;
> +	unsigned long flags;
> +};
> +
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add->src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return 0;
> +
> +	/*
> +	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
> +	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
> +	 * Note!  If extended the measurement fails, bug the VM, but do NOT
> +	 * return an error, as mapping the page in the S-EPT succeeded and so
> +	 * needs to be tracked in KVM's mirror page tables.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			break;
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> -	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> -	 * the counter to ensure all mapped pages have been added to the image,
> -	 * to prevent running the TD with uninitialized memory.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
> +	 * and optionally extend the measurement with the page contents.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		kvm_tdx->nr_pending_tdh_mem_page_adds++;
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
> -
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>  		return -EIO;
> @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	struct page *src_page;
> -	int ret, i;
> +	struct kvm_tdx_page_add page_add = {
> +		.flags = arg->flags,
> +	};
> +	int ret;
>  
> -	lockdep_assert_held(&kvm->slots_lock);
> +	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
> +		return -EIO;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
>  	 */
> -	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
>  	if (ret < 0)
>  		return ret;
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add = &page_add;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> -
> -	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
> -
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> -		}
> -	}
> -
> -out:
> -	put_page(src_page);
> +	put_page(page_add.src);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 45d86f9fa41c..39e0c3bcc866 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -21,6 +21,8 @@ enum kvm_tdx_state {
>  	TD_STATE_RUNNABLE,
>  };
>  
> +struct kvm_tdx_add_page;
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  
> @@ -37,12 +39,11 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/*
> -	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> -	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> -	 * check adding pages to the image, and to ensure that all pages have
> -	 * been initialized before finalizing the TD.
> +	 * Scratch structure used to pass the source page and metadata flags to
> +	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
> +	 * mapping a private pfn via tdx_gmem_post_populate().
>  	 */
> -	unsigned long nr_pending_tdh_mem_page_adds;
> +	struct kvm_tdx_page_add *page_add;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> 
> base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
> --


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups
  2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
                   ` (12 preceding siblings ...)
  2025-08-27  9:48 ` [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Yan Zhao
@ 2025-08-28 19:01 ` Edgecombe, Rick P
  2025-08-28 23:19   ` Sean Christopherson
  13 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 19:01 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> RFC as this is compile tested only (mostly due to lack of access to a TDX
> capable system, but also due to lack of cycles).

Let us know how we could best help with this. The series fails the tests because
of the page size issue Yan pointed. We could just review and test a v2, or if
you want us to pull together the feedback, test the result, and repost please
let us know. I think either should work from our end.

I suspect Vishal could hook you up with a TDX machine. But if you need any setup
help there too, please shout.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page()
  2025-08-28 18:50     ` Sean Christopherson
@ 2025-08-28 19:04       ` Edgecombe, Rick P
  0 siblings, 0 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 19:04 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 11:50 -0700, Sean Christopherson wrote:
> > 
> > I realize you are just trying to do map->prefault here, but "page" seems
> > redundant once you have "prefault" in the name. Why page here vs all the
> > other fault handler functions without it?
> 
> kvm_tdp_prefault() feels a bit ambiguous/bare.  Many of the fault helpers do
> have "page", it's just before the fault part.
> 
>   kvm_mmu_finish_page_fault
>   kvm_handle_page_fault
>   kvm_tdp_page_fault
>   direct_page_fault
>   nonpaging_page_fault
>   kvm_tdp_mmu_page_fault
> 
>   (and probably more)

True.

> 
> How about kvm_tdp_page_prefault()?  Or kvm_tdp_do_prefault(), but I think I
> like kvm_tdp_page_prefault() a little more.

kvm_tdp_page_prefault() would be my pick of those. 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28  5:56         ` Yan Zhao
@ 2025-08-28 19:08           ` Edgecombe, Rick P
  0 siblings, 0 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 19:08 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	seanjc@google.com, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, 2025-08-28 at 13:56 +0800, Yan Zhao wrote:
> > Reasons that tdh_mem_page_add() could get BUSY:
> > 1. If two vCPU's tried to tdh_mem_page_add() the same gpa at the same time 
> > they
> > could contend the SEPT entry lock
> > 2. If one vCPU tries to tdh_mem_page_add() while the other zaps (i.e.
> > tdh_mem_range_block()).
> Hmm, two tdh_mem_page_add()s can't contend as they are protected by both
> slot_lock and filemap lock.
> 
> With regard to the contention to tdh_mem_range_block(), please check my
> analysis at the above [1].

The analysis missed the tdh_mem_page_add() failure path

> 
> tdh_mem_page_add() could get BUSY though, when a misbehaved userspace invokes
> KVM_TDX_INIT_MEM_REGION on one vCPU while initializing another vCPU.
> 
> Please check more details at [2].
> 
> [2] https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com/

Ah, the TDR lock. I actually referred to an older version of your locking
analysis that didn't have that one. But this means the premap count could get
out of sync for that reason too.

> 
> 
> > I guess since we don't hold MMU lock while we tdh_mem_page_add(), 2 is a
> > possibility.
> 2 is possible only for paranoid zaps.
> See "case 3. Unexpected zaps" in [1].

Sean's lockdep assert handles half of those cases. Maybe we could also re-
consider a KVM_BUG_ON() in the invalid zap paths again if it comes to it.

> 
> 
> > > What reasonable use case is there for gracefully handling
> > > tdh_mem_page_add()
> > > failure?
> > > 
> > > If there is a need to handle failure, I gotta imagine it's only for the -
> > > EBUSY
> > > case.  And if it's only for -EBUSY, why can't that be handled by retrying
> > > in
> > > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that
> > > all
> > > pages mapped into the S-EPT are ADDed, then it can assert that there are
> > > no
> > > pending pages when it completes (even if it "fails"), and similarly
> > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > > non-zero.
> > 
> > Maybe we could take mmu write lock for the retry of tdh_mem_page_add(). Or
> > maybe
> > even for a single call of it, until someone wants to parallelize the
> > operation.
> Hmm. I prefer returning -BUSY directly as invoking KVM_TDX_INIT_MEM_REGION 
> before finishing initializing all vCPUs are uncommon.

I was looking guaranteeing its success when Sean posted his suggestion to return
to the original pattern. I'm in favor of that direction. If you agree we can
call this moot. 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-28  6:48     ` Yan Zhao
@ 2025-08-28 19:14       ` Edgecombe, Rick P
  2025-08-28 22:33         ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 19:14 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	seanjc@google.com, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> code change.

When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
TDX code compared to the rest of KVM. (even after we dropped a bunch during
development) We have to differentiate from good safety, and "safety" that is
really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
might be design issues.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-28  2:11   ` Edgecombe, Rick P
@ 2025-08-28 19:21     ` Sean Christopherson
  2025-08-28 20:13       ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 19:21 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, michael.roth@amd.com,
	Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > Return -EIO when a KVM_BUG_ON() is tripped, as KVM's ABI is to return -EIO
> > when a VM has been killed due to a KVM bug, not -EINVAL.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 9fb6e5f02cc9..ef4ffcad131f 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1624,7 +1624,7 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> >  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >  
> >  	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > -		return -EINVAL;
> > +		return -EIO;
> >  
> >  	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> >  	atomic64_inc(&kvm_tdx->nr_premapped);
> > @@ -1638,7 +1638,7 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> >  
> >  	/* TODO: handle large pages. */
> >  	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > -		return -EINVAL;
> > +		return -EIO;
> >  
> >  	/*
> >  	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > @@ -1849,7 +1849,7 @@ static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> >  	 * and slot move/deletion.
> >  	 */
> >  	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
> > -		return -EINVAL;
> > +		return -EIO;
> >  
> >  	/*
> >  	 * The HKID assigned to this TD was already freed and cache was
> > @@ -1870,7 +1870,7 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> >  	 * there can't be anything populated in the private EPT.
> >  	 */
> >  	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> > -		return -EINVAL;
> > +		return -EIO;
> >  
> >  	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> >  	if (ret <= 0)
> 
> 
> Did you miss?

I did indeed.

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..fd1b8fea55a9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1656,10 +1656,10 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm,
> gfn_t gfn,
>  
>         /* TODO: handle large pages. */
>         if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> -               return -EINVAL;
> +               return -EIO;
>  
>         if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
> -               return -EINVAL;
> +               return -EIO;
>  
>         /*
>          * When zapping private page, write lock is held. So no race condition
> 
> 
> We really have a lot of KVM_BUG_ON()s in tdx code. I hesitate to pull them out
> but it feels a bit gratuitous.

Generally speaking, the number of KVM_BUG_ON()s is fine.  What we can do though
is reduce the amount of boilerplate and the number of paths the propagate a SEAMCALL
err through multiple layers, e.g. by eliminating single-use helpers (which is made
easier by reducing boilerplate and thus lines of code).

Concretely, if we combine the KVM_BUG_ON() usage with pr_tdx_error():

#define __TDX_BUG_ON(__err, __fn_str, __kvm, __fmt, __args...)			\
({										\
	struct kvm *_kvm = (__kvm);						\
	bool __ret = !!(__err);							\
										\
	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
		if (_kvm)							\
			kvm_vm_bugged(_kvm);					\
		pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx"	\
				   __fmt "\n",  __err,  __args); 		\
	}									\
	unlikely(__ret);							\
})

#define TDX_BUG_ON(__err, __fn, __kvm)				\
	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")

#define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)

#define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)

#define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)


And a macro to handle retry when kicking vCPUs out of the guest:

#define tdh_do_no_vcpus(tdh_func, kvm, args...)					\
({										\
	struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);				\
	u64 __err;								\
										\
	lockdep_assert_held_write(&kvm->mmu_lock);				\
										\
	__err = tdh_func(args);							\
	if (unlikely(tdx_operand_busy(__err))) {				\
		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true);			\
		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);	\
										\
		__err = tdh_func(args);						\
										\
		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false);		\
	}									\
	__err;									\
})

And do a bit of massaging, then we can end up e.g. this, which IMO is much easier
to follow than the current form of tdx_sept_remove_private_spte(), which has
several duplicate sanity checks and error handlers.

The tdh_do_no_vcpus() macro is a little mean, but I think it's a net positive
as eliminates quite a lot of "noise", and thus makes it easier to focus on the
logic.  And alternative to a trampoline macro would be to implement a guard()
and then do a scoped_guard(), but I think that'd be just as hard to read, and
would require almost as much boilerplate as there is today.

static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
					 enum pg_level level, u64 spte)
{
	struct page *page = pfn_to_page(spte_to_pfn(spte));
	int tdx_level = pg_level_to_tdx_sept_level(level);
	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
	gpa_t gpa = gfn_to_gpa(gfn);
	u64 err, entry, level_state;

	/*
	 * HKID is released after all private pages have been removed, and set
	 * before any might be populated. Warn if zapping is attempted when
	 * there can't be anything populated in the private EPT.
	 */
	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
		return;

	/* TODO: handle large pages. */
	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
		return;

	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
			      tdx_level, &entry, &level_state);
	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
		return;

	/*
	 * TDX requires TLB tracking before dropping private page.  Do
	 * it here, although it is also done later.
	 */
	tdx_track(kvm);

	/*
	 * When zapping private page, write lock is held. So no race condition
	 * with other vcpu sept operation.
	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
	 */
	err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa,
			      tdx_level, &entry, &level_state);
	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
		return;

	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
		return;

	tdx_clear_page(page);
}


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  6:23         ` Yan Zhao
@ 2025-08-28 19:40           ` Sean Christopherson
  2025-08-29  1:16             ` Yan Zhao
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 19:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025, Yan Zhao wrote:
> On Thu, Aug 28, 2025 at 09:26:50AM +0800, Edgecombe, Rick P wrote:
> > On Wed, 2025-08-27 at 17:54 -0700, Rick Edgecombe wrote:
> > > > 
> > > > Then, what about setting
> > > >                 .max_level = PG_LEVEL_4K,
> > > > directly?
> > > > 
> > > > Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered
> > > > in
> > > > tdx_sept_set_private_spte().
> > > 
> > > Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
> > > tests. I don't think it's ideal to encode PAGE.ADD details here though.
> > > 
> > > But I'm not immediately clear what is going wrong. The old struct
> > > kvm_page_fault
> > > looks pretty similar. Did you root cause it?
> >
> > Oh, duh. Because we are passing in the PFN now so it can't know the size. So
> > it's not about PAGE.ADD actually.
> Right, it's because the previous kvm_tdp_map_page() updates fault->max_level in
> kvm_mmu_faultin_pfn_private() by checking the private_max_mapping_level hook.
> 
> However, private_max_mapping_level() skips the faultin step and goes straight
> to kvm_tdp_mmu_map().
> 
> > Sill, how about calling the function kvm_tdp_mmu_map_private_pfn_4k(), or
> > passing in the level?
> Looks [1] can also address this issue. Not sure which one Sean prefers.
> 
> [1] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com

That won't fix this issue though, becuase @fault will be valid and so max_level
will still be KVM_MAX_HUGEPAGE_LEVEL.  Which is by design, the intent in that
flow is that KVM should have gotten the level when getting the pfn from gmem.

IIUC, this particular flow _must_ map at 4KiB, so I think forcing PG_LEVEL_4k is
the right solution.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28  1:51     ` Edgecombe, Rick P
@ 2025-08-28 19:57       ` Sean Christopherson
  0 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 19:57 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Ira Weiny, kvm@vger.kernel.org,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	michael.roth@amd.com

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-08-27 at 19:40 -0500, Ira Weiny wrote:
> > > +		.map_writable = true,
> > 
> > Why is map_writable set?  Doesn't this get translated into host_writable?
> 
> I guess it's normally set only if it's a !KVM_MEM_READONLY slot for private
> faults memory. 

map_writable can also be %false on read faults and the host userspace mapping
isn't writable. 

> But that flag is invalid for gmem. So we should only have
> map_writable=true cases for tdx.

Yep.  And not TDX specific, map_writable _must_ be true for write faults.  The
reason there's two separate flags is so that KVM can opportunistically create a
writable mapping on read faults.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-28 19:21     ` Sean Christopherson
@ 2025-08-28 20:13       ` Edgecombe, Rick P
  2025-08-28 21:00         ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 20:13 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 12:21 -0700, Sean Christopherson wrote:
> Generally speaking, the number of KVM_BUG_ON()s is fine.  What we can do though
> is reduce the amount of boilerplate and the number of paths the propagate a SEAMCALL
> err through multiple layers, e.g. by eliminating single-use helpers (which is made
> easier by reducing boilerplate and thus lines of code).
> 
> Concretely, if we combine the KVM_BUG_ON() usage with pr_tdx_error():
> 
> #define __TDX_BUG_ON(__err, __fn_str, __kvm, __fmt, __args...)			\
> ({										\
> 	struct kvm *_kvm = (__kvm);						\
> 	bool __ret = !!(__err);							\
> 										\
> 	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
> 		if (_kvm)							\
> 			kvm_vm_bugged(_kvm);					\
> 		pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx"	\
> 				   __fmt "\n",  __err,  __args); 		\
> 	}									\
> 	unlikely(__ret);							\
> })
> 
> #define TDX_BUG_ON(__err, __fn, __kvm)				\
> 	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
> 
> #define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
> 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
> 
> #define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
> 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
> 
> #define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
> 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)

In general sounds good. But there it's a bit strange to specify them rcx, rdx,
etc in a general helper. This is fallout from the existing chain of strange
naming:

For example tdh_mem_range_block() plucks them from those registers and calls
them ext_err1 due to their conditional meaning. Then KVM gives them some more
meaning with 'entry' and 'level_state". Then prints them out as original
register names. How about keeping the KVM names, like:

#define TDX_BUG_ON_2(__err, __fn, arg1, arg2, __kvm)		\
	__TDX_BUG_ON(__err, #__fn, __kvm, ", " #arg1 " 0x%llx, " #arg2 "
0x%llx", arg1, arg2)

so you get: entry: 0x00 level:0xF00

I *think* there is a way to make this work like var args and have a single
function, but it becomes impossible for people to read.


> 
> 
> And a macro to handle retry when kicking vCPUs out of the guest:
> 
> #define tdh_do_no_vcpus(tdh_func, kvm, args...)					\
> ({										\
> 	struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);				\
> 	u64 __err;								\
> 										\
> 	lockdep_assert_held_write(&kvm->mmu_lock);				\

There is a functional change in that the lock assert is not required if BUSY
avoidance can be guaranteed to not happen. I don't think it should be needed
today. I guess it's probably better to not rely on hitting rare races to catch
an issue like that.

> 										\
> 	__err = tdh_func(args);							\
> 	if (unlikely(tdx_operand_busy(__err))) {				\
> 		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true);			\
> 		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);	\
> 										\
> 		__err = tdh_func(args);						\
> 										\
> 		WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false);		\
> 	}									\
> 	__err;									\
> })
> 
> And do a bit of massaging, then we can end up e.g. this, which IMO is much easier
> to follow than the current form of tdx_sept_remove_private_spte(), which has
> several duplicate sanity checks and error handlers.
> 
> The tdh_do_no_vcpus() macro is a little mean, but I think it's a net positive
> as eliminates quite a lot of "noise", and thus makes it easier to focus on the
> logic.  And alternative to a trampoline macro would be to implement a guard()
> and then do a scoped_guard(), but I think that'd be just as hard to read, and
> would require almost as much boilerplate as there is today.
> 
> static void tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> 					 enum pg_level level, u64 spte)
> {
> 	struct page *page = pfn_to_page(spte_to_pfn(spte));
> 	int tdx_level = pg_level_to_tdx_sept_level(level);
> 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> 	gpa_t gpa = gfn_to_gpa(gfn);
> 	u64 err, entry, level_state;
> 
> 	/*
> 	 * HKID is released after all private pages have been removed, and set
> 	 * before any might be populated. Warn if zapping is attempted when
> 	 * there can't be anything populated in the private EPT.
> 	 */
> 	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> 		return;
> 
> 	/* TODO: handle large pages. */
> 	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> 		return;
> 
> 	err = tdh_do_no_vcpus(tdh_mem_range_block, kvm, &kvm_tdx->td, gpa,
> 			      tdx_level, &entry, &level_state);
> 	if (TDX_BUG_ON_2(err, TDH_MEM_RANGE_BLOCK, entry, level_state, kvm))
> 		return;
> 
> 	/*
> 	 * TDX requires TLB tracking before dropping private page.  Do
> 	 * it here, although it is also done later.
> 	 */
> 	tdx_track(kvm);
> 
> 	/*
> 	 * When zapping private page, write lock is held. So no race condition
> 	 * with other vcpu sept operation.
> 	 * Race with TDH.VP.ENTER due to (0-step mitigation) and Guest TDCALLs.
> 	 */
> 	err = tdh_do_no_vcpus(tdh_mem_page_remove, kvm, &kvm_tdx->td, gpa,
> 			      tdx_level, &entry, &level_state);
> 	if (TDX_BUG_ON_2(err, TDH_MEM_PAGE_REMOVE, entry, level_state, kvm))
> 		return;
> 
> 	err = tdh_phymem_page_wbinvd_hkid((u16)kvm_tdx->hkid, page);
> 	if (TDX_BUG_ON(err, TDH_PHYMEM_PAGE_WBINVD, kvm))
> 		return;
> 
> 	tdx_clear_page(page);
> }

Seems like tasteful macro-ization to me.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management
  2025-08-28  2:05     ` Edgecombe, Rick P
@ 2025-08-28 20:16       ` Sean Christopherson
  0 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 20:16 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-08-27 at 16:33 +0800, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:15PM -0700, Sean Christopherson wrote:
> > > Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> > > doesn't support page migration in any capacity, i.e. there are no migrate
> > > callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> > > kvm_gmem_migrate_folio().
> > Hmm, we implemented exactly the same patch at [1], where we explained the
> > potential problems of not holding page refcount, and the explored various
> > approaches, and related considerations.
> > 
> > [1] https://lore.kernel.org/all/20250807094241.4523-1-yan.y.zhao@intel.com/

Oh, nice!  I'll grab that and massage the changelog to break the hard dependencies
on the rest of the hugepage support.

> Yea, so the outcome of the huge page related discussion was that we should look
> at some sort of emergency page reclaim feature for the TDX module to use in the
> case of bugs. But in the meantime to move forward without it, using a solution
> like in this patch.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 18:52           ` Edgecombe, Rick P
@ 2025-08-28 20:26             ` Sean Christopherson
  2025-08-28 21:33               ` Edgecombe, Rick P
  2025-08-28 21:44             ` Sean Christopherson
  2025-08-29  2:42             ` Binbin Wu
  2 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 20:26 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
> > > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > > KVM_BUG_ON/WARN_ON should be also ok.
> > 
> > Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> > "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> > disliked?), but what we ended up with is far worse.
> 
> I think it was also that the tdh_mr_extend() loop was too heavyweight for the
> fault path. But that was before we got to the kick+lock stuff.

Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
operations is not a concern.

If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
code is also broken in the sense that there are no cond_resched() calls.  The
vast majority of TDX hosts will be using non-preemptible kernels, so without an
explicit cond_resched(), there's no practical difference between extending the
measurement under mmu_lock versus outside of mmu_lock.

_If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
still do tdh_mem_page_add() under mmu_lock.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-28 20:13       ` Edgecombe, Rick P
@ 2025-08-28 21:00         ` Sean Christopherson
  2025-08-28 21:19           ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 21:00 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, michael.roth@amd.com,
	Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 12:21 -0700, Sean Christopherson wrote:
> > Generally speaking, the number of KVM_BUG_ON()s is fine.  What we can do though
> > is reduce the amount of boilerplate and the number of paths the propagate a SEAMCALL
> > err through multiple layers, e.g. by eliminating single-use helpers (which is made
> > easier by reducing boilerplate and thus lines of code).
> > 
> > Concretely, if we combine the KVM_BUG_ON() usage with pr_tdx_error():
> > 
> > #define __TDX_BUG_ON(__err, __fn_str, __kvm, __fmt, __args...)			\
> > ({										\
> > 	struct kvm *_kvm = (__kvm);						\
> > 	bool __ret = !!(__err);							\
> > 										\
> > 	if (WARN_ON_ONCE(__ret && (!_kvm || !_kvm->vm_bugged))) {		\
> > 		if (_kvm)							\
> > 			kvm_vm_bugged(_kvm);					\
> > 		pr_err_ratelimited("SEAMCALL " __fn_str " failed: 0x%llx"	\
> > 				   __fmt "\n",  __err,  __args); 		\
> > 	}									\
> > 	unlikely(__ret);							\
> > })
> > 
> > #define TDX_BUG_ON(__err, __fn, __kvm)				\
> > 	__TDX_BUG_ON(__err, #__fn, __kvm, "%s", "")
> > 
> > #define TDX_BUG_ON_1(__err, __fn, __rcx, __kvm)			\
> > 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx", __rcx)
> > 
> > #define TDX_BUG_ON_2(__err, __fn, __rcx, __rdx, __kvm)		\
> > 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx", __rcx, __rdx)
> > 
> > #define TDX_BUG_ON_3(__err, __fn, __rcx, __rdx, __r8, __kvm)	\
> > 	__TDX_BUG_ON(__err, #__fn, __kvm, ", rcx 0x%llx, rdx 0x%llx, r8 0x%llx", __rcx, __rdx, __r8)
> 
> In general sounds good. But there it's a bit strange to specify them rcx, rdx,
> etc in a general helper. This is fallout from the existing chain of strange
> naming:
> 
> For example tdh_mem_range_block() plucks them from those registers and calls
> them ext_err1 due to their conditional meaning. Then KVM gives them some more
> meaning with 'entry' and 'level_state". Then prints them out as original
> register names. How about keeping the KVM names, like:
> 
> #define TDX_BUG_ON_2(__err, __fn, arg1, arg2, __kvm)		\
> 	__TDX_BUG_ON(__err, #__fn, __kvm, ", " #arg1 " 0x%llx, " #arg2 "
> 0x%llx", arg1, arg2)
> 
> so you get: entry: 0x00 level:0xF00

Ooh, nice, I'll tack on a patch.

> I *think* there is a way to make this work like var args and have a single
> function, but it becomes impossible for people to read.

Heh, and would probably take two months to decipher the compiler errors in order
to get it working :-)

> > And a macro to handle retry when kicking vCPUs out of the guest:
> > 
> > #define tdh_do_no_vcpus(tdh_func, kvm, args...)					\
> > ({										\
> > 	struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);				\
> > 	u64 __err;								\
> > 										\
> > 	lockdep_assert_held_write(&kvm->mmu_lock);				\
> 
> There is a functional change 

Ugh, I missed that.  I'll do a prep change to make that explicit.

> in that the lock assert is not required if BUSY
> avoidance can be guaranteed to not happen. I don't think it should be needed
> today. I guess it's probably better to not rely on hitting rare races to catch
> an issue like that.

But that's not actually what the code does.  The lockdep assert won't trip because
KVM never removes S-EPT entries under read-lock:

		if (is_mirror_sp(sp)) {
			KVM_BUG_ON(shared, kvm);
			remove_external_spte(kvm, gfn, old_spte, level);
		}

Not because KVM actually guarantees -EBUSY is avoided.  So the current code is
flawed, it just doesn't cause problems.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-28 21:00         ` Sean Christopherson
@ 2025-08-28 21:19           ` Edgecombe, Rick P
  2025-08-28 21:34             ` Sean Christopherson
  0 siblings, 1 reply; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 21:19 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 14:00 -0700, Sean Christopherson wrote:
> But that's not actually what the code does.  The lockdep assert won't trip because
> KVM never removes S-EPT entries under read-lock:

Right

> 
> 		if (is_mirror_sp(sp)) {
> 			KVM_BUG_ON(shared, kvm);
> 			remove_external_spte(kvm, gfn, old_spte, level);
> 		}
> 
> Not because KVM actually guarantees -EBUSY is avoided.  So the current code is
> flawed, it just doesn't cause problems.

Flawed, as in the lockdep should assert regardless of EBUSY? Seems good to me.
Probably if we wanted to try to call tdx_sept_remove_private_spte() under read
lock with special plans to avoid EBUSY we should think twice anyway.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 20:26             ` Sean Christopherson
@ 2025-08-28 21:33               ` Edgecombe, Rick P
  2025-08-28 21:57                 ` Sean Christopherson
                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 21:33 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> operations is not a concern.

Just was my recollection of the discussion. I found it:
https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/

> 
> If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> code is also broken in the sense that there are no cond_resched() calls.  The
> vast majority of TDX hosts will be using non-preemptible kernels, so without an
> explicit cond_resched(), there's no practical difference between extending the
> measurement under mmu_lock versus outside of mmu_lock.
> 
> _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> still do tdh_mem_page_add() under mmu_lock.

I just did a quick test and we should be on the order of <1 ms per page for the
full loop. I can try to get some more formal test data if it matters. But that
doesn't sound too horrible?

tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
inside it. But maybe a better reason is that we could better handle errors
outside the fault. (i.e. no 5 line comment about why not to return an error in
tdx_mem_page_add() due to code in another file).

I wonder if Yan can give an analysis of any zapping races if we do that.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition
  2025-08-28 21:19           ` Edgecombe, Rick P
@ 2025-08-28 21:34             ` Sean Christopherson
  0 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 21:34 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, michael.roth@amd.com,
	Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 14:00 -0700, Sean Christopherson wrote:
> > But that's not actually what the code does.  The lockdep assert won't trip because
> > KVM never removes S-EPT entries under read-lock:
> 
> Right
> 
> > 
> > 		if (is_mirror_sp(sp)) {
> > 			KVM_BUG_ON(shared, kvm);
> > 			remove_external_spte(kvm, gfn, old_spte, level);
> > 		}
> > 
> > Not because KVM actually guarantees -EBUSY is avoided.  So the current code is
> > flawed, it just doesn't cause problems.
> 
> Flawed, as in the lockdep should assert regardless of EBUSY?

Yep, exactly.

> Seems good to me.
> Probably if we wanted to try to call tdx_sept_remove_private_spte() under read
> lock with special plans to avoid EBUSY we should think twice anyway.

Heh, add a few zeros to "twice" :-D

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 18:52           ` Edgecombe, Rick P
  2025-08-28 20:26             ` Sean Christopherson
@ 2025-08-28 21:44             ` Sean Christopherson
  2025-08-29  2:42             ` Binbin Wu
  2 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 21:44 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could
> have just encrypted it in place.

Yeah, hindsight is definitely 20/20 on that front.  Though I suspect that we'd
never have landed anything if we tried to go straight to support mmap().

> Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the
> impression that you wanted to see less TDX churn for a bit.

Heh, believe me, I'm not exactly ecstatic to dive into this.  But, I don't want
to just ignore it, because some of these quirks/warts are already getting in the
way of new development, and if I/we delay such clean ups, the pain is only going
to get worse (and the total cost will be much higher).

Fatigue is a bit of a problem for me, but the biggest issue really is just lack
of cycles (the quick feedback and testing y'all are providing helps tremendously
on that front).  And lack of cycles should be mitigated to some extent as I
(re)familiarize myself with the code; I recognized most of the concepts, but
there are definitely a few places where I'm completely lost, and that makes
reviewing things like dynamic PAMT and hugepage support excrutiatingly slow.

> The fact is, the TDX base support still needs more work like this.

IMO, the most important things to address are cases where the design choices we
made turned out to be suboptimal, i.e. where the behavior itself is problematic.
Code cleanups are definitely welcome too, but my priority is polishing the core
design.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 21:33               ` Edgecombe, Rick P
@ 2025-08-28 21:57                 ` Sean Christopherson
  2025-08-28 23:17                   ` Edgecombe, Rick P
  2025-08-29  6:08                   ` Yan Zhao
  2025-08-28 22:06                 ` Ira Weiny
  2025-08-29  6:06                 ` Yan Zhao
  2 siblings, 2 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 21:57 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, Yan Y Zhao, michael.roth@amd.com,
	Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/

Ugh, another case where an honest question gets interpreted as "do it this way". :-(

> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?

1ms is totally reasonable.  I wouldn't bother with any more testing.

> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it.

Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
out, then zapping S-EPT entries could induce failure, and I don't think it's
worth going through the effort to ensure it's impossible to trigger S-EPT removal.

Note, temoving S-EPT entries during initialization of the image isn't something
I want to official support, rather it's an endless stream of whack-a-mole due to
obsurce edge cases

Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99%
certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page
has its own non-guest_memfd memslot and so can't be used for the initial image,
which means that too is mutually exclusive.

So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
errors can be triggered by userspace.

> But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.

As above, I think we're good?

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 21:33               ` Edgecombe, Rick P
  2025-08-28 21:57                 ` Sean Christopherson
@ 2025-08-28 22:06                 ` Ira Weiny
  2025-08-28 23:17                   ` Sean Christopherson
  2025-08-29  6:06                 ` Yan Zhao
  2 siblings, 1 reply; 85+ messages in thread
From: Ira Weiny @ 2025-08-28 22:06 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> > 
> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?
> 
> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it.

I'm probably not following this conversation, so stupid question:  It
doesn't need to be in the lock because user space should not be setting up
memory and extending the measurement in an asynchronous way.  Is that
correct?

> But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.

When you say analysis, you mean detecting user space did something wrong
and failing gracefully?  Is that correct?

Ira

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-28 19:14       ` Edgecombe, Rick P
@ 2025-08-28 22:33         ` Sean Christopherson
  2025-08-28 23:18           ` Edgecombe, Rick P
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 22:33 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> > code change.
> 
> When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
> TDX code compared to the rest of KVM. (even after we dropped a bunch during
> development) We have to differentiate from good safety, and "safety" that is
> really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
> might be design issues.

Nah, I think we're good.  The majority of the asserts are on SEAMCALLs, and those
are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to
individual call sites.

Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON
are both assertions against bugs, one is just guaranteed to be fatal to the VM).

  $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l
  25
  $ git grep -e KVM_BUG_ON -e WARN_  | wc -l
  459

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 21:57                 ` Sean Christopherson
@ 2025-08-28 23:17                   ` Edgecombe, Rick P
  2025-08-29  6:08                   ` Yan Zhao
  1 sibling, 0 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 23:17 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 14:57 -0700, Sean Christopherson wrote:
> Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
> in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
> out, then zapping S-EPT entries could induce failure, and I don't think it's
> worth going through the effort to ensure it's impossible to trigger S-EPT removal.
> 
> Note, temoving S-EPT entries during initialization of the image isn't something
> I want to official support, rather it's an endless stream of whack-a-mole due to
> obsurce edge cases
> 
> Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
> filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
> shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
> from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
> so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and
> 

Yea, in the other thread Yan was suggesting the same thing from the KVM side:
https://lore.kernel.org/all/aK%2Fsdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com/

But was concerned about "Unexpected zaps" (kvm_zap_gfn_range()). I think maybe
we could think about KVM_BUG_ON() in the case of mirror EPT to cover it from
another angle. IIRC we discussed this at some point.

I was wondering about TDH.MR.EXTEND error conditions. Coming back now, I'm not
sure what I was thinking.

> while I'm 99% certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(),
> the APIC page has its own non-guest_memfd memslot and so can't be used for the
> initial image, which means that too is mutually exclusive.

Hmm, well maybe KVM_BUG_ON() for kvm_zap_gfn_range() only if this gets
addressed.

> 
> So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
> errors can be triggered by userspace.
> 
> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> As above, I think we're good?

Works for me.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 22:06                 ` Ira Weiny
@ 2025-08-28 23:17                   ` Sean Christopherson
  2025-08-29  0:35                     ` Ira Weiny
  0 siblings, 1 reply; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 23:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	michael.roth@amd.com

On Thu, Aug 28, 2025, Ira Weiny wrote:
> Edgecombe, Rick P wrote:
> > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > operations is not a concern.
> > 
> > Just was my recollection of the discussion. I found it:
> > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > 
> > > 
> > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > explicit cond_resched(), there's no practical difference between extending the
> > > measurement under mmu_lock versus outside of mmu_lock.
> > > 
> > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > still do tdh_mem_page_add() under mmu_lock.
> > 
> > I just did a quick test and we should be on the order of <1 ms per page for the
> > full loop. I can try to get some more formal test data if it matters. But that
> > doesn't sound too horrible?
> > 
> > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > inside it.
> 
> I'm probably not following this conversation, so stupid question:  It
> doesn't need to be in the lock because user space should not be setting up
> memory and extending the measurement in an asynchronous way.  Is that
> correct?

No, from userspace's perspective ADD+MEASURE is fully serialized.  ADD "needs"
to be under mmu_lock to guarantee consistency between the mirror EPT and the
"real" S-EPT entries.  E.g. if ADD is done after the fact, then KVM can end up
with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
That causes a pile of problems because it breaks KVM's fundamental assumption
that M-EPT and S-EPT entries updated in lock-step.

TDH_MR_EXTEND doesn't have the same same consistency issue.  If it fails, the
only thing that's left in a bad state is the measurement.  That's obviously not
ideal either, but we can handle that by forcefully terminating the VM, without
opening up KVM to edge cases that would otherwise be impossible.

> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> When you say analysis, you mean detecting user space did something wrong
> and failing gracefully?  Is that correct?

More specifically, whether or not KVM can WARN without the WARN being user
triggerable.  Kernel policy is that WARNs must not be triggerable absent kernel,
hardware, or firmware bugs.  What we're trying to figure out is if there's a
flow that can be triggered by userspace (misbehving or not) that would trip a
WARN even if KVM is operating as expected.  I'm pretty sure the answer is "no".

Oh, and WARNing here is desirable, because it improves the chances of detecting
a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent
  2025-08-28 22:33         ` Sean Christopherson
@ 2025-08-28 23:18           ` Edgecombe, Rick P
  0 siblings, 0 replies; 85+ messages in thread
From: Edgecombe, Rick P @ 2025-08-28 23:18 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, 2025-08-28 at 15:33 -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> > On Thu, 2025-08-28 at 14:48 +0800, Yan Zhao wrote:
> > > Hmm, I still think it's safer to keep the nr_premapped to detect any unexpected
> > > code change.
> > 
> > When I checking patch 6 I saw how many more KVM_BUG_ON()s we ended up with in
> > TDX code compared to the rest of KVM. (even after we dropped a bunch during
> > development) We have to differentiate from good safety, and "safety" that is
> > really just propping up brittle code. Each KVM_BUG_ON() is a hint that there
> > might be design issues.
> 
> Nah, I think we're good.  The majority of the asserts are on SEAMCALLs, and those
> are no different than the WARN_ONCE() in vmx_insn_failed(), just spread out to
> individual call sites.
> 
> Once those are out of the numbers are entirely reasonable (WARNs and KVM_BUG_ON
> are both assertions against bugs, one is just guaranteed to be fatal to the VM).
> 
>   $ git grep -e KVM_BUG_ON -e WARN_ vmx/tdx.c | wc -l
>   25
>   $ git grep -e KVM_BUG_ON -e WARN_  | wc -l
>   459

Hmm, ok.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups
  2025-08-28 19:01 ` Edgecombe, Rick P
@ 2025-08-28 23:19   ` Sean Christopherson
  0 siblings, 0 replies; 85+ messages in thread
From: Sean Christopherson @ 2025-08-28 23:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, michael.roth@amd.com,
	Weiny, Ira

On Thu, Aug 28, 2025, Edgecombe, Rick P wrote:
> On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > RFC as this is compile tested only (mostly due to lack of access to a TDX
> > capable system, but also due to lack of cycles).
> 
> Let us know how we could best help with this. The series fails the tests because
> of the page size issue Yan pointed. We could just review and test a v2, or if
> you want us to pull together the feedback, test the result, and repost please
> let us know. I think either should work from our end.

I'll post a v2, it's going to look quite different.

> I suspect Vishal could hook you up with a TDX machine. But if you need any setup
> help there too, please shout.

Oh, he can, I just haven't crossed that testing bridge yet (ditto for SNP).  I'll
do so someday, but for now I'll abuse your generosity and throw noodles at ya :-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 23:17                   ` Sean Christopherson
@ 2025-08-29  0:35                     ` Ira Weiny
  0 siblings, 0 replies; 85+ messages in thread
From: Ira Weiny @ 2025-08-29  0:35 UTC (permalink / raw)
  To: Sean Christopherson, Ira Weiny
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org, Yan Y Zhao,
	michael.roth@amd.com

Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Ira Weiny wrote:
> > Edgecombe, Rick P wrote:
> > > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > > operations is not a concern.
> > > 
> > > Just was my recollection of the discussion. I found it:
> > > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> > > 
> > > > 
> > > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > > explicit cond_resched(), there's no practical difference between extending the
> > > > measurement under mmu_lock versus outside of mmu_lock.
> > > > 
> > > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > > still do tdh_mem_page_add() under mmu_lock.
> > > 
> > > I just did a quick test and we should be on the order of <1 ms per page for the
> > > full loop. I can try to get some more formal test data if it matters. But that
> > > doesn't sound too horrible?
> > > 
> > > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > > inside it.
> > 
> > I'm probably not following this conversation, so stupid question:  It
> > doesn't need to be in the lock because user space should not be setting up
> > memory and extending the measurement in an asynchronous way.  Is that
> > correct?
> 
> No, from userspace's perspective ADD+MEASURE is fully serialized.  ADD "needs"
> to be under mmu_lock to guarantee consistency between the mirror EPT and the
> "real" S-EPT entries.  E.g. if ADD is done after the fact, then KVM can end up
> with a PRESENT M-EPT entry but a corresponding S-EPT entry that is !PRESENT.
> That causes a pile of problems because it breaks KVM's fundamental assumption
> that M-EPT and S-EPT entries updated in lock-step.

Ok yes, I think I worded my query incorrectly but this makes things clear.

Thanks!

> 
> TDH_MR_EXTEND doesn't have the same same consistency issue.  If it fails, the
> only thing that's left in a bad state is the measurement.  That's obviously not
> ideal either, but we can handle that by forcefully terminating the VM, without
> opening up KVM to edge cases that would otherwise be impossible.
> 
> > > But maybe a better reason is that we could better handle errors
> > > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > > tdx_mem_page_add() due to code in another file).
> > > 
> > > I wonder if Yan can give an analysis of any zapping races if we do that.
> > 
> > When you say analysis, you mean detecting user space did something wrong
> > and failing gracefully?  Is that correct?
> 
> More specifically, whether or not KVM can WARN without the WARN being user
> triggerable.  Kernel policy is that WARNs must not be triggerable absent kernel,
> hardware, or firmware bugs.  What we're trying to figure out is if there's a
> flow that can be triggered by userspace (misbehving or not) that would trip a
> WARN even if KVM is operating as expected.  I'm pretty sure the answer is "no".
> 
> Oh, and WARNing here is desirable, because it improves the chances of detecting
> a fatal-to-the-VM bug, e.g. in KVM and/or in the TDX-Module.

OK...  In other areas of the kernel if the user misbehaves it is
reasonable to fail an operation.  I would think that being fatal to the VM
would be fine if QEMU did not properly synchronize ADD, measurement, and
finalize, for example.  Am I wrong in that assumption?

Ira

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte()
  2025-08-28 14:50     ` Edgecombe, Rick P
@ 2025-08-29  1:10       ` Yan Zhao
  0 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  1:10 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Thu, Aug 28, 2025 at 10:50:06PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-08-27 at 19:19 -0700, Rick Edgecombe wrote:
> > On Tue, 2025-08-26 at 17:05 -0700, Sean Christopherson wrote:
> > > Return -EIO immediately from tdx_sept_zap_private_spte() if the number of
> > > to-be-added pages underflows, so that the following "KVM_BUG_ON(err, kvm)"
> > > isn't also triggered.  Isolating the check from the "is premap error"
> > > if-statement will also allow adding a lockdep assertion that premap errors
> > > are encountered if and only if slots_lock is held.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > 
> > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> There is actually another KVM_BUG_ON() in the path here:
> 
> static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
> 				 int level)
> {
> 	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> 	int ret;
> 
> 	/*
> 	 * External (TDX) SPTEs are limited to PG_LEVEL_4K, and external
> 	 * PTs are removed in a special order, involving free_external_spt().
> 	 * But remove_external_spte() will be called on non-leaf PTEs via
> 	 * __tdp_mmu_zap_root(), so avoid the error the former would return
> 	 * in this case.
> 	 */
> 	if (!is_last_spte(old_spte, level))
> 		return;
> 
> 	/* Zapping leaf spte is allowed only when write lock is held. */
> 	lockdep_assert_held_write(&kvm->mmu_lock);
> 	/* Because write lock is held, operation should success. */
> 	ret = kvm_x86_call(remove_external_spte)(kvm, gfn, level, old_pfn);
> ->	KVM_BUG_ON(ret, kvm);
> 
> We don't need to do it in this patch, but we could remove the return value in
> .remove_external_spte, and the KVM_BUG_ON(). Just let remove_external_spte
> handle it internally.
+1. Triggering KVM_BUG_ON() only in TDX internally is better.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-28 19:40           ` Sean Christopherson
@ 2025-08-29  1:16             ` Yan Zhao
  2025-09-01  0:39               ` Yan Zhao
  0 siblings, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  1:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025 at 12:40:20PM -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Thu, Aug 28, 2025 at 09:26:50AM +0800, Edgecombe, Rick P wrote:
> > > On Wed, 2025-08-27 at 17:54 -0700, Rick Edgecombe wrote:
> > > > > 
> > > > > Then, what about setting
> > > > >                 .max_level = PG_LEVEL_4K,
> > > > > directly?
> > > > > 
> > > > > Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered
> > > > > in
> > > > > tdx_sept_set_private_spte().
> > > > 
> > > > Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
> > > > tests. I don't think it's ideal to encode PAGE.ADD details here though.
> > > > 
> > > > But I'm not immediately clear what is going wrong. The old struct
> > > > kvm_page_fault
> > > > looks pretty similar. Did you root cause it?
> > >
> > > Oh, duh. Because we are passing in the PFN now so it can't know the size. So
> > > it's not about PAGE.ADD actually.
> > Right, it's because the previous kvm_tdp_map_page() updates fault->max_level in
> > kvm_mmu_faultin_pfn_private() by checking the private_max_mapping_level hook.
> > 
> > However, private_max_mapping_level() skips the faultin step and goes straight
> > to kvm_tdp_mmu_map().
> > 
> > > Sill, how about calling the function kvm_tdp_mmu_map_private_pfn_4k(), or
> > > passing in the level?
> > Looks [1] can also address this issue. Not sure which one Sean prefers.
> > 
> > [1] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com
> 
> That won't fix this issue though, becuase @fault will be valid and so max_level
Ah, right, I missed that you composed a fault...

> will still be KVM_MAX_HUGEPAGE_LEVEL.  Which is by design, the intent in that
> flow is that KVM should have gotten the level when getting the pfn from gmem.
> 
> IIUC, this particular flow _must_ map at 4KiB, so I think forcing PG_LEVEL_4k is
> the right solution.
Forcing PG_LEVEL_4k looks good to me.
I was worried that SEV might want to use higher levels.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 17:00         ` Sean Christopherson
  2025-08-28 18:52           ` Edgecombe, Rick P
@ 2025-08-29  2:31           ` Yan Zhao
  2025-08-29  6:33             ` Yan Zhao
  1 sibling, 1 reply; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  2:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Michael Roth, Ira Weiny,
	Vishal Annapurve, Rick Edgecombe

On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > +	/*
> > > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > > +	 * to prevent running the TD with uninitialized memory.
> > > > To prevent the mismatch between mirror EPT and the S-EPT?
> > > 
> > > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> > Hmm, I think there must be some misunderstanding.
> 
> Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.
> 
> > Before userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
> >    (1) KVM holds slot_lock and filemap lock.
> >    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
> >        patch 2).
> >        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
> >        that there's a page mapped in M-EPT, while it's not yet installed in
> >        S-EPT.
> >    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
> >        page has been mapped in S-EPT too.
> >        
> >    As the name of nr_premapped indicates, the count means a page is pre-mapped
> >    in the M-EPT, before its real mapping in the S-EPT.
> >    If ADD fails in step (3), nr_premapped will not be decreased.
> > 
> >    With mere the normal path, nr_premapped should return back to 0 after all
> >    KVM_TDX_INIT_MEM_REGIONs.
> >       
> > 
> > 2. Expected zap paths (e.g. If userspace does something strange, such as
> >    removing a slot after KVM_TDX_INIT_MEM_REGION)
> >    Those zap paths could be triggered by
> >    1) userspace performs a page attribute conversion
> >    2) userspace invokes gmem punch hole
> >    3) userspace removes a slot
> >    As all those paths either hold a slot_lock or a filemap lock, they can't
> >    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
> >    slot_lock and internally filemap lock).
> >    Consequently, those zaps must occur
> >    a) before kvm_tdp_map_page() or
> >    b) after TDH.MEM.PAGE.ADD.
> >    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
> >            mapped in M-EPT either;
> >    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
> >            SEAMCALLs are following the ADD.
> >    nr_premapped is therere unchanged, since it does not change the consistency
> >    between M-EPT and S-EPT.
> > 
> > 3. Unexpected zaps (such as kvm_zap_gfn_range()).
> 
> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
> user-triggerable.
> 
> Easiest thing would be to mark the vCPU as dead (though we obviously need
> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index dbab1c15b0cd..1c0b43ff9544 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -719,7 +719,8 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>  static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  {
>         if (is_td_vcpu(vcpu)) {
> -               KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
> +               if (!kvm_vcpu_apicv_active(vcpu))
> +                       kvm_make_request(KVM_REQ_VM_DEAD, vcpu);
>                 return;
>         }
>  
> >    Those zaps are currently just paranoid ones. Not found in any existing paths
> >    yet. i.e.,
> >    We want to detect any future code or any missed code piecies, which invokes
> >    kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).
> > 
> >    As those zaps do not necessarily hold slot_lock or filemap lock, they may
> >    ocurr after installing M-EPT and before installing S-EPT.
> >    As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
> >    true.
> >    Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
> >    but not in S-EPT decreases.
> > 
> >    TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
> >    will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
> >    successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
> >    So, we will be able to detect those unexpected zaps.
> >    
> > 
> > When userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > The nr_premapped must be 0 before tdx_td_finalize() succeeds.
> > 
> > The nr_premapped could be 0 if
> > (1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
> > (2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
> > (3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
> >     regions.
> > 
> > For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed.
> 
> Ya, we're in agreement on what can happen.  I think all of the confusion was due
> to me forgetting that TDH.MEM.PAGE.ADD is what actually installes the leaf S-EPT
> entry.
> 
> > So, TD can still run with uninitialized memory.
> 
> No, the TD can never run with truly uninitialized memory.  By "uninitialized", I
> mean memory that the guest can access and which has not been written to.  Again,
> my confusion was due to thinking a page was already mapped into the guest, but
> awaiting TDH.MEM.PAGE.ADD to 
>  
> > > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > We don't. It returns -EBUSY or -EIO immediately.
> 
> But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
> actions on the VM.  Tolerating failure is fine in general, but in this case it
> leaves the MMU is left in a half-baked state.  
> 
> > > nice with tdh_mem_page_add() failure necessitates both the
> > > tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> > > that all pending pages have been consumed.
> > 
> > tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
> > by executing BLOCK before ADD.
> 
> We need to make this situation impossible.
Currently this situation should be impossible already.
If there're still missing ones, we can fix it (as you did above).

But this tdx_is_sept_zap_err_due_to_premap() check is just to detect if anything
is still missing.
Or maybe directly KVM_BUG_ON() on that is also ok.


> > > What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
> > If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.
> > 
> > > If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> > > case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> > > tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
> > I analyzed the contention status of tdh_mem_sept_add() at
> > https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.
> > 
> > As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
> > vCPU, returning -EBUSY instead of retrying looks safer and easier.
> > 
> > > pages mapped into the S-EPT are ADDed, then it can assert that there are no
> > > pending pages when it completes (even if it "fails"), and similarly
> > > tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> > > non-zero.
> > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > KVM_BUG_ON/WARN_ON should be also ok.
> 
> Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> disliked?), but what we ended up with is far worse.
> 
> For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
> with access rules that are all but impossible to understand, e.g. there's practically
> zero chance anyone could suss out complications with "Unexpected zaps", and the
> existence of that super subtle edge case necessitates using an atomic because
> KVM can't strictly guarantee that access to the field is mutually exclusive.  And
> that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
> is checking nr_premapped, what happens?
The tdx_td_finalize() takes slots_lock and you asserted those unexpected zaps
at https://lore.kernel.org/all/20250827000522.4022426-11-seanjc@google.com.

Expected zaps can't occur during tdx_td_finalize checking nr_premapped either.

 
> The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
> the map completes and mmu_lock is dropped means that failure at that point leaves
> the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
> S-EPT does not.  Eww.
Eww... That's why there's nr_premapped.
And it's suggested by you, though you called it "Crazy idea"...
https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/

> Note, in no way am I trying to blame anyone; quite the opposite, you've done an
> admirable job to get all of this landed.  And I apologize if any of my past
> feedback led y'all down this road.  I suspect my prefaulting idea really screwed
> things up; sorry :-(
It's ok :)

> Back to the code, unless I'm missing yet another complication, I think the obvious
> fix to all of this is to pass the source page and metadata flags via a scratch
> field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
> tdx_sept_set_private_spte().  Then there is no need to keep track of pending
> pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
> with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
> It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
> for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.
> 
> Diff on top below for feedback on the idea.  A proper series for this would simply
> replace several of the patches, e.g. asserting that slots_lock is held on
> tdx_is_sept_zap_err_due_to_premap() is wrong.
Looks it's similar to the implementation in v19?
https://lore.kernel.org/all/bbac4998cfb34da496646491038b03f501964cbd.1708933498.git.isaku.yamahata@intel.com/

> ---
>  arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h |  11 +--
>  2 files changed, 70 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..5d981a061442 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +
> +struct kvm_tdx_page_add {
> +	struct page *src;
> +	unsigned long flags;
> +};
> +
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add->src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return 0;
> +
> +	/*
> +	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
> +	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
> +	 * Note!  If extended the measurement fails, bug the VM, but do NOT
> +	 * return an error, as mapping the page in the S-EPT succeeded and so
> +	 * needs to be tracked in KVM's mirror page tables.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			break;
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> -	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> -	 * the counter to ensure all mapped pages have been added to the image,
> -	 * to prevent running the TD with uninitialized memory.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
> +	 * and optionally extend the measurement with the page contents.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		kvm_tdx->nr_pending_tdh_mem_page_adds++;
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
> -
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>  		return -EIO;
> @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	struct page *src_page;
> -	int ret, i;
> +	struct kvm_tdx_page_add page_add = {
> +		.flags = arg->flags,
> +	};
> +	int ret;
>  
> -	lockdep_assert_held(&kvm->slots_lock);
> +	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
> +		return -EIO;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
>  	 */
> -	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
>  	if (ret < 0)
>  		return ret;
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add = &page_add;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> -
> -	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
> -
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> -		}
> -	}
> -
> -out:
> -	put_page(src_page);
> +	put_page(page_add.src);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 45d86f9fa41c..39e0c3bcc866 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -21,6 +21,8 @@ enum kvm_tdx_state {
>  	TD_STATE_RUNNABLE,
>  };
>  
> +struct kvm_tdx_add_page;
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  
> @@ -37,12 +39,11 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/*
> -	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> -	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> -	 * check adding pages to the image, and to ensure that all pages have
> -	 * been initialized before finalizing the TD.
> +	 * Scratch structure used to pass the source page and metadata flags to
> +	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
> +	 * mapping a private pfn via tdx_gmem_post_populate().
>  	 */
> -	unsigned long nr_pending_tdh_mem_page_adds;
> +	struct kvm_tdx_page_add *page_add;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> 
> base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
> --

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 18:52           ` Edgecombe, Rick P
  2025-08-28 20:26             ` Sean Christopherson
  2025-08-28 21:44             ` Sean Christopherson
@ 2025-08-29  2:42             ` Binbin Wu
  2 siblings, 0 replies; 85+ messages in thread
From: Binbin Wu @ 2025-08-29  2:42 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Annapurve, Vishal,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Weiny, Ira



On 8/29/2025 2:52 AM, Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
>> On Thu, Aug 28, 2025, Yan Zhao wrote:
[...]
>>>
>>> 3. Unexpected zaps (such as kvm_zap_gfn_range()).
>> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
>> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
>> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
>> user-triggerable.
>>
>> Easiest thing would be to mark the vCPU as dead (though we obviously need
>> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
>>
>>
>>
> I'm going need to look up the related apic discussions from the base series and
> circle back.
There was an analysis about the inhibit reasons for TDX.
https://lore.kernel.org/lkml/e3a2e8fa-b496-4010-9a8c-bfeb131bc43b@linux.intel.com/

As Sean mentioned, if kvm_recalculate_apic_map() fails to allocate the memory
for optimized map, it will trigger the KVM_BUG_ON() in
vt_refresh_apicv_exec_ctrl(). And kvzalloc() failure should not be treated as
KVM bug.

As talking about user-triggerable, the kvzalloc() failure path could be
triggered by KVM_CREATE_VCPU and KVM_TDX_INIT_VCPU for TD. After
KVM_TDX_INIT_VCPU, the mapping is not allowed to be changed.

Sean's suggested code change looks good to me.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 21:33               ` Edgecombe, Rick P
  2025-08-28 21:57                 ` Sean Christopherson
  2025-08-28 22:06                 ` Ira Weiny
@ 2025-08-29  6:06                 ` Yan Zhao
  2 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  6:06 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	Annapurve, Vishal, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Weiny, Ira

On Fri, Aug 29, 2025 at 05:33:48AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > operations is not a concern.
> 
> Just was my recollection of the discussion. I found it:
> https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> > 
> > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > code is also broken in the sense that there are no cond_resched() calls.  The
> > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > explicit cond_resched(), there's no practical difference between extending the
> > measurement under mmu_lock versus outside of mmu_lock.
> > 
> > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > still do tdh_mem_page_add() under mmu_lock.
> 
> I just did a quick test and we should be on the order of <1 ms per page for the
> full loop. I can try to get some more formal test data if it matters. But that
> doesn't sound too horrible?
> 
> tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> inside it. But maybe a better reason is that we could better handle errors
> outside the fault. (i.e. no 5 line comment about why not to return an error in
> tdx_mem_page_add() due to code in another file).
> 
> I wonder if Yan can give an analysis of any zapping races if we do that.
I actually proposed to have write mmu_lock around tdh_mem_page_add() and
tdh_mr_extend(), as in
https://lore.kernel.org/kvm/Ztfn5gh5888PmEIe@yzhao56-desk.sh.intel.com.

I don't see any reason why tdh_mr_extend() can't be done inside mmu_lock
in the pre-boot stage.

But the previous conclusion was that with slots_lock and filemap invalidation
lock, it's ok to invoke tdh_mem_page_add() and tdh_mr_extend() without any
mmu_lock. The nr_premapped can also detect the unexpected zap.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-28 21:57                 ` Sean Christopherson
  2025-08-28 23:17                   ` Edgecombe, Rick P
@ 2025-08-29  6:08                   ` Yan Zhao
  1 sibling, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  6:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, pbonzini@redhat.com,
	Vishal Annapurve, linux-kernel@vger.kernel.org,
	michael.roth@amd.com, Ira Weiny

On Thu, Aug 28, 2025 at 02:57:39PM -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Rick P Edgecombe wrote:
> > On Thu, 2025-08-28 at 13:26 -0700, Sean Christopherson wrote:
> > > Me confused.  This is pre-boot, not the normal fault path, i.e. blocking other
> > > operations is not a concern.
> > 
> > Just was my recollection of the discussion. I found it:
> > https://lore.kernel.org/lkml/Zbrj5WKVgMsUFDtb@google.com/
> 
> Ugh, another case where an honest question gets interpreted as "do it this way". :-(
> 
> > > If tdh_mr_extend() is too heavy for a non-preemptible section, then the current
> > > code is also broken in the sense that there are no cond_resched() calls.  The
> > > vast majority of TDX hosts will be using non-preemptible kernels, so without an
> > > explicit cond_resched(), there's no practical difference between extending the
> > > measurement under mmu_lock versus outside of mmu_lock.
> > > 
> > > _If_ we need/want to do tdh_mr_extend() outside of mmu_lock, we can and should
> > > still do tdh_mem_page_add() under mmu_lock.
> > 
> > I just did a quick test and we should be on the order of <1 ms per page for the
> > full loop. I can try to get some more formal test data if it matters. But that
> > doesn't sound too horrible?
> 
> 1ms is totally reasonable.  I wouldn't bother with any more testing.
> 
> > tdh_mr_extend() outside MMU lock is tempting because it doesn't *need* to be
> > inside it.
> 
> Agreed, and it would eliminate the need for a "flags" argument.  But keeping it
> in the mmu_lock critical section means KVM can WARN on failures.  If it's moved
> out, then zapping S-EPT entries could induce failure, and I don't think it's
> worth going through the effort to ensure it's impossible to trigger S-EPT removal.
> 
> Note, temoving S-EPT entries during initialization of the image isn't something
> I want to official support, rather it's an endless stream of whack-a-mole due to
> obsurce edge cases
> 
> Hmm, actually, maybe I take that back.  slots_lock prevents memslot updates,
> filemap_invalidate_lock() prevents guest_memfd updates, and mmu_notifier events
> shouldn't ever hit S-EPT.  I was worried about kvm_zap_gfn_range(), but the call
> from sev.c is obviously mutually exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT
> so same goes for kvm_noncoherent_dma_assignment_start_or_stop, and while I'm 99%
> certain there's a way to trip __kvm_set_or_clear_apicv_inhibit(), the APIC page
> has its own non-guest_memfd memslot and so can't be used for the initial image,
> which means that too is mutually exclusive.
> 
> So yeah, let's give it a shot.  Worst case scenario we're wrong and TDH_MR_EXTEND
> errors can be triggered by userspace.
> 
> > But maybe a better reason is that we could better handle errors
> > outside the fault. (i.e. no 5 line comment about why not to return an error in
> > tdx_mem_page_add() due to code in another file).
> > 
> > I wonder if Yan can give an analysis of any zapping races if we do that.
> 
> As above, I think we're good?
I think so.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller
  2025-08-29  2:31           ` Yan Zhao
@ 2025-08-29  6:33             ` Yan Zhao
  0 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-08-29  6:33 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel,
	Michael Roth, Ira Weiny, Vishal Annapurve, Rick Edgecombe

On Fri, Aug 29, 2025 at 10:31:29AM +0800, Yan Zhao wrote:
> On Thu, Aug 28, 2025 at 10:00:28AM -0700, Sean Christopherson wrote:
> > On Thu, Aug 28, 2025, Yan Zhao wrote:
> > > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
... 
> > > > Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
> > > We don't. It returns -EBUSY or -EIO immediately.
> > 
> > But that _is_ tolerating failure, in the sense that KVM doesn't prevent further
> > actions on the VM.  Tolerating failure is fine in general, but in this case it
> > leaves the MMU is left in a half-baked state.
Yes, but nr_premapped will not be decreased on tdh_mem_page_add() failure.

So we rely on nr_premapped to disallow the TD from running eventually.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
  2025-08-29  1:16             ` Yan Zhao
@ 2025-09-01  0:39               ` Yan Zhao
  0 siblings, 0 replies; 85+ messages in thread
From: Yan Zhao @ 2025-09-01  0:39 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe, kvm@vger.kernel.org,
	pbonzini@redhat.com, Vishal Annapurve,
	linux-kernel@vger.kernel.org, michael.roth@amd.com, Ira Weiny

On Fri, Aug 29, 2025 at 09:16:47AM +0800, Yan Zhao wrote:
> On Thu, Aug 28, 2025 at 12:40:20PM -0700, Sean Christopherson wrote:
> > On Thu, Aug 28, 2025, Yan Zhao wrote:
> > > On Thu, Aug 28, 2025 at 09:26:50AM +0800, Edgecombe, Rick P wrote:
> > > > On Wed, 2025-08-27 at 17:54 -0700, Rick Edgecombe wrote:
> > > > > > 
> > > > > > Then, what about setting
> > > > > >                 .max_level = PG_LEVEL_4K,
> > > > > > directly?
> > > > > > 
> > > > > > Otherwise, the "(KVM_BUG_ON(level != PG_LEVEL_4K, kvm)" would be triggered
> > > > > > in
> > > > > > tdx_sept_set_private_spte().
> > > > > 
> > > > > Yes this fails to boot a TD. With max_level = PG_LEVEL_4K it passes the full
> > > > > tests. I don't think it's ideal to encode PAGE.ADD details here though.
> > > > > 
> > > > > But I'm not immediately clear what is going wrong. The old struct
> > > > > kvm_page_fault
> > > > > looks pretty similar. Did you root cause it?
> > > >
> > > > Oh, duh. Because we are passing in the PFN now so it can't know the size. So
> > > > it's not about PAGE.ADD actually.
> > > Right, it's because the previous kvm_tdp_map_page() updates fault->max_level in
> > > kvm_mmu_faultin_pfn_private() by checking the private_max_mapping_level hook.
> > > 
> > > However, private_max_mapping_level() skips the faultin step and goes straight
> > > to kvm_tdp_mmu_map().
> > > 
> > > > Sill, how about calling the function kvm_tdp_mmu_map_private_pfn_4k(), or
> > > > passing in the level?
> > > Looks [1] can also address this issue. Not sure which one Sean prefers.
> > > 
> > > [1] https://lore.kernel.org/all/20250729225455.670324-15-seanjc@google.com
> > 
> > That won't fix this issue though, becuase @fault will be valid and so max_level
> Ah, right, I missed that you composed a fault...
FWIW: after reviewing it again, I think [1] is still able update the max_level
to 4KB.

The flow with a valid @fault:

kvm_mmu_hugepage_adjust
  kvm_mmu_max_mapping_level
    kvm_max_private_mapping_level
      kvm_x86_call(gmem_max_mapping_level)(kvm, pfn);
 

> > will still be KVM_MAX_HUGEPAGE_LEVEL.  Which is by design, the intent in that
> > flow is that KVM should have gotten the level when getting the pfn from gmem.
> > 
> > IIUC, this particular flow _must_ map at 4KiB, so I think forcing PG_LEVEL_4k is
> > the right solution.
> Forcing PG_LEVEL_4k looks good to me.
> I was worried that SEV might want to use higher levels.

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2025-09-01  0:40 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27  0:05 [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
2025-08-27  0:05 ` [RFC PATCH 01/12] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
2025-08-27  8:14   ` Yan Zhao
2025-08-28  0:37   ` Ira Weiny
2025-08-28  2:13   ` Huang, Kai
2025-08-27  0:05 ` [RFC PATCH 02/12] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
2025-08-27  8:25   ` Yan Zhao
2025-08-28  0:54     ` Edgecombe, Rick P
2025-08-28  1:26       ` Edgecombe, Rick P
2025-08-28  6:23         ` Yan Zhao
2025-08-28 19:40           ` Sean Christopherson
2025-08-29  1:16             ` Yan Zhao
2025-09-01  0:39               ` Yan Zhao
2025-08-28  6:55       ` Yan Zhao
2025-08-28  0:40   ` Ira Weiny
2025-08-28  1:51     ` Edgecombe, Rick P
2025-08-28 19:57       ` Sean Christopherson
2025-08-27  0:05 ` [RFC PATCH 03/12] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
2025-08-27  0:05 ` [RFC PATCH 04/12] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_prefault_page() Sean Christopherson
2025-08-28  2:01   ` Edgecombe, Rick P
2025-08-28 18:50     ` Sean Christopherson
2025-08-28 19:04       ` Edgecombe, Rick P
2025-08-27  0:05 ` [RFC PATCH 05/12] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
2025-08-27  8:33   ` Yan Zhao
2025-08-28  2:05     ` Edgecombe, Rick P
2025-08-28 20:16       ` Sean Christopherson
2025-08-28  0:36   ` Ira Weiny
2025-08-28  7:08     ` Yan Zhao
2025-08-28 15:54       ` Ira Weiny
2025-08-28  2:45   ` Huang, Kai
2025-08-27  0:05 ` [RFC PATCH 06/12] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
2025-08-27  8:39   ` Yan Zhao
2025-08-27 17:26     ` Sean Christopherson
2025-08-28  2:11   ` Edgecombe, Rick P
2025-08-28 19:21     ` Sean Christopherson
2025-08-28 20:13       ` Edgecombe, Rick P
2025-08-28 21:00         ` Sean Christopherson
2025-08-28 21:19           ` Edgecombe, Rick P
2025-08-28 21:34             ` Sean Christopherson
2025-08-28 15:03   ` Ira Weiny
2025-08-27  0:05 ` [RFC PATCH 07/12] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
2025-08-28  2:19   ` Edgecombe, Rick P
2025-08-28 14:50     ` Edgecombe, Rick P
2025-08-29  1:10       ` Yan Zhao
2025-08-28 15:02   ` Ira Weiny
2025-08-27  0:05 ` [RFC PATCH 08/12] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
2025-08-28  2:56   ` Edgecombe, Rick P
2025-08-28  6:48     ` Yan Zhao
2025-08-28 19:14       ` Edgecombe, Rick P
2025-08-28 22:33         ` Sean Christopherson
2025-08-28 23:18           ` Edgecombe, Rick P
2025-08-28 15:03   ` Ira Weiny
2025-08-27  0:05 ` [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
2025-08-27  9:02   ` Yan Zhao
2025-08-27 19:08     ` Sean Christopherson
2025-08-28  3:13       ` Edgecombe, Rick P
2025-08-28  5:56         ` Yan Zhao
2025-08-28 19:08           ` Edgecombe, Rick P
2025-08-28  5:43       ` Yan Zhao
2025-08-28 17:00         ` Sean Christopherson
2025-08-28 18:52           ` Edgecombe, Rick P
2025-08-28 20:26             ` Sean Christopherson
2025-08-28 21:33               ` Edgecombe, Rick P
2025-08-28 21:57                 ` Sean Christopherson
2025-08-28 23:17                   ` Edgecombe, Rick P
2025-08-29  6:08                   ` Yan Zhao
2025-08-28 22:06                 ` Ira Weiny
2025-08-28 23:17                   ` Sean Christopherson
2025-08-29  0:35                     ` Ira Weiny
2025-08-29  6:06                 ` Yan Zhao
2025-08-28 21:44             ` Sean Christopherson
2025-08-29  2:42             ` Binbin Wu
2025-08-29  2:31           ` Yan Zhao
2025-08-29  6:33             ` Yan Zhao
2025-08-28 15:30       ` Ira Weiny
2025-08-28 15:28     ` Ira Weiny
2025-08-27  0:05 ` [RFC PATCH 10/12] KVM: TDX: Assert that slots_lock is held when nr_premapped is accessed Sean Christopherson
2025-08-27  0:05 ` [RFC PATCH 11/12] KVM: TDX: Track nr_premapped as an "unsigned long", not an "atomic64_t" Sean Christopherson
2025-08-27  9:12   ` Yan Zhao
2025-08-27  0:05 ` [RFC PATCH 12/12] KVM: TDX: Rename nr_premapped to nr_pending_tdh_mem_page_adds Sean Christopherson
2025-08-27  9:22   ` Yan Zhao
2025-08-28 15:23   ` Ira Weiny
2025-08-27  9:48 ` [RFC PATCH 00/12] KVM: x86/mmu: TDX post-populate cleanups Yan Zhao
2025-08-28 19:01 ` Edgecombe, Rick P
2025-08-28 23:19   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).