[PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
@ 2025-06-11  0:10 Xiaoyao Li
  2025-06-11 18:10 ` Sean Christopherson
  2025-06-12  4:44 ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: Xiaoyao Li @ 2025-06-11  0:10 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, rick.p.edgecombe
  Cc: kvm, linux-kernel, yan.y.zhao, reinette.chatre, kai.huang,
	adrian.hunter, isaku.yamahata, Binbin Wu, tony.lindgren,
	xiaoyao.li

From: Paolo Bonzini <pbonzini@redhat.com>

Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.

It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
shared bit set when the memory attribute of it is shared, and it leads
to wrong root in tdp_mmu_get_root_for_fault().

Fix it by embedding the direct bits in the gpa that is passed to
kvm_tdp_map_page(), when the memory of the gpa is not private.

[*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
we have selftests enhancement for TDX case of KVM_PRE_FAULT_MEMORY, but
the plan is to post them on top of the TDX selftests [1] when they get
upstream.

[1] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
---
 arch/x86/kvm/mmu/mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..a4040578b537 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4896,6 +4896,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 {
 	u64 error_code = PFERR_GUEST_FINAL_MASK;
 	u8 level = PG_LEVEL_4K;
+	u64 direct_bits;
 	u64 end;
 	int r;
 
@@ -4910,15 +4911,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	if (r)
 		return r;
 
+	direct_bits = 0;
 	if (kvm_arch_has_private_mem(vcpu->kvm) &&
 	    kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
 		error_code |= PFERR_PRIVATE_ACCESS;
+	else
+		direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
 
 	/*
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
+	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
 	if (r < 0)
 		return r;
 

base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11  0:10 [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY Xiaoyao Li
@ 2025-06-11 18:10 ` Sean Christopherson
  2025-06-11 18:21   ` Paolo Bonzini
                     ` (2 more replies)
  2025-06-12  4:44 ` Paolo Bonzini
  1 sibling, 3 replies; 18+ messages in thread
From: Sean Christopherson @ 2025-06-11 18:10 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, rick.p.edgecombe, kvm, linux-kernel, yan.y.zhao,
	reinette.chatre, kai.huang, adrian.hunter, isaku.yamahata,
	Binbin Wu, tony.lindgren

On Tue, Jun 10, 2025, Xiaoyao Li wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.
> 
> It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
> shared bit set when the memory attribute of it is shared, and it leads
> to wrong root in tdp_mmu_get_root_for_fault().
> 
> Fix it by embedding the direct bits in the gpa that is passed to
> kvm_tdp_map_page(), when the memory of the gpa is not private.
> 
> [*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> 
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> we have selftests enhancement for TDX case of KVM_PRE_FAULT_MEMORY, but
> the plan is to post them on top of the TDX selftests [1] when they get
> upstream.
> 
> [1] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> ---
>  arch/x86/kvm/mmu/mmu.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..a4040578b537 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4896,6 +4896,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  {
>  	u64 error_code = PFERR_GUEST_FINAL_MASK;
>  	u8 level = PG_LEVEL_4K;
> +	u64 direct_bits;
>  	u64 end;
>  	int r;
>  
> @@ -4910,15 +4911,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	if (r)
>  		return r;
>  
> +	direct_bits = 0;
>  	if (kvm_arch_has_private_mem(vcpu->kvm) &&
>  	    kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
>  		error_code |= PFERR_PRIVATE_ACCESS;
> +	else
> +		direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));

Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
but stuffing vendor specific GPA bits in common code goes too far.  Actually,
all of this goes too far.  There's zero reason any code outside of TDX needs to
*explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.

Side topic, kvm_arch_vcpu_pre_fault_memory() should disallow aliased GFNs.  It
will "work", because kvm_mmu_do_page_fault() will strip the SHARED bit, but it's
all kinds of confusing due to KVM also forcing shared vs. private based on memory
attributes.

Back to the main topic, KVM needs to have a single source of truth when it comes
to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
aware of aliased GFN bits, but absolute nothing outside of TDX (including common
VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
has far, far too much history and baggage with "direct") is tied to the existence
and polarity of aliased GFN bits.

What we have now does work *today* (see this bug), and it will be a complete
trainwreck if we ever want to steal GFN bits for other reasons.

To detect a mirror fault:

  static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
  {
	return kvm_has_mirrored_tdp(kvm) &&
	       error_code & PFERR_PRIVATE_ACCESS;
  }

And for TDX, it should darn well explicitly track the shared GPA mask:

  static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
  {
	/* For TDX the direct mask is the shared mask. */
	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
  }

Overloading a field in kvm_arch and bleeding TDX details into common code isn't
worth saving 8 bytes per VM.

Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).

Compile tested only, and obviously needs to be split into multiple patches.

---
 arch/x86/include/asm/kvm-x86-ops.h |  3 ++
 arch/x86/include/asm/kvm_host.h    |  5 +-
 arch/x86/kvm/mmu.h                 | 21 +++++---
 arch/x86/kvm/mmu/mmu.c             |  3 ++
 arch/x86/kvm/mmu/mmu_internal.h    | 14 +-----
 arch/x86/kvm/mmu/tdp_iter.c        |  4 +-
 arch/x86/kvm/mmu/tdp_iter.h        | 12 ++---
 arch/x86/kvm/mmu/tdp_mmu.c         | 13 ++---
 arch/x86/kvm/mmu/tdp_mmu.h         |  2 +-
 arch/x86/kvm/vmx/common.h          | 17 ++++---
 arch/x86/kvm/vmx/main.c            | 11 +++++
 arch/x86/kvm/vmx/tdx.c             | 78 ++++++++++++++++++------------
 arch/x86/kvm/vmx/tdx.h             |  2 +
 arch/x86/kvm/vmx/vmx.c             |  2 +-
 arch/x86/kvm/vmx/x86_ops.h         |  1 +
 15 files changed, 112 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..f0a958ba4823 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -145,6 +145,9 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
 KVM_X86_OP_OPTIONAL(get_untagged_addr)
 KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
+#ifdef CONFIG_X86_64
+KVM_X86_OP_OPTIONAL_RET0(get_tdp_mmu_root_gfn);
+#endif
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
 KVM_X86_OP_OPTIONAL(gmem_invalidate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 330cdcbed1a6..3a5efde1ab9d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1578,7 +1578,7 @@ struct kvm_arch {
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
 
-	gfn_t gfn_direct_bits;
+	gfn_t aliased_gfn_bits;
 
 	/*
 	 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A Zero
@@ -1897,6 +1897,9 @@ struct kvm_x86_ops {
 
 	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+	gfn_t (*get_tdp_mmu_root_gfn)(struct kvm *kvm, bool is_mirror);
+#endif
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
 	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..cef726e59f9b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -257,7 +257,7 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
@@ -309,20 +309,25 @@ static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
 	return kvm->arch.vm_type == KVM_X86_TDX_VM;
 }
 
-static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
+static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
 {
-	return kvm->arch.gfn_direct_bits;
+	return kvm_has_mirrored_tdp(kvm) &&
+	       error_code & PFERR_PRIVATE_ACCESS;
 }
 
-static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
+static inline gfn_t kvm_aliased_gfn_bits(struct kvm *kvm)
 {
-	gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
-
-	return !gpa_direct_bits || (gpa & gpa_direct_bits);
+	return kvm->arch.aliased_gfn_bits;
 }
 
 static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn & kvm_gfn_direct_bits(kvm);
+	return gfn & kvm_aliased_gfn_bits(kvm);
 }
+
+static inline gpa_t kvm_get_unaliased_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	return gpa & ~gfn_to_gpa(kvm_aliased_gfn_bits(kvm));
+}
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..0228d49ac363 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4902,6 +4902,9 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	if (!vcpu->kvm->arch.pre_fault_allowed)
 		return -EOPNOTSUPP;
 
+	if (kvm_is_gfn_alias(vcpu->kvm, gpa_to_gfn(range->gpa)))
+		return -EINVAL;
+
 	/*
 	 * reload is efficient when called repeatedly, so we can do it on
 	 * every iteration.
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..7d2c53d2b0ca 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -175,18 +175,6 @@ static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_
 	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
 }
 
-static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct kvm_mmu_page *root)
-{
-	/*
-	 * Since mirror SPs are used only for TDX, which maps private memory
-	 * at its "natural" GFN, no mask needs to be applied to them - and, dually,
-	 * we expect that the bits is only used for the shared PT.
-	 */
-	if (is_mirror_sp(root))
-		return 0;
-	return kvm_gfn_direct_bits(kvm);
-}
-
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm *kvm,
 						      struct kvm_mmu_page *sp)
 {
@@ -376,7 +364,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		 * bit. Strip it so that the GFN can be used like normal, and the
 		 * fault.addr can be used when the shared bit is needed.
 		 */
-		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_bits(vcpu->kvm);
+		fault.gfn = gpa_to_gfn(fault.addr) & ~vcpu->kvm->arch.aliased_gfn_bits;
 		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 9e17bfa80901..c36bfb920382 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -37,8 +37,10 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits)
+		    int min_level, gfn_t next_last_level_gfn, bool paranoid)
 {
+	gfn_t gfn_bits = paranoid ? 0 : root->gfn;
+
 	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
 			 (root->role.level > PT64_ROOT_MAX_LEVEL) ||
 			 (gfn_bits && next_last_level_gfn >= gfn_bits))) {
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 364c5da6c499..5117d64952f7 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -120,13 +120,13 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)		  \
-	for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_bits(kvm, root)); \
-	     iter.valid && iter.gfn < end;						  \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)	\
+	for (tdp_iter_start(&iter, root, min_level, start, false);		\
+	     iter.valid && iter.gfn < end;					\
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte_min_level_all(iter, root, min_level)		\
-	for (tdp_iter_start(&iter, root, min_level, 0, 0);		\
+#define for_each_tdp_pte_min_level_paranoid(iter, root, min_level)	\
+	for (tdp_iter_start(&iter, root, min_level, 0, true);		\
 		iter.valid && iter.gfn < tdp_mmu_max_gfn_exclusive();	\
 		tdp_iter_next(&iter))
 
@@ -136,7 +136,7 @@ struct tdp_iter {
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits);
+		    int min_level, gfn_t next_last_level_gfn, bool paranoid);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..15daf4353ccc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -257,6 +257,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	int as_id = kvm_mmu_role_as_id(role);
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *root;
+	gfn_t gfn;
 
 	if (mirror)
 		role.is_mirror = true;
@@ -291,7 +292,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	}
 
 	root = tdp_mmu_alloc_sp(vcpu);
-	tdp_mmu_init_sp(root, NULL, 0, role);
+	gfn = kvm_x86_call(get_tdp_mmu_root_gfn)(vcpu->kvm, mirror);
+	tdp_mmu_init_sp(root, NULL, gfn, role);
 
 	/*
 	 * TDP MMU roots are kept until they are explicitly invalidated, either
@@ -860,7 +862,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	for_each_tdp_pte_min_level_all(iter, root, zap_level) {
+	for_each_tdp_pte_min_level_paranoid(iter, root, zap_level) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1934,12 +1936,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
 }
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code)
 {
 	struct kvm *kvm = vcpu->kvm;
-	bool is_direct = kvm_is_addr_direct(kvm, gpa);
-	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
-				 vcpu->arch.mmu->mirror_root_hpa;
+	hpa_t root = kvm_is_mirror_fault(kvm, error_code) ? vcpu->arch.mmu->mirror_root_hpa :
+							    vcpu->arch.mmu->root.hpa;
 	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
 	int leaf;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..397309dfc73f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(str
 static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
 							      struct kvm_page_fault *fault)
 {
-	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
+	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->error_code)))
 		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
 
 	return root_to_sp(vcpu->arch.mmu->root.hpa);
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index a0c5e8781c33..5065fd8d41e8 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -76,14 +76,9 @@ static __always_inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
 
 #endif
 
-static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
-{
-	/* For TDX the direct mask is the shared mask. */
-	return !kvm_is_addr_direct(kvm, gpa);
-}
-
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
-					     unsigned long exit_qualification)
+					     unsigned long exit_qualification,
+					     bool is_private)
 {
 	u64 error_code;
 
@@ -104,12 +99,18 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
 			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
-	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
+	if (is_private)
 		error_code |= PFERR_PRIVATE_ACCESS;
 
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+static inline int vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					   unsigned long exit_qualification)
+{
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, false);
+}
+
 static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
 						     int pi_vec)
 {
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..6e0652ed0d22 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -855,6 +855,14 @@ static void vt_setup_mce(struct kvm_vcpu *vcpu)
 	vmx_setup_mce(vcpu);
 }
 
+static gfn_t vt_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
+{
+	if (!is_td(kvm))
+		return 0;
+
+	return tdx_get_tdp_mmu_root_gfn(kvm, is_mirror);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -1041,6 +1049,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.get_untagged_addr = vmx_get_untagged_addr,
 
+#ifdef CONFIG_X86_64
+	.get_tdp_mmu_root_gfn = vt_op_tdx_only(get_tdp_mmu_root_gfn),
+#endif
 	.mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl),
 	.vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl),
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..1abf3c158cd5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1138,6 +1138,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);
 }
 
+static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	/* For TDX the direct mask is the shared mask. */
+	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
+}
+
 /*
  * Split into chunks and check interrupt pending between chunks.  This allows
  * for timely injection of interrupts to prevent issues with guest lockup
@@ -1192,9 +1198,9 @@ static void __tdx_map_gpa(struct vcpu_tdx *tdx)
 	 * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU.
 	 */
 	tdx->vcpu.run->hypercall.ret = 0;
-	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
+	tdx->vcpu.run->hypercall.args[0] = gfn_to_gpa(kvm_get_unaliased_gpa(tdx->vcpu.kvm, gpa));
 	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
-	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
+	tdx->vcpu.run->hypercall.args[2] = tdx_is_private_gpa(tdx->vcpu.kvm, gpa) ?
 					   KVM_MAP_GPA_RANGE_ENCRYPTED :
 					   KVM_MAP_GPA_RANGE_DECRYPTED;
 	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
@@ -1222,8 +1228,8 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 
 	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
 	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) ||
-	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
-	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) {
+	    (tdx_is_private_gpa(vcpu->kvm, gpa) !=
+	     tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))) {
 		ret = TDVMCALL_STATUS_INVALID_OPERAND;
 		goto error;
 	}
@@ -1411,11 +1417,11 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 	 * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to
 	 * do MMIO emulation for private GPA.
 	 */
-	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) ||
-	    vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))
+	if (tdx_is_private_gpa(vcpu->kvm, gpa) ||
+	    tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))
 		goto error;
 
-	gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
+	gpa = kvm_get_unaliased_gpa(vcpu->kvm, gpa);
 
 	if (write)
 		r = tdx_mmio_write(vcpu, gpa, size, val);
@@ -1480,12 +1486,17 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
+{
+	return is_mirror ? 0 : gpa_to_gfn(to_kvm_tdx(kvm)->shared_gpa_mask);
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
 			  TDX_SHARED_BIT_PWL_4;
 
-	if (KVM_BUG_ON(shared_bit != kvm_gfn_direct_bits(vcpu->kvm), vcpu->kvm))
+	if (KVM_BUG_ON(tdx_is_private_gpa(vcpu->kvm, shared_bit), vcpu->kvm))
 		return;
 
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
@@ -1837,10 +1848,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qual;
 	gpa_t gpa = to_tdx(vcpu)->exit_gpa;
-	bool local_retry = false;
+	bool is_private = tdx_is_private_gpa(vcpu->kvm, gpa);
 	int ret;
 
-	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
+	if (is_private) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
 			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
 				gpa, vcpu->vcpu_id);
@@ -1857,9 +1868,6 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 * due to aliasing a single HPA to multiple GPAs.
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
-
-		/* Only private GPA triggers zero-step mitigation */
-		local_retry = true;
 	} else {
 		exit_qual = vmx_get_exit_qual(vcpu);
 		/*
@@ -1907,9 +1915,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	 * handle retries locally in their EPT violation handlers.
 	 */
 	while (1) {
-		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual, is_private);
 
-		if (ret != RET_PF_RETRY || !local_retry)
+		/* Only private GPA triggers zero-step mitigation */
+		if (ret != RET_PF_RETRY || !is_private)
 			break;
 
 		if (kvm_vcpu_has_events(vcpu) || signal_pending(current))
@@ -2620,12 +2629,11 @@ static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
 	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
 
 	/*
-	 * Work around missing support on old TDX modules, fetch
-	 * guest maxpa from gfn_direct_bits.
+	 * Work around missing support on old TDX modules, derive the guest
+	 * MAXPA from the shared bit.
 	 */
 	if (leaf == 0x80000008) {
-		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
-		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
+		unsigned int g_maxpa = __ffs(kvm_tdx->shared_gpa_mask) + 1;
 
 		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
 	}
@@ -2648,6 +2656,7 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct kvm_tdx_init_vm *init_vm;
 	struct td_params *td_params = NULL;
+	gpa_t shared_bit;
 	int ret;
 
 	BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid));
@@ -2712,9 +2721,12 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	kvm_tdx->xfam = td_params->xfam;
 
 	if (td_params->config_flags & TDX_CONFIG_FLAGS_MAX_GPAW)
-		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_5;
+		shared_bit = TDX_SHARED_BIT_PWL_5;
 	else
-		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_4;
+		shared_bit = TDX_SHARED_BIT_PWL_4;
+
+	kvm_tdx->shared_gpa_mask = shared_bit;
+	kvm->arch.aliased_gfn_bits = gpa_to_gfn(shared_bit);
 
 	kvm_tdx->state = TD_STATE_INITIALIZED;
 out:
@@ -3052,6 +3064,16 @@ struct tdx_gmem_post_populate_arg {
 	__u32 flags;
 };
 
+static int tdx_is_private_mapping_valid(struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU))
+		return 0;
+
+	guard(read_lock)(&vcpu->kvm->mmu_lock);
+
+	return kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, PFERR_PRIVATE_ACCESS);
+}
+
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				  void __user *src, int order, void *_arg)
 {
@@ -3084,14 +3106,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	 * because all paths are covered by slots_lock and the
 	 * filemap invalidate lock.  Check that they are indeed enough.
 	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
+	if (KVM_BUG_ON(!tdx_is_private_mapping_valid(vcpu, gpa), kvm))
+		return -EIO;
 
 	ret = 0;
 	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
@@ -3148,8 +3164,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
 	    !region.nr_pages ||
 	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
-	    !vt_is_tdx_private_gpa(kvm, region.gpa) ||
-	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
+	    !tdx_is_private_gpa(kvm, region.gpa) ||
+	    !tdx_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
 	kvm_mmu_reload(vcpu);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 51f98443e8a2..3807562d5b48 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -27,6 +27,8 @@ struct kvm_tdx {
 	int hkid;
 	enum kvm_tdx_state state;
 
+	gpa_t shared_gpa_mask;
+
 	u64 attributes;
 	u64 xfam;
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ff00ae9f05a..38103efe2f47 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5775,7 +5775,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
+	return vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..ee56e96b4db3 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -151,6 +151,7 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror);
 int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
 			      enum pg_level level, void *private_spt);
 int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,

base-commit: 61374cc145f4a56377eaf87c7409a97ec7a34041
--

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 18:10 ` Sean Christopherson
@ 2025-06-11 18:21   ` Paolo Bonzini
  2025-06-11 19:37     ` Sean Christopherson
  2025-06-11 20:45   ` Edgecombe, Rick P
  2025-06-12 12:20   ` Yan Zhao
  2 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2025-06-11 18:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, rick.p.edgecombe, kvm, linux-kernel, yan.y.zhao,
	reinette.chatre, kai.huang, adrian.hunter, isaku.yamahata,
	Binbin Wu, tony.lindgren



On Wed, Jun 11, 2025 at 8:10 PM Sean Christopherson <seanjc@google.com> wrote:
> > +     direct_bits = 0;
> >       if (kvm_arch_has_private_mem(vcpu->kvm) &&
> >           kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> >               error_code |= PFERR_PRIVATE_ACCESS;
> > +     else
> > +             direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
>
> Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
> but stuffing vendor specific GPA bits in common code goes too far.  Actually,
> all of this goes too far.  There's zero reason any code outside of TDX needs to
> *explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.
>
> Back to the main topic, KVM needs to have a single source of truth when it comes
> to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> has far, far too much history and baggage with "direct") is tied to the existence
> and polarity of aliased GFN bits.
>
> To detect a mirror fault:
>
>   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>   {
>         return kvm_has_mirrored_tdp(kvm) &&
>                error_code & PFERR_PRIVATE_ACCESS;
>   }
>
> And for TDX, it should darn well explicitly track the shared GPA mask:
>
>   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
>   {
>         /* For TDX the direct mask is the shared mask. */
>         return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
>   }

My fault - this is more similar, at least in spirit, to what
Yan and Xiaoyao had tested earlier:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..209103bf0f30 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types
  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
  							      struct kvm_page_fault *fault)
  {
-	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
+	if (unlikely(fault->is_private))
  		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);

and I instead proposed the version that you hate with such ardor.

My reasoning was that I preferred to have the pre-fault scenario "look like"
what you get while the VM runs.

> Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
> the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
>
> Compile tested only, and obviously needs to be split into multiple patches.

Also obviously needs to be delayed to 6.17, since a working fix can be a
one line change. :) (Plus your kvm_is_gfn_alias() test which should be
included anyway and independently).

What do you hate less between Yan's idea above and this patch? Just tell me
and I'll handle posting v2.

Paolo


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 18:21   ` Paolo Bonzini
@ 2025-06-11 19:37     ` Sean Christopherson
  2025-06-11 20:25       ` Edgecombe, Rick P
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-06-11 19:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, rick.p.edgecombe, kvm, linux-kernel, yan.y.zhao,
	reinette.chatre, kai.huang, adrian.hunter, isaku.yamahata,
	Binbin Wu, tony.lindgren

On Wed, Jun 11, 2025, Paolo Bonzini wrote:
> On Wed, Jun 11, 2025 at 8:10 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +     direct_bits = 0;
> > >       if (kvm_arch_has_private_mem(vcpu->kvm) &&
> > >           kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> > >               error_code |= PFERR_PRIVATE_ACCESS;
> > > +     else
> > > +             direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> > 
> > Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
> > but stuffing vendor specific GPA bits in common code goes too far.  Actually,
> > all of this goes too far.  There's zero reason any code outside of TDX needs to
> > *explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.
> > 
> > Back to the main topic, KVM needs to have a single source of truth when it comes
> > to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> > aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> > VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> > has far, far too much history and baggage with "direct") is tied to the existence
> > and polarity of aliased GFN bits.
> > 
> > To detect a mirror fault:
> > 
> >   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
> >   {
> >         return kvm_has_mirrored_tdp(kvm) &&
> >                error_code & PFERR_PRIVATE_ACCESS;
> >   }
> > 
> > And for TDX, it should darn well explicitly track the shared GPA mask:
> > 
> >   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> >   {
> >         /* For TDX the direct mask is the shared mask. */
> >         return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
> >   }
> 
> My fault - this is more similar, at least in spirit, to what
> Yan and Xiaoyao had tested earlier:
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 52acf99d40a0..209103bf0f30 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types
>  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
>  							      struct kvm_page_fault *fault)
>  {
> -	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> +	if (unlikely(fault->is_private))
>  		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> 
> and I instead proposed the version that you hate with such ardor.
> 
> My reasoning was that I preferred to have the pre-fault scenario "look like"
> what you get while the VM runs.

Yes, 100% agreed.  I forgot fault->addr has the unmodified GPA, whereas fault->gfn
has the unaliased GFN. :-/

> > Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> > kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> > And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
> > the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
> > 
> > Compile tested only, and obviously needs to be split into multiple patches.
> 
> Also obviously needs to be delayed to 6.17, since a working fix can be a
> one line change. :)

Ya, definitely.

> (Plus your kvm_is_gfn_alias() test which should be
> included anyway and independently).
> 
> What do you hate less between Yan's idea above and this patch? Just tell me
> and I'll handle posting v2.

As much as it pains me, this version :-(

There are other things that rely on the GPA being "correct", e.g. walking the
SPTEs in fast_page_fault().  So for a 6.16 fix, this is the safer and more complete
option.

Ugh, and the whole tdp_mmu_get_root_for_fault() handling is broken.
is_page_fault_stale() only looks at mmu->root.hpa, i.e. could theoretically blow
up if the shared root is somehow valid but the mirror root is not.  Probably can't
happen in practice, but it's ugly.

Oof, and I've no idea what kvm_tdp_mmu_fast_pf_get_last_sptep() is doing.  It
says:

	/* Fast pf is not supported for mirrored roots  */

but I don't see anything that actually enforces that.

So tdp_mmu_get_root_for_fault() should be a generic kvm_mmu_get_root_for_fault(),
and tdp_mmu_get_root() simply shouldn't exist.

As for stuffing the correct GPA, with kvm_mmu_get_root_for_fault() being generic
and the root holding its gfn modifier, kvm_tdp_map_page() can simply OR in the
appropriate gfn (and maybe WARN if there's overlap?).

Something like so (on top of the other untested blob):

-----
 arch/x86/kvm/mmu/mmu.c     |  8 ++++++--
 arch/x86/kvm/mmu/spte.h    | 15 +++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.h | 21 ++-------------------
 4 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0228d49ac363..3bcc8d4848bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3589,7 +3589,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		u64 new_spte;
 
 		if (tdp_mmu_enabled)
-			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->gfn, &spte);
+			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault, &spte);
 		else
 			sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
 
@@ -4682,7 +4682,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault)
 {
-	struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->root.hpa);
+	struct kvm_mmu_page *sp = kvm_mmu_get_root_for_fault(vcpu, fault);
 
 	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
 	if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4849,6 +4849,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
 {
+	struct kvm_mmu_page *root = __kvm_mmu_get_root_for_fault(vcpu, error_code);
 	int r;
 
 	/*
@@ -4858,6 +4859,9 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
 		return -EOPNOTSUPP;
 
+	/* Comment goes here. */
+	gpa |= gfn_to_gpa(root->gfn);
+
 	do {
 		if (signal_pending(current))
 			return -EINTR;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1e94f081bdaf..68e7979ac1fe 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -280,6 +280,21 @@ static inline bool is_mirror_sptep(tdp_ptep_t sptep)
 	return is_mirror_sp(sptep_to_sp(rcu_dereference(sptep)));
 }
 
+static inline struct kvm_mmu_page *__kvm_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
+								u64 error_code)
+{
+	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, error_code)))
+		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
+
+	return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
+
+static inline struct kvm_mmu_page *kvm_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
+							      struct kvm_page_fault *fault)
+{
+	return __kvm_mmu_get_root_for_fault(vcpu, fault->error_code);
+}
+
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
 	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 15daf4353ccc..ecfffc6fbb73 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1240,7 +1240,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-	struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
+	struct kvm_mmu_page *root = kvm_mmu_get_root_for_fault(vcpu, fault);
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
@@ -1967,15 +1967,15 @@ EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
  *
  * WARNING: This function is only intended to be called during fast_page_fault.
  */
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault,
 					u64 *spte)
 {
-	/* Fast pf is not supported for mirrored roots  */
-	struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_DIRECT_ROOTS);
+	struct kvm_mmu_page *root = kvm_mmu_get_root_for_fault(vcpu, fault);
 	struct tdp_iter iter;
 	tdp_ptep_t sptep = NULL;
 
-	for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
+	for_each_tdp_pte(iter, vcpu->kvm, root, fault->gfn, fault->gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 397309dfc73f..f75888474b73 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -45,24 +45,6 @@ static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(str
 	return ret;
 }
 
-static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
-							      struct kvm_page_fault *fault)
-{
-	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->error_code)))
-		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
-
-	return root_to_sp(vcpu->arch.mmu->root.hpa);
-}
-
-static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
-						    enum kvm_tdp_mmu_root_types type)
-{
-	if (unlikely(type == KVM_MIRROR_ROOTS))
-		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
-
-	return root_to_sp(vcpu->arch.mmu->root.hpa);
-}
-
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
@@ -109,7 +91,8 @@ static inline void kvm_tdp_mmu_walk_lockless_end(void)
 
 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault,
 					u64 *spte);
 
 #ifdef CONFIG_X86_64

base-commit: 1abe48190d919c44e69aae17beb9e55d83db2303
-- 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 19:37     ` Sean Christopherson
@ 2025-06-11 20:25       ` Edgecombe, Rick P
  2025-06-11 20:43         ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-11 20:25 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Hunter, Adrian, Zhao, Yan Y,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com

On Wed, 2025-06-11 at 12:37 -0700, Sean Christopherson wrote:
> Ugh, and the whole tdp_mmu_get_root_for_fault() handling is broken.
> is_page_fault_stale() only looks at mmu->root.hpa, i.e. could theoretically blow
> up if the shared root is somehow valid but the mirror root is not.  Probably can't
> happen in practice, but it's ugly.

We had some discussion on this root valid/invalid pattern:
https://lore.kernel.org/kvm/d33d00b88707961126a24b19f940de43ba6e6c56.camel@intel.com/

It's brittle though.

> 
> Oof, and I've no idea what kvm_tdp_mmu_fast_pf_get_last_sptep() is doing.  It
> says:
> 
> 	/* Fast pf is not supported for mirrored roots  */
> 
> but I don't see anything that actually enforces that.

Functionally, page_fault_can_be_fast() should prevented this with the check of
kvm->arch.has_private_mem. But, yea it's not correct for being readable. The
mirror/external concepts only work if they make sense as independent concepts.
Otherwise it's just naming obfuscation.

> 
> So tdp_mmu_get_root_for_fault() should be a generic kvm_mmu_get_root_for_fault(),
> and tdp_mmu_get_root() simply shouldn't exist.
> 
> As for stuffing the correct GPA, with kvm_mmu_get_root_for_fault() being generic
> and the root holding its gfn modifier, kvm_tdp_map_page() can simply OR in the
> appropriate gfn (and maybe WARN if there's overlap?).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 20:25       ` Edgecombe, Rick P
@ 2025-06-11 20:43         ` Sean Christopherson
  2025-06-11 21:16           ` Edgecombe, Rick P
  2025-06-12  6:58           ` Yan Zhao
  0 siblings, 2 replies; 18+ messages in thread
From: Sean Christopherson @ 2025-06-11 20:43 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Kai Huang, binbin.wu@linux.intel.com,
	Xiaoyao Li, Reinette Chatre, Adrian Hunter, Yan Y Zhao,
	kvm@vger.kernel.org, Isaku Yamahata, linux-kernel@vger.kernel.org,
	tony.lindgren@linux.intel.com

On Wed, Jun 11, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-06-11 at 12:37 -0700, Sean Christopherson wrote:
> > Ugh, and the whole tdp_mmu_get_root_for_fault() handling is broken.
> > is_page_fault_stale() only looks at mmu->root.hpa, i.e. could theoretically blow
> > up if the shared root is somehow valid but the mirror root is not.  Probably can't
> > happen in practice, but it's ugly.
> 
> We had some discussion on this root valid/invalid pattern:
> https://lore.kernel.org/kvm/d33d00b88707961126a24b19f940de43ba6e6c56.camel@intel.com/
> 
> It's brittle though.

Hmm, yeah, the is_page_fault_stale() thing is definitely benign, just odd.

> > Oof, and I've no idea what kvm_tdp_mmu_fast_pf_get_last_sptep() is doing.  It
> > says:
> > 
> > 	/* Fast pf is not supported for mirrored roots  */
> > 
> > but I don't see anything that actually enforces that.
> 
> Functionally, page_fault_can_be_fast() should prevented this with the check of
> kvm->arch.has_private_mem.

No?  I see this:

	if (kvm->arch.has_private_mem &&
	    fault->is_private != kvm_mem_is_private(kvm, fault->gfn))
		return false;

I.e. a private fault can be fast, so long as the page is already in the correct
shared vs. private state.  I can imagine that it's impossible for TDX to generate
protection violations, but I think kvm_tdp_mmu_fast_pf_get_last_sptep() could be
reached with a mirror root if kvm_ad_enabled=false.

	if (!fault->present)
		return !kvm_ad_enabled;

	/*
	 * Note, instruction fetches and writes are mutually exclusive, ignore
	 * the "exec" flag.
	 */
	return fault->write;

> But, yea it's not correct for being readable. The mirror/external concepts
> only work if they make sense as independent concepts.  Otherwise it's just
> naming obfuscation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 18:10 ` Sean Christopherson
  2025-06-11 18:21   ` Paolo Bonzini
@ 2025-06-11 20:45   ` Edgecombe, Rick P
  2025-06-11 21:09     ` Sean Christopherson
  2025-06-12 12:20   ` Yan Zhao
  2 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-11 20:45 UTC (permalink / raw)
  To: Li, Xiaoyao, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	linux-kernel@vger.kernel.org, Hunter, Adrian, Zhao, Yan Y,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	tony.lindgren@linux.intel.com

On Wed, 2025-06-11 at 11:10 -0700, Sean Christopherson wrote:
> Back to the main topic, KVM needs to have a single source of truth when it comes
> to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> has far, far too much history and baggage with "direct") is tied to the existence
> and polarity of aliased GFN bits.
> 
> What we have now does work *today* (see this bug), and it will be a complete
> trainwreck if we ever want to steal GFN bits for other reasons.

KVM XO's time has come and gone. Out of curiosity is there anything else?
Readability is the main objection here, right?

> 
> To detect a mirror fault:
> 
>   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>   {
> 	return kvm_has_mirrored_tdp(kvm) &&
> 	       error_code & PFERR_PRIVATE_ACCESS;
>   }
> 
> And for TDX, it should darn well explicitly track the shared GPA mask:
> 
>   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
>   {
> 	/* For TDX the direct mask is the shared mask. */
> 	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
>   }
> 
> Overloading a field in kvm_arch and bleeding TDX details into common code isn't
> worth saving 8 bytes per VM.
> 
> Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> And for that, we can simply use kvm_mmu_page.gfn,
> 

Ooh, nice.

>  with a kvm_x86_ops hook to get
> the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 20:45   ` Edgecombe, Rick P
@ 2025-06-11 21:09     ` Sean Christopherson
  0 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2025-06-11 21:09 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Xiaoyao Li, Kai Huang, binbin.wu@linux.intel.com, Reinette Chatre,
	linux-kernel@vger.kernel.org, Adrian Hunter, Yan Y Zhao,
	kvm@vger.kernel.org, pbonzini@redhat.com, Isaku Yamahata,
	tony.lindgren@linux.intel.com

On Wed, Jun 11, 2025, Rick P Edgecombe wrote:
> On Wed, 2025-06-11 at 11:10 -0700, Sean Christopherson wrote:
> > Back to the main topic, KVM needs to have a single source of truth when it comes
> > to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> > aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> > VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> > has far, far too much history and baggage with "direct") is tied to the existence
> > and polarity of aliased GFN bits.
> > 
> > What we have now does work *today* (see this bug), and it will be a complete
> > trainwreck if we ever want to steal GFN bits for other reasons.
> 
> KVM XO's time has come and gone. Out of curiosity is there anything else?

Not that I know of.

> Readability is the main objection here, right?

Maintainability first and foremost, but definitely readability too (which obviously
plays into maintainability).  E.g. if we properly encapsulate TDX, then it'll be
harder to pile on hacks and whatnot simply because common code won't have access
to state that lets it misbehave.  

But yes, you're correct in that I don't have a specific use case in mind.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 20:43         ` Sean Christopherson
@ 2025-06-11 21:16           ` Edgecombe, Rick P
  2025-06-12  7:19             ` Yan Zhao
  2025-06-12  6:58           ` Yan Zhao
  1 sibling, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-11 21:16 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Hunter, Adrian, Zhao, Yan Y,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com

On Wed, 2025-06-11 at 13:43 -0700, Sean Christopherson wrote:
> > Functionally, page_fault_can_be_fast() should prevented this with the
> > check of
> > kvm->arch.has_private_mem.
> 
> No?  I see this:
> 
> 	if (kvm->arch.has_private_mem &&
> 	    fault->is_private != kvm_mem_is_private(kvm, fault->gfn))
> 		return false;
> 
> I.e. a private fault can be fast, so long as the page is already in the
> correct
> shared vs. private state.  I can imagine that it's impossible for TDX to
> generate
> protection violations, but I think kvm_tdp_mmu_fast_pf_get_last_sptep() could
> be
> reached with a mirror root if kvm_ad_enabled=false.
> 
> 	if (!fault->present)
> 		return !kvm_ad_enabled;
> 
> 	/*
> 	 * Note, instruction fetches and writes are mutually exclusive,
> ignore
> 	 * the "exec" flag.
> 	 */
> 	return fault->write;

Oh, how embarrassing. Yes, I misread the code, but the way it's working is, oh
man...

TDX isn't setting PFERR_WRITE_MASK or PFERR_PRESENT_MASK in the error_code
passed into the fault handler. So page_fault_can_be_fast() should return false
for that reason for private/mirror faults.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11  0:10 [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY Xiaoyao Li
  2025-06-11 18:10 ` Sean Christopherson
@ 2025-06-12  4:44 ` Paolo Bonzini
  1 sibling, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2025-06-12  4:44 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Sean Christopherson, rick.p.edgecombe, kvm, linux-kernel,
	yan.y.zhao, reinette.chatre, kai.huang, adrian.hunter,
	isaku.yamahata, Binbin Wu, tony.lindgren

Queued, thanks.

Paolo



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 20:43         ` Sean Christopherson
  2025-06-11 21:16           ` Edgecombe, Rick P
@ 2025-06-12  6:58           ` Yan Zhao
  1 sibling, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-06-12  6:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Kai Huang,
	binbin.wu@linux.intel.com, Xiaoyao Li, Reinette Chatre,
	Adrian Hunter, kvm@vger.kernel.org, Isaku Yamahata,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com

On Wed, Jun 11, 2025 at 01:43:25PM -0700, Sean Christopherson wrote:
> On Wed, Jun 11, 2025, Rick P Edgecombe wrote:
> > On Wed, 2025-06-11 at 12:37 -0700, Sean Christopherson wrote:
> > > Ugh, and the whole tdp_mmu_get_root_for_fault() handling is broken.
> > > is_page_fault_stale() only looks at mmu->root.hpa, i.e. could theoretically blow
> > > up if the shared root is somehow valid but the mirror root is not.  Probably can't
> > > happen in practice, but it's ugly.
> > 
> > We had some discussion on this root valid/invalid pattern:
> > https://lore.kernel.org/kvm/d33d00b88707961126a24b19f940de43ba6e6c56.camel@intel.com/
> > 
> > It's brittle though.
> 
> Hmm, yeah, the is_page_fault_stale() thing is definitely benign, just odd.
Agreed.

> > > Oof, and I've no idea what kvm_tdp_mmu_fast_pf_get_last_sptep() is doing.  It
> > > says:
> > > 
> > > 	/* Fast pf is not supported for mirrored roots  */
> > > 
> > > but I don't see anything that actually enforces that.
> > 
> > Functionally, page_fault_can_be_fast() should prevented this with the check of
> > kvm->arch.has_private_mem.
> 
> No?  I see this:
> 
> 	if (kvm->arch.has_private_mem &&
> 	    fault->is_private != kvm_mem_is_private(kvm, fault->gfn))
> 		return false;
> 
> I.e. a private fault can be fast, so long as the page is already in the correct
> shared vs. private state.  I can imagine that it's impossible for TDX to generate
> protection violations, but I think kvm_tdp_mmu_fast_pf_get_last_sptep() could be
> reached with a mirror root if kvm_ad_enabled=false.
> 
> 	if (!fault->present)
> 		return !kvm_ad_enabled;
For TDX private fault,
-fault->present is always false if !fault->prefetch as its error_code is
 PFERR_PRIVATE_ACCESS | PFERR_WRITE_MASK due to exit quilification being
 hardcoded to EPT_VIOLATION_ACC_WRITE.
-fault->present is false if fault->prefetch is true as its erro_code is
 PFERR_PRIVATE_ACCESS | PFERR_GUEST_FINAL_MASK.

In tdx_bringup(), enable_ept_ad_bits is checked as a prerequisit of TDX, so
kvm_ad_enabled should be true.

So, page_fault_can_be_fast() should always return false for mirror root.


> 	/*
> 	 * Note, instruction fetches and writes are mutually exclusive, ignore
> 	 * the "exec" flag.
> 	 */
> 	return fault->write;
> 
> > But, yea it's not correct for being readable. The mirror/external concepts
> > only work if they make sense as independent concepts.  Otherwise it's just
> > naming obfuscation.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 21:16           ` Edgecombe, Rick P
@ 2025-06-12  7:19             ` Yan Zhao
  2025-06-12 18:50               ` Edgecombe, Rick P
  0 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-06-12  7:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org, tony.lindgren@linux.intel.com

On Thu, Jun 12, 2025 at 05:16:05AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-11 at 13:43 -0700, Sean Christopherson wrote:
> > > Functionally, page_fault_can_be_fast() should prevented this with the
> > > check of
> > > kvm->arch.has_private_mem.
> > 
> > No?  I see this:
> > 
> > 	if (kvm->arch.has_private_mem &&
> > 	    fault->is_private != kvm_mem_is_private(kvm, fault->gfn))
> > 		return false;
> > 
> > I.e. a private fault can be fast, so long as the page is already in the
> > correct
> > shared vs. private state.  I can imagine that it's impossible for TDX to
> > generate
> > protection violations, but I think kvm_tdp_mmu_fast_pf_get_last_sptep() could
> > be
> > reached with a mirror root if kvm_ad_enabled=false.
> > 
> > 	if (!fault->present)
> > 		return !kvm_ad_enabled;
> > 
> > 	/*
> > 	 * Note, instruction fetches and writes are mutually exclusive,
> > ignore
> > 	 * the "exec" flag.
> > 	 */
> > 	return fault->write;
> 
> Oh, how embarrassing. Yes, I misread the code, but the way it's working is, oh
> man...
> 
> TDX isn't setting PFERR_WRITE_MASK or PFERR_PRESENT_MASK in the error_code
> passed into the fault handler. So page_fault_can_be_fast() should return false
> for that reason for private/mirror faults.
Hmm, TDX does set PFERR_WRITE_MASK in the error_code when fault->prefetch is
false (since exit_qual is set to EPT_VIOLATION_ACC_WRITE in
tdx_handle_ept_violation()).

PFERR_PRESENT_MASK is always unset.

page_fault_can_be_fast() does always return false for private mirror faults
though, due to the reason in
https://lore.kernel.org/kvm/aEp6pDQgbjsfrg2h@yzhao56-desk.sh.intel.com :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-11 18:10 ` Sean Christopherson
  2025-06-11 18:21   ` Paolo Bonzini
  2025-06-11 20:45   ` Edgecombe, Rick P
@ 2025-06-12 12:20   ` Yan Zhao
  2025-06-12 18:40     ` Edgecombe, Rick P
  2 siblings, 1 reply; 18+ messages in thread
From: Yan Zhao @ 2025-06-12 12:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Paolo Bonzini, rick.p.edgecombe, kvm, linux-kernel,
	reinette.chatre, kai.huang, adrian.hunter, isaku.yamahata,
	Binbin Wu, tony.lindgren

On Wed, Jun 11, 2025 at 11:10:21AM -0700, Sean Christopherson wrote:
> On Tue, Jun 10, 2025, Xiaoyao Li wrote:
> > From: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > Bug[*] reported for TDX case when enabling KVM_PRE_FAULT_MEMORY in QEMU.
> > 
> > It turns out that @gpa passed to kvm_mmu_do_page_fault() doesn't have
> > shared bit set when the memory attribute of it is shared, and it leads
> > to wrong root in tdp_mmu_get_root_for_fault().
> > 
> > Fix it by embedding the direct bits in the gpa that is passed to
> > kvm_tdp_map_page(), when the memory of the gpa is not private.
> > 
> > [*] https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> > 
> > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Closes: https://lore.kernel.org/qemu-devel/4a757796-11c2-47f1-ae0d-335626e818fd@intel.com/
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > ---
> > we have selftests enhancement for TDX case of KVM_PRE_FAULT_MEMORY, but
> > the plan is to post them on top of the TDX selftests [1] when they get
> > upstream.
> > 
> > [1] https://lore.kernel.org/all/20250414214801.2693294-1-sagis@google.com/
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index cbc84c6abc2e..a4040578b537 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4896,6 +4896,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  {
> >  	u64 error_code = PFERR_GUEST_FINAL_MASK;
> >  	u8 level = PG_LEVEL_4K;
> > +	u64 direct_bits;
> >  	u64 end;
> >  	int r;
> >  
> > @@ -4910,15 +4911,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	if (r)
> >  		return r;
> >  
> > +	direct_bits = 0;
> >  	if (kvm_arch_has_private_mem(vcpu->kvm) &&
> >  	    kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> >  		error_code |= PFERR_PRIVATE_ACCESS;
> > +	else
> > +		direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> 
> Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
> but stuffing vendor specific GPA bits in common code goes too far.  Actually,
> all of this goes too far.  There's zero reason any code outside of TDX needs to
> *explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.
> 
> Side topic, kvm_arch_vcpu_pre_fault_memory() should disallow aliased GFNs.  It
> will "work", because kvm_mmu_do_page_fault() will strip the SHARED bit, but it's
> all kinds of confusing due to KVM also forcing shared vs. private based on memory
> attributes.
> 
> Back to the main topic, KVM needs to have a single source of truth when it comes
> to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> has far, far too much history and baggage with "direct") is tied to the existence
> and polarity of aliased GFN bits.
> 
> What we have now does work *today* (see this bug), and it will be a complete
> trainwreck if we ever want to steal GFN bits for other reasons.
> 
> To detect a mirror fault:
> 
>   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>   {
> 	return kvm_has_mirrored_tdp(kvm) &&
> 	       error_code & PFERR_PRIVATE_ACCESS;
>   }
> 
> And for TDX, it should darn well explicitly track the shared GPA mask:
> 
>   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
>   {
> 	/* For TDX the direct mask is the shared mask. */
> 	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
>   }
> 
> Overloading a field in kvm_arch and bleeding TDX details into common code isn't
> worth saving 8 bytes per VM.
> 
> Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
> the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
> 
> Compile tested only, and obviously needs to be split into multiple patches.
> 
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  3 ++
>  arch/x86/include/asm/kvm_host.h    |  5 +-
>  arch/x86/kvm/mmu.h                 | 21 +++++---
>  arch/x86/kvm/mmu/mmu.c             |  3 ++
>  arch/x86/kvm/mmu/mmu_internal.h    | 14 +-----
>  arch/x86/kvm/mmu/tdp_iter.c        |  4 +-
>  arch/x86/kvm/mmu/tdp_iter.h        | 12 ++---
>  arch/x86/kvm/mmu/tdp_mmu.c         | 13 ++---
>  arch/x86/kvm/mmu/tdp_mmu.h         |  2 +-
>  arch/x86/kvm/vmx/common.h          | 17 ++++---
>  arch/x86/kvm/vmx/main.c            | 11 +++++
>  arch/x86/kvm/vmx/tdx.c             | 78 ++++++++++++++++++------------
>  arch/x86/kvm/vmx/tdx.h             |  2 +
>  arch/x86/kvm/vmx/vmx.c             |  2 +-
>  arch/x86/kvm/vmx/x86_ops.h         |  1 +
>  15 files changed, 112 insertions(+), 76 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8d50e3e0a19b..f0a958ba4823 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -145,6 +145,9 @@ KVM_X86_OP(vcpu_deliver_sipi_vector)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>  KVM_X86_OP_OPTIONAL(get_untagged_addr)
>  KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> +#ifdef CONFIG_X86_64
> +KVM_X86_OP_OPTIONAL_RET0(get_tdp_mmu_root_gfn);
> +#endif
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
>  KVM_X86_OP_OPTIONAL(gmem_invalidate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 330cdcbed1a6..3a5efde1ab9d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1578,7 +1578,7 @@ struct kvm_arch {
>  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
>  	struct kvm_mmu_memory_cache split_desc_cache;
>  
> -	gfn_t gfn_direct_bits;
> +	gfn_t aliased_gfn_bits;
>  
>  	/*
>  	 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A Zero
> @@ -1897,6 +1897,9 @@ struct kvm_x86_ops {
>  
>  	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> +#ifdef CONFIG_X86_64
> +	gfn_t (*get_tdp_mmu_root_gfn)(struct kvm *kvm, bool is_mirror);
> +#endif
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
>  	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>  	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..cef726e59f9b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -257,7 +257,7 @@ extern bool tdp_mmu_enabled;
>  #define tdp_mmu_enabled false
>  #endif
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> @@ -309,20 +309,25 @@ static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
>  	return kvm->arch.vm_type == KVM_X86_TDX_VM;
>  }
>  
> -static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
> +static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
>  {
> -	return kvm->arch.gfn_direct_bits;
> +	return kvm_has_mirrored_tdp(kvm) &&
> +	       error_code & PFERR_PRIVATE_ACCESS;
>  }
What about passing is is_private instead?  

static inline bool kvm_is_mirror_fault(struct kvm *kvm, bool is_private)
{
 	return kvm_has_mirrored_tdp(kvm) && is_private;
}

tdp_mmu_get_root_for_fault() and kvm_tdp_mmu_gpa_is_mapped() can pass in
faul->is_private or is_private directly, leaving the parsing of error_code &
PFERR_PRIVATE_ACCESS only in kvm_mmu_do_page_fault().

> -static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
> +static inline gfn_t kvm_aliased_gfn_bits(struct kvm *kvm)
>  {
> -	gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
> -
> -	return !gpa_direct_bits || (gpa & gpa_direct_bits);
> +	return kvm->arch.aliased_gfn_bits;
>  }
>  
>  static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
>  {
> -	return gfn & kvm_gfn_direct_bits(kvm);
> +	return gfn & kvm_aliased_gfn_bits(kvm);
>  }
> +
> +static inline gpa_t kvm_get_unaliased_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	return gpa & ~gfn_to_gpa(kvm_aliased_gfn_bits(kvm));
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..0228d49ac363 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4902,6 +4902,9 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	if (!vcpu->kvm->arch.pre_fault_allowed)
>  		return -EOPNOTSUPP;
>  
> +	if (kvm_is_gfn_alias(vcpu->kvm, gpa_to_gfn(range->gpa)))
> +		return -EINVAL;
> +
>  	/*
>  	 * reload is efficient when called repeatedly, so we can do it on
>  	 * every iteration.
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index db8f33e4de62..7d2c53d2b0ca 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -175,18 +175,6 @@ static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_
>  	sp->external_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_external_spt_cache);
>  }
>  
> -static inline gfn_t kvm_gfn_root_bits(const struct kvm *kvm, const struct kvm_mmu_page *root)
> -{
> -	/*
> -	 * Since mirror SPs are used only for TDX, which maps private memory
> -	 * at its "natural" GFN, no mask needs to be applied to them - and, dually,
> -	 * we expect that the bits is only used for the shared PT.
> -	 */
> -	if (is_mirror_sp(root))
> -		return 0;
> -	return kvm_gfn_direct_bits(kvm);
> -}
> -
>  static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm *kvm,
>  						      struct kvm_mmu_page *sp)
>  {
> @@ -376,7 +364,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		 * bit. Strip it so that the GFN can be used like normal, and the
>  		 * fault.addr can be used when the shared bit is needed.
>  		 */
> -		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_direct_bits(vcpu->kvm);
> +		fault.gfn = gpa_to_gfn(fault.addr) & ~vcpu->kvm->arch.aliased_gfn_bits;
>  		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>  	}
>  
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index 9e17bfa80901..c36bfb920382 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -37,8 +37,10 @@ void tdp_iter_restart(struct tdp_iter *iter)
>   * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
>   */
>  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> -		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits)
> +		    int min_level, gfn_t next_last_level_gfn, bool paranoid)
>  {
> +	gfn_t gfn_bits = paranoid ? 0 : root->gfn;
> +
>  	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
>  			 (root->role.level > PT64_ROOT_MAX_LEVEL) ||
>  			 (gfn_bits && next_last_level_gfn >= gfn_bits))) {
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index 364c5da6c499..5117d64952f7 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -120,13 +120,13 @@ struct tdp_iter {
>   * Iterates over every SPTE mapping the GFN range [start, end) in a
>   * preorder traversal.
>   */
> -#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)		  \
> -	for (tdp_iter_start(&iter, root, min_level, start, kvm_gfn_root_bits(kvm, root)); \
> -	     iter.valid && iter.gfn < end;						  \
> +#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end)	\
> +	for (tdp_iter_start(&iter, root, min_level, start, false);		\
> +	     iter.valid && iter.gfn < end;					\
>  	     tdp_iter_next(&iter))
>  
> -#define for_each_tdp_pte_min_level_all(iter, root, min_level)		\
> -	for (tdp_iter_start(&iter, root, min_level, 0, 0);		\
> +#define for_each_tdp_pte_min_level_paranoid(iter, root, min_level)	\
> +	for (tdp_iter_start(&iter, root, min_level, 0, true);		\
>  		iter.valid && iter.gfn < tdp_mmu_max_gfn_exclusive();	\
>  		tdp_iter_next(&iter))
>  
> @@ -136,7 +136,7 @@ struct tdp_iter {
>  tdp_ptep_t spte_to_child_pt(u64 pte, int level);
>  
>  void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> -		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_bits);
> +		    int min_level, gfn_t next_last_level_gfn, bool paranoid);
>  void tdp_iter_next(struct tdp_iter *iter);
>  void tdp_iter_restart(struct tdp_iter *iter);
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..15daf4353ccc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -257,6 +257,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	int as_id = kvm_mmu_role_as_id(role);
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_mmu_page *root;
> +	gfn_t gfn;
>  
>  	if (mirror)
>  		role.is_mirror = true;
> @@ -291,7 +292,8 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>  	}
>  
>  	root = tdp_mmu_alloc_sp(vcpu);
> -	tdp_mmu_init_sp(root, NULL, 0, role);
> +	gfn = kvm_x86_call(get_tdp_mmu_root_gfn)(vcpu->kvm, mirror);
> +	tdp_mmu_init_sp(root, NULL, gfn, role);
>  
>  	/*
>  	 * TDP MMU roots are kept until they are explicitly invalidated, either
> @@ -860,7 +862,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  {
>  	struct tdp_iter iter;
>  
> -	for_each_tdp_pte_min_level_all(iter, root, zap_level) {
> +	for_each_tdp_pte_min_level_paranoid(iter, root, zap_level) {
>  retry:
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
>  			continue;
> @@ -1934,12 +1936,11 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>  	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
>  }
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, u64 error_code)
bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa, bool is_private)

>  {
>  	struct kvm *kvm = vcpu->kvm;
> -	bool is_direct = kvm_is_addr_direct(kvm, gpa);
> -	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
> -				 vcpu->arch.mmu->mirror_root_hpa;
> +	hpa_t root = kvm_is_mirror_fault(kvm, error_code) ? vcpu->arch.mmu->mirror_root_hpa :
> +							    vcpu->arch.mmu->root.hpa;
hpa_t root = kvm_is_mirror_fault(kvm, is_private) ? vcpu->arch.mmu->mirror_root_hpa :
							    vcpu->arch.mmu->root.hpa;

>  	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
>  	int leaf;



> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 52acf99d40a0..397309dfc73f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(str
>  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
>  							      struct kvm_page_fault *fault)
>  {
> -	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> +	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->error_code)))
So, here can be
	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->is_private)))

>  		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
>  
>  	return root_to_sp(vcpu->arch.mmu->root.hpa);
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index a0c5e8781c33..5065fd8d41e8 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -76,14 +76,9 @@ static __always_inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
>  
>  #endif
>  
> -static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
> -{
> -	/* For TDX the direct mask is the shared mask. */
> -	return !kvm_is_addr_direct(kvm, gpa);
> -}
> -
>  static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> -					     unsigned long exit_qualification)
> +					     unsigned long exit_qualification,
> +					     bool is_private)
>  {
>  	u64 error_code;
>  
> @@ -104,12 +99,18 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>  		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
>  			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>  
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa))
> +	if (is_private)
>  		error_code |= PFERR_PRIVATE_ACCESS;
>  
>  	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>  }
>  
> +static inline int vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> +					   unsigned long exit_qualification)
> +{
> +	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification, false);
> +}
> +
>  static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
>  						     int pi_vec)
>  {
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..6e0652ed0d22 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -855,6 +855,14 @@ static void vt_setup_mce(struct kvm_vcpu *vcpu)
>  	vmx_setup_mce(vcpu);
>  }
>  
> +static gfn_t vt_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
> +{
> +	if (!is_td(kvm))
> +		return 0;
> +
> +	return tdx_get_tdp_mmu_root_gfn(kvm, is_mirror);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -1041,6 +1049,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.get_untagged_addr = vmx_get_untagged_addr,
>  
> +#ifdef CONFIG_X86_64
> +	.get_tdp_mmu_root_gfn = vt_op_tdx_only(get_tdp_mmu_root_gfn),
> +#endif
>  	.mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl),
>  	.vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl),
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..1abf3c158cd5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1138,6 +1138,12 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>  	return __kvm_emulate_hypercall(vcpu, 0, complete_hypercall_exit);
>  }
>  
> +static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	/* For TDX the direct mask is the shared mask. */
> +	return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
> +}
> +
>  /*
>   * Split into chunks and check interrupt pending between chunks.  This allows
>   * for timely injection of interrupts to prevent issues with guest lockup
> @@ -1192,9 +1198,9 @@ static void __tdx_map_gpa(struct vcpu_tdx *tdx)
>  	 * vcpu->run->hypercall.ret, ensuring that it is zero to not break QEMU.
>  	 */
>  	tdx->vcpu.run->hypercall.ret = 0;
> -	tdx->vcpu.run->hypercall.args[0] = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(tdx->vcpu.kvm));
> +	tdx->vcpu.run->hypercall.args[0] = gfn_to_gpa(kvm_get_unaliased_gpa(tdx->vcpu.kvm, gpa));
No need to do gfn_to_gpa().

>  	tdx->vcpu.run->hypercall.args[1] = size / PAGE_SIZE;
> -	tdx->vcpu.run->hypercall.args[2] = vt_is_tdx_private_gpa(tdx->vcpu.kvm, gpa) ?
> +	tdx->vcpu.run->hypercall.args[2] = tdx_is_private_gpa(tdx->vcpu.kvm, gpa) ?
>  					   KVM_MAP_GPA_RANGE_ENCRYPTED :
>  					   KVM_MAP_GPA_RANGE_DECRYPTED;
>  	tdx->vcpu.run->hypercall.flags   = KVM_EXIT_HYPERCALL_LONG_MODE;
> @@ -1222,8 +1228,8 @@ static int tdx_map_gpa(struct kvm_vcpu *vcpu)
>  
>  	if (gpa + size <= gpa || !kvm_vcpu_is_legal_gpa(vcpu, gpa) ||
>  	    !kvm_vcpu_is_legal_gpa(vcpu, gpa + size - 1) ||
> -	    (vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
> -	     vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))) {
> +	    (tdx_is_private_gpa(vcpu->kvm, gpa) !=
> +	     tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))) {
>  		ret = TDVMCALL_STATUS_INVALID_OPERAND;
>  		goto error;
>  	}
> @@ -1411,11 +1417,11 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>  	 * TDG.VP.VMCALL<MMIO> allows only shared GPA, it makes no sense to
>  	 * do MMIO emulation for private GPA.
>  	 */
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa) ||
> -	    vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1))
> +	if (tdx_is_private_gpa(vcpu->kvm, gpa) ||
> +	    tdx_is_private_gpa(vcpu->kvm, gpa + size - 1))
>  		goto error;
>  
> -	gpa = gpa & ~gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> +	gpa = kvm_get_unaliased_gpa(vcpu->kvm, gpa);
>  
>  	if (write)
>  		r = tdx_mmio_write(vcpu, gpa, size, val);
> @@ -1480,12 +1486,17 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>  
> +gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror)
> +{
> +	return is_mirror ? 0 : gpa_to_gfn(to_kvm_tdx(kvm)->shared_gpa_mask);
> +}
> +
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  {
>  	u64 shared_bit = (pgd_level == 5) ? TDX_SHARED_BIT_PWL_5 :
>  			  TDX_SHARED_BIT_PWL_4;
>  
> -	if (KVM_BUG_ON(shared_bit != kvm_gfn_direct_bits(vcpu->kvm), vcpu->kvm))
> +	if (KVM_BUG_ON(tdx_is_private_gpa(vcpu->kvm, shared_bit), vcpu->kvm))
>  		return;
Should be
if (KVM_BUG_ON(tdx_is_private_gpa(vcpu->kvm, gfn_to_gpa(shared_bit)), vcpu->kvm))

or remove the gpa_to_gfn in TDX_SHARED_BIT_PWL_5, TDX_SHARED_BIT_PWL_4.



>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> @@ -1837,10 +1848,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qual;
>  	gpa_t gpa = to_tdx(vcpu)->exit_gpa;
> -	bool local_retry = false;
> +	bool is_private = tdx_is_private_gpa(vcpu->kvm, gpa);
>  	int ret;
>  
> -	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
> +	if (is_private) {
>  		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
>  			pr_warn("Guest access before accepting 0x%llx on vCPU %d\n",
>  				gpa, vcpu->vcpu_id);
> @@ -1857,9 +1868,6 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  		 * due to aliasing a single HPA to multiple GPAs.
>  		 */
>  		exit_qual = EPT_VIOLATION_ACC_WRITE;
> -
> -		/* Only private GPA triggers zero-step mitigation */
> -		local_retry = true;
>  	} else {
>  		exit_qual = vmx_get_exit_qual(vcpu);
>  		/*
> @@ -1907,9 +1915,10 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  	 * handle retries locally in their EPT violation handlers.
>  	 */
>  	while (1) {
> -		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual, is_private);
>  
> -		if (ret != RET_PF_RETRY || !local_retry)
> +		/* Only private GPA triggers zero-step mitigation */
> +		if (ret != RET_PF_RETRY || !is_private)
>  			break;
>  
>  		if (kvm_vcpu_has_events(vcpu) || signal_pending(current))
> @@ -2620,12 +2629,11 @@ static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
>  	out->flags |= sub_leaf_set ? KVM_CPUID_FLAG_SIGNIFCANT_INDEX : 0;
>  
>  	/*
> -	 * Work around missing support on old TDX modules, fetch
> -	 * guest maxpa from gfn_direct_bits.
> +	 * Work around missing support on old TDX modules, derive the guest
> +	 * MAXPA from the shared bit.
>  	 */
>  	if (leaf == 0x80000008) {
> -		gpa_t gpa_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> -		unsigned int g_maxpa = __ffs(gpa_bits) + 1;
> +		unsigned int g_maxpa = __ffs(kvm_tdx->shared_gpa_mask) + 1;
>  
>  		out->eax = tdx_set_guest_phys_addr_bits(out->eax, g_maxpa);
>  	}
> @@ -2648,6 +2656,7 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct kvm_tdx_init_vm *init_vm;
>  	struct td_params *td_params = NULL;
> +	gpa_t shared_bit;
>  	int ret;
>  
>  	BUILD_BUG_ON(sizeof(*init_vm) != 256 + sizeof_field(struct kvm_tdx_init_vm, cpuid));
> @@ -2712,9 +2721,12 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	kvm_tdx->xfam = td_params->xfam;
>  
>  	if (td_params->config_flags & TDX_CONFIG_FLAGS_MAX_GPAW)
> -		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_5;
> +		shared_bit = TDX_SHARED_BIT_PWL_5;
>  	else
> -		kvm->arch.gfn_direct_bits = TDX_SHARED_BIT_PWL_4;
> +		shared_bit = TDX_SHARED_BIT_PWL_4;
> +
> +	kvm_tdx->shared_gpa_mask = shared_bit;
> +	kvm->arch.aliased_gfn_bits = gpa_to_gfn(shared_bit);
Should be
       kvm_tdx->shared_gpa_mask = gfn_to_gpa(shared_bit);
       kvm->arch.aliased_gfn_bits = shared_bit;

Or remove the gpa_to_gfn in TDX_SHARED_BIT_PWL_5, TDX_SHARED_BIT_PWL_4


>  
>  	kvm_tdx->state = TD_STATE_INITIALIZED;
>  out:
> @@ -3052,6 +3064,16 @@ struct tdx_gmem_post_populate_arg {
>  	__u32 flags;
>  };
>  
> +static int tdx_is_private_mapping_valid(struct kvm_vcpu *vcpu, gpa_t gpa)
> +{
> +	if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU))
> +		return 0;
Should be "return 1".

> +
> +	guard(read_lock)(&vcpu->kvm->mmu_lock);
> +
> +	return kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, PFERR_PRIVATE_ACCESS);
So, here can be
	return kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, true);
> +}
> +
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  				  void __user *src, int order, void *_arg)
>  {
> @@ -3084,14 +3106,8 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	 * because all paths are covered by slots_lock and the
>  	 * filemap invalidate lock.  Check that they are indeed enough.
>  	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> +	if (KVM_BUG_ON(!tdx_is_private_mapping_valid(vcpu, gpa), kvm))
> +		return -EIO;
>  
>  	ret = 0;
>  	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> @@ -3148,8 +3164,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
>  	    !region.nr_pages ||
>  	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> -	    !vt_is_tdx_private_gpa(kvm, region.gpa) ||
> -	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
> +	    !tdx_is_private_gpa(kvm, region.gpa) ||
> +	    !tdx_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
>  		return -EINVAL;
>  
>  	kvm_mmu_reload(vcpu);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 51f98443e8a2..3807562d5b48 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -27,6 +27,8 @@ struct kvm_tdx {
>  	int hkid;
>  	enum kvm_tdx_state state;
>  
> +	gpa_t shared_gpa_mask;
> +
>  	u64 attributes;
>  	u64 xfam;
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9ff00ae9f05a..38103efe2f47 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5775,7 +5775,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
>  		return kvm_emulate_instruction(vcpu, 0);
>  
> -	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
> +	return vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index b4596f651232..ee56e96b4db3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -151,6 +151,7 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>  
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  
> +gfn_t tdx_get_tdp_mmu_root_gfn(struct kvm *kvm, bool is_mirror);
>  int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  			      enum pg_level level, void *private_spt);
>  int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
> 
> base-commit: 61374cc145f4a56377eaf87c7409a97ec7a34041
> --
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-12 12:20   ` Yan Zhao
@ 2025-06-12 18:40     ` Edgecombe, Rick P
  2025-06-13  0:09       ` Sean Christopherson
  0 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-12 18:40 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Hunter, Adrian, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Thu, 2025-06-12 at 20:20 +0800, Yan Zhao wrote:
> What about passing is is_private instead?  
> 
> static inline bool kvm_is_mirror_fault(struct kvm *kvm, bool is_private)
> {
>  	return kvm_has_mirrored_tdp(kvm) && is_private;
> }
> 
> tdp_mmu_get_root_for_fault() and kvm_tdp_mmu_gpa_is_mapped() can pass in
> faul->is_private or is_private directly, leaving the parsing of error_code &
> PFERR_PRIVATE_ACCESS only in kvm_mmu_do_page_fault().

General question about the existing code...

Why do we have the error code bits separated out into bools in struct
kvm_page_fault? It transitions between:
1. Native exit info (exit qualification, AMD error code, etc)
2. Synthetic error codes
3. struct kvm_page_fault bools *and* synthetic error code.

Why don't we go right to struct kvm_page_fault bools? Or just leave the
synthetic error code in struct kvm_page_fault and refer to it? Having both in
struct kvm_page_fault seems wrong, at least.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-12  7:19             ` Yan Zhao
@ 2025-06-12 18:50               ` Edgecombe, Rick P
  2025-06-13  1:14                 ` Yan Zhao
  0 siblings, 1 reply; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-12 18:50 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian,
	tony.lindgren@linux.intel.com, pbonzini@redhat.com,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Thu, 2025-06-12 at 15:19 +0800, Yan Zhao wrote:
> > TDX isn't setting PFERR_WRITE_MASK or PFERR_PRESENT_MASK in the error_code
> > passed into the fault handler. So page_fault_can_be_fast() should return
> > false
> > for that reason for private/mirror faults.
> Hmm, TDX does set PFERR_WRITE_MASK in the error_code when fault->prefetch is
> false (since exit_qual is set to EPT_VIOLATION_ACC_WRITE in
> tdx_handle_ept_violation()).
> 
> PFERR_PRESENT_MASK is always unset.
> 
> page_fault_can_be_fast() does always return false for private mirror faults
> though, due to the reason in
> https://lore.kernel.org/kvm/aEp6pDQgbjsfrg2h@yzhao56-desk.sh.intel.com :)

Seems cleanup worthy to me, but not a bug. I think we should follow up,
depending on the scope of Sean's cleanup.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-12 18:40     ` Edgecombe, Rick P
@ 2025-06-13  0:09       ` Sean Christopherson
  2025-06-13 16:12         ` Edgecombe, Rick P
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2025-06-13  0:09 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, Kai Huang, binbin.wu@linux.intel.com, Xiaoyao Li,
	Reinette Chatre, Adrian Hunter, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Isaku Yamahata,
	linux-kernel@vger.kernel.org

On Thu, Jun 12, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-06-12 at 20:20 +0800, Yan Zhao wrote:
> > What about passing is is_private instead?  
> > 
> > static inline bool kvm_is_mirror_fault(struct kvm *kvm, bool is_private)
> > {
> >  	return kvm_has_mirrored_tdp(kvm) && is_private;
> > }
> > 
> > tdp_mmu_get_root_for_fault() and kvm_tdp_mmu_gpa_is_mapped() can pass in
> > faul->is_private or is_private directly, leaving the parsing of error_code &
> > PFERR_PRIVATE_ACCESS only in kvm_mmu_do_page_fault().
> 
> General question about the existing code...
> 
> Why do we have the error code bits separated out into bools in struct
> kvm_page_fault? It transitions between:
> 1. Native exit info (exit qualification, AMD error code, etc)

This step should be obvious :-)

> 2. Synthetic error codes
> 3. struct kvm_page_fault bools *and* synthetic error code.

A few reasons.

 a. The error_code is used in other paths, e.g. see the PFERR_IMPLICIT_ACCESS
    usage in emulator_write_std(), and the @access parameter from FNAME(gva_to_gpa)
    to FNAME(walk_addr_generic) (which is why FNAME(walk_addr) takes a sanitized
    "access", a.k.a. error code, instead of e.g. kvm_page_fault.
 b. Keeping the entire error code allowed adding kvm_page_fault without having
    to churn *everything*.
 c. Preserving the entire error code simplifies the handoff to async #PF.
 d. Unpacking error_code into bools makes downstream code much cleaner, e.g.
    page_fault_can_be_fast() is a good example.
 e. Waiting until kvm_mmu_do_page_fault() to fill kvm_page_fault deduplicates a
    _lot_ of boilerplate, and allows for many fields to be "const".
 f. I really, really want to make (most of) kvm_page_fault a structure that's
    common to all architectures, at which point tracking e.g. exec, read, write,
    etc. using bool is pretty much the only sane option.

> Why don't we go right to struct kvm_page_fault bools? Or just leave the
> synthetic error code in struct kvm_page_fault and refer to it? Having both in
> struct kvm_page_fault seems wrong, at least.

I actually like it.  It's like having both the raw and decoded information for
CPUID or RDMSR output.  All of the relevant fields are "const", so there's very
little chance of the state becoming out of sync.

I suppose an alternative would be to create union+bitfield overlay, but that
wouldn't work if/when pieces of kvm_page_fault are shared with other architectures,
and even without that angle in play, I think I actually prefer manually filling
bools.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-12 18:50               ` Edgecombe, Rick P
@ 2025-06-13  1:14                 ` Yan Zhao
  0 siblings, 0 replies; 18+ messages in thread
From: Yan Zhao @ 2025-06-13  1:14 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, binbin.wu@linux.intel.com,
	Li, Xiaoyao, Chatre, Reinette, Hunter, Adrian,
	tony.lindgren@linux.intel.com, pbonzini@redhat.com,
	kvm@vger.kernel.org, Yamahata, Isaku,
	linux-kernel@vger.kernel.org

On Fri, Jun 13, 2025 at 02:50:48AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-12 at 15:19 +0800, Yan Zhao wrote:
> > > TDX isn't setting PFERR_WRITE_MASK or PFERR_PRESENT_MASK in the error_code
> > > passed into the fault handler. So page_fault_can_be_fast() should return
> > > false
> > > for that reason for private/mirror faults.
> > Hmm, TDX does set PFERR_WRITE_MASK in the error_code when fault->prefetch is
> > false (since exit_qual is set to EPT_VIOLATION_ACC_WRITE in
> > tdx_handle_ept_violation()).
> > 
> > PFERR_PRESENT_MASK is always unset.
> > 
> > page_fault_can_be_fast() does always return false for private mirror faults
> > though, due to the reason in
> > https://lore.kernel.org/kvm/aEp6pDQgbjsfrg2h@yzhao56-desk.sh.intel.com :)
> 
> Seems cleanup worthy to me, but not a bug. I think we should follow up,
> depending on the scope of Sean's cleanup.
Ok. There was an explicit disallowing of fast page fault for mirror [1].
Maybe we can add it back after Sean's cleanup.

[1] https://lore.kernel.org/kvm/af70ce8626cb7366d9b86a41c5d731f8ebd144ff.1708933498.git.isaku.yamahata@intel.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  2025-06-13  0:09       ` Sean Christopherson
@ 2025-06-13 16:12         ` Edgecombe, Rick P
  0 siblings, 0 replies; 18+ messages in thread
From: Edgecombe, Rick P @ 2025-06-13 16:12 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: linux-kernel@vger.kernel.org, Huang, Kai,
	binbin.wu@linux.intel.com, Li, Xiaoyao, Chatre, Reinette,
	Hunter, Adrian, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Zhao, Yan Y, Yamahata, Isaku,
	pbonzini@redhat.com

On Thu, 2025-06-12 at 17:09 -0700, Sean Christopherson wrote:
> A few reasons.
> 
>  a. The error_code is used in other paths, e.g. see the PFERR_IMPLICIT_ACCESS
>     usage in emulator_write_std(), and the @access parameter from FNAME(gva_to_gpa)
>     to FNAME(walk_addr_generic) (which is why FNAME(walk_addr) takes a sanitized
>     "access", a.k.a. error code, instead of e.g. kvm_page_fault.

It makes sense to not have to re-constitute it.

>  b. Keeping the entire error code allowed adding kvm_page_fault without having
>     to churn *everything*.
>  c. Preserving the entire error code simplifies the handoff to async #PF.
>  d. Unpacking error_code into bools makes downstream code much cleaner, e.g.
>     page_fault_can_be_fast() is a good example.
>  e. Waiting until kvm_mmu_do_page_fault() to fill kvm_page_fault deduplicates a
>     _lot_ of boilerplate, and allows for many fields to be "const".
>  f. I really, really want to make (most of) kvm_page_fault a structure that's
>     common to all architectures, at which point tracking e.g. exec, read, write,
>     etc. using bool is pretty much the only sane option.

Aha on (f)!

It still seems a bit unfortunate to me, but thanks for sharing the practical
reasons. You kind of have to know to try to read the bools over error code (like
Yan pointed out in this case) to keep it from looking like reading it one way or
the other means something.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-13 16:12 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11  0:10 [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY Xiaoyao Li
2025-06-11 18:10 ` Sean Christopherson
2025-06-11 18:21   ` Paolo Bonzini
2025-06-11 19:37     ` Sean Christopherson
2025-06-11 20:25       ` Edgecombe, Rick P
2025-06-11 20:43         ` Sean Christopherson
2025-06-11 21:16           ` Edgecombe, Rick P
2025-06-12  7:19             ` Yan Zhao
2025-06-12 18:50               ` Edgecombe, Rick P
2025-06-13  1:14                 ` Yan Zhao
2025-06-12  6:58           ` Yan Zhao
2025-06-11 20:45   ` Edgecombe, Rick P
2025-06-11 21:09     ` Sean Christopherson
2025-06-12 12:20   ` Yan Zhao
2025-06-12 18:40     ` Edgecombe, Rick P
2025-06-13  0:09       ` Sean Christopherson
2025-06-13 16:12         ` Edgecombe, Rick P
2025-06-12  4:44 ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).