Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Michael Roth <michael.roth@amd.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>,
	pbonzini@redhat.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com,
	kai.huang@intel.com,  adrian.hunter@intel.com,
	reinette.chatre@intel.com, xiaoyao.li@intel.com,
	 tony.lindgren@intel.com, binbin.wu@linux.intel.com,
	dmatlack@google.com,  isaku.yamahata@intel.com,
	ira.weiny@intel.com, vannapurve@google.com,  david@redhat.com,
	ackerleytng@google.com, tabba@google.com,  chao.p.peng@intel.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
Date: Thu, 10 Jul 2025 09:24:13 -0700	[thread overview]
Message-ID: <aG_pLUlHdYIZ2luh@google.com> (raw)
In-Reply-To: <20250709232103.zwmufocd3l7sqk7y@amd.com>

On Wed, Jul 09, 2025, Michael Roth wrote:
> On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > to use open code to populate the initial memory region into the mirror page
> > table, and add the region to S-EPT.
> > 
> > Background
> > ===
> > Sean initially suggested TDX to populate initial memory region in a 4-step
> > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > interface [2] to help TDX populate init memory region.

I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
after all :-)

> > tdx_vcpu_init_mem_region
> >     guard(mutex)(&kvm->slots_lock)
> >     kvm_gmem_populate
> >         filemap_invalidate_lock(file->f_mapping)
> >             __kvm_gmem_get_pfn      //1. get private PFN
> >             post_populate           //tdx_gmem_post_populate
> >                 get_user_pages_fast //2. get source page
> >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> >                                          source page to it.
> > 
> > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > invalidate lock also helps ensure the private PFN remains valid when
> > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > 
> > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > helps on sharing common code between SEV-SNP and TDX.
> > 
> > Problem
> > ===
> > (1)
> > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > invalidation lock for protecting its preparedness tracking. Similarly, the
> > in-place conversion version of guest_memfd series by Ackerly also requires
> > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > 
> > kvm_gmem_get_pfn
> >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > 
> > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > filemap invalidation lock.
> 
> Bringing the prior discussion over to here: it seems wrong that
> kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> chain, because:
> 
> 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
>    tdx_gmem_post_populate(), but we are throwing it away to grab it
>    again kvm_gmem_get_pfn(), which is then creating these locking issues
>    that we are trying to work around. If we could simply pass that PFN down
>    to kvm_tdp_map_page() (or some variant), then we would not trigger any
>    deadlocks in the first place.

Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.

> 2) kvm_gmem_populate() is intended for pre-boot population of guest
>    memory, and allows the post_populate callback to handle setting
>    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
>    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
>    setup of private memory. Having kvm_gmem_get_pfn() called as part of
>    kvm_gmem_populate() chain brings things 2 things in conflict with
>    each other, and TDX seems to be relying on that fact that it doesn't
>    implement a handler for kvm_arch_gmem_prepare(). 
> 
> I don't think this hurts anything in the current code, and I don't
> personally see any issue with open-coding the population path if it doesn't
> fit TDX very well, but there was some effort put into making
> kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> design of the interface itself, but instead just some inflexibility on the
> KVM MMU mapping side, then it seems more robust to address the latter if
> possible.
> 
> Would something like the below be reasonable? 

No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).

I would much rather special case this path, because it absolutely is a special
snowflake.  This even eliminates several exports of low level helpers that frankly
have no business being used by TDX, e.g. kvm_mmu_reload().

---
 arch/x86/kvm/mmu.h         |  2 +-
 arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c |  1 -
 arch/x86/kvm/vmx/tdx.c     | 24 ++----------
 4 files changed, 78 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..9cd7a34333af 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
 #endif
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..bc937f8ed5a0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+				 u64 error_code, u8 *level)
 {
 	int r;
 
@@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
@@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
+	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
 	if (r < 0)
 		return r;
 
@@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	return min(range->size, end - range->gpa);
 }
 
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct kvm_page_fault fault = {
+		.addr = gfn_to_gpa(gfn),
+		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
+		.prefetch = true,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+
+		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+		.is_private = true,
+
+		.gfn = gfn,
+		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+		.pfn = pfn,
+		.map_writable = true,
+	};
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return -EIO;
+
+	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
+		return -EPERM;
+
+	r = kvm_mmu_reload(vcpu);
+	if (r)
+		return r;
+
+	r = mmu_topup_memory_caches(vcpu, false);
+	if (r)
+		return r;
+
+	do {
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
+			return -EIO;
+
+		cond_resched();
+
+		guard(read_lock)(&kvm->mmu_lock);
+
+		r = kvm_tdp_mmu_map(vcpu, &fault);
+	} while (r == RET_PF_RETRY);
+
+	if (r != RET_PF_FIXED)
+		return -EIO;
+
+	/*
+	 * The caller is responsible for ensuring that no MMU invalidations can
+	 * occur.  Sanity check that the mapping hasn't been zapped.
+	 */
+	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
+		cond_resched();
+
+		scoped_guard(read_lock, &kvm->mmu_lock) {
+			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
+				return -EIO;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
@@ -5973,7 +6044,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
-EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..4f73d5341ebe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1953,7 +1953,6 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
 	spte = sptes[leaf];
 	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
 
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f4d4fd5cc6e8..02142496754f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3170,15 +3170,12 @@ struct tdx_gmem_post_populate_arg {
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				  void __user *src, int order, void *_arg)
 {
-	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct tdx_gmem_post_populate_arg *arg = _arg;
-	struct kvm_vcpu *vcpu = arg->vcpu;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
 	gpa_t gpa = gfn_to_gpa(gfn);
-	u8 level = PG_LEVEL_4K;
 	struct page *src_page;
 	int ret, i;
-	u64 err, entry, level_state;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
@@ -3190,24 +3187,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
 	if (ret < 0)
 		goto out;
 
-	/*
-	 * The private mem cannot be zapped after kvm_tdp_map_page()
-	 * because all paths are covered by slots_lock and the
-	 * filemap invalidate lock.  Check that they are indeed enough.
-	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
-
 	ret = 0;
 	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
 			       src_page, &entry, &level_state);
@@ -3267,7 +3250,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
-	kvm_mmu_reload(vcpu);
 	ret = 0;
 	while (region.nr_pages) {
 		if (signal_pending(current)) {

base-commit: 6c7ecd725e503bf2ca69ff52c6cc48bb650b1f11
--

next prev parent reply	other threads:[~2025-07-10 16:24 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
2025-07-03 16:51 ` Vishal Annapurve
2025-07-09 23:21 ` Michael Roth
2025-07-10 16:24   ` Sean Christopherson [this message]
2025-07-11  1:41     ` Ira Weiny
2025-07-11 14:21       ` Sean Christopherson
2025-07-11  4:36     ` Yan Zhao
2025-07-11 15:17       ` Michael Roth
2025-07-11 15:39         ` Sean Christopherson
2025-07-11 16:34           ` Michael Roth
2025-07-11 18:38             ` Vishal Annapurve
2025-07-11 19:49               ` Michael Roth
2025-07-11 20:19                 ` Sean Christopherson
2025-07-11 20:25             ` Ira Weiny
2025-07-11 22:56               ` Sean Christopherson
2025-07-11 23:04                 ` Vishal Annapurve
2025-07-14 23:11                   ` Ira Weiny
2025-07-15  0:41                     ` Vishal Annapurve
2025-07-14 23:08                 ` Ira Weiny
2025-07-14 23:12                   ` Sean Christopherson
2025-07-11 18:46           ` Vishal Annapurve
2025-07-12 17:38             ` Vishal Annapurve
2025-07-14  6:15           ` Yan Zhao
2025-07-14 15:46             ` Sean Christopherson
2025-07-14 16:02               ` David Hildenbrand
2025-07-14 16:07                 ` Sean Christopherson
2025-07-15  1:10               ` Yan Zhao
2025-07-18  9:14                 ` Yan Zhao
2025-07-18 15:57                   ` Vishal Annapurve
2025-07-18 18:42                     ` Ira Weiny
2025-07-18 18:59                       ` Vishal Annapurve
2025-07-21 17:46                         ` Ira Weiny
2025-07-28  9:48                     ` Yan Zhao
2025-07-29  0:45                       ` Vishal Annapurve
2025-07-29  1:37                         ` Yan Zhao
2025-07-29 16:33                           ` Ira Weiny
2025-08-05  0:22                             ` Sean Christopherson
2025-08-05  1:20                               ` Vishal Annapurve
2025-08-05 14:30                                 ` Vishal Annapurve
2025-08-05 19:59                                 ` Sean Christopherson
2025-08-06  0:09                                   ` Vishal Annapurve
2025-07-14  3:20         ` Yan Zhao

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b4b6860ab97 dfblob:9cd7a34333a dfblob:6e838cb6c9e
dfblob:bc937f8ed5a dfblob:7f3d7229b2c dfblob:4f73d5341eb
dfblob:f4d4fd5cc6e dfblob:02142496754 )
 OR (
bs:"Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aG_pLUlHdYIZ2luh@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=adrian.hunter@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=tabba@google.com \
    --cc=tony.lindgren@intel.com \
    --cc=vannapurve@google.com \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).