[RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
@ 2025-07-03  6:26 Yan Zhao
  2025-07-03 16:51 ` Vishal Annapurve
  2025-07-09 23:21 ` Michael Roth
  0 siblings, 2 replies; 42+ messages in thread
From: Yan Zhao @ 2025-07-03  6:26 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm
  Cc: linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, ira.weiny, michael.roth, vannapurve, david,
	ackerleytng, tabba, chao.p.peng, Yan Zhao

Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
to use open code to populate the initial memory region into the mirror page
table, and add the region to S-EPT.

Background
===
Sean initially suggested TDX to populate initial memory region in a 4-step
way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
interface [2] to help TDX populate init memory region.

tdx_vcpu_init_mem_region
    guard(mutex)(&kvm->slots_lock)
    kvm_gmem_populate
        filemap_invalidate_lock(file->f_mapping)
            __kvm_gmem_get_pfn      //1. get private PFN
            post_populate           //tdx_gmem_post_populate
                get_user_pages_fast //2. get source page
                kvm_tdp_map_page    //3. map private PFN to mirror root
                tdh_mem_page_add    //4. add private PFN to S-EPT and copy
                                         source page to it.

kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
invalidate lock also helps ensure the private PFN remains valid when
tdh_mem_page_add() is invoked in TDX's post_populate hook.

Though TDX does not need the folio prepration code, kvm_gmem_populate()
helps on sharing common code between SEV-SNP and TDX.

Problem
===
(1)
In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
invalidation lock for protecting its preparedness tracking. Similarly, the
in-place conversion version of guest_memfd series by Ackerly also requires
kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].

kvm_gmem_get_pfn
    filemap_invalidate_lock_shared(file_inode(file)->i_mapping);

However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
filemap invalidation lock.

(2)
Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
resulting in the following lock sequence in tdx_vcpu_init_mem_region():
- filemap invalidation lock --> mm->mmap_lock

However, in future code, the shared filemap invalidation lock will be held
in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
- mm->mmap_lock --> filemap invalidation lock

This creates an AB-BA deadlock issue.

These two issues should still present in Michael Roth's code [7], [8].

Proposal
===
To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
the initial memory region independently of kvm_gmem_populate(). The revised
sequence in tdx_vcpu_init_mem_region() is as follows:

tdx_vcpu_init_mem_region
    guard(mutex)(&kvm->slots_lock)
    tdx_init_mem_populate
        get_user_pages_fast //1. get source page
        kvm_tdp_map_page    //2. map private PFN to mirror root
        read_lock(&kvm->mmu_lock);
        kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
                                        mirror root and return the mapped
                                        private PFN.
        tdh_mem_page_add    //4. add private PFN to S-EPT and copy source
                                 page to it
        read_unlock(&kvm->mmu_lock);

The original step 1 "get private PFN" is now integrated in the new step 3
"check if the gpa is mapped in the mirror root and return the mapped
private PFN".

With the protection of slots_lock, the read mmu_lock ensures the private
PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
only permitted map level is 4KB, preventing any potential merging or
splitting in the mirror root under the read mmu_lock.

So, this approach should work for TDX. It still follows the spirit in
Sean's suggestion [1], where mapping the private PFN to mirror root and
adding it to the S-EPT with initial content from the source page are
executed in separate steps.

Discussions
===
The introduction of kvm_gmem_populate() was intended to make it usable by
both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
post_populate for both.

a) TDX keeps using kvm_gmem_populate().
   Pros: - keep the status quo
         - share common code between SEV-SNP and TDX, though TDX does not
           need to prepare folios.
   Cons: - we need to explore solutions to the locking issues, e.g. the
           proposal at [11].
         - PFN is faulted in twice for each GFN:
           one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().

b) Michael suggested introducing some variant of
   kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
   kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
   Pro:  - TDX can still invoke kvm_gmem_populate().
           can share common code between SEV-SNP and TDX.
   Cons: - only TDX needs this variant.
         - can't fix the 2nd AB-BA lock issue.

c) Change in this patch
   Pro: greater flexibility. Simplify the implementation for both SEV-SNP
        and TDX.
   Con: undermine the purpose of sharing common code.
        kvm_gmem_populate() may only be usable by SEV-SNP in future.

Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]

Suggested-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
This is an RFC patch. Will split it later.
---
 arch/x86/kvm/mmu.h         |  3 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  6 ++-
 arch/x86/kvm/vmx/tdx.c     | 96 ++++++++++++++------------------------
 3 files changed, 42 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..b122255c7d4e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -257,7 +257,8 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level,
+			       kvm_pfn_t *pfn);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..bb95c95f6531 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1934,7 +1934,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
 }
 
-bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level,
+			       kvm_pfn_t *pfn)
 {
 	struct kvm *kvm = vcpu->kvm;
 	bool is_direct = kvm_is_addr_direct(kvm, gpa);
@@ -1947,10 +1948,11 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
 	rcu_read_lock();
 	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
 	rcu_read_unlock();
-	if (leaf < 0)
+	if (leaf < 0 || leaf != level)
 		return false;
 
 	spte = sptes[leaf];
+	*pfn = spte_to_pfn(spte);
 	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
 }
 EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..f3c2bb3554c3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1521,9 +1521,9 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 }
 
 /*
- * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
- * callback tdx_gmem_post_populate() then maps pages into private memory.
- * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
+ * KVM_TDX_INIT_MEM_REGION calls tdx_init_mem_populate() to first map guest
+ * pages into mirror page table and then maps pages into private memory through
+ * the a SEAMCALL TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
  * private EPT structures for the page to have been built before, which is
  * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
  * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
@@ -3047,23 +3047,17 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	WARN_ON_ONCE(init_event);
 }
 
-struct tdx_gmem_post_populate_arg {
-	struct kvm_vcpu *vcpu;
-	__u32 flags;
-};
-
-static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
-				  void __user *src, int order, void *_arg)
+static int tdx_init_mem_populate(struct kvm_vcpu *vcpu, gpa_t gpa,
+				 void __user *src, __u32 flags)
 {
 	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-	struct tdx_gmem_post_populate_arg *arg = _arg;
-	struct kvm_vcpu *vcpu = arg->vcpu;
-	gpa_t gpa = gfn_to_gpa(gfn);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct kvm *kvm = vcpu->kvm;
 	u8 level = PG_LEVEL_4K;
 	struct page *src_page;
 	int ret, i;
 	u64 err, entry, level_state;
+	kvm_pfn_t pfn;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
@@ -3079,38 +3073,33 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret < 0)
 		goto out;
 
-	/*
-	 * The private mem cannot be zapped after kvm_tdp_map_page()
-	 * because all paths are covered by slots_lock and the
-	 * filemap invalidate lock.  Check that they are indeed enough.
-	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
+	KVM_BUG_ON(level != PG_LEVEL_4K, kvm);
 
-	ret = 0;
-	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
-			       src_page, &entry, &level_state);
-	if (err) {
-		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
-		goto out;
-	}
+	scoped_guard(read_lock, &kvm->mmu_lock) {
+		if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, level, &pfn)) {
+			ret = -EIO;
+			goto out;
+		}
 
-	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
-		atomic64_dec(&kvm_tdx->nr_premapped);
+		ret = 0;
+		err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
+				       src_page, &entry, &level_state);
+		if (err) {
+			ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
+			goto out;
+		}
 
-	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
-		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
-			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
-					    &level_state);
-			if (err) {
-				ret = -EIO;
-				break;
+		if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
+			atomic64_dec(&kvm_tdx->nr_premapped);
+
+		if (flags & KVM_TDX_MEASURE_MEMORY_REGION) {
+			for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+				err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
+						    &level_state);
+				if (err) {
+					ret = -EIO;
+					break;
+				}
 			}
 		}
 	}
@@ -3126,8 +3115,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct kvm_tdx_init_mem_region region;
-	struct tdx_gmem_post_populate_arg arg;
-	long gmem_ret;
 	int ret;
 
 	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
@@ -3160,22 +3147,11 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 			break;
 		}
 
-		arg = (struct tdx_gmem_post_populate_arg) {
-			.vcpu = vcpu,
-			.flags = cmd->flags,
-		};
-		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
-					     u64_to_user_ptr(region.source_addr),
-					     1, tdx_gmem_post_populate, &arg);
-		if (gmem_ret < 0) {
-			ret = gmem_ret;
-			break;
-		}
-
-		if (gmem_ret != 1) {
-			ret = -EIO;
+		ret = tdx_init_mem_populate(vcpu, region.gpa,
+					    u64_to_user_ptr(region.source_addr),
+					    cmd->flags);
+		if (ret)
 			break;
-		}
 
 		region.source_addr += PAGE_SIZE;
 		region.gpa += PAGE_SIZE;
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
@ 2025-07-03 16:51 ` Vishal Annapurve
  2025-07-09 23:21 ` Michael Roth
  1 sibling, 0 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-03 16:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, ira.weiny, michael.roth,
	david, ackerleytng, tabba, chao.p.peng

On Wed, Jul 2, 2025 at 11:29 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
>
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     kvm_gmem_populate
>         filemap_invalidate_lock(file->f_mapping)
>             __kvm_gmem_get_pfn      //1. get private PFN
>             post_populate           //tdx_gmem_post_populate
>                 get_user_pages_fast //2. get source page
>                 kvm_tdp_map_page    //3. map private PFN to mirror root
>                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
>                                          source page to it.
>
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
>
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
>
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
>
> kvm_gmem_get_pfn
>     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.
>
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
>
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
>
> This creates an AB-BA deadlock issue.
>
> These two issues should still present in Michael Roth's code [7], [8].
>
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     tdx_init_mem_populate
>         get_user_pages_fast //1. get source page
>         kvm_tdp_map_page    //2. map private PFN to mirror root
>         read_lock(&kvm->mmu_lock);
>         kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
>                                         mirror root and return the mapped
>                                         private PFN.
>         tdh_mem_page_add    //4. add private PFN to S-EPT and copy source
>                                  page to it
>         read_unlock(&kvm->mmu_lock);
>
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
>
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
>
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
>
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
>
> a) TDX keeps using kvm_gmem_populate().
>    Pros: - keep the status quo
>          - share common code between SEV-SNP and TDX, though TDX does not
>            need to prepare folios.
>    Cons: - we need to explore solutions to the locking issues, e.g. the
>            proposal at [11].
>          - PFN is faulted in twice for each GFN:
>            one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
>
> b) Michael suggested introducing some variant of
>    kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
>    kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
>    Pro:  - TDX can still invoke kvm_gmem_populate().
>            can share common code between SEV-SNP and TDX.
>    Cons: - only TDX needs this variant.
>          - can't fix the 2nd AB-BA lock issue.
>
> c) Change in this patch
>    Pro: greater flexibility. Simplify the implementation for both SEV-SNP
>         and TDX.
>    Con: undermine the purpose of sharing common code.
>         kvm_gmem_populate() may only be usable by SEV-SNP in future.
>
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
>
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---

Thanks Yan for revisiting the initial memory population for TDX VMs.
This implementation seems much cleaner to me.

Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
  2025-07-03 16:51 ` Vishal Annapurve
@ 2025-07-09 23:21 ` Michael Roth
  2025-07-10 16:24   ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Michael Roth @ 2025-07-09 23:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: pbonzini, seanjc, kvm, linux-kernel, rick.p.edgecombe, kai.huang,
	adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
	binbin.wu, dmatlack, isaku.yamahata, ira.weiny, vannapurve, david,
	ackerleytng, tabba, chao.p.peng

On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
> 
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
> 
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     kvm_gmem_populate
>         filemap_invalidate_lock(file->f_mapping)
>             __kvm_gmem_get_pfn      //1. get private PFN
>             post_populate           //tdx_gmem_post_populate
>                 get_user_pages_fast //2. get source page
>                 kvm_tdp_map_page    //3. map private PFN to mirror root
>                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
>                                          source page to it.
> 
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
> 
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
> 
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> 
> kvm_gmem_get_pfn
>     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> 
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.

Bringing the prior discussion over to here: it seems wrong that
kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
chain, because:

1) kvm_gmem_populate() is specifically passing the gmem PFN down to
   tdx_gmem_post_populate(), but we are throwing it away to grab it
   again kvm_gmem_get_pfn(), which is then creating these locking issues
   that we are trying to work around. If we could simply pass that PFN down
   to kvm_tdp_map_page() (or some variant), then we would not trigger any
   deadlocks in the first place.

2) kvm_gmem_populate() is intended for pre-boot population of guest
   memory, and allows the post_populate callback to handle setting
   up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
   calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
   setup of private memory. Having kvm_gmem_get_pfn() called as part of
   kvm_gmem_populate() chain brings things 2 things in conflict with
   each other, and TDX seems to be relying on that fact that it doesn't
   implement a handler for kvm_arch_gmem_prepare(). 

I don't think this hurts anything in the current code, and I don't
personally see any issue with open-coding the population path if it doesn't
fit TDX very well, but there was some effort put into making
kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
design of the interface itself, but instead just some inflexibility on the
KVM MMU mapping side, then it seems more robust to address the latter if
possible.

Would something like the below be reasonable? I scoped it to only be for
mapping gmem pages, but I guess it could be generalized if similar use-case
ever arose. (completely untested)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..93319f02f8b7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_map_page_gmem(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level,
+                         kvm_pfn_t gmem_pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e06e2e89a8f..61766ac0fa29 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4503,11 +4503,17 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
                return -EFAULT;
        }
 
-       r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
-                            &fault->refcounted_page, &max_order);
-       if (r) {
-               kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
-               return r;
+       if (fault->gmem_pfn == KVM_PFN_ERR_FAULT) {
+               r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
+                                    &fault->refcounted_page, &max_order);
+               if (r) {
+                       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+                       return r;
+               }
+       } else {
+               /* Only the above-handled KVM_PFN_ERR_FAULT is expected currently. */
+               WARN_ON_ONCE(is_error_pfn(fault->gmem_pfn);
+               fault->pfn = fault->gmem_pfn;
        }
 
        fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
@@ -4847,7 +4853,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
        return direct_page_fault(vcpu, fault);
 }
 
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+int kvm_tdp_map_page_gmem(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level,
+                         kvm_pfn_t gmem_pfn)
 {
        int r;
 
@@ -4866,7 +4873,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
                        return -EIO;
 
                cond_resched();
-               r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+
+               if (is_error_pfn(pfn))
+                       r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+               else
+                       folio_get(
+                       r = kvm_mmu_do_page_fault_pfn(vcpu, gpa, error_code, true, NULL, level, pfn);
        } while (r == RET_PF_RETRY);
 
        if (r < 0)
@@ -4889,6 +4901,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
                return -EIO;
        }
 }
+EXPORT_SYMBOL_GPL(kvm_tdp_map_page_gmem);
+
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+{
+       return kvm_tdp_map_page_gmem(vcpu, gpa, error_code, level, KVM_PFN_ERR_FAULT);
+}
 EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index db8f33e4de62..b0ffa4541657 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -286,6 +286,7 @@ struct kvm_page_fault {
        /* Outputs of kvm_mmu_faultin_pfn().  */
        unsigned long mmu_seq;
        kvm_pfn_t pfn;
+       kvm_pfn_t gmem_pfn;
        struct page *refcounted_page;
        bool map_writable;
 
@@ -344,9 +345,10 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
                                      fault->is_private);
 }
 
-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-                                       u64 err, bool prefetch,
-                                       int *emulation_type, u8 *level)
+static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+                                         u64 err, bool prefetch,
+                                         int *emulation_type, u8 *level,
+                                         kvm_pfn_t gmem_pfn)
 {
        struct kvm_page_fault fault = {
                .addr = cr2_or_gpa,
@@ -367,6 +369,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
                .is_private = err & PFERR_PRIVATE_ACCESS,
 
                .pfn = KVM_PFN_ERR_FAULT,
+               .gmem_pfn = gmem_pfn,
        };
        int r;
 
@@ -408,6 +411,14 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
        return r;
 }
 
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+                                       u64 err, bool prefetch,
+                                       int *emulation_type, u8 *level)
+{
+       return __kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, err, prefetch,
+                                      emulation_type, level, KVM_PFN_ERR_FAULT);
+}
+
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
                              const struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..516416a9af27 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3075,7 +3075,7 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
        if (ret != 1)
                return -ENOMEM;
 
-       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+       ret = kvm_tdp_map_page_gmem(vcpu, gpa, error_code, &level, pfn);
        if (ret < 0)
                goto out;

Thanks,

Mike

> 
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
> 
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
> 
> This creates an AB-BA deadlock issue.
> 
> These two issues should still present in Michael Roth's code [7], [8].
> 
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
> 
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     tdx_init_mem_populate
>         get_user_pages_fast //1. get source page
>         kvm_tdp_map_page    //2. map private PFN to mirror root
>         read_lock(&kvm->mmu_lock);
>         kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
>                                         mirror root and return the mapped
>                                         private PFN.
>         tdh_mem_page_add    //4. add private PFN to S-EPT and copy source
>                                  page to it
>         read_unlock(&kvm->mmu_lock);
> 
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
> 
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
> 
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
> 
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
> 
> a) TDX keeps using kvm_gmem_populate().
>    Pros: - keep the status quo
>          - share common code between SEV-SNP and TDX, though TDX does not
>            need to prepare folios.
>    Cons: - we need to explore solutions to the locking issues, e.g. the
>            proposal at [11].
>          - PFN is faulted in twice for each GFN:
>            one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
> 
> b) Michael suggested introducing some variant of
>    kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
>    kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
>    Pro:  - TDX can still invoke kvm_gmem_populate().
>            can share common code between SEV-SNP and TDX.
>    Cons: - only TDX needs this variant.
>          - can't fix the 2nd AB-BA lock issue.
> 
> c) Change in this patch
>    Pro: greater flexibility. Simplify the implementation for both SEV-SNP
>         and TDX.
>    Con: undermine the purpose of sharing common code.
>         kvm_gmem_populate() may only be usable by SEV-SNP in future.
> 
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
> 
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
> This is an RFC patch. Will split it later.
> ---
>  arch/x86/kvm/mmu.h         |  3 +-
>  arch/x86/kvm/mmu/tdp_mmu.c |  6 ++-
>  arch/x86/kvm/vmx/tdx.c     | 96 ++++++++++++++------------------------
>  3 files changed, 42 insertions(+), 63 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..b122255c7d4e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -257,7 +257,8 @@ extern bool tdp_mmu_enabled;
>  #define tdp_mmu_enabled false
>  #endif
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level,
> +			       kvm_pfn_t *pfn);
>  int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..bb95c95f6531 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1934,7 +1934,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>  	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
>  }
>  
> -bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, gpa_t gpa, int level,
> +			       kvm_pfn_t *pfn)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	bool is_direct = kvm_is_addr_direct(kvm, gpa);
> @@ -1947,10 +1948,11 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
>  	rcu_read_lock();
>  	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
>  	rcu_read_unlock();
> -	if (leaf < 0)
> +	if (leaf < 0 || leaf != level)
>  		return false;
>  
>  	spte = sptes[leaf];
> +	*pfn = spte_to_pfn(spte);
>  	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
>  }
>  EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b952bc673271..f3c2bb3554c3 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1521,9 +1521,9 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  }
>  
>  /*
> - * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to map guest pages; the
> - * callback tdx_gmem_post_populate() then maps pages into private memory.
> - * through the a seamcall TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
> + * KVM_TDX_INIT_MEM_REGION calls tdx_init_mem_populate() to first map guest
> + * pages into mirror page table and then maps pages into private memory through
> + * the a SEAMCALL TDH.MEM.PAGE.ADD().  The SEAMCALL also requires the
>   * private EPT structures for the page to have been built before, which is
>   * done via kvm_tdp_map_page(). nr_premapped counts the number of pages that
>   * were added to the EPT structures but not added with TDH.MEM.PAGE.ADD().
> @@ -3047,23 +3047,17 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	WARN_ON_ONCE(init_event);
>  }
>  
> -struct tdx_gmem_post_populate_arg {
> -	struct kvm_vcpu *vcpu;
> -	__u32 flags;
> -};
> -
> -static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> -				  void __user *src, int order, void *_arg)
> +static int tdx_init_mem_populate(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				 void __user *src, __u32 flags)
>  {
>  	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct tdx_gmem_post_populate_arg *arg = _arg;
> -	struct kvm_vcpu *vcpu = arg->vcpu;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct kvm *kvm = vcpu->kvm;
>  	u8 level = PG_LEVEL_4K;
>  	struct page *src_page;
>  	int ret, i;
>  	u64 err, entry, level_state;
> +	kvm_pfn_t pfn;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
> @@ -3079,38 +3073,33 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret < 0)
>  		goto out;
>  
> -	/*
> -	 * The private mem cannot be zapped after kvm_tdp_map_page()
> -	 * because all paths are covered by slots_lock and the
> -	 * filemap invalidate lock.  Check that they are indeed enough.
> -	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> +	KVM_BUG_ON(level != PG_LEVEL_4K, kvm);
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> +	scoped_guard(read_lock, &kvm->mmu_lock) {
> +		if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa, level, &pfn)) {
> +			ret = -EIO;
> +			goto out;
> +		}
>  
> -	if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> -		atomic64_dec(&kvm_tdx->nr_premapped);
> +		ret = 0;
> +		err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +				       src_page, &entry, &level_state);
> +		if (err) {
> +			ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> +			goto out;
> +		}
>  
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> +		if (!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm))
> +			atomic64_dec(&kvm_tdx->nr_premapped);
> +
> +		if (flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> +			for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +				err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> +						    &level_state);
> +				if (err) {
> +					ret = -EIO;
> +					break;
> +				}
>  			}
>  		}
>  	}
> @@ -3126,8 +3115,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  	struct kvm *kvm = vcpu->kvm;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct kvm_tdx_init_mem_region region;
> -	struct tdx_gmem_post_populate_arg arg;
> -	long gmem_ret;
>  	int ret;
>  
>  	if (tdx->state != VCPU_TD_STATE_INITIALIZED)
> @@ -3160,22 +3147,11 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  			break;
>  		}
>  
> -		arg = (struct tdx_gmem_post_populate_arg) {
> -			.vcpu = vcpu,
> -			.flags = cmd->flags,
> -		};
> -		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> -					     u64_to_user_ptr(region.source_addr),
> -					     1, tdx_gmem_post_populate, &arg);
> -		if (gmem_ret < 0) {
> -			ret = gmem_ret;
> -			break;
> -		}
> -
> -		if (gmem_ret != 1) {
> -			ret = -EIO;
> +		ret = tdx_init_mem_populate(vcpu, region.gpa,
> +					    u64_to_user_ptr(region.source_addr),
> +					    cmd->flags);
> +		if (ret)
>  			break;
> -		}
>  
>  		region.source_addr += PAGE_SIZE;
>  		region.gpa += PAGE_SIZE;
> -- 
> 2.43.2
> 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-09 23:21 ` Michael Roth
@ 2025-07-10 16:24   ` Sean Christopherson
  2025-07-11  1:41     ` Ira Weiny
  2025-07-11  4:36     ` Yan Zhao
  0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-10 16:24 UTC (permalink / raw)
  To: Michael Roth
  Cc: Yan Zhao, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Wed, Jul 09, 2025, Michael Roth wrote:
> On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > to use open code to populate the initial memory region into the mirror page
> > table, and add the region to S-EPT.
> > 
> > Background
> > ===
> > Sean initially suggested TDX to populate initial memory region in a 4-step
> > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > interface [2] to help TDX populate init memory region.

I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
after all :-)

> > tdx_vcpu_init_mem_region
> >     guard(mutex)(&kvm->slots_lock)
> >     kvm_gmem_populate
> >         filemap_invalidate_lock(file->f_mapping)
> >             __kvm_gmem_get_pfn      //1. get private PFN
> >             post_populate           //tdx_gmem_post_populate
> >                 get_user_pages_fast //2. get source page
> >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> >                                          source page to it.
> > 
> > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > invalidate lock also helps ensure the private PFN remains valid when
> > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > 
> > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > helps on sharing common code between SEV-SNP and TDX.
> > 
> > Problem
> > ===
> > (1)
> > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > invalidation lock for protecting its preparedness tracking. Similarly, the
> > in-place conversion version of guest_memfd series by Ackerly also requires
> > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > 
> > kvm_gmem_get_pfn
> >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > 
> > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > filemap invalidation lock.
> 
> Bringing the prior discussion over to here: it seems wrong that
> kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> chain, because:
> 
> 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
>    tdx_gmem_post_populate(), but we are throwing it away to grab it
>    again kvm_gmem_get_pfn(), which is then creating these locking issues
>    that we are trying to work around. If we could simply pass that PFN down
>    to kvm_tdp_map_page() (or some variant), then we would not trigger any
>    deadlocks in the first place.

Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.

> 2) kvm_gmem_populate() is intended for pre-boot population of guest
>    memory, and allows the post_populate callback to handle setting
>    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
>    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
>    setup of private memory. Having kvm_gmem_get_pfn() called as part of
>    kvm_gmem_populate() chain brings things 2 things in conflict with
>    each other, and TDX seems to be relying on that fact that it doesn't
>    implement a handler for kvm_arch_gmem_prepare(). 
> 
> I don't think this hurts anything in the current code, and I don't
> personally see any issue with open-coding the population path if it doesn't
> fit TDX very well, but there was some effort put into making
> kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> design of the interface itself, but instead just some inflexibility on the
> KVM MMU mapping side, then it seems more robust to address the latter if
> possible.
> 
> Would something like the below be reasonable? 

No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).

I would much rather special case this path, because it absolutely is a special
snowflake.  This even eliminates several exports of low level helpers that frankly
have no business being used by TDX, e.g. kvm_mmu_reload().

---
 arch/x86/kvm/mmu.h         |  2 +-
 arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c |  1 -
 arch/x86/kvm/vmx/tdx.c     | 24 ++----------
 4 files changed, 78 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..9cd7a34333af 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
 #endif
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..bc937f8ed5a0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+				 u64 error_code, u8 *level)
 {
 	int r;
 
@@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
@@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
+	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
 	if (r < 0)
 		return r;
 
@@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	return min(range->size, end - range->gpa);
 }
 
+int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
+{
+	struct kvm_page_fault fault = {
+		.addr = gfn_to_gpa(gfn),
+		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
+		.prefetch = true,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+
+		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+		.is_private = true,
+
+		.gfn = gfn,
+		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+		.pfn = pfn,
+		.map_writable = true,
+	};
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+		return -EIO;
+
+	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
+		return -EPERM;
+
+	r = kvm_mmu_reload(vcpu);
+	if (r)
+		return r;
+
+	r = mmu_topup_memory_caches(vcpu, false);
+	if (r)
+		return r;
+
+	do {
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
+			return -EIO;
+
+		cond_resched();
+
+		guard(read_lock)(&kvm->mmu_lock);
+
+		r = kvm_tdp_mmu_map(vcpu, &fault);
+	} while (r == RET_PF_RETRY);
+
+	if (r != RET_PF_FIXED)
+		return -EIO;
+
+	/*
+	 * The caller is responsible for ensuring that no MMU invalidations can
+	 * occur.  Sanity check that the mapping hasn't been zapped.
+	 */
+	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
+		cond_resched();
+
+		scoped_guard(read_lock, &kvm->mmu_lock) {
+			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
+				return -EIO;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
@@ -5973,7 +6044,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
-EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..4f73d5341ebe 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1953,7 +1953,6 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
 	spte = sptes[leaf];
 	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
 }
-EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
 
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f4d4fd5cc6e8..02142496754f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3170,15 +3170,12 @@ struct tdx_gmem_post_populate_arg {
 static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 				  void __user *src, int order, void *_arg)
 {
-	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
-	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct tdx_gmem_post_populate_arg *arg = _arg;
-	struct kvm_vcpu *vcpu = arg->vcpu;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err, entry, level_state;
 	gpa_t gpa = gfn_to_gpa(gfn);
-	u8 level = PG_LEVEL_4K;
 	struct page *src_page;
 	int ret, i;
-	u64 err, entry, level_state;
 
 	/*
 	 * Get the source page if it has been faulted in. Return failure if the
@@ -3190,24 +3187,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
 	if (ret < 0)
 		goto out;
 
-	/*
-	 * The private mem cannot be zapped after kvm_tdp_map_page()
-	 * because all paths are covered by slots_lock and the
-	 * filemap invalidate lock.  Check that they are indeed enough.
-	 */
-	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
-		scoped_guard(read_lock, &kvm->mmu_lock) {
-			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
-				ret = -EIO;
-				goto out;
-			}
-		}
-	}
-
 	ret = 0;
 	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
 			       src_page, &entry, &level_state);
@@ -3267,7 +3250,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
 	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
 		return -EINVAL;
 
-	kvm_mmu_reload(vcpu);
 	ret = 0;
 	while (region.nr_pages) {
 		if (signal_pending(current)) {

base-commit: 6c7ecd725e503bf2ca69ff52c6cc48bb650b1f11
--

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-10 16:24   ` Sean Christopherson
@ 2025-07-11  1:41     ` Ira Weiny
  2025-07-11 14:21       ` Sean Christopherson
  2025-07-11  4:36     ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-11  1:41 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth
  Cc: Yan Zhao, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

Sean Christopherson wrote:
> On Wed, Jul 09, 2025, Michael Roth wrote:
> > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > to use open code to populate the initial memory region into the mirror page
> > > table, and add the region to S-EPT.
> > > 
> > > Background
> > > ===
> > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > interface [2] to help TDX populate init memory region.
> 
> I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> after all :-)
> 
> > > tdx_vcpu_init_mem_region
> > >     guard(mutex)(&kvm->slots_lock)
> > >     kvm_gmem_populate
> > >         filemap_invalidate_lock(file->f_mapping)
> > >             __kvm_gmem_get_pfn      //1. get private PFN
> > >             post_populate           //tdx_gmem_post_populate
> > >                 get_user_pages_fast //2. get source page
> > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > >                                          source page to it.
> > > 
> > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > invalidate lock also helps ensure the private PFN remains valid when
> > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > 
> > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > helps on sharing common code between SEV-SNP and TDX.
> > > 
> > > Problem
> > > ===
> > > (1)
> > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > 
> > > kvm_gmem_get_pfn
> > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > 
> > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > filemap invalidation lock.
> > 
> > Bringing the prior discussion over to here: it seems wrong that
> > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > chain, because:
> > 
> > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> >    that we are trying to work around. If we could simply pass that PFN down
> >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> >    deadlocks in the first place.
> 
> Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> 
> > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> >    memory, and allows the post_populate callback to handle setting
> >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> >    kvm_gmem_populate() chain brings things 2 things in conflict with
> >    each other, and TDX seems to be relying on that fact that it doesn't
> >    implement a handler for kvm_arch_gmem_prepare(). 
> > 
> > I don't think this hurts anything in the current code, and I don't
> > personally see any issue with open-coding the population path if it doesn't
> > fit TDX very well, but there was some effort put into making
> > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > design of the interface itself, but instead just some inflexibility on the
> > KVM MMU mapping side, then it seems more robust to address the latter if
> > possible.
> > 
> > Would something like the below be reasonable? 
> 
> No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> 
> I would much rather special case this path, because it absolutely is a special
> snowflake.  This even eliminates several exports of low level helpers that frankly
> have no business being used by TDX, e.g. kvm_mmu_reload().

I'm not quite following what the code below is for.  Is it an addition to
Yan's patch to eliminate the use of kvm_gmem_populate() from TDX?
I don't see how this code helps with the lock invalidation so I think we
still need Yan's patch, correct?

From a higher level, as Yan said, TDX does not need to prepare folios.  So
the elimination of the use of kvm_gmem_populate() makes more sense to me.

Ira

> 
> ---
>  arch/x86/kvm/mmu.h         |  2 +-
>  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
>  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
>  4 files changed, 78 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..9cd7a34333af 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
>  #endif
>  
>  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..bc937f8ed5a0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	return direct_page_fault(vcpu, fault);
>  }
>  
> -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				 u64 error_code, u8 *level)
>  {
>  	int r;
>  
> @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  		return -EIO;
>  	}
>  }
> -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range)
> @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	 * Shadow paging uses GVA for kvm page fault, so restrict to
>  	 * two-dimensional paging.
>  	 */
> -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
>  	if (r < 0)
>  		return r;
>  
> @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	return min(range->size, end - range->gpa);
>  }
>  
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct kvm_page_fault fault = {
> +		.addr = gfn_to_gpa(gfn),
> +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> +		.prefetch = true,
> +		.is_tdp = true,
> +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> +
> +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +		.req_level = PG_LEVEL_4K,
> +		.goal_level = PG_LEVEL_4K,
> +		.is_private = true,
> +
> +		.gfn = gfn,
> +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> +		.pfn = pfn,
> +		.map_writable = true,
> +	};
> +	struct kvm *kvm = vcpu->kvm;
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return -EIO;
> +
> +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> +		return -EPERM;
> +
> +	r = kvm_mmu_reload(vcpu);
> +	if (r)
> +		return r;
> +
> +	r = mmu_topup_memory_caches(vcpu, false);
> +	if (r)
> +		return r;
> +
> +	do {
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> +			return -EIO;
> +
> +		cond_resched();
> +
> +		guard(read_lock)(&kvm->mmu_lock);
> +
> +		r = kvm_tdp_mmu_map(vcpu, &fault);
> +	} while (r == RET_PF_RETRY);
> +
> +	if (r != RET_PF_FIXED)
> +		return -EIO;
> +
> +	/*
> +	 * The caller is responsible for ensuring that no MMU invalidations can
> +	 * occur.  Sanity check that the mapping hasn't been zapped.
> +	 */
> +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> +		cond_resched();
> +
> +		scoped_guard(read_lock, &kvm->mmu_lock) {
> +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> +				return -EIO;
> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> +
>  static void nonpaging_init_context(struct kvm_mmu *context)
>  {
>  	context->page_fault = nonpaging_page_fault;
> @@ -5973,7 +6044,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  out:
>  	return r;
>  }
> -EXPORT_SYMBOL_GPL(kvm_mmu_load);
>  
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..4f73d5341ebe 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1953,7 +1953,6 @@ bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
>  	spte = sptes[leaf];
>  	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
>  }
> -EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
>  
>  /*
>   * Returns the last level spte pointer of the shadow page walk for the given
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f4d4fd5cc6e8..02142496754f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3170,15 +3170,12 @@ struct tdx_gmem_post_populate_arg {
>  static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  				  void __user *src, int order, void *_arg)
>  {
> -	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> -	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
> -	struct kvm_vcpu *vcpu = arg->vcpu;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
>  	gpa_t gpa = gfn_to_gpa(gfn);
> -	u8 level = PG_LEVEL_4K;
>  	struct page *src_page;
>  	int ret, i;
> -	u64 err, entry, level_state;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
> @@ -3190,24 +3187,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> -	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
>  	if (ret < 0)
>  		goto out;
>  
> -	/*
> -	 * The private mem cannot be zapped after kvm_tdp_map_page()
> -	 * because all paths are covered by slots_lock and the
> -	 * filemap invalidate lock.  Check that they are indeed enough.
> -	 */
> -	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> -		scoped_guard(read_lock, &kvm->mmu_lock) {
> -			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa), kvm)) {
> -				ret = -EIO;
> -				goto out;
> -			}
> -		}
> -	}
> -
>  	ret = 0;
>  	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
>  			       src_page, &entry, &level_state);
> @@ -3267,7 +3250,6 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>  	    !vt_is_tdx_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
>  		return -EINVAL;
>  
> -	kvm_mmu_reload(vcpu);
>  	ret = 0;
>  	while (region.nr_pages) {
>  		if (signal_pending(current)) {
> 
> base-commit: 6c7ecd725e503bf2ca69ff52c6cc48bb650b1f11
> --



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11  1:41     ` Ira Weiny
@ 2025-07-11 14:21       ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-11 14:21 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Thu, Jul 10, 2025, Ira Weiny wrote:
> Sean Christopherson wrote:
> > On Wed, Jul 09, 2025, Michael Roth wrote:
> > > I don't think this hurts anything in the current code, and I don't
> > > personally see any issue with open-coding the population path if it doesn't
> > > fit TDX very well, but there was some effort put into making
> > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > > design of the interface itself, but instead just some inflexibility on the
> > > KVM MMU mapping side, then it seems more robust to address the latter if
> > > possible.
> > > 
> > > Would something like the below be reasonable? 
> > 
> > No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> > be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> > that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> > 
> > I would much rather special case this path, because it absolutely is a special
> > snowflake.  This even eliminates several exports of low level helpers that frankly
> > have no business being used by TDX, e.g. kvm_mmu_reload().
> 
> I'm not quite following what the code below is for.  Is it an addition to
> Yan's patch to eliminate the use of kvm_gmem_populate() from TDX?
> I don't see how this code helps with the lock invalidation so I think we
> still need Yan's patch, correct?

Dunno, I haven't read through Yan's patch, I was just reacting to Mike's proposal.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-10 16:24   ` Sean Christopherson
  2025-07-11  1:41     ` Ira Weiny
@ 2025-07-11  4:36     ` Yan Zhao
  2025-07-11 15:17       ` Michael Roth
  1 sibling, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-11  4:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote:
> On Wed, Jul 09, 2025, Michael Roth wrote:
> > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > to use open code to populate the initial memory region into the mirror page
> > > table, and add the region to S-EPT.
> > > 
> > > Background
> > > ===
> > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > interface [2] to help TDX populate init memory region.
> 
> I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> after all :-)
> 
> > > tdx_vcpu_init_mem_region
> > >     guard(mutex)(&kvm->slots_lock)
> > >     kvm_gmem_populate
> > >         filemap_invalidate_lock(file->f_mapping)
> > >             __kvm_gmem_get_pfn      //1. get private PFN
> > >             post_populate           //tdx_gmem_post_populate
> > >                 get_user_pages_fast //2. get source page
> > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > >                                          source page to it.
> > > 
> > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > invalidate lock also helps ensure the private PFN remains valid when
> > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > 
> > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > helps on sharing common code between SEV-SNP and TDX.
> > > 
> > > Problem
> > > ===
> > > (1)
> > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > 
> > > kvm_gmem_get_pfn
> > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > 
> > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > filemap invalidation lock.
> > 
> > Bringing the prior discussion over to here: it seems wrong that
> > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > chain, because:
> > 
> > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> >    that we are trying to work around. If we could simply pass that PFN down
> >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> >    deadlocks in the first place.
> 
> Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> 
> > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> >    memory, and allows the post_populate callback to handle setting
> >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> >    kvm_gmem_populate() chain brings things 2 things in conflict with
> >    each other, and TDX seems to be relying on that fact that it doesn't
> >    implement a handler for kvm_arch_gmem_prepare(). 
> > 
> > I don't think this hurts anything in the current code, and I don't
> > personally see any issue with open-coding the population path if it doesn't
> > fit TDX very well, but there was some effort put into making
> > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > design of the interface itself, but instead just some inflexibility on the
> > KVM MMU mapping side, then it seems more robust to address the latter if
> > possible.
> > 
> > Would something like the below be reasonable? 
> 
> No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> 
> I would much rather special case this path, because it absolutely is a special
> snowflake.  This even eliminates several exports of low level helpers that frankly
> have no business being used by TDX, e.g. kvm_mmu_reload().
> 
> ---
>  arch/x86/kvm/mmu.h         |  2 +-
>  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
>  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
>  4 files changed, 78 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index b4b6860ab971..9cd7a34333af 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
>  #endif
>  
>  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
>  
>  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>  {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..bc937f8ed5a0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	return direct_page_fault(vcpu, fault);
>  }
>  
> -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> +				 u64 error_code, u8 *level)
>  {
>  	int r;
>  
> @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
>  		return -EIO;
>  	}
>  }
> -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>  
>  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  				    struct kvm_pre_fault_memory *range)
> @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	 * Shadow paging uses GVA for kvm page fault, so restrict to
>  	 * two-dimensional paging.
>  	 */
> -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
>  	if (r < 0)
>  		return r;
>  
> @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
>  	return min(range->size, end - range->gpa);
>  }
>  
> +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> +{
> +	struct kvm_page_fault fault = {
> +		.addr = gfn_to_gpa(gfn),
> +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> +		.prefetch = true,
> +		.is_tdp = true,
> +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> +
> +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +		.req_level = PG_LEVEL_4K,
kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M,
because the private_max_mapping_level hook is only invoked in
kvm_mmu_faultin_pfn_gmem().

Updating lpage_info can fix it though.

> +		.goal_level = PG_LEVEL_4K,
> +		.is_private = true,
> +
> +		.gfn = gfn,
> +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> +		.pfn = pfn,
> +		.map_writable = true,
> +	};
> +	struct kvm *kvm = vcpu->kvm;
> +	int r;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> +		return -EIO;
> +
> +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> +		return -EPERM;
> +
> +	r = kvm_mmu_reload(vcpu);
> +	if (r)
> +		return r;
> +
> +	r = mmu_topup_memory_caches(vcpu, false);
> +	if (r)
> +		return r;
> +
> +	do {
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> +			return -EIO;
> +
> +		cond_resched();
> +
> +		guard(read_lock)(&kvm->mmu_lock);
> +
> +		r = kvm_tdp_mmu_map(vcpu, &fault);
> +	} while (r == RET_PF_RETRY);
> +
> +	if (r != RET_PF_FIXED)
> +		return -EIO;
> +
> +	/*
> +	 * The caller is responsible for ensuring that no MMU invalidations can
> +	 * occur.  Sanity check that the mapping hasn't been zapped.
> +	 */
> +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> +		cond_resched();
> +
> +		scoped_guard(read_lock, &kvm->mmu_lock) {
> +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> +				return -EIO;
> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);

Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
log:

Problem
===
...
(2)
Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
resulting in the following lock sequence in tdx_vcpu_init_mem_region():
- filemap invalidation lock --> mm->mmap_lock

However, in future code, the shared filemap invalidation lock will be held
in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
- mm->mmap_lock --> filemap invalidation lock

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11  4:36     ` Yan Zhao
@ 2025-07-11 15:17       ` Michael Roth
  2025-07-11 15:39         ` Sean Christopherson
  2025-07-14  3:20         ` Yan Zhao
  0 siblings, 2 replies; 42+ messages in thread
From: Michael Roth @ 2025-07-11 15:17 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote:
> > On Wed, Jul 09, 2025, Michael Roth wrote:
> > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > > to use open code to populate the initial memory region into the mirror page
> > > > table, and add the region to S-EPT.
> > > > 
> > > > Background
> > > > ===
> > > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > > interface [2] to help TDX populate init memory region.
> > 
> > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> > after all :-)
> > 
> > > > tdx_vcpu_init_mem_region
> > > >     guard(mutex)(&kvm->slots_lock)
> > > >     kvm_gmem_populate
> > > >         filemap_invalidate_lock(file->f_mapping)
> > > >             __kvm_gmem_get_pfn      //1. get private PFN
> > > >             post_populate           //tdx_gmem_post_populate
> > > >                 get_user_pages_fast //2. get source page
> > > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > > >                                          source page to it.
> > > > 
> > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > > invalidate lock also helps ensure the private PFN remains valid when
> > > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > > 
> > > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > > helps on sharing common code between SEV-SNP and TDX.
> > > > 
> > > > Problem
> > > > ===
> > > > (1)
> > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > > 
> > > > kvm_gmem_get_pfn
> > > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > > 
> > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > > filemap invalidation lock.
> > > 
> > > Bringing the prior discussion over to here: it seems wrong that
> > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > > chain, because:
> > > 
> > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> > >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> > >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> > >    that we are trying to work around. If we could simply pass that PFN down
> > >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> > >    deadlocks in the first place.
> > 
> > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> > 
> > > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> > >    memory, and allows the post_populate callback to handle setting
> > >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> > >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> > >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> > >    kvm_gmem_populate() chain brings things 2 things in conflict with
> > >    each other, and TDX seems to be relying on that fact that it doesn't
> > >    implement a handler for kvm_arch_gmem_prepare(). 
> > > 
> > > I don't think this hurts anything in the current code, and I don't
> > > personally see any issue with open-coding the population path if it doesn't
> > > fit TDX very well, but there was some effort put into making
> > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > > design of the interface itself, but instead just some inflexibility on the
> > > KVM MMU mapping side, then it seems more robust to address the latter if
> > > possible.
> > > 
> > > Would something like the below be reasonable? 
> > 
> > No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> > be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> > that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> > 
> > I would much rather special case this path, because it absolutely is a special
> > snowflake.  This even eliminates several exports of low level helpers that frankly
> > have no business being used by TDX, e.g. kvm_mmu_reload().
> > 
> > ---
> >  arch/x86/kvm/mmu.h         |  2 +-
> >  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
> >  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
> >  4 files changed, 78 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index b4b6860ab971..9cd7a34333af 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
> >  #endif
> >  
> >  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
> >  
> >  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> >  {
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6e838cb6c9e1..bc937f8ed5a0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  	return direct_page_fault(vcpu, fault);
> >  }
> >  
> > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> > +				 u64 error_code, u8 *level)
> >  {
> >  	int r;
> >  
> > @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> >  		return -EIO;
> >  	}
> >  }
> > -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> >  
> >  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  				    struct kvm_pre_fault_memory *range)
> > @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	 * Shadow paging uses GVA for kvm page fault, so restrict to
> >  	 * two-dimensional paging.
> >  	 */
> > -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
> >  	if (r < 0)
> >  		return r;
> >  
> > @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >  	return min(range->size, end - range->gpa);
> >  }
> >  
> > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > +{
> > +	struct kvm_page_fault fault = {
> > +		.addr = gfn_to_gpa(gfn),
> > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > +		.prefetch = true,
> > +		.is_tdp = true,
> > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > +
> > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > +		.req_level = PG_LEVEL_4K,
> kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M,
> because the private_max_mapping_level hook is only invoked in
> kvm_mmu_faultin_pfn_gmem().
> 
> Updating lpage_info can fix it though.
> 
> > +		.goal_level = PG_LEVEL_4K,
> > +		.is_private = true,
> > +
> > +		.gfn = gfn,
> > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > +		.pfn = pfn,
> > +		.map_writable = true,
> > +	};
> > +	struct kvm *kvm = vcpu->kvm;
> > +	int r;
> > +
> > +	lockdep_assert_held(&kvm->slots_lock);
> > +
> > +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> > +		return -EIO;
> > +
> > +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> > +		return -EPERM;
> > +
> > +	r = kvm_mmu_reload(vcpu);
> > +	if (r)
> > +		return r;
> > +
> > +	r = mmu_topup_memory_caches(vcpu, false);
> > +	if (r)
> > +		return r;
> > +
> > +	do {
> > +		if (signal_pending(current))
> > +			return -EINTR;
> > +
> > +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> > +			return -EIO;
> > +
> > +		cond_resched();
> > +
> > +		guard(read_lock)(&kvm->mmu_lock);
> > +
> > +		r = kvm_tdp_mmu_map(vcpu, &fault);
> > +	} while (r == RET_PF_RETRY);
> > +
> > +	if (r != RET_PF_FIXED)
> > +		return -EIO;
> > +
> > +	/*
> > +	 * The caller is responsible for ensuring that no MMU invalidations can
> > +	 * occur.  Sanity check that the mapping hasn't been zapped.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> > +		cond_resched();
> > +
> > +		scoped_guard(read_lock, &kvm->mmu_lock) {
> > +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> > +				return -EIO;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> 
> Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> log:
> 
> Problem
> ===
> ...
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
> 
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock

I wouldn't expect kvm_gmem_fault_shared() to trigger for the
KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
There was some discussion during previous guest_memfd upstream call
(May/June?) about whether to continue using kvm_gmem_populate() (or the
callback you hand it) to handle initializing memory contents before
in-place encryption, verses just expecting that userspace will
initialize the contents directly via mmap() prior to issuing any calls
that trigger kvm_gmem_populate().

I was planning on enforcing that the 'src' parameter to
kvm_gmem_populate() must be NULL for cases where
KVM_MEMSLOT_SUPPORTS_GMEM_SHARED is set, or otherwise it will return
-EINVAL, because:

1) it avoids this awkward path you mentioned where kvm_gmem_fault_shared()
   triggers during kvm_gmem_populate()
2) it makes no sense to have to have to copy anything from 'src' when we
   now support in-place update

For the SNP side, that will require a small API update for
SNP_LAUNCH_UPDATE that mandates that corresponding 'uaddr' argument is
ignored/disallowed in favor of in-place initialization from userspace via
mmap(). Not sure if TDX would need similar API update.

Would that work on the TDX side as well?

Thanks,

Mike

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 15:17       ` Michael Roth
@ 2025-07-11 15:39         ` Sean Christopherson
  2025-07-11 16:34           ` Michael Roth
                             ` (2 more replies)
  2025-07-14  3:20         ` Yan Zhao
  1 sibling, 3 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-11 15:39 UTC (permalink / raw)
  To: Michael Roth
  Cc: Yan Zhao, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025, Michael Roth wrote:
> On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > log:
> > 
> > Problem
> > ===
> > ...
> > (2)
> > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > - filemap invalidation lock --> mm->mmap_lock
> > 
> > However, in future code, the shared filemap invalidation lock will be held
> > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > - mm->mmap_lock --> filemap invalidation lock
> 
> I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).

Irrespective of shared faults, I think the API could do with a bit of cleanup
now that TDX has landed, i.e. now that we can see a bit more of the picture.

As is, I'm pretty sure TDX is broken with respect to hugepage support, because
kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
with one page at a time.  So that needs to be changed.  I assume it's already
address in one of the many upcoming series, but it still shows a flaw in the API.

Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
seems pretty straightforward, and would provide consistent ABI for all vendor
flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
obvious downside is that struct-page becomes a requirement for SNP, but that

The below could be tweaked to batch get_user_pages() into an array of pointers,
but given that both SNP and TDX can only operate on one 4KiB page at a time, and
that hugepage support doesn't yet exist, trying to super optimize the hugepage
case straightaway doesn't seem like a pressing concern.

static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
				struct file *file, gfn_t gfn, void __user *src,
				kvm_gmem_populate_cb post_populate, void *opaque)
{
	pgoff_t index = kvm_gmem_get_index(slot, gfn);
	struct page *src_page = NULL;
	bool is_prepared = false;
	struct folio *folio;
	int ret, max_order;
	kvm_pfn_t pfn;

	if (src) {
		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
		if (ret < 0)
			return ret;
		if (ret != 1)
			return -ENOMEM;
	}

	filemap_invalidate_lock(file->f_mapping);

	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
		ret = -EINVAL;
		goto out_unlock;
	}

	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
	if (IS_ERR(folio)) {
		ret = PTR_ERR(folio);
		goto out_unlock;
	}

	folio_unlock(folio);

	if (is_prepared) {
		ret = -EEXIST;
		goto out_put_folio;
	}

	ret = post_populate(kvm, gfn, pfn, src_page, opaque);
	if (!ret)
		kvm_gmem_mark_prepared(folio);

out_put_folio:
	folio_put(folio);
out_unlock:
	filemap_invalidate_unlock(file->f_mapping);

	if (src_page)
		put_page(src_page);
	return ret;
}

long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
		       kvm_gmem_populate_cb post_populate, void *opaque)
{
	struct file *file;
	struct kvm_memory_slot *slot;
	void __user *p;
	int ret = 0;
	long i;

	lockdep_assert_held(&kvm->slots_lock);
	if (npages < 0)
		return -EINVAL;

	slot = gfn_to_memslot(kvm, start_gfn);
	if (!kvm_slot_can_be_private(slot))
		return -EINVAL;

	file = kvm_gmem_get_file(slot);
	if (!file)
		return -EFAULT;

	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
	for (i = 0; i < npages; i ++) {
		if (signal_pending(current)) {
			ret = -EINTR;
			break;
		}

		p = src ? src + i * PAGE_SIZE : NULL;

		ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
					  post_populate, opaque);
		if (ret)
			break;
	}

	fput(file);
	return ret && !i ? ret : i;
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 15:39         ` Sean Christopherson
@ 2025-07-11 16:34           ` Michael Roth
  2025-07-11 18:38             ` Vishal Annapurve
  2025-07-11 20:25             ` Ira Weiny
  2025-07-11 18:46           ` Vishal Annapurve
  2025-07-14  6:15           ` Yan Zhao
  2 siblings, 2 replies; 42+ messages in thread
From: Michael Roth @ 2025-07-11 16:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Michael Roth wrote:
> > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > log:
> > > 
> > > Problem
> > > ===
> > > ...
> > > (2)
> > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > - filemap invalidation lock --> mm->mmap_lock
> > > 
> > > However, in future code, the shared filemap invalidation lock will be held
> > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > - mm->mmap_lock --> filemap invalidation lock
> > 
> > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> 
> Irrespective of shared faults, I think the API could do with a bit of cleanup
> now that TDX has landed, i.e. now that we can see a bit more of the picture.
> 
> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals

Yes, for the THP-based hugepage patchset the preparation-tracking was
modified so that only the range passed to post_populate() callback will
be marked as prepared, so I think that would have addressed this case
unless there's something more specific to TDX in that regard (otherwise
it seems analogous to SNP considerations).

However, I think the current leaning here is to drop tracking of
prepared/unprepared entirely for in-place conversion / hugepage case. I
posted an RFC patch that does this in prep for in-place conversion support:

  https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/

So in that case kvm_arch_gmem_prepare() would always be called via
kvm_gmem_get_pfn() and the architecture-specific code will handle
checking whether additional prep is needed or not. (In that context,
one might even consider removing kvm_arch_gmem_prepare() entirely from
gmem in that case and considering it a KVM/hypervisor MMU hook instead
(which would be more along the lines of TDX, but that's a less-important
topic)).

> with one page at a time.  So that needs to be changed.  I assume it's already
> address in one of the many upcoming series, but it still shows a flaw in the API.
> 
> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> seems pretty straightforward, and would provide consistent ABI for all vendor
> flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> obvious downside is that struct-page becomes a requirement for SNP, but that
> 
> The below could be tweaked to batch get_user_pages() into an array of pointers,
> but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> that hugepage support doesn't yet exist, trying to super optimize the hugepage
> case straightaway doesn't seem like a pressing concern.
> 
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> 				struct file *file, gfn_t gfn, void __user *src,
> 				kvm_gmem_populate_cb post_populate, void *opaque)
> {
> 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> 	struct page *src_page = NULL;
> 	bool is_prepared = false;
> 	struct folio *folio;
> 	int ret, max_order;
> 	kvm_pfn_t pfn;
> 
> 	if (src) {
> 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> 		if (ret < 0)
> 			return ret;
> 		if (ret != 1)
> 			return -ENOMEM;
> 	}

One tricky part here is that the uAPI currently expects the pages to
have the private attribute set prior to calling kvm_gmem_populate(),
which gets enforced below.

For in-place conversion: the idea is that userspace will convert
private->shared to update in-place, then immediately convert back
shared->private; so that approach would remain compatible with above
behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
and do a GUP or copy_from_user() on it at any point, regardless if
it is is outside of filemap_invalidate_lock(), then
kvm_gmem_fault_shared() will return -EACCES. The only 2 ways I see
around that are to either a) stop enforcing that pages that get
processed by kvm_gmem_populate() are private for in-place conversion
case, or b) enforce that 'src' is NULL for in-place conversion case.

I think either would handle the ABBA issue.

One nice thing about a) is that we wouldn't have to change the API for
SNP_LAUNCH_UPDATE since 'src' could still get passed to
kvm_gmem_populate(), but it's kind of silly since we are copying the
pages into themselves in that case. And we still need uAPI updates
regardless expected initial shared/private state for pages passed to
SNP_LAUNCH_UPDATE so to me it seems simpler to just never have to deal
with 'src' anymore outside of the legacy cases (but maybe your change
still makes sense to have just in terms of making the sequencing of
locks/etc clearer for the legacy case).

-Mike

> 
> 	filemap_invalidate_lock(file->f_mapping);
> 
> 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> 		ret = -EINVAL;
> 		goto out_unlock;
> 	}
> 
> 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> 	if (IS_ERR(folio)) {
> 		ret = PTR_ERR(folio);
> 		goto out_unlock;
> 	}
> 
> 	folio_unlock(folio);
> 
> 	if (is_prepared) {
> 		ret = -EEXIST;
> 		goto out_put_folio;
> 	}
> 
> 	ret = post_populate(kvm, gfn, pfn, src_page, opaque);
> 	if (!ret)
> 		kvm_gmem_mark_prepared(folio);
> 
> out_put_folio:
> 	folio_put(folio);
> out_unlock:
> 	filemap_invalidate_unlock(file->f_mapping);
> 
> 	if (src_page)
> 		put_page(src_page);
> 	return ret;
> }
> 
> long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> 		       kvm_gmem_populate_cb post_populate, void *opaque)
> {
> 	struct file *file;
> 	struct kvm_memory_slot *slot;
> 	void __user *p;
> 	int ret = 0;
> 	long i;
> 
> 	lockdep_assert_held(&kvm->slots_lock);
> 	if (npages < 0)
> 		return -EINVAL;
> 
> 	slot = gfn_to_memslot(kvm, start_gfn);
> 	if (!kvm_slot_can_be_private(slot))
> 		return -EINVAL;
> 
> 	file = kvm_gmem_get_file(slot);
> 	if (!file)
> 		return -EFAULT;
> 
> 	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> 	for (i = 0; i < npages; i ++) {
> 		if (signal_pending(current)) {
> 			ret = -EINTR;
> 			break;
> 		}
> 
> 		p = src ? src + i * PAGE_SIZE : NULL;
> 
> 		ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
> 					  post_populate, opaque);
> 		if (ret)
> 			break;
> 	}
> 
> 	fput(file);
> 	return ret && !i ? ret : i;
> }
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 16:34           ` Michael Roth
@ 2025-07-11 18:38             ` Vishal Annapurve
  2025-07-11 19:49               ` Michael Roth
  2025-07-11 20:25             ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-11 18:38 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 9:37 AM Michael Roth <michael.roth@amd.com> wrote:
>
> >
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> >                               struct file *file, gfn_t gfn, void __user *src,
> >                               kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> >       pgoff_t index = kvm_gmem_get_index(slot, gfn);
> >       struct page *src_page = NULL;
> >       bool is_prepared = false;
> >       struct folio *folio;
> >       int ret, max_order;
> >       kvm_pfn_t pfn;
> >
> >       if (src) {
> >               ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> >               if (ret < 0)
> >                       return ret;
> >               if (ret != 1)
> >                       return -ENOMEM;
> >       }
>
> One tricky part here is that the uAPI currently expects the pages to
> have the private attribute set prior to calling kvm_gmem_populate(),
> which gets enforced below.
>
> For in-place conversion: the idea is that userspace will convert
> private->shared to update in-place, then immediately convert back
> shared->private; so that approach would remain compatible with above
> behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
> and do a GUP or copy_from_user() on it at any point, regardless if
> it is is outside of filemap_invalidate_lock(), then
> kvm_gmem_fault_shared() will return -EACCES.

I think that's a fine way to fail the initial memory population, this
simply means userspace didn't pass the right source address. Why do we
have to work around this error? Userspace should simply pass the
source buffer that is accessible to the host or pass null to indicate
that the target gfn already has the needed contents.

That is, userspace can still bring a separate source buffer even with
in-place conversion available.

> The only 2 ways I see
> around that are to either a) stop enforcing that pages that get
> processed by kvm_gmem_populate() are private for in-place conversion
> case, or b) enforce that 'src' is NULL for in-place conversion case.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 18:38             ` Vishal Annapurve
@ 2025-07-11 19:49               ` Michael Roth
  2025-07-11 20:19                 ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Michael Roth @ 2025-07-11 19:49 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Sean Christopherson, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 11:38:10AM -0700, Vishal Annapurve wrote:
> On Fri, Jul 11, 2025 at 9:37 AM Michael Roth <michael.roth@amd.com> wrote:
> >
> > >
> > > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > >                               struct file *file, gfn_t gfn, void __user *src,
> > >                               kvm_gmem_populate_cb post_populate, void *opaque)
> > > {
> > >       pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > >       struct page *src_page = NULL;
> > >       bool is_prepared = false;
> > >       struct folio *folio;
> > >       int ret, max_order;
> > >       kvm_pfn_t pfn;
> > >
> > >       if (src) {
> > >               ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> > >               if (ret < 0)
> > >                       return ret;
> > >               if (ret != 1)
> > >                       return -ENOMEM;
> > >       }
> >
> > One tricky part here is that the uAPI currently expects the pages to
> > have the private attribute set prior to calling kvm_gmem_populate(),
> > which gets enforced below.
> >
> > For in-place conversion: the idea is that userspace will convert
> > private->shared to update in-place, then immediately convert back
> > shared->private; so that approach would remain compatible with above
> > behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
> > and do a GUP or copy_from_user() on it at any point, regardless if
> > it is is outside of filemap_invalidate_lock(), then
> > kvm_gmem_fault_shared() will return -EACCES.
> 
> I think that's a fine way to fail the initial memory population, this
> simply means userspace didn't pass the right source address. Why do we
> have to work around this error? Userspace should simply pass the
> source buffer that is accessible to the host or pass null to indicate
> that the target gfn already has the needed contents.
> 
> That is, userspace can still bring a separate source buffer even with
> in-place conversion available.

I thought there was some agreement that mmap() be the 'blessed'
approach for initializing memory with in-place conversion to help
untangle some of these paths, so it made sense to enforce that in
kvm_gmem_populate() to make it 'official', but with Sean's suggested
rework I suppose we could support both approaches.

-Mike

> 
> > The only 2 ways I see
> > around that are to either a) stop enforcing that pages that get
> > processed by kvm_gmem_populate() are private for in-place conversion
> > case, or b) enforce that 'src' is NULL for in-place conversion case.
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 19:49               ` Michael Roth
@ 2025-07-11 20:19                 ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-11 20:19 UTC (permalink / raw)
  To: Michael Roth
  Cc: Vishal Annapurve, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025, Michael Roth wrote:
> On Fri, Jul 11, 2025 at 11:38:10AM -0700, Vishal Annapurve wrote:
> > On Fri, Jul 11, 2025 at 9:37 AM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > >
> > > > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > >                               struct file *file, gfn_t gfn, void __user *src,
> > > >                               kvm_gmem_populate_cb post_populate, void *opaque)
> > > > {
> > > >       pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > >       struct page *src_page = NULL;
> > > >       bool is_prepared = false;
> > > >       struct folio *folio;
> > > >       int ret, max_order;
> > > >       kvm_pfn_t pfn;
> > > >
> > > >       if (src) {
> > > >               ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> > > >               if (ret < 0)
> > > >                       return ret;
> > > >               if (ret != 1)
> > > >                       return -ENOMEM;
> > > >       }
> > >
> > > One tricky part here is that the uAPI currently expects the pages to
> > > have the private attribute set prior to calling kvm_gmem_populate(),
> > > which gets enforced below.
> > >
> > > For in-place conversion: the idea is that userspace will convert
> > > private->shared to update in-place, then immediately convert back
> > > shared->private; so that approach would remain compatible with above
> > > behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
> > > and do a GUP or copy_from_user() on it at any point, regardless if
> > > it is is outside of filemap_invalidate_lock(), then
> > > kvm_gmem_fault_shared() will return -EACCES.
> > 
> > I think that's a fine way to fail the initial memory population, this
> > simply means userspace didn't pass the right source address. Why do we
> > have to work around this error? Userspace should simply pass the
> > source buffer that is accessible to the host or pass null to indicate
> > that the target gfn already has the needed contents.
> > 
> > That is, userspace can still bring a separate source buffer even with
> > in-place conversion available.

Yeah.  It might be superfluous to a certain extent, and it should be straight up
disallowed with PRESERVE, but I don't like the idea of taking a hard dependency
on src being NULL.

> I thought there was some agreement that mmap() be the 'blessed'
> approach for initializing memory with in-place conversion to help
> untangle some of these paths, so it made sense to enforce that in
> kvm_gmem_populate() to make it 'official', but with Sean's suggested
> rework I suppose we could support both approaches.

Ya, my preference would be to not rely on subtly making two paths mutually
exclusive in order to avoid deadlock, especially when there are ABI implications.

I'm not dead set against it, e.g. if for some reason we just absolutely need to
disallow a non-NULL src for the that case, but hopefully we can avoid that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 16:34           ` Michael Roth
  2025-07-11 18:38             ` Vishal Annapurve
@ 2025-07-11 20:25             ` Ira Weiny
  2025-07-11 22:56               ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-11 20:25 UTC (permalink / raw)
  To: Michael Roth, Sean Christopherson
  Cc: Yan Zhao, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

Michael Roth wrote:
> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > > 
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > > 
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > > 
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> > 
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> > 
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> 
> Yes, for the THP-based hugepage patchset the preparation-tracking was
> modified so that only the range passed to post_populate() callback will
> be marked as prepared, so I think that would have addressed this case
> unless there's something more specific to TDX in that regard (otherwise
> it seems analogous to SNP considerations).
> 
> However, I think the current leaning here is to drop tracking of
> prepared/unprepared entirely for in-place conversion / hugepage case. I
> posted an RFC patch that does this in prep for in-place conversion support:
> 
>   https://lore.kernel.org/kvm/20250613005400.3694904-2-michael.roth@amd.com/
> 
> So in that case kvm_arch_gmem_prepare() would always be called via
> kvm_gmem_get_pfn() and the architecture-specific code will handle
> checking whether additional prep is needed or not. (In that context,
> one might even consider removing kvm_arch_gmem_prepare() entirely from
> gmem in that case and considering it a KVM/hypervisor MMU hook instead
> (which would be more along the lines of TDX, but that's a less-important
> topic)).
> 
> 
> > with one page at a time.  So that needs to be changed.  I assume it's already
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> > 
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> > 
> > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > case straightaway doesn't seem like a pressing concern.
> > 
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > 				struct file *file, gfn_t gfn, void __user *src,
> > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > 	struct page *src_page = NULL;
> > 	bool is_prepared = false;
> > 	struct folio *folio;
> > 	int ret, max_order;
> > 	kvm_pfn_t pfn;
> > 
> > 	if (src) {
> > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> > 		if (ret < 0)
> > 			return ret;
> > 		if (ret != 1)
> > 			return -ENOMEM;
> > 	}
> 
> One tricky part here is that the uAPI currently expects the pages to
> have the private attribute set prior to calling kvm_gmem_populate(),
> which gets enforced below.
> 
> For in-place conversion: the idea is that userspace will convert
> private->shared to update in-place, then immediately convert back
> shared->private;

Why convert from private to shared and back to private?  Userspace which
knows about mmap and supports it should create shared pages, mmap, write
data, then convert to private.

Old userspace will create private and pass in a source pointer for the
initial data as it does today.

Internally, the post_populate() callback only needs to know if the data is
in place or coming from somewhere else (ie src != NULL).

> so that approach would remain compatible with above
> behavior. But if we pass a 'src' parameter to kvm_gmem_populate(),
> and do a GUP or copy_from_user() on it at any point, regardless if
> it is is outside of filemap_invalidate_lock(), then
> kvm_gmem_fault_shared() will return -EACCES. The only 2 ways I see
> around that are to either a) stop enforcing that pages that get
> processed by kvm_gmem_populate() are private for in-place conversion
> case, or b) enforce that 'src' is NULL for in-place conversion case.
> 
> I think either would handle the ABBA issue.
> 
> One nice thing about a) is that we wouldn't have to change the API for
> SNP_LAUNCH_UPDATE since 'src' could still get passed to
> kvm_gmem_populate(), but it's kind of silly since we are copying the
> pages into themselves in that case. And we still need uAPI updates
> regardless expected initial shared/private state for pages passed to
> SNP_LAUNCH_UPDATE so to me it seems simpler to just never have to deal
> with 'src' anymore outside of the legacy cases (but maybe your change
> still makes sense to have just in terms of making the sequencing of
> locks/etc clearer for the legacy case).

Agreed.

Ira

> 
> -Mike
> 
> > 
> > 	filemap_invalidate_lock(file->f_mapping);
> > 
> > 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > 	if (IS_ERR(folio)) {
> > 		ret = PTR_ERR(folio);
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio_unlock(folio);
> > 
> > 	if (is_prepared) {
> > 		ret = -EEXIST;
> > 		goto out_put_folio;
> > 	}
> > 
> > 	ret = post_populate(kvm, gfn, pfn, src_page, opaque);
> > 	if (!ret)
> > 		kvm_gmem_mark_prepared(folio);
> > 
> > out_put_folio:
> > 	folio_put(folio);
> > out_unlock:
> > 	filemap_invalidate_unlock(file->f_mapping);
> > 
> > 	if (src_page)
> > 		put_page(src_page);
> > 	return ret;
> > }
> > 
> > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> > 		       kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	struct file *file;
> > 	struct kvm_memory_slot *slot;
> > 	void __user *p;
> > 	int ret = 0;
> > 	long i;
> > 
> > 	lockdep_assert_held(&kvm->slots_lock);
> > 	if (npages < 0)
> > 		return -EINVAL;
> > 
> > 	slot = gfn_to_memslot(kvm, start_gfn);
> > 	if (!kvm_slot_can_be_private(slot))
> > 		return -EINVAL;
> > 
> > 	file = kvm_gmem_get_file(slot);
> > 	if (!file)
> > 		return -EFAULT;
> > 
> > 	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> > 	for (i = 0; i < npages; i ++) {
> > 		if (signal_pending(current)) {
> > 			ret = -EINTR;
> > 			break;
> > 		}
> > 
> > 		p = src ? src + i * PAGE_SIZE : NULL;
> > 
> > 		ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
> > 					  post_populate, opaque);
> > 		if (ret)
> > 			break;
> > 	}
> > 
> > 	fput(file);
> > 	return ret && !i ? ret : i;
> > }
> > 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 20:25             ` Ira Weiny
@ 2025-07-11 22:56               ` Sean Christopherson
  2025-07-11 23:04                 ` Vishal Annapurve
  2025-07-14 23:08                 ` Ira Weiny
  0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-11 22:56 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025, Ira Weiny wrote:
> Michael Roth wrote:
> > For in-place conversion: the idea is that userspace will convert
> > private->shared to update in-place, then immediately convert back
> > shared->private;
> 
> Why convert from private to shared and back to private?  Userspace which
> knows about mmap and supports it should create shared pages, mmap, write
> data, then convert to private.

Dunno if there's a strong usecase for converting to shared *and* populating the
data, but I also don't know that it's worth going out of our way to prevent such
behavior, at least not without a strong reason to do so.  E.g. if it allowed for
a cleaner implementation or better semantics, then by all means.  But I don't
think that's true here?  Though I haven't thought hard about this, so don't
quote me on that. :-)

> Old userspace will create private and pass in a source pointer for the
> initial data as it does today.
> 
> Internally, the post_populate() callback only needs to know if the data is
> in place or coming from somewhere else (ie src != NULL).

I think there will be a third option: data needs to be zeroed, i.e. the !src &&
!PRESERVED case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 22:56               ` Sean Christopherson
@ 2025-07-11 23:04                 ` Vishal Annapurve
  2025-07-14 23:11                   ` Ira Weiny
  2025-07-14 23:08                 ` Ira Weiny
  1 sibling, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-11 23:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ira Weiny, Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 3:56 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jul 11, 2025, Ira Weiny wrote:
> > Michael Roth wrote:
> > > For in-place conversion: the idea is that userspace will convert
> > > private->shared to update in-place, then immediately convert back
> > > shared->private;
> >
> > Why convert from private to shared and back to private?  Userspace which
> > knows about mmap and supports it should create shared pages, mmap, write
> > data, then convert to private.
>
> Dunno if there's a strong usecase for converting to shared *and* populating the
> data, but I also don't know that it's worth going out of our way to prevent such
> behavior, at least not without a strong reason to do so.  E.g. if it allowed for
> a cleaner implementation or better semantics, then by all means.  But I don't
> think that's true here?  Though I haven't thought hard about this, so don't
> quote me on that. :-)

If this is a huge page backing, starting as shared will split all the
pages to 4K granularity upon allocation. To avoid splitting, userspace
can start with everything as private when working with hugepages and
then follow convert to shared -> populate -> convert to private as
needed.

>
> > Old userspace will create private and pass in a source pointer for the
> > initial data as it does today.
> >
> > Internally, the post_populate() callback only needs to know if the data is
> > in place or coming from somewhere else (ie src != NULL).
>
> I think there will be a third option: data needs to be zeroed, i.e. the !src &&
> !PRESERVED case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 23:04                 ` Vishal Annapurve
@ 2025-07-14 23:11                   ` Ira Weiny
  2025-07-15  0:41                     ` Vishal Annapurve
  0 siblings, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-14 23:11 UTC (permalink / raw)
  To: Vishal Annapurve, Sean Christopherson
  Cc: Ira Weiny, Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

Vishal Annapurve wrote:
> On Fri, Jul 11, 2025 at 3:56 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jul 11, 2025, Ira Weiny wrote:
> > > Michael Roth wrote:
> > > > For in-place conversion: the idea is that userspace will convert
> > > > private->shared to update in-place, then immediately convert back
> > > > shared->private;
> > >
> > > Why convert from private to shared and back to private?  Userspace which
> > > knows about mmap and supports it should create shared pages, mmap, write
> > > data, then convert to private.
> >
> > Dunno if there's a strong usecase for converting to shared *and* populating the
> > data, but I also don't know that it's worth going out of our way to prevent such
> > behavior, at least not without a strong reason to do so.  E.g. if it allowed for
> > a cleaner implementation or better semantics, then by all means.  But I don't
> > think that's true here?  Though I haven't thought hard about this, so don't
> > quote me on that. :-)
> 
> If this is a huge page backing, starting as shared will split all the
> pages to 4K granularity upon allocation.

Why?  What is the reason it needs to be split?

> To avoid splitting, userspace
> can start with everything as private when working with hugepages and
> then follow convert to shared -> populate -> convert to private as
> needed.

I'm not saying this could not be done but seems wasteful.

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14 23:11                   ` Ira Weiny
@ 2025-07-15  0:41                     ` Vishal Annapurve
  0 siblings, 0 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-15  0:41 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Sean Christopherson, Michael Roth, Yan Zhao, pbonzini, kvm,
	linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 14, 2025 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Vishal Annapurve wrote:
> > On Fri, Jul 11, 2025 at 3:56 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Fri, Jul 11, 2025, Ira Weiny wrote:
> > > > Michael Roth wrote:
> > > > > For in-place conversion: the idea is that userspace will convert
> > > > > private->shared to update in-place, then immediately convert back
> > > > > shared->private;
> > > >
> > > > Why convert from private to shared and back to private?  Userspace which
> > > > knows about mmap and supports it should create shared pages, mmap, write
> > > > data, then convert to private.
> > >
> > > Dunno if there's a strong usecase for converting to shared *and* populating the
> > > data, but I also don't know that it's worth going out of our way to prevent such
> > > behavior, at least not without a strong reason to do so.  E.g. if it allowed for
> > > a cleaner implementation or better semantics, then by all means.  But I don't
> > > think that's true here?  Though I haven't thought hard about this, so don't
> > > quote me on that. :-)
> >
> > If this is a huge page backing, starting as shared will split all the
> > pages to 4K granularity upon allocation.
>
> Why?  What is the reason it needs to be split?

I think you and Sean have similar questions.

This init private-> shared-> fill -> private scheme is for userspace
for the initial guest payload population. Another choice userspace has
is to begin the whole file as shared -> fill -> only needed ranges to
private.

Regarding shared memory ranges for CC VMs, guest_memfd huge page
support [1] simply works by splitting hugepages in 4K chunks for
shared regions to allow core-mm to manage the pages without affecting
rest of the private ranges within a hugepage. The need for splitting
has been discussed in MM alignment calls and LPC 2024[2].

[1] https://lore.kernel.org/lkml/cover.1747264138.git.ackerleytng@google.com/
[2] https://lpc.events/event/18/contributions/1764/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 22:56               ` Sean Christopherson
  2025-07-11 23:04                 ` Vishal Annapurve
@ 2025-07-14 23:08                 ` Ira Weiny
  2025-07-14 23:12                   ` Sean Christopherson
  1 sibling, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-14 23:08 UTC (permalink / raw)
  To: Sean Christopherson, Ira Weiny
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Ira Weiny wrote:
> > Michael Roth wrote:
> > > For in-place conversion: the idea is that userspace will convert
> > > private->shared to update in-place, then immediately convert back
> > > shared->private;
> > 
> > Why convert from private to shared and back to private?  Userspace which
> > knows about mmap and supports it should create shared pages, mmap, write
> > data, then convert to private.
> 
> Dunno if there's a strong usecase for converting to shared *and* populating the
> data, but I also don't know that it's worth going out of our way to prevent such
> behavior, at least not without a strong reason to do so.

I'm not proposing to prevent such behavior.  Only arguing that the
private->shared->private path to data population is unlikely to be a
'common' use case.

> E.g. if it allowed for
> a cleaner implementation or better semantics, then by all means.  But I don't
> think that's true here?  Though I haven't thought hard about this, so don't
> quote me on that. :-)

Me neither.  Since I am new to this I am looking at this from a pretty
hight level and it seems to me if the intention is to pass data to the
guest then starting shared is the way to go.  Passing data out, in a Coco
VM, is probably not going to be supported.

I have to think on Vishal's assertion that a shared page needs to be split
on allocation.  That does not make sense to me.

> > Old userspace will create private and pass in a source pointer for the
> > initial data as it does today.
> > 
> > Internally, the post_populate() callback only needs to know if the data is
> > in place or coming from somewhere else (ie src != NULL).
> 
> I think there will be a third option: data needs to be zeroed, i.e. the !src &&
> !PRESERVED case.

Yes, indeed.

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14 23:08                 ` Ira Weiny
@ 2025-07-14 23:12                   ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-14 23:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 14, 2025, Ira Weiny wrote:
> Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Ira Weiny wrote:
> > > Michael Roth wrote:
> > > > For in-place conversion: the idea is that userspace will convert
> > > > private->shared to update in-place, then immediately convert back
> > > > shared->private;
> > > 
> > > Why convert from private to shared and back to private?  Userspace which
> > > knows about mmap and supports it should create shared pages, mmap, write
> > > data, then convert to private.
> > 
> > Dunno if there's a strong usecase for converting to shared *and* populating the
> > data, but I also don't know that it's worth going out of our way to prevent such
> > behavior, at least not without a strong reason to do so.
> 
> I'm not proposing to prevent such behavior.  Only arguing that the
> private->shared->private path to data population is unlikely to be a
> 'common' use case.
> 
> > E.g. if it allowed for
> > a cleaner implementation or better semantics, then by all means.  But I don't
> > think that's true here?  Though I haven't thought hard about this, so don't
> > quote me on that. :-)
> 
> Me neither.  Since I am new to this I am looking at this from a pretty
> hight level and it seems to me if the intention is to pass data to the
> guest then starting shared is the way to go.  Passing data out, in a Coco
> VM, is probably not going to be supported.

IIUC, passing data out is supported and used by pKVM.  I can see use cases for
things where the output of some processing is a non-trivial amount of data.
E.g. pass in "public" data/inputs, protect the data in-place, process the data
using a super secret model, then finally pass the sanitized, "safe for public
consumption" result back to the untrusted world.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 15:39         ` Sean Christopherson
  2025-07-11 16:34           ` Michael Roth
@ 2025-07-11 18:46           ` Vishal Annapurve
  2025-07-12 17:38             ` Vishal Annapurve
  2025-07-14  6:15           ` Yan Zhao
  2 siblings, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-11 18:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 8:40 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Jul 11, 2025, Michael Roth wrote:
> > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > log:
> > >
> > > Problem
> > > ===
> > > ...
> > > (2)
> > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > - filemap invalidation lock --> mm->mmap_lock
> > >
> > > However, in future code, the shared filemap invalidation lock will be held
> > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > - mm->mmap_lock --> filemap invalidation lock
> >
> > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
>
> Irrespective of shared faults, I think the API could do with a bit of cleanup
> now that TDX has landed, i.e. now that we can see a bit more of the picture.
>
> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> with one page at a time.  So that needs to be changed.  I assume it's already
> address in one of the many upcoming series, but it still shows a flaw in the API.
>
> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> seems pretty straightforward, and would provide consistent ABI for all vendor

Will relying on standard KVM -> guest_memfd interaction i.e.
simulating a second stage fault to get the right target address work
for all vendors i.e. CCA/SNP/TDX? If so, we might not have to maintain
this out of band path of kvm_gmem_populate.

> flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> obvious downside is that struct-page becomes a requirement for SNP, but that
>

Maybe you had more thought here?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 18:46           ` Vishal Annapurve
@ 2025-07-12 17:38             ` Vishal Annapurve
  0 siblings, 0 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-12 17:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, Yan Zhao, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 11:46 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Fri, Jul 11, 2025 at 8:40 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > >
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > >
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > >
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> >
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> >
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> > with one page at a time.  So that needs to be changed.  I assume it's already
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> >
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
>
> Will relying on standard KVM -> guest_memfd interaction i.e.
> simulating a second stage fault to get the right target address work
> for all vendors i.e. CCA/SNP/TDX? If so, we might not have to maintain
> this out of band path of kvm_gmem_populate.

I think the only different scenario is SNP, where the host must write
initial contents to guest memory.

Will this work for all cases CCA/SNP/TDX during initial memory
population from within KVM:
1) Simulate stage2 fault
2) Take a KVM mmu read lock
3) Check that the needed gpa is mapped in EPT/NPT entries
4) For SNP, if src != null, make the target pfn to be shared, copy
contents and then make the target pfn back to private.
5) For TDX, if src != null, pass the same address for source and
target (likely this works for CCA too)
6) Invoke appropriate memory encryption operations
7) measure contents
8) release the KVM mmu read lock

If this scheme works, ideally we should also not call RMP table
population logic from guest_memfd, but from KVM NPT fault handling
logic directly (a bit of cosmetic change). Ideally any outgoing
interaction from guest_memfd to KVM should be only via invalidation
notifiers.





>
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> >
>
> Maybe you had more thought here?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 15:39         ` Sean Christopherson
  2025-07-11 16:34           ` Michael Roth
  2025-07-11 18:46           ` Vishal Annapurve
@ 2025-07-14  6:15           ` Yan Zhao
  2025-07-14 15:46             ` Sean Christopherson
  2 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-14  6:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> On Fri, Jul 11, 2025, Michael Roth wrote:
> > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > log:
> > > 
> > > Problem
> > > ===
> > > ...
> > > (2)
> > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > - filemap invalidation lock --> mm->mmap_lock
> > > 
> > > However, in future code, the shared filemap invalidation lock will be held
> > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > - mm->mmap_lock --> filemap invalidation lock
> > 
> > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> 
> Irrespective of shared faults, I think the API could do with a bit of cleanup
> now that TDX has landed, i.e. now that we can see a bit more of the picture.
> 
> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> with one page at a time.  So that needs to be changed.  I assume it's already
In TDX RFC v1, we deals with multiple pages at a time :)
https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/

> address in one of the many upcoming series, but it still shows a flaw in the API.
> 
> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> seems pretty straightforward, and would provide consistent ABI for all vendor
> flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> obvious downside is that struct-page becomes a requirement for SNP, but that
> 
> The below could be tweaked to batch get_user_pages() into an array of pointers,
> but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> that hugepage support doesn't yet exist, trying to super optimize the hugepage
> case straightaway doesn't seem like a pressing concern.

> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> 				struct file *file, gfn_t gfn, void __user *src,
> 				kvm_gmem_populate_cb post_populate, void *opaque)
> {
> 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> 	struct page *src_page = NULL;
> 	bool is_prepared = false;
> 	struct folio *folio;
> 	int ret, max_order;
> 	kvm_pfn_t pfn;
> 
> 	if (src) {
> 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
get_user_pages_fast()?

get_user_pages() can't pass the assertion of mmap_assert_locked().

> 		if (ret < 0)
> 			return ret;
> 		if (ret != 1)
> 			return -ENOMEM;
> 	}
> 
> 	filemap_invalidate_lock(file->f_mapping);
> 
> 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
if (kvm_mem_is_private(kvm, gfn)) ? where

static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
{
        struct kvm_memory_slot *slot;

        if (!IS_ENABLED(CONFIG_KVM_GMEM))
                return false;

        slot = gfn_to_memslot(kvm, gfn);
        if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
                return kvm_gmem_is_private(slot, gfn);

        return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
}


> 		ret = -EINVAL;
> 		goto out_unlock;
> 	}
> 
> 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
GFN+1 will return is_prepared == true.

> 	if (IS_ERR(folio)) {
> 		ret = PTR_ERR(folio);
> 		goto out_unlock;
> 	}
> 
> 	folio_unlock(folio);
> 
> 	if (is_prepared) {
> 		ret = -EEXIST;
> 		goto out_put_folio;
> 	}
So, skip this check of is_prepare?

> 
> 	ret = post_populate(kvm, gfn, pfn, src_page, opaque);
Pass in the slot to post_populate() as well?

TDX may need to invoke hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M)
in tdx_gmem_post_populate() if kvm_tdp_mmu_map_private_pfn() does not check
the hook private_max_mapping_level for max_level as in
https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com.

> 	if (!ret)
> 		kvm_gmem_mark_prepared(folio);

	if (!ret && !is_prepared)
 		kvm_gmem_mark_prepared(folio);

	?

> out_put_folio:
> 	folio_put(folio);
> out_unlock:
> 	filemap_invalidate_unlock(file->f_mapping);
> 
> 	if (src_page)
> 		put_page(src_page);
> 	return ret;
> }
> 
> long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> 		       kvm_gmem_populate_cb post_populate, void *opaque)
> {
> 	struct file *file;
> 	struct kvm_memory_slot *slot;
> 	void __user *p;
> 	int ret = 0;
> 	long i;
> 
> 	lockdep_assert_held(&kvm->slots_lock);
> 	if (npages < 0)
> 		return -EINVAL;
> 
> 	slot = gfn_to_memslot(kvm, start_gfn);
> 	if (!kvm_slot_can_be_private(slot))
> 		return -EINVAL;
> 
> 	file = kvm_gmem_get_file(slot);
> 	if (!file)
> 		return -EFAULT;
> 
> 	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> 	for (i = 0; i < npages; i ++) {
> 		if (signal_pending(current)) {
> 			ret = -EINTR;
> 			break;
> 		}
> 
> 		p = src ? src + i * PAGE_SIZE : NULL;
> 
> 		ret = __kvm_gmem_populate(kvm, slot, file, start_gfn + i, p,
> 					  post_populate, opaque);
> 		if (ret)
> 			break;
> 	}
> 
> 	fput(file);
> 	return ret && !i ? ret : i;
> }

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14  6:15           ` Yan Zhao
@ 2025-07-14 15:46             ` Sean Christopherson
  2025-07-14 16:02               ` David Hildenbrand
  2025-07-15  1:10               ` Yan Zhao
  0 siblings, 2 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-14 15:46 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Michael Roth, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 14, 2025, Yan Zhao wrote:
> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > On Fri, Jul 11, 2025, Michael Roth wrote:
> > > On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > > > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > > > log:
> > > > 
> > > > Problem
> > > > ===
> > > > ...
> > > > (2)
> > > > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > > > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > > > - filemap invalidation lock --> mm->mmap_lock
> > > > 
> > > > However, in future code, the shared filemap invalidation lock will be held
> > > > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > > > - mm->mmap_lock --> filemap invalidation lock
> > > 
> > > I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> > > KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
> > 
> > Irrespective of shared faults, I think the API could do with a bit of cleanup
> > now that TDX has landed, i.e. now that we can see a bit more of the picture.
> > 
> > As is, I'm pretty sure TDX is broken with respect to hugepage support, because
> > kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
> > with one page at a time.  So that needs to be changed.  I assume it's already
> In TDX RFC v1, we deals with multiple pages at a time :)
> https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
> 
> > address in one of the many upcoming series, but it still shows a flaw in the API.
> > 
> > Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
> > seems pretty straightforward, and would provide consistent ABI for all vendor
> > flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
> > obvious downside is that struct-page becomes a requirement for SNP, but that
> > 
> > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > case straightaway doesn't seem like a pressing concern.
> 
> > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > 				struct file *file, gfn_t gfn, void __user *src,
> > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > 	struct page *src_page = NULL;
> > 	bool is_prepared = false;
> > 	struct folio *folio;
> > 	int ret, max_order;
> > 	kvm_pfn_t pfn;
> > 
> > 	if (src) {
> > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> get_user_pages_fast()?
> 
> get_user_pages() can't pass the assertion of mmap_assert_locked().

Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
prefer to not use a fast variant, so that userspace isn't required to prefault
(and pin?) the source.

So get_user_pages_unlocked()?

> 
> > 		if (ret < 0)
> > 			return ret;
> > 		if (ret != 1)
> > 			return -ENOMEM;
> > 	}
> > 
> > 	filemap_invalidate_lock(file->f_mapping);
> > 
> > 	if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE,
> > 					     KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> if (kvm_mem_is_private(kvm, gfn)) ? where
> 
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
>         struct kvm_memory_slot *slot;
> 
>         if (!IS_ENABLED(CONFIG_KVM_GMEM))
>                 return false;
> 
>         slot = gfn_to_memslot(kvm, gfn);
>         if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
>                 return kvm_gmem_is_private(slot, gfn);
> 
>         return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> }
> 
> 
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> > 
> > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> GFN+1 will return is_prepared == true.

I don't see any reason to try and make the current code truly work with hugepages.
Unless I've misundertood where we stand, the correctness of hugepage support is
going to depend heavily on the implementation for preparedness.  I.e. trying to
make this all work with per-folio granulartiy just isn't possible, no?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14 15:46             ` Sean Christopherson
@ 2025-07-14 16:02               ` David Hildenbrand
  2025-07-14 16:07                 ` Sean Christopherson
  2025-07-15  1:10               ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2025-07-14 16:02 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Michael Roth, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, ackerleytng, tabba, chao.p.peng

On 14.07.25 17:46, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Yan Zhao wrote:
>> On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
>>> On Fri, Jul 11, 2025, Michael Roth wrote:
>>>> On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
>>>>> Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
>>>>> log:
>>>>>
>>>>> Problem
>>>>> ===
>>>>> ...
>>>>> (2)
>>>>> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
>>>>> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
>>>>> - filemap invalidation lock --> mm->mmap_lock
>>>>>
>>>>> However, in future code, the shared filemap invalidation lock will be held
>>>>> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
>>>>> - mm->mmap_lock --> filemap invalidation lock
>>>>
>>>> I wouldn't expect kvm_gmem_fault_shared() to trigger for the
>>>> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).
>>>
>>> Irrespective of shared faults, I think the API could do with a bit of cleanup
>>> now that TDX has landed, i.e. now that we can see a bit more of the picture.
>>>
>>> As is, I'm pretty sure TDX is broken with respect to hugepage support, because
>>> kvm_gmem_populate() marks an entire folio as prepared, but TDX only ever deals
>>> with one page at a time.  So that needs to be changed.  I assume it's already
>> In TDX RFC v1, we deals with multiple pages at a time :)
>> https://lore.kernel.org/all/20250424030500.32720-1-yan.y.zhao@intel.com/
>>
>>> address in one of the many upcoming series, but it still shows a flaw in the API.
>>>
>>> Hoisting the retrieval of the source page outside of filemap_invalidate_lock()
>>> seems pretty straightforward, and would provide consistent ABI for all vendor
>>> flavors.  E.g. as is, non-struct-page memory will work for SNP, but not TDX.  The
>>> obvious downside is that struct-page becomes a requirement for SNP, but that
>>>
>>> The below could be tweaked to batch get_user_pages() into an array of pointers,
>>> but given that both SNP and TDX can only operate on one 4KiB page at a time, and
>>> that hugepage support doesn't yet exist, trying to super optimize the hugepage
>>> case straightaway doesn't seem like a pressing concern.
>>
>>> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>>> 				struct file *file, gfn_t gfn, void __user *src,
>>> 				kvm_gmem_populate_cb post_populate, void *opaque)
>>> {
>>> 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
>>> 	struct page *src_page = NULL;
>>> 	bool is_prepared = false;
>>> 	struct folio *folio;
>>> 	int ret, max_order;
>>> 	kvm_pfn_t pfn;
>>>
>>> 	if (src) {
>>> 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
>> get_user_pages_fast()?
>>
>> get_user_pages() can't pass the assertion of mmap_assert_locked().
> 
> Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
> prefer to not use a fast variant, so that userspace isn't required to prefault
> (and pin?) the source.
> 
> So get_user_pages_unlocked()?

Yes, but likely we really want get_user_pages_fast(), which will 
fallback to GUP-slow (+take the lock) in case it doesn't find what it 
needs in the page tables.

get_user_pages_fast_only() would be the variant that doesn't fallback to 
GUP-slow.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14 16:02               ` David Hildenbrand
@ 2025-07-14 16:07                 ` Sean Christopherson
  0 siblings, 0 replies; 42+ messages in thread
From: Sean Christopherson @ 2025-07-14 16:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yan Zhao, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, vannapurve, ackerleytng, tabba, chao.p.peng

On Mon, Jul 14, 2025, David Hildenbrand wrote:
> On 14.07.25 17:46, Sean Christopherson wrote:
> > On Mon, Jul 14, 2025, Yan Zhao wrote:
> > > On Fri, Jul 11, 2025 at 08:39:59AM -0700, Sean Christopherson wrote:
> > > > The below could be tweaked to batch get_user_pages() into an array of pointers,
> > > > but given that both SNP and TDX can only operate on one 4KiB page at a time, and
> > > > that hugepage support doesn't yet exist, trying to super optimize the hugepage
> > > > case straightaway doesn't seem like a pressing concern.
> > > 
> > > > static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > 				struct file *file, gfn_t gfn, void __user *src,
> > > > 				kvm_gmem_populate_cb post_populate, void *opaque)
> > > > {
> > > > 	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > > 	struct page *src_page = NULL;
> > > > 	bool is_prepared = false;
> > > > 	struct folio *folio;
> > > > 	int ret, max_order;
> > > > 	kvm_pfn_t pfn;
> > > > 
> > > > 	if (src) {
> > > > 		ret = get_user_pages((unsigned long)src, 1, 0, &src_page);
> > > get_user_pages_fast()?
> > > 
> > > get_user_pages() can't pass the assertion of mmap_assert_locked().
> > 
> > Oh, I forgot get_user_pages() requires mmap_lock to already be held.  I would
> > prefer to not use a fast variant, so that userspace isn't required to prefault
> > (and pin?) the source.
> > 
> > So get_user_pages_unlocked()?
> 
> Yes, but likely we really want get_user_pages_fast(), which will fallback to
> GUP-slow (+take the lock) in case it doesn't find what it needs in the page
> tables.
> 
> get_user_pages_fast_only() would be the variant that doesn't fallback to
> GUP-slow.

Doh, right, that's indeed what I want.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-14 15:46             ` Sean Christopherson
  2025-07-14 16:02               ` David Hildenbrand
@ 2025-07-15  1:10               ` Yan Zhao
  2025-07-18  9:14                 ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-15  1:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, kvm, linux-kernel, rick.p.edgecombe,
	kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
	tony.lindgren, binbin.wu, dmatlack, isaku.yamahata, ira.weiny,
	vannapurve, david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > GFN+1 will return is_prepared == true.
> 
> I don't see any reason to try and make the current code truly work with hugepages.
> Unless I've misundertood where we stand, the correctness of hugepage support is
Hmm. I thought your stand was to address the AB-BA lock issue which will be
introduced by huge pages, so you moved the get_user_pages() from vendor code to
the common code in guest_memfd :)

> going to depend heavily on the implementation for preparedness.  I.e. trying to
> make this all work with per-folio granulartiy just isn't possible, no?
Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
should return is_prepared at 4KB granularity rather than per-folio granularity.

So, huge pages still has dependency on the implementation for preparedness.


Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
as prerequisites for TDX huge page v2?

[1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
[2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/

Thanks!


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-15  1:10               ` Yan Zhao
@ 2025-07-18  9:14                 ` Yan Zhao
  2025-07-18 15:57                   ` Vishal Annapurve
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-18  9:14 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, vannapurve, david, ackerleytng, tabba, chao.p.peng

On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > 	folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > GFN+1 will return is_prepared == true.
> > 
> > I don't see any reason to try and make the current code truly work with hugepages.
> > Unless I've misundertood where we stand, the correctness of hugepage support is
> Hmm. I thought your stand was to address the AB-BA lock issue which will be
> introduced by huge pages, so you moved the get_user_pages() from vendor code to
> the common code in guest_memfd :)
> 
> > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > make this all work with per-folio granulartiy just isn't possible, no?
> Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> should return is_prepared at 4KB granularity rather than per-folio granularity.
> 
> So, huge pages still has dependency on the implementation for preparedness.
Looks with [3], is_prepared will not be checked in kvm_gmem_populate().

> Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> as prerequisites for TDX huge page v2?
So, maybe I can use [1][2][3] as the base.

> [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
[3] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-18  9:14                 ` Yan Zhao
@ 2025-07-18 15:57                   ` Vishal Annapurve
  2025-07-18 18:42                     ` Ira Weiny
  2025-07-28  9:48                     ` Yan Zhao
  0 siblings, 2 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-18 15:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > GFN+1 will return is_prepared == true.
> > >
> > > I don't see any reason to try and make the current code truly work with hugepages.
> > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > the common code in guest_memfd :)
> >
> > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > make this all work with per-folio granulartiy just isn't possible, no?
> > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > should return is_prepared at 4KB granularity rather than per-folio granularity.
> >
> > So, huge pages still has dependency on the implementation for preparedness.
> Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
>
> > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > as prerequisites for TDX huge page v2?
> So, maybe I can use [1][2][3] as the base.
>
> > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/

IMO, unless there is any objection to [1], it's un-necessary to
maintain kvm_gmem_populate for any arch (even for SNP). All the
initial memory population logic needs is the stable pfn for a given
gfn, which ideally should be available using the standard mechanisms
such as EPT/NPT page table walk within a read KVM mmu lock (This patch
already demonstrates it to be working).

It will be hard to clean-up this logic once we have all the
architectures using this path.

[1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/

> [3] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-18 15:57                   ` Vishal Annapurve
@ 2025-07-18 18:42                     ` Ira Weiny
  2025-07-18 18:59                       ` Vishal Annapurve
  2025-07-28  9:48                     ` Yan Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-18 18:42 UTC (permalink / raw)
  To: Vishal Annapurve, Yan Zhao
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

Vishal Annapurve wrote:
> On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > GFN+1 will return is_prepared == true.
> > > >
> > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > the common code in guest_memfd :)
> > >
> > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > >
> > > So, huge pages still has dependency on the implementation for preparedness.
> > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> >
> > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > as prerequisites for TDX huge page v2?
> > So, maybe I can use [1][2][3] as the base.
> >
> > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> 
> IMO, unless there is any objection to [1], it's un-necessary to
> maintain kvm_gmem_populate for any arch (even for SNP). All the
> initial memory population logic needs is the stable pfn for a given
> gfn, which ideally should be available using the standard mechanisms
> such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> already demonstrates it to be working).
> 
> It will be hard to clean-up this logic once we have all the
> architectures using this path.

Did you mean to say 'not hard'?

Ira

> 
> [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> 
> > [3] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com,



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-18 18:42                     ` Ira Weiny
@ 2025-07-18 18:59                       ` Vishal Annapurve
  2025-07-21 17:46                         ` Ira Weiny
  0 siblings, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-18 18:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Yan Zhao, Sean Christopherson, Michael Roth, pbonzini, kvm,
	linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 18, 2025 at 11:41 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Vishal Annapurve wrote:
> > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > GFN+1 will return is_prepared == true.
> > > > >
> > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > the common code in guest_memfd :)
> > > >
> > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > >
> > > > So, huge pages still has dependency on the implementation for preparedness.
> > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > >
> > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > as prerequisites for TDX huge page v2?
> > > So, maybe I can use [1][2][3] as the base.
> > >
> > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> >
> > IMO, unless there is any objection to [1], it's un-necessary to
> > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > initial memory population logic needs is the stable pfn for a given
> > gfn, which ideally should be available using the standard mechanisms
> > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > already demonstrates it to be working).
> >
> > It will be hard to clean-up this logic once we have all the
> > architectures using this path.
>
> Did you mean to say 'not hard'?

Let me rephrase my sentence:
It will be harder to remove kvm_gmem_populate if we punt it to the future.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-18 18:59                       ` Vishal Annapurve
@ 2025-07-21 17:46                         ` Ira Weiny
  0 siblings, 0 replies; 42+ messages in thread
From: Ira Weiny @ 2025-07-21 17:46 UTC (permalink / raw)
  To: Vishal Annapurve, Ira Weiny
  Cc: Yan Zhao, Sean Christopherson, Michael Roth, pbonzini, kvm,
	linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, david, ackerleytng, tabba, chao.p.peng

Vishal Annapurve wrote:
> On Fri, Jul 18, 2025 at 11:41 AM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > Vishal Annapurve wrote:
> > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > > GFN+1 will return is_prepared == true.
> > > > > >
> > > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > > the common code in guest_memfd :)
> > > > >
> > > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > > >
> > > > > So, huge pages still has dependency on the implementation for preparedness.
> > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > > >
> > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > > as prerequisites for TDX huge page v2?
> > > > So, maybe I can use [1][2][3] as the base.
> > > >
> > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> > >
> > > IMO, unless there is any objection to [1], it's un-necessary to
> > > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > > initial memory population logic needs is the stable pfn for a given
> > > gfn, which ideally should be available using the standard mechanisms
> > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > > already demonstrates it to be working).
> > >
> > > It will be hard to clean-up this logic once we have all the
> > > architectures using this path.
> >
> > Did you mean to say 'not hard'?
> 
> Let me rephrase my sentence:
> It will be harder to remove kvm_gmem_populate if we punt it to the future.

Indeed if more folks start using it.

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-18 15:57                   ` Vishal Annapurve
  2025-07-18 18:42                     ` Ira Weiny
@ 2025-07-28  9:48                     ` Yan Zhao
  2025-07-29  0:45                       ` Vishal Annapurve
  1 sibling, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-28  9:48 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > GFN+1 will return is_prepared == true.
> > > >
> > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > the common code in guest_memfd :)
> > >
> > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > >
> > > So, huge pages still has dependency on the implementation for preparedness.
> > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> >
> > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > as prerequisites for TDX huge page v2?
> > So, maybe I can use [1][2][3] as the base.
> >
> > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/

From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
[3] soon.

hi, Sean, is this understanding correct?

> IMO, unless there is any objection to [1], it's un-necessary to
> maintain kvm_gmem_populate for any arch (even for SNP). All the
> initial memory population logic needs is the stable pfn for a given
> gfn, which ideally should be available using the standard mechanisms
> such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> already demonstrates it to be working).
> 
> It will be hard to clean-up this logic once we have all the
> architectures using this path.
> 
> [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
For TDX, it means adopting the approach in this RFC patch, right?

> > [3] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-28  9:48                     ` Yan Zhao
@ 2025-07-29  0:45                       ` Vishal Annapurve
  2025-07-29  1:37                         ` Yan Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Vishal Annapurve @ 2025-07-29  0:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > GFN+1 will return is_prepared == true.
> > > > >
> > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > the common code in guest_memfd :)
> > > >
> > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > >
> > > > So, huge pages still has dependency on the implementation for preparedness.
> > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > >
> > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > as prerequisites for TDX huge page v2?
> > > So, maybe I can use [1][2][3] as the base.
> > >
> > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
>
> From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
> [3] soon.
>
> hi, Sean, is this understanding correct?
>
> > IMO, unless there is any objection to [1], it's un-necessary to
> > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > initial memory population logic needs is the stable pfn for a given
> > gfn, which ideally should be available using the standard mechanisms
> > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > already demonstrates it to be working).
> >
> > It will be hard to clean-up this logic once we have all the
> > architectures using this path.
> >
> > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> For TDX, it means adopting the approach in this RFC patch, right?

Yes, IMO this RFC is following the right approach as posted.

>
> > > [3] https://lore.kernel.org/lkml/20250613005400.3694904-2-michael.roth@amd.com,

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-29  0:45                       ` Vishal Annapurve
@ 2025-07-29  1:37                         ` Yan Zhao
  2025-07-29 16:33                           ` Ira Weiny
  0 siblings, 1 reply; 42+ messages in thread
From: Yan Zhao @ 2025-07-29  1:37 UTC (permalink / raw)
  To: Vishal Annapurve, ira.weiny
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Mon, Jul 28, 2025 at 05:45:35PM -0700, Vishal Annapurve wrote:
> On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > > GFN+1 will return is_prepared == true.
> > > > > >
> > > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > > the common code in guest_memfd :)
> > > > >
> > > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > > >
> > > > > So, huge pages still has dependency on the implementation for preparedness.
> > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > > >
> > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > > as prerequisites for TDX huge page v2?
> > > > So, maybe I can use [1][2][3] as the base.
> > > >
> > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> >
> > From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
> > [3] soon.
> >
> > hi, Sean, is this understanding correct?
> >
> > > IMO, unless there is any objection to [1], it's un-necessary to
> > > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > > initial memory population logic needs is the stable pfn for a given
> > > gfn, which ideally should be available using the standard mechanisms
> > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > > already demonstrates it to be working).
> > >
> > > It will be hard to clean-up this logic once we have all the
> > > architectures using this path.
> > >
> > > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > For TDX, it means adopting the approach in this RFC patch, right?
> 
> Yes, IMO this RFC is following the right approach as posted.
Ira has been investigating this for a while, see if he has any comment.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-29  1:37                         ` Yan Zhao
@ 2025-07-29 16:33                           ` Ira Weiny
  2025-08-05  0:22                             ` Sean Christopherson
  0 siblings, 1 reply; 42+ messages in thread
From: Ira Weiny @ 2025-07-29 16:33 UTC (permalink / raw)
  To: Yan Zhao, Vishal Annapurve, ira.weiny
  Cc: Sean Christopherson, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

Yan Zhao wrote:
> On Mon, Jul 28, 2025 at 05:45:35PM -0700, Vishal Annapurve wrote:
> > On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> > > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > > > GFN+1 will return is_prepared == true.
> > > > > > >
> > > > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > > > the common code in guest_memfd :)
> > > > > >
> > > > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > > > >
> > > > > > So, huge pages still has dependency on the implementation for preparedness.
> > > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > > > >
> > > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > > > as prerequisites for TDX huge page v2?
> > > > > So, maybe I can use [1][2][3] as the base.
> > > > >
> > > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> > >
> > > From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
> > > [3] soon.
> > >
> > > hi, Sean, is this understanding correct?
> > >
> > > > IMO, unless there is any objection to [1], it's un-necessary to
> > > > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > > > initial memory population logic needs is the stable pfn for a given
> > > > gfn, which ideally should be available using the standard mechanisms
> > > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > > > already demonstrates it to be working).
> > > >
> > > > It will be hard to clean-up this logic once we have all the
> > > > architectures using this path.
> > > >
> > > > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > For TDX, it means adopting the approach in this RFC patch, right?
> > 
> > Yes, IMO this RFC is following the right approach as posted.
> Ira has been investigating this for a while, see if he has any comment.

So far I have not seen any reason to keep kvm_gmem_populate() either.

Sean, did yall post the patch you suggested here and I missed it?

	https://lore.kernel.org/kvm/aG_pLUlHdYIZ2luh@google.com/

Ira

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-29 16:33                           ` Ira Weiny
@ 2025-08-05  0:22                             ` Sean Christopherson
  2025-08-05  1:20                               ` Vishal Annapurve
  0 siblings, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-08-05  0:22 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Yan Zhao, Vishal Annapurve, Michael Roth, pbonzini, kvm,
	linux-kernel, rick.p.edgecombe, kai.huang, adrian.hunter,
	reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu, dmatlack,
	isaku.yamahata, david, ackerleytng, tabba, chao.p.peng

On Tue, Jul 29, 2025, Ira Weiny wrote:
> Yan Zhao wrote:
> > On Mon, Jul 28, 2025 at 05:45:35PM -0700, Vishal Annapurve wrote:
> > > On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> > > > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > > > > GFN+1 will return is_prepared == true.
> > > > > > > >
> > > > > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > > > > the common code in guest_memfd :)
> > > > > > >
> > > > > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > > > > >
> > > > > > > So, huge pages still has dependency on the implementation for preparedness.
> > > > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > > > > >
> > > > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > > > > as prerequisites for TDX huge page v2?
> > > > > > So, maybe I can use [1][2][3] as the base.
> > > > > >
> > > > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> > > >
> > > > From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
> > > > [3] soon.
> > > >
> > > > hi, Sean, is this understanding correct?
> > > >
> > > > > IMO, unless there is any objection to [1], it's un-necessary to
> > > > > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > > > > initial memory population logic needs is the stable pfn for a given
> > > > > gfn, which ideally should be available using the standard mechanisms
> > > > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > > > > already demonstrates it to be working).
> > > > >
> > > > > It will be hard to clean-up this logic once we have all the
> > > > > architectures using this path.
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > > For TDX, it means adopting the approach in this RFC patch, right?
> > > Yes, IMO this RFC is following the right approach as posted.

I don't think we want to abandon kvm_gmem_populate().  Unless I'm missing something,
SNP has the same AB-BA problem as TDX.  The copy_from_user() on @src can trigger
a page fault, and resolving the page fault may require taking mm->mmap_lock.

Fundamentally, TDX and SNP are doing the same thing: copying from source to guest
memory.  The only differences are in the mechanics of the copy+encrypt, everything
else is the same.  I.e. I don't expect that we'll find a magic solution that works
well for one and not the other.

I also don't want to end up with wildly different ABI for SNP vs. everything else.
E.g. cond_resched() needs to be called if the to-be-initialzied range is large,
which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can
yield without dropping invalidate_lock, which means that the behavior of populating
guest_memfd memory will be quite different with respect to guest_memfd operations.

Pulling in the RFC text:

: I think the only different scenario is SNP, where the host must write
: initial contents to guest memory.
: 
: Will this work for all cases CCA/SNP/TDX during initial memory
: population from within KVM:
: 1) Simulate stage2 fault
: 2) Take a KVM mmu read lock

Doing all of this under mmu_lock is pretty much a non-starter.

: 3) Check that the needed gpa is mapped in EPT/NPT entries

No, KVM's page tables are not the source of truth.  S-EPT is a special snowflake,
and I'd like to avoid foisting the same requirements on NPT.

: 4) For SNP, if src != null, make the target pfn to be shared, copy
: contents and then make the target pfn back to private.

Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
memory might_fault() and thus might_sleep().

: 5) For TDX, if src != null, pass the same address for source and
: target (likely this works for CCA too)
: 6) Invoke appropriate memory encryption operations
: 7) measure contents
: 8) release the KVM mmu read lock
: 
: If this scheme works, ideally we should also not call RMP table
: population logic from guest_memfd, but from KVM NPT fault handling
: logic directly (a bit of cosmetic change). 

LOL, that's not a cosmetic change.  It would be a non-trivial ABI change as KVM's
ABI (ignoring S-EPT) is that userspace can delete and recreate memslots at will.

: Ideally any outgoing interaction from guest_memfd to KVM should be only via
  invalidation notifiers.

Why?  It's all KVM code.  I don't see how this is any different than e.g. common
code, arch code, and vendor code all calling into one another.  Artificially
limiting guest_memfd to a super generic interface pretty much defeats the whole
purpose of having KVM provide a backing store.

> > Ira has been investigating this for a while, see if he has any comment.
> 
> So far I have not seen any reason to keep kvm_gmem_populate() either.
> 
> Sean, did yall post the patch you suggested here and I missed it?

No, I have a partially baked patch, but I've been trying to finish up v5 of the
mediated PMU series and haven't had time to focus on this.  Hopefully I'll post
a compile-tested patch later this week.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-08-05  0:22                             ` Sean Christopherson
@ 2025-08-05  1:20                               ` Vishal Annapurve
  2025-08-05 14:30                                 ` Vishal Annapurve
  2025-08-05 19:59                                 ` Sean Christopherson
  0 siblings, 2 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-08-05  1:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ira Weiny, Yan Zhao, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Mon, Aug 4, 2025 at 5:22 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > > > For TDX, it means adopting the approach in this RFC patch, right?
> > > > Yes, IMO this RFC is following the right approach as posted.
>
> I don't think we want to abandon kvm_gmem_populate().  Unless I'm missing something,
> SNP has the same AB-BA problem as TDX.  The copy_from_user() on @src can trigger
> a page fault, and resolving the page fault may require taking mm->mmap_lock.
>
> Fundamentally, TDX and SNP are doing the same thing: copying from source to guest
> memory.  The only differences are in the mechanics of the copy+encrypt, everything
> else is the same.  I.e. I don't expect that we'll find a magic solution that works
> well for one and not the other.
>
> I also don't want to end up with wildly different ABI for SNP vs. everything else.
> E.g. cond_resched() needs to be called if the to-be-initialzied range is large,
> which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can
> yield without dropping invalidate_lock, which means that the behavior of populating
> guest_memfd memory will be quite different with respect to guest_memfd operations.

I would think that TDX/CCA VMs [1] will run into the similar behavior
of needing to simulate stage2 faults i.e. KVM will end up picking up
and dropping mmu_lock for each page anyways at least for these two
platforms.

[1] https://lore.kernel.org/kvm/20250611104844.245235-5-steven.price@arm.com/
(rmi_rtt_create())

>
> Pulling in the RFC text:
>
> : I think the only different scenario is SNP, where the host must write
> : initial contents to guest memory.
> :
> : Will this work for all cases CCA/SNP/TDX during initial memory
> : population from within KVM:
> : 1) Simulate stage2 fault
> : 2) Take a KVM mmu read lock
>
> Doing all of this under mmu_lock is pretty much a non-starter.
>
> : 3) Check that the needed gpa is mapped in EPT/NPT entries
>
> No, KVM's page tables are not the source of truth.  S-EPT is a special snowflake,
> and I'd like to avoid foisting the same requirements on NPT.

I agree this would be a new requirement.

>
> : 4) For SNP, if src != null, make the target pfn to be shared, copy
> : contents and then make the target pfn back to private.
>
> Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
> memory might_fault() and thus might_sleep().

I would think that a combination of get_user_pages() and
kmap_local_pfn() will prevent this situation of might_fault().

>
> : 5) For TDX, if src != null, pass the same address for source and
> : target (likely this works for CCA too)
> : 6) Invoke appropriate memory encryption operations
> : 7) measure contents
> : 8) release the KVM mmu read lock
> :
> : If this scheme works, ideally we should also not call RMP table
> : population logic from guest_memfd, but from KVM NPT fault handling
> : logic directly (a bit of cosmetic change).
>
> LOL, that's not a cosmetic change.  It would be a non-trivial ABI change as KVM's
> ABI (ignoring S-EPT) is that userspace can delete and recreate memslots at will.

Ack, this is not a cosmetic change once we start thinking about how
memory ownership should be tied to memslots/NPT operations.

>
> : Ideally any outgoing interaction from guest_memfd to KVM should be only via
>   invalidation notifiers.
>
> Why?  It's all KVM code.  I don't see how this is any different than e.g. common
> code, arch code, and vendor code all calling into one another.  Artificially
> limiting guest_memfd to a super generic interface pretty much defeats the whole
> purpose of having KVM provide a backing store.

Inline with what we discussed on another thread, it makes sense to
think of guest_memfd as not a super generic interface, but at least
the one that has a very well defined role of supplying memory to KVM
guests (so to userspace and IOMMU) and taking away the memory when
needed. Memory population in my opinion is best solved either by users
asserting ownership of the memory and writing to it directly or by
using guest_memfd (to be) exposed APIs to populate memory ranges given
a source buffer. IMO kvm_gmem_populate() is doing something different
than both of these options.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-08-05  1:20                               ` Vishal Annapurve
@ 2025-08-05 14:30                                 ` Vishal Annapurve
  2025-08-05 19:59                                 ` Sean Christopherson
  1 sibling, 0 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-08-05 14:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ira Weiny, Yan Zhao, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Mon, Aug 4, 2025 at 6:20 PM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Mon, Aug 4, 2025 at 5:22 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > > > > For TDX, it means adopting the approach in this RFC patch, right?
> > > > > Yes, IMO this RFC is following the right approach as posted.
> >
> > I don't think we want to abandon kvm_gmem_populate().  Unless I'm missing something,
> > SNP has the same AB-BA problem as TDX.  The copy_from_user() on @src can trigger
> > a page fault, and resolving the page fault may require taking mm->mmap_lock.
> >
> > Fundamentally, TDX and SNP are doing the same thing: copying from source to guest
> > memory.  The only differences are in the mechanics of the copy+encrypt, everything
> > else is the same.  I.e. I don't expect that we'll find a magic solution that works
> > well for one and not the other.
> >
> > I also don't want to end up with wildly different ABI for SNP vs. everything else.
> > E.g. cond_resched() needs to be called if the to-be-initialzied range is large,
> > which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can
> > yield without dropping invalidate_lock, which means that the behavior of populating
> > guest_memfd memory will be quite different with respect to guest_memfd operations.
>
> I would think that TDX/CCA VMs [1] will run into the similar behavior
> of needing to simulate stage2 faults i.e. KVM will end up picking up
> and dropping mmu_lock for each page anyways at least for these two
> platforms.
>
> [1] https://lore.kernel.org/kvm/20250611104844.245235-5-steven.price@arm.com/
> (rmi_rtt_create())
>
> >
> > Pulling in the RFC text:
> >
> > : I think the only different scenario is SNP, where the host must write
> > : initial contents to guest memory.
> > :
> > : Will this work for all cases CCA/SNP/TDX during initial memory
> > : population from within KVM:
> > : 1) Simulate stage2 fault
> > : 2) Take a KVM mmu read lock
> >
> > Doing all of this under mmu_lock is pretty much a non-starter.

Looking closer at CPU <-> PSP communication which is not implemented
to work within an atomic context, I agree now that this wouldn't work
for SNP VMs.


> >
> > : 3) Check that the needed gpa is mapped in EPT/NPT entries
> >
> > No, KVM's page tables are not the source of truth.  S-EPT is a special snowflake,
> > and I'd like to avoid foisting the same requirements on NPT.
>
> I agree this would be a new requirement.
>
> >
> > : 4) For SNP, if src != null, make the target pfn to be shared, copy
> > : contents and then make the target pfn back to private.
> >
> > Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
> > memory might_fault() and thus might_sleep().
>
> I would think that a combination of get_user_pages() and
> kmap_local_pfn() will prevent this situation of might_fault().
>
> >
> > : 5) For TDX, if src != null, pass the same address for source and
> > : target (likely this works for CCA too)
> > : 6) Invoke appropriate memory encryption operations
> > : 7) measure contents
> > : 8) release the KVM mmu read lock
> > :
> > : If this scheme works, ideally we should also not call RMP table
> > : population logic from guest_memfd, but from KVM NPT fault handling
> > : logic directly (a bit of cosmetic change).
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-08-05  1:20                               ` Vishal Annapurve
  2025-08-05 14:30                                 ` Vishal Annapurve
@ 2025-08-05 19:59                                 ` Sean Christopherson
  2025-08-06  0:09                                   ` Vishal Annapurve
  1 sibling, 1 reply; 42+ messages in thread
From: Sean Christopherson @ 2025-08-05 19:59 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ira Weiny, Yan Zhao, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Mon, Aug 04, 2025, Vishal Annapurve wrote:
> On Mon, Aug 4, 2025 at 5:22 PM Sean Christopherson <seanjc@google.com> wrote:
> > : 4) For SNP, if src != null, make the target pfn to be shared, copy
> > : contents and then make the target pfn back to private.
> >
> > Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
> > memory might_fault() and thus might_sleep().
> 
> I would think that a combination of get_user_pages() and
> kmap_local_pfn() will prevent this situation of might_fault().

Yes, but if SNP is using get_user_pages(), then it looks an awful lot like the
TDX flow, at which point isn't that an argument for keeping populate()?

> Memory population in my opinion is best solved either by users asserting
> ownership of the memory and writing to it directly or by using guest_memfd
> (to be) exposed APIs to populate memory ranges given a source buffer. IMO
> kvm_gmem_populate() is doing something different than both of these options.

In a perfect world, yes, guest_memfd would provide a clean, well-defined API
without needing a complicated dance between vendor code and guest_memfd.  But,
sadly, the world of CoCo is anything but perfect.  It's not KVM's fault that
every vendor came up with a different CoCo architecture.  I.e. we can't "fix"
the underlying issue of SNP and TDX having significantly different ways for
initializing private memory.

What we can do is shift as much code to common KVM as possible, e.g. to minimize
maintenance costs, reduce boilerplate and/or copy+paste code, provide a consistent
ABI, etc.  Those things always need to be balanced against overall complexity, but
IMO providing a vendor callback doesn't add anywhere near enough complexity to
justify open coding the same concept in every vendor implementation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-08-05 19:59                                 ` Sean Christopherson
@ 2025-08-06  0:09                                   ` Vishal Annapurve
  0 siblings, 0 replies; 42+ messages in thread
From: Vishal Annapurve @ 2025-08-06  0:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ira Weiny, Yan Zhao, Michael Roth, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	david, ackerleytng, tabba, chao.p.peng

On Tue, Aug 5, 2025 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Aug 04, 2025, Vishal Annapurve wrote:
> > On Mon, Aug 4, 2025 at 5:22 PM Sean Christopherson <seanjc@google.com> wrote:
> > > : 4) For SNP, if src != null, make the target pfn to be shared, copy
> > > : contents and then make the target pfn back to private.
> > >
> > > Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
> > > memory might_fault() and thus might_sleep().
> >
> > I would think that a combination of get_user_pages() and
> > kmap_local_pfn() will prevent this situation of might_fault().
>
> Yes, but if SNP is using get_user_pages(), then it looks an awful lot like the
> TDX flow, at which point isn't that an argument for keeping populate()?

Ack, I agree we can't ditch kvm_gmem_populate() for SNP VMs. I am ok
with using it for TDX/CCA VMs with the fixes discussed in this RFC.

>
> > Memory population in my opinion is best solved either by users asserting
> > ownership of the memory and writing to it directly or by using guest_memfd
> > (to be) exposed APIs to populate memory ranges given a source buffer. IMO
> > kvm_gmem_populate() is doing something different than both of these options.
>
> In a perfect world, yes, guest_memfd would provide a clean, well-defined API
> without needing a complicated dance between vendor code and guest_memfd.  But,
> sadly, the world of CoCo is anything but perfect.  It's not KVM's fault that
> every vendor came up with a different CoCo architecture.  I.e. we can't "fix"
> the underlying issue of SNP and TDX having significantly different ways for
> initializing private memory.
>
> What we can do is shift as much code to common KVM as possible, e.g. to minimize
> maintenance costs, reduce boilerplate and/or copy+paste code, provide a consistent
> ABI, etc.  Those things always need to be balanced against overall complexity, but
> IMO providing a vendor callback doesn't add anywhere near enough complexity to
> justify open coding the same concept in every vendor implementation.

Ack. My goal was to steer this implementation towards reusing existing
KVM synchronization to protect guest memory population within KVM
vendor logic rather than relying on guest_memfd filemap lock to
provide the needed protection here. That being said, I agree that we
can't solve this problem cleanly in a manner that works for all
architectures.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
  2025-07-11 15:17       ` Michael Roth
  2025-07-11 15:39         ` Sean Christopherson
@ 2025-07-14  3:20         ` Yan Zhao
  1 sibling, 0 replies; 42+ messages in thread
From: Yan Zhao @ 2025-07-14  3:20 UTC (permalink / raw)
  To: Michael Roth
  Cc: Sean Christopherson, pbonzini, kvm, linux-kernel,
	rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
	xiaoyao.li, tony.lindgren, binbin.wu, dmatlack, isaku.yamahata,
	ira.weiny, vannapurve, david, ackerleytng, tabba, chao.p.peng

On Fri, Jul 11, 2025 at 10:17:19AM -0500, Michael Roth wrote:
> On Fri, Jul 11, 2025 at 12:36:24PM +0800, Yan Zhao wrote:
> > On Thu, Jul 10, 2025 at 09:24:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Jul 09, 2025, Michael Roth wrote:
> > > > On Thu, Jul 03, 2025 at 02:26:41PM +0800, Yan Zhao wrote:
> > > > > Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> > > > > to use open code to populate the initial memory region into the mirror page
> > > > > table, and add the region to S-EPT.
> > > > > 
> > > > > Background
> > > > > ===
> > > > > Sean initially suggested TDX to populate initial memory region in a 4-step
> > > > > way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> > > > > interface [2] to help TDX populate init memory region.
> > > 
> > > I wouldn't give my suggestion too much weight; I did qualify it with "Crazy idea."
> > > after all :-)
> > > 
> > > > > tdx_vcpu_init_mem_region
> > > > >     guard(mutex)(&kvm->slots_lock)
> > > > >     kvm_gmem_populate
> > > > >         filemap_invalidate_lock(file->f_mapping)
> > > > >             __kvm_gmem_get_pfn      //1. get private PFN
> > > > >             post_populate           //tdx_gmem_post_populate
> > > > >                 get_user_pages_fast //2. get source page
> > > > >                 kvm_tdp_map_page    //3. map private PFN to mirror root
> > > > >                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
> > > > >                                          source page to it.
> > > > > 
> > > > > kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> > > > > invalidate lock also helps ensure the private PFN remains valid when
> > > > > tdh_mem_page_add() is invoked in TDX's post_populate hook.
> > > > > 
> > > > > Though TDX does not need the folio prepration code, kvm_gmem_populate()
> > > > > helps on sharing common code between SEV-SNP and TDX.
> > > > > 
> > > > > Problem
> > > > > ===
> > > > > (1)
> > > > > In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> > > > > changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> > > > > invalidation lock for protecting its preparedness tracking. Similarly, the
> > > > > in-place conversion version of guest_memfd series by Ackerly also requires
> > > > > kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
> > > > > 
> > > > > kvm_gmem_get_pfn
> > > > >     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > > > 
> > > > > However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> > > > > in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> > > > > filemap invalidation lock.
> > > > 
> > > > Bringing the prior discussion over to here: it seems wrong that
> > > > kvm_gmem_get_pfn() is getting called within the kvm_gmem_populate()
> > > > chain, because:
> > > > 
> > > > 1) kvm_gmem_populate() is specifically passing the gmem PFN down to
> > > >    tdx_gmem_post_populate(), but we are throwing it away to grab it
> > > >    again kvm_gmem_get_pfn(), which is then creating these locking issues
> > > >    that we are trying to work around. If we could simply pass that PFN down
> > > >    to kvm_tdp_map_page() (or some variant), then we would not trigger any
> > > >    deadlocks in the first place.
> > > 
> > > Yes, doing kvm_mmu_faultin_pfn() in tdx_gmem_post_populate() is a major flaw.
> > > 
> > > > 2) kvm_gmem_populate() is intended for pre-boot population of guest
> > > >    memory, and allows the post_populate callback to handle setting
> > > >    up the architecture-specific preparation, whereas kvm_gmem_get_pfn()
> > > >    calls kvm_arch_gmem_prepare(), which is intended to handle post-boot
> > > >    setup of private memory. Having kvm_gmem_get_pfn() called as part of
> > > >    kvm_gmem_populate() chain brings things 2 things in conflict with
> > > >    each other, and TDX seems to be relying on that fact that it doesn't
> > > >    implement a handler for kvm_arch_gmem_prepare(). 
> > > > 
> > > > I don't think this hurts anything in the current code, and I don't
> > > > personally see any issue with open-coding the population path if it doesn't
> > > > fit TDX very well, but there was some effort put into making
> > > > kvm_gmem_populate() usable for both TDX/SNP, and if the real issue isn't the
> > > > design of the interface itself, but instead just some inflexibility on the
> > > > KVM MMU mapping side, then it seems more robust to address the latter if
> > > > possible.
> > > > 
> > > > Would something like the below be reasonable? 
> > > 
> > > No, polluting the page fault paths is a non-starter for me.  TDX really shouldn't
> > > be synthesizing a page fault when it has the PFN in hand.  And some of the behavior
> > > that's desirable for pre-faults looks flat out wrong for TDX.  E.g. returning '0'
> > > on RET_PF_WRITE_PROTECTED and RET_PF_SPURIOUS (though maybe spurious is fine?).
> > > 
> > > I would much rather special case this path, because it absolutely is a special
> > > snowflake.  This even eliminates several exports of low level helpers that frankly
> > > have no business being used by TDX, e.g. kvm_mmu_reload().
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.h         |  2 +-
> > >  arch/x86/kvm/mmu/mmu.c     | 78 ++++++++++++++++++++++++++++++++++++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c |  1 -
> > >  arch/x86/kvm/vmx/tdx.c     | 24 ++----------
> > >  4 files changed, 78 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > > index b4b6860ab971..9cd7a34333af 100644
> > > --- a/arch/x86/kvm/mmu.h
> > > +++ b/arch/x86/kvm/mmu.h
> > > @@ -258,7 +258,7 @@ extern bool tdp_mmu_enabled;
> > >  #endif
> > >  
> > >  bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
> > >  
> > >  static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 6e838cb6c9e1..bc937f8ed5a0 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4900,7 +4900,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > >  	return direct_page_fault(vcpu, fault);
> > >  }
> > >  
> > > -int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
> > > +static int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa,
> > > +				 u64 error_code, u8 *level)
> > >  {
> > >  	int r;
> > >  
> > > @@ -4942,7 +4943,6 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
> > >  		return -EIO;
> > >  	}
> > >  }
> > > -EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
> > >  
> > >  long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  				    struct kvm_pre_fault_memory *range)
> > > @@ -4978,7 +4978,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  	 * Shadow paging uses GVA for kvm page fault, so restrict to
> > >  	 * two-dimensional paging.
> > >  	 */
> > > -	r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > > +	r = kvm_tdp_prefault_page(vcpu, range->gpa | direct_bits, error_code, &level);
> > >  	if (r < 0)
> > >  		return r;
> > >  
> > > @@ -4990,6 +4990,77 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > >  	return min(range->size, end - range->gpa);
> > >  }
> > >  
> > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > +{
> > > +	struct kvm_page_fault fault = {
> > > +		.addr = gfn_to_gpa(gfn),
> > > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > > +		.prefetch = true,
> > > +		.is_tdp = true,
> > > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > > +
> > > +		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> > > +		.req_level = PG_LEVEL_4K,
> > kvm_mmu_hugepage_adjust() will replace the PG_LEVEL_4K here to PG_LEVEL_2M,
> > because the private_max_mapping_level hook is only invoked in
> > kvm_mmu_faultin_pfn_gmem().
> > 
> > Updating lpage_info can fix it though.
> > 
> > > +		.goal_level = PG_LEVEL_4K,
> > > +		.is_private = true,
> > > +
> > > +		.gfn = gfn,
> > > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > > +		.pfn = pfn,
> > > +		.map_writable = true,
> > > +	};
> > > +	struct kvm *kvm = vcpu->kvm;
> > > +	int r;
> > > +
> > > +	lockdep_assert_held(&kvm->slots_lock);
> > > +
> > > +	if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> > > +		return -EIO;
> > > +
> > > +	if (kvm_gfn_is_write_tracked(kvm, fault.slot, fault.gfn))
> > > +		return -EPERM;
> > > +
> > > +	r = kvm_mmu_reload(vcpu);
> > > +	if (r)
> > > +		return r;
> > > +
> > > +	r = mmu_topup_memory_caches(vcpu, false);
> > > +	if (r)
> > > +		return r;
> > > +
> > > +	do {
> > > +		if (signal_pending(current))
> > > +			return -EINTR;
> > > +
> > > +		if (kvm_test_request(KVM_REQ_VM_DEAD, vcpu))
> > > +			return -EIO;
> > > +
> > > +		cond_resched();
> > > +
> > > +		guard(read_lock)(&kvm->mmu_lock);
> > > +
> > > +		r = kvm_tdp_mmu_map(vcpu, &fault);
> > > +	} while (r == RET_PF_RETRY);
> > > +
> > > +	if (r != RET_PF_FIXED)
> > > +		return -EIO;
> > > +
> > > +	/*
> > > +	 * The caller is responsible for ensuring that no MMU invalidations can
> > > +	 * occur.  Sanity check that the mapping hasn't been zapped.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
> > > +		cond_resched();
> > > +
> > > +		scoped_guard(read_lock, &kvm->mmu_lock) {
> > > +			if (KVM_BUG_ON(!kvm_tdp_mmu_gpa_is_mapped(vcpu, fault.addr), kvm))
> > > +				return -EIO;
> > > +		}
> > > +	}
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_map_private_pfn);
> > 
> > Besides, it can't address the 2nd AB-BA lock issue as mentioned in the patch
> > log:
> > 
> > Problem
> > ===
> > ...
> > (2)
> > Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> > resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> > - filemap invalidation lock --> mm->mmap_lock
> > 
> > However, in future code, the shared filemap invalidation lock will be held
> > in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> > - mm->mmap_lock --> filemap invalidation lock
> 
> I wouldn't expect kvm_gmem_fault_shared() to trigger for the
> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED case (or whatever we end up naming it).

For the AB-BA issue,

kvm_gmem_fault_shared() can be invoked by

(AB1) the guest_memfd (supporting in-place conversion) used for the init memory
      region.

      When the memory is converted from private to shared during TD's runtime,
      i.e., after the init memory region population stage.
     (though it's unusually to convert GFNs populated from initial memory region
     to shared during TD's runtime, it's still a valid for guests to do so).

(AB2) the guest_memfd (supporting in-place conversion) which is not used for the
      init memory region.


The get_user_pages_fast() can be invoked by
(BA1) the guest_memfd (supporting in-place conversion) used for the init memory
      region, when userspace brings a separate buffer.

(BA2) the guest_memfd (not supporting in-place conversion) used for the init
      memory region.

> There was some discussion during previous guest_memfd upstream call
> (May/June?) about whether to continue using kvm_gmem_populate() (or the
> callback you hand it) to handle initializing memory contents before
> in-place encryption, verses just expecting that userspace will
> initialize the contents directly via mmap() prior to issuing any calls
> that trigger kvm_gmem_populate().
> 
> I was planning on enforcing that the 'src' parameter to
> kvm_gmem_populate() must be NULL for cases where
> KVM_MEMSLOT_SUPPORTS_GMEM_SHARED is set, or otherwise it will return
> -EINVAL, because:
> 
> 1) it avoids this awkward path you mentioned where kvm_gmem_fault_shared()
>    triggers during kvm_gmem_populate()
It still can't.
Forcing src to NULL will remove the above (BA1), however, AB-BA lock issue is
still present between (BA2) and (AB1) (AB2).

Compared to fine-grained annotation to further address this issue, I think
Sean's fix [1] is cleaner. Besides, Vishal's concern [2] that "userspace can
still bring a separate source buffer even with in-place conversion available"
is reasonable.

[1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
[2] https://lore.kernel.org/all/CAGtprH9dCCxK=GwVZTUKCeERQGbYD78-t4xDzQprmwtGxDoZXw@mail.gmail.com/

> 2) it makes no sense to have to have to copy anything from 'src' when we
>    now support in-place update
> 
> For the SNP side, that will require a small API update for
> SNP_LAUNCH_UPDATE that mandates that corresponding 'uaddr' argument is
> ignored/disallowed in favor of in-place initialization from userspace via
> mmap(). Not sure if TDX would need similar API update.
> 
> Would that work on the TDX side as well?
I believe the TDX side will support it as well, but I prefer not to rely on it
to resolve the AB-BA issue.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-08-06  0:09 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
2025-07-03 16:51 ` Vishal Annapurve
2025-07-09 23:21 ` Michael Roth
2025-07-10 16:24   ` Sean Christopherson
2025-07-11  1:41     ` Ira Weiny
2025-07-11 14:21       ` Sean Christopherson
2025-07-11  4:36     ` Yan Zhao
2025-07-11 15:17       ` Michael Roth
2025-07-11 15:39         ` Sean Christopherson
2025-07-11 16:34           ` Michael Roth
2025-07-11 18:38             ` Vishal Annapurve
2025-07-11 19:49               ` Michael Roth
2025-07-11 20:19                 ` Sean Christopherson
2025-07-11 20:25             ` Ira Weiny
2025-07-11 22:56               ` Sean Christopherson
2025-07-11 23:04                 ` Vishal Annapurve
2025-07-14 23:11                   ` Ira Weiny
2025-07-15  0:41                     ` Vishal Annapurve
2025-07-14 23:08                 ` Ira Weiny
2025-07-14 23:12                   ` Sean Christopherson
2025-07-11 18:46           ` Vishal Annapurve
2025-07-12 17:38             ` Vishal Annapurve
2025-07-14  6:15           ` Yan Zhao
2025-07-14 15:46             ` Sean Christopherson
2025-07-14 16:02               ` David Hildenbrand
2025-07-14 16:07                 ` Sean Christopherson
2025-07-15  1:10               ` Yan Zhao
2025-07-18  9:14                 ` Yan Zhao
2025-07-18 15:57                   ` Vishal Annapurve
2025-07-18 18:42                     ` Ira Weiny
2025-07-18 18:59                       ` Vishal Annapurve
2025-07-21 17:46                         ` Ira Weiny
2025-07-28  9:48                     ` Yan Zhao
2025-07-29  0:45                       ` Vishal Annapurve
2025-07-29  1:37                         ` Yan Zhao
2025-07-29 16:33                           ` Ira Weiny
2025-08-05  0:22                             ` Sean Christopherson
2025-08-05  1:20                               ` Vishal Annapurve
2025-08-05 14:30                                 ` Vishal Annapurve
2025-08-05 19:59                                 ` Sean Christopherson
2025-08-06  0:09                                   ` Vishal Annapurve
2025-07-14  3:20         ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).