From: Vishal Annapurve <vannapurve@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com,
kai.huang@intel.com, adrian.hunter@intel.com,
reinette.chatre@intel.com, xiaoyao.li@intel.com,
tony.lindgren@intel.com, binbin.wu@linux.intel.com,
dmatlack@google.com, isaku.yamahata@intel.com,
ira.weiny@intel.com, michael.roth@amd.com, david@redhat.com,
ackerleytng@google.com, tabba@google.com, chao.p.peng@intel.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
Date: Thu, 3 Jul 2025 09:51:28 -0700 [thread overview]
Message-ID: <CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com> (raw)
In-Reply-To: <20250703062641.3247-1-yan.y.zhao@intel.com>
On Wed, Jul 2, 2025 at 11:29 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
>
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
>
> tdx_vcpu_init_mem_region
> guard(mutex)(&kvm->slots_lock)
> kvm_gmem_populate
> filemap_invalidate_lock(file->f_mapping)
> __kvm_gmem_get_pfn //1. get private PFN
> post_populate //tdx_gmem_post_populate
> get_user_pages_fast //2. get source page
> kvm_tdp_map_page //3. map private PFN to mirror root
> tdh_mem_page_add //4. add private PFN to S-EPT and copy
> source page to it.
>
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
>
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
>
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
>
> kvm_gmem_get_pfn
> filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.
>
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
>
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
>
> This creates an AB-BA deadlock issue.
>
> These two issues should still present in Michael Roth's code [7], [8].
>
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
>
> tdx_vcpu_init_mem_region
> guard(mutex)(&kvm->slots_lock)
> tdx_init_mem_populate
> get_user_pages_fast //1. get source page
> kvm_tdp_map_page //2. map private PFN to mirror root
> read_lock(&kvm->mmu_lock);
> kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
> mirror root and return the mapped
> private PFN.
> tdh_mem_page_add //4. add private PFN to S-EPT and copy source
> page to it
> read_unlock(&kvm->mmu_lock);
>
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
>
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
>
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
>
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
>
> a) TDX keeps using kvm_gmem_populate().
> Pros: - keep the status quo
> - share common code between SEV-SNP and TDX, though TDX does not
> need to prepare folios.
> Cons: - we need to explore solutions to the locking issues, e.g. the
> proposal at [11].
> - PFN is faulted in twice for each GFN:
> one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
>
> b) Michael suggested introducing some variant of
> kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
> kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
> Pro: - TDX can still invoke kvm_gmem_populate().
> can share common code between SEV-SNP and TDX.
> Cons: - only TDX needs this variant.
> - can't fix the 2nd AB-BA lock issue.
>
> c) Change in this patch
> Pro: greater flexibility. Simplify the implementation for both SEV-SNP
> and TDX.
> Con: undermine the purpose of sharing common code.
> kvm_gmem_populate() may only be usable by SEV-SNP in future.
>
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
>
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
Thanks Yan for revisiting the initial memory population for TDX VMs.
This implementation seems much cleaner to me.
Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>
next prev parent reply other threads:[~2025-07-03 16:51 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-03 6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
2025-07-03 16:51 ` Vishal Annapurve [this message]
2025-07-09 23:21 ` Michael Roth
2025-07-10 16:24 ` Sean Christopherson
2025-07-11 1:41 ` Ira Weiny
2025-07-11 14:21 ` Sean Christopherson
2025-07-11 4:36 ` Yan Zhao
2025-07-11 15:17 ` Michael Roth
2025-07-11 15:39 ` Sean Christopherson
2025-07-11 16:34 ` Michael Roth
2025-07-11 18:38 ` Vishal Annapurve
2025-07-11 19:49 ` Michael Roth
2025-07-11 20:19 ` Sean Christopherson
2025-07-11 20:25 ` Ira Weiny
2025-07-11 22:56 ` Sean Christopherson
2025-07-11 23:04 ` Vishal Annapurve
2025-07-14 23:11 ` Ira Weiny
2025-07-15 0:41 ` Vishal Annapurve
2025-07-14 23:08 ` Ira Weiny
2025-07-14 23:12 ` Sean Christopherson
2025-07-11 18:46 ` Vishal Annapurve
2025-07-12 17:38 ` Vishal Annapurve
2025-07-14 6:15 ` Yan Zhao
2025-07-14 15:46 ` Sean Christopherson
2025-07-14 16:02 ` David Hildenbrand
2025-07-14 16:07 ` Sean Christopherson
2025-07-15 1:10 ` Yan Zhao
2025-07-18 9:14 ` Yan Zhao
2025-07-18 15:57 ` Vishal Annapurve
2025-07-18 18:42 ` Ira Weiny
2025-07-18 18:59 ` Vishal Annapurve
2025-07-21 17:46 ` Ira Weiny
2025-07-28 9:48 ` Yan Zhao
2025-07-29 0:45 ` Vishal Annapurve
2025-07-29 1:37 ` Yan Zhao
2025-07-29 16:33 ` Ira Weiny
2025-08-05 0:22 ` Sean Christopherson
2025-08-05 1:20 ` Vishal Annapurve
2025-08-05 14:30 ` Vishal Annapurve
2025-08-05 19:59 ` Sean Christopherson
2025-08-06 0:09 ` Vishal Annapurve
2025-07-14 3:20 ` Yan Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com' \
--to=vannapurve@google.com \
--cc=ackerleytng@google.com \
--cc=adrian.hunter@intel.com \
--cc=binbin.wu@linux.intel.com \
--cc=chao.p.peng@intel.com \
--cc=david@redhat.com \
--cc=dmatlack@google.com \
--cc=ira.weiny@intel.com \
--cc=isaku.yamahata@intel.com \
--cc=kai.huang@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=pbonzini@redhat.com \
--cc=reinette.chatre@intel.com \
--cc=rick.p.edgecombe@intel.com \
--cc=seanjc@google.com \
--cc=tabba@google.com \
--cc=tony.lindgren@intel.com \
--cc=xiaoyao.li@intel.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).