Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Vishal Annapurve <vannapurve@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: pbonzini@redhat.com, seanjc@google.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com,
	kai.huang@intel.com,  adrian.hunter@intel.com,
	reinette.chatre@intel.com, xiaoyao.li@intel.com,
	 tony.lindgren@intel.com, binbin.wu@linux.intel.com,
	dmatlack@google.com,  isaku.yamahata@intel.com,
	ira.weiny@intel.com, michael.roth@amd.com,  david@redhat.com,
	ackerleytng@google.com, tabba@google.com,  chao.p.peng@intel.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
Date: Thu, 3 Jul 2025 09:51:28 -0700	[thread overview]
Message-ID: <CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com> (raw)
In-Reply-To: <20250703062641.3247-1-yan.y.zhao@intel.com>

On Wed, Jul 2, 2025 at 11:29 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> Rather than invoking kvm_gmem_populate(), allow tdx_vcpu_init_mem_region()
> to use open code to populate the initial memory region into the mirror page
> table, and add the region to S-EPT.
>
> Background
> ===
> Sean initially suggested TDX to populate initial memory region in a 4-step
> way [1]. Paolo refactored guest_memfd and introduced kvm_gmem_populate()
> interface [2] to help TDX populate init memory region.
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     kvm_gmem_populate
>         filemap_invalidate_lock(file->f_mapping)
>             __kvm_gmem_get_pfn      //1. get private PFN
>             post_populate           //tdx_gmem_post_populate
>                 get_user_pages_fast //2. get source page
>                 kvm_tdp_map_page    //3. map private PFN to mirror root
>                 tdh_mem_page_add    //4. add private PFN to S-EPT and copy
>                                          source page to it.
>
> kvm_gmem_populate() helps TDX to "get private PFN" in step 1. Its file
> invalidate lock also helps ensure the private PFN remains valid when
> tdh_mem_page_add() is invoked in TDX's post_populate hook.
>
> Though TDX does not need the folio prepration code, kvm_gmem_populate()
> helps on sharing common code between SEV-SNP and TDX.
>
> Problem
> ===
> (1)
> In Michael's series "KVM: gmem: 2MB THP support and preparedness tracking
> changes" [4], kvm_gmem_get_pfn() was modified to rely on the filemap
> invalidation lock for protecting its preparedness tracking. Similarly, the
> in-place conversion version of guest_memfd series by Ackerly also requires
> kvm_gmem_get_pfn() to acquire filemap invalidation lock [5].
>
> kvm_gmem_get_pfn
>     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>
> However, since kvm_gmem_get_pfn() is called by kvm_tdp_map_page(), which is
> in turn invoked within kvm_gmem_populate() in TDX, a deadlock occurs on the
> filemap invalidation lock.
>
> (2)
> Moreover, in step 2, get_user_pages_fast() may acquire mm->mmap_lock,
> resulting in the following lock sequence in tdx_vcpu_init_mem_region():
> - filemap invalidation lock --> mm->mmap_lock
>
> However, in future code, the shared filemap invalidation lock will be held
> in kvm_gmem_fault_shared() (see [6]), leading to the lock sequence:
> - mm->mmap_lock --> filemap invalidation lock
>
> This creates an AB-BA deadlock issue.
>
> These two issues should still present in Michael Roth's code [7], [8].
>
> Proposal
> ===
> To prevent deadlock and the AB-BA issue, this patch enables TDX to populate
> the initial memory region independently of kvm_gmem_populate(). The revised
> sequence in tdx_vcpu_init_mem_region() is as follows:
>
> tdx_vcpu_init_mem_region
>     guard(mutex)(&kvm->slots_lock)
>     tdx_init_mem_populate
>         get_user_pages_fast //1. get source page
>         kvm_tdp_map_page    //2. map private PFN to mirror root
>         read_lock(&kvm->mmu_lock);
>         kvm_tdp_mmu_gpa_is_mapped // 3. check if the gpa is mapped in the
>                                         mirror root and return the mapped
>                                         private PFN.
>         tdh_mem_page_add    //4. add private PFN to S-EPT and copy source
>                                  page to it
>         read_unlock(&kvm->mmu_lock);
>
> The original step 1 "get private PFN" is now integrated in the new step 3
> "check if the gpa is mapped in the mirror root and return the mapped
> private PFN".
>
> With the protection of slots_lock, the read mmu_lock ensures the private
> PFN added by tdh_mem_page_add() is the same one mapped in the mirror page
> table. Addiontionally, before the TD state becomes TD_STATE_RUNNABLE, the
> only permitted map level is 4KB, preventing any potential merging or
> splitting in the mirror root under the read mmu_lock.
>
> So, this approach should work for TDX. It still follows the spirit in
> Sean's suggestion [1], where mapping the private PFN to mirror root and
> adding it to the S-EPT with initial content from the source page are
> executed in separate steps.
>
> Discussions
> ===
> The introduction of kvm_gmem_populate() was intended to make it usable by
> both TDX and SEV-SNP [3], which is why Paolo provided the vendor hook
> post_populate for both.
>
> a) TDX keeps using kvm_gmem_populate().
>    Pros: - keep the status quo
>          - share common code between SEV-SNP and TDX, though TDX does not
>            need to prepare folios.
>    Cons: - we need to explore solutions to the locking issues, e.g. the
>            proposal at [11].
>          - PFN is faulted in twice for each GFN:
>            one in __kvm_gmem_get_pfn(), another in kvm_gmem_get_pfn().
>
> b) Michael suggested introducing some variant of
>    kvm_tdp_map_page()/kvm_mmu_do_page_fault() to avoid invoking
>    kvm_gmem_get_pfn() in the kvm_gmem_populate() path. [10].
>    Pro:  - TDX can still invoke kvm_gmem_populate().
>            can share common code between SEV-SNP and TDX.
>    Cons: - only TDX needs this variant.
>          - can't fix the 2nd AB-BA lock issue.
>
> c) Change in this patch
>    Pro: greater flexibility. Simplify the implementation for both SEV-SNP
>         and TDX.
>    Con: undermine the purpose of sharing common code.
>         kvm_gmem_populate() may only be usable by SEV-SNP in future.
>
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com [1]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-10-pbonzini@redhat.com [2]
> Link: https://lore.kernel.org/lkml/20240404185034.3184582-1-pbonzini@redhat.com [3]
> Link: https://lore.kernel.org/lkml/20241212063635.712877-4-michael.roth@amd.com [4]
> Link: https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com [5]
> Link: https://lore.kernel.org/all/20250430165655.605595-9-tabba@google.com [6]
> Link: https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2 [7]
> Link: https://lore.kernel.org/kvm/20250613005400.3694904-1-michael.roth@amd.com [8]
> Link: https://lore.kernel.org/lkml/20250613151939.z5ztzrtibr6xatql@amd.com [9]
> Link: https://lore.kernel.org/lkml/20250613180418.bo4vqveigxsq2ouu@amd.com [10]
> Link: https://lore.kernel.org/lkml/aErK25Oo5VJna40z@yzhao56-desk.sh.intel.com [11]
>
> Suggested-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---

Thanks Yan for revisiting the initial memory population for TDX VMs.
This implementation seems much cleaner to me.

Acked-by: Vishal Annapurve <vannapurve@google.com>
Tested-by: Vishal Annapurve <vannapurve@google.com>

next prev parent reply	other threads:[~2025-07-03 16:51 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
2025-07-03 16:51 ` Vishal Annapurve [this message]
2025-07-09 23:21 ` Michael Roth
2025-07-10 16:24   ` Sean Christopherson
2025-07-11  1:41     ` Ira Weiny
2025-07-11 14:21       ` Sean Christopherson
2025-07-11  4:36     ` Yan Zhao
2025-07-11 15:17       ` Michael Roth
2025-07-11 15:39         ` Sean Christopherson
2025-07-11 16:34           ` Michael Roth
2025-07-11 18:38             ` Vishal Annapurve
2025-07-11 19:49               ` Michael Roth
2025-07-11 20:19                 ` Sean Christopherson
2025-07-11 20:25             ` Ira Weiny
2025-07-11 22:56               ` Sean Christopherson
2025-07-11 23:04                 ` Vishal Annapurve
2025-07-14 23:11                   ` Ira Weiny
2025-07-15  0:41                     ` Vishal Annapurve
2025-07-14 23:08                 ` Ira Weiny
2025-07-14 23:12                   ` Sean Christopherson
2025-07-11 18:46           ` Vishal Annapurve
2025-07-12 17:38             ` Vishal Annapurve
2025-07-14  6:15           ` Yan Zhao
2025-07-14 15:46             ` Sean Christopherson
2025-07-14 16:02               ` David Hildenbrand
2025-07-14 16:07                 ` Sean Christopherson
2025-07-15  1:10               ` Yan Zhao
2025-07-18  9:14                 ` Yan Zhao
2025-07-18 15:57                   ` Vishal Annapurve
2025-07-18 18:42                     ` Ira Weiny
2025-07-18 18:59                       ` Vishal Annapurve
2025-07-21 17:46                         ` Ira Weiny
2025-07-28  9:48                     ` Yan Zhao
2025-07-29  0:45                       ` Vishal Annapurve
2025-07-29  1:37                         ` Yan Zhao
2025-07-29 16:33                           ` Ira Weiny
2025-08-05  0:22                             ` Sean Christopherson
2025-08-05  1:20                               ` Vishal Annapurve
2025-08-05 14:30                                 ` Vishal Annapurve
2025-08-05 19:59                                 ` Sean Christopherson
2025-08-06  0:09                                   ` Vishal Annapurve
2025-07-14  3:20         ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGtprH-Hb3B-sG_0ockS++bP==Zyn2f4dvWpwC73+ksVt7YqJg@mail.gmail.com' \
    --to=vannapurve@google.com \
    --cc=ackerleytng@google.com \
    --cc=adrian.hunter@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=tabba@google.com \
    --cc=tony.lindgren@intel.com \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).