Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Ira Weiny <ira.weiny@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>,
	Vishal Annapurve <vannapurve@google.com>,
	 Michael Roth <michael.roth@amd.com>,
	pbonzini@redhat.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org, rick.p.edgecombe@intel.com,
	kai.huang@intel.com,  adrian.hunter@intel.com,
	reinette.chatre@intel.com, xiaoyao.li@intel.com,
	 tony.lindgren@intel.com, binbin.wu@linux.intel.com,
	dmatlack@google.com,  isaku.yamahata@intel.com, david@redhat.com,
	ackerleytng@google.com,  tabba@google.com, chao.p.peng@intel.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()
Date: Mon, 4 Aug 2025 17:22:15 -0700	[thread overview]
Message-ID: <aJFOt64k2EFjaufd@google.com> (raw)
In-Reply-To: <6888f7e4129b9_ec573294fa@iweiny-mobl.notmuch>

On Tue, Jul 29, 2025, Ira Weiny wrote:
> Yan Zhao wrote:
> > On Mon, Jul 28, 2025 at 05:45:35PM -0700, Vishal Annapurve wrote:
> > > On Mon, Jul 28, 2025 at 2:49 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Fri, Jul 18, 2025 at 08:57:10AM -0700, Vishal Annapurve wrote:
> > > > > On Fri, Jul 18, 2025 at 2:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Tue, Jul 15, 2025 at 09:10:42AM +0800, Yan Zhao wrote:
> > > > > > > On Mon, Jul 14, 2025 at 08:46:59AM -0700, Sean Christopherson wrote:
> > > > > > > > > >         folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > > > > > > If max_order > 0 is returned, the next invocation of __kvm_gmem_populate() for
> > > > > > > > > GFN+1 will return is_prepared == true.
> > > > > > > >
> > > > > > > > I don't see any reason to try and make the current code truly work with hugepages.
> > > > > > > > Unless I've misundertood where we stand, the correctness of hugepage support is
> > > > > > > Hmm. I thought your stand was to address the AB-BA lock issue which will be
> > > > > > > introduced by huge pages, so you moved the get_user_pages() from vendor code to
> > > > > > > the common code in guest_memfd :)
> > > > > > >
> > > > > > > > going to depend heavily on the implementation for preparedness.  I.e. trying to
> > > > > > > > make this all work with per-folio granulartiy just isn't possible, no?
> > > > > > > Ah. I understand now. You mean the right implementation of __kvm_gmem_get_pfn()
> > > > > > > should return is_prepared at 4KB granularity rather than per-folio granularity.
> > > > > > >
> > > > > > > So, huge pages still has dependency on the implementation for preparedness.
> > > > > > Looks with [3], is_prepared will not be checked in kvm_gmem_populate().
> > > > > >
> > > > > > > Will you post code [1][2] to fix non-hugepages first? Or can I pull them to use
> > > > > > > as prerequisites for TDX huge page v2?
> > > > > > So, maybe I can use [1][2][3] as the base.
> > > > > >
> > > > > > > [1] https://lore.kernel.org/all/aG_pLUlHdYIZ2luh@google.com/
> > > > > > > [2] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
> > > >
> > > > From the PUCK, looks Sean said he'll post [1][2] for 6.18 and Michael will post
> > > > [3] soon.
> > > >
> > > > hi, Sean, is this understanding correct?
> > > >
> > > > > IMO, unless there is any objection to [1], it's un-necessary to
> > > > > maintain kvm_gmem_populate for any arch (even for SNP). All the
> > > > > initial memory population logic needs is the stable pfn for a given
> > > > > gfn, which ideally should be available using the standard mechanisms
> > > > > such as EPT/NPT page table walk within a read KVM mmu lock (This patch
> > > > > already demonstrates it to be working).
> > > > >
> > > > > It will be hard to clean-up this logic once we have all the
> > > > > architectures using this path.
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/CAGtprH8+x5Z=tPz=NcrQM6Dor2AYBu3jiZdo+Lg4NqAk0pUJ3w@mail.gmail.com/
> > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > > For TDX, it means adopting the approach in this RFC patch, right?
> > > Yes, IMO this RFC is following the right approach as posted.

I don't think we want to abandon kvm_gmem_populate().  Unless I'm missing something,
SNP has the same AB-BA problem as TDX.  The copy_from_user() on @src can trigger
a page fault, and resolving the page fault may require taking mm->mmap_lock.

Fundamentally, TDX and SNP are doing the same thing: copying from source to guest
memory.  The only differences are in the mechanics of the copy+encrypt, everything
else is the same.  I.e. I don't expect that we'll find a magic solution that works
well for one and not the other.

I also don't want to end up with wildly different ABI for SNP vs. everything else.
E.g. cond_resched() needs to be called if the to-be-initialzied range is large,
which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can
yield without dropping invalidate_lock, which means that the behavior of populating
guest_memfd memory will be quite different with respect to guest_memfd operations.

Pulling in the RFC text:

: I think the only different scenario is SNP, where the host must write
: initial contents to guest memory.
: 
: Will this work for all cases CCA/SNP/TDX during initial memory
: population from within KVM:
: 1) Simulate stage2 fault
: 2) Take a KVM mmu read lock

Doing all of this under mmu_lock is pretty much a non-starter.

: 3) Check that the needed gpa is mapped in EPT/NPT entries

No, KVM's page tables are not the source of truth.  S-EPT is a special snowflake,
and I'd like to avoid foisting the same requirements on NPT.

: 4) For SNP, if src != null, make the target pfn to be shared, copy
: contents and then make the target pfn back to private.

Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
memory might_fault() and thus might_sleep().

: 5) For TDX, if src != null, pass the same address for source and
: target (likely this works for CCA too)
: 6) Invoke appropriate memory encryption operations
: 7) measure contents
: 8) release the KVM mmu read lock
: 
: If this scheme works, ideally we should also not call RMP table
: population logic from guest_memfd, but from KVM NPT fault handling
: logic directly (a bit of cosmetic change). 

LOL, that's not a cosmetic change.  It would be a non-trivial ABI change as KVM's
ABI (ignoring S-EPT) is that userspace can delete and recreate memslots at will.

: Ideally any outgoing interaction from guest_memfd to KVM should be only via
  invalidation notifiers.

Why?  It's all KVM code.  I don't see how this is any different than e.g. common
code, arch code, and vendor code all calling into one another.  Artificially
limiting guest_memfd to a super generic interface pretty much defeats the whole
purpose of having KVM provide a backing store.

> > Ira has been investigating this for a while, see if he has any comment.
> 
> So far I have not seen any reason to keep kvm_gmem_populate() either.
> 
> Sean, did yall post the patch you suggested here and I missed it?

No, I have a partially baked patch, but I've been trying to finish up v5 of the
mediated PMU series and haven't had time to focus on this.  Hopefully I'll post
a compile-tested patch later this week.

next prev parent reply	other threads:[~2025-08-05  0:22 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-03  6:26 [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate() Yan Zhao
2025-07-03 16:51 ` Vishal Annapurve
2025-07-09 23:21 ` Michael Roth
2025-07-10 16:24   ` Sean Christopherson
2025-07-11  1:41     ` Ira Weiny
2025-07-11 14:21       ` Sean Christopherson
2025-07-11  4:36     ` Yan Zhao
2025-07-11 15:17       ` Michael Roth
2025-07-11 15:39         ` Sean Christopherson
2025-07-11 16:34           ` Michael Roth
2025-07-11 18:38             ` Vishal Annapurve
2025-07-11 19:49               ` Michael Roth
2025-07-11 20:19                 ` Sean Christopherson
2025-07-11 20:25             ` Ira Weiny
2025-07-11 22:56               ` Sean Christopherson
2025-07-11 23:04                 ` Vishal Annapurve
2025-07-14 23:11                   ` Ira Weiny
2025-07-15  0:41                     ` Vishal Annapurve
2025-07-14 23:08                 ` Ira Weiny
2025-07-14 23:12                   ` Sean Christopherson
2025-07-11 18:46           ` Vishal Annapurve
2025-07-12 17:38             ` Vishal Annapurve
2025-07-14  6:15           ` Yan Zhao
2025-07-14 15:46             ` Sean Christopherson
2025-07-14 16:02               ` David Hildenbrand
2025-07-14 16:07                 ` Sean Christopherson
2025-07-15  1:10               ` Yan Zhao
2025-07-18  9:14                 ` Yan Zhao
2025-07-18 15:57                   ` Vishal Annapurve
2025-07-18 18:42                     ` Ira Weiny
2025-07-18 18:59                       ` Vishal Annapurve
2025-07-21 17:46                         ` Ira Weiny
2025-07-28  9:48                     ` Yan Zhao
2025-07-29  0:45                       ` Vishal Annapurve
2025-07-29  1:37                         ` Yan Zhao
2025-07-29 16:33                           ` Ira Weiny
2025-08-05  0:22                             ` Sean Christopherson [this message]
2025-08-05  1:20                               ` Vishal Annapurve
2025-08-05 14:30                                 ` Vishal Annapurve
2025-08-05 19:59                                 ` Sean Christopherson
2025-08-06  0:09                                   ` Vishal Annapurve
2025-10-02 21:01                                     ` Ira Weiny
2025-10-03  0:15                                       ` Sean Christopherson
2025-10-03 15:31                                         ` Ira Weiny
2025-07-14  3:20         ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aJFOt64k2EFjaufd@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=adrian.hunter@intel.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=tabba@google.com \
    --cc=tony.lindgren@intel.com \
    --cc=vannapurve@google.com \
    --cc=xiaoyao.li@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.