Re: [PATCH v3 04/25] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	 Tianrui Zhao <zhaotianrui@loongson.cn>,
	Bibo Mao <maobibo@loongson.cn>,
	 Huacai Chen <chenhuacai@kernel.org>,
	Madhavan Srinivasan <maddy@linux.ibm.com>,
	 Anup Patel <anup@brainfault.org>, Paul Walmsley <pjw@kernel.org>,
	 Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	 Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	 Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 "Kirill A. Shutemov" <kas@kernel.org>,
	linux-arm-kernel@lists.infradead.org,  kvmarm@lists.linux.dev,
	kvm@vger.kernel.org, loongarch@lists.linux.dev,
	 linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	 kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org,
	 x86@kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org,  Ira Weiny <ira.weiny@intel.com>,
	Kai Huang <kai.huang@intel.com>,
	 Michael Roth <michael.roth@amd.com>,
	Vishal Annapurve <vannapurve@google.com>,
	 Rick Edgecombe <rick.p.edgecombe@intel.com>,
	Ackerley Tng <ackerleytng@google.com>,
	 Binbin Wu <binbin.wu@linux.intel.com>
Subject: Re: [PATCH v3 04/25] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU
Date: Wed, 22 Oct 2025 11:12:47 -0700	[thread overview]
Message-ID: <aPken0s-0MfdSd5o@google.com> (raw)
In-Reply-To: <aPiQYBoDlUmrQxEw@yzhao56-desk.sh.intel.com>

On Wed, Oct 22, 2025, Yan Zhao wrote:
> On Tue, Oct 21, 2025 at 09:36:52AM -0700, Sean Christopherson wrote:
> > On Tue, Oct 21, 2025, Yan Zhao wrote:
> > > On Thu, Oct 16, 2025 at 05:32:22PM -0700, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 18d69d48bc55..ba5cca825a7f 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5014,6 +5014,65 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> > > >  	return min(range->size, end - range->gpa);
> > > >  }
> > > >  
> > > > +int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> > > > +{
> > > > +	struct kvm_page_fault fault = {
> > > > +		.addr = gfn_to_gpa(gfn),
> > > > +		.error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS,
> > > > +		.prefetch = true,
> > > > +		.is_tdp = true,
> > > > +		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
> > > > +
> > > > +		.max_level = PG_LEVEL_4K,
> > > > +		.req_level = PG_LEVEL_4K,
> > > > +		.goal_level = PG_LEVEL_4K,
> > > > +		.is_private = true,
> > > > +
> > > > +		.gfn = gfn,
> > > > +		.slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn),
> > > > +		.pfn = pfn,
> > > > +		.map_writable = true,
> > > > +	};
> > > > +	struct kvm *kvm = vcpu->kvm;
> > > > +	int r;
> > > > +
> > > > +	lockdep_assert_held(&kvm->slots_lock);
> > > Do we need to assert that filemap_invalidate_lock() is held as well?
> > 
> > Hrm, a lockdep assertion would be nice to have, but it's obviously not strictly
> > necessary, and I'm not sure it's worth the cost.  To safely assert, KVM would need
> Not sure. Maybe just add a comment?
> But even with kvm_assert_gmem_invalidate_lock_held() and
> lockdep_assert_held(&kvm->slots_lock), it seems that
> kvm_tdp_mmu_map_private_pfn() still can't guarantee that the pfn is not stale.

At some point we have to assume correctness.  E.g. one could also argue that
holding every locking in the universe still doesn't ensure the pfn is fresh,
because theoretically guest_memfd could violate the locking scheme.

Aha!  And to further harden and document this code, this API can be gated on
CONFIG_KVM_GUEST_MEMFD=y, as pointed out by the amazing-as-always test bot:

https://lore.kernel.org/all/202510221928.ikBXHGCf-lkp@intel.com

We could go a step further and gate it on CONFIG_KVM_INTEL_TDX=y, but I don't
like that idea as I think it'd would be a net negative in terms of documenation,
compared to checking CONFIG_KVM_GUEST_MEMFD.  And in general I don't want to set
a precedent of ifdef-ing common x86 based on what vendor code _currently_ needs
an API.

> e.g., if hypothetically those locks were released and re-acquired after getting
> the pfn.
> 
> > to first assert that the file refcount is elevated, e.g. to guard against
> > guest_memfd _really_ screwing up and not grabbing a reference to the underlying
> > file.
> > 
> > E.g. it'd have to be something like this:
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 94d7f32a03b6..5d46b2ac0292 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5014,6 +5014,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> >         return min(range->size, end - range->gpa);
> >  }
> >  
> > +static void kvm_assert_gmem_invalidate_lock_held(struct kvm_memory_slot *slot)
> > +{
> > +#ifdef CONFIG_PROVE_LOCKING
> > +       if (WARN_ON_ONCE(!kvm_slot_has_gmem(slot)) ||
> > +           WARN_ON_ONCE(!slot->gmem.file) ||
> > +           WARN_ON_ONCE(!file_count(slot->gmem.file)))
> > +               return;
> > +
> > +       lockdep_assert_held(file_inode(&slot->gmem.file)->i_mapping->invalidate_lock));
> 	  lockdep_assert_held(&file_inode(slot->gmem.file)->i_mapping->invalidate_lock);
> > +#endif
> > +}
> > +
> >  int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >  {
> >         struct kvm_page_fault fault = {
> > @@ -5038,6 +5050,8 @@ int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
> >  
> >         lockdep_assert_held(&kvm->slots_lock);
> >  
> > +       kvm_assert_gmem_invalidate_lock_held(fault.slot);
> > +
> >         if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
> >                 return -EIO;
> > --
> > 
> > Which I suppose isn't that terrible?
> Is it good if we test is_page_fault_stale()? e.g.,

No, because it can only get false positives, e.g. if an mmu_notifier invalidation
on shared, non-guest_memfd memory.  Though a sanity check would be nice to have;
I believe we can simply do:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c5734ca5c17d..440fd8f80397 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1273,6 +1273,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
        struct kvm_mmu_page *sp;
        int ret = RET_PF_RETRY;
 
+       KVM_MMU_WARN_ON(!root || root->role.invalid);
+
        kvm_mmu_hugepage_adjust(vcpu, fault);
 
        trace_kvm_mmu_spte_requested(fault);

next prev parent reply	other threads:[~2025-10-22 18:12 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-17  0:32 [PATCH v3 00/25] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 01/25] KVM: Make support for kvm_arch_vcpu_async_ioctl() mandatory Sean Christopherson
2025-10-17  9:12   ` Claudio Imbrenda
2025-10-17  0:32 ` [PATCH v3 02/25] KVM: Rename kvm_arch_vcpu_async_ioctl() to kvm_arch_vcpu_unlocked_ioctl() Sean Christopherson
2025-10-17  9:13   ` Claudio Imbrenda
2025-10-17  0:32 ` [PATCH v3 03/25] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
2025-10-22  3:15   ` Binbin Wu
2025-10-17  0:32 ` [PATCH v3 04/25] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
2025-10-21  0:10   ` Edgecombe, Rick P
2025-10-21  4:06   ` Yan Zhao
2025-10-21 16:36     ` Sean Christopherson
2025-10-22  8:05       ` Yan Zhao
2025-10-22 18:12         ` Sean Christopherson [this message]
2025-10-23  6:48           ` Yan Zhao
2025-10-22  4:53   ` Yan Zhao
2025-10-30  8:34     ` Yan Zhao
2025-11-04 17:57       ` Sean Christopherson
2025-11-05  7:32         ` Yan Zhao
2025-11-05  7:47           ` Yan Zhao
2025-11-05 15:26             ` Sean Christopherson
2025-10-23 10:28   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 05/25] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
2025-10-22  5:56   ` Binbin Wu
2025-10-23 10:30   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 06/25] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
2025-10-22  5:57   ` Binbin Wu
2025-10-23 10:38   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 07/25] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
2025-10-21  0:10   ` Edgecombe, Rick P
2025-10-17  0:32 ` [PATCH v3 08/25] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 09/25] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-10-23 10:53   ` Huang, Kai
2025-10-23 14:59     ` Sean Christopherson
2025-10-23 22:20       ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 10/25] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
2025-10-22  8:46   ` Yan Zhao
2025-10-22 19:08     ` Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 11/25] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
2025-10-23 22:21   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 12/25] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 13/25] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
2025-10-23 22:32   ` Huang, Kai
2025-10-24  7:21     ` Huang, Kai
2025-10-24  7:38   ` Binbin Wu
2025-10-24 16:33     ` Sean Christopherson
2025-10-27  9:01       ` Binbin Wu
2025-10-28  0:29         ` Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 14/25] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
2025-10-21  0:10   ` Edgecombe, Rick P
2025-10-23 17:27     ` Sean Christopherson
2025-10-23 22:48   ` Huang, Kai
2025-10-24 16:35     ` Sean Christopherson
2025-10-27  9:31       ` Yan Zhao
2025-10-17  0:32 ` [PATCH v3 15/25] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
2025-10-24  7:18   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 16/25] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-10-24  9:53   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 17/25] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 18/25] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
2025-10-17  0:32 ` [PATCH v3 19/25] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
2025-10-23  7:37   ` Yan Zhao
2025-10-23 15:14     ` Sean Christopherson
2025-10-24 10:05       ` Yan Zhao
2025-10-17  0:32 ` [PATCH v3 20/25] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson
2025-10-24 10:09   ` Huang, Kai
2025-10-27 19:20     ` Sean Christopherson
2025-10-27 22:00       ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 21/25] KVM: TDX: Add tdx_get_cmd() helper to get and validate sub-ioctl command Sean Christopherson
2025-10-21  0:12   ` Edgecombe, Rick P
2025-10-24 10:11   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 22/25] KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctl Sean Christopherson
2025-10-24 10:36   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 23/25] KVM: TDX: Use guard() to acquire kvm->lock in tdx_vm_ioctl() Sean Christopherson
2025-10-21  0:10   ` Edgecombe, Rick P
2025-10-21 16:56     ` Sean Christopherson
2025-10-21 19:03       ` Edgecombe, Rick P
2025-10-24 10:36   ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 24/25] KVM: TDX: Guard VM state transitions with "all" the locks Sean Christopherson
2025-10-24 10:02   ` Yan Zhao
2025-10-24 16:57     ` Sean Christopherson
2025-10-27  9:26       ` Yan Zhao
2025-10-27 17:46         ` Edgecombe, Rick P
2025-10-27 18:10           ` Sean Christopherson
2025-10-28  0:28             ` [PATCH] KVM: TDX: Take MMU lock around tdh_vp_init() Rick Edgecombe
2025-10-28  5:37               ` Yan Zhao
2025-10-29  6:37               ` Binbin Wu
2025-10-28  1:37           ` [PATCH v3 24/25] KVM: TDX: Guard VM state transitions with "all" the locks Yan Zhao
2025-10-28 17:40             ` Edgecombe, Rick P
2025-10-24 10:53   ` Huang, Kai
2025-10-28  0:28   ` Huang, Kai
2025-10-28  0:37     ` Sean Christopherson
2025-10-28  1:01       ` Huang, Kai
2025-10-17  0:32 ` [PATCH v3 25/25] KVM: TDX: Fix list_add corruption during vcpu_load() Sean Christopherson
2025-10-20  8:50   ` Yan Zhao

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c5734ca5c17 dfblob:440fd8f8039 )
 OR (
bs:"Re: [PATCH v3 04/25] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPken0s-0MfdSd5o@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=binbin.wu@linux.intel.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=chenhuacai@kernel.org \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kas@kernel.org \
    --cc=kvm-riscv@lists.infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=loongarch@lists.linux.dev \
    --cc=maddy@linux.ibm.com \
    --cc=maobibo@loongson.cn \
    --cc=maz@kernel.org \
    --cc=michael.roth@amd.com \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=pbonzini@redhat.com \
    --cc=pjw@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=vannapurve@google.com \
    --cc=x86@kernel.org \
    --cc=yan.y.zhao@intel.com \
    --cc=zhaotianrui@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).