linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
To: "pbonzini@redhat.com" <pbonzini@redhat.com>,
	"seanjc@google.com" <seanjc@google.com>
Cc: "Huang, Kai" <kai.huang@intel.com>,
	"ackerleytng@google.com" <ackerleytng@google.com>,
	"Annapurve, Vishal" <vannapurve@google.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Zhao, Yan Y" <yan.y.zhao@intel.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"michael.roth@amd.com" <michael.roth@amd.com>
Subject: Re: [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries
Date: Fri, 29 Aug 2025 23:42:13 +0000	[thread overview]
Message-ID: <17ef5f493d5ef6d76c4dc9ca19f1d4d7fe4c73f9.camel@intel.com> (raw)
In-Reply-To: <20250829000618.351013-14-seanjc@google.com>

On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> When populating the initial memory image for a TDX guest, ADD pages to the
> TD as part of establishing the mappings in the mirror EPT, as opposed to
> creating the mappings and then doing ADD after the fact.  Doing ADD in the
> S-EPT callbacks eliminates the need to track "premapped" pages, as the
> mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
> KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).
> 
> Eliminating the hole where the M-EPT can have a mapping that doesn't exist
> in the S-EPT in turn obviates the need to handle errors that are unique to
> encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).
> 
> Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
> for unconsumed "premap" entries during tdx_td_finalize(), as there simply
> can't be any such entries.  Dropping that check in particular reduces the
> overall cognitive load, as the managemented of nr_premapped with respect
> to removal of S-EPT is _very_ subtle.  E.g. successful removal of an S-EPT
> entry after it completed ADD doesn't adjust nr_premapped, but it's not
> clear why that's "ok" but having half-baked entries is not (it's not truly
> "ok" in that removing pages from the image will likely prevent the guest
> from booting, but from KVM's perspective it's "ok").
> 
> Doing ADD in the S-EPT path requires passing an argument via a scratch
> field, but the current approach of tracking the number of "premapped"
> pages effectively does the same.  And the "premapped" counter is much more
> dangerous, as it doesn't have a singular lock to protect its usage, since
> nr_premapped can be modified as soon as mmu_lock is dropped, at least in
> theory.  I.e. nr_premapped is guarded by slots_lock, but only for "happy"
> paths.
> 
> Note, this approach was used/tried at various points in TDX development,
> but was ultimately discarded due to a desire to avoid stashing temporary
> state in kvm_tdx.  But as above, KVM ended up with such state anyways,
> and fully committing to using temporary state provides better access
> rules (100% guarded by slots_lock), and makes several edge cases flat out
> impossible.
> 
> Note #2, continue to extend the measurement outside of mmu_lock, as it's
> a slow operation (typically 16 SEAMCALLs per page whose data is included
> in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
> for consistency purposes.  However, MR.EXTEND isn't _that_ slow, e.g.
> ~1ms latency to measure a full page, so if it needs to be done under
> mmu_lock in the future, e.g. because KVM gains a flow that can remove
> S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the
                ^using
> measurement can also be moved into the S-EPT mapping path (again, only if
> absolutely necessary).  P.S. _If_ MR.EXTEND is moved into the S-EPT path,
> take care not to return an error up the stack if TDH_MR_EXTEND fails, as
> removing the M-EPT entry but not the S-EPT entry would result in
> inconsistent state!
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

But some possible unintended changed below.

>  arch/x86/kvm/vmx/tdx.c | 116 ++++++++++++++---------------------------
>  arch/x86/kvm/vmx/tdx.h |   8 ++-
>  2 files changed, 46 insertions(+), 78 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index bc92e87a1dbb..00c3dc376690 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add_src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,19 +1653,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
> -	 * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages
> -	 * have been added to the image, to prevent running the TD with a
> -	 * valid mapping in the mirror EPT, but not in the S-EPT.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		atomic64_inc(&kvm_tdx->nr_premapped);
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1665,39 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1717,12 +1701,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
>  
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
> @@ -2827,12 +2805,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (atomic64_read(&kvm_tdx->nr_premapped))
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
>  	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 err, entry, level_state;

Fine, but ?

>  	struct page *src_page;
>  	int ret, i;
>  
> +	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> +		return -EIO;
> +
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
> @@ -3131,22 +3106,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add_src = src_page;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add_src = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> +	put_page(src_page);
>  
> -	KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
> -
> -	if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> -		goto out;
> +	if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return ret;
>  
>  	/*
>  	 * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> @@ -3159,14 +3126,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
>  		if (KVM_BUG_ON(err, kvm)) {
>  			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> -			ret = -EIO;
> -			goto out;
> +			return -EIO;
>  		}
>  	}
>  
> -out:
> -	put_page(src_page);
> -	return ret;
> +	return 0;
>  }
>  
>  static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..1b00adbbaf77 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -36,8 +36,12 @@ struct kvm_tdx {
>  
>  	struct tdx_td td;
>  
> -	/* For KVM_TDX_INIT_MEM_REGION. */
> -	atomic64_t nr_premapped;
> +	/*
> +	 * Scratch pointer used to pass the source page to tdx_mem_page_add.
> +	 * Protected by slots_lock, and non-NULL only when mapping a private
> +	 * pfn via tdx_gmem_post_populate().
> +	 */
> +	struct page *page_add_src;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do


  reply	other threads:[~2025-08-29 23:42 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29  0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
2025-08-29  6:20   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
2025-08-29 18:34   ` Edgecombe, Rick P
2025-08-29 20:27     ` Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
2025-08-29 19:00   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
2025-08-29 19:03   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Sean Christopherson
2025-08-29  8:36   ` Binbin Wu
2025-08-29 19:53   ` Edgecombe, Rick P
2025-08-29 20:19     ` Sean Christopherson
2025-08-29 21:54       ` Edgecombe, Rick P
2025-08-29 22:02         ` Sean Christopherson
2025-08-29 22:17           ` Edgecombe, Rick P
2025-08-29 22:58             ` Sean Christopherson
2025-08-29 22:59               ` Edgecombe, Rick P
2025-09-01  1:25     ` Yan Zhao
2025-09-02 17:33       ` Sean Christopherson
2025-09-02 18:55         ` Edgecombe, Rick P
2025-09-04  8:45           ` Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
2025-08-29  9:40   ` Binbin Wu
2025-08-29 16:58   ` Ira Weiny
2025-08-29 19:59   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-08-29  9:49   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
2025-08-29  9:52   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
2025-08-29  9:52   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
2025-08-29 10:06   ` Binbin Wu
2025-08-29  0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
2025-09-02 22:46   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
2025-08-29  8:18   ` Yan Zhao
2025-08-29 18:16     ` Edgecombe, Rick P
2025-08-29 20:11       ` Sean Christopherson
2025-08-29 22:39         ` Edgecombe, Rick P
2025-08-29 23:15           ` Edgecombe, Rick P
2025-08-29 23:18             ` Sean Christopherson
2025-09-02  9:24         ` Yan Zhao
2025-09-02 17:04           ` Sean Christopherson
2025-09-03  0:18             ` Edgecombe, Rick P
2025-09-03  3:34               ` Yan Zhao
2025-09-03  9:19                 ` Yan Zhao
2025-08-29  0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
2025-08-29 23:42   ` Edgecombe, Rick P [this message]
2025-09-02 17:09     ` Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-09-02 17:31   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
2025-08-29  9:03   ` Binbin Wu
2025-08-29 14:19     ` Sean Christopherson
2025-09-01  1:46       ` Binbin Wu
2025-09-02 18:55   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
2025-08-30  0:00   ` Edgecombe, Rick P
2025-08-29  0:06 ` [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
2025-08-29  0:06 ` [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17ef5f493d5ef6d76c4dc9ca19f1d4d7fe4c73f9.camel@intel.com \
    --to=rick.p.edgecombe@intel.com \
    --cc=ackerleytng@google.com \
    --cc=ira.weiny@intel.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=vannapurve@google.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).