public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yan Zhao <yan.y.zhao@intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@intel.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	Yuan Yao <yuan.yao@intel.com>, Kai Huang <kai.huang@intel.com>,
	"isaku.yamahata@gmail.com" <isaku.yamahata@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"dmatlack@google.com" <dmatlack@google.com>,
	"nik.borisov@suse.com" <nik.borisov@suse.com>
Subject: Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
Date: Fri, 11 Oct 2024 10:06:00 +0800	[thread overview]
Message-ID: <ZwiICKN8trllBbZW@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <ZwgP6nJ-MdDjKEiZ@google.com>

On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > > branch TDX_1.5.05.
> > > > > > > 
> > > > > > > TL;DR:
> > > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > > 
> > > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > > whatever reason.
> > > > > > 
> > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > > hits the fault?
> > > > > > 
> > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > > 
> > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > > much higher, especially in light of the zero-step issues.
> > > > > > 
> > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > > loop in kvm_tdp_map_page().
> > > > > > 
> > > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > > 
> > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > > most paths.
> > > > > Good idea.
> > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > > 
> > > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > > something KVM can solve.
> > > 
> > >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> > >  : not being able to differentiate between a KVM bug and guest interference.  The
> > >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> > >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >  : report the error to userspace instead of trying to suppress guest activity, and
> > >  : potentially from other KVM tasks too.
> > > 
> > > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > > due to zero-step would be working as intended.  
> > Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> > the same GPA could trigger zero-step.
> > 
> > For an extremely unlucky vCPU, is it still possible to fire zero step when
> > nothing is wrong both in KVM and QEMU?
> > e.g.
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
Hmm, the 7th EPT violation could still occur if the vCPU that sees failure of
"try_cmpxchg64()" returns to guest faster than the one that successfully
installs the SPTE.

> 
> That said, a few thoughts:
> 
> 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
I didn't follow this question.
Do you want to disallow userspace to pre-fault memory after TD finalization
or do you want to suggest userspace to do it?

> 
> 2. The zero-step logic really should have a slightly more conservative threshold.
>    I have a hard time believing that e.g. 10 attempts would create a side channel,
>    but 6 attempts is "fine".
Don't know where the value 6 comes. :)
We may need to ask. 

> 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map().
>    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
>    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy
>    for KVM to guarantee success.
It's reasonable.

But TDX code still needs to retry for the RET_PF_RETRY_FROZEN without
re-entering guest.

Would it be good for TDX code to retry whenever it sees RET_PF_RETRY or
RET_PF_RETRY_FOZEN?
We can have tdx_sept_link_private_spt()/tdx_sept_set_private_spte() to return
-EBUSY on contention.


> 
> E.g. for #3, the below (compile tested only) patch should make it impossible for
> the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs
> should never trigger A/D assists, i.e. retry should always succeed.
> 
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..e47573a652a9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  				   struct kvm_mmu_page *sp, bool shared);
>  
> +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu,
> +					       struct kvm_mmu_page *sp)
> +{
> +	if (!sp)
> +		return tdp_mmu_alloc_sp(vcpu);
> +
> +	memset(sp, 0, sizeof(*sp));
> +	memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value,
> +		 PAGE_SIZE / sizeof(u64));
> +	return sp;
> +}
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> +	struct kvm_mmu_page *sp = NULL;
>  	struct kvm *kvm = vcpu->kvm;
>  	struct tdp_iter iter;
> -	struct kvm_mmu_page *sp;
>  	int ret = RET_PF_RETRY;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	rcu_read_lock();
>  
>  	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> -		int r;
> -
> +		/*
> +		 * Somewhat arbitrarily allow two local retries, e.g. to play
> +		 * nice with the extremely unlikely case that KVM encounters a
> +		 * huge SPTE an Access-assist _and_ a subsequent Dirty-assist.
> +		 * Retrying is inexpensive, but if KVM fails to install a SPTE
> +		 * three times, then a fourth attempt is likely futile and it's
> +		 * time to back off.
> +		 */
> +		int r, retry_locally = 2;
> +again:
>  		if (fault->nx_huge_page_workaround_enabled)
>  			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>  
> @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * The SPTE is either non-present or points to a huge page that
>  		 * needs to be split.
>  		 */
> -		sp = tdp_mmu_alloc_sp(vcpu);
> +		sp = tdp_mmu_realloc_sp(vcpu, sp);
>  		tdp_mmu_init_child_sp(sp, &iter);
>  
>  		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
>  
>  		/*
> -		 * Force the guest to retry if installing an upper level SPTE
> -		 * failed, e.g. because a different task modified the SPTE.
> +		 * If installing an upper level SPTE failed, retry the walk
> +		 * locally before forcing the guest to retry.  If the SPTE was
> +		 * modified by a different task, odds are very good the new
> +		 * SPTE is usable as-is.  And if the SPTE was modified by the
> +		 * CPU, e.g. to set A/D bits, then unless KVM gets *extremely*
> +		 * unlucky, the CMPXCHG should succeed the second time around.
>  		 */
>  		if (r) {
> -			tdp_mmu_free_sp(sp);
> +			if (retry_locally--)
> +				goto again;
>  			goto retry;
>  		}
>  
> @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  				track_possible_nx_huge_page(kvm, sp);
>  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  		}
> +		sp = NULL;
>  	}
>  
>  	/*
> @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +
> +	/*
> +	 * Free the previously allocated MMU page if KVM retried locally and
> +	 * ended up not using said page.
> +	 */
> +	if (sp)
> +		tdp_mmu_free_sp(sp);
>  	return ret;
>  }
>  
> 
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> -- 
> 

  parent reply	other threads:[~2024-10-11  2:08 UTC|newest]

Thread overview: 139+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
2024-09-09 13:44   ` Paolo Bonzini
2024-09-09 21:06     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
2024-09-09 13:51   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest Rick Edgecombe
2024-09-09 13:53   ` Paolo Bonzini
2024-09-09 21:07     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
2024-09-09 13:57   ` Paolo Bonzini
2024-09-09 16:07   ` Sean Christopherson
2024-09-10  7:36     ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
2024-09-09 13:59   ` Paolo Bonzini
2024-09-11  8:52   ` Chao Gao
2024-09-11 16:29     ` Edgecombe, Rick P
2024-09-12  0:39   ` Huang, Kai
2024-09-12 13:58     ` Sean Christopherson
2024-09-12 14:43       ` Edgecombe, Rick P
2024-09-12 14:46         ` Paolo Bonzini
2024-09-12  1:19   ` Huang, Kai
2024-09-04  3:07 ` [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers Rick Edgecombe
2024-09-09 14:19   ` Paolo Bonzini
2024-09-09 21:29     ` Edgecombe, Rick P
2024-09-10 10:48       ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX Rick Edgecombe
2024-09-11  2:48   ` Chao Gao
2024-09-11  2:49     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit Rick Edgecombe
2024-09-09 15:21   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
2024-09-06  1:41   ` Huang, Kai
2024-09-09 20:25     ` Edgecombe, Rick P
2024-09-09 15:25   ` Paolo Bonzini
2024-09-09 20:22     ` Edgecombe, Rick P
2024-09-09 21:11       ` Sean Christopherson
2024-09-09 21:23         ` Sean Christopherson
2024-09-09 22:34           ` Edgecombe, Rick P
2024-09-09 23:58             ` Sean Christopherson
2024-09-10  0:50               ` Edgecombe, Rick P
2024-09-10  1:46                 ` Sean Christopherson
2024-09-11  1:17               ` Huang, Kai
2024-09-11  2:48                 ` Edgecombe, Rick P
2024-09-11 22:55                   ` Huang, Kai
2024-09-10 13:15         ` Paolo Bonzini
2024-09-10 13:57           ` Sean Christopherson
2024-09-10 15:16             ` Paolo Bonzini
2024-09-10 15:57               ` Sean Christopherson
2024-09-10 16:28                 ` Edgecombe, Rick P
2024-09-10 17:42                   ` Sean Christopherson
2024-09-13  8:36                     ` Yan Zhao
2024-09-13 17:23                       ` Sean Christopherson
2024-09-13 19:19                         ` Edgecombe, Rick P
2024-09-13 22:18                           ` Sean Christopherson
2024-09-14  9:27                         ` Yan Zhao
2024-09-15  9:53                           ` Yan Zhao
2024-09-17  1:31                             ` Huang, Kai
2024-09-25 10:53                           ` Yan Zhao
2024-10-08 14:51                             ` Sean Christopherson
2024-10-10  5:23                               ` Yan Zhao
2024-10-10 17:33                                 ` Sean Christopherson
2024-10-10 21:53                                   ` Edgecombe, Rick P
2024-10-11  2:30                                     ` Yan Zhao
2024-10-14 10:54                                     ` Huang, Kai
2024-10-14 17:36                                       ` Edgecombe, Rick P
2024-10-14 23:03                                         ` Huang, Kai
2024-10-15  1:24                                           ` Edgecombe, Rick P
2024-10-11  2:06                                   ` Yan Zhao [this message]
2024-10-16 14:13                                   ` Yan Zhao
2024-09-17  2:11                         ` Huang, Kai
2024-09-13 19:19                       ` Edgecombe, Rick P
2024-09-14 10:00                         ` Yan Zhao
2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
2024-09-09 15:26   ` Paolo Bonzini
2024-09-12  0:15   ` Huang, Kai
2024-09-04  3:07 ` [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value Rick Edgecombe
2024-09-09 15:33   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0 Rick Edgecombe
2024-09-09 15:33   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
2024-09-10  8:16   ` Paolo Bonzini
2024-09-10 23:49     ` Edgecombe, Rick P
2024-10-14  6:34     ` Yan Zhao
2024-09-11  6:25   ` Xu Yilun
2024-09-11 17:28     ` Edgecombe, Rick P
2024-09-12  4:54       ` Yan Zhao
2024-09-12 14:44         ` Edgecombe, Rick P
2024-09-12  7:47       ` Xu Yilun
2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
2024-09-06  2:10   ` Huang, Kai
2024-09-09 21:03     ` Edgecombe, Rick P
2024-09-10  1:52       ` Yan Zhao
2024-09-10  9:33       ` Paolo Bonzini
2024-09-10 23:58         ` Edgecombe, Rick P
2024-09-11  1:05           ` Yan Zhao
2024-10-30  3:03   ` Binbin Wu
2024-11-04  9:09     ` Yan Zhao
2024-09-04  3:07 ` [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages Rick Edgecombe
2024-09-10 10:17   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
2024-09-10 10:24   ` Paolo Bonzini
2024-09-11  0:19     ` Edgecombe, Rick P
2024-09-13 13:33       ` Adrian Hunter
2024-09-13 19:49         ` Edgecombe, Rick P
2024-09-10 10:49   ` Paolo Bonzini
2024-09-11  0:30     ` Edgecombe, Rick P
2024-09-11 10:39       ` Paolo Bonzini
2024-09-11 16:36         ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX Rick Edgecombe
2024-09-10 10:04   ` Paolo Bonzini
2024-09-10 14:05     ` Sean Christopherson
2024-09-04  3:07 ` [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page() Rick Edgecombe
2024-09-10 10:02   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
2024-09-04  4:53   ` Yan Zhao
2024-09-04 14:01     ` Edgecombe, Rick P
2024-09-06 16:30       ` Edgecombe, Rick P
2024-09-09  1:29         ` Yan Zhao
2024-09-10 10:13         ` Paolo Bonzini
2024-09-11  0:11           ` Edgecombe, Rick P
2024-09-04 13:56   ` Edgecombe, Rick P
2024-09-10 10:16   ` Paolo Bonzini
2024-09-11  0:12     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
2024-09-04 15:37   ` Adrian Hunter
2024-09-04 16:09     ` Edgecombe, Rick P
2024-09-10 10:33     ` Paolo Bonzini
2024-09-10 11:15       ` Adrian Hunter
2024-09-10 11:28         ` Paolo Bonzini
2024-09-10 11:31         ` Adrian Hunter
2024-09-10 10:25   ` Paolo Bonzini
2024-09-10 11:54     ` Adrian Hunter
2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
2024-09-09 15:41   ` Paolo Bonzini
2024-09-09 23:30     ` Edgecombe, Rick P
2024-09-10 10:45   ` Paolo Bonzini
2024-09-11  0:17     ` Edgecombe, Rick P
2024-11-04  9:45     ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZwiICKN8trllBbZW@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=dmatlack@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nik.borisov@suse.com \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox