[PATCH 0/2] KVM: Pre-check mmu_notifier retry on x86

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] KVM: Pre-check mmu_notifier retry on x86
@ 2023-08-25  2:07 Sean Christopherson
  2023-08-25  2:07 ` [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock Sean Christopherson
  2023-08-25  2:07 ` [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Sean Christopherson
  0 siblings, 2 replies; 8+ messages in thread
From: Sean Christopherson @ 2023-08-25  2:07 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao

Pre-check for an mmu_notifier retry on x86 to avoid contending mmu_lock,
which is quite problematic on preemptible kernels due to the way x86's TDP
MMU reacts to mmu_lock contentions.  If mmu_lock contention is detected when
zapping SPTEs for an mmu_notifier invalidation, the TDP MMU drops mmu_lock
and yields.

The idea behind yielding is to let vCPUs that are trying to fault-in memory
make forward progress while the invalidation is ongoing.  This works
because x86 uses the precise(ish) version of retry which checks for hva
overlap.  At least, it works so long as vCPUs are hitting the region that's
being zapped.

Yielding turns out to be really bad when the vCPU is trying to fault-in a
page that *is* covered by the invalidation, because the vCPU ends up
retrying over and over, which puts mmu_lock under constant contention, and
ultimately causes the invalidation to take much longer due to the zapping
task constantly yielding.  And in the worst case scenario, if the
invalidation is finding SPTEs to zap, every yield will trigger a remote
(*cough* VM-wide) TLB flush.

Sean Christopherson (2):
  KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock
  KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is
    changing

 arch/x86/kvm/mmu/mmu.c   |  3 +++
 include/linux/kvm_host.h | 17 ++++++++++++++---
 2 files changed, 17 insertions(+), 3 deletions(-)

base-commit: fff2e47e6c3b8050ca26656693caa857e3a8b740
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock
  2023-08-25  2:07 [PATCH 0/2] KVM: Pre-check mmu_notifier retry on x86 Sean Christopherson
@ 2023-08-25  2:07 ` Sean Christopherson
  2023-09-07 23:30   ` Huang, Kai
  2023-08-25  2:07 ` [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2023-08-25  2:07 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao

Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
though mmu_lock must be held to guarantee correctness, i.e. to avoid
false negatives.  Dropping the requirement that mmu_lock be held will
allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
contending mmu_lock when the guest is accessing a range that is being
invalidated by the host.

Contending mmu_lock can have severe negative side effects for x86's TDP
MMU when running on preemptible kernels, as KVM will yield from the
zapping task (holds mmu_lock for write) when there is lock contention,
and yielding after any SPTEs have been zapped requires a VM-scoped TLB
flush.

Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
e.g. due to caching the in-progress flag and never seeing it go to '0'.

Force a load of mmu_invalidate_seq as well, even though it isn't strictly
necessary to avoid an infinite loop, as doing so improves the probability
that KVM will detect an invalidation that already completed before
acquiring mmu_lock and bailing anyways.

Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
may generate a load into a register instead of doing a direct comparison
(MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
is a few bytes of code and maaaaybe a cycle or three.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7418e881c21c..7314138ba5f4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 					   unsigned long mmu_seq,
 					   unsigned long hva)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
 	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
 	 * that might be being invalidated. Note that it may include some false
 	 * positives, due to shortcuts when handing concurrent invalidations.
+	 *
+	 * Note the lack of a memory barriers!  The caller *must* hold mmu_lock
+	 * to avoid false negatives!  Holding mmu_lock is not mandatory though,
+	 * e.g. to allow pre-checking for an in-progress invalidation to
+	 * avoiding contending mmu_lock.  Ensure that the in-progress flag and
+	 * sequence counter are always read from memory, so that checking for
+	 * retry in a loop won't result in an infinite retry loop.  Don't force
+	 * loads for start+end, as the key to avoiding an infinite retry loops
+	 * is observing the 1=>0 transition of in-progress, i.e. getting false
+	 * negatives (if mmu_lock isn't held) due to stale start+end values is
+	 * acceptable.
 	 */
-	if (unlikely(kvm->mmu_invalidate_in_progress) &&
+	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
 	    hva >= kvm->mmu_invalidate_range_start &&
 	    hva < kvm->mmu_invalidate_range_end)
 		return 1;
-	if (kvm->mmu_invalidate_seq != mmu_seq)
+
+	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
 		return 1;
 	return 0;
 }
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  2023-08-25  2:07 [PATCH 0/2] KVM: Pre-check mmu_notifier retry on x86 Sean Christopherson
  2023-08-25  2:07 ` [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock Sean Christopherson
@ 2023-08-25  2:07 ` Sean Christopherson
  2023-09-06 23:03   ` Huang, Kai
  1 sibling, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2023-08-25  2:07 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao

Retry page faults without acquiring mmu_lock if the resolved hva is covered
by an active invalidation.  Contending for mmu_lock is especially
problematic on preemptible kernels as the mmu_notifier invalidation task
will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
invalidation, and ultimately increase the latency of resolving the page
fault.  And in the worst case scenario, yielding will be accompanied by a
remote TLB flush, e.g. if the invalidation covers a large range of memory
and vCPUs are accessing addresses that were already zapped.

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1a5a1e7d1eb7..8e2e07ed1a1b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (unlikely(!fault->slot))
 		return kvm_handle_noslot_fault(vcpu, fault, access);

+	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
+		return RET_PF_RETRY;
+
 	return RET_PF_CONTINUE;
 }

-- 
2.42.0.rc2.253.gd59a3bf2b4-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  2023-08-25  2:07 ` [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Sean Christopherson
@ 2023-09-06 23:03   ` Huang, Kai
  2023-09-07 14:45     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Kai @ 2023-09-06 23:03 UTC (permalink / raw)
  To: pbonzini@redhat.com, Christopherson,, Sean
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Zhao, Yan Y

On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> Retry page faults without acquiring mmu_lock if the resolved hva is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the mmu_notifier invalidation task
> will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> invalidation, and ultimately increase the latency of resolving the page
> fault.  And in the worst case scenario, yielding will be accompanied by a
> remote TLB flush, e.g. if the invalidation covers a large range of memory
> and vCPUs are accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Acked-by: Kai Huang <kai.huang@intel.com>

Nit below ...

> ---
>  arch/x86/kvm/mmu/mmu.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1a5a1e7d1eb7..8e2e07ed1a1b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> +		return RET_PF_RETRY;
> +

... Perhaps a comment saying this is to avoid unnecessary MMU lock contention
would be nice.  Otherwise we have is_page_fault_stale() called later within the
MMU lock.  I suppose people only tend to use git blamer when they cannot find
answer in the code :-)
 
>  	return RET_PF_CONTINUE;
>  }
>  

Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after
fast_page_fault().  Conceptually, should we move this to even before
fast_page_fault() because I assume the range zapping should also apply to the
cases that fast_page_fault() handles?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  2023-09-06 23:03   ` Huang, Kai
@ 2023-09-07 14:45     ` Sean Christopherson
  2023-09-07 22:34       ` Huang, Kai
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2023-09-07 14:45 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yan Y Zhao

On Wed, Sep 06, 2023, Kai Huang wrote:
> On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	if (unlikely(!fault->slot))
> >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> >  
> > +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> > +		return RET_PF_RETRY;
> > +
> 
> ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention
> would be nice.  Otherwise we have is_page_fault_stale() called later within the
> MMU lock.  I suppose people only tend to use git blamer when they cannot find
> answer in the code :-)

Agreed, will add.

> >  	return RET_PF_CONTINUE;
> >  }
> >  
> 
> Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after
> fast_page_fault().  Conceptually, should we move this to even before
> fast_page_fault() because I assume the range zapping should also apply to the
> cases that fast_page_fault() handles?

Nope, fast_page_fault() doesn't need to "manually" detect invalidated SPTEs because
it only modifies shadow-present SPTEs and does so with an atomic CMPXCHG.  If a
SPTE is zapped by an mmu_notifier event (or anything else), the CMPXCHG will fail
and fast_page_fault() will see the !PRESENT SPTE on the next retry and bail.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
  2023-09-07 14:45     ` Sean Christopherson
@ 2023-09-07 22:34       ` Huang, Kai
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Kai @ 2023-09-07 22:34 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y

On Thu, 2023-09-07 at 07:45 -0700, Sean Christopherson wrote:
> On Wed, Sep 06, 2023, Kai Huang wrote:
> > On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 1a5a1e7d1eb7..8e2e07ed1a1b 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4334,6 +4334,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	if (unlikely(!fault->slot))
> > >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> > >  
> > > +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> > > +		return RET_PF_RETRY;
> > > +
> > 
> > ... Perhaps a comment saying this is to avoid unnecessary MMU lock contention
> > would be nice.  Otherwise we have is_page_fault_stale() called later within the
> > MMU lock.  I suppose people only tend to use git blamer when they cannot find
> > answer in the code :-)
> 
> Agreed, will add.
> 
> > >  	return RET_PF_CONTINUE;
> > >  }
> > >  
> > 
> > Btw, currently fault->mmu_seq is set in kvm_faultin_pfn(), which happens after
> > fast_page_fault().  Conceptually, should we move this to even before
> > fast_page_fault() because I assume the range zapping should also apply to the
> > cases that fast_page_fault() handles?
> 
> Nope, fast_page_fault() doesn't need to "manually" detect invalidated SPTEs because
> it only modifies shadow-present SPTEs and does so with an atomic CMPXCHG.  If a
> SPTE is zapped by an mmu_notifier event (or anything else), the CMPXCHG will fail
> and fast_page_fault() will see the !PRESENT SPTE on the next retry and bail.

Ah yes.  Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock
  2023-08-25  2:07 ` [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock Sean Christopherson
@ 2023-09-07 23:30   ` Huang, Kai
  2023-09-08 17:36     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Kai @ 2023-09-07 23:30 UTC (permalink / raw)
  To: pbonzini@redhat.com, Christopherson,, Sean
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Zhao, Yan Y

On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> though mmu_lock must be held to guarantee correctness, i.e. to avoid
> false negatives.  Dropping the requirement that mmu_lock be held will
> allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> contending mmu_lock when the guest is accessing a range that is being
> invalidated by the host.
> 
> Contending mmu_lock can have severe negative side effects for x86's TDP
> MMU when running on preemptible kernels, as KVM will yield from the
> zapping task (holds mmu_lock for write) when there is lock contention,
> and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> flush.
> 
> Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> e.g. due to caching the in-progress flag and never seeing it go to '0'.
> 
> Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> necessary to avoid an infinite loop, as doing so improves the probability
> that KVM will detect an invalidation that already completed before
> acquiring mmu_lock and bailing anyways.

Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
this:

	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;		<-- (1)
	smp_rmb();

	...
	READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
	...

	if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq)	<-- (2)
		...

Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
>mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
code in 1) and 2)?  Or all the barriers between are enough to prevent compiler
to do such stupid thing?

> 
> Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> may generate a load into a register instead of doing a direct comparison
> (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> is a few bytes of code and maaaaybe a cycle or three.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7418e881c21c..7314138ba5f4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1962,18 +1962,29 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
>  	 * that might be being invalidated. Note that it may include some false
>  	 * positives, due to shortcuts when handing concurrent invalidations.
> +	 *
> +	 * Note the lack of a memory barriers!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives!  Holding mmu_lock is not mandatory though,
> +	 * e.g. to allow pre-checking for an in-progress invalidation to
> +	 * avoiding contending mmu_lock.  Ensure that the in-progress flag and
> +	 * sequence counter are always read from memory, so that checking for
> +	 * retry in a loop won't result in an infinite retry loop.  Don't force
> +	 * loads for start+end, as the key to avoiding an infinite retry loops
> +	 * is observing the 1=>0 transition of in-progress, i.e. getting false
> +	 * negatives (if mmu_lock isn't held) due to stale start+end values is
> +	 * acceptable.
>  	 */
> -	if (unlikely(kvm->mmu_invalidate_in_progress) &&
> +	if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> -	if (kvm->mmu_invalidate_seq != mmu_seq)
> +
> +	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
>  		return 1;
>  	return 0;
>  }

I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
all those should be theoretical thing, but the extra cost should be nearly empty
as you said.

Acked-by: Kai Huang <kai.huang@intel.com>

(Or perhaps patch 1/2 can be just merged to one patch)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock
  2023-09-07 23:30   ` Huang, Kai
@ 2023-09-08 17:36     ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2023-09-08 17:36 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Yan Y Zhao

On Thu, Sep 07, 2023, Kai Huang wrote:
> On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> > Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> > though mmu_lock must be held to guarantee correctness, i.e. to avoid
> > false negatives.  Dropping the requirement that mmu_lock be held will
> > allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> > contending mmu_lock when the guest is accessing a range that is being
> > invalidated by the host.
> > 
> > Contending mmu_lock can have severe negative side effects for x86's TDP
> > MMU when running on preemptible kernels, as KVM will yield from the
> > zapping task (holds mmu_lock for write) when there is lock contention,
> > and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> > flush.
> > 
> > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> > mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> > e.g. due to caching the in-progress flag and never seeing it go to '0'.
> > 
> > Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> > necessary to avoid an infinite loop, as doing so improves the probability
> > that KVM will detect an invalidation that already completed before
> > acquiring mmu_lock and bailing anyways.
> 
> Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
> mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
> this:
> 
> 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;		<-- (1)
> 	smp_rmb();
> 
> 	...
> 	READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
> 	...
> 
> 	if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq)	<-- (2)
> 		...
> 
> Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
> >mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
> code in 1) and 2)?  Or all the barriers between are enough to prevent compiler

Practically speaking, no, there's far too much going on in __kvm_faultin_pfn().

But, KVM _could_ do the freshness check before __kvm_faultin_pfn() since KVM
has the memslot and thus the host virtual addess at that point.  I highly doubt
we'll ever do that, but it's possible.  At that point, there'd be no spinlocks
or other barries to ensure the load+check wouldn't get elided.  That's still
extremely theoretical though.

1) can't be eliminated because acquiring mmu_lock provides enough barries to
prevent the compiler from omitting the load, i.e. the compiler can't omit the
comparison that done inside the critical section.  (2) can theoretically be optimized
away by the compiler (when called before acquiring mmu_lock), though it's extremely
unlikely since there's sooo much going on between the load and the check.

> > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> > may generate a load into a register instead of doing a direct comparison
> > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> > is a few bytes of code and maaaaybe a cycle or three.

...

> > -	if (kvm->mmu_invalidate_seq != mmu_seq)
> > +
> > +	if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
> >  		return 1;
> >  	return 0;
> >  }
> 
> I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
> all those should be theoretical thing, but the extra cost should be nearly empty
> as you said.

It's not currently called in a loop, but it wouldn't be wrong for KVM to do
something like the below instead of fully re-entering the guest, in which case
mmu_invalidate_retry_hva() would be called in a loop, just a big loop :-)

And if KVM checked for freshness before resolving the PFN, it'd be possible for
the loop to read and check the sequence in the loop without any barriers that would
guarantee a reload (again, very, very theoretically).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
        }
 
        if (r == RET_PF_INVALID) {
-               r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
-                                         lower_32_bits(error_code), false,
-                                         &emulation_type);
-               if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
-                       return -EIO;
+               do {
+                       r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+                                                 lower_32_bits(error_code),
+                                                 false, &emulation_type);
+                       if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+                               return -EIO;
+               while (r == RET_PF_RETRY);
        }
 
        if (r < 0)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-08 17:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-25  2:07 [PATCH 0/2] KVM: Pre-check mmu_notifier retry on x86 Sean Christopherson
2023-08-25  2:07 ` [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock Sean Christopherson
2023-09-07 23:30   ` Huang, Kai
2023-09-08 17:36     ` Sean Christopherson
2023-08-25  2:07 ` [PATCH 2/2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Sean Christopherson
2023-09-06 23:03   ` Huang, Kai
2023-09-07 14:45     ` Sean Christopherson
2023-09-07 22:34       ` Huang, Kai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox