public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Mushahid Hussain <hmushi@amazon.co.uk>,
	 Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	 Jim Mattson <jmattson@google.com>,
	Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org,
	Mingwei Zhang <mizhang@google.com>,
	 Maxim Levitsky <mlevitsk@redhat.com>
Subject: Re: [PATCH v3 2/5] KVM: pfncache: Implement range-based invalidation check for hva_to_pfn_retry()
Date: Tue, 15 Oct 2024 17:04:48 -0700	[thread overview]
Message-ID: <Zw8DINUkJJKDByXE@google.com> (raw)
In-Reply-To: <20240821202814.711673-2-dwmw2@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 6230 bytes --]

On Wed, Aug 21, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> If there are any concurrent invalidations running, it's effectively just
> a complex busy wait loop because its local mmu_notifier_retry_cache()
> function will always return true.
> 
> Since multiple invalidations can be running in parallel, this can result
> in a situation where hva_to_pfn_retry() just backs off and keep retrying
> for ever, not making any progress.
> 
> Solve this by being a bit more selective about when to retry. In
> addition to the ->needs_invalidation flag added in a previous commit,
> check before calling hva_to_pfn() if there is any pending invalidation
> (i.e. between invalidate_range_start() and invalidate_range_end()) which
> affects the uHVA of the GPC being updated. This catches the case where
> hva_to_pfn() would return a soon-to-be-stale mapping of a range for which
> the invalidate_range_start() hook had already been called before the uHVA
> was even set in the GPC and the ->needs_invalidation flag set.
> 
> An invalidation which *starts* after the lock is dropped in the loop is
> fine because it will clear the ->needs_revalidation flag and also trigger
> a retry.
> 
> This is slightly more complex than it needs to be; moving the
> invalidation to occur in the invalidate_range_end() hook will make life
> simpler, in a subsequent commit.

I'm pretty sure that's still more complex than it needs to be.  And even if the
gpc implementation isn't any more/less complex, diverging from the page fault
side of things makes KVM overall more complex.

Maybe I'm just not seeing some functionality that needs_invalidation provides,
but I still don't understand why it's necessary. 

Omitting context to keep this short, I think the below is what we can end up at;
see attached (very lightly tested) patches for the full idea.  So long as either
the invalidation side marks the gpc invalid, or the refresh side is guaranteed
to check for an invalidation, I don't see why needs_invalidation is required.

If the cache is invalid, or the uhva is changing, then the event doesn't need to
do anything because refresh() will hva_to_pfn_retry() to re-resolve the pfn.  So
at that point, KVM just needs to ensure either hva_to_pfn_retry() observes the
in-progress range, or the event sees valid==true.

Note, the attached patches include the "wait" idea, but I'm not convinced that's
actually a good idea.  I'll respond more to your last patch.

gfn_to_pfn_cache_invalidate_start():

	...

	/*
	 * Ensure that either each cache sees the to-be-invalidated range and
	 * retries if necessary, or that this task sees the cache's valid flag
	 * and invalidates the cache before completing the mmu_notifier event.
	 * Note, gpc->uhva must be set before gpc->valid, and if gpc->uhva is
	 * modified the cache must be re-validated.  Pairs with the smp_mb() in
	 * hva_to_pfn_retry().
	 */
	smp_mb__before_atomic();

	spin_lock(&kvm->gpc_lock);
	list_for_each_entry(gpc, &kvm->gpc_list, list) {
		if (!gpc->valid || gpc->uhva < start || gpc->uhva >= end)
			continue;

		write_lock_irq(&gpc->lock);

		/*
		 * Verify the cache still needs to be invalidated after
		 * acquiring gpc->lock, to avoid an unnecessary invalidation
		 * in the unlikely scenario the cache became valid with a
		 * different userspace virtual address.
		 */
		if (gpc->valid && gpc->uhva >= start && gpc->uhva < end)
			gpc->valid = false;
		write_unlock_irq(&gpc->lock);
	}
	spin_unlock(&kvm->gpc_lock);

hva_to_pfn_retry():

	do {
		/*
		 * Invalidate the cache prior to dropping gpc->lock, the
		 * gpa=>uhva assets have already been updated and so a check()
		 * from a different task may not fail the gpa/uhva/generation
		 * checks, i.e. must observe valid==false.
		 */
		gpc->valid = false;

		write_unlock_irq(&gpc->lock);

		...

		write_lock_irq(&gpc->lock);

		/*
		 * Other tasks must wait for _this_ refresh to complete before
		 * attempting to refresh.
		 */
		WARN_ON_ONCE(gpc->valid);
		gpc->valid = true;

		/*
		 * Ensure valid is visible before checking if retry is needed.
		 * Pairs with the smp_mb__before_atomic() in
		 * gfn_to_pfn_cache_invalidate().
		 */
		smp_mb();
	} while (gpc_invalidate_retry_hva(gpc, mmu_seq));

> @@ -193,6 +174,29 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  
>  		write_unlock_irq(&gpc->lock);
>  
> +		/*
> +		 * Invalidation occurs from the invalidate_range_start() hook,
> +		 * which could already have happened before __kvm_gpc_refresh()
> +		 * (or the previous turn around this loop) took gpc->lock().
> +		 * If so, and if the corresponding invalidate_range_end() hook
> +		 * hasn't happened yet, hva_to_pfn() could return a mapping
> +		 * which is about to be stale and which should not be used. So
> +		 * check if there are any currently-running invalidations which
> +		 * affect the uHVA of this GPC, and retry if there are. Any
> +		 * invalidation which starts after gpc->needs_invalidation is
> +		 * set is fine, because it will clear that flag and trigger a
> +		 * retry. And any invalidation which *completes* by having its
> +		 * invalidate_range_end() hook called immediately prior to this
> +		 * check is also fine, because the page tables are guaranteed
> +		 * to have been changted already, so hva_to_pfn() won't return

s/changted/changed

> +		 * a stale mapping in that case anyway.
> +		 */
> +		while (gpc_invalidations_pending(gpc)) {
> +			cond_resched();
> +			write_lock_irq(&gpc->lock);
> +			continue;

Heh, our testcase obviously isn't as robust as we hope it is.  Or maybe only the
"wait" version was tested.  The "continue" here will continue the
while(gpc_invalidations_pending) loop, not the outer while loop.  That will cause
deadlock due to trying acquiring gpc->lock over and over.

> +		}
> +
>  		/*
>  		 * If the previous iteration "failed" due to an mmu_notifier
>  		 * event, release the pfn and unmap the kernel virtual address
> @@ -233,6 +237,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>  			goto out_error;
>  		}
>  
> +

Spurious whitespace.

>  		write_lock_irq(&gpc->lock);
>  
>  		/*


[-- Attachment #2: 0001-KVM-pfncache-Snapshot-mmu_invalidate_seq-immediately.patch --]
[-- Type: text/x-diff, Size: 1691 bytes --]

From 4e34801acab175913b6fd5b6c7c4aa1350d5e571 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 15 Oct 2024 16:35:15 -0700
Subject: [PATCH 1/5] KVM: pfncache: Snapshot mmu_invalidate_seq immediately
 before hva_to_pfn()

Grab the snapshot of the mmu_notifier sequence counter immediately before
calling hva_to_pfn() when refreching a gpc, as there's no requirement that
the snapshot be taken while holding gpc->lock; the sequence counter is
completely independent of locking.  This will allow waiting on in-progress
invalidations to complete, instead of actively trying to resolve a pfn
that KVM is guaranteed to discard (because either the invalidation will
still be in-progress, or it will have completed and bumped the sequence
counter).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/pfncache.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e..4afbc1262e3f 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -172,9 +172,6 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	gpc->valid = false;
 
 	do {
-		mmu_seq = gpc->kvm->mmu_invalidate_seq;
-		smp_rmb();
-
 		write_unlock_irq(&gpc->lock);
 
 		/*
@@ -197,6 +194,9 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
+		mmu_seq = gpc->kvm->mmu_invalidate_seq;
+		smp_rmb();
+
 		/* We always request a writeable mapping */
 		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
 		if (is_error_noslot_pfn(new_pfn))

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
-- 
2.47.0.rc1.288.g06298d1525-goog


[-- Attachment #3: 0002-KVM-pfncache-Wait-for-in-progress-invalidations-to-c.patch --]
[-- Type: text/x-diff, Size: 1607 bytes --]

From b023d3ada9c856cbfe38839068861e5f2e19489f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 15 Oct 2024 16:33:58 -0700
Subject: [PATCH 2/5] KVM: pfncache: Wait for in-progress invalidations to
 complete during refresh

When refreshing a gpc, wait for in-progress invalidations to complete
before reading the sequence counter and resolving the pfn.  Resolving the
pfn when there is an in-progress invalidation is worse than pointless, as
the pfn is guaranteed to be discarded, and trying to resolve the pfn could
contended for resources, e.g. mmap_lock, and make the invalidation and
thus refresh take longer to complete.

Suggested-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/pfncache.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 4afbc1262e3f..957f739227ab 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -194,6 +194,18 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
+		/*
+		 * Wait for in-progress invalidations to complete if the user
+		 * already being invalidated.  Unlike the page fault path, this
+		 * task _must_ complete the refresh, i.e. there's no value in
+		 * trying to race ahead in the hope that a different task makes
+		 * the cache valid.
+		 */
+		while (READ_ONCE(gpc->kvm->mn_active_invalidate_count)) {
+			if (!cond_resched())
+				cpu_relax();
+		}
+
 		mmu_seq = gpc->kvm->mmu_invalidate_seq;
 		smp_rmb();
 
-- 
2.47.0.rc1.288.g06298d1525-goog


[-- Attachment #4: 0003-KVM-pfncache-Implement-range-based-invalidation-chec.patch --]
[-- Type: text/x-diff, Size: 8693 bytes --]

From 98de4aad0c442958b95ca62e0a9ff1a736b71bb7 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Tue, 15 Oct 2024 12:49:06 -0700
Subject: [PATCH 3/5] KVM: pfncache: Implement range-based invalidation check
 for hva_to_pfn_retry()

The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
If there are any concurrent invalidations running, it's effectively just
a complex busy wait loop because its local mmu_notifier_retry_cache()
function will always return true.  Even worse, caches are forced to wait
even if the invalidations do not affect the cache's virtual address range.

Since multiple invalidations can be running in parallel, this can result
in a situation where hva_to_pfn_retry() just backs off and keeps retrying
forever, not making any progress.

Mitigate the badness by being a bit more selective about when to retry.
Specifically, retry only if an in-progress invalidation range affects the
cache, i.e. implement range-based invalidation for caches, similar to how
KVM imlements range-based invalidation for page faults.

Note, like the existing range-based invalidation, this approach doesn't
completely prevent false positives since the in-progress range never
shrinks.  E.g. if userspace is spamming invalidations *and* simultaneously
invalidates a cache, the cache will still get stuck until the spam stops.
However, that problem already exists with page faults, and precisely
tracking in-progress ranges would add significant complexity.  Defer
trying to address that issue until it actually becomes a problem, which in
all likelihood will never happen.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      |  7 ++--
 virt/kvm/pfncache.c      | 75 ++++++++++++++++++++++++++++------------
 3 files changed, 58 insertions(+), 26 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index db567d26f7b9..2c0ed735f0f4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -785,6 +785,8 @@ struct kvm {
 	/* For management / invalidation of gfn_to_pfn_caches */
 	spinlock_t gpc_lock;
 	struct list_head gpc_list;
+	u64 mmu_gpc_invalidate_range_start;
+	u64 mmu_gpc_invalidate_range_end;
 
 	/*
 	 * created_vcpus is protected by kvm->lock, and is incremented
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb2548d99..b9223ecab2ca 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -775,7 +775,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 */
 	spin_lock(&kvm->mn_invalidate_lock);
 	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
@@ -784,10 +783,10 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * any given time, and the caches themselves can check for hva overlap,
 	 * i.e. don't need to rely on memslot overlap checks for performance.
 	 * Because this runs without holding mmu_lock, the pfn caches must use
-	 * mn_active_invalidate_count (see above) instead of
-	 * mmu_invalidate_in_progress.
+	 * mn_active_invalidate_count instead of mmu_invalidate_in_progress.
 	 */
 	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
+	spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
 	 * If one or more memslots were found and thus zapped, notify arch code
@@ -1164,6 +1163,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
+	kvm->mmu_gpc_invalidate_range_start = KVM_HVA_ERR_BAD;
+	kvm->mmu_gpc_invalidate_range_end = KVM_HVA_ERR_BAD;
 
 	INIT_LIST_HEAD(&kvm->devices);
 	kvm->max_vcpus = KVM_MAX_VCPUS;
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 957f739227ab..3d5bc9bab3d9 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -27,6 +27,27 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 {
 	struct gfn_to_pfn_cache *gpc;
 
+	lockdep_assert_held(&kvm->mn_invalidate_lock);
+
+	if (likely(kvm->mn_active_invalidate_count == 1)) {
+		kvm->mmu_gpc_invalidate_range_start = start;
+		kvm->mmu_gpc_invalidate_range_end = end;
+	} else {
+		/*
+		 * Fully tracking multiple concurrent ranges has diminishing
+		 * returns. Keep things simple and just find the minimal range
+		 * which includes the current and new ranges. As there won't be
+		 * enough information to subtract a range after its invalidate
+		 * completes, any ranges invalidated concurrently will
+		 * accumulate and persist until all outstanding invalidates
+		 * complete.
+		 */
+		kvm->mmu_gpc_invalidate_range_start =
+			min(kvm->mmu_gpc_invalidate_range_start, start);
+		kvm->mmu_gpc_invalidate_range_end =
+			max(kvm->mmu_gpc_invalidate_range_end, end);
+	}
+
 	spin_lock(&kvm->gpc_lock);
 	list_for_each_entry(gpc, &kvm->gpc_list, list) {
 		read_lock_irq(&gpc->lock);
@@ -74,6 +95,8 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
 {
 	struct kvm_memslots *slots = kvm_memslots(gpc->kvm);
 
+	lockdep_assert_held(&gpc->lock);
+
 	if (!gpc->active)
 		return false;
 
@@ -124,21 +147,26 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 #endif
 }
 
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static bool gpc_invalidate_in_progress_hva(struct gfn_to_pfn_cache *gpc)
 {
+	struct kvm *kvm = gpc->kvm;
+
 	/*
-	 * mn_active_invalidate_count acts for all intents and purposes
-	 * like mmu_invalidate_in_progress here; but the latter cannot
-	 * be used here because the invalidation of caches in the
-	 * mmu_notifier event occurs _before_ mmu_invalidate_in_progress
-	 * is elevated.
-	 *
-	 * Note, it does not matter that mn_active_invalidate_count
-	 * is not protected by gpc->lock.  It is guaranteed to
-	 * be elevated before the mmu_notifier acquires gpc->lock, and
-	 * isn't dropped until after mmu_invalidate_seq is updated.
+	 * Ensure the count and range are always read from memory so that
+	 * checking for retry in a loop won't get false negatives on the range,
+	 * and won't result in an infinite retry loop.  Note, the range never
+	 * shrinks, only grows, and so the key to avoiding infinite retry loops
+	 * is observing the 1=>0 transition of the count.
 	 */
-	if (kvm->mn_active_invalidate_count)
+	return unlikely(READ_ONCE(kvm->mn_active_invalidate_count)) &&
+	       READ_ONCE(kvm->mmu_gpc_invalidate_range_start) <= gpc->uhva &&
+	       READ_ONCE(kvm->mmu_gpc_invalidate_range_end) > gpc->uhva;
+}
+
+static bool gpc_invalidate_retry_hva(struct gfn_to_pfn_cache *gpc,
+				     unsigned long mmu_seq)
+{
+	if (gpc_invalidate_in_progress_hva(gpc))
 		return true;
 
 	/*
@@ -149,7 +177,7 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	 * new (incremented) value of mmu_invalidate_seq is observed.
 	 */
 	smp_rmb();
-	return kvm->mmu_invalidate_seq != mmu_seq;
+	return gpc->kvm->mmu_invalidate_seq != mmu_seq;
 }
 
 static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
@@ -164,14 +192,15 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 
 	lockdep_assert_held_write(&gpc->lock);
 
-	/*
-	 * Invalidate the cache prior to dropping gpc->lock, the gpa=>uhva
-	 * assets have already been updated and so a concurrent check() from a
-	 * different task may not fail the gpa/uhva/generation checks.
-	 */
-	gpc->valid = false;
-
 	do {
+		/*
+		 * Invalidate the cache prior to dropping gpc->lock, the
+		 * gpa=>uhva assets have already been updated and so a check()
+		 * from a different task may not fail the gpa/uhva/generation
+		 * checks, i.e. must observe valid==false.
+		 */
+		gpc->valid = false;
+
 		write_unlock_irq(&gpc->lock);
 
 		/*
@@ -201,7 +230,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		 * trying to race ahead in the hope that a different task makes
 		 * the cache valid.
 		 */
-		while (READ_ONCE(gpc->kvm->mn_active_invalidate_count)) {
+		while (gpc_invalidate_in_progress_hva(gpc)) {
 			if (!cond_resched())
 				cpu_relax();
 		}
@@ -236,9 +265,9 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		 * attempting to refresh.
 		 */
 		WARN_ON_ONCE(gpc->valid);
-	} while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+		gpc->valid = true;
+	} while (gpc_invalidate_retry_hva(gpc, mmu_seq));
 
-	gpc->valid = true;
 	gpc->pfn = new_pfn;
 	gpc->khva = new_khva + offset_in_page(gpc->uhva);
 
-- 
2.47.0.rc1.288.g06298d1525-goog


[-- Attachment #5: 0004-KVM-pfncache-Add-lockless-checking-of-cache-invalida.patch --]
[-- Type: text/x-diff, Size: 3837 bytes --]

From 423be552d22d83a49ce4d0ad7865b8f4e3f1eeb0 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 15 Oct 2024 13:57:54 -0700
Subject: [PATCH 4/5] KVM: pfncache: Add lockless checking of cache
 invalidation events

When handling an mmu_notifier invalidation event, perform the initial check
for overlap with a valid gpc_to_pfn_cache without acquiring the cache's
lock, and instead ensure either the invalidation event observes a valid
cache, or the cache observes the invalidation event.

Locklessly checking gpc->valid and gpc->uhva relies on several existing
invariants:

 - Changing gpc->uhva requires performing a refresh()
 - A cache can be made valid only during refresh()
 - Only one task can execute refresh() at a time
 - Tasks must check() a cache after a successful refresh()
 - check() must hold gpc->lock (usually for read)

The existing invariants means that if KVM observes an invalid gpc, or if
the uhva is unstable, then a refresh() is in-progress or will be performed
prior to consuming the new uhva.  And so KVM simply needs to ensure that
refresh() sees the invalidation and retries, or that the invalidation sees
the fully valid gpc.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/pfncache.c | 51 ++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 22 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 3d5bc9bab3d9..2163bb6b899c 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -48,32 +48,32 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 			max(kvm->mmu_gpc_invalidate_range_end, end);
 	}
 
+	/*
+	 * Ensure that either each cache sees the to-be-invalidated range and
+	 * retries if necessary, or that this task sees the cache's valid flag
+	 * and invalidates the cache before completing the mmu_notifier event.
+	 * Note, gpc->uhva must be set before gpc->valid, and if gpc->uhva is
+	 * modified the cache must be re-validated.  Pairs with the smp_mb() in
+	 * hva_to_pfn_retry().
+	 */
+	smp_mb__before_atomic();
+
 	spin_lock(&kvm->gpc_lock);
 	list_for_each_entry(gpc, &kvm->gpc_list, list) {
-		read_lock_irq(&gpc->lock);
-
-		/* Only a single page so no need to care about length */
-		if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
-		    gpc->uhva >= start && gpc->uhva < end) {
-			read_unlock_irq(&gpc->lock);
-
-			/*
-			 * There is a small window here where the cache could
-			 * be modified, and invalidation would no longer be
-			 * necessary. Hence check again whether invalidation
-			 * is still necessary once the write lock has been
-			 * acquired.
-			 */
-
-			write_lock_irq(&gpc->lock);
-			if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
-			    gpc->uhva >= start && gpc->uhva < end)
-				gpc->valid = false;
-			write_unlock_irq(&gpc->lock);
+		if (!gpc->valid || gpc->uhva < start || gpc->uhva >= end)
 			continue;
-		}
 
-		read_unlock_irq(&gpc->lock);
+		write_lock_irq(&gpc->lock);
+
+		/*
+		 * Verify the cache still needs to be invalidated after
+		 * acquiring gpc->lock, to avoid an unnecessary invalidation
+		 * in the unlikely scenario the cache became valid with a
+		 * different userspace virtual address.
+		 */
+		if (gpc->valid && gpc->uhva >= start && gpc->uhva < end)
+			gpc->valid = false;
+		write_unlock_irq(&gpc->lock);
 	}
 	spin_unlock(&kvm->gpc_lock);
 }
@@ -266,6 +266,13 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		 */
 		WARN_ON_ONCE(gpc->valid);
 		gpc->valid = true;
+
+		/*
+		 * Ensure valid is visible before checking if retry is needed.
+		 * Pairs with the smp_mb__before_atomic() in
+		 * gfn_to_pfn_cache_invalidate().
+		 */
+		smp_mb();
 	} while (gpc_invalidate_retry_hva(gpc, mmu_seq));
 
 	gpc->pfn = new_pfn;
-- 
2.47.0.rc1.288.g06298d1525-goog


[-- Attachment #6: 0005-KVM-pfncache-Wait-for-pending-invalidations-instead-.patch --]
[-- Type: text/x-diff, Size: 3904 bytes --]

From 6df3fbb08269acfbf28b2cf2fab12c1a0e7c52f7 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Tue, 15 Oct 2024 14:55:49 -0700
Subject: [PATCH 5/5] KVM: pfncache: Wait for pending invalidations instead of
 spinning

The busy loop in hva_to_pfn_retry() is worse than a normal page fault
retry loop because it spins even while it's waiting for the invalidation
to complete. It isn't just that a page might get faulted out again before
it's actually accessed.

Introduce a wait queue to be woken when kvm->mn_active_invalidate_count
reaches zero, and wait on it if there is any pending invalidation which
affects the GPC being refreshed.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[sean: massage comment as part of rebasing]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      |  9 ++++++---
 virt/kvm/pfncache.c      | 30 ++++++++++++++++++++++++++----
 3 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2c0ed735f0f4..a9d7b2200b6f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,6 +787,7 @@ struct kvm {
 	struct list_head gpc_list;
 	u64 mmu_gpc_invalidate_range_start;
 	u64 mmu_gpc_invalidate_range_end;
+	wait_queue_head_t gpc_invalidate_wq;
 
 	/*
 	 * created_vcpus is protected by kvm->lock, and is incremented
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b9223ecab2ca..3ba6d109a941 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -849,11 +849,13 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	spin_unlock(&kvm->mn_invalidate_lock);
 
 	/*
-	 * There can only be one waiter, since the wait happens under
-	 * slots_lock.
+	 * There can only be one memslots waiter, since the wait happens under
+	 * slots_lock, but there can be multiple gpc waiters.
 	 */
-	if (wake)
+	if (wake) {
+		wake_up(&kvm->gpc_invalidate_wq);
 		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+	}
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
@@ -1163,6 +1165,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
+	init_waitqueue_head(&kvm->gpc_invalidate_wq);
 	kvm->mmu_gpc_invalidate_range_start = KVM_HVA_ERR_BAD;
 	kvm->mmu_gpc_invalidate_range_end = KVM_HVA_ERR_BAD;
 
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2163bb6b899c..77cc5633636a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -180,6 +180,30 @@ static bool gpc_invalidate_retry_hva(struct gfn_to_pfn_cache *gpc,
 	return gpc->kvm->mmu_invalidate_seq != mmu_seq;
 }
 
+static void gpc_wait_for_invalidations(struct gfn_to_pfn_cache *gpc)
+{
+	struct kvm *kvm = gpc->kvm;
+
+	spin_lock(&kvm->mn_invalidate_lock);
+	if (gpc_invalidate_in_progress_hva(gpc)) {
+		DEFINE_WAIT(wait);
+
+		for (;;) {
+			prepare_to_wait(&kvm->gpc_invalidate_wq, &wait,
+					TASK_UNINTERRUPTIBLE);
+
+			if (!gpc_invalidate_in_progress_hva(gpc))
+				break;
+
+			spin_unlock(&kvm->mn_invalidate_lock);
+			schedule();
+			spin_lock(&kvm->mn_invalidate_lock);
+		}
+		finish_wait(&kvm->gpc_invalidate_wq, &wait);
+	}
+	spin_unlock(&kvm->mn_invalidate_lock);
+}
+
 static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 {
 	/* Note, the new page offset may be different than the old! */
@@ -230,10 +254,8 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		 * trying to race ahead in the hope that a different task makes
 		 * the cache valid.
 		 */
-		while (gpc_invalidate_in_progress_hva(gpc)) {
-			if (!cond_resched())
-				cpu_relax();
-		}
+		while (gpc_invalidate_in_progress_hva(gpc))
+			gpc_wait_for_invalidations(gpc);
 
 		mmu_seq = gpc->kvm->mmu_invalidate_seq;
 		smp_rmb();
-- 
2.47.0.rc1.288.g06298d1525-goog


  reply	other threads:[~2024-10-16  0:04 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-21 20:28 [PATCH v3 1/5] KVM: pfncache: Add needs_invalidation flag to gfn_to_pfn_cache David Woodhouse
2024-08-21 20:28 ` [PATCH v3 2/5] KVM: pfncache: Implement range-based invalidation check for hva_to_pfn_retry() David Woodhouse
2024-10-16  0:04   ` Sean Christopherson [this message]
2024-08-21 20:28 ` [PATCH v3 3/5] KVM: pfncache: Wait for pending invalidations instead of spinning David Woodhouse
2024-08-21 20:28 ` [PATCH v3 4/5] KVM: pfncache: Move invalidation to invalidate_range_end hook David Woodhouse
2024-10-16  0:17   ` Sean Christopherson
2024-08-21 20:28 ` [PATCH v3 5/5] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh David Woodhouse
2024-09-19 11:13 ` [PATCH v3 1/5] KVM: pfncache: Add needs_invalidation flag to gfn_to_pfn_cache David Woodhouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zw8DINUkJJKDByXE@google.com \
    --to=seanjc@google.com \
    --cc=dwmw2@infradead.org \
    --cc=hmushi@amazon.co.uk \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mizhang@google.com \
    --cc=mlevitsk@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox