* [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races
@ 2022-04-27 1:39 Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 1/8] Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race" Sean Christopherson
` (7 more replies)
0 siblings, 8 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:39 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Fix races between mmu_notifier invalidation and pfncache refresh, and
within the pfncache itself.
The first two patches are reverts of the patches sitting in kvm/queue,
trying to separate and fix the races independently is nigh impossible.
I assume/hope they can be ignored and the original patches dropped.
I verified internal races using the attached hack-a-test. Running the
test against the current implementation fails due to KVM writing the
current GPA into the wrong page.
I don't think the race with the mmu_notifier is technically proven, e.g. I
never encountered a use-after-free even running with KASAN.
Ran with PROVE_LOCKING and DEBUG_ATOMIC_SLEEP, so in theory there shouldn't
be any lurking locking goofs this time...
v2:
- Map the pfn=>khva outside of gpc->lock. [Maxim]
- Fix a page leak.
- Fix more races.
v1:
https://lore.kernel.org/all/20220420004859.3298837-1-seanjc@google.com
Sean Christopherson (8):
Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race"
Revert "KVM: Fix race between mmu_notifier invalidation and pfncache
refresh"
KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc()
helper
KVM: Put the extra pfn reference when reusing a pfn in the gpc cache
KVM: Do not incorporate page offset into gfn=>pfn cache user address
KVM: Fix multiple races in gfn=>pfn cache refresh
KVM: Do not pin pages tracked by gfn=>pfn caches
DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh
arch/x86/kvm/x86.c | 30 ++++
include/linux/kvm_host.h | 2 +
include/linux/kvm_types.h | 1 +
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 2 +
tools/testing/selftests/kvm/gpc_test.c | 217 +++++++++++++++++++++++++
virt/kvm/pfncache.c | 188 +++++++++++++--------
7 files changed, 372 insertions(+), 69 deletions(-)
create mode 100644 tools/testing/selftests/kvm/gpc_test.c
base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 1/8] Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race"
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
@ 2022-04-27 1:39 ` Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 2/8] Revert "KVM: Fix race between mmu_notifier invalidation and pfncache refresh" Sean Christopherson
` (6 subsequent siblings)
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:39 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
This reverts commit 55111927df1cd140aa7b7ea3f33f524b87776381.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 72eee096a7cd..71c84a43024c 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -81,8 +81,6 @@ bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
{
struct kvm_memslots *slots = kvm_memslots(kvm);
- lockdep_assert_held_read(&gpc->lock);
-
if ((gpa & ~PAGE_MASK) + len > PAGE_SIZE)
return false;
@@ -228,6 +226,11 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
if (!old_valid || old_uhva != gpc->uhva) {
void *new_khva = NULL;
+ /* Placeholders for "hva is valid but not yet mapped" */
+ gpc->pfn = KVM_PFN_ERR_FAULT;
+ gpc->khva = NULL;
+ gpc->valid = true;
+
new_pfn = hva_to_pfn_retry(kvm, gpc);
if (is_error_noslot_pfn(new_pfn)) {
ret = -EFAULT;
@@ -256,7 +259,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
gpc->pfn = KVM_PFN_ERR_FAULT;
gpc->khva = NULL;
} else {
- gpc->valid = true;
+ /* At this point, gpc->valid may already have been cleared */
gpc->pfn = new_pfn;
gpc->khva = new_khva;
}
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 2/8] Revert "KVM: Fix race between mmu_notifier invalidation and pfncache refresh"
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 1/8] Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race" Sean Christopherson
@ 2022-04-27 1:39 ` Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 3/8] KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc() helper Sean Christopherson
` (5 subsequent siblings)
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:39 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
This reverts commit c496097d2c0bdc229f82d72b4b1e55d64974c316.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/kvm_main.c | 9 ------
virt/kvm/pfncache.c | 70 ++++++++++++++-------------------------------
2 files changed, 21 insertions(+), 58 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0848430f36c6..dfb7dabdbc63 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -705,15 +705,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm->mn_active_invalidate_count++;
spin_unlock(&kvm->mn_invalidate_lock);
- /*
- * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
- * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
- * each cache's lock. There are relatively few caches in existence at
- * any given time, and the caches themselves can check for hva overlap,
- * i.e. don't need to rely on memslot overlap checks for performance.
- * Because this runs without holding mmu_lock, the pfn caches must use
- * mn_active_invalidate_count (see above) instead of mmu_notifier_count.
- */
gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
hva_range.may_block);
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 71c84a43024c..dd84676615f1 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,63 +112,29 @@ static void __release_gpc(struct kvm *kvm, kvm_pfn_t pfn, void *khva, gpa_t gpa)
}
}
-static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
+static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva)
{
- bool first_attempt = true;
unsigned long mmu_seq;
kvm_pfn_t new_pfn;
+ int retry;
- lockdep_assert_held_write(&gpc->lock);
-
- for (;;) {
+ do {
mmu_seq = kvm->mmu_notifier_seq;
smp_rmb();
- write_unlock_irq(&gpc->lock);
-
- /* Opportunistically check for resched while the lock isn't held. */
- if (!first_attempt)
- cond_resched();
-
/* We always request a writeable mapping */
- new_pfn = hva_to_pfn(gpc->uhva, false, NULL, true, NULL);
-
- write_lock_irq(&gpc->lock);
-
+ new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
if (is_error_noslot_pfn(new_pfn))
break;
- first_attempt = false;
-
- /*
- * Wait for mn_active_invalidate_count, not mmu_notifier_count,
- * to go away, as the invalidation in the mmu_notifier event
- * occurs _before_ mmu_notifier_count is elevated.
- *
- * Note, mn_active_invalidate_count can change at any time as
- * it's not protected by gpc->lock. But, it is guaranteed to
- * be elevated before the mmu_notifier acquires gpc->lock, and
- * isn't dropped until after mmu_notifier_seq is updated. So,
- * this task may get a false positive of sorts, i.e. see an
- * elevated count and wait even though it's technically safe to
- * proceed (becase the mmu_notifier will invalidate the cache
- * _after_ it's refreshed here), but the cache will never be
- * refreshed with stale data, i.e. won't get false negatives.
- */
- if (kvm->mn_active_invalidate_count)
- continue;
-
- /*
- * Ensure mn_active_invalidate_count is read before
- * mmu_notifier_seq. This pairs with the smp_wmb() in
- * mmu_notifier_invalidate_range_end() to guarantee either the
- * old (non-zero) value of mn_active_invalidate_count or the
- * new (incremented) value of mmu_notifier_seq is observed.
- */
- smp_rmb();
- if (kvm->mmu_notifier_seq == mmu_seq)
+ KVM_MMU_READ_LOCK(kvm);
+ retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
+ KVM_MMU_READ_UNLOCK(kvm);
+ if (!retry)
break;
- }
+
+ cond_resched();
+ } while (1);
return new_pfn;
}
@@ -224,6 +190,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
* drop the lock and do the HVA to PFN lookup again.
*/
if (!old_valid || old_uhva != gpc->uhva) {
+ unsigned long uhva = gpc->uhva;
void *new_khva = NULL;
/* Placeholders for "hva is valid but not yet mapped" */
@@ -231,10 +198,15 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
gpc->khva = NULL;
gpc->valid = true;
- new_pfn = hva_to_pfn_retry(kvm, gpc);
+ write_unlock_irq(&gpc->lock);
+
+ new_pfn = hva_to_pfn_retry(kvm, uhva);
if (is_error_noslot_pfn(new_pfn)) {
ret = -EFAULT;
- } else if (gpc->usage & KVM_HOST_USES_PFN) {
+ goto map_done;
+ }
+
+ if (gpc->usage & KVM_HOST_USES_PFN) {
if (new_pfn == old_pfn) {
new_khva = old_khva;
old_pfn = KVM_PFN_ERR_FAULT;
@@ -250,10 +222,10 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
new_khva += page_offset;
else
ret = -EFAULT;
- } else {
- /* Nothing more to do, the pfn is consumed only by the guest. */
}
+ map_done:
+ write_lock_irq(&gpc->lock);
if (ret) {
gpc->valid = false;
gpc->pfn = KVM_PFN_ERR_FAULT;
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 3/8] KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc() helper
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 1/8] Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race" Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 2/8] Revert "KVM: Fix race between mmu_notifier invalidation and pfncache refresh" Sean Christopherson
@ 2022-04-27 1:39 ` Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 4/8] KVM: Put the extra pfn reference when reusing a pfn in the gpc cache Sean Christopherson
` (4 subsequent siblings)
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:39 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Drop the @pga param from __release_gpc() and rename the helper to make it
more obvious that the cache itself is not being released. The helper
will be reused by a future commit to release a pfn+khva combination that
is _never_ associated with the cache, at which point the current name
would go from slightly misleading to blatantly wrong.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index dd84676615f1..e05a6a1b8eff 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -95,7 +95,7 @@ bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check);
-static void __release_gpc(struct kvm *kvm, kvm_pfn_t pfn, void *khva, gpa_t gpa)
+static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
{
/* Unmap the old page if it was mapped before, and release it */
if (!is_error_noslot_pfn(pfn)) {
@@ -146,7 +146,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
unsigned long page_offset = gpa & ~PAGE_MASK;
kvm_pfn_t old_pfn, new_pfn;
unsigned long old_uhva;
- gpa_t old_gpa;
void *old_khva;
bool old_valid;
int ret = 0;
@@ -160,7 +159,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
write_lock_irq(&gpc->lock);
- old_gpa = gpc->gpa;
old_pfn = gpc->pfn;
old_khva = gpc->khva - offset_in_page(gpc->khva);
old_uhva = gpc->uhva;
@@ -244,7 +242,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
out:
write_unlock_irq(&gpc->lock);
- __release_gpc(kvm, old_pfn, old_khva, old_gpa);
+ gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
return ret;
}
@@ -254,14 +252,12 @@ void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
{
void *old_khva;
kvm_pfn_t old_pfn;
- gpa_t old_gpa;
write_lock_irq(&gpc->lock);
gpc->valid = false;
old_khva = gpc->khva - offset_in_page(gpc->khva);
- old_gpa = gpc->gpa;
old_pfn = gpc->pfn;
/*
@@ -273,7 +269,7 @@ void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
write_unlock_irq(&gpc->lock);
- __release_gpc(kvm, old_pfn, old_khva, old_gpa);
+ gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_unmap);
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 4/8] KVM: Put the extra pfn reference when reusing a pfn in the gpc cache
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
` (2 preceding siblings ...)
2022-04-27 1:39 ` [PATCH v2 3/8] KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc() helper Sean Christopherson
@ 2022-04-27 1:40 ` Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 5/8] KVM: Do not incorporate page offset into gfn=>pfn cache user address Sean Christopherson
` (3 subsequent siblings)
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Put the struct page reference to pfn acquired by hva_to_pfn() when the
old and new pfns for a gfn=>pfn cache match. The cache already has a
reference via the old/current pfn, and will only put one reference when
the cache is done with the pfn.
Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index e05a6a1b8eff..40cbe90d52e0 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -206,6 +206,14 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
if (gpc->usage & KVM_HOST_USES_PFN) {
if (new_pfn == old_pfn) {
+ /*
+ * Reuse the existing pfn and khva, but put the
+ * reference acquired hva_to_pfn_retry(); the
+ * cache still holds a reference to the pfn
+ * from the previous refresh.
+ */
+ gpc_release_pfn_and_khva(kvm, new_pfn, NULL);
+
new_khva = old_khva;
old_pfn = KVM_PFN_ERR_FAULT;
old_khva = NULL;
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 5/8] KVM: Do not incorporate page offset into gfn=>pfn cache user address
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
` (3 preceding siblings ...)
2022-04-27 1:40 ` [PATCH v2 4/8] KVM: Put the extra pfn reference when reusing a pfn in the gpc cache Sean Christopherson
@ 2022-04-27 1:40 ` Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
` (2 subsequent siblings)
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Don't adjust the userspace address in the gfn=>pfn cache by the page
offset from the gpa. KVM should never use the user address directly, and
all KVM operations that translate a user address to something else
require the user address to be page aligned. Ignoring the offset will
allow the cache to reuse a gfn=>hva translation in the unlikely event
that the page offset of the gpa changes, but the gfn does not.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 40cbe90d52e0..05cb0bcbf662 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -179,8 +179,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
ret = -EFAULT;
goto out;
}
-
- gpc->uhva += page_offset;
}
/*
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
` (4 preceding siblings ...)
2022-04-27 1:40 ` [PATCH v2 5/8] KVM: Do not incorporate page offset into gfn=>pfn cache user address Sean Christopherson
@ 2022-04-27 1:40 ` Sean Christopherson
2022-04-27 14:10 ` Sean Christopherson
` (3 more replies)
2022-04-27 1:40 ` [PATCH v2 7/8] KVM: Do not pin pages tracked by gfn=>pfn caches Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh Sean Christopherson
7 siblings, 4 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
between the cache itself, and between the cache and mmu_notifier events.
The existing refresh code attempts to guard against races with the
mmu_notifier by speculatively marking the cache valid, and then marking
it invalid if a mmu_notifier invalidation occurs. That handles the case
where an invalidation occurs between dropping and re-acquiring gpc->lock,
but it doesn't handle the scenario where the cache is refreshed after the
cache was invalidated by the notifier, but before the notifier elevates
mmu_notifier_count. The gpc refresh can't use the "retry" helper as its
invalidation occurs _before_ mmu_notifier_count is elevated and before
mmu_notifier_range_start is set/updated.
CPU0 CPU1
---- ----
gfn_to_pfn_cache_invalidate_start()
|
-> gpc->valid = false;
kvm_gfn_to_pfn_cache_refresh()
|
|-> gpc->valid = true;
hva_to_pfn_retry()
|
-> acquire kvm->mmu_lock
kvm->mmu_notifier_count == 0
mmu_seq == kvm->mmu_notifier_seq
drop kvm->mmu_lock
return pfn 'X'
acquire kvm->mmu_lock
kvm_inc_notifier_count()
drop kvm->mmu_lock()
kernel frees pfn 'X'
kvm_gfn_to_pfn_cache_check()
|
|-> gpc->valid == true
caller accesses freed pfn 'X'
Key off of mn_active_invalidate_count to detect that a pfncache refresh
needs to wait for an in-progress mmu_notifier invalidation. While
mn_active_invalidate_count is not guaranteed to be stable, it is
guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
so either the refresh will see an active invalidation and wait, or the
invalidation will run after the refresh completes.
Speculatively marking the cache valid is itself flawed, as a concurrent
kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
values. The KVM Xen use case explicitly allows/wants multiple users;
even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
can read a different vCPU (or vCPUs). Address this race by invalidating
the cache prior to dropping gpc->lock (this is made possible by fixing
the above mmu_notifier race).
Finally, the refresh logic doesn't protect against concurrent refreshes
with different GPAs (which may or may not be a desired use case, but its
allowed in the code), nor does it protect against a false negative on the
memslot generation. If the first refresh sees a stale memslot generation,
it will refresh the hva and generation before moving on to the hva=>pfn
translation. If it then drops gpc->lock, a different user can come along,
acquire gpc->lock, see that the memslot generation is fresh, and skip
the hva=>pfn update due to the userspace address also matching (because
it too was updated). Address this race by adding an "in-progress" flag
so that the refresh that acquires gpc->lock first runs to completion
before other users can start their refresh.
Complicating all of this is the fact that both the hva=>pfn resolution
and mapping of the kernel address can sleep, i.e. must be done outside
of gpc->lock
Fix the above races in one fell swoop, trying to fix each individual race
in a sane manner is impossible, for all intents and purposes.
Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 9 ++
virt/kvm/pfncache.c | 209 +++++++++++++++++++++++++-------------
3 files changed, 148 insertions(+), 71 deletions(-)
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index ac1ebb37a0ff..83dcb97dddf1 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -74,6 +74,7 @@ struct gfn_to_pfn_cache {
enum pfn_cache_usage usage;
bool active;
bool valid;
+ bool refresh_in_progress;
};
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfb7dabdbc63..0848430f36c6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -705,6 +705,15 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm->mn_active_invalidate_count++;
spin_unlock(&kvm->mn_invalidate_lock);
+ /*
+ * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
+ * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
+ * each cache's lock. There are relatively few caches in existence at
+ * any given time, and the caches themselves can check for hva overlap,
+ * i.e. don't need to rely on memslot overlap checks for performance.
+ * Because this runs without holding mmu_lock, the pfn caches must use
+ * mn_active_invalidate_count (see above) instead of mmu_notifier_count.
+ */
gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end,
hva_range.may_block);
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 05cb0bcbf662..b1665d0e6c32 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,31 +112,122 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
}
}
-static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva)
+static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
{
+ /* Note, the new page offset may be different than the old! */
+ void *old_khva = gpc->khva - offset_in_page(gpc->khva);
+ kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
+ void *new_khva = NULL;
unsigned long mmu_seq;
- kvm_pfn_t new_pfn;
- int retry;
- do {
+ lockdep_assert_held_write(&gpc->lock);
+
+ /*
+ * Invalidate the cache prior to dropping gpc->lock, gpc->uhva has
+ * already been updated and so a concurrent refresh from a different
+ * task will not detect that gpa/uhva changed.
+ */
+ gpc->valid = false;
+
+ for (;;) {
mmu_seq = kvm->mmu_notifier_seq;
smp_rmb();
+ write_unlock_irq(&gpc->lock);
+
+ /*
+ * If the previous iteration "failed" due to an mmu_notifier
+ * event, release the pfn and unmap the kernel virtual address
+ * from the previous attempt. Unmapping might sleep, so this
+ * needs to be done after dropping the lock. Opportunistically
+ * check for resched while the lock isn't held.
+ */
+ if (new_pfn != KVM_PFN_ERR_FAULT) {
+ /*
+ * Keep the mapping if the previous iteration reused
+ * the existing mapping and didn't create a new one.
+ */
+ if (new_khva == old_khva)
+ new_khva = NULL;
+
+ gpc_release_pfn_and_khva(kvm, new_pfn, new_khva);
+
+ cond_resched();
+ }
+
/* We always request a writeable mapping */
- new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL);
+ new_pfn = hva_to_pfn(gpc->uhva, false, NULL, true, NULL);
if (is_error_noslot_pfn(new_pfn))
- break;
+ goto out_error;
+
+ /*
+ * Obtain a new kernel mapping if KVM itself will access the
+ * pfn. Note, kmap() and memremap() can both sleep, so this
+ * too must be done outside of gpc->lock!
+ */
+ if (gpc->usage & KVM_HOST_USES_PFN) {
+ if (new_pfn == gpc->pfn) {
+ new_khva = old_khva;
+ } else if (pfn_valid(new_pfn)) {
+ new_khva = kmap(pfn_to_page(new_pfn));
+#ifdef CONFIG_HAS_IOMEM
+ } else {
+ new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB);
+#endif
+ }
+ if (!new_khva) {
+ kvm_release_pfn_clean(new_pfn);
+ goto out_error;
+ }
+ }
+
+ write_lock_irq(&gpc->lock);
- KVM_MMU_READ_LOCK(kvm);
- retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
- KVM_MMU_READ_UNLOCK(kvm);
- if (!retry)
+ /*
+ * Other tasks must wait for _this_ refresh to complete before
+ * attempting to refresh.
+ */
+ WARN_ON_ONCE(gpc->valid);
+
+ /*
+ * Wait for mn_active_invalidate_count, not mmu_notifier_count,
+ * to go away, as the invalidation in the mmu_notifier event
+ * occurs _before_ mmu_notifier_count is elevated.
+ *
+ * Note, mn_active_invalidate_count can change at any time as
+ * it's not protected by gpc->lock. But, it is guaranteed to
+ * be elevated before the mmu_notifier acquires gpc->lock, and
+ * isn't dropped until after mmu_notifier_seq is updated. So,
+ * this task may get a false positive of sorts, i.e. see an
+ * elevated count and wait even though it's technically safe to
+ * proceed (becase the mmu_notifier will invalidate the cache
+ * _after_ it's refreshed here), but the cache will never be
+ * refreshed with stale data, i.e. won't get false negatives.
+ */
+ if (kvm->mn_active_invalidate_count)
+ continue;
+
+ /*
+ * Ensure mn_active_invalidate_count is read before
+ * mmu_notifier_seq. This pairs with the smp_wmb() in
+ * mmu_notifier_invalidate_range_end() to guarantee either the
+ * old (non-zero) value of mn_active_invalidate_count or the
+ * new (incremented) value of mmu_notifier_seq is observed.
+ */
+ smp_rmb();
+ if (kvm->mmu_notifier_seq == mmu_seq)
break;
+ }
+
+ gpc->valid = true;
+ gpc->pfn = new_pfn;
+ gpc->khva = new_khva + (gpc->gpa & ~PAGE_MASK);
+ return 0;
- cond_resched();
- } while (1);
+out_error:
+ write_lock_irq(&gpc->lock);
- return new_pfn;
+ return -EFAULT;
}
int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
@@ -147,7 +238,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
kvm_pfn_t old_pfn, new_pfn;
unsigned long old_uhva;
void *old_khva;
- bool old_valid;
int ret = 0;
/*
@@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
write_lock_irq(&gpc->lock);
+ /*
+ * If another task is refreshing the cache, wait for it to complete.
+ * There is no guarantee that concurrent refreshes will see the same
+ * gpa, memslots generation, etc..., so they must be fully serialized.
+ */
+ while (gpc->refresh_in_progress) {
+ write_unlock_irq(&gpc->lock);
+
+ cond_resched();
+
+ write_lock_irq(&gpc->lock);
+ }
+ gpc->refresh_in_progress = true;
+
old_pfn = gpc->pfn;
old_khva = gpc->khva - offset_in_page(gpc->khva);
old_uhva = gpc->uhva;
- old_valid = gpc->valid;
/* If the userspace HVA is invalid, refresh that first */
if (gpc->gpa != gpa || gpc->generation != slots->generation ||
@@ -175,7 +278,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn);
if (kvm_is_error_hva(gpc->uhva)) {
- gpc->pfn = KVM_PFN_ERR_FAULT;
ret = -EFAULT;
goto out;
}
@@ -185,60 +287,8 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
* If the userspace HVA changed or the PFN was already invalid,
* drop the lock and do the HVA to PFN lookup again.
*/
- if (!old_valid || old_uhva != gpc->uhva) {
- unsigned long uhva = gpc->uhva;
- void *new_khva = NULL;
-
- /* Placeholders for "hva is valid but not yet mapped" */
- gpc->pfn = KVM_PFN_ERR_FAULT;
- gpc->khva = NULL;
- gpc->valid = true;
-
- write_unlock_irq(&gpc->lock);
-
- new_pfn = hva_to_pfn_retry(kvm, uhva);
- if (is_error_noslot_pfn(new_pfn)) {
- ret = -EFAULT;
- goto map_done;
- }
-
- if (gpc->usage & KVM_HOST_USES_PFN) {
- if (new_pfn == old_pfn) {
- /*
- * Reuse the existing pfn and khva, but put the
- * reference acquired hva_to_pfn_retry(); the
- * cache still holds a reference to the pfn
- * from the previous refresh.
- */
- gpc_release_pfn_and_khva(kvm, new_pfn, NULL);
-
- new_khva = old_khva;
- old_pfn = KVM_PFN_ERR_FAULT;
- old_khva = NULL;
- } else if (pfn_valid(new_pfn)) {
- new_khva = kmap(pfn_to_page(new_pfn));
-#ifdef CONFIG_HAS_IOMEM
- } else {
- new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB);
-#endif
- }
- if (new_khva)
- new_khva += page_offset;
- else
- ret = -EFAULT;
- }
-
- map_done:
- write_lock_irq(&gpc->lock);
- if (ret) {
- gpc->valid = false;
- gpc->pfn = KVM_PFN_ERR_FAULT;
- gpc->khva = NULL;
- } else {
- /* At this point, gpc->valid may already have been cleared */
- gpc->pfn = new_pfn;
- gpc->khva = new_khva;
- }
+ if (!gpc->valid || old_uhva != gpc->uhva) {
+ ret = hva_to_pfn_retry(kvm, gpc);
} else {
/* If the HVA→PFN mapping was already valid, don't unmap it. */
old_pfn = KVM_PFN_ERR_FAULT;
@@ -246,9 +296,26 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
}
out:
+ /*
+ * Invalidate the cache and purge the pfn/khva if the refresh failed.
+ * Some/all of the uhva, gpa, and memslot generation info may still be
+ * valid, leave it as is.
+ */
+ if (ret) {
+ gpc->valid = false;
+ gpc->pfn = KVM_PFN_ERR_FAULT;
+ gpc->khva = NULL;
+ }
+
+ gpc->refresh_in_progress = false;
+
+ /* Snapshot the new pfn before dropping the lock! */
+ new_pfn = gpc->pfn;
+
write_unlock_irq(&gpc->lock);
- gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
+ if (old_pfn != new_pfn)
+ gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
return ret;
}
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 7/8] KVM: Do not pin pages tracked by gfn=>pfn caches
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
` (5 preceding siblings ...)
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
@ 2022-04-27 1:40 ` Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh Sean Christopherson
7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Put the reference to any struct page mapped/tracked by a gfn=>pfn cache
upon inserting the pfn into its associated cache, as opposed to putting
the reference only when the cache is done using the pfn. In other words,
don't pin pages while they're in the cache. One of the major roles of
the gfn=>pfn cache is to play nicely with invalidation events, i.e. it
exists in large part so that KVM doesn't rely on pinning pages.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index b1665d0e6c32..3cb439b505b4 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -95,20 +95,16 @@ bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check);
-static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
+static void gpc_unmap_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
{
- /* Unmap the old page if it was mapped before, and release it */
- if (!is_error_noslot_pfn(pfn)) {
- if (khva) {
- if (pfn_valid(pfn))
- kunmap(pfn_to_page(pfn));
+ /* Unmap the old pfn/page if it was mapped before. */
+ if (!is_error_noslot_pfn(pfn) && khva) {
+ if (pfn_valid(pfn))
+ kunmap(pfn_to_page(pfn));
#ifdef CONFIG_HAS_IOMEM
- else
- memunmap(khva);
+ else
+ memunmap(khva);
#endif
- }
-
- kvm_release_pfn(pfn, false);
}
}
@@ -147,10 +143,10 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
* Keep the mapping if the previous iteration reused
* the existing mapping and didn't create a new one.
*/
- if (new_khva == old_khva)
- new_khva = NULL;
+ if (new_khva != old_khva)
+ gpc_unmap_khva(kvm, new_pfn, new_khva);
- gpc_release_pfn_and_khva(kvm, new_pfn, new_khva);
+ kvm_release_pfn_clean(new_pfn);
cond_resched();
}
@@ -222,6 +218,14 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
gpc->valid = true;
gpc->pfn = new_pfn;
gpc->khva = new_khva + (gpc->gpa & ~PAGE_MASK);
+
+ /*
+ * Put the reference to the _new_ pfn. The pfn is now tracked by the
+ * cache and can be safely migrated, swapped, etc... as the cache will
+ * invalidate any mappings in response to relevant mmu_notifier events.
+ */
+ kvm_release_pfn_clean(new_pfn);
+
return 0;
out_error:
@@ -315,7 +319,7 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
write_unlock_irq(&gpc->lock);
if (old_pfn != new_pfn)
- gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
+ gpc_unmap_khva(kvm, old_pfn, old_khva);
return ret;
}
@@ -342,7 +346,7 @@ void kvm_gfn_to_pfn_cache_unmap(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
write_unlock_irq(&gpc->lock);
- gpc_release_pfn_and_khva(kvm, old_pfn, old_khva);
+ gpc_unmap_khva(kvm, old_pfn, old_khva);
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_unmap);
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
` (6 preceding siblings ...)
2022-04-27 1:40 ` [PATCH v2 7/8] KVM: Do not pin pages tracked by gfn=>pfn caches Sean Christopherson
@ 2022-04-27 1:40 ` Sean Christopherson
2022-04-27 15:17 ` David Woodhouse
7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 1:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
Add a VM-wide gfn=>pfn cache and a fake MSR to let userspace control the
cache. On writes, reflect the value of the MSR into the backing page of
a gfn=>pfn cache so that userspace can detect if a value was written to
the wrong page, i.e. to a stale mapping.
Spin up 16 vCPUs (arbitrary) to use/refresh the cache, and another thread
to trigger mmu_notifier events and memslot updates.
Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 30 ++++
include/linux/kvm_host.h | 2 +
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 2 +
tools/testing/selftests/kvm/gpc_test.c | 217 +++++++++++++++++++++++++
virt/kvm/pfncache.c | 2 +
6 files changed, 254 insertions(+)
create mode 100644 tools/testing/selftests/kvm/gpc_test.c
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccda..7afdb7f39821 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3473,6 +3473,20 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return kvm_xen_write_hypercall_page(vcpu, data);
switch (msr) {
+ case 0xdeadbeefu: {
+ struct gfn_to_pfn_cache *gpc = &vcpu->kvm->test_cache;
+ unsigned long flags;
+
+ if (kvm_gfn_to_pfn_cache_refresh(vcpu->kvm, gpc, data, 8))
+ break;
+
+ read_lock_irqsave(&gpc->lock, flags);
+ if (kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, data, 8))
+ *(u64 *)(gpc->khva) = data;
+ read_unlock_irqrestore(&gpc->lock, flags);
+ break;
+ }
+
case MSR_AMD64_NB_CFG:
case MSR_IA32_UCODE_WRITE:
case MSR_VM_HSAVE_PA:
@@ -3825,6 +3839,19 @@ static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host)
int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
switch (msr_info->index) {
+ case 0xdeadbeefu: {
+ struct gfn_to_pfn_cache *gpc = &vcpu->kvm->test_cache;
+ unsigned long flags;
+
+ read_lock_irqsave(&gpc->lock, flags);
+ if (kvm_gfn_to_pfn_cache_check(vcpu->kvm, gpc, gpc->gpa, 8))
+ msr_info->data = gpc->gpa;
+ else
+ msr_info->data = 0xdeadbeefu;
+ read_unlock_irqrestore(&gpc->lock, flags);
+ return 0;
+ }
+
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
case MSR_IA32_LASTBRANCHFROMIP:
@@ -11794,6 +11821,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_hv_init_vm(kvm);
kvm_xen_init_vm(kvm);
+ kvm_gfn_to_pfn_cache_init(kvm, &kvm->test_cache, NULL,
+ KVM_HOST_USES_PFN, 0, 0);
+
return static_call(kvm_x86_vm_init)(kvm);
out_page_track:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58..88ed76ad8bc7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -718,6 +718,8 @@ struct kvm {
spinlock_t gpc_lock;
struct list_head gpc_list;
+ struct gfn_to_pfn_cache test_cache;
+
/*
* created_vcpus is protected by kvm->lock, and is incremented
* at the beginning of KVM_CREATE_VCPU. online_vcpus is only
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 56140068b763..0310a57a1a4f 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -70,3 +70,4 @@
/steal_time
/kvm_binary_stats_test
/system_counter_offset_test
+/gpc_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index af582d168621..0adc9ac954d1 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -104,6 +104,8 @@ TEST_GEN_PROGS_x86_64 += steal_time
TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += gpc_test
+
TEST_GEN_PROGS_aarch64 += aarch64/arch_timer
TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
TEST_GEN_PROGS_aarch64 += aarch64/get-reg-list
diff --git a/tools/testing/selftests/kvm/gpc_test.c b/tools/testing/selftests/kvm/gpc_test.c
new file mode 100644
index 000000000000..5c509e7bb4da
--- /dev/null
+++ b/tools/testing/selftests/kvm/gpc_test.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/sysinfo.h>
+#include <asm/barrier.h>
+#include <linux/atomic.h>
+#include <linux/rseq.h>
+#include <linux/unistd.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+#define NR_VCPUS 16
+
+#define NR_ITERATIONS 1000
+
+#define PAGE_SIZE 4096
+
+#ifndef MAP_FIXED_NOREPLACE
+#define MAP_FIXED_NOREPLACE 0x100000
+#endif
+
+static const uint64_t gpa_base = (4ull * (1 << 30));
+
+static struct kvm_vm *vm;
+
+static pthread_t memory_thread;
+static pthread_t vcpu_threads[NR_VCPUS];
+
+static bool fight;
+
+static uint64_t per_vcpu_gpa_aligned(int vcpu_id)
+{
+ return gpa_base + (vcpu_id * PAGE_SIZE);
+}
+
+static uint64_t per_vcpu_gpa(int vcpu_id)
+{
+ return per_vcpu_gpa_aligned(vcpu_id) + vcpu_id;
+}
+
+static void guest_code(int vcpu_id)
+{
+ uint64_t this_vcpu_gpa;
+ int i;
+
+ this_vcpu_gpa = per_vcpu_gpa(vcpu_id);
+
+ for (i = 0; i < NR_ITERATIONS; i++)
+ wrmsr(0xdeadbeefu, this_vcpu_gpa);
+ GUEST_SYNC(0);
+}
+
+static void *memory_worker(void *ign)
+{
+ int i, x, r, k;
+ uint64_t *hva;
+ uint64_t gpa;
+ void *mem;
+
+ while (!READ_ONCE(fight))
+ cpu_relax();
+
+ for (k = 0; k < 50; k++) {
+ i = (unsigned int)random() % NR_VCPUS;
+
+ gpa = per_vcpu_gpa_aligned(i);
+ hva = (void *)gpa;
+
+ x = (unsigned int)random() % 5;
+ switch (x) {
+ case 0:
+ r = munmap(hva, PAGE_SIZE);
+ TEST_ASSERT(!r, "Failed to mumap (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+
+ mem = mmap(hva, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ TEST_ASSERT(mem != MAP_FAILED || mem != hva,
+ "Failed to mmap (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+ break;
+ case 1:
+ vm_set_user_memory_region(vm, i + 1, KVM_MEM_LOG_DIRTY_PAGES,
+ gpa, PAGE_SIZE, hva);
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, PAGE_SIZE, hva);
+ break;
+ case 2:
+ r = mprotect(hva, PAGE_SIZE, PROT_NONE);
+ TEST_ASSERT(!r, "Failed to mprotect (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+
+ r = mprotect(hva, PAGE_SIZE, PROT_READ | PROT_WRITE);
+ TEST_ASSERT(!r, "Failed to mprotect (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+ break;
+ case 3:
+ r = mprotect(hva, PAGE_SIZE, PROT_READ);
+ TEST_ASSERT(!r, "Failed to mprotect (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+
+ r = mprotect(hva, PAGE_SIZE, PROT_READ | PROT_WRITE);
+ TEST_ASSERT(!r, "Failed to mprotect (hva = %lx), errno = %d (%s)",
+ (unsigned long)hva, errno, strerror(errno));
+ break;
+ case 4:
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, 0, 0);
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, PAGE_SIZE,
+ (void *)per_vcpu_gpa_aligned(NR_VCPUS));
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, 0, 0);
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, PAGE_SIZE, hva);
+ break;
+ }
+ }
+ return NULL;
+}
+
+static void sync_guest(int vcpu_id)
+{
+ struct ucall uc;
+
+ switch (get_ucall(vm, vcpu_id, &uc)) {
+ case UCALL_SYNC:
+ TEST_ASSERT(uc.args[1] == 0,
+ "Unexpected sync ucall, got %lx", uc.args[1]);
+ break;
+ case UCALL_ABORT:
+ TEST_FAIL("%s at %s:%ld\n\tvalues: %#lx, %#lx",
+ (const char *)uc.args[0],
+ __FILE__, uc.args[1], uc.args[2], uc.args[3]);
+ break;
+ default:
+ TEST_FAIL("Unexpected userspace exit, reason = %s\n",
+ exit_reason_str(vcpu_state(vm, vcpu_id)->exit_reason));
+ break;
+ }
+}
+
+static void *vcpu_worker(void *data)
+{
+ int vcpu_id = (unsigned long)data;
+
+ vcpu_args_set(vm, vcpu_id, 1, vcpu_id);
+
+ while (!READ_ONCE(fight))
+ cpu_relax();
+
+ usleep(10);
+
+ vcpu_run(vm, vcpu_id);
+
+ sync_guest(vcpu_id);
+
+ return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+ uint64_t *hva;
+ uint64_t gpa;
+ void *r;
+ int i;
+
+ srandom(time(0));
+
+ vm = vm_create_default_with_vcpus(NR_VCPUS, 0, 0, guest_code, NULL);
+ ucall_init(vm, NULL);
+
+ pthread_create(&memory_thread, NULL, memory_worker, 0);
+
+ for (i = 0; i < NR_VCPUS; i++) {
+ pthread_create(&vcpu_threads[i], NULL, vcpu_worker, (void *)(unsigned long)i);
+
+ gpa = per_vcpu_gpa_aligned(i);
+ hva = (void *)gpa;
+ r = mmap(hva, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ TEST_ASSERT(r != MAP_FAILED, "mmap() '%lx' failed, errno = %d (%s)",
+ gpa, errno, strerror(errno));
+
+ vm_set_user_memory_region(vm, i + 1, 0, gpa, PAGE_SIZE, hva);
+ }
+
+ WRITE_ONCE(fight, true);
+
+ for (i = 0; i < NR_VCPUS; i++)
+ pthread_join(vcpu_threads[i], NULL);
+
+ pthread_join(memory_thread, NULL);
+
+ for (i = 0; i < NR_VCPUS; i++) {
+ gpa = per_vcpu_gpa(i);
+ hva = (void *)gpa;
+
+ TEST_ASSERT(*hva == 0 || *hva == gpa,
+ "Want '0' or '%lx', got '%lx'\n", gpa, *hva);
+ }
+
+ gpa = vcpu_get_msr(vm, 0, 0xdeadbeefu);
+ hva = (void *)gpa;
+ if (gpa != 0xdeadbeefu)
+ TEST_ASSERT(*hva == gpa, "Want '%lx', got '%lx'\n", gpa, *hva);
+
+ kvm_vm_free(vm);
+
+ return 0;
+}
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 3cb439b505b4..7881e6e6d91a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -372,6 +372,8 @@ int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
list_add(&gpc->list, &kvm->gpc_list);
spin_unlock(&kvm->gpc_lock);
}
+ if (!len)
+ return -EINVAL;
return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len);
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_init);
--
2.36.0.rc2.479.g8af0fa9b8e-goog
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
@ 2022-04-27 14:10 ` Sean Christopherson
2022-04-28 3:39 ` Lai Jiangshan
` (2 subsequent siblings)
3 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 14:10 UTC (permalink / raw)
To: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
On Wed, Apr 27, 2022, Sean Christopherson wrote:
> Finally, the refresh logic doesn't protect against concurrent refreshes
> with different GPAs (which may or may not be a desired use case, but its
> allowed in the code), nor does it protect against a false negative on the
> memslot generation. If the first refresh sees a stale memslot generation,
> it will refresh the hva and generation before moving on to the hva=>pfn
> translation. If it then drops gpc->lock, a different user can come along,
> acquire gpc->lock, see that the memslot generation is fresh, and skip
> the hva=>pfn update due to the userspace address also matching (because
> it too was updated). Address this race by adding an "in-progress" flag
> so that the refresh that acquires gpc->lock first runs to completion
> before other users can start their refresh.
...
> @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>
> write_lock_irq(&gpc->lock);
>
> + /*
> + * If another task is refreshing the cache, wait for it to complete.
> + * There is no guarantee that concurrent refreshes will see the same
> + * gpa, memslots generation, etc..., so they must be fully serialized.
> + */
> + while (gpc->refresh_in_progress) {
> + write_unlock_irq(&gpc->lock);
> +
> + cond_resched();
> +
> + write_lock_irq(&gpc->lock);
> + }
> + gpc->refresh_in_progress = true;
Adding refresh_in_progress can likely go in a separate patch. I'll plan on doing
that in a v3 unless it proves to be painful.
> @@ -246,9 +296,26 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
> }
>
> out:
> + /*
> + * Invalidate the cache and purge the pfn/khva if the refresh failed.
> + * Some/all of the uhva, gpa, and memslot generation info may still be
> + * valid, leave it as is.
> + */
> + if (ret) {
> + gpc->valid = false;
> + gpc->pfn = KVM_PFN_ERR_FAULT;
> + gpc->khva = NULL;
> + }
> +
> + gpc->refresh_in_progress = false;
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh
2022-04-27 1:40 ` [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh Sean Christopherson
@ 2022-04-27 15:17 ` David Woodhouse
2022-04-27 20:23 ` Sean Christopherson
0 siblings, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2022-04-27 15:17 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, Mingwei Zhang, Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 643 bytes --]
On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote:
> Add a VM-wide gfn=>pfn cache and a fake MSR to let userspace control the
> cache. On writes, reflect the value of the MSR into the backing page of
> a gfn=>pfn cache so that userspace can detect if a value was written to
> the wrong page, i.e. to a stale mapping.
>
> Spin up 16 vCPUs (arbitrary) to use/refresh the cache, and another thread
> to trigger mmu_notifier events and memslot updates.
Do you need the MSR hack? Can't you exercise this using Xen interrupt
delivery or runstate information and the same kind of thread setup?
Thanks for working on this!
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh
2022-04-27 15:17 ` David Woodhouse
@ 2022-04-27 20:23 ` Sean Christopherson
0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-27 20:23 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Maxim Levitsky
On Wed, Apr 27, 2022, David Woodhouse wrote:
> On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote:
> > Add a VM-wide gfn=>pfn cache and a fake MSR to let userspace control the
> > cache. On writes, reflect the value of the MSR into the backing page of
> > a gfn=>pfn cache so that userspace can detect if a value was written to
> > the wrong page, i.e. to a stale mapping.
> >
> > Spin up 16 vCPUs (arbitrary) to use/refresh the cache, and another thread
> > to trigger mmu_notifier events and memslot updates.
>
> Do you need the MSR hack? Can't you exercise this using Xen interrupt
> delivery or runstate information and the same kind of thread setup?
Yeah, I asumme it's possible, and medium/long term I definitely want to have a
proper test. I went the hack route to get something that could hammer a cache
with minimal chance of a test bug. I only have a rough idea of what the Xen stuff
does.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
2022-04-27 14:10 ` Sean Christopherson
@ 2022-04-28 3:39 ` Lai Jiangshan
2022-04-28 14:33 ` Sean Christopherson
2022-05-20 14:49 ` Paolo Bonzini
2024-08-05 11:04 ` David Woodhouse
3 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-28 3:39 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, LKML, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
On Wed, Apr 27, 2022 at 7:16 PM Sean Christopherson <seanjc@google.com> wrote:
> @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
>
The following code of refresh_in_progress is somewhat like mutex.
+ mutex_lock(&gpc->refresh_in_progress); // before write_lock_irq(&gpc->lock);
Is it fit for the intention?
Thanks
Lai
> write_lock_irq(&gpc->lock);
>
> + /*
> + * If another task is refreshing the cache, wait for it to complete.
> + * There is no guarantee that concurrent refreshes will see the same
> + * gpa, memslots generation, etc..., so they must be fully serialized.
> + */
> + while (gpc->refresh_in_progress) {
> + write_unlock_irq(&gpc->lock);
> +
> + cond_resched();
> +
> + write_lock_irq(&gpc->lock);
> + }
> + gpc->refresh_in_progress = true;
> +
> old_pfn = gpc->pfn;
> old_khva = gpc->khva - offset_in_page(gpc->khva);
> old_uhva = gpc->uhva;
> - old_valid = gpc->valid;
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-28 3:39 ` Lai Jiangshan
@ 2022-04-28 14:33 ` Sean Christopherson
0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-28 14:33 UTC (permalink / raw)
To: Lai Jiangshan
Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
Joerg Roedel, kvm, LKML, David Woodhouse, Mingwei Zhang,
Maxim Levitsky
On Thu, Apr 28, 2022, Lai Jiangshan wrote:
> On Wed, Apr 27, 2022 at 7:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > @@ -159,10 +249,23 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
> >
>
> The following code of refresh_in_progress is somewhat like mutex.
>
> + mutex_lock(&gpc->refresh_in_progress); // before write_lock_irq(&gpc->lock);
>
> Is it fit for the intention?
Yeah, I mutex should work. Not sure why I shied away from a mutex...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
2022-04-27 14:10 ` Sean Christopherson
2022-04-28 3:39 ` Lai Jiangshan
@ 2022-05-20 14:49 ` Paolo Bonzini
2022-05-20 14:58 ` Paolo Bonzini
2022-05-20 14:59 ` Sean Christopherson
2024-08-05 11:04 ` David Woodhouse
3 siblings, 2 replies; 27+ messages in thread
From: Paolo Bonzini @ 2022-05-20 14:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, David Woodhouse, Mingwei Zhang, Maxim Levitsky
On 4/27/22 03:40, Sean Christopherson wrote:
> + * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> + * to go away, as the invalidation in the mmu_notifier event
> + * occurs_before_ mmu_notifier_count is elevated.
> + *
> + * Note, mn_active_invalidate_count can change at any time as
> + * it's not protected by gpc->lock. But, it is guaranteed to
> + * be elevated before the mmu_notifier acquires gpc->lock, and
> + * isn't dropped until after mmu_notifier_seq is updated. So,
> + * this task may get a false positive of sorts, i.e. see an
> + * elevated count and wait even though it's technically safe to
> + * proceed (becase the mmu_notifier will invalidate the cache
> + *_after_ it's refreshed here), but the cache will never be
> + * refreshed with stale data, i.e. won't get false negatives.
I am all for lavish comments, but I think this is even too detailed. What about:
/*
* mn_active_invalidate_count acts for all intents and purposes
* like mmu_notifier_count here; but we cannot use the latter
* because the invalidation in the mmu_notifier event occurs
* _before_ mmu_notifier_count is elevated.
*
* Note, it does not matter that mn_active_invalidate_count
* is not protected by gpc->lock. It is guaranteed to
* be elevated before the mmu_notifier acquires gpc->lock, and
* isn't dropped until after mmu_notifier_seq is updated.
*/
Paolo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-05-20 14:49 ` Paolo Bonzini
@ 2022-05-20 14:58 ` Paolo Bonzini
2022-05-20 15:02 ` Sean Christopherson
2022-05-20 14:59 ` Sean Christopherson
1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-05-20 14:58 UTC (permalink / raw)
To: Sean Christopherson
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, David Woodhouse, Mingwei Zhang, Maxim Levitsky
On 5/20/22 16:49, Paolo Bonzini wrote:
> On 4/27/22 03:40, Sean Christopherson wrote:
>> + * Wait for mn_active_invalidate_count, not mmu_notifier_count,
>> + * to go away, as the invalidation in the mmu_notifier event
>> + * occurs_before_ mmu_notifier_count is elevated.
>> + *
>> + * Note, mn_active_invalidate_count can change at any time as
>> + * it's not protected by gpc->lock. But, it is guaranteed to
>> + * be elevated before the mmu_notifier acquires gpc->lock, and
>> + * isn't dropped until after mmu_notifier_seq is updated. So,
>> + * this task may get a false positive of sorts, i.e. see an
>> + * elevated count and wait even though it's technically safe to
>> + * proceed (becase the mmu_notifier will invalidate the cache
>> + *_after_ it's refreshed here), but the cache will never be
>> + * refreshed with stale data, i.e. won't get false negatives.
>
> I am all for lavish comments, but I think this is even too detailed.
> What about:
And in fact this should be moved to a separate function.
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 50ce7b78b42f..321964ff42e1 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -112,6 +112,36 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
}
}
+
+static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+{
+ /*
+ * mn_active_invalidate_count acts for all intents and purposes
+ * like mmu_notifier_count here; but we cannot use the latter
+ * because the invalidation in the mmu_notifier event occurs
+ * _before_ mmu_notifier_count is elevated.
+ *
+ * Note, it does not matter that mn_active_invalidate_count
+ * is not protected by gpc->lock. It is guaranteed to
+ * be elevated before the mmu_notifier acquires gpc->lock, and
+ * isn't dropped until after mmu_notifier_seq is updated.
+ */
+ if (kvm->mn_active_invalidate_count)
+ return true;
+
+ /*
+ * Ensure mn_active_invalidate_count is read before
+ * mmu_notifier_seq. This pairs with the smp_wmb() in
+ * mmu_notifier_invalidate_range_end() to guarantee either the
+ * old (non-zero) value of mn_active_invalidate_count or the
+ * new (incremented) value of mmu_notifier_seq is observed.
+ */
+ smp_rmb();
+ if (kvm->mmu_notifier_seq != mmu_seq)
+ return true;
+ return false;
+}
+
static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
{
/* Note, the new page offset may be different than the old! */
@@ -129,7 +159,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
*/
gpc->valid = false;
- for (;;) {
+ do {
mmu_seq = kvm->mmu_notifier_seq;
smp_rmb();
@@ -188,32 +218,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc)
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
-
- /*
- * mn_active_invalidate_count acts for all intents and purposes
- * like mmu_notifier_count here; but we cannot use the latter
- * because the invalidation in the mmu_notifier event occurs
- * _before_ mmu_notifier_count is elevated.
- *
- * Note, it does not matter that mn_active_invalidate_count
- * is not protected by gpc->lock. It is guaranteed to
- * be elevated before the mmu_notifier acquires gpc->lock, and
- * isn't dropped until after mmu_notifier_seq is updated.
- */
- if (kvm->mn_active_invalidate_count)
- continue;
-
- /*
- * Ensure mn_active_invalidate_count is read before
- * mmu_notifier_seq. This pairs with the smp_wmb() in
- * mmu_notifier_invalidate_range_end() to guarantee either the
- * old (non-zero) value of mn_active_invalidate_count or the
- * new (incremented) value of mmu_notifier_seq is observed.
- */
- smp_rmb();
- if (kvm->mmu_notifier_seq == mmu_seq)
- break;
- }
+ } while (mmu_notifier_retry_cache(kvm, mmu_seq);
gpc->valid = true;
gpc->pfn = new_pfn;
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-05-20 14:49 ` Paolo Bonzini
2022-05-20 14:58 ` Paolo Bonzini
@ 2022-05-20 14:59 ` Sean Christopherson
1 sibling, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-05-20 14:59 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, David Woodhouse, Mingwei Zhang, Maxim Levitsky
On Fri, May 20, 2022, Paolo Bonzini wrote:
> On 4/27/22 03:40, Sean Christopherson wrote:
> > + * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> > + * to go away, as the invalidation in the mmu_notifier event
> > + * occurs_before_ mmu_notifier_count is elevated.
> > + *
> > + * Note, mn_active_invalidate_count can change at any time as
> > + * it's not protected by gpc->lock. But, it is guaranteed to
> > + * be elevated before the mmu_notifier acquires gpc->lock, and
> > + * isn't dropped until after mmu_notifier_seq is updated. So,
> > + * this task may get a false positive of sorts, i.e. see an
> > + * elevated count and wait even though it's technically safe to
> > + * proceed (becase the mmu_notifier will invalidate the cache
> > + *_after_ it's refreshed here), but the cache will never be
> > + * refreshed with stale data, i.e. won't get false negatives.
>
> I am all for lavish comments, but I think this is even too detailed.
Yeah, the false positive/negative stuff is probably overkill.
> What about:
>
> /*
> * mn_active_invalidate_count acts for all intents and purposes
> * like mmu_notifier_count here; but we cannot use the latter
> * because the invalidation in the mmu_notifier event occurs
> * _before_ mmu_notifier_count is elevated.
Looks good, though I'd prefer to avoid the "we", and explicitly call out that its
the invalidation of the caches.
/*
* mn_active_invalidate_count acts for all intents and purposes
* like mmu_notifier_count here; but the latter cannot be used
* here because the invalidation of caches in the mmu_notifier
* event occurs _before_ mmu_notifier_count is elevated.
*
* Note, it does not matter that mn_active_invalidate_count
* is not protected by gpc->lock. It is guaranteed to
* be elevated before the mmu_notifier acquires gpc->lock, and
* isn't dropped until after mmu_notifier_seq is updated.
*/
Also, you'll definitely want to look at v3 of this series. I'm 99% certain I didn't
change the comment though :-)
https://lore.kernel.org/all/20220429210025.3293691-1-seanjc@google.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-05-20 14:58 ` Paolo Bonzini
@ 2022-05-20 15:02 ` Sean Christopherson
0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-05-20 15:02 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, David Woodhouse, Mingwei Zhang, Maxim Levitsky
On Fri, May 20, 2022, Paolo Bonzini wrote:
> On 5/20/22 16:49, Paolo Bonzini wrote:
> > On 4/27/22 03:40, Sean Christopherson wrote:
> > > + * Wait for mn_active_invalidate_count, not mmu_notifier_count,
> > > + * to go away, as the invalidation in the mmu_notifier event
> > > + * occurs_before_ mmu_notifier_count is elevated.
> > > + *
> > > + * Note, mn_active_invalidate_count can change at any time as
> > > + * it's not protected by gpc->lock. But, it is guaranteed to
> > > + * be elevated before the mmu_notifier acquires gpc->lock, and
> > > + * isn't dropped until after mmu_notifier_seq is updated. So,
> > > + * this task may get a false positive of sorts, i.e. see an
> > > + * elevated count and wait even though it's technically safe to
> > > + * proceed (becase the mmu_notifier will invalidate the cache
> > > + *_after_ it's refreshed here), but the cache will never be
> > > + * refreshed with stale data, i.e. won't get false negatives.
> >
> > I am all for lavish comments, but I think this is even too detailed.
> > What about:
>
> And in fact this should be moved to a separate function.
>
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 50ce7b78b42f..321964ff42e1 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -112,6 +112,36 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm, kvm_pfn_t pfn, void *khva)
> }
> }
> +
> +static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
> +{
> + /*
> + * mn_active_invalidate_count acts for all intents and purposes
> + * like mmu_notifier_count here; but we cannot use the latter
> + * because the invalidation in the mmu_notifier event occurs
> + * _before_ mmu_notifier_count is elevated.
> + *
> + * Note, it does not matter that mn_active_invalidate_count
> + * is not protected by gpc->lock. It is guaranteed to
> + * be elevated before the mmu_notifier acquires gpc->lock, and
> + * isn't dropped until after mmu_notifier_seq is updated.
> + */
> + if (kvm->mn_active_invalidate_count)
> + return true;
> +
> + /*
> + * Ensure mn_active_invalidate_count is read before
> + * mmu_notifier_seq. This pairs with the smp_wmb() in
> + * mmu_notifier_invalidate_range_end() to guarantee either the
> + * old (non-zero) value of mn_active_invalidate_count or the
> + * new (incremented) value of mmu_notifier_seq is observed.
> + */
> + smp_rmb();
> + if (kvm->mmu_notifier_seq != mmu_seq)
> + return true;
> + return false;
This can be
return kvm->mmu_notifier_seq != mmu_seq;
Looks good otherwise. It'll probably yield a smaller diff too.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
` (2 preceding siblings ...)
2022-05-20 14:49 ` Paolo Bonzini
@ 2024-08-05 11:04 ` David Woodhouse
2024-08-05 11:08 ` [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook David Woodhouse
3 siblings, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2024-08-05 11:04 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Hussain, Mushahid
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, Mingwei Zhang, Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 5910 bytes --]
On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote:
> Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
> between the cache itself, and between the cache and mmu_notifier events.
>
> The existing refresh code attempts to guard against races with the
> mmu_notifier by speculatively marking the cache valid, and then marking
> it invalid if a mmu_notifier invalidation occurs. That handles the case
> where an invalidation occurs between dropping and re-acquiring gpc->lock,
> but it doesn't handle the scenario where the cache is refreshed after the
> cache was invalidated by the notifier, but before the notifier elevates
> mmu_notifier_count. The gpc refresh can't use the "retry" helper as its
> invalidation occurs _before_ mmu_notifier_count is elevated and before
> mmu_notifier_range_start is set/updated.
>
> CPU0 CPU1
> ---- ----
>
> gfn_to_pfn_cache_invalidate_start()
> |
> -> gpc->valid = false;
> kvm_gfn_to_pfn_cache_refresh()
> |
> |-> gpc->valid = true;
>
> hva_to_pfn_retry()
> |
> -> acquire kvm->mmu_lock
> kvm->mmu_notifier_count == 0
> mmu_seq == kvm->mmu_notifier_seq
> drop kvm->mmu_lock
> return pfn 'X'
> acquire kvm->mmu_lock
> kvm_inc_notifier_count()
> drop kvm->mmu_lock()
> kernel frees pfn 'X'
> kvm_gfn_to_pfn_cache_check()
> |
> |-> gpc->valid == true
>
> caller accesses freed pfn 'X'
>
> Key off of mn_active_invalidate_count to detect that a pfncache refresh
> needs to wait for an in-progress mmu_notifier invalidation. While
> mn_active_invalidate_count is not guaranteed to be stable, it is
> guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
> so either the refresh will see an active invalidation and wait, or the
> invalidation will run after the refresh completes.
>
> Speculatively marking the cache valid is itself flawed, as a concurrent
> kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
> values. The KVM Xen use case explicitly allows/wants multiple users;
> even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
> can read a different vCPU (or vCPUs). Address this race by invalidating
> the cache prior to dropping gpc->lock (this is made possible by fixing
> the above mmu_notifier race).
>
> Finally, the refresh logic doesn't protect against concurrent refreshes
> with different GPAs (which may or may not be a desired use case, but its
> allowed in the code), nor does it protect against a false negative on the
> memslot generation. If the first refresh sees a stale memslot generation,
> it will refresh the hva and generation before moving on to the hva=>pfn
> translation. If it then drops gpc->lock, a different user can come along,
> acquire gpc->lock, see that the memslot generation is fresh, and skip
> the hva=>pfn update due to the userspace address also matching (because
> it too was updated). Address this race by adding an "in-progress" flag
> so that the refresh that acquires gpc->lock first runs to completion
> before other users can start their refresh.
>
> Complicating all of this is the fact that both the hva=>pfn resolution
> and mapping of the kernel address can sleep, i.e. must be done outside
> of gpc->lock
>
> Fix the above races in one fell swoop, trying to fix each individual race
> in a sane manner is impossible, for all intents and purposes.
Hm, the problem with this (commit 58cd407ca4c6278) is that it ends up
acting like a *really* unfair read/write lock.
If there are a lot of 'writers' invalidating other HVA ranges, then the
hva_to_pfn_retry() function as the 'reader' will back off for ever and
never make progress. Even while a single other invalidation is active
(->mn_active_invalidate_count is non-zero), hva_to_pfn_retry() will
just spin in an elaborate busy-wait loop, mapping same page over and
over again.
(In the repro case I'm not entirely sure *why* there are userspace
threads bashing on MADV_DONTNEED and causing this, but 'Don't Do That
Then' is only a partial answer. The kernel's own locking should allow
it to make progress regardless.)
If we call the GPC invalidate function from invalidate_range_end()
instead of _start, can't we do it concurrently without having to check
for active invalidations or mmu_invalidate_seq?
We can then introduce a 'validating' flag, set before the attempt to
hva_to_pfn() in the loop, so that the translation can be shot down
before it's even made. And use *that* as the trigger for the retry loop
so that only has to retry if its *own* uHVA is messed with.
Patch follows...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-05 11:04 ` David Woodhouse
@ 2024-08-05 11:08 ` David Woodhouse
2024-08-06 0:45 ` Sean Christopherson
0 siblings, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2024-08-05 11:08 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Hussain, Mushahid
Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
linux-kernel, Mingwei Zhang, Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 9632 bytes --]
From: David Woodhouse <dwmw@amazon.co.uk>
The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
If there is an invalidation running concurrently, it is effectively just
a complex busy wait loop because its local mmu_notifier_retry_cache()
function will always return true.
It ends up functioning as a very unfair read/write lock. If userspace is
acting as a 'writer', performing many unrelated MM changes, then the
hva_to_pfn_retry() function acting as the 'reader' just backs off and
keep retrying for ever, not making any progress.
Solve this by introducing a separate 'validating' flag to the GPC, so
that it can be marked invalid before it's even mapped. This allows the
invalidation to be moved to the range_end hook, and the retry loop in
hva_to_pfn_retry() can be changed to loop only if its particular uHVA
has been affected.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
I note I'm deleting a big comment in kvm_main.c about doing the
invalidation before acquiring mmu_lock. But we don't hold the lock
in the range_end callback either, do we?
include/linux/kvm_types.h | 1 +
virt/kvm/kvm_main.c | 14 ++------
virt/kvm/kvm_mm.h | 12 +++----
virt/kvm/pfncache.c | 75 +++++++++++++++++++--------------------
4 files changed, 45 insertions(+), 57 deletions(-)
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 827ecc0b7e10..30ed1019cfc6 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -69,6 +69,7 @@ struct gfn_to_pfn_cache {
void *khva;
kvm_pfn_t pfn;
bool active;
+ bool validating;
bool valid;
};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..ffd6ab4c2a16 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -777,18 +777,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
kvm->mn_active_invalidate_count++;
spin_unlock(&kvm->mn_invalidate_lock);
- /*
- * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
- * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
- * each cache's lock. There are relatively few caches in existence at
- * any given time, and the caches themselves can check for hva overlap,
- * i.e. don't need to rely on memslot overlap checks for performance.
- * Because this runs without holding mmu_lock, the pfn caches must use
- * mn_active_invalidate_count (see above) instead of
- * mmu_invalidate_in_progress.
- */
- gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
-
/*
* If one or more memslots were found and thus zapped, notify arch code
* that guest memory has been reclaimed. This needs to be done *after*
@@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
wake = !kvm->mn_active_invalidate_count;
spin_unlock(&kvm->mn_invalidate_lock);
+ gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
+
/*
* There can only be one waiter, since the wait happens under
* slots_lock.
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 715f19669d01..34e4e67f09f8 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -24,13 +24,13 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
bool *async, bool write_fault, bool *writable);
#ifdef CONFIG_HAVE_KVM_PFNCACHE
-void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
- unsigned long start,
- unsigned long end);
+void gfn_to_pfn_cache_invalidate(struct kvm *kvm,
+ unsigned long start,
+ unsigned long end);
#else
-static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
- unsigned long start,
- unsigned long end)
+static inline void gfn_to_pfn_cache_invalidate(struct kvm *kvm,
+ unsigned long start,
+ unsigned long end)
{
}
#endif /* HAVE_KVM_PFNCACHE */
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e..187b58150ef7 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -20,10 +20,14 @@
#include "kvm_mm.h"
/*
- * MMU notifier 'invalidate_range_start' hook.
+ * MMU notifier 'invalidate_range_end' hook. The hva_to_pfn_retry() function
+ * below may look up a PFN just before it is zapped, and may be mapping it
+ * concurrently (with the GPC lock dropped). By using a separate 'validating'
+ * flag, the invalidation can occur concurrently, causing hva_to_pfn_retry()
+ * to drop its result and retry correctly.
*/
-void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
- unsigned long end)
+void gfn_to_pfn_cache_invalidate(struct kvm *kvm, unsigned long start,
+ unsigned long end)
{
struct gfn_to_pfn_cache *gpc;
@@ -32,7 +36,7 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
read_lock_irq(&gpc->lock);
/* Only a single page so no need to care about length */
- if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
+ if (gpc->validating && !is_error_noslot_pfn(gpc->pfn) &&
gpc->uhva >= start && gpc->uhva < end) {
read_unlock_irq(&gpc->lock);
@@ -45,9 +49,11 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
*/
write_lock_irq(&gpc->lock);
- if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
- gpc->uhva >= start && gpc->uhva < end)
+ if (gpc->validating && !is_error_noslot_pfn(gpc->pfn) &&
+ gpc->uhva >= start && gpc->uhva < end) {
+ gpc->validating = false;
gpc->valid = false;
+ }
write_unlock_irq(&gpc->lock);
continue;
}
@@ -93,6 +99,9 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
if (!gpc->valid)
return false;
+ /* It can never be valid unless it was once validating! */
+ WARN_ON_ONCE(!gpc->validating);
+
return true;
}
@@ -124,41 +133,12 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
#endif
}
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
-{
- /*
- * mn_active_invalidate_count acts for all intents and purposes
- * like mmu_invalidate_in_progress here; but the latter cannot
- * be used here because the invalidation of caches in the
- * mmu_notifier event occurs _before_ mmu_invalidate_in_progress
- * is elevated.
- *
- * Note, it does not matter that mn_active_invalidate_count
- * is not protected by gpc->lock. It is guaranteed to
- * be elevated before the mmu_notifier acquires gpc->lock, and
- * isn't dropped until after mmu_invalidate_seq is updated.
- */
- if (kvm->mn_active_invalidate_count)
- return true;
-
- /*
- * Ensure mn_active_invalidate_count is read before
- * mmu_invalidate_seq. This pairs with the smp_wmb() in
- * mmu_notifier_invalidate_range_end() to guarantee either the
- * old (non-zero) value of mn_active_invalidate_count or the
- * new (incremented) value of mmu_invalidate_seq is observed.
- */
- smp_rmb();
- return kvm->mmu_invalidate_seq != mmu_seq;
-}
-
static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
{
/* Note, the new page offset may be different than the old! */
void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
void *new_khva = NULL;
- unsigned long mmu_seq;
lockdep_assert_held(&gpc->refresh_lock);
@@ -172,8 +152,16 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
gpc->valid = false;
do {
- mmu_seq = gpc->kvm->mmu_invalidate_seq;
- smp_rmb();
+ /*
+ * The translation made by hva_to_pfn() below could be made
+ * invalid as soon as it's mapped. But the uhva is already
+ * known and that's all that gfn_to_pfn_cache_invalidate()
+ * looks at. So set the 'validating' flag to allow the GPC
+ * to be marked invalid from the moment the lock is dropped,
+ * before the corresponding PFN is even found (and, more to
+ * the point, immediately afterwards).
+ */
+ gpc->validating = true;
write_unlock_irq(&gpc->lock);
@@ -224,7 +212,14 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
- } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+
+ /*
+ * Since gfn_to_pfn_cache_invalidate() is called from the
+ * kvm_mmu_notifier_invalidate_range_end() callback, it can
+ * invalidate the GPC the moment after hva_to_pfn() returned
+ * a valid PFN. If that happens, retry.
+ */
+ } while (!gpc->validating);
gpc->valid = true;
gpc->pfn = new_pfn;
@@ -339,6 +334,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
*/
if (ret) {
gpc->valid = false;
+ gpc->validating = false;
gpc->pfn = KVM_PFN_ERR_FAULT;
gpc->khva = NULL;
}
@@ -383,7 +379,7 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
gpc->pfn = KVM_PFN_ERR_FAULT;
gpc->gpa = INVALID_GPA;
gpc->uhva = KVM_HVA_ERR_BAD;
- gpc->active = gpc->valid = false;
+ gpc->active = gpc->valid = gpc->validating = false;
}
static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
@@ -453,6 +449,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
write_lock_irq(&gpc->lock);
gpc->active = false;
gpc->valid = false;
+ gpc->validating = false;
/*
* Leave the GPA => uHVA cache intact, it's protected by the
--
2.44.0
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-05 11:08 ` [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook David Woodhouse
@ 2024-08-06 0:45 ` Sean Christopherson
2024-08-06 9:06 ` David Woodhouse
0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2024-08-06 0:45 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
On Mon, Aug 05, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> If there is an invalidation running concurrently, it is effectively just
> a complex busy wait loop because its local mmu_notifier_retry_cache()
> function will always return true.
>
> It ends up functioning as a very unfair read/write lock. If userspace is
> acting as a 'writer', performing many unrelated MM changes, then the
> hva_to_pfn_retry() function acting as the 'reader' just backs off and
> keep retrying for ever, not making any progress.
>
> Solve this by introducing a separate 'validating' flag to the GPC, so
> that it can be marked invalid before it's even mapped. This allows the
> invalidation to be moved to the range_end hook, and the retry loop in
> hva_to_pfn_retry() can be changed to loop only if its particular uHVA
> has been affected.
I think I'm missing something. How does allowing hva_to_pfn_retry() allow KVM
as a whole to make forward progress? Doesn't getting past hva_to_pfn_retry()
just move the problem to kvm_gpc_check()?
kvm_gpc_refresh() can't be called with gpc->lock held, and nor does it return
with gpc->lock held, so a racing mmu_notifier invalidation can/will acquire
gpc->lock and clear gpc->active, no?
Oh, by "unrelated", you mean _completely_ unrelated? As in, KVM happens to do a
refresh when userspace is blasting MADV_DONTNEED, and gets stuck retrying for
no good reason?
Servicing guest pages faults has the same problem, which is why
mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
little harder, but not horrifically so (there are ordering differences regardless).
Woefully incomplete, but I think this is the gist of what you want:
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index f0039efb9e1e..1c4c95ab7d0a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -28,6 +28,26 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
struct gfn_to_pfn_cache *gpc;
spin_lock(&kvm->gpc_lock);
+
+ if (likely(kvm_is_error_hva(kvm->mmu_gpc_invalidate_range_start)) {
+ kvm->mmu_gpc_invalidate_range_start = start;
+ kvm->mmu_gpc_invalidate_range_end = end;
+ } else {
+ /*
+ * Fully tracking multiple concurrent ranges has diminishing
+ * returns. Keep things simple and just find the minimal range
+ * which includes the current and new ranges. As there won't be
+ * enough information to subtract a range after its invalidate
+ * completes, any ranges invalidated concurrently will
+ * accumulate and persist until all outstanding invalidates
+ * complete.
+ */
+ kvm->mmu_gpc_invalidate_range_start =
+ min(kvm->mmu_gpc_invalidate_range_start, start);
+ kvm->mmu_gpc_invalidate_range_end =
+ max(kvm->mmu_gpc_invalidate_range_end, end);
+ }
+
list_for_each_entry(gpc, &kvm->gpc_list, list) {
read_lock_irq(&gpc->lock);
@@ -124,8 +144,11 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
#endif
}
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static inline bool mmu_notifier_retry_cache(struct gfn_to_pfn_cache *gpc,
+ unsigned long mmu_seq)
{
+ struct kvm *kvm = gpc->kvm;
+
/*
* mn_active_invalidate_count acts for all intents and purposes
* like mmu_invalidate_in_progress here; but the latter cannot
@@ -138,7 +161,9 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
* be elevated before the mmu_notifier acquires gpc->lock, and
* isn't dropped until after mmu_invalidate_seq is updated.
*/
- if (kvm->mn_active_invalidate_count)
+ if (kvm->mn_active_invalidate_count &&
+ gpc->uhva >= kvm->mmu_gpc_invalidate_range_start &&
+ gpc->uhva < kvm->mmu_gpc_invalidate_range_end)
return true;
/*
@@ -224,7 +249,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
- } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+ } while (mmu_notifier_retry_cache(gpc, mmu_seq));
gpc->valid = true;
gpc->pfn = new_pfn;
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> I note I'm deleting a big comment in kvm_main.c about doing the
> invalidation before acquiring mmu_lock. But we don't hold the lock
> in the range_end callback either, do we?
Correct, __kvm_handle_hva_range() acquires and releases mmu_lock. However, the
intent of the comment was to clarify why GPCs are invalidated in
kvm_mmu_notifier_invalidate_range_start(), as opposed to kvm_mmu_invalidate_begin()
which _is_ called under mmu_lock and is also called if and only if KVM has a
relevant memslot. E.g. that's why the comment also talks about memslot overlap
checks.
>
> include/linux/kvm_types.h | 1 +
> virt/kvm/kvm_main.c | 14 ++------
> virt/kvm/kvm_mm.h | 12 +++----
> virt/kvm/pfncache.c | 75 +++++++++++++++++++--------------------
> 4 files changed, 45 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 827ecc0b7e10..30ed1019cfc6 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -69,6 +69,7 @@ struct gfn_to_pfn_cache {
> void *khva;
> kvm_pfn_t pfn;
> bool active;
> + bool validating;
This is a confusing name, partly because KVM usually deals with invalidation
events, but also because it's sticky and stays set long after the act of
validating the GPC is complete.
Something like "needs_invalidation" is the best I can come up with, but I believe
this bikeshed is moot (see above and below).
> bool valid;
> };
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d0788d0a72cc..ffd6ab4c2a16 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -777,18 +777,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> kvm->mn_active_invalidate_count++;
> spin_unlock(&kvm->mn_invalidate_lock);
>
> - /*
> - * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
> - * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring
> - * each cache's lock. There are relatively few caches in existence at
> - * any given time, and the caches themselves can check for hva overlap,
> - * i.e. don't need to rely on memslot overlap checks for performance.
> - * Because this runs without holding mmu_lock, the pfn caches must use
> - * mn_active_invalidate_count (see above) instead of
> - * mmu_invalidate_in_progress.
> - */
> - gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
> -
> /*
> * If one or more memslots were found and thus zapped, notify arch code
> * that guest memory has been reclaimed. This needs to be done *after*
> @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> wake = !kvm->mn_active_invalidate_count;
> spin_unlock(&kvm->mn_invalidate_lock);
>
> + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
We can't do this. The contract with mmu_notifiers is that secondary MMUs must
unmap the hva before returning from invalidate_range_start(), and must not create
new mappings until invalidate_range_end().
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 0:45 ` Sean Christopherson
@ 2024-08-06 9:06 ` David Woodhouse
2024-08-06 14:03 ` Sean Christopherson
0 siblings, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2024-08-06 9:06 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]
On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> On Mon, Aug 05, 2024, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> > If there is an invalidation running concurrently, it is effectively just
> > a complex busy wait loop because its local mmu_notifier_retry_cache()
> > function will always return true.
> >
> > It ends up functioning as a very unfair read/write lock. If userspace is
> > acting as a 'writer', performing many unrelated MM changes, then the
> > hva_to_pfn_retry() function acting as the 'reader' just backs off and
> > keep retrying for ever, not making any progress.
> >
> > Solve this by introducing a separate 'validating' flag to the GPC, so
> > that it can be marked invalid before it's even mapped. This allows the
> > invalidation to be moved to the range_end hook, and the retry loop in
> > hva_to_pfn_retry() can be changed to loop only if its particular uHVA
> > has been affected.
>
> I think I'm missing something. How does allowing hva_to_pfn_retry() allow KVM
> as a whole to make forward progress? Doesn't getting past hva_to_pfn_retry()
> just move the problem to kvm_gpc_check()?
>
> kvm_gpc_refresh() can't be called with gpc->lock held, and nor does it return
> with gpc->lock held, so a racing mmu_notifier invalidation can/will acquire
> gpc->lock and clear gpc->active, no?
>
> Oh, by "unrelated", you mean _completely_ unrelated? As in, KVM happens to do a
> refresh when userspace is blasting MADV_DONTNEED, and gets stuck retrying for
> no good reason?
Right. And again, I have no idea why userspace is doing that, and we
need to make it stop. But that's not really the point; the kernel
should be able to make progress anyway.
> Servicing guest pages faults has the same problem, which is why
> mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
> little harder, but not horrifically so (there are ordering differences regardless).
>
> Woefully incomplete, but I think this is the gist of what you want:
Hm, maybe. It does mean that migration occurring all through memory
(indeed, just one at top and bottom of guest memory space) would
perturb GPCs which remain present.
> > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > wake = !kvm->mn_active_invalidate_count;
> > spin_unlock(&kvm->mn_invalidate_lock);
> >
> > + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
>
> We can't do this. The contract with mmu_notifiers is that secondary MMUs must
> unmap the hva before returning from invalidate_range_start(), and must not create
> new mappings until invalidate_range_end().
But in the context of the GPC, it is only "mapped" when the ->valid bit is set.
Even the invalidation callback just clears the valid bit, and that
means nobody is allowed to dereference the ->khva any more. It doesn't
matter that the underlying (stale) PFN is still kmapped.
Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it
might kmap a page that gets removed, but it's not actually created a
new mapping if it hasn't set the ->valid bit.
I don't think this version quite meets the constraints, and I might
need to hook *both* the start and end notifiers, and might not like it
once I get there. But I'll have a go...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 9:06 ` David Woodhouse
@ 2024-08-06 14:03 ` Sean Christopherson
2024-08-06 14:24 ` David Woodhouse
2024-08-22 9:00 ` David Woodhouse
0 siblings, 2 replies; 27+ messages in thread
From: Sean Christopherson @ 2024-08-06 14:03 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
On Tue, Aug 06, 2024, David Woodhouse wrote:
> On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> > On Mon, Aug 05, 2024, David Woodhouse wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > Servicing guest pages faults has the same problem, which is why
> > mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
> > little harder, but not horrifically so (there are ordering differences regardless).
> >
> > Woefully incomplete, but I think this is the gist of what you want:
>
> Hm, maybe. It does mean that migration occurring all through memory
> (indeed, just one at top and bottom of guest memory space) would
> perturb GPCs which remain present.
If that happens with a real world VMM, and it's not a blatant VMM goof, then we
can fix KVM. The stage-2 page fault path hammers the mmu_notifier retry logic
far more than GPCs, so if a range-based check is inadequate for some use case,
then we definitely need to fix both.
In short, I don't see any reason to invent something different for GPCs.
> > > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > wake = !kvm->mn_active_invalidate_count;
> > > spin_unlock(&kvm->mn_invalidate_lock);
> > >
> > > + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
> >
> > We can't do this. The contract with mmu_notifiers is that secondary MMUs must
> > unmap the hva before returning from invalidate_range_start(), and must not create
> > new mappings until invalidate_range_end().
>
> But in the context of the GPC, it is only "mapped" when the ->valid bit is set.
>
> Even the invalidation callback just clears the valid bit, and that
> means nobody is allowed to dereference the ->khva any more. It doesn't
> matter that the underlying (stale) PFN is still kmapped.
>
> Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it
> might kmap a page that gets removed, but it's not actually created a
> new mapping if it hasn't set the ->valid bit.
>
> I don't think this version quite meets the constraints, and I might
> need to hook *both* the start and end notifiers, and might not like it
> once I get there. But I'll have a go...
I'm pretty sure you're going to need the range-based retry logic. KVM can't
safely set gpc->valid until mn_active_invalidate_count reaches zero, so if a GPC
refresh comes along after mn_active_invalidate_count has been elevated, it won't
be able to set gpc->valid until the MADV_DONTNEED storm goes away. Without
range-based tracking, there's no way to know if a previous invalidation was
relevant to the GPC.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 14:03 ` Sean Christopherson
@ 2024-08-06 14:24 ` David Woodhouse
2024-08-06 15:57 ` Sean Christopherson
2024-08-22 9:00 ` David Woodhouse
1 sibling, 1 reply; 27+ messages in thread
From: David Woodhouse @ 2024-08-06 14:24 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 4758 bytes --]
On Tue, 2024-08-06 at 07:03 -0700, Sean Christopherson wrote:
> On Tue, Aug 06, 2024, David Woodhouse wrote:
> > On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> > > On Mon, Aug 05, 2024, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > Servicing guest pages faults has the same problem, which is why
> > > mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
> > > little harder, but not horrifically so (there are ordering differences regardless).
> > >
> > > Woefully incomplete, but I think this is the gist of what you want:
> >
> > Hm, maybe. It does mean that migration occurring all through memory
> > (indeed, just one at top and bottom of guest memory space) would
> > perturb GPCs which remain present.
>
> If that happens with a real world VMM, and it's not a blatant VMM goof, then we
> can fix KVM. The stage-2 page fault path hammers the mmu_notifier retry logic
> far more than GPCs, so if a range-based check is inadequate for some use case,
> then we definitely need to fix both.
>
> In short, I don't see any reason to invent something different for GPCs.
>
> > > > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > > wake = !kvm->mn_active_invalidate_count;
> > > > spin_unlock(&kvm->mn_invalidate_lock);
> > > >
> > > > + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
> > >
> > > We can't do this. The contract with mmu_notifiers is that secondary MMUs must
> > > unmap the hva before returning from invalidate_range_start(), and must not create
> > > new mappings until invalidate_range_end().
Looking at that assertion harder... where is that rule written? It
seems counter-intuitive to me; that isn't how TLBs work. Another CPU
can populate a TLB entry right up to the moment the PTE is actually
*changed* in the page tables, and then the CPU which is
modifying/zapping the PTE needs to perform a remote TLB flush. That
remote TLB flush is analogous to the invalidate_range_end() call,
surely?
I'm fairly sure that's how it works for PASID support too; nothing
prevents the IOMMU+device from populating an IOTLB entry until the PTE
is actually changed in the process page tables.
So why can't we do the same for the GPC?
> > But in the context of the GPC, it is only "mapped" when the ->valid bit is set.
> >
> > Even the invalidation callback just clears the valid bit, and that
> > means nobody is allowed to dereference the ->khva any more. It doesn't
> > matter that the underlying (stale) PFN is still kmapped.
> >
> > Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it
> > might kmap a page that gets removed, but it's not actually created a
> > new mapping if it hasn't set the ->valid bit.
> >
> > I don't think this version quite meets the constraints, and I might
> > need to hook *both* the start and end notifiers, and might not like it
> > once I get there. But I'll have a go...
>
> I'm pretty sure you're going to need the range-based retry logic. KVM can't
> safely set gpc->valid until mn_active_invalidate_count reaches zero, so if a GPC
> refresh comes along after mn_active_invalidate_count has been elevated, it won't
> be able to set gpc->valid until the MADV_DONTNEED storm goes away. Without
> range-based tracking, there's no way to know if a previous invalidation was
> relevant to the GPC.
If it is indeed the case that KVM can't just behave like a normal TLB,
so it and can't set gpc->valid until mn_active_invalidate_count reaches
zero, it still only needs to *wait* (or spin, maybe). It certainly
doesn't need to keep looping and remapping the same PFN over and over
again, as it does at the moment.
When mn_active_invalidate_count does reach zero, either the young GPC
will have been invalidated by clearing the (to be renamed) ->validating
flag, or it won't have been. If it *has* been invalidated, that's when
hva_to_pfn_retry() needs to go one more time round its full loop.
So it just needs to wait until any pending (relevant) invalidations
have completed, *then* check and potentially loop once more.
And yes, making that *wait* range-based does make some sense, I
suppose. It becomes "wait for gpc->uhva not to be within the range of
kvm->mmu_gpc_invalidate_range_{start,end}."
Except... that range can never shrink *except* when
mn_active_invalidate_count becomes zero, can it? So if we do end up
waiting, the wake condition is *still* just that the count has become
zero. There's already a wakeup in that case, on kvm-
>mn_memslots_update_rcuwait. Can I wait on that?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 14:24 ` David Woodhouse
@ 2024-08-06 15:57 ` Sean Christopherson
2024-08-06 16:40 ` David Woodhouse
0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2024-08-06 15:57 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
On Tue, Aug 06, 2024, David Woodhouse wrote:
> On Tue, 2024-08-06 at 07:03 -0700, Sean Christopherson wrote:
> > On Tue, Aug 06, 2024, David Woodhouse wrote:
> > > On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> > > > On Mon, Aug 05, 2024, David Woodhouse wrote:
> > > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > Servicing guest pages faults has the same problem, which is why
> > > > mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
> > > > little harder, but not horrifically so (there are ordering differences regardless).
> > > >
> > > > Woefully incomplete, but I think this is the gist of what you want:
> > >
> > > Hm, maybe. It does mean that migration occurring all through memory
> > > (indeed, just one at top and bottom of guest memory space) would
> > > perturb GPCs which remain present.
> >
> > If that happens with a real world VMM, and it's not a blatant VMM goof, then we
> > can fix KVM. The stage-2 page fault path hammers the mmu_notifier retry logic
> > far more than GPCs, so if a range-based check is inadequate for some use case,
> > then we definitely need to fix both.
> >
> > In short, I don't see any reason to invent something different for GPCs.
> >
> > > > > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> > > > > wake = !kvm->mn_active_invalidate_count;
> > > > > spin_unlock(&kvm->mn_invalidate_lock);
> > > > >
> > > > > + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
> > > >
> > > > We can't do this. The contract with mmu_notifiers is that secondary MMUs must
> > > > unmap the hva before returning from invalidate_range_start(), and must not create
> > > > new mappings until invalidate_range_end().
>
> Looking at that assertion harder... where is that rule written?
The big comment for invalidate_range_{start,end}() in include/linux/mmu_notifier.h.
The relevant snippets are:
* If the subsystem can't guarantee that no additional references are
* taken to the pages in the range, it has to implement the
* invalidate_range() notifier to remove any references taken after
* invalidate_range_start().
* invalidate_range_start() is called when all pages in the
* range are still mapped and have at least a refcount of one.
*
* invalidate_range_end() is called when all pages in the
* range have been unmapped and the pages have been freed by
* the VM.
The last one is key: the pages have already been freed when invalidate_range_end()
is called, and so unmapping at that time would be too late.
> It seems counter-intuitive to me; that isn't how TLBs work. Another CPU can
> populate a TLB entry right up to the moment the PTE is actually *changed* in
> the page tables, and then the CPU which is modifying/zapping the PTE needs to
> perform a remote TLB flush. That remote TLB flush is analogous to the
> invalidate_range_end() call, surely?
KVM's usage isn't about (hardware) TLBs. Ah, and the history is even somewhat
evident in the above comment I referenced. invalidate_range() no longer exists,
it was morphed into arch_invalidate_secondary_tlbs(). For secondary MMUs that
reuse the primary MMU's PTEs, mmu_notifier_arch_invalidate_secondary_tlbs() is
indeed called after the PTEs have been modified.
KVM's usage is different. Because KVM has its own (Secondary) PTEs (commit
1af5a8109904 ("mmu_notifiers: rename invalidate_range notifier") calls them
"software TLBs", but I find that to be a confusing description), zapping on-demand
when the primary PTEs are modified is tricky and ultimately undesirable.
E.g. invoking mmu_notifiers while holding a PTE lock would prevent KVM from
blocking, which can be problematic if KVM needs to zap a large number SPTEs.
And doing invalidation on-demand for each primary PTE would be suboptimal for
cases where a large VMA range is unmapped/modified, e.g. KVM would get a large
number of invalidation events instead of one big, all-encompassing invalidation.
The obvious downside is what you've run into, where the start+end approach forces
KVM to wait for all in-flight invalidations to go away. But again, in practice
the rudimentary range tracking suffices for all known use cases.
> I'm fairly sure that's how it works for PASID support too; nothing
> prevents the IOMMU+device from populating an IOTLB entry until the PTE
> is actually changed in the process page tables.
>
> So why can't we do the same for the GPC?
>
> > > But in the context of the GPC, it is only "mapped" when the ->valid bit is set.
> > >
> > > Even the invalidation callback just clears the valid bit, and that
> > > means nobody is allowed to dereference the ->khva any more. It doesn't
> > > matter that the underlying (stale) PFN is still kmapped.
> > >
> > > Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it
> > > might kmap a page that gets removed, but it's not actually created a
> > > new mapping if it hasn't set the ->valid bit.
> > >
> > > I don't think this version quite meets the constraints, and I might
> > > need to hook *both* the start and end notifiers, and might not like it
> > > once I get there. But I'll have a go...
> >
> > I'm pretty sure you're going to need the range-based retry logic. KVM can't
> > safely set gpc->valid until mn_active_invalidate_count reaches zero, so if a GPC
> > refresh comes along after mn_active_invalidate_count has been elevated, it won't
> > be able to set gpc->valid until the MADV_DONTNEED storm goes away. Without
> > range-based tracking, there's no way to know if a previous invalidation was
> > relevant to the GPC.
>
> If it is indeed the case that KVM can't just behave like a normal TLB,
> so it and can't set gpc->valid until mn_active_invalidate_count reaches
> zero, it still only needs to *wait* (or spin, maybe). It certainly
> doesn't need to keep looping and remapping the same PFN over and over
> again, as it does at the moment.
>
> When mn_active_invalidate_count does reach zero, either the young GPC
> will have been invalidated by clearing the (to be renamed) ->validating
> flag, or it won't have been. If it *has* been invalidated, that's when
> hva_to_pfn_retry() needs to go one more time round its full loop.
>
> So it just needs to wait until any pending (relevant) invalidations
> have completed, *then* check and potentially loop once more.
>
> And yes, making that *wait* range-based does make some sense, I
> suppose. It becomes "wait for gpc->uhva not to be within the range of
> kvm->mmu_gpc_invalidate_range_{start,end}."
Yep, exactly. Without range-based tracking, there's no way for KVM to know when
a relevant in-flight invalidation has completed.
> Except... that range can never shrink *except* when
> mn_active_invalidate_count becomes zero, can it?
Not without more sophisticated logic, no. E.g. if KVM supported tracking multiple
distinct ranges, then individual invalidation ranges could be dropped. But to
to avoid memory allocations in invalidate_range_start(), KVM would still need to
hardcode the maximum number of in-flight ranges. E.g. even if KVM used a dynamic
container, we'd probably want the container entries to be "allocated" out of a
cache, and that cache would need a maximum capacity.
With a max limit on the number of ranges, KVM would still be forced to combine
ranges if there are too many in-flight invalidations.
So, because tracking a single range has sufficed for all known use cases, and
it's significantly simpler than tracking multiple ranges, AFAIK no one has pursued
a multi-range tracking implementation.
> So if we do end up waiting, the wake condition is *still* just that the count
> has become zero. There's already a wakeup in that case, on kvm-
> >mn_memslots_update_rcuwait. Can I wait on that?
I suspect you're trying to solve a problem that doesn't exist in practice.
hva_to_pfn_retry() already has a cond_resched(), so getting stuck for a long
duration isn't fatal, just suboptimal. And similar to the range-based tracking,
_if_ there's a problem in practice, then it also affects guest page faults. KVM
simply resumes the vCPU and keeps re-faulting until the in-flight invalidation(s)
has gone away.
Not without reworking mn_memslots_update_rcuwait. KVM assumes there is at most
one waiter, as that wait+wake combination is specifically to handle the case where
a _relevant_ in-flight mmu_notifier invalidation needs to block a userspace memslot
deletion. KVM takes mmu_lock in invalidate_range_{start,end}() if and only if
there is an overlapping memslot, and so KVM needs to prevent a memslot from being
deleted between start() and end().
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 15:57 ` Sean Christopherson
@ 2024-08-06 16:40 ` David Woodhouse
0 siblings, 0 replies; 27+ messages in thread
From: David Woodhouse @ 2024-08-06 16:40 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 6298 bytes --]
On Tue, 2024-08-06 at 08:57 -0700, Sean Christopherson wrote:
> The last one is key: the pages have already been freed when invalidate_range_end()
> is called, and so unmapping at that time would be too late.
OK, thanks.
> The obvious downside is what you've run into, where the start+end approach forces
> KVM to wait for all in-flight invalidations to go away. But again, in practice
> the rudimentary range tracking suffices for all known use cases.
Makes sense.
> I suspect you're trying to solve a problem that doesn't exist in practice.
> hva_to_pfn_retry() already has a cond_resched(), so getting stuck for a long
> duration isn't fatal, just suboptimal. And similar to the range-based tracking,
> _if_ there's a problem in practice, then it also affects guest page faults. KVM
> simply resumes the vCPU and keeps re-faulting until the in-flight invalidation(s)
> has gone away.
Yeah. When it's looping on actual page faults it's easy to pretend it
isn't a busy-loop :)
Making it wait is fairly simple; it would be easy to just make it
cond_resched() instead though. Here's what I have so far (incremental).
https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=pfncache-2
I need to revive the torture test you had at the end of your earlier series.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 689e8be873a7..9b5261e11802 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -770,6 +770,9 @@ struct kvm {
/* For management / invalidation of gfn_to_pfn_caches */
spinlock_t gpc_lock;
struct list_head gpc_list;
+ u64 mmu_gpc_invalidate_range_start;
+ u64 mmu_gpc_invalidate_range_end;
+ wait_queue_head_t gpc_invalidate_wq;
/*
* created_vcpus is protected by kvm->lock, and is incremented
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ffd6ab4c2a16..a243f9f8a9c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -775,6 +775,24 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
*/
spin_lock(&kvm->mn_invalidate_lock);
kvm->mn_active_invalidate_count++;
+ if (likely(kvm->mn_active_invalidate_count == 1) {
+ kvm->mmu_gpc_invalidate_range_start = range->start;
+ kvm->mmu_gpc_invalidate_range_end = range->end;
+ } else {
+ /*
+ * Fully tracking multiple concurrent ranges has diminishing
+ * returns. Keep things simple and just find the minimal range
+ * which includes the current and new ranges. As there won't be
+ * enough information to subtract a range after its invalidate
+ * completes, any ranges invalidated concurrently will
+ * accumulate and persist until all outstanding invalidates
+ * complete.
+ */
+ kvm->mmu_gpc_invalidate_range_start =
+ min(kvm->mmu_gpc_invalidate_range_start, range->start);
+ kvm->mmu_gpc_invalidate_range_end =
+ max(kvm->mmu_gpc_invalidate_range_end, range->end);
+ }
spin_unlock(&kvm->mn_invalidate_lock);
/*
@@ -830,21 +848,27 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
__kvm_handle_hva_range(kvm, &hva_range);
+ gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
+
/* Pairs with the increment in range_start(). */
spin_lock(&kvm->mn_invalidate_lock);
if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
--kvm->mn_active_invalidate_count;
wake = !kvm->mn_active_invalidate_count;
+ if (wake) {
+ kvm->mmu_gpc_invalidate_range_start = KVM_HVA_ERR_BAD;
+ kvm->mmu_gpc_invalidate_range_end = KVM_HVA_ERR_BAD;
+ }
spin_unlock(&kvm->mn_invalidate_lock);
- gfn_to_pfn_cache_invalidate(kvm, range->start, range->end);
-
/*
* There can only be one waiter, since the wait happens under
* slots_lock.
*/
- if (wake)
+ if (wake) {
+ wake_up(&kvm->gpc_invalidate_wq);
rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+ }
}
static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
@@ -1154,7 +1178,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
INIT_LIST_HEAD(&kvm->gpc_list);
spin_lock_init(&kvm->gpc_lock);
-
+ init_waitqueue_head(&kvm->gpc_invalidate_wq);
+ kvm->mmu_gpc_invalidate_range_start = KVM_HVA_ERR_BAD;
+ kvm->mmu_gpc_invalidate_range_end = KVM_HVA_ERR_BAD;
INIT_LIST_HEAD(&kvm->devices);
kvm->max_vcpus = KVM_MAX_VCPUS;
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 187b58150ef7..dad32ef5ac87 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -133,6 +133,39 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
#endif
}
+static bool gpc_invalidations_pending(struct gfn_to_pfn_cache *gpc)
+{
+ /*
+ * No need for locking on GPC here because these fields are protected
+ * by gpc->refresh_lock.
+ */
+ return unlikely(gpc->kvm->mn_active_invalidate_count) &&
+ (gpc->kvm->mmu_gpc_invalidate_range_start <= gpc->uhva) &&
+ (gpc->kvm->mmu_gpc_invalidate_range_end > gpc->uhva);
+}
+
+static void gpc_wait_for_invalidations(struct gfn_to_pfn_cache *gpc)
+{
+ spin_lock(&gpc->kvm->mn_invalidate_lock);
+ if (gpc_invalidations_pending(gpc)) {
+ DECLARE_WAITQUEUE(wait, current);
+
+ for (;;) {
+ prepare_to_wait(&gpc->kvm->gpc_invalidate_wq, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ if (!gpc_invalidations_pending(gpc))
+ break;
+
+ spin_unlock(&gpc->kvm->mn_invalidate_lock);
+ schedule();
+ spin_lock(&gpc->kvm->mn_invalidate_lock);
+ }
+ finish_wait(&gpc->kvm->gpc_invalidate_wq, &wait);
+ }
+ spin_unlock(&gpc->kvm->mn_invalidate_lock);
+}
+
static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
{
/* Note, the new page offset may be different than the old! */
@@ -205,6 +238,15 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
goto out_error;
}
+ /*
+ * Avoid populating a GPC with a user address which is already
+ * being invalidated, if the invalidate_range_start() notifier
+ * has already been called. The actual invalidation happens in
+ * the invalidate_range_end() callback, so wait until there are
+ * no active invalidations (for the relevant HVA).
+ */
+ gpc_wait_for_invalidations(gpc);
+
write_lock_irq(&gpc->lock);
/*
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook
2024-08-06 14:03 ` Sean Christopherson
2024-08-06 14:24 ` David Woodhouse
@ 2024-08-22 9:00 ` David Woodhouse
1 sibling, 0 replies; 27+ messages in thread
From: David Woodhouse @ 2024-08-22 9:00 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Mushahid Hussain, Vitaly Kuznetsov, Wanpeng Li,
Jim Mattson, Joerg Roedel, kvm, linux-kernel, Mingwei Zhang,
Maxim Levitsky
[-- Attachment #1: Type: text/plain, Size: 1472 bytes --]
On Tue, 2024-08-06 at 07:03 -0700, Sean Christopherson wrote:
> On Tue, Aug 06, 2024, David Woodhouse wrote:
> > On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote:
> > > On Mon, Aug 05, 2024, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > Servicing guest pages faults has the same problem, which is why
> > > mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a
> > > little harder, but not horrifically so (there are ordering differences regardless).
> > >
> > > Woefully incomplete, but I think this is the gist of what you want:
> >
> > Hm, maybe. It does mean that migration occurring all through memory
> > (indeed, just one at top and bottom of guest memory space) would
> > perturb GPCs which remain present.
>
> If that happens with a real world VMM, and it's not a blatant VMM goof, then we
> can fix KVM. The stage-2 page fault path hammers the mmu_notifier retry logic
> far more than GPCs, so if a range-based check is inadequate for some use case,
> then we definitely need to fix both.
FWIW this turned out to be because someone used C++ where they
shouldn't have done. Opinion is divided on whether to call that a
'blatant VMM goof'.
But if you use std::vector for something that's modified that many
times a second, and free() implicitly calls madvise(MADV_DONTNEED)
every time, you get what you deserve :)
We should still fix it in the kernel.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-08-22 9:01 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-27 1:39 [PATCH v2 0/8] KVM: Fix mmu_notifier vs. pfncache vs. pfncache races Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 1/8] Revert "KVM: Do not speculatively mark pfn cache valid to "fix" race" Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 2/8] Revert "KVM: Fix race between mmu_notifier invalidation and pfncache refresh" Sean Christopherson
2022-04-27 1:39 ` [PATCH v2 3/8] KVM: Drop unused @gpa param from gfn=>pfn cache's __release_gpc() helper Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 4/8] KVM: Put the extra pfn reference when reusing a pfn in the gpc cache Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 5/8] KVM: Do not incorporate page offset into gfn=>pfn cache user address Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh Sean Christopherson
2022-04-27 14:10 ` Sean Christopherson
2022-04-28 3:39 ` Lai Jiangshan
2022-04-28 14:33 ` Sean Christopherson
2022-05-20 14:49 ` Paolo Bonzini
2022-05-20 14:58 ` Paolo Bonzini
2022-05-20 15:02 ` Sean Christopherson
2022-05-20 14:59 ` Sean Christopherson
2024-08-05 11:04 ` David Woodhouse
2024-08-05 11:08 ` [PATCH] KVM: Move gfn_to_pfn_cache invalidation to invalidate_range_end hook David Woodhouse
2024-08-06 0:45 ` Sean Christopherson
2024-08-06 9:06 ` David Woodhouse
2024-08-06 14:03 ` Sean Christopherson
2024-08-06 14:24 ` David Woodhouse
2024-08-06 15:57 ` Sean Christopherson
2024-08-06 16:40 ` David Woodhouse
2024-08-22 9:00 ` David Woodhouse
2022-04-27 1:40 ` [PATCH v2 7/8] KVM: Do not pin pages tracked by gfn=>pfn caches Sean Christopherson
2022-04-27 1:40 ` [PATCH v2 8/8] DO NOT MERGE: Hack-a-test to verify gpc invalidation+refresh Sean Christopherson
2022-04-27 15:17 ` David Woodhouse
2022-04-27 20:23 ` Sean Christopherson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox