From: Takahiro Itazuri <itazur@amazon.com>
To: <kvm@vger.kernel.org>, Sean Christopherson <seanjc@google.com>,
"Paolo Bonzini" <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>,
Fuad Tabba <tabba@google.com>,
Brendan Jackman <jackmanb@google.com>,
David Hildenbrand <david@kernel.org>,
David Woodhouse <dwmw2@infradead.org>,
Paul Durrant <pdurrant@amazon.com>,
Nikita Kalyazin <nikita.kalyazin@linux.dev>,
Patrick Roy <patrick.roy@campus.lmu.de>,
Patrick Roy <patrick.roy@linux.dev>,
"Derek Manwaring" <derekmn@amazon.com>,
Alina Cernea <acernea@amazon.com>,
"Michael Zoumboulakis" <zoumboul@amazon.com>,
Takahiro Itazuri <zulinx86@gmail.com>,
Takahiro Itazuri <itazur@amazon.com>
Subject: [RFC PATCH v4 5/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates
Date: Mon, 20 Apr 2026 15:46:06 +0000 [thread overview]
Message-ID: <20260420154720.29012-6-itazur@amazon.com> (raw)
In-Reply-To: <20260420154720.29012-1-itazur@amazon.com>
Invalidate pfncaches when guest_memfd invalidation or memory attribute
updates render cached PFN resolutions stale.
Reuse mn_active_invalidate_count to synchronize with the existing retry
logic and preserve ordering against mmu_invalidate_seq.
Invalidation needs to be performed using HVA ranges so that both
GPA-based and HVA-based pfncaches are covered. Internally GPA-based
ones translate GPA to memslot/UHVA first and then resolve PFN, while
HVA-based ones only resolve PFN and do not store memslot/GPA context.
Technically, it is possible to make HVA-based pfncaches search the
corresponding memslot/GPA when activated/refreshed, but it would add
overhead to a greater or lesser extent, regardless of guest_memfd-backed
or not. At the time of writing, only Xen uses HVA-based pfncaches.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
virt/kvm/guest_memfd.c | 50 ++++++++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 47 ++++++++++++++++++++++++++++++++++++++-
virt/kvm/pfncache.c | 21 ++++++++++--------
3 files changed, 108 insertions(+), 10 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 79f34dad0c2f..011fd205ac7e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -215,6 +215,33 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
struct kvm *kvm = f->kvm;
unsigned long index;
+ /*
+ * Prevent pfncaches from being activated / refreshed using stale PFN
+ * resolutions. To invalidate pfncaches _before_ invalidating the
+ * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+ * mn_active_invalidate_count instead of mmu_invalidate_in_progress.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * Invalidation of pfncaches must be done using a HVA range. pfncaches
+ * can be either GPA-based or HVA-based, and all pfncaches store uhva
+ * while HVA-based pfncaches do not have gpa/memslot context. Thus,
+ * using GFN ranges would miss invalidating HVA-based ones.
+ */
+ xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
+ pgoff_t pgoff = slot->gmem.pgoff;
+ gfn_t gfn_start = slot->base_gfn + max(pgoff, start) - pgoff;
+ gfn_t gfn_end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff;
+
+ unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+ unsigned long hva_end = hva_start + (gfn_end - gfn_start) * PAGE_SIZE;
+
+ gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+ }
+
xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
pgoff_t pgoff = slot->gmem.pgoff;
@@ -259,12 +286,35 @@ static void __kvm_gmem_invalidate_end(struct gmem_file *f, pgoff_t start,
pgoff_t end)
{
struct kvm *kvm = f->kvm;
+ bool wake;
if (xa_find(&f->bindings, &start, end - 1, XA_PRESENT)) {
KVM_MMU_LOCK(kvm);
kvm_mmu_invalidate_end(kvm);
KVM_MMU_UNLOCK(kvm);
}
+
+ /*
+ * This must be done after the increment of mmu_invalidate_seq and
+ * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+ * gpc_invalidate_retry() observes either the old (non-zero)
+ * mn_active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
+ kvm->mn_active_invalidate_count--;
+ wake = !kvm->mn_active_invalidate_count;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * guest_memfd invalidation itself doesn't need to block active memslots
+ * swap as bindings updates are serialized by filemap_invalidate_lock().
+ * However, mn_active_invalidate_count is shared with the MMU notifier
+ * path, so the waiter must be woken when mn_active_invalidate_count
+ * drops to zero.
+ */
+ if (wake)
+ rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
}
static void kvm_gmem_invalidate_end(struct inode *inode, pgoff_t start,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d64e70f8e8e3..b6d0a22fee79 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2583,9 +2583,11 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
.on_lock = kvm_mmu_invalidate_end,
.may_block = true,
};
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *slot;
unsigned long i;
void *entry;
- int r = 0;
+ int r = 0, bkt;
entry = attributes ? xa_mk_value(attributes) : NULL;
@@ -2609,6 +2611,34 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
cond_resched();
}
+ /*
+ * Prevent pfncaches from being activated / refreshed using stale PFN
+ * resolutions. To invalidate pfncaches _before_ invalidating the
+ * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+ * mn_active_invalidate_count instead of mmu_invalidate_in_progress.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * Invalidation of pfncaches must be done using a HVA range. pfncaches
+ * can be either GPA-based or HVA-based, and all pfncaches store uhva
+ * while HVA-based pfncaches do not have gpa/memslot info. Thus,
+ * using GFN ranges would miss invalidating HVA-based ones.
+ */
+ kvm_for_each_memslot(slot, bkt, slots) {
+ gfn_t gfn_start = max(start, slot->base_gfn);
+ gfn_t gfn_end = min(end, slot->base_gfn + slot->npages);
+
+ if (gfn_start < gfn_end) {
+ unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+ unsigned long hva_end = hva_start + (gfn_end - gfn_start) * PAGE_SIZE;
+
+ gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+ }
+ }
+
kvm_handle_gfn_range(kvm, &pre_set_range);
for (i = start; i < end; i++) {
@@ -2620,6 +2650,21 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
kvm_handle_gfn_range(kvm, &post_set_range);
+ /*
+ * This must be done after the increment of mmu_invalidate_seq and
+ * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+ * gpc_invalidate_retry() observes either the old (non-zero)
+ * mn_active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+ *
+ * mn_memslots_update_rcuwait does not need to be waked when
+ * mn_active_invalidate_count drops to zero because active memslots swap
+ * is also done while holding slots_lock.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
+ kvm->mn_active_invalidate_count--;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
out_unlock:
mutex_unlock(&kvm->slots_lock);
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index bb4ba3a1b3d9..0e7a0f64e14b 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,7 +144,7 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
#endif
}
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static inline bool gpc_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
{
/*
* mn_active_invalidate_count acts for all intents and purposes
@@ -178,14 +178,17 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
*
* The caller holds gpc->refresh_lock, but does not hold gpc->lock nor
* kvm->slots_lock. Reading slot->flags (via kvm_slot_has_gmem() and
- * kvm_memslot_is_gmem_only()) is safe because memslot changes bump
- * slots->generation, which is detected in kvm_gpc_check(), forcing callers
- * to invoke kvm_gpc_refresh().
+ * kvm_memslot_is_gmem_only()) and looking up memory attributes (via
+ * kvm_mem_is_private()) without those locks is safe because:
*
- * Looking up memory attributes (via kvm_mem_is_private()) can race with
- * KVM_SET_MEMORY_ATTRIBUTES, which takes kvm->slots_lock to serialize
- * writers but doesn't exclude lockless readers. Handling that race is deferred
- * to a subsequent commit that wires up pfncache invalidation for gmem events.
+ * - memslot changes bump slots->generation, which is detected in
+ * kvm_gpc_check(), forcing callers to invoke kvm_gpc_refresh().
+ *
+ * - Memory attribute changes and gmem invalidations elevate
+ * mn_active_invalidate_count and bump mmu_invalidate_seq, bracketing the
+ * pfncache invalidation. gpc_invalidate_retry() observes either of these
+ * changes and forces a retry of the refresh loop in gpc_to_pfn_retry(), so
+ * any stale value read here will be re-evaluated.
*/
static inline bool gpc_is_gmem_backed(struct gfn_to_pfn_cache *gpc)
{
@@ -293,7 +296,7 @@ static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
- } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+ } while (gpc_invalidate_retry(gpc->kvm, mmu_seq));
gpc->valid = true;
gpc->pfn = new_pfn;
--
2.50.1
next prev parent reply other threads:[~2026-04-20 15:48 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-20 15:46 [RFC PATCH v4 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 1/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 2/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 3/7] KVM: Rename invalidate_begin to invalidate_start for consistency Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 4/7] KVM: pfncache: Rename invalidate_start() helper Takahiro Itazuri
2026-05-01 21:38 ` Ackerley Tng
2026-05-12 22:14 ` Sean Christopherson
2026-04-20 15:46 ` Takahiro Itazuri [this message]
2026-05-01 21:43 ` [RFC PATCH v4 5/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates Ackerley Tng
2026-04-20 15:46 ` [RFC PATCH v4 6/7] KVM: selftests: Test pfncache with gmem-backed memory Takahiro Itazuri
2026-05-01 21:16 ` Ackerley Tng
2026-04-20 15:46 ` [RFC PATCH v4 7/7] KVM: selftests: Test pfncache invalidation for " Takahiro Itazuri
2026-05-12 22:12 ` [RFC PATCH v4 0/7] KVM: pfncache: Add guest_memfd support to pfncache Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420154720.29012-6-itazur@amazon.com \
--to=itazur@amazon.com \
--cc=acernea@amazon.com \
--cc=david@kernel.org \
--cc=derekmn@amazon.com \
--cc=dwmw2@infradead.org \
--cc=jackmanb@google.com \
--cc=kvm@vger.kernel.org \
--cc=nikita.kalyazin@linux.dev \
--cc=patrick.roy@campus.lmu.de \
--cc=patrick.roy@linux.dev \
--cc=pbonzini@redhat.com \
--cc=pdurrant@amazon.com \
--cc=seanjc@google.com \
--cc=tabba@google.com \
--cc=vkuznets@redhat.com \
--cc=zoumboul@amazon.com \
--cc=zulinx86@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.