From: Takahiro Itazuri <itazur@amazon.com>
To: <kvm@vger.kernel.org>, Sean Christopherson <seanjc@google.com>,
"Paolo Bonzini" <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>,
Fuad Tabba <tabba@google.com>,
Brendan Jackman <jackmanb@google.com>,
David Hildenbrand <david@kernel.org>,
David Woodhouse <dwmw2@infradead.org>,
Paul Durrant <pdurrant@amazon.com>,
Nikita Kalyazin <nikita.kalyazin@linux.dev>,
Patrick Roy <patrick.roy@campus.lmu.de>,
Patrick Roy <patrick.roy@linux.dev>,
"Derek Manwaring" <derekmn@amazon.com>,
Alina Cernea <acernea@amazon.com>,
"Michael Zoumboulakis" <zoumboul@amazon.com>,
Takahiro Itazuri <zulinx86@gmail.com>,
Takahiro Itazuri <itazur@amazon.com>
Subject: [RFC PATCH v4 5/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates
Date: Mon, 20 Apr 2026 15:46:06 +0000 [thread overview]
Message-ID: <20260420154720.29012-6-itazur@amazon.com> (raw)
In-Reply-To: <20260420154720.29012-1-itazur@amazon.com>
Invalidate pfncaches when guest_memfd invalidation or memory attribute
updates render cached PFN resolutions stale.
Reuse mn_active_invalidate_count to synchronize with the existing retry
logic and preserve ordering against mmu_invalidate_seq.
Invalidation needs to be performed using HVA ranges so that both
GPA-based and HVA-based pfncaches are covered. Internally GPA-based
ones translate GPA to memslot/UHVA first and then resolve PFN, while
HVA-based ones only resolve PFN and do not store memslot/GPA context.
Technically, it is possible to make HVA-based pfncaches search the
corresponding memslot/GPA when activated/refreshed, but it would add
overhead to a greater or lesser extent, regardless of guest_memfd-backed
or not. At the time of writing, only Xen uses HVA-based pfncaches.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
virt/kvm/guest_memfd.c | 50 ++++++++++++++++++++++++++++++++++++++++++
virt/kvm/kvm_main.c | 47 ++++++++++++++++++++++++++++++++++++++-
virt/kvm/pfncache.c | 21 ++++++++++--------
3 files changed, 108 insertions(+), 10 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 79f34dad0c2f..011fd205ac7e 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -215,6 +215,33 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
struct kvm *kvm = f->kvm;
unsigned long index;
+ /*
+ * Prevent pfncaches from being activated / refreshed using stale PFN
+ * resolutions. To invalidate pfncaches _before_ invalidating the
+ * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+ * mn_active_invalidate_count instead of mmu_invalidate_in_progress.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * Invalidation of pfncaches must be done using a HVA range. pfncaches
+ * can be either GPA-based or HVA-based, and all pfncaches store uhva
+ * while HVA-based pfncaches do not have gpa/memslot context. Thus,
+ * using GFN ranges would miss invalidating HVA-based ones.
+ */
+ xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
+ pgoff_t pgoff = slot->gmem.pgoff;
+ gfn_t gfn_start = slot->base_gfn + max(pgoff, start) - pgoff;
+ gfn_t gfn_end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff;
+
+ unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+ unsigned long hva_end = hva_start + (gfn_end - gfn_start) * PAGE_SIZE;
+
+ gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+ }
+
xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
pgoff_t pgoff = slot->gmem.pgoff;
@@ -259,12 +286,35 @@ static void __kvm_gmem_invalidate_end(struct gmem_file *f, pgoff_t start,
pgoff_t end)
{
struct kvm *kvm = f->kvm;
+ bool wake;
if (xa_find(&f->bindings, &start, end - 1, XA_PRESENT)) {
KVM_MMU_LOCK(kvm);
kvm_mmu_invalidate_end(kvm);
KVM_MMU_UNLOCK(kvm);
}
+
+ /*
+ * This must be done after the increment of mmu_invalidate_seq and
+ * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+ * gpc_invalidate_retry() observes either the old (non-zero)
+ * mn_active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
+ kvm->mn_active_invalidate_count--;
+ wake = !kvm->mn_active_invalidate_count;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * guest_memfd invalidation itself doesn't need to block active memslots
+ * swap as bindings updates are serialized by filemap_invalidate_lock().
+ * However, mn_active_invalidate_count is shared with the MMU notifier
+ * path, so the waiter must be woken when mn_active_invalidate_count
+ * drops to zero.
+ */
+ if (wake)
+ rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
}
static void kvm_gmem_invalidate_end(struct inode *inode, pgoff_t start,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d64e70f8e8e3..b6d0a22fee79 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2583,9 +2583,11 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
.on_lock = kvm_mmu_invalidate_end,
.may_block = true,
};
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *slot;
unsigned long i;
void *entry;
- int r = 0;
+ int r = 0, bkt;
entry = attributes ? xa_mk_value(attributes) : NULL;
@@ -2609,6 +2611,34 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
cond_resched();
}
+ /*
+ * Prevent pfncaches from being activated / refreshed using stale PFN
+ * resolutions. To invalidate pfncaches _before_ invalidating the
+ * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+ * mn_active_invalidate_count instead of mmu_invalidate_in_progress.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ kvm->mn_active_invalidate_count++;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
+ /*
+ * Invalidation of pfncaches must be done using a HVA range. pfncaches
+ * can be either GPA-based or HVA-based, and all pfncaches store uhva
+ * while HVA-based pfncaches do not have gpa/memslot info. Thus,
+ * using GFN ranges would miss invalidating HVA-based ones.
+ */
+ kvm_for_each_memslot(slot, bkt, slots) {
+ gfn_t gfn_start = max(start, slot->base_gfn);
+ gfn_t gfn_end = min(end, slot->base_gfn + slot->npages);
+
+ if (gfn_start < gfn_end) {
+ unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+ unsigned long hva_end = hva_start + (gfn_end - gfn_start) * PAGE_SIZE;
+
+ gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+ }
+ }
+
kvm_handle_gfn_range(kvm, &pre_set_range);
for (i = start; i < end; i++) {
@@ -2620,6 +2650,21 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
kvm_handle_gfn_range(kvm, &post_set_range);
+ /*
+ * This must be done after the increment of mmu_invalidate_seq and
+ * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+ * gpc_invalidate_retry() observes either the old (non-zero)
+ * mn_active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+ *
+ * mn_memslots_update_rcuwait does not need to be waked when
+ * mn_active_invalidate_count drops to zero because active memslots swap
+ * is also done while holding slots_lock.
+ */
+ spin_lock(&kvm->mn_invalidate_lock);
+ if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
+ kvm->mn_active_invalidate_count--;
+ spin_unlock(&kvm->mn_invalidate_lock);
+
out_unlock:
mutex_unlock(&kvm->slots_lock);
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index bb4ba3a1b3d9..0e7a0f64e14b 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,7 +144,7 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
#endif
}
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static inline bool gpc_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
{
/*
* mn_active_invalidate_count acts for all intents and purposes
@@ -178,14 +178,17 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
*
* The caller holds gpc->refresh_lock, but does not hold gpc->lock nor
* kvm->slots_lock. Reading slot->flags (via kvm_slot_has_gmem() and
- * kvm_memslot_is_gmem_only()) is safe because memslot changes bump
- * slots->generation, which is detected in kvm_gpc_check(), forcing callers
- * to invoke kvm_gpc_refresh().
+ * kvm_memslot_is_gmem_only()) and looking up memory attributes (via
+ * kvm_mem_is_private()) without those locks is safe because:
*
- * Looking up memory attributes (via kvm_mem_is_private()) can race with
- * KVM_SET_MEMORY_ATTRIBUTES, which takes kvm->slots_lock to serialize
- * writers but doesn't exclude lockless readers. Handling that race is deferred
- * to a subsequent commit that wires up pfncache invalidation for gmem events.
+ * - memslot changes bump slots->generation, which is detected in
+ * kvm_gpc_check(), forcing callers to invoke kvm_gpc_refresh().
+ *
+ * - Memory attribute changes and gmem invalidations elevate
+ * mn_active_invalidate_count and bump mmu_invalidate_seq, bracketing the
+ * pfncache invalidation. gpc_invalidate_retry() observes either of these
+ * changes and forces a retry of the refresh loop in gpc_to_pfn_retry(), so
+ * any stale value read here will be re-evaluated.
*/
static inline bool gpc_is_gmem_backed(struct gfn_to_pfn_cache *gpc)
{
@@ -293,7 +296,7 @@ static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
- } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+ } while (gpc_invalidate_retry(gpc->kvm, mmu_seq));
gpc->valid = true;
gpc->pfn = new_pfn;
--
2.50.1
next prev parent reply other threads:[~2026-04-20 15:48 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-20 15:46 [RFC PATCH v4 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 1/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 2/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 3/7] KVM: Rename invalidate_begin to invalidate_start for consistency Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 4/7] KVM: pfncache: Rename invalidate_start() helper Takahiro Itazuri
2026-04-20 15:46 ` Takahiro Itazuri [this message]
2026-04-20 15:46 ` [RFC PATCH v4 6/7] KVM: selftests: Test pfncache with gmem-backed memory Takahiro Itazuri
2026-04-20 15:46 ` [RFC PATCH v4 7/7] KVM: selftests: Test pfncache invalidation for " Takahiro Itazuri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260420154720.29012-6-itazur@amazon.com \
--to=itazur@amazon.com \
--cc=acernea@amazon.com \
--cc=david@kernel.org \
--cc=derekmn@amazon.com \
--cc=dwmw2@infradead.org \
--cc=jackmanb@google.com \
--cc=kvm@vger.kernel.org \
--cc=nikita.kalyazin@linux.dev \
--cc=patrick.roy@campus.lmu.de \
--cc=patrick.roy@linux.dev \
--cc=pbonzini@redhat.com \
--cc=pdurrant@amazon.com \
--cc=seanjc@google.com \
--cc=tabba@google.com \
--cc=vkuznets@redhat.com \
--cc=zoumboul@amazon.com \
--cc=zulinx86@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox