public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache
@ 2026-02-26 13:53 Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures Takahiro Itazuri
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

[ based on v6.18 with [1] ]

This patch series is a follow-up to RFC v1.  (This is still labelled RFC
since its dependency [1]  has not yet been merged.)

=== Problem Statement ===

gfn_to_pfn_cache (a.k.a. pfncache) does not work with guest_memfd.  As
of today, pfncaches resolve PFNs via hva_to_pfn(), which requires a
userspace mapping and relies on GUP.  This does not work for guest_memfd
in the following two ways:

  * guest_memfd created with GUEST_MEMFD_FLAP_MMAP does not have a
    userspace mapping due to the nature of private memory.

  * guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP uses an
    AS_NO_DIRECT_MAP mapping, which is rejected by GUP.

In addition, pfncaches map RAM pages via kmap(), which typically returns
an address derived from the direct map.  So kmap() cannot be used for
NO_DIRECT_MAP guest_memfd.  pfncaches require fault-free KHVAs since
they can be used from atomic context.  Thus, it cannot fall back to
access via a userspace mapping like KVM does for other accesses to
NO_DIRECT_MAP guest_memfd.

The introduction of guest_memfd support necessitates additional
invalidation paths in addition to the existing MMU notifier path: one
from guest_memfd invalidation and another from memory attribute updates.

=== Core Approach ===

The core part keeps the original approach in RFC v1:

  * Resolve PFNs for guest_memfd-backed GPAs via kvm_gmem_get_pfn()

  * Obtain a fault-free KHVA for NO_DIRECT_MAP pages via vmap()

=== Changes since RFC v1 ===

  * Hook pfncache invalidation into guest_memfd invalidation (punch hole
    / release / error handling) as well as into memory attribute updates
    (switch between shared and private memories).

===  Design Considerations (Feedback Appreciated) ===

To implement the above change, this series tries to reuse as much of the
existing invalidation and retry infrastructure as possible.  The
following points are potential design trade-offs where feedback is
especially welcome:

  * Generalize and reuse the existing mn_active_invalidate_count
    (renamed to active_invalidate_count).  This allows reusing the
    existing pfncache retry logic as-is and enables invalidating
    pfncaches without holding mmu_lock from guest_memfd invalidation
    context.  As a side effect, active memslots swap is blocked while
    active_invalidate_count > 0.  To avoid this block, it would be
    possible to introduce a dedicated gmem_active_invalidate_count in
    struct kvm instead.

  * Although both guest_memfd invalidation and memory attribute update
    are driven by GFN ranges, pfncache invalidation is performed using
    HVA ranges and reuses the existing function.  This is because
    GPA-based pfncaches translate GPA->UHVA->PFN and therefore have
    memslot/GPA info, whereas HVA-based pfncaches resolve PFN directly
    from UHVA and do not store memslot/GPA info.  Using GFN-based
    invalidation would therefore miss HVA-based pfncaches.  Technically,
    it would be possible to refactor HVA-based pfncaches to search for
    and retain the corresponding memslot/GPA at activation / refresh
    time instead of at invalidation time.

  * pfncaches are not dynamically allocated but are statically allocated
    on a per-VM and per-vCPU basis.  For a normal VM (i.e. non-Xen),
    there is one pfncache per vCPU.  For a Xen VM, there is one per-VM
    pfncache and five per-vCPU pfncaches.  Given the maximum of 1024
    vCPUs, a normal VM can have up to 1024 pfncaches, consuming 4 MB of
    virtual address space.  A Xen VM can have up to 5121 pfncaches,
    consuming approximately 20 MB of virtual address space.  Although
    the vmalloc area is limited on 32-bit systems, it should be large
    enough and typically tens of TB on 64-bit systems (e.g. 32 TB for
    4-level paging and 12800 TB for 5-level paging on x86_64).  If
    virtual address space exhaustion became a concern, migration to
    mm-local region (forthcoming mermap?) could be considered in the
    future.  Note that vmap() only creates virtual mappings to existing
    pages; they do not allocate new physical pages.

  * With this patch series, HVA-based pfncaches always resolve PFNs
    via hva_to_pfn(), and thus activation for NO_DIRECT_MAP guest_memfd
    fails.  It is technically possible to support this scenario, but it
    would require searching the corresponding memslot and GPA from the
    given UHVA in order to determine whether it is backed by
    guest_memfd.  Doing so would add overhead to the HVA-based pfncache
    activation / refresh paths, to a greater or lesser extent,
    regardless of guest_memfd-backed or not.  At the time of writing,
    only Xen uses HVA-based pfncaches.

RFC v1: https://lore.kernel.org/all/20251203144159.6131-1-itazur@amazon.com/

[1]: https://lore.kernel.org/all/20260126164445.11867-1-kalyazin@amazon.com/

Takahiro Itazuri (7):
  KVM: x86: Avoid silent kvm-clock activation failures
  KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs
  KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP
  KVM: Rename invalidate_begin to invalidate_start for consistency
  KVM: pfncache: Rename invalidate_start() helper
  KVM: Rename mn_* invalidate-related fields to generic ones
  KVM: pfncache: Invalidate on gmem invalidation and memattr updates

 Documentation/virt/kvm/locking.rst |   8 +--
 arch/x86/kvm/mmu/mmu.c             |   2 +-
 arch/x86/kvm/x86.c                 |  18 ++---
 include/linux/kvm_host.h           |  13 ++--
 include/linux/mmu_notifier.h       |   4 +-
 virt/kvm/guest_memfd.c             |  64 +++++++++++++++--
 virt/kvm/kvm_main.c                |  99 +++++++++++++++++++-------
 virt/kvm/kvm_mm.h                  |  12 ++--
 virt/kvm/pfncache.c                | 110 ++++++++++++++++++++---------
 9 files changed, 235 insertions(+), 95 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-03-05 17:50   ` Sean Christopherson
  2026-02-26 13:53 ` [RFC PATCH v2 2/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs Takahiro Itazuri
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

kvm_write_system_time() previously ignored the return value of
kvm_gpc_activate().  As a result, kvm-clock activation could fail
silently, making debugging harder.

Propagate the return value so that the MSR write fail properly instead
of continuing silently.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 arch/x86/kvm/x86.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a447663d5eff..a729b8419b61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2438,7 +2438,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o
 	kvm_write_guest(kvm, wall_clock, &version, sizeof(version));
 }
 
-static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
+static int kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 				  bool old_msr, bool host_initiated)
 {
 	struct kvm_arch *ka = &vcpu->kvm->arch;
@@ -2455,12 +2455,12 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 
 	/* we verify if the enable bit is set... */
 	if (system_time & 1)
-		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
-				 sizeof(struct pvclock_vcpu_time_info));
-	else
-		kvm_gpc_deactivate(&vcpu->arch.pv_time);
+		return kvm_gpc_activate(&vcpu->arch.pv_time,
+					system_time & ~1ULL,
+					sizeof(struct pvclock_vcpu_time_info));
 
-	return;
+	kvm_gpc_deactivate(&vcpu->arch.pv_time);
+	return 0;
 }
 
 static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
@@ -4156,13 +4156,15 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!guest_pv_has(vcpu, KVM_FEATURE_CLOCKSOURCE2))
 			return 1;
 
-		kvm_write_system_time(vcpu, data, false, msr_info->host_initiated);
+		if (kvm_write_system_time(vcpu, data, false, msr_info->host_initiated))
+			return 1;
 		break;
 	case MSR_KVM_SYSTEM_TIME:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_CLOCKSOURCE))
 			return 1;
 
-		kvm_write_system_time(vcpu, data, true,  msr_info->host_initiated);
+		if (kvm_write_system_time(vcpu, data, true,  msr_info->host_initiated))
+			return 1;
 		break;
 	case MSR_KVM_ASYNC_PF_EN:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF))
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 2/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 3/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP Takahiro Itazuri
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

Currently, pfncaches always resolve PFNs via hva_to_pfn(), which
requires a userspace mapping and relies on GUP.  This does not work for
guest_memfd in the following two ways:

  * guest_memfd created without GUEST_MEMFD_FLAG_MMAP does not have a
    userspace mapping for private memory.

  * guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP uses an
    AS_NO_DIRECT_MAP mapping, which is rejected by GUP.

Resolve PFNs via kvm_gmem_get_pfn() for guest_memfd-backed and GPA-based
pfncaches.  Otherwise, fall back to the existing hva_to_pfn().

Note that HVA-based pfncaches always resolve PFNs via hva_to_pfn(), and
thus activation of HVA-based pfncaches for NO_DIRECT_MAP guest_memfd
fails.  Supporting this scenario would be technically possible, but
would require searching the corresponding memslot and GPA from the given
UHVA in order to determine whether it is backed by guest_memfd.  Doing
so would add overhead to the HVA-based pfncache activation / refresh
paths, to a greater or lesser extent, regardless of guest_memfd-backed
or not.  At the time of writing, only Xen uses HVA-based pfncaches.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 virt/kvm/pfncache.c | 45 +++++++++++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 728d2c1b488a..100a8e2f114b 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -152,7 +152,36 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	return kvm->mmu_invalidate_seq != mmu_seq;
 }
 
-static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
+static inline bool gpc_is_gmem_backed(struct gfn_to_pfn_cache *gpc)
+{
+	/* For HVA-based pfncaches, memslot is NULL */
+	return gpc->memslot && kvm_slot_has_gmem(gpc->memslot) &&
+	       (kvm_memslot_is_gmem_only(gpc->memslot) ||
+		kvm_mem_is_private(gpc->kvm, gpa_to_gfn(gpc->gpa)));
+}
+
+static kvm_pfn_t gpc_to_pfn(struct gfn_to_pfn_cache *gpc, struct page **page)
+{
+	if (gpc_is_gmem_backed(gpc)) {
+		kvm_pfn_t pfn;
+
+		if (kvm_gmem_get_pfn(gpc->kvm, gpc->memslot,
+				     gpa_to_gfn(gpc->gpa), &pfn, page, NULL))
+			return KVM_PFN_ERR_FAULT;
+
+		return pfn;
+	}
+
+	return hva_to_pfn(&(struct kvm_follow_pfn) {
+		.slot = gpc->memslot,
+		.gfn = gpa_to_gfn(gpc->gpa),
+		.flags = FOLL_WRITE,
+		.hva = gpc->uhva,
+		.refcounted_page = page,
+	});
+}
+
+static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 {
 	/* Note, the new page offset may be different than the old! */
 	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
@@ -161,14 +190,6 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	unsigned long mmu_seq;
 	struct page *page;
 
-	struct kvm_follow_pfn kfp = {
-		.slot = gpc->memslot,
-		.gfn = gpa_to_gfn(gpc->gpa),
-		.flags = FOLL_WRITE,
-		.hva = gpc->uhva,
-		.refcounted_page = &page,
-	};
-
 	lockdep_assert_held(&gpc->refresh_lock);
 
 	lockdep_assert_held_write(&gpc->lock);
@@ -206,7 +227,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
-		new_pfn = hva_to_pfn(&kfp);
+		new_pfn = gpc_to_pfn(gpc, &page);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
@@ -319,7 +340,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 		}
 	}
 
-	/* Note: the offset must be correct before calling hva_to_pfn_retry() */
+	/* Note: the offset must be correct before calling gpc_to_pfn_retry() */
 	gpc->uhva += page_offset;
 
 	/*
@@ -327,7 +348,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	 * drop the lock and do the HVA to PFN lookup again.
 	 */
 	if (!gpc->valid || hva_change) {
-		ret = hva_to_pfn_retry(gpc);
+		ret = gpc_to_pfn_retry(gpc);
 	} else {
 		/*
 		 * If the HVA→PFN mapping was already valid, don't unmap it.
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 3/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 2/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 4/7] KVM: Rename invalidate_begin to invalidate_start for consistency Takahiro Itazuri
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

Currently, pfncaches map RAM pages via kmap(), which typically returns a
kernel address derived from the direct map.  However, guest_memfd
created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP has their direct map removed
and uses an AS_NO_DIRECT_MAP mapping.  So kmap() cannot be used in this
case.

pfncaches can be used from atomic context where page faults cannot be
tolerated.  Therefore, it cannot fall back to access via a userspace
mapping like KVM does for other accesses to NO_DIRECT_MAP guest_memfd.

To obtain a fault-free kernel host virtual address (KHVA), use vmap()
for NO_DIRECT_MAP pages.  Since gpc_map() is the sole producer of KHVA
for pfncaches and only vmap() returns a vmalloc address, gpc_unmap()
can reliably pair vunmap() using is_vmalloc_addr().

Although vm_map_ram() could be faster than vmap(), mixing short-lived
and long-lived vm_map_ram() can lead to fragmentation.  For this reason,
vm_map_ram() is recommended only for short-lived ones.  Since pfncaches
typically have a lifetime comparable to that of the VM, vm_map_ram() is
deliberately not used here.

pfncaches are not dynamically allocated but are statically allocated on
a per-VM and per-vCPU basis.  For a normal VM (i.e. non-Xen), there is
one pfncache per vCPU.  For a Xen VM, there is one per-VM pfncache and
five per-vCPU pfncaches.  Given the maximum of 1024 vCPUs, a normal VM
can have up to 1024 pfncaches, consuming 4 MB of virtual address space.
A Xen VM can have up to 5121 pfncaches, consuming approximately 20 MB of
virtual address space.  Although the vmalloc area is limited on 32-bit
systems, it should be large enough and typically tens of TB on 64-bit
systems (e.g. 32 TB for 4-level paging and 12800 TB for 5-level paging
on x86_64).  If virtual address space exhaustion becomes a concern,
migration to an mm-local region (like forthcoming mermap?) could be
considered in the future.  Note that vmap() and vm_map_ram() only create
virtual mappings to existing pages; they do not allocate new physical
pages.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 virt/kvm/pfncache.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 100a8e2f114b..531adc4dcb11 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -16,6 +16,7 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 #include <linux/errno.h>
+#include <linux/pagemap.h>
 
 #include "kvm_mm.h"
 
@@ -98,8 +99,19 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
 
 static void *gpc_map(kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return kmap(pfn_to_page(pfn));
+	if (pfn_valid(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		struct page *head = compound_head(page);
+		struct address_space *mapping = READ_ONCE(head->mapping);
+
+		if (mapping && mapping_no_direct_map(mapping)) {
+			struct page *pages[] = { page };
+
+			return vmap(pages, 1, VM_MAP, PAGE_KERNEL);
+		}
+
+		return kmap(page);
+	}
 
 #ifdef CONFIG_HAS_IOMEM
 	return memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
@@ -115,7 +127,15 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 		return;
 
 	if (pfn_valid(pfn)) {
-		kunmap(pfn_to_page(pfn));
+		/*
+		 * For valid PFNs, gpc_map() returns either a kmap() address
+		 * (non-vmalloc) or a vmap() address (vmalloc).
+		 */
+		if (is_vmalloc_addr(khva))
+			vunmap(khva);
+		else
+			kunmap(pfn_to_page(pfn));
+
 		return;
 	}
 
@@ -233,8 +253,11 @@ static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 
 		/*
 		 * Obtain a new kernel mapping if KVM itself will access the
-		 * pfn.  Note, kmap() and memremap() can both sleep, so this
-		 * too must be done outside of gpc->lock!
+		 * pfn.  Note, kmap(), vmap() and memremap() can all sleep, so
+		 * this too must be done outside of gpc->lock!
+		 * Note that even though gpc->lock is dropped, it's still fine
+		 * to read gpc->pfn and other fields because gpc->refresh_lock
+		 * mutex prevents them from being updated.
 		 */
 		if (new_pfn == gpc->pfn)
 			new_khva = old_khva;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 4/7] KVM: Rename invalidate_begin to invalidate_start for consistency
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
                   ` (2 preceding siblings ...)
  2026-02-26 13:53 ` [RFC PATCH v2 3/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 5/7] KVM: pfncache: Rename invalidate_start() helper Takahiro Itazuri
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

Most MMU-related helpers use "_start" suffix.  Align with the prevailing
naming convention for consistency across MMU-related codebase.

```
$ git grep -E "invalidate(_range)?_start" | wc -l
123

$ git grep -E "invalidate(_range)?_begin" | wc -l
14
```

No functional change intended.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c       |  2 +-
 include/linux/kvm_host.h     |  2 +-
 include/linux/mmu_notifier.h |  4 ++--
 virt/kvm/guest_memfd.c       | 14 +++++++-------
 virt/kvm/kvm_main.c          |  6 +++---
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d3e705ac4c6f..e82a357e2219 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6859,7 +6859,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	write_lock(&kvm->mmu_lock);
 
-	kvm_mmu_invalidate_begin(kvm);
+	kvm_mmu_invalidate_start(kvm);
 
 	kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2ea5d2f172f7..618a71894ed1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1566,7 +1566,7 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_start(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
 bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..8ecf36a84e3b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -134,8 +134,8 @@ struct mmu_notifier_ops {
 	 * Invalidation of multiple concurrent ranges may be
 	 * optionally permitted by the driver. Either way the
 	 * establishment of sptes is forbidden in the range passed to
-	 * invalidate_range_begin/end for the whole duration of the
-	 * invalidate_range_begin/end critical section.
+	 * invalidate_range_start/end for the whole duration of the
+	 * invalidate_range_start/end critical section.
 	 *
 	 * invalidate_range_start() is called when all pages in the
 	 * range are still mapped and have at least a refcount of one.
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5d6e966d4f32..79f34dad0c2f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -206,7 +206,7 @@ static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *in
 	return KVM_FILTER_PRIVATE;
 }
 
-static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
+static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
 					pgoff_t end,
 					enum kvm_gfn_range_filter attr_filter)
 {
@@ -230,7 +230,7 @@ static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
 			found_memslot = true;
 
 			KVM_MMU_LOCK(kvm);
-			kvm_mmu_invalidate_begin(kvm);
+			kvm_mmu_invalidate_start(kvm);
 		}
 
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
@@ -243,7 +243,7 @@ static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
 		KVM_MMU_UNLOCK(kvm);
 }
 
-static void kvm_gmem_invalidate_begin(struct inode *inode, pgoff_t start,
+static void kvm_gmem_invalidate_start(struct inode *inode, pgoff_t start,
 				      pgoff_t end)
 {
 	enum kvm_gfn_range_filter attr_filter;
@@ -252,7 +252,7 @@ static void kvm_gmem_invalidate_begin(struct inode *inode, pgoff_t start,
 	attr_filter = kvm_gmem_get_invalidate_filter(inode);
 
 	kvm_gmem_for_each_file(f, inode->i_mapping)
-		__kvm_gmem_invalidate_begin(f, start, end, attr_filter);
+		__kvm_gmem_invalidate_start(f, start, end, attr_filter);
 }
 
 static void __kvm_gmem_invalidate_end(struct gmem_file *f, pgoff_t start,
@@ -287,7 +287,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
 
-	kvm_gmem_invalidate_begin(inode, start, end);
+	kvm_gmem_invalidate_start(inode, start, end);
 
 	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
 
@@ -401,7 +401,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	 * Zap all SPTEs pointed at by this file.  Do not free the backing
 	 * memory, as its lifetime is associated with the inode, not the file.
 	 */
-	__kvm_gmem_invalidate_begin(f, 0, -1ul,
+	__kvm_gmem_invalidate_start(f, 0, -1ul,
 				    kvm_gmem_get_invalidate_filter(inode));
 	__kvm_gmem_invalidate_end(f, 0, -1ul);
 
@@ -582,7 +582,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	start = folio->index;
 	end = start + folio_nr_pages(folio);
 
-	kvm_gmem_invalidate_begin(mapping->host, start, end);
+	kvm_gmem_invalidate_start(mapping->host, start, end);
 
 	/*
 	 * Do not truncate the range, what action is taken in response to the
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 60a8b7ca8ab4..5871882ff1db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -678,7 +678,7 @@ static __always_inline int kvm_age_hva_range_no_flush(struct mmu_notifier *mn,
 	return kvm_age_hva_range(mn, start, end, handler, false);
 }
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm)
+void kvm_mmu_invalidate_start(struct kvm *kvm)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	/*
@@ -734,7 +734,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
-		.on_lock	= kvm_mmu_invalidate_begin,
+		.on_lock	= kvm_mmu_invalidate_start,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -2571,7 +2571,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.end = end,
 		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
-		.on_lock = kvm_mmu_invalidate_begin,
+		.on_lock = kvm_mmu_invalidate_start,
 		.flush_on_ret = true,
 		.may_block = true,
 	};
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 5/7] KVM: pfncache: Rename invalidate_start() helper
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
                   ` (3 preceding siblings ...)
  2026-02-26 13:53 ` [RFC PATCH v2 4/7] KVM: Rename invalidate_begin to invalidate_start for consistency Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 6/7] KVM: Rename mn_* invalidate-related fields to generic ones Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 7/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates Takahiro Itazuri
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

Rename gfn_to_pfn_cache_invalidate_start() to
gpc_invalidate_hva_range_start() to explicitly indicate that it takes a
range of HVA range.

No functional changes intended.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 virt/kvm/kvm_main.c |  2 +-
 virt/kvm/kvm_mm.h   | 12 ++++++------
 virt/kvm/pfncache.c |  4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5871882ff1db..d64e70f8e8e3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -763,7 +763,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * mn_active_invalidate_count (see above) instead of
 	 * mmu_invalidate_in_progress.
 	 */
-	gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end);
+	gpc_invalidate_hva_range_start(kvm, range->start, range->end);
 
 	/*
 	 * If one or more memslots were found and thus zapped, notify arch code
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 9fcc5d5b7f8d..abd8e7d33ab0 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -56,13 +56,13 @@ struct kvm_follow_pfn {
 kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp);
 
 #ifdef CONFIG_HAVE_KVM_PFNCACHE
-void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
-				       unsigned long start,
-				       unsigned long end);
+void gpc_invalidate_hva_range_start(struct kvm *kvm,
+				    unsigned long start,
+				    unsigned long end);
 #else
-static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
-						     unsigned long start,
-						     unsigned long end)
+static inline void gpc_invalidate_hva_range_start(struct kvm *kvm,
+						  unsigned long start,
+						  unsigned long end)
 {
 }
 #endif /* HAVE_KVM_PFNCACHE */
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 531adc4dcb11..3ff8251727e2 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -23,8 +23,8 @@
 /*
  * MMU notifier 'invalidate_range_start' hook.
  */
-void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
-				       unsigned long end)
+void gpc_invalidate_hva_range_start(struct kvm *kvm, unsigned long start,
+				    unsigned long end)
 {
 	struct gfn_to_pfn_cache *gpc;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 6/7] KVM: Rename mn_* invalidate-related fields to generic ones
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
                   ` (4 preceding siblings ...)
  2026-02-26 13:53 ` [RFC PATCH v2 5/7] KVM: pfncache: Rename invalidate_start() helper Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  2026-02-26 13:53 ` [RFC PATCH v2 7/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates Takahiro Itazuri
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

The addition of guest_memfd support to pfncaches introduces additional
sources of pfncache invalidation beyond the MMU notifier path.  The
existing mn_* naming implies that they are only relevant to MMU
notifiers, which is no longer true.

No functional changes intended.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 Documentation/virt/kvm/locking.rst |  8 +++---
 include/linux/kvm_host.h           | 11 ++++---
 virt/kvm/kvm_main.c                | 46 +++++++++++++++---------------
 virt/kvm/pfncache.c                | 28 +++++++++---------
 4 files changed, 47 insertions(+), 46 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index ae8bce7fecbe..73679044ce44 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -20,7 +20,7 @@ The acquisition orders for mutexes are as follows:
 - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
   them together is quite rare.
 
-- kvm->mn_active_invalidate_count ensures that pairs of
+- kvm->active_invalidate_count ensures that pairs of MMU notifier's
   invalidate_range_start() and invalidate_range_end() callbacks
   use the same memslots array.  kvm->slots_lock and kvm->slots_arch_lock
   are taken on the waiting side when modifying memslots, so MMU notifiers
@@ -249,12 +249,12 @@ time it will be set using the Dirty tracking mechanism described above.
 :Comment:	Exists to allow taking cpus_read_lock() while kvm_usage_count is
 		protected, which simplifies the virtualization enabling logic.
 
-``kvm->mn_invalidate_lock``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+``kvm->invalidate_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 :Type:          spinlock_t
 :Arch:          any
-:Protects:      mn_active_invalidate_count, mn_memslots_update_rcuwait
+:Protects:      active_invalidate_count, memslots_update_rcuwait
 
 ``kvm_arch::tsc_write_lock``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 618a71894ed1..7faa83d3d306 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -814,10 +814,13 @@ struct kvm {
 	 */
 	atomic_t nr_memslots_dirty_logging;
 
-	/* Used to wait for completion of MMU notifiers.  */
-	spinlock_t mn_invalidate_lock;
-	unsigned long mn_active_invalidate_count;
-	struct rcuwait mn_memslots_update_rcuwait;
+	/*
+	 * Used by active memslots swap and pfncache refresh to wait for
+	 * invalidation to complete.
+	 */
+	spinlock_t invalidate_lock;
+	unsigned long active_invalidate_count;
+	struct rcuwait memslots_update_rcuwait;
 
 	/* For management / invalidation of gfn_to_pfn_caches */
 	spinlock_t gpc_lock;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d64e70f8e8e3..f51056e971d0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -749,9 +749,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 *
 	 * Pairs with the decrement in range_end().
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
-	kvm->mn_active_invalidate_count++;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	spin_lock(&kvm->invalidate_lock);
+	kvm->active_invalidate_count++;
+	spin_unlock(&kvm->invalidate_lock);
 
 	/*
 	 * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e.
@@ -760,7 +760,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 	 * any given time, and the caches themselves can check for hva overlap,
 	 * i.e. don't need to rely on memslot overlap checks for performance.
 	 * Because this runs without holding mmu_lock, the pfn caches must use
-	 * mn_active_invalidate_count (see above) instead of
+	 * active_invalidate_count (see above) instead of
 	 * mmu_invalidate_in_progress.
 	 */
 	gpc_invalidate_hva_range_start(kvm, range->start, range->end);
@@ -819,18 +819,18 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	kvm_handle_hva_range(kvm, &hva_range);
 
 	/* Pairs with the increment in range_start(). */
-	spin_lock(&kvm->mn_invalidate_lock);
-	if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count))
-		--kvm->mn_active_invalidate_count;
-	wake = !kvm->mn_active_invalidate_count;
-	spin_unlock(&kvm->mn_invalidate_lock);
+	spin_lock(&kvm->invalidate_lock);
+	if (!WARN_ON_ONCE(!kvm->active_invalidate_count))
+		--kvm->active_invalidate_count;
+	wake = !kvm->active_invalidate_count;
+	spin_unlock(&kvm->invalidate_lock);
 
 	/*
 	 * There can only be one waiter, since the wait happens under
 	 * slots_lock.
 	 */
 	if (wake)
-		rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
+		rcuwait_wake_up(&kvm->memslots_update_rcuwait);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
@@ -1131,8 +1131,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	mutex_init(&kvm->slots_arch_lock);
-	spin_lock_init(&kvm->mn_invalidate_lock);
-	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
+	spin_lock_init(&kvm->invalidate_lock);
+	rcuwait_init(&kvm->memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	xa_init(&kvm->mem_attr_array);
@@ -1299,7 +1299,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	/*
 	 * At this point, pending calls to invalidate_range_start()
 	 * have completed but no more MMU notifiers will run, so
-	 * mn_active_invalidate_count may remain unbalanced.
+	 * active_invalidate_count may remain unbalanced.
 	 * No threads can be waiting in kvm_swap_active_memslots() as the
 	 * last reference on KVM has been dropped, but freeing
 	 * memslots would deadlock without this manual intervention.
@@ -1308,9 +1308,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	 * notifier between a start() and end(), then there shouldn't be any
 	 * in-progress invalidations.
 	 */
-	WARN_ON(rcuwait_active(&kvm->mn_memslots_update_rcuwait));
-	if (kvm->mn_active_invalidate_count)
-		kvm->mn_active_invalidate_count = 0;
+	WARN_ON(rcuwait_active(&kvm->memslots_update_rcuwait));
+	if (kvm->active_invalidate_count)
+		kvm->active_invalidate_count = 0;
 	else
 		WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
@@ -1640,17 +1640,17 @@ static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 	 * progress, otherwise the locking in invalidate_range_start and
 	 * invalidate_range_end will be unbalanced.
 	 */
-	spin_lock(&kvm->mn_invalidate_lock);
-	prepare_to_rcuwait(&kvm->mn_memslots_update_rcuwait);
-	while (kvm->mn_active_invalidate_count) {
+	spin_lock(&kvm->invalidate_lock);
+	prepare_to_rcuwait(&kvm->memslots_update_rcuwait);
+	while (kvm->active_invalidate_count) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_unlock(&kvm->mn_invalidate_lock);
+		spin_unlock(&kvm->invalidate_lock);
 		schedule();
-		spin_lock(&kvm->mn_invalidate_lock);
+		spin_lock(&kvm->invalidate_lock);
 	}
-	finish_rcuwait(&kvm->mn_memslots_update_rcuwait);
+	finish_rcuwait(&kvm->memslots_update_rcuwait);
 	rcu_assign_pointer(kvm->memslots[as_id], slots);
-	spin_unlock(&kvm->mn_invalidate_lock);
+	spin_unlock(&kvm->invalidate_lock);
 
 	/*
 	 * Acquired in kvm_set_memslot. Must be released before synchronize
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 3ff8251727e2..2880a36257c2 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -147,26 +147,24 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
 {
 	/*
-	 * mn_active_invalidate_count acts for all intents and purposes
-	 * like mmu_invalidate_in_progress here; but the latter cannot
-	 * be used here because the invalidation of caches in the
-	 * mmu_notifier event occurs _before_ mmu_invalidate_in_progress
-	 * is elevated.
+	 * active_invalidate_count acts for all intents and purposes like
+	 * mmu_invalidate_in_progress here; but the latter cannot be used here
+	 * because the invalidation of caches in the mmu_notifier event occurs
+	 * _before_ mmu_invalidate_in_progress is elevated.
 	 *
-	 * Note, it does not matter that mn_active_invalidate_count
-	 * is not protected by gpc->lock.  It is guaranteed to
-	 * be elevated before the mmu_notifier acquires gpc->lock, and
-	 * isn't dropped until after mmu_invalidate_seq is updated.
+	 * Note, it does not matter that active_invalidate_count is not
+	 * protected by gpc->lock.  It is guaranteed to be elevated before the
+	 * mmu_notifier acquires gpc->lock, and isn't dropped until after
+	 * mmu_invalidate_seq is updated.
 	 */
-	if (kvm->mn_active_invalidate_count)
+	if (kvm->active_invalidate_count)
 		return true;
 
 	/*
-	 * Ensure mn_active_invalidate_count is read before
-	 * mmu_invalidate_seq.  This pairs with the smp_wmb() in
-	 * mmu_notifier_invalidate_range_end() to guarantee either the
-	 * old (non-zero) value of mn_active_invalidate_count or the
-	 * new (incremented) value of mmu_invalidate_seq is observed.
+	 * Ensure active_invalidate_count is read before mmu_invalidate_seq.
+	 * This pairs with the smp_wmb() in kvm_mmu_invalidate_end() to
+	 * guarantee either the old (non-zero) value of active_invalidate_count
+	 * or the new (incremented) value of mmu_invalidate_seq is observed.
 	 */
 	smp_rmb();
 	return kvm->mmu_invalidate_seq != mmu_seq;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v2 7/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates
  2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
                   ` (5 preceding siblings ...)
  2026-02-26 13:53 ` [RFC PATCH v2 6/7] KVM: Rename mn_* invalidate-related fields to generic ones Takahiro Itazuri
@ 2026-02-26 13:53 ` Takahiro Itazuri
  6 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-02-26 13:53 UTC (permalink / raw)
  To: kvm, Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman, David Hildenbrand,
	David Woodhouse, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

Invalidate pfncaches when guest_memfd invalidation or memory attribute
updates render cached PFN resolutions stale.

Reuse active_invalidate_count to synchronize with the existing retry
logic and preserve ordering against mmu_invalidate_seq.

Invalidation needs to be performed using HVA ranges so that both
GPA-based and HVA-based pfncaches are covered.  Internally GPA-based
ones translate GPA to memslot/UHVA first and then resolve PFN, while
HVA-based ones only resolve PFN and do not store memslot/GPA context.
Technically, it is possible to make HVA-based pfncaches search the
corresponding memslot/GPA when activated / refreshed, but it would add
overhead to a greater ot lesser extent, regardless of guest_memfd-backed
or not.  At the time of writing, only Xen uses HVA-based pfncaches.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 virt/kvm/guest_memfd.c | 50 ++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c    | 45 +++++++++++++++++++++++++++++++++++++
 virt/kvm/pfncache.c    |  4 ++--
 3 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 79f34dad0c2f..eb2f1a7e54dc 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -215,6 +215,33 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
 	struct kvm *kvm = f->kvm;
 	unsigned long index;
 
+	/*
+	 * Prevent pfncaches from being activated / refreshed using stale PFN
+	 * resolutions.  To invalidate pfncaches _before_ invalidating the
+	 * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+	 * active_invalidate_count instead of mmu_invalidate_in_progress.
+	 */
+	spin_lock(&kvm->invalidate_lock);
+	kvm->active_invalidate_count++;
+	spin_unlock(&kvm->invalidate_lock);
+
+	/*
+	 * Invalidation of pfncaches must be done using a HVA range.  pfncaches
+	 * can be either GPA-based or HVA-based, and all pfncaches store uhva
+	 * while HVA-based pfncaches do not have gpa/memslot info.  Thus,
+	 * using GFN ranges would miss invalidating HVA-based ones.
+	 */
+	xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+		gfn_t gfn_start = slot->base_gfn + max(pgoff, start) - pgoff;
+		gfn_t gfn_end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff;
+
+		unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+		unsigned long hva_end = gfn_to_hva_memslot(slot, gfn_end);
+
+		gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+	}
+
 	xa_for_each_range(&f->bindings, index, slot, start, end - 1) {
 		pgoff_t pgoff = slot->gmem.pgoff;
 
@@ -259,12 +286,35 @@ static void __kvm_gmem_invalidate_end(struct gmem_file *f, pgoff_t start,
 				      pgoff_t end)
 {
 	struct kvm *kvm = f->kvm;
+	bool wake;
 
 	if (xa_find(&f->bindings, &start, end - 1, XA_PRESENT)) {
 		KVM_MMU_LOCK(kvm);
 		kvm_mmu_invalidate_end(kvm);
 		KVM_MMU_UNLOCK(kvm);
 	}
+
+	/*
+	 * This must be done after the increment of mmu_invalidate_seq and
+	 * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+	 * gpc_invalidate_retry() observes either the old (non-zero)
+	 * active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+	 */
+	spin_lock(&kvm->invalidate_lock);
+	if (!WARN_ON_ONCE(!kvm->active_invalidate_count))
+		kvm->active_invalidate_count--;
+	wake = !kvm->active_invalidate_count;
+	spin_unlock(&kvm->invalidate_lock);
+
+	/*
+	 * guest_memfd invalidation itself doesn't need to block active memslots
+	 * swap as bindings updates are serialized by filemap_invalidate_lock().
+	 * However, active_invalidate_count is shared with the MMU notifier
+	 * path, so the waiter must be waked when active_invalidate_count drops
+	 * to zero.
+	 */
+	if (wake)
+		rcuwait_wake_up(&kvm->memslots_update_rcuwait);
 }
 
 static void kvm_gmem_invalidate_end(struct inode *inode, pgoff_t start,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f51056e971d0..f56b98c85175 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2583,6 +2583,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.on_lock = kvm_mmu_invalidate_end,
 		.may_block = true,
 	};
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *slot;
 	unsigned long i;
 	void *entry;
 	int r = 0;
@@ -2609,6 +2611,34 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		cond_resched();
 	}
 
+	/*
+	 * Prevent pfncaches from being activated / refreshed using stale PFN
+	 * resolutions.  To invalidate pfncaches _before_ invalidating the
+	 * secondary MMUs (i.e. without acquiring mmu_lock), pfncaches must use
+	 * active_invalidate_count instead of mmu_invalidate_in_progress.
+	 */
+	spin_lock(&kvm->invalidate_lock);
+	kvm->active_invalidate_count++;
+	spin_unlock(&kvm->invalidate_lock);
+
+	/*
+	 * Invalidation of pfncaches must be done using a HVA range.  pfncaches
+	 * can be either GPA-based or HVA-based, and all pfncaches store uhva
+	 * while HVA-based pfncaches do not have gpa/memslot info.  Thus,
+	 * using GFN ranges would miss invalidating HVA-based ones.
+	 */
+	kvm_for_each_memslot(slot, slots) {
+		gfn_t gfn_start = max(start, slot->base_gfn);
+		gfn_t gfn_end = min(end, slot->base_gfn + slot->npages);
+
+		if (gfn_start < gfn_end) {
+			unsigned long hva_start = gfn_to_hva_memslot(slot, gfn_start);
+			unsigned long hva_end = gfn_to_hva_memslot(slot, gfn_end);
+
+			gpc_invalidate_hva_range_start(kvm, hva_start, hva_end);
+		}
+	}
+
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
 	for (i = start; i < end; i++) {
@@ -2620,6 +2650,21 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	kvm_handle_gfn_range(kvm, &post_set_range);
 
+	/*
+	 * This must be done after the increment of mmu_invalidate_seq and
+	 * smp_wmb() in kvm_mmu_invalidate_end() to guarantee that
+	 * gpc_invalidate_retry() observes either the old (non-zero)
+	 * active_invalidate_count or the new (incremented) mmu_invalidate_seq.
+	 *
+	 * memslots_update_rcuwait does not need to be waked when
+	 * active_invalidate_count drops to zero because active memslots swap is
+	 * also done while holding slots_lock.
+	 */
+	spin_lock(&kvm->invalidate_lock);
+	if (!WARN_ON_ONCE(!kvm->active_invalidate_count))
+		kvm->active_invalidate_count--;
+	spin_unlock(&kvm->invalidate_lock);
+
 out_unlock:
 	mutex_unlock(&kvm->slots_lock);
 
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 2880a36257c2..2b44da46d2ab 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -144,7 +144,7 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 #endif
 }
 
-static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq)
+static inline bool gpc_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
 	/*
 	 * active_invalidate_count acts for all intents and purposes like
@@ -274,7 +274,7 @@ static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		 * attempting to refresh.
 		 */
 		WARN_ON_ONCE(gpc->valid);
-	} while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
+	} while (gpc_invalidate_retry(gpc->kvm, mmu_seq));
 
 	gpc->valid = true;
 	gpc->pfn = new_pfn;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures
  2026-02-26 13:53 ` [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures Takahiro Itazuri
@ 2026-03-05 17:50   ` Sean Christopherson
  2026-03-10  5:58     ` Takahiro Itazuri
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-03-05 17:50 UTC (permalink / raw)
  To: Takahiro Itazuri
  Cc: kvm, Paolo Bonzini, Vitaly Kuznetsov, Fuad Tabba, Brendan Jackman,
	David Hildenbrand, David Woodhouse, Paul Durrant, Nikita Kalyazin,
	Patrick Roy, Takahiro Itazuri

On Thu, Feb 26, 2026, Takahiro Itazuri wrote:
> kvm_write_system_time() previously ignored the return value of
> kvm_gpc_activate().  As a result, kvm-clock activation could fail
> silently, making debugging harder.
> 
> Propagate the return value so that the MSR write fail properly instead
> of continuing silently.

Hrm.  For better or worse, KVM's ABI when it comes to PV stuff is to silently
ignore failures.  I 100% agree it makes debugging painful, but it's unfortunately
also "safer" in many cases, e.g. often results in degraded behavior versus flat
out crashing the guest.

The other wrinkle is that success isn't actually guaranteed, because the actual
writes don't happen until KVM_RUN via kvm_guest_time_update(), i.e. only failing
in _some_ cases creates a weird ABI.

And most importantly, this would be a breaking change in guest- and user-visible
behavior.  So while I agree silently failing is ugly, all things considered I
think it's the least awful choice here :-/

> Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
> ---
>  arch/x86/kvm/x86.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a447663d5eff..a729b8419b61 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2438,7 +2438,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_o
>  	kvm_write_guest(kvm, wall_clock, &version, sizeof(version));
>  }
>  
> -static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
> +static int kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
>  				  bool old_msr, bool host_initiated)
>  {
>  	struct kvm_arch *ka = &vcpu->kvm->arch;
> @@ -2455,12 +2455,12 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
>  
>  	/* we verify if the enable bit is set... */
>  	if (system_time & 1)
> -		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
> -				 sizeof(struct pvclock_vcpu_time_info));
> -	else
> -		kvm_gpc_deactivate(&vcpu->arch.pv_time);
> +		return kvm_gpc_activate(&vcpu->arch.pv_time,
> +					system_time & ~1ULL,
> +					sizeof(struct pvclock_vcpu_time_info));
>  
> -	return;
> +	kvm_gpc_deactivate(&vcpu->arch.pv_time);
> +	return 0;
>  }
>  
>  static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
> @@ -4156,13 +4156,15 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (!guest_pv_has(vcpu, KVM_FEATURE_CLOCKSOURCE2))
>  			return 1;
>  
> -		kvm_write_system_time(vcpu, data, false, msr_info->host_initiated);
> +		if (kvm_write_system_time(vcpu, data, false, msr_info->host_initiated))
> +			return 1;
>  		break;
>  	case MSR_KVM_SYSTEM_TIME:
>  		if (!guest_pv_has(vcpu, KVM_FEATURE_CLOCKSOURCE))
>  			return 1;
>  
> -		kvm_write_system_time(vcpu, data, true,  msr_info->host_initiated);
> +		if (kvm_write_system_time(vcpu, data, true,  msr_info->host_initiated))
> +			return 1;
>  		break;
>  	case MSR_KVM_ASYNC_PF_EN:
>  		if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF))
> -- 
> 2.50.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures
  2026-03-05 17:50   ` Sean Christopherson
@ 2026-03-10  5:58     ` Takahiro Itazuri
  0 siblings, 0 replies; 10+ messages in thread
From: Takahiro Itazuri @ 2026-03-10  5:58 UTC (permalink / raw)
  To: seanjc
  Cc: david, dwmw2, itazur, jackmanb, kalyazin, kvm, patrick.roy,
	pbonzini, pdurrant, tabba, vkuznets, zulinx86

On Thu, 5 Mar 2026 09:50:30 -0800, Sean Christopherson wrote:
> On Thu, Feb 26, 2026, Takahiro Itazuri wrote:
> > kvm_write_system_time() previously ignored the return value of
> > kvm_gpc_activate().  As a result, kvm-clock activation could fail
> > silently, making debugging harder.
> > 
> > Propagate the return value so that the MSR write fail properly instead
> > of continuing silently.
> 
> Hrm.  For better or worse, KVM's ABI when it comes to PV stuff is to silently
> ignore failures.  I 100% agree it makes debugging painful, but it's unfortunately
> also "safer" in many cases, e.g. often results in degraded behavior versus flat
> out crashing the guest.
> 
> The other wrinkle is that success isn't actually guaranteed, because the actual
> writes don't happen until KVM_RUN via kvm_guest_time_update(), i.e. only failing
> in _some_ cases creates a weird ABI.
> 
> And most importantly, this would be a breaking change in guest- and user-visible
> behavior.  So while I agree silently failing is ugly, all things considered I
> think it's the least awful choice here :-/

Fair point.  I'll drop this change in the next version since I spotted a
compile error in another patch and I need to resend anyway.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-10  5:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 13:53 [RFC PATCH v2 0/7] KVM: pfncache: Add guest_memfd support to pfncache Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 1/7] KVM: x86: Avoid silent kvm-clock activation failures Takahiro Itazuri
2026-03-05 17:50   ` Sean Christopherson
2026-03-10  5:58     ` Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 2/7] KVM: pfncache: Resolve PFNs via kvm_gmem_get_pfn() for gmem-backed GPAs Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 3/7] KVM: pfncache: Obtain KHVA via vmap() for gmem with NO_DIRECT_MAP Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 4/7] KVM: Rename invalidate_begin to invalidate_start for consistency Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 5/7] KVM: pfncache: Rename invalidate_start() helper Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 6/7] KVM: Rename mn_* invalidate-related fields to generic ones Takahiro Itazuri
2026-02-26 13:53 ` [RFC PATCH v2 7/7] KVM: pfncache: Invalidate on gmem invalidation and memattr updates Takahiro Itazuri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox