[RFC PATCH] KVM: Ignore MMU notifiers for guest

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
@ 2026-06-15 15:52 Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
  2026-06-15 19:07 ` David Hildenbrand
  0 siblings, 2 replies; 5+ messages in thread
From: Alexandru Elisei @ 2026-06-15 15:52 UTC (permalink / raw)
  To: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, david.hildenbrand, mark.rutland

For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
memory provider for the virtual machine is the guest_memfd file, not the
userspace mapping. Faults are resolved using the guest_memfd page cache,
and the permissions for the secondary MMU mapping depends exclusively on
the memslot (i.e, if the memslot is read-only). How userspace happens to
have the memory mmaped at fault time, or even if the memory is mapped at
all into userspace, is not taken into consideration.

guest_memfd memory is not evictable, is not movable and there's no backing
storage. Once memory is allocated for an offset in guest_memfd file, the
offset will not change, and that memory is not freed unless userspace
explicitly punches a hole in the file. As a result, memory reclaim, page
migration, page aging and dirty page tracking for the userspace mapping
serve little purpose.

Despite this, KVM's MMU notifiers still modify the secondary MMU page
tables, similar to ordinary memslots, only for the same memory to be
remapped next time a guest accesses it. Make the disconnect between the
user mapping and the secondary MMU page tables explicit by ignoring the MMU
notifiers for guest_memfd-only memslots.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
The only theoretical instance where the MMU notifiers are invoked for the
userspace mapping of a guest_memfd-only memslot that I was able to find was
automatic NUMA balancing with a non-NULL NUMA policy for the guest_memfd
file. I wasn't able to test it in practice. Also my knowledge of MM is very
limited, so there might be other cases where it happens, or I might be
wrong and today the MMU notifiers are never invoked.

Either way, when and if it happens, having memory unmapped from the
seconday MMU in the case of guest_memfd-only memslot is at most a
performance issue (it causes unnecessary guest faults), but I wanted to
start a conversation about this because having memory that stays mapped at
stage 2 (unless userspace explicitly unmaps it from the VM) is needed for a
Arm feature (called SPE, Statistical Profiling Extension) that I'm working
to upstream. This patch aims to provide the guarantee that memory won't be
unmapped from the secondary MMU behind the VMMs back, which is what happens
for non guest_memfd memslots.

 virt/kvm/kvm_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469..8c4158996928 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;

 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+
+			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
+				continue;
+
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
 			hva_end = min_t(unsigned long, range->end,
 					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
@ 2026-06-15 16:09 ` sashiko-bot
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 19:07 ` David Hildenbrand
  1 sibling, 1 reply; 5+ messages in thread
From: sashiko-bot @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, kvm, kvmarm, Oliver Upton

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Skipping MMU notifiers for guest_memfd-only memslots breaks the pfncache retry protocol, leading to a host-level Use-After-Free.
--

commit 8278d4b7bdd8bdfe7a6bc39dd2a0418838234428
Author: Alexandru Elisei <alexandru.elisei@arm.com>

KVM: Ignore MMU notifiers for guest_memfd-only memslots

This commit modifies KVM's MMU notifier handling to ignore guest_memfd-only
memslots. The rationale is that guest_memfd memory is not evictable or movable,
making traditional userspace MMU notifier tracking unnecessary.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
[ ... ]
> @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> +
> +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> +				continue;
> +

[Severity: Critical]
Does this change inadvertently introduce a regression in the pfncache retry
protocol?

Looking at the pfncache framework, it maps guest memory into kernel space and
explicitly drops the page reference after mapping it:

virt/kvm/pfncache.c:hva_to_pfn_retry() {
    ...
    kvm_release_page_clean(page);
    ...
}

It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
to invalidate the cache when the page is unmapped by the host.

If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
but still provides a valid anonymous user mapping as its userspace_addr,
could this regression lead to a use-after-free? By unmapping the anonymous
memory, the host would trigger MMU notifiers, but this new check skips
the memslot. As a result, kvm->mmu_invalidate_seq wouldn't increment, and
KVM might retain a kernel mapping to a freed physical page.

Could this allow the guest to read or write arbitrary host physical memory?

>  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
>  			hva_end = min_t(unsigned long, range->end,
>  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2026-06-15 17:47 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> [ ... ]
> > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> >  			unsigned long hva_start, hva_end;
> >  
> >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > +
> > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > +				continue;
> > +
> 
> [Severity: Critical]
> Does this change inadvertently introduce a regression in the pfncache retry
> protocol?
> 
> Looking at the pfncache framework, it maps guest memory into kernel space and
> explicitly drops the page reference after mapping it:
> 
> virt/kvm/pfncache.c:hva_to_pfn_retry() {
>     ...
>     kvm_release_page_clean(page);
>     ...
> }
> 
> It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> to invalidate the cache when the page is unmapped by the host.
> 
> If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> but still provides a valid anonymous user mapping as its userspace_addr,
> could this regression lead to a use-after-free?

Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
to be able to pull directly from guest_memfd.  I forget if anyone is working on
that.

> By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> page.
> 
> Could this allow the guest to read or write arbitrary host physical memory?
> 
> >  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> >  			hva_end = min_t(unsigned long, range->end,
> >  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
@ 2026-06-15 18:09     ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-15 18:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > [ ... ]
> > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > >  			unsigned long hva_start, hva_end;
> > >  
> > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +
> > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > +				continue;
> > > +
> > 
> > [Severity: Critical]
> > Does this change inadvertently introduce a regression in the pfncache retry
> > protocol?
> > 
> > Looking at the pfncache framework, it maps guest memory into kernel space and
> > explicitly drops the page reference after mapping it:
> > 
> > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> >     ...
> >     kvm_release_page_clean(page);
> >     ...
> > }
> > 
> > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > to invalidate the cache when the page is unmapped by the host.
> > 
> > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > but still provides a valid anonymous user mapping as its userspace_addr,
> > could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

Actually, we just need to ensure the invalidation tracking is updated, the MMU
itself can be left as-is.

Compile tested only, but this?

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..690ab707816b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg {
 enum kvm_gfn_range_filter {
        KVM_FILTER_SHARED               = BIT(0),
        KVM_FILTER_PRIVATE              = BIT(1),
+       KVM_FILTER_USERSPACE_MAPPINGS   = BIT(2),
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e44c20c04961..84b693de7e35 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
                         * HVA-based notifications aren't relevant to private
                         * mappings as they don't have a userspace mapping.
                         */
-                       gfn_range.attr_filter = KVM_FILTER_SHARED;
+                       gfn_range.attr_filter = KVM_FILTER_SHARED |
+                                               KVM_FILTER_USERSPACE_MAPPINGS;
 
                        /*
                         * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
        kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+
+       /*
+        * When reacting to changes in userspace mappings, don't unmap memslots
+        * that are guest_memfd-only, in which case KVM's MMU mappings are
+        * pulled directly from guest_memfd, i.e. don't depend on the userspace
+        * mappings.
+        *
+        * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once
+        * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd.
+        */
+       if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS &&
+           kvm_slot_has_gmem(range->slot) &&
+           kvm_memslot_is_gmem_only(range->slot))
+               return false;
+
        return kvm_unmap_gfn_range(kvm, range);
 }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 19:07 ` David Hildenbrand
  1 sibling, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2026-06-15 19:07 UTC (permalink / raw)
  To: Alexandru Elisei, pbonzini, kvm, linux-kernel, maz, oupton,
	suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland

On 6/15/26 17:52, Alexandru Elisei wrote:
> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> memory provider for the virtual machine is the guest_memfd file, not the
> userspace mapping. Faults are resolved using the guest_memfd page cache,
> and the permissions for the secondary MMU mapping depends exclusively on
> the memslot (i.e, if the memslot is read-only). How userspace happens to
> have the memory mmaped at fault time, or even if the memory is mapped at
> all into userspace, is not taken into consideration.
> 
> guest_memfd memory is not evictable, is not movable and there's no backing
> storage. Once memory is allocated for an offset in guest_memfd file, the
> offset will not change, and that memory is not freed unless userspace
> explicitly punches a hole in the file. As a result, memory reclaim, page
> migration, page aging and dirty page tracking for the userspace mapping
> serve little purpose.

I don't think any of that is relevant for the patch at hand?

The thing is: invalidation (truncation, later migration, for any other reason)
is driven through guest_memfd notifications, not through unrelated page tables.

If we don't lookup pages for the KVM MMU through the page table, then there is
also no need for MMU notifiers. It's all guest_memfd only.

Or am I missing something?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-15 19:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
2026-06-15 16:09 ` sashiko-bot
2026-06-15 17:47   ` Sean Christopherson
2026-06-15 18:09     ` Sean Christopherson
2026-06-15 19:07 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox