[RFC PATCH] KVM: Ignore MMU notifiers for guest

Linux KVM/arm64 development list
 help / color / mirror / Atom feed

* [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
@ 2026-06-15 15:52 Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
  2026-06-15 19:07 ` David Hildenbrand
  0 siblings, 2 replies; 11+ messages in thread
From: Alexandru Elisei @ 2026-06-15 15:52 UTC (permalink / raw)
  To: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, david.hildenbrand, mark.rutland

For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
memory provider for the virtual machine is the guest_memfd file, not the
userspace mapping. Faults are resolved using the guest_memfd page cache,
and the permissions for the secondary MMU mapping depends exclusively on
the memslot (i.e, if the memslot is read-only). How userspace happens to
have the memory mmaped at fault time, or even if the memory is mapped at
all into userspace, is not taken into consideration.

guest_memfd memory is not evictable, is not movable and there's no backing
storage. Once memory is allocated for an offset in guest_memfd file, the
offset will not change, and that memory is not freed unless userspace
explicitly punches a hole in the file. As a result, memory reclaim, page
migration, page aging and dirty page tracking for the userspace mapping
serve little purpose.

Despite this, KVM's MMU notifiers still modify the secondary MMU page
tables, similar to ordinary memslots, only for the same memory to be
remapped next time a guest accesses it. Make the disconnect between the
user mapping and the secondary MMU page tables explicit by ignoring the MMU
notifiers for guest_memfd-only memslots.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
The only theoretical instance where the MMU notifiers are invoked for the
userspace mapping of a guest_memfd-only memslot that I was able to find was
automatic NUMA balancing with a non-NULL NUMA policy for the guest_memfd
file. I wasn't able to test it in practice. Also my knowledge of MM is very
limited, so there might be other cases where it happens, or I might be
wrong and today the MMU notifiers are never invoked.

Either way, when and if it happens, having memory unmapped from the
seconday MMU in the case of guest_memfd-only memslot is at most a
performance issue (it causes unnecessary guest faults), but I wanted to
start a conversation about this because having memory that stays mapped at
stage 2 (unless userspace explicitly unmaps it from the VM) is needed for a
Arm feature (called SPE, Statistical Profiling Extension) that I'm working
to upstream. This patch aims to provide the guarantee that memory won't be
unmapped from the secondary MMU behind the VMMs back, which is what happens
for non guest_memfd memslots.

 virt/kvm/kvm_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469..8c4158996928 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;

 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+
+			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
+				continue;
+
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
 			hva_end = min_t(unsigned long, range->end,
 					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
@ 2026-06-15 16:09 ` sashiko-bot
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 19:07 ` David Hildenbrand
  1 sibling, 1 reply; 11+ messages in thread
From: sashiko-bot @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, kvm, kvmarm, Oliver Upton

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Skipping MMU notifiers for guest_memfd-only memslots breaks the pfncache retry protocol, leading to a host-level Use-After-Free.
--

commit 8278d4b7bdd8bdfe7a6bc39dd2a0418838234428
Author: Alexandru Elisei <alexandru.elisei@arm.com>

KVM: Ignore MMU notifiers for guest_memfd-only memslots

This commit modifies KVM's MMU notifier handling to ignore guest_memfd-only
memslots. The rationale is that guest_memfd memory is not evictable or movable,
making traditional userspace MMU notifier tracking unnecessary.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
[ ... ]
> @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> +
> +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> +				continue;
> +

[Severity: Critical]
Does this change inadvertently introduce a regression in the pfncache retry
protocol?

Looking at the pfncache framework, it maps guest memory into kernel space and
explicitly drops the page reference after mapping it:

virt/kvm/pfncache.c:hva_to_pfn_retry() {
    ...
    kvm_release_page_clean(page);
    ...
}

It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
to invalidate the cache when the page is unmapped by the host.

If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
but still provides a valid anonymous user mapping as its userspace_addr,
could this regression lead to a use-after-free? By unmapping the anonymous
memory, the host would trigger MMU notifiers, but this new check skips
the memslot. As a result, kvm->mmu_invalidate_seq wouldn't increment, and
KVM might retain a kernel mapping to a freed physical page.

Could this allow the guest to read or write arbitrary host physical memory?

>  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
>  			hva_end = min_t(unsigned long, range->end,
>  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
  2026-06-17 13:07     ` Alexandru Elisei
  0 siblings, 2 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-06-15 17:47 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> [ ... ]
> > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> >  			unsigned long hva_start, hva_end;
> >  
> >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > +
> > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > +				continue;
> > +
> 
> [Severity: Critical]
> Does this change inadvertently introduce a regression in the pfncache retry
> protocol?
> 
> Looking at the pfncache framework, it maps guest memory into kernel space and
> explicitly drops the page reference after mapping it:
> 
> virt/kvm/pfncache.c:hva_to_pfn_retry() {
>     ...
>     kvm_release_page_clean(page);
>     ...
> }
> 
> It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> to invalidate the cache when the page is unmapped by the host.
> 
> If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> but still provides a valid anonymous user mapping as its userspace_addr,
> could this regression lead to a use-after-free?

Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
to be able to pull directly from guest_memfd.  I forget if anyone is working on
that.

> By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> page.
> 
> Could this allow the guest to read or write arbitrary host physical memory?
> 
> >  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> >  			hva_end = min_t(unsigned long, range->end,
> >  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
@ 2026-06-15 18:09     ` Sean Christopherson
  2026-06-17 13:07     ` Alexandru Elisei
  1 sibling, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-06-15 18:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > [ ... ]
> > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > >  			unsigned long hva_start, hva_end;
> > >  
> > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +
> > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > +				continue;
> > > +
> > 
> > [Severity: Critical]
> > Does this change inadvertently introduce a regression in the pfncache retry
> > protocol?
> > 
> > Looking at the pfncache framework, it maps guest memory into kernel space and
> > explicitly drops the page reference after mapping it:
> > 
> > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> >     ...
> >     kvm_release_page_clean(page);
> >     ...
> > }
> > 
> > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > to invalidate the cache when the page is unmapped by the host.
> > 
> > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > but still provides a valid anonymous user mapping as its userspace_addr,
> > could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

Actually, we just need to ensure the invalidation tracking is updated, the MMU
itself can be left as-is.

Compile tested only, but this?

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..690ab707816b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg {
 enum kvm_gfn_range_filter {
        KVM_FILTER_SHARED               = BIT(0),
        KVM_FILTER_PRIVATE              = BIT(1),
+       KVM_FILTER_USERSPACE_MAPPINGS   = BIT(2),
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e44c20c04961..84b693de7e35 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
                         * HVA-based notifications aren't relevant to private
                         * mappings as they don't have a userspace mapping.
                         */
-                       gfn_range.attr_filter = KVM_FILTER_SHARED;
+                       gfn_range.attr_filter = KVM_FILTER_SHARED |
+                                               KVM_FILTER_USERSPACE_MAPPINGS;
 
                        /*
                         * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
        kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+
+       /*
+        * When reacting to changes in userspace mappings, don't unmap memslots
+        * that are guest_memfd-only, in which case KVM's MMU mappings are
+        * pulled directly from guest_memfd, i.e. don't depend on the userspace
+        * mappings.
+        *
+        * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once
+        * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd.
+        */
+       if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS &&
+           kvm_slot_has_gmem(range->slot) &&
+           kvm_memslot_is_gmem_only(range->slot))
+               return false;
+
        return kvm_unmap_gfn_range(kvm, range);
 }

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
@ 2026-06-17 13:07     ` Alexandru Elisei
  2026-06-17 21:21       ` Sean Christopherson
  2026-06-17 21:22       ` Sean Christopherson
  1 sibling, 2 replies; 11+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

Hi Sean,

Thanks for the reply. Just to make sure, once the bugs have been iron out,
you're ok with this idea?

On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > [ ... ]
> > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > >  			unsigned long hva_start, hva_end;
> > >  
> > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +
> > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > +				continue;
> > > +
> > 
> > [Severity: Critical]
> > Does this change inadvertently introduce a regression in the pfncache retry
> > protocol?
> > 
> > Looking at the pfncache framework, it maps guest memory into kernel space and
> > explicitly drops the page reference after mapping it:
> > 
> > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> >     ...
> >     kvm_release_page_clean(page);
> >     ...
> > }
> > 
> > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > to invalidate the cache when the page is unmapped by the host.
> > 
> > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > but still provides a valid anonymous user mapping as its userspace_addr,
> > could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

I've been trying to wrap my head around this, and I just can't seem to
figure it out.

kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls
gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that
marks all caches that overlap the range as invalid. kvm_gpc_check() returns
false for an invalid cache, so how can the memory still be accessed via the
pfncache?

> 
> > By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> > wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> > page.

kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see
how that is affected by skipping a memslot in handle_hva_range().

> > 
> > Could this allow the guest to read or write arbitrary host physical memory?

The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
that '[..] the fault will always be consumed from guest_memfd, regardless
of whether it is a shared or private fault'.  As far as I can tell, this
means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
still in the page cache for the guest_memfd file after userspace has
unmapped it, so the guest will not be accessing a freed page.

Thoughts? Am I missing something obvious?

Thanks,
Alex

> > 
> > >  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> > >  			hva_end = min_t(unsigned long, range->end,
> > >  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
> > 
> > -- 
> > Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:07     ` Alexandru Elisei
@ 2026-06-17 21:21       ` Sean Christopherson
  2026-06-17 21:22       ` Sean Christopherson
  1 sibling, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-06-17 21:21 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Wed, Jun 17, 2026, Alexandru Elisei wrote:
> Hi Sean,
> 
> Thanks for the reply. Just to make sure, once the bugs have been iron out,
> you're ok with this idea?
> 
> On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote:
> > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > [ ... ]
> > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > > >  			unsigned long hva_start, hva_end;
> > > >  
> > > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > > +
> > > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > > +				continue;
> > > > +
> > > 
> > > [Severity: Critical]
> > > Does this change inadvertently introduce a regression in the pfncache retry
> > > protocol?
> > > 
> > > Looking at the pfncache framework, it maps guest memory into kernel space and
> > > explicitly drops the page reference after mapping it:
> > > 
> > > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> > >     ...
> > >     kvm_release_page_clean(page);
> > >     ...
> > > }
> > > 
> > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > > to invalidate the cache when the page is unmapped by the host.
> > > 
> > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > > but still provides a valid anonymous user mapping as its userspace_addr,
> > > could this regression lead to a use-after-free?
> > 
> > Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> > to be able to pull directly from guest_memfd.  I forget if anyone is working on
> > that.
> 
> I've been trying to wrap my head around this, and I just can't seem to
> figure it out.
> 
> kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls
> gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that
> marks all caches that overlap the range as invalid. kvm_gpc_check() returns
> false for an invalid cache, so how can the memory still be accessed via the
> pfncache?

That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies
on mmu_invalidate_seq being bumped to avoid consuming stale state.

> > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> > > page.
> 
> kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see
> how that is affected by skipping a memslot in handle_hva_range().

handle_hva_range() only invokes on_lock() if a memslot is found.  By skipping the
memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so
mmu_invalidate_seq won't be bumped.

> > > Could this allow the guest to read or write arbitrary host physical memory?
> 
> The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
> created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
> that '[..] the fault will always be consumed from guest_memfd, regardless
> of whether it is a shared or private fault'.  As far as I can tell, this
> means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
> still in the page cache for the guest_memfd file after userspace has
> unmapped it, so the guest will not be accessing a freed page.

KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU.
For other cases where KVM accesses guest memory, KVM still follows the host virtual
address, e.g. so that copy_{to,from}_user() Just Works.  But userspace isn't
strictly *required* to keep the userspace mapping coherent with guest_memfd, nor
is userspace required to make the userspace mapping fully RWX.  And so if
userspace modifies the VMA, KVM needs to react accordingly.

When in-place conversion comes along, KVM will also rely on userspace mappings
being torn down before allow a SHARED page to become PRIVATE (for all intents
and purposes, we're conceptually treating conversions as free()+re-alloc().  So
while the page might still be in the page cache, it's effectively been "freed".
So in that case, KVM really does need to ensure it handles mmu_notifier events
correctly to avoid UAF.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:07     ` Alexandru Elisei
  2026-06-17 21:21       ` Sean Christopherson
@ 2026-06-17 21:22       ` Sean Christopherson
  1 sibling, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-06-17 21:22 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Wed, Jun 17, 2026, Alexandru Elisei wrote:
> Hi Sean,
> 
> Thanks for the reply. Just to make sure, once the bugs have been iron out,
> you're ok with this idea?

Yes, I'm more than ok with this idea.  The only reason I didn't push to do this
right away is that I completely missed it :-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 19:07 ` David Hildenbrand
  2026-06-17 13:23   ` Alexandru Elisei
  1 sibling, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2026-06-15 19:07 UTC (permalink / raw)
  To: Alexandru Elisei, pbonzini, kvm, linux-kernel, maz, oupton,
	suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland

On 6/15/26 17:52, Alexandru Elisei wrote:
> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> memory provider for the virtual machine is the guest_memfd file, not the
> userspace mapping. Faults are resolved using the guest_memfd page cache,
> and the permissions for the secondary MMU mapping depends exclusively on
> the memslot (i.e, if the memslot is read-only). How userspace happens to
> have the memory mmaped at fault time, or even if the memory is mapped at
> all into userspace, is not taken into consideration.
> 
> guest_memfd memory is not evictable, is not movable and there's no backing
> storage. Once memory is allocated for an offset in guest_memfd file, the
> offset will not change, and that memory is not freed unless userspace
> explicitly punches a hole in the file. As a result, memory reclaim, page
> migration, page aging and dirty page tracking for the userspace mapping
> serve little purpose.

I don't think any of that is relevant for the patch at hand?

The thing is: invalidation (truncation, later migration, for any other reason)
is driven through guest_memfd notifications, not through unrelated page tables.

If we don't lookup pages for the KVM MMU through the page table, then there is
also no need for MMU notifiers. It's all guest_memfd only.

Or am I missing something?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 19:07 ` David Hildenbrand
@ 2026-06-17 13:23   ` Alexandru Elisei
  2026-06-17 13:41     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

Hi David,

On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
> On 6/15/26 17:52, Alexandru Elisei wrote:
> > For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> > memory provider for the virtual machine is the guest_memfd file, not the
> > userspace mapping. Faults are resolved using the guest_memfd page cache,
> > and the permissions for the secondary MMU mapping depends exclusively on
> > the memslot (i.e, if the memslot is read-only). How userspace happens to
> > have the memory mmaped at fault time, or even if the memory is mapped at
> > all into userspace, is not taken into consideration.
> > 
> > guest_memfd memory is not evictable, is not movable and there's no backing
> > storage. Once memory is allocated for an offset in guest_memfd file, the
> > offset will not change, and that memory is not freed unless userspace
> > explicitly punches a hole in the file. As a result, memory reclaim, page
> > migration, page aging and dirty page tracking for the userspace mapping
> > serve little purpose.
> 
> I don't think any of that is relevant for the patch at hand?
> 
> The thing is: invalidation (truncation, later migration, for any other reason)
> is driven through guest_memfd notifications, not through unrelated page tables.
> 
> If we don't lookup pages for the KVM MMU through the page table, then there is
> also no need for MMU notifiers. It's all guest_memfd only.
> 
> Or am I missing something?

My thinking was that, because guest_memfd is not evictable, there is no need to
do page ageing, which would require that secondary MMU mappings be made old.

The invalidate callbacks are also used when userspace memory is marked read-only
for dirty state tracking. I was trying to explaing that, since there is no
backing for the guest_memfd file, host doesn't need to keep track of dirty state
for the memory, and ignoring the invalidate callbacks is correct for all cases.

I can drop the paragraph entirely, if you think that would make the commit
message clearer.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:23   ` Alexandru Elisei
@ 2026-06-17 13:41     ` David Hildenbrand
  2026-06-17 13:50       ` Alexandru Elisei
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2026-06-17 13:41 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

On 6/17/26 15:23, Alexandru Elisei wrote:
> Hi David,
> 
> On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
>> On 6/15/26 17:52, Alexandru Elisei wrote:
>>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
>>> memory provider for the virtual machine is the guest_memfd file, not the
>>> userspace mapping. Faults are resolved using the guest_memfd page cache,
>>> and the permissions for the secondary MMU mapping depends exclusively on
>>> the memslot (i.e, if the memslot is read-only). How userspace happens to
>>> have the memory mmaped at fault time, or even if the memory is mapped at
>>> all into userspace, is not taken into consideration.
>>>
>>> guest_memfd memory is not evictable, is not movable and there's no backing
>>> storage. Once memory is allocated for an offset in guest_memfd file, the
>>> offset will not change, and that memory is not freed unless userspace
>>> explicitly punches a hole in the file. As a result, memory reclaim, page
>>> migration, page aging and dirty page tracking for the userspace mapping
>>> serve little purpose.
>>
>> I don't think any of that is relevant for the patch at hand?
>>
>> The thing is: invalidation (truncation, later migration, for any other reason)
>> is driven through guest_memfd notifications, not through unrelated page tables.
>>
>> If we don't lookup pages for the KVM MMU through the page table, then there is
>> also no need for MMU notifiers. It's all guest_memfd only.
>>
>> Or am I missing something?
> 
> My thinking was that, because guest_memfd is not evictable, there is no need to
> do page ageing, which would require that secondary MMU mappings be made old.

Not really.

The KVM MMU did not obtain the folios through the page tables, but directly
through guest_memfd. Any aging would, therefore, have to be done through
guest_memfd.

Which we don't support and don't want to support :)

That we happen to have a matching user space range that maps the guest_memfd is
just coincidence from a KVM MMU point of view.

> 
> The invalidate callbacks are also used when userspace memory is marked read-only
> for dirty state tracking. I was trying to explaing that, since there is no
> backing for the guest_memfd file, host doesn't need to keep track of dirty state
> for the memory, and ignoring the invalidate callbacks is correct for all cases.
> 
> I can drop the paragraph entirely, if you think that would make the commit
> message clearer.

I think the real motivation is:

"Mappings in the secondary MMU were established by obtaining folios from
guest_memfd directly, not by looking the folios up through the page tables
through GUP. Consequently, there is no relationship between the page tables and
the secondary MMU: MMU notifiers do not apply."

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:41     ` David Hildenbrand
@ 2026-06-17 13:50       ` Alexandru Elisei
  0 siblings, 0 replies; 11+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

Hi David,

On Wed, Jun 17, 2026 at 03:41:41PM +0200, David Hildenbrand wrote:
> On 6/17/26 15:23, Alexandru Elisei wrote:
> > Hi David,
> > 
> > On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
> >> On 6/15/26 17:52, Alexandru Elisei wrote:
> >>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> >>> memory provider for the virtual machine is the guest_memfd file, not the
> >>> userspace mapping. Faults are resolved using the guest_memfd page cache,
> >>> and the permissions for the secondary MMU mapping depends exclusively on
> >>> the memslot (i.e, if the memslot is read-only). How userspace happens to
> >>> have the memory mmaped at fault time, or even if the memory is mapped at
> >>> all into userspace, is not taken into consideration.
> >>>
> >>> guest_memfd memory is not evictable, is not movable and there's no backing
> >>> storage. Once memory is allocated for an offset in guest_memfd file, the
> >>> offset will not change, and that memory is not freed unless userspace
> >>> explicitly punches a hole in the file. As a result, memory reclaim, page
> >>> migration, page aging and dirty page tracking for the userspace mapping
> >>> serve little purpose.
> >>
> >> I don't think any of that is relevant for the patch at hand?
> >>
> >> The thing is: invalidation (truncation, later migration, for any other reason)
> >> is driven through guest_memfd notifications, not through unrelated page tables.
> >>
> >> If we don't lookup pages for the KVM MMU through the page table, then there is
> >> also no need for MMU notifiers. It's all guest_memfd only.
> >>
> >> Or am I missing something?
> > 
> > My thinking was that, because guest_memfd is not evictable, there is no need to
> > do page ageing, which would require that secondary MMU mappings be made old.
> 
> Not really.
> 
> The KVM MMU did not obtain the folios through the page tables, but directly
> through guest_memfd. Any aging would, therefore, have to be done through
> guest_memfd.
> 
> Which we don't support and don't want to support :)
> 
> That we happen to have a matching user space range that maps the guest_memfd is
> just coincidence from a KVM MMU point of view.
> 
> > 
> > The invalidate callbacks are also used when userspace memory is marked read-only
> > for dirty state tracking. I was trying to explaing that, since there is no
> > backing for the guest_memfd file, host doesn't need to keep track of dirty state
> > for the memory, and ignoring the invalidate callbacks is correct for all cases.
> > 
> > I can drop the paragraph entirely, if you think that would make the commit
> > message clearer.
> 
> I think the real motivation is:
> 
> "Mappings in the secondary MMU were established by obtaining folios from
> guest_memfd directly, not by looking the folios up through the page tables
> through GUP. Consequently, there is no relationship between the page tables and
> the secondary MMU: MMU notifiers do not apply."

That's much better than my version, thanks!

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-17 21:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
2026-06-15 16:09 ` sashiko-bot
2026-06-15 17:47   ` Sean Christopherson
2026-06-15 18:09     ` Sean Christopherson
2026-06-17 13:07     ` Alexandru Elisei
2026-06-17 21:21       ` Sean Christopherson
2026-06-17 21:22       ` Sean Christopherson
2026-06-15 19:07 ` David Hildenbrand
2026-06-17 13:23   ` Alexandru Elisei
2026-06-17 13:41     ` David Hildenbrand
2026-06-17 13:50       ` Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox