* [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots @ 2026-06-15 15:52 Alexandru Elisei 2026-06-15 16:09 ` sashiko-bot 2026-06-15 19:07 ` David Hildenbrand 0 siblings, 2 replies; 14+ messages in thread From: Alexandru Elisei @ 2026-06-15 15:52 UTC (permalink / raw) To: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, david.hildenbrand, mark.rutland For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the memory provider for the virtual machine is the guest_memfd file, not the userspace mapping. Faults are resolved using the guest_memfd page cache, and the permissions for the secondary MMU mapping depends exclusively on the memslot (i.e, if the memslot is read-only). How userspace happens to have the memory mmaped at fault time, or even if the memory is mapped at all into userspace, is not taken into consideration. guest_memfd memory is not evictable, is not movable and there's no backing storage. Once memory is allocated for an offset in guest_memfd file, the offset will not change, and that memory is not freed unless userspace explicitly punches a hole in the file. As a result, memory reclaim, page migration, page aging and dirty page tracking for the userspace mapping serve little purpose. Despite this, KVM's MMU notifiers still modify the secondary MMU page tables, similar to ordinary memslots, only for the same memory to be remapped next time a guest accesses it. Make the disconnect between the user mapping and the secondary MMU page tables explicit by ignoring the MMU notifiers for guest_memfd-only memslots. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> --- The only theoretical instance where the MMU notifiers are invoked for the userspace mapping of a guest_memfd-only memslot that I was able to find was automatic NUMA balancing with a non-NULL NUMA policy for the guest_memfd file. I wasn't able to test it in practice. Also my knowledge of MM is very limited, so there might be other cases where it happens, or I might be wrong and today the MMU notifiers are never invoked. Either way, when and if it happens, having memory unmapped from the seconday MMU in the case of guest_memfd-only memslot is at most a performance issue (it causes unnecessary guest faults), but I wanted to start a conversation about this because having memory that stays mapped at stage 2 (unless userspace explicitly unmaps it from the VM) is needed for a Arm feature (called SPE, Statistical Profiling Extension) that I'm working to upstream. This patch aims to provide the guarantee that memory won't be unmapped from the secondary MMU behind the VMMs back, which is what happens for non guest_memfd memslots. virt/kvm/kvm_main.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 881f92d7a469..8c4158996928 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, unsigned long hva_start, hva_end; slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); + + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) + continue; + hva_start = max_t(unsigned long, range->start, slot->userspace_addr); hva_end = min_t(unsigned long, range->end, slot->userspace_addr + (slot->npages << PAGE_SHIFT)); base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 -- 2.54.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei @ 2026-06-15 16:09 ` sashiko-bot 2026-06-15 17:47 ` Sean Christopherson 2026-06-15 19:07 ` David Hildenbrand 1 sibling, 1 reply; 14+ messages in thread From: sashiko-bot @ 2026-06-15 16:09 UTC (permalink / raw) To: Alexandru Elisei; +Cc: Marc Zyngier, kvm, kvmarm, Oliver Upton Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [Critical] Skipping MMU notifiers for guest_memfd-only memslots breaks the pfncache retry protocol, leading to a host-level Use-After-Free. -- commit 8278d4b7bdd8bdfe7a6bc39dd2a0418838234428 Author: Alexandru Elisei <alexandru.elisei@arm.com> KVM: Ignore MMU notifiers for guest_memfd-only memslots This commit modifies KVM's MMU notifier handling to ignore guest_memfd-only memslots. The rationale is that guest_memfd memory is not evictable or movable, making traditional userspace MMU notifier tracking unnecessary. > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c [ ... ] > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > unsigned long hva_start, hva_end; > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > + > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > + continue; > + [Severity: Critical] Does this change inadvertently introduce a regression in the pfncache retry protocol? Looking at the pfncache framework, it maps guest memory into kernel space and explicitly drops the page reference after mapping it: virt/kvm/pfncache.c:hva_to_pfn_retry() { ... kvm_release_page_clean(page); ... } It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) to invalidate the cache when the page is unmapped by the host. If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY but still provides a valid anonymous user mapping as its userspace_addr, could this regression lead to a use-after-free? By unmapping the anonymous memory, the host would trigger MMU notifiers, but this new check skips the memslot. As a result, kvm->mmu_invalidate_seq wouldn't increment, and KVM might retain a kernel mapping to a freed physical page. Could this allow the guest to read or write arbitrary host physical memory? > hva_start = max_t(unsigned long, range->start, slot->userspace_addr); > hva_end = min_t(unsigned long, range->end, > slot->userspace_addr + (slot->npages << PAGE_SHIFT)); -- Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 16:09 ` sashiko-bot @ 2026-06-15 17:47 ` Sean Christopherson 2026-06-15 18:09 ` Sean Christopherson ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Sean Christopherson @ 2026-06-15 17:47 UTC (permalink / raw) To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > [ ... ] > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > unsigned long hva_start, hva_end; > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > + > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > + continue; > > + > > [Severity: Critical] > Does this change inadvertently introduce a regression in the pfncache retry > protocol? > > Looking at the pfncache framework, it maps guest memory into kernel space and > explicitly drops the page reference after mapping it: > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > ... > kvm_release_page_clean(page); > ... > } > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > to invalidate the cache when the page is unmapped by the host. > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > but still provides a valid anonymous user mapping as its userspace_addr, > could this regression lead to a use-after-free? Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code to be able to pull directly from guest_memfd. I forget if anyone is working on that. > By unmapping the anonymous memory, the host would trigger MMU notifiers, but > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq > wouldn't increment, and KVM might retain a kernel mapping to a freed physical > page. > > Could this allow the guest to read or write arbitrary host physical memory? > > > hva_start = max_t(unsigned long, range->start, slot->userspace_addr); > > hva_end = min_t(unsigned long, range->end, > > slot->userspace_addr + (slot->npages << PAGE_SHIFT)); > > -- > Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 17:47 ` Sean Christopherson @ 2026-06-15 18:09 ` Sean Christopherson 2026-06-18 11:09 ` Alexandru Elisei 2026-06-17 13:07 ` Alexandru Elisei 2026-06-18 11:26 ` David Hildenbrand (Arm) 2 siblings, 1 reply; 14+ messages in thread From: Sean Christopherson @ 2026-06-15 18:09 UTC (permalink / raw) To: sashiko-reviews; +Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton On Mon, Jun 15, 2026, Sean Christopherson wrote: > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > [ ... ] > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > unsigned long hva_start, hva_end; > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > + > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > + continue; > > > + > > > > [Severity: Critical] > > Does this change inadvertently introduce a regression in the pfncache retry > > protocol? > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > explicitly drops the page reference after mapping it: > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > ... > > kvm_release_page_clean(page); > > ... > > } > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > to invalidate the cache when the page is unmapped by the host. > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > but still provides a valid anonymous user mapping as its userspace_addr, > > could this regression lead to a use-after-free? > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > to be able to pull directly from guest_memfd. I forget if anyone is working on > that. Actually, we just need to ensure the invalidation tracking is updated, the MMU itself can be left as-is. Compile tested only, but this? diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 27498e990dff..690ab707816b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg { enum kvm_gfn_range_filter { KVM_FILTER_SHARED = BIT(0), KVM_FILTER_PRIVATE = BIT(1), + KVM_FILTER_USERSPACE_MAPPINGS = BIT(2), }; struct kvm_gfn_range { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e44c20c04961..84b693de7e35 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, * HVA-based notifications aren't relevant to private * mappings as they don't have a userspace mapping. */ - gfn_range.attr_filter = KVM_FILTER_SHARED; + gfn_range.attr_filter = KVM_FILTER_SHARED | + KVM_FILTER_USERSPACE_MAPPINGS; /* * {gfn(page) | page intersects with [hva_start, hva_end)} = @@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end) bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { kvm_mmu_invalidate_range_add(kvm, range->start, range->end); + + /* + * When reacting to changes in userspace mappings, don't unmap memslots + * that are guest_memfd-only, in which case KVM's MMU mappings are + * pulled directly from guest_memfd, i.e. don't depend on the userspace + * mappings. + * + * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once + * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd. + */ + if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS && + kvm_slot_has_gmem(range->slot) && + kvm_memslot_is_gmem_only(range->slot)) + return false; + return kvm_unmap_gfn_range(kvm, range); } ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 18:09 ` Sean Christopherson @ 2026-06-18 11:09 ` Alexandru Elisei 0 siblings, 0 replies; 14+ messages in thread From: Alexandru Elisei @ 2026-06-18 11:09 UTC (permalink / raw) To: Sean Christopherson Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton Hi Sean, On Mon, Jun 15, 2026 at 11:09:21AM -0700, Sean Christopherson wrote: > On Mon, Jun 15, 2026, Sean Christopherson wrote: > > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > --- a/virt/kvm/kvm_main.c > > > > +++ b/virt/kvm/kvm_main.c > > > [ ... ] > > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > > unsigned long hva_start, hva_end; > > > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > > + > > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > > + continue; > > > > + > > > > > > [Severity: Critical] > > > Does this change inadvertently introduce a regression in the pfncache retry > > > protocol? > > > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > > explicitly drops the page reference after mapping it: > > > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > > ... > > > kvm_release_page_clean(page); > > > ... > > > } > > > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > > to invalidate the cache when the page is unmapped by the host. > > > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > > but still provides a valid anonymous user mapping as its userspace_addr, > > > could this regression lead to a use-after-free? > > > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > > to be able to pull directly from guest_memfd. I forget if anyone is working on > > that. > > Actually, we just need to ensure the invalidation tracking is updated, the MMU > itself can be left as-is. > > Compile tested only, but this? > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 27498e990dff..690ab707816b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -260,6 +260,7 @@ union kvm_mmu_notifier_arg { > enum kvm_gfn_range_filter { > KVM_FILTER_SHARED = BIT(0), > KVM_FILTER_PRIVATE = BIT(1), > + KVM_FILTER_USERSPACE_MAPPINGS = BIT(2), > }; > > struct kvm_gfn_range { > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index e44c20c04961..84b693de7e35 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -608,7 +608,8 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > * HVA-based notifications aren't relevant to private > * mappings as they don't have a userspace mapping. > */ > - gfn_range.attr_filter = KVM_FILTER_SHARED; > + gfn_range.attr_filter = KVM_FILTER_SHARED | > + KVM_FILTER_USERSPACE_MAPPINGS; > > /* > * {gfn(page) | page intersects with [hva_start, hva_end)} = > @@ -715,6 +716,21 @@ void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end) > bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > { > kvm_mmu_invalidate_range_add(kvm, range->start, range->end); > + > + /* > + * When reacting to changes in userspace mappings, don't unmap memslots > + * that are guest_memfd-only, in which case KVM's MMU mappings are > + * pulled directly from guest_memfd, i.e. don't depend on the userspace > + * mappings. > + * > + * TODO: Skip gmem-only memslots on mmu_notifier events entirely, once > + * gfn_to_pfn_cache is also wired up to directly pull from guest_memfd. > + */ > + if (range->attr_filter & KVM_FILTER_USERSPACE_MAPPINGS && > + kvm_slot_has_gmem(range->slot) && > + kvm_memslot_is_gmem_only(range->slot)) > + return false; > + > return kvm_unmap_gfn_range(kvm, range); > } Looks correct to me, this way we also make sure we don't hit the WARN_ON_ONCE() from mmu_invalidate_retry_pfn(). How about the ->{clear_flush,clear,test}_young() MMU notifier callbacks? Shouldn't they receive the same treatment? Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 17:47 ` Sean Christopherson 2026-06-15 18:09 ` Sean Christopherson @ 2026-06-17 13:07 ` Alexandru Elisei 2026-06-17 21:21 ` Sean Christopherson 2026-06-17 21:22 ` Sean Christopherson 2026-06-18 11:26 ` David Hildenbrand (Arm) 2 siblings, 2 replies; 14+ messages in thread From: Alexandru Elisei @ 2026-06-17 13:07 UTC (permalink / raw) To: Sean Christopherson Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton Hi Sean, Thanks for the reply. Just to make sure, once the bugs have been iron out, you're ok with this idea? On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote: > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > [ ... ] > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > unsigned long hva_start, hva_end; > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > + > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > + continue; > > > + > > > > [Severity: Critical] > > Does this change inadvertently introduce a regression in the pfncache retry > > protocol? > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > explicitly drops the page reference after mapping it: > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > ... > > kvm_release_page_clean(page); > > ... > > } > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > to invalidate the cache when the page is unmapped by the host. > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > but still provides a valid anonymous user mapping as its userspace_addr, > > could this regression lead to a use-after-free? > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > to be able to pull directly from guest_memfd. I forget if anyone is working on > that. I've been trying to wrap my head around this, and I just can't seem to figure it out. kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that marks all caches that overlap the range as invalid. kvm_gpc_check() returns false for an invalid cache, so how can the memory still be accessed via the pfncache? > > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical > > page. kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see how that is affected by skipping a memslot in handle_hva_range(). > > > > Could this allow the guest to read or write arbitrary host physical memory? The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says that '[..] the fault will always be consumed from guest_memfd, regardless of whether it is a shared or private fault'. As far as I can tell, this means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is still in the page cache for the guest_memfd file after userspace has unmapped it, so the guest will not be accessing a freed page. Thoughts? Am I missing something obvious? Thanks, Alex > > > > > hva_start = max_t(unsigned long, range->start, slot->userspace_addr); > > > hva_end = min_t(unsigned long, range->end, > > > slot->userspace_addr + (slot->npages << PAGE_SHIFT)); > > > > -- > > Sashiko AI review · https://sashiko.dev/#/patchset/20260615155244.183044-1-alexandru.elisei@arm.com?part=1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-17 13:07 ` Alexandru Elisei @ 2026-06-17 21:21 ` Sean Christopherson 2026-06-18 10:19 ` Alexandru Elisei 2026-06-17 21:22 ` Sean Christopherson 1 sibling, 1 reply; 14+ messages in thread From: Sean Christopherson @ 2026-06-17 21:21 UTC (permalink / raw) To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton On Wed, Jun 17, 2026, Alexandru Elisei wrote: > Hi Sean, > > Thanks for the reply. Just to make sure, once the bugs have been iron out, > you're ok with this idea? > > On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote: > > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > --- a/virt/kvm/kvm_main.c > > > > +++ b/virt/kvm/kvm_main.c > > > [ ... ] > > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > > unsigned long hva_start, hva_end; > > > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > > + > > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > > + continue; > > > > + > > > > > > [Severity: Critical] > > > Does this change inadvertently introduce a regression in the pfncache retry > > > protocol? > > > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > > explicitly drops the page reference after mapping it: > > > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > > ... > > > kvm_release_page_clean(page); > > > ... > > > } > > > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > > to invalidate the cache when the page is unmapped by the host. > > > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > > but still provides a valid anonymous user mapping as its userspace_addr, > > > could this regression lead to a use-after-free? > > > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > > to be able to pull directly from guest_memfd. I forget if anyone is working on > > that. > > I've been trying to wrap my head around this, and I just can't seem to > figure it out. > > kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls > gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that > marks all caches that overlap the range as invalid. kvm_gpc_check() returns > false for an invalid cache, so how can the memory still be accessed via the > pfncache? That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies on mmu_invalidate_seq being bumped to avoid consuming stale state. > > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but > > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq > > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical > > > page. > > kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see > how that is affected by skipping a memslot in handle_hva_range(). handle_hva_range() only invokes on_lock() if a memslot is found. By skipping the memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so mmu_invalidate_seq won't be bumped. > > > Could this allow the guest to read or write arbitrary host physical memory? > > The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been > created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says > that '[..] the fault will always be consumed from guest_memfd, regardless > of whether it is a shared or private fault'. As far as I can tell, this > means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is > still in the page cache for the guest_memfd file after userspace has > unmapped it, so the guest will not be accessing a freed page. KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU. For other cases where KVM accesses guest memory, KVM still follows the host virtual address, e.g. so that copy_{to,from}_user() Just Works. But userspace isn't strictly *required* to keep the userspace mapping coherent with guest_memfd, nor is userspace required to make the userspace mapping fully RWX. And so if userspace modifies the VMA, KVM needs to react accordingly. When in-place conversion comes along, KVM will also rely on userspace mappings being torn down before allow a SHARED page to become PRIVATE (for all intents and purposes, we're conceptually treating conversions as free()+re-alloc(). So while the page might still be in the page cache, it's effectively been "freed". So in that case, KVM really does need to ensure it handles mmu_notifier events correctly to avoid UAF. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-17 21:21 ` Sean Christopherson @ 2026-06-18 10:19 ` Alexandru Elisei 0 siblings, 0 replies; 14+ messages in thread From: Alexandru Elisei @ 2026-06-18 10:19 UTC (permalink / raw) To: Sean Christopherson Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton Hi Sean, On Wed, Jun 17, 2026 at 02:21:02PM -0700, Sean Christopherson wrote: > On Wed, Jun 17, 2026, Alexandru Elisei wrote: > > Hi Sean, > > > > Thanks for the reply. Just to make sure, once the bugs have been iron out, > > you're ok with this idea? > > > > On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote: > > > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > > --- a/virt/kvm/kvm_main.c > > > > > +++ b/virt/kvm/kvm_main.c > > > > [ ... ] > > > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > > > unsigned long hva_start, hva_end; > > > > > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > > > + > > > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > > > + continue; > > > > > + > > > > > > > > [Severity: Critical] > > > > Does this change inadvertently introduce a regression in the pfncache retry > > > > protocol? > > > > > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > > > explicitly drops the page reference after mapping it: > > > > > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > > > ... > > > > kvm_release_page_clean(page); > > > > ... > > > > } > > > > > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > > > to invalidate the cache when the page is unmapped by the host. > > > > > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > > > but still provides a valid anonymous user mapping as its userspace_addr, > > > > could this regression lead to a use-after-free? > > > > > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > > > to be able to pull directly from guest_memfd. I forget if anyone is working on > > > that. > > > > I've been trying to wrap my head around this, and I just can't seem to > > figure it out. > > > > kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls > > gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that > > marks all caches that overlap the range as invalid. kvm_gpc_check() returns > > false for an invalid cache, so how can the memory still be accessed via the > > pfncache? > > That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies > on mmu_invalidate_seq being bumped to avoid consuming stale state. Yes. > > > > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but > > > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq > > > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical > > > > page. > > > > kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see > > how that is affected by skipping a memslot in handle_hva_range(). > > handle_hva_range() only invokes on_lock() if a memslot is found. By skipping the > memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so > mmu_invalidate_seq won't be bumped. I see it now, for some reason I completely missed the part where kvm_mmu_invalidate_{begin,end}() is called on ->lock() :( I was under the impression that they are called directly from the ->invalidate_range_{start,end}() MMU notifier callbacks. > > > > > Could this allow the guest to read or write arbitrary host physical memory? > > > > The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been > > created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says > > that '[..] the fault will always be consumed from guest_memfd, regardless > > of whether it is a shared or private fault'. As far as I can tell, this > > means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is > > still in the page cache for the guest_memfd file after userspace has > > unmapped it, so the guest will not be accessing a freed page. > > KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU. > For other cases where KVM accesses guest memory, KVM still follows the host virtual > address, e.g. so that copy_{to,from}_user() Just Works. But userspace isn't > strictly *required* to keep the userspace mapping coherent with guest_memfd, nor > is userspace required to make the userspace mapping fully RWX. And so if > userspace modifies the VMA, KVM needs to react accordingly. > > When in-place conversion comes along, KVM will also rely on userspace mappings > being torn down before allow a SHARED page to become PRIVATE (for all intents > and purposes, we're conceptually treating conversions as free()+re-alloc(). So > while the page might still be in the page cache, it's effectively been "freed". > So in that case, KVM really does need to ensure it handles mmu_notifier events > correctly to avoid UAF. Everything makes more sense now, thanks for your patience in explaining it. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-17 13:07 ` Alexandru Elisei 2026-06-17 21:21 ` Sean Christopherson @ 2026-06-17 21:22 ` Sean Christopherson 1 sibling, 0 replies; 14+ messages in thread From: Sean Christopherson @ 2026-06-17 21:22 UTC (permalink / raw) To: Alexandru Elisei; +Cc: sashiko-reviews, Marc Zyngier, kvm, kvmarm, Oliver Upton On Wed, Jun 17, 2026, Alexandru Elisei wrote: > Hi Sean, > > Thanks for the reply. Just to make sure, once the bugs have been iron out, > you're ok with this idea? Yes, I'm more than ok with this idea. The only reason I didn't push to do this right away is that I completely missed it :-) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 17:47 ` Sean Christopherson 2026-06-15 18:09 ` Sean Christopherson 2026-06-17 13:07 ` Alexandru Elisei @ 2026-06-18 11:26 ` David Hildenbrand (Arm) 2 siblings, 0 replies; 14+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-18 11:26 UTC (permalink / raw) To: Sean Christopherson, sashiko-reviews Cc: Alexandru Elisei, Marc Zyngier, kvm, kvmarm, Oliver Upton On 6/15/26 19:47, Sean Christopherson wrote: > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>> --- a/virt/kvm/kvm_main.c >>> +++ b/virt/kvm/kvm_main.c >> [ ... ] >>> @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, >>> unsigned long hva_start, hva_end; >>> >>> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); >>> + >>> + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) >>> + continue; >>> + >> >> [Severity: Critical] >> Does this change inadvertently introduce a regression in the pfncache retry >> protocol? >> >> Looking at the pfncache framework, it maps guest memory into kernel space and >> explicitly drops the page reference after mapping it: >> >> virt/kvm/pfncache.c:hva_to_pfn_retry() { >> ... >> kvm_release_page_clean(page); >> ... >> } >> >> It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) >> to invalidate the cache when the page is unmapped by the host. >> >> If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY >> but still provides a valid anonymous user mapping as its userspace_addr, >> could this regression lead to a use-after-free? > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > to be able to pull directly from guest_memfd. I forget if anyone is working on > that. For completeness, I think this is what you are looking for: https://lore.kernel.org/r/20260420154720.29012-1-itazur@amazon.com -- Cheers, David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei 2026-06-15 16:09 ` sashiko-bot @ 2026-06-15 19:07 ` David Hildenbrand 2026-06-17 13:23 ` Alexandru Elisei 1 sibling, 1 reply; 14+ messages in thread From: David Hildenbrand @ 2026-06-15 19:07 UTC (permalink / raw) To: Alexandru Elisei, pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland On 6/15/26 17:52, Alexandru Elisei wrote: > For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the > memory provider for the virtual machine is the guest_memfd file, not the > userspace mapping. Faults are resolved using the guest_memfd page cache, > and the permissions for the secondary MMU mapping depends exclusively on > the memslot (i.e, if the memslot is read-only). How userspace happens to > have the memory mmaped at fault time, or even if the memory is mapped at > all into userspace, is not taken into consideration. > > guest_memfd memory is not evictable, is not movable and there's no backing > storage. Once memory is allocated for an offset in guest_memfd file, the > offset will not change, and that memory is not freed unless userspace > explicitly punches a hole in the file. As a result, memory reclaim, page > migration, page aging and dirty page tracking for the userspace mapping > serve little purpose. I don't think any of that is relevant for the patch at hand? The thing is: invalidation (truncation, later migration, for any other reason) is driven through guest_memfd notifications, not through unrelated page tables. If we don't lookup pages for the KVM MMU through the page table, then there is also no need for MMU notifiers. It's all guest_memfd only. Or am I missing something? -- Cheers, David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-15 19:07 ` David Hildenbrand @ 2026-06-17 13:23 ` Alexandru Elisei 2026-06-17 13:41 ` David Hildenbrand 0 siblings, 1 reply; 14+ messages in thread From: Alexandru Elisei @ 2026-06-17 13:23 UTC (permalink / raw) To: David Hildenbrand Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland Hi David, On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote: > On 6/15/26 17:52, Alexandru Elisei wrote: > > For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the > > memory provider for the virtual machine is the guest_memfd file, not the > > userspace mapping. Faults are resolved using the guest_memfd page cache, > > and the permissions for the secondary MMU mapping depends exclusively on > > the memslot (i.e, if the memslot is read-only). How userspace happens to > > have the memory mmaped at fault time, or even if the memory is mapped at > > all into userspace, is not taken into consideration. > > > > guest_memfd memory is not evictable, is not movable and there's no backing > > storage. Once memory is allocated for an offset in guest_memfd file, the > > offset will not change, and that memory is not freed unless userspace > > explicitly punches a hole in the file. As a result, memory reclaim, page > > migration, page aging and dirty page tracking for the userspace mapping > > serve little purpose. > > I don't think any of that is relevant for the patch at hand? > > The thing is: invalidation (truncation, later migration, for any other reason) > is driven through guest_memfd notifications, not through unrelated page tables. > > If we don't lookup pages for the KVM MMU through the page table, then there is > also no need for MMU notifiers. It's all guest_memfd only. > > Or am I missing something? My thinking was that, because guest_memfd is not evictable, there is no need to do page ageing, which would require that secondary MMU mappings be made old. The invalidate callbacks are also used when userspace memory is marked read-only for dirty state tracking. I was trying to explaing that, since there is no backing for the guest_memfd file, host doesn't need to keep track of dirty state for the memory, and ignoring the invalidate callbacks is correct for all cases. I can drop the paragraph entirely, if you think that would make the commit message clearer. Thanks, Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-17 13:23 ` Alexandru Elisei @ 2026-06-17 13:41 ` David Hildenbrand 2026-06-17 13:50 ` Alexandru Elisei 0 siblings, 1 reply; 14+ messages in thread From: David Hildenbrand @ 2026-06-17 13:41 UTC (permalink / raw) To: Alexandru Elisei Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland On 6/17/26 15:23, Alexandru Elisei wrote: > Hi David, > > On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote: >> On 6/15/26 17:52, Alexandru Elisei wrote: >>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the >>> memory provider for the virtual machine is the guest_memfd file, not the >>> userspace mapping. Faults are resolved using the guest_memfd page cache, >>> and the permissions for the secondary MMU mapping depends exclusively on >>> the memslot (i.e, if the memslot is read-only). How userspace happens to >>> have the memory mmaped at fault time, or even if the memory is mapped at >>> all into userspace, is not taken into consideration. >>> >>> guest_memfd memory is not evictable, is not movable and there's no backing >>> storage. Once memory is allocated for an offset in guest_memfd file, the >>> offset will not change, and that memory is not freed unless userspace >>> explicitly punches a hole in the file. As a result, memory reclaim, page >>> migration, page aging and dirty page tracking for the userspace mapping >>> serve little purpose. >> >> I don't think any of that is relevant for the patch at hand? >> >> The thing is: invalidation (truncation, later migration, for any other reason) >> is driven through guest_memfd notifications, not through unrelated page tables. >> >> If we don't lookup pages for the KVM MMU through the page table, then there is >> also no need for MMU notifiers. It's all guest_memfd only. >> >> Or am I missing something? > > My thinking was that, because guest_memfd is not evictable, there is no need to > do page ageing, which would require that secondary MMU mappings be made old. Not really. The KVM MMU did not obtain the folios through the page tables, but directly through guest_memfd. Any aging would, therefore, have to be done through guest_memfd. Which we don't support and don't want to support :) That we happen to have a matching user space range that maps the guest_memfd is just coincidence from a KVM MMU point of view. > > The invalidate callbacks are also used when userspace memory is marked read-only > for dirty state tracking. I was trying to explaing that, since there is no > backing for the guest_memfd file, host doesn't need to keep track of dirty state > for the memory, and ignoring the invalidate callbacks is correct for all cases. > > I can drop the paragraph entirely, if you think that would make the commit > message clearer. I think the real motivation is: "Mappings in the secondary MMU were established by obtaining folios from guest_memfd directly, not by looking the folios up through the page tables through GUP. Consequently, there is no relationship between the page tables and the secondary MMU: MMU notifiers do not apply." -- Cheers, David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots 2026-06-17 13:41 ` David Hildenbrand @ 2026-06-17 13:50 ` Alexandru Elisei 0 siblings, 0 replies; 14+ messages in thread From: Alexandru Elisei @ 2026-06-17 13:50 UTC (permalink / raw) To: David Hildenbrand Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm, linux-arm-kernel, seanjc, mark.rutland Hi David, On Wed, Jun 17, 2026 at 03:41:41PM +0200, David Hildenbrand wrote: > On 6/17/26 15:23, Alexandru Elisei wrote: > > Hi David, > > > > On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote: > >> On 6/15/26 17:52, Alexandru Elisei wrote: > >>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the > >>> memory provider for the virtual machine is the guest_memfd file, not the > >>> userspace mapping. Faults are resolved using the guest_memfd page cache, > >>> and the permissions for the secondary MMU mapping depends exclusively on > >>> the memslot (i.e, if the memslot is read-only). How userspace happens to > >>> have the memory mmaped at fault time, or even if the memory is mapped at > >>> all into userspace, is not taken into consideration. > >>> > >>> guest_memfd memory is not evictable, is not movable and there's no backing > >>> storage. Once memory is allocated for an offset in guest_memfd file, the > >>> offset will not change, and that memory is not freed unless userspace > >>> explicitly punches a hole in the file. As a result, memory reclaim, page > >>> migration, page aging and dirty page tracking for the userspace mapping > >>> serve little purpose. > >> > >> I don't think any of that is relevant for the patch at hand? > >> > >> The thing is: invalidation (truncation, later migration, for any other reason) > >> is driven through guest_memfd notifications, not through unrelated page tables. > >> > >> If we don't lookup pages for the KVM MMU through the page table, then there is > >> also no need for MMU notifiers. It's all guest_memfd only. > >> > >> Or am I missing something? > > > > My thinking was that, because guest_memfd is not evictable, there is no need to > > do page ageing, which would require that secondary MMU mappings be made old. > > Not really. > > The KVM MMU did not obtain the folios through the page tables, but directly > through guest_memfd. Any aging would, therefore, have to be done through > guest_memfd. > > Which we don't support and don't want to support :) > > That we happen to have a matching user space range that maps the guest_memfd is > just coincidence from a KVM MMU point of view. > > > > > The invalidate callbacks are also used when userspace memory is marked read-only > > for dirty state tracking. I was trying to explaing that, since there is no > > backing for the guest_memfd file, host doesn't need to keep track of dirty state > > for the memory, and ignoring the invalidate callbacks is correct for all cases. > > > > I can drop the paragraph entirely, if you think that would make the commit > > message clearer. > > I think the real motivation is: > > "Mappings in the secondary MMU were established by obtaining folios from > guest_memfd directly, not by looking the folios up through the page tables > through GUP. Consequently, there is no relationship between the page tables and > the secondary MMU: MMU notifiers do not apply." That's much better than my version, thanks! Alex ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-06-18 11:26 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-15 15:52 [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Alexandru Elisei 2026-06-15 16:09 ` sashiko-bot 2026-06-15 17:47 ` Sean Christopherson 2026-06-15 18:09 ` Sean Christopherson 2026-06-18 11:09 ` Alexandru Elisei 2026-06-17 13:07 ` Alexandru Elisei 2026-06-17 21:21 ` Sean Christopherson 2026-06-18 10:19 ` Alexandru Elisei 2026-06-17 21:22 ` Sean Christopherson 2026-06-18 11:26 ` David Hildenbrand (Arm) 2026-06-15 19:07 ` David Hildenbrand 2026-06-17 13:23 ` Alexandru Elisei 2026-06-17 13:41 ` David Hildenbrand 2026-06-17 13:50 ` Alexandru Elisei
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.