From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 1721F3EEAF9; Thu, 18 Jun 2026 10:20:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781778010; cv=none; b=i1Q8qRUOQWWrLKvLF4K1MDlZ2H5poQ1LXwU0+wD6ZBnJLbAmGf7AJb8CQUJQncIRzABt6vTzpP9GDShk0wA3YyWsASWMu1S6ABip3kz9nQdsjF4QuM+GdL/vqz5a4jKp5z28Kotsm/Wf37A6RRRZikDJLWeBH6CduIi2bwcGGuI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781778010; c=relaxed/simple; bh=9fangGeG4n196Zz3BJvMg9UNdqOmOWQJGg3FWdgaILs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=I6gySqxEsdYh5t7+HqQq7U/YRHgUVf2sCg7Vl79J4tHx4h6Ff/J8+XxQts7o4KUUVLH2ZJVnJSqdRcZ+R1FF3zcOtyO9gjMp6pXP666rSrS69EzTRxmVCmRbvNLBVVzq6ptggjrr263aV65oebzjYmM3RVLMj3WGqvImO94usEc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b=FTidFAd+; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=arm.com header.i=@arm.com header.b="FTidFAd+" Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2A4DA2C1D; Thu, 18 Jun 2026 03:19:57 -0700 (PDT) Received: from raptor (usa-sjc-mx-foss1.foss.arm.com [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 503AC3F905; Thu, 18 Jun 2026 03:20:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arm.com; s=foss; t=1781778001; bh=9fangGeG4n196Zz3BJvMg9UNdqOmOWQJGg3FWdgaILs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=FTidFAd+kMZcLOALCGU7P9wsthueikOo5vkAe+SBemmN/cUXQHiPDzD8fxv/nUROw 7k34i+3w+wt1UVGVGjISKJC1XNSmGNDO/wW6PfMCU1B0SX4PyGl+vSRP5s7NgEadEd uRiYVbQLsqDbpmk0IETOc3GjeDfoTvACAmH/RUlM= Date: Thu, 18 Jun 2026 11:19:56 +0100 From: Alexandru Elisei To: Sean Christopherson Cc: sashiko-reviews@lists.linux.dev, Marc Zyngier , kvm@vger.kernel.org, kvmarm@lists.linux.dev, Oliver Upton Subject: Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots Message-ID: References: <20260615155244.183044-1-alexandru.elisei@arm.com> <20260615160901.9A1A61F000E9@smtp.kernel.org> Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Hi Sean, On Wed, Jun 17, 2026 at 02:21:02PM -0700, Sean Christopherson wrote: > On Wed, Jun 17, 2026, Alexandru Elisei wrote: > > Hi Sean, > > > > Thanks for the reply. Just to make sure, once the bugs have been iron out, > > you're ok with this idea? > > > > On Mon, Jun 15, 2026 at 10:47:14AM -0700, Sean Christopherson wrote: > > > On Mon, Jun 15, 2026, sashiko-bot@kernel.org wrote: > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > > --- a/virt/kvm/kvm_main.c > > > > > +++ b/virt/kvm/kvm_main.c > > > > [ ... ] > > > > > @@ -592,6 +592,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, > > > > > unsigned long hva_start, hva_end; > > > > > > > > > > slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]); > > > > > + > > > > > + if (kvm_slot_has_gmem(slot) && kvm_memslot_is_gmem_only(slot)) > > > > > + continue; > > > > > + > > > > > > > > [Severity: Critical] > > > > Does this change inadvertently introduce a regression in the pfncache retry > > > > protocol? > > > > > > > > Looking at the pfncache framework, it maps guest memory into kernel space and > > > > explicitly drops the page reference after mapping it: > > > > > > > > virt/kvm/pfncache.c:hva_to_pfn_retry() { > > > > ... > > > > kvm_release_page_clean(page); > > > > ... > > > > } > > > > > > > > It appears to rely entirely on KVM's MMU notifiers (kvm->mmu_invalidate_seq) > > > > to invalidate the cache when the page is unmapped by the host. > > > > > > > > If a VMM defines a guest_memfd-backed memslot with KVM_MEMSLOT_GMEM_ONLY > > > > but still provides a valid anonymous user mapping as its userspace_addr, > > > > could this regression lead to a use-after-free? > > > > > > Sadly, yes. To land this, we would need to first teach the gfn_to_pfn_cache code > > > to be able to pull directly from guest_memfd. I forget if anyone is working on > > > that. > > > > I've been trying to wrap my head around this, and I just can't seem to > > figure it out. > > > > kvm_mmu_notifier_invalidate_range_start(), before handle_hva_range(), calls > > gfn_to_pfn_cache_invalidate_start() for the MMU notifier range, and that > > marks all caches that overlap the range as invalid. kvm_gpc_check() returns > > false for an invalid cache, so how can the memory still be accessed via the > > pfncache? > > That just forces gpcs to be refreshed, mmu_notifier_retry_cache() still relies > on mmu_invalidate_seq being bumped to avoid consuming stale state. Yes. > > > > > By unmapping the anonymous memory, the host would trigger MMU notifiers, but > > > > this new check skips the memslot. As a result, kvm->mmu_invalidate_seq > > > > wouldn't increment, and KVM might retain a kernel mapping to a freed physical > > > > page. > > > > kvm->mmu_invalidate_seq is incremented in kvm_mmu_invalidate_end(), I don't see > > how that is affected by skipping a memslot in handle_hva_range(). > > handle_hva_range() only invokes on_lock() if a memslot is found. By skipping the > memslot entirely, kvm_mmu_invalidate_{start,end}() won't be called and so > mmu_invalidate_seq won't be bumped. I see it now, for some reason I completely missed the part where kvm_mmu_invalidate_{begin,end}() is called on ->lock() :( I was under the impression that they are called directly from the ->invalidate_range_{start,end}() MMU notifier callbacks. > > > > > Could this allow the guest to read or write arbitrary host physical memory? > > > > The KVM_MEMSLOT_GMEM_ONLY flag is set if the backing guest_memfd has been > > created with GUEST_MEMFD_FLAG_MMAP. The documentation for the flag says > > that '[..] the fault will always be consumed from guest_memfd, regardless > > of whether it is a shared or private fault'. As far as I can tell, this > > means that, absent a fallocate(FALLOC_FL_PUNCH_HOLE) call, the page is > > still in the page cache for the guest_memfd file after userspace has > > unmapped it, so the guest will not be accessing a freed page. > > KVM_MEMSLOT_GMEM_ONLY is somewhat misleading, it only applies to the KVM's MMU. > For other cases where KVM accesses guest memory, KVM still follows the host virtual > address, e.g. so that copy_{to,from}_user() Just Works. But userspace isn't > strictly *required* to keep the userspace mapping coherent with guest_memfd, nor > is userspace required to make the userspace mapping fully RWX. And so if > userspace modifies the VMA, KVM needs to react accordingly. > > When in-place conversion comes along, KVM will also rely on userspace mappings > being torn down before allow a SHARED page to become PRIVATE (for all intents > and purposes, we're conceptually treating conversions as free()+re-alloc(). So > while the page might still be in the page cache, it's effectively been "freed". > So in that case, KVM really does need to ensure it handles mmu_notifier events > correctly to avoid UAF. Everything makes more sense now, thanks for your patience in explaining it. Thanks, Alex