All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Shuah Khan <shuah@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	 rientjes@google.com, rick.p.edgecombe@intel.com,
	yan.y.zhao@intel.com,  fvdl@google.com, jthoughton@google.com,
	vannapurve@google.com,  shivankg@amd.com, michael.roth@amd.com,
	pratyush@kernel.org,  pasha.tatashin@soleen.com,
	kalyazin@amazon.com, tabba@google.com,
	 Vlastimil Babka <vbabka@kernel.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	 linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
Date: Mon, 9 Mar 2026 18:06:17 -0700	[thread overview]
Message-ID: <aa9uiQ_KBcX7X0My@google.com> (raw)
In-Reply-To: <20260309-gmem-st-blocks-v3-2-815f03d9653e@google.com>

On Mon, Mar 09, 2026, Ackerley Tng wrote:
> Set release always on guest_memfd mappings to enable the use of
> .invalidate_folio, which performs inode accounting for guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 77219551056a7..8246b9fbcf832 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	mapping_set_inaccessible(inode->i_mapping);
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +	mapping_set_release_always(inode->i_mapping);

*sigh*

So... an internal AI review bot flagged setting AS_RELEASE_ALWAYS as being
potentially problematic, and I started poking around, mostly because I was
curious.  I'm pretty sure the exact scenario painted by the bot isn't possible,
but I do think a similar issue exists in at least truncate_error_folio().  Or at
least, *should* exist, but doesn't because of a different bug.

On memory error, kvm_gmem_error_folio() will get invoked via this code.  Note
the "err != 0" check.  kvm_gmem_error_folio() returns MF_DELAYED, which has an
arbitrary value of '2', and so KVM is always signalling "failure".

		int err = mapping->a_ops->error_remove_folio(mapping, folio);

		if (err != 0)
			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
		else if (!filemap_release_folio(folio, GFP_NOIO))
			pr_info("%#lx: failed to release buffers\n", pfn);

I _think_ that's bad?  On x86, if I'm following the breadcrubs correctly, we'll
end up in this code in kill_me_maybe()

	pr_err("Memory error not recovered");
	kill_me_now(cb);

and send what I assume is a relatively useless SIGBUS and likely kill the VM.

	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);

	p->mce_count = 0;
	force_sig(SIGBUS);

But even if that's somehow the "right" behavior, we're doing it purely by
accident.

As for this patch, if we fix that bug by returning 0, then filemap_release_folio()
is definitely reachable by at least one flow, so I think guest_memfd also needs
to implement release_folio()?


    
Full AI bot text:
--
Setting the AS_RELEASE_ALWAYS flag causes folio_needs_release() to return
true. This correctly triggers .invalidate_folio during truncation, but does
it also unintentionally expose guest_memfd folios to eviction via
posix_fadvise(POSIX_FADV_DONTNEED)?

If userspace calls posix_fadvise() on a guest_memfd file, the core mm
calls mapping_evict_folio(). Because folio_needs_release() is true, it
calls filemap_release_folio().

Since guest_memfd does not implement a .release_folio address space
operation, filemap_release_folio() falls back to calling
try_to_free_buffers(). Could this fallback cause a warning?

fs/buffer.c:try_to_free_buffers() {
	...
	/* Misconfigured folio check */
	if (WARN_ON_ONCE(!folio_buffers(folio)))
		return true;
	...
}

Because the guest_memfd folio has no private data, folio_buffers()
is NULL, which will trigger this WARN_ON_ONCE.

Furthermore, try_to_free_buffers() returns true, allowing the folio to be
removed from the page cache. Because this eviction path bypasses
truncate_cleanup_folio(), it never calls .invalidate_folio.

Does this mean inode_sub_bytes() is skipped, leaking the inode block
accounting?

Userspace could potentially trigger the warning and infinitely inflate the
inode's block count with:
    struct kvm_create_guest_memfd args = { .size = 4096 };
    int fd = ioctl(kvm_vm_fd, KVM_CREATE_GUEST_MEMFD, &args);
    fallocate(fd, 0, 0, 4096);
    posix_fadvise(fd, 0, 4096, POSIX_FADV_DONTNEED);
Should guest_memfd implement a .release_folio callback that simply
returns false to prevent these folios from being evicted?
--

  parent reply	other threads:[~2026-03-10  1:06 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
2026-03-09 11:50   ` David Hildenbrand (Arm)
2026-03-09 15:45     ` Ackerley Tng
2026-03-09 20:14       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
2026-03-09 11:42   ` Jan Kara
2026-03-10  1:06   ` Sean Christopherson [this message]
2026-03-10  9:12     ` Ackerley Tng
2026-03-12 19:00       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation Ackerley Tng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa9uiQ_KBcX7X0My@google.com \
    --to=seanjc@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=jack@suse.cz \
    --cc=jthoughton@google.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=michael.roth@amd.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=pbonzini@redhat.com \
    --cc=pratyush@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=shivankg@amd.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.