public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Shuah Khan <shuah@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
	 Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	 rientjes@google.com, rick.p.edgecombe@intel.com,
	yan.y.zhao@intel.com,  fvdl@google.com, jthoughton@google.com,
	vannapurve@google.com,  shivankg@amd.com, michael.roth@amd.com,
	pratyush@kernel.org,  pasha.tatashin@soleen.com,
	kalyazin@amazon.com, tabba@google.com,
	 Vlastimil Babka <vbabka@kernel.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	 linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
Date: Mon, 9 Mar 2026 18:06:17 -0700	[thread overview]
Message-ID: <aa9uiQ_KBcX7X0My@google.com> (raw)
In-Reply-To: <20260309-gmem-st-blocks-v3-2-815f03d9653e@google.com>

On Mon, Mar 09, 2026, Ackerley Tng wrote:
> Set release always on guest_memfd mappings to enable the use of
> .invalidate_folio, which performs inode accounting for guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 77219551056a7..8246b9fbcf832 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	mapping_set_inaccessible(inode->i_mapping);
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +	mapping_set_release_always(inode->i_mapping);

*sigh*

So... an internal AI review bot flagged setting AS_RELEASE_ALWAYS as being
potentially problematic, and I started poking around, mostly because I was
curious.  I'm pretty sure the exact scenario painted by the bot isn't possible,
but I do think a similar issue exists in at least truncate_error_folio().  Or at
least, *should* exist, but doesn't because of a different bug.

On memory error, kvm_gmem_error_folio() will get invoked via this code.  Note
the "err != 0" check.  kvm_gmem_error_folio() returns MF_DELAYED, which has an
arbitrary value of '2', and so KVM is always signalling "failure".

		int err = mapping->a_ops->error_remove_folio(mapping, folio);

		if (err != 0)
			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
		else if (!filemap_release_folio(folio, GFP_NOIO))
			pr_info("%#lx: failed to release buffers\n", pfn);

I _think_ that's bad?  On x86, if I'm following the breadcrubs correctly, we'll
end up in this code in kill_me_maybe()

	pr_err("Memory error not recovered");
	kill_me_now(cb);

and send what I assume is a relatively useless SIGBUS and likely kill the VM.

	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);

	p->mce_count = 0;
	force_sig(SIGBUS);

But even if that's somehow the "right" behavior, we're doing it purely by
accident.

As for this patch, if we fix that bug by returning 0, then filemap_release_folio()
is definitely reachable by at least one flow, so I think guest_memfd also needs
to implement release_folio()?


    
Full AI bot text:
--
Setting the AS_RELEASE_ALWAYS flag causes folio_needs_release() to return
true. This correctly triggers .invalidate_folio during truncation, but does
it also unintentionally expose guest_memfd folios to eviction via
posix_fadvise(POSIX_FADV_DONTNEED)?

If userspace calls posix_fadvise() on a guest_memfd file, the core mm
calls mapping_evict_folio(). Because folio_needs_release() is true, it
calls filemap_release_folio().

Since guest_memfd does not implement a .release_folio address space
operation, filemap_release_folio() falls back to calling
try_to_free_buffers(). Could this fallback cause a warning?

fs/buffer.c:try_to_free_buffers() {
	...
	/* Misconfigured folio check */
	if (WARN_ON_ONCE(!folio_buffers(folio)))
		return true;
	...
}

Because the guest_memfd folio has no private data, folio_buffers()
is NULL, which will trigger this WARN_ON_ONCE.

Furthermore, try_to_free_buffers() returns true, allowing the folio to be
removed from the page cache. Because this eviction path bypasses
truncate_cleanup_folio(), it never calls .invalidate_folio.

Does this mean inode_sub_bytes() is skipped, leaking the inode block
accounting?

Userspace could potentially trigger the warning and infinitely inflate the
inode's block count with:
    struct kvm_create_guest_memfd args = { .size = 4096 };
    int fd = ioctl(kvm_vm_fd, KVM_CREATE_GUEST_MEMFD, &args);
    fallocate(fd, 0, 0, 4096);
    posix_fadvise(fd, 0, 4096, POSIX_FADV_DONTNEED);
Should guest_memfd implement a .release_folio callback that simply
returns false to prevent these folios from being evicted?
--

  parent reply	other threads:[~2026-03-10  1:06 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
2026-03-09 11:50   ` David Hildenbrand (Arm)
2026-03-09 15:45     ` Ackerley Tng
2026-03-09 20:14       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
2026-03-09 11:42   ` Jan Kara
2026-03-10  1:06   ` Sean Christopherson [this message]
2026-03-10  9:12     ` Ackerley Tng
2026-03-12 19:00       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation Ackerley Tng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa9uiQ_KBcX7X0My@google.com \
    --to=seanjc@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=jack@suse.cz \
    --cc=jthoughton@google.com \
    --cc=kalyazin@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=michael.roth@amd.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=pbonzini@redhat.com \
    --cc=pratyush@kernel.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=shivankg@amd.com \
    --cc=shuah@kernel.org \
    --cc=surenb@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox