[PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode
@ 2026-03-09  9:53 Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc, Ackerley Tng

Hi,

Currently, guest_memfd doesn't update inode's i_blocks or i_bytes at
all. Hence, st_blocks in the struct populated by a userspace fstat()
call on a guest_memfd will always be 0. This patch series makes
guest_memfd track the amount of memory allocated on an inode, which
allows fstat() to accurately report that on requests from userspace.

The inode's i_blocks and i_bytes fields are updated when the folio is
associated or disassociated from the guest_memfd inode, which are at
allocation and truncation times respectively.

RFC v3 uses the .invalidate_folio() callback to update accounting in inode
fields at truncation time, and sets AS_RELEASE_ALWAYS for guest_memfd
mappings to enable .invalidate_folio() for guest_memfd.

RFC v3 series is based on kvm-x86/next.

+ RFC v2: Removed a full custom implementation of .evict_inode for
  guest_memfd in favor of adding .unaccount_folio callback.
  + https://lore.kernel.org/all/20260225-gmem-st-blocks-v2-0-87d7098119a9@google.com/T/
+ RFC v1: https://lore.kernel.org/all/cover.1771826352.git.ackerleytng@google.com/T/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
Ackerley Tng (4):
      KVM: guest_memfd: Track amount of memory allocated on inode
      KVM: guest_memfd: Set release always on guest_memfd mappings
      KVM: selftests: Wrap fstat() to assert success
      KVM: selftests: Test that st_blocks is updated on allocation

 tools/testing/selftests/kvm/guest_memfd_test.c     | 32 +++++++++++++++-------
 tools/testing/selftests/kvm/include/kvm_syscalls.h |  2 ++
 virt/kvm/guest_memfd.c                             | 15 ++++++++++
 3 files changed, 39 insertions(+), 10 deletions(-)
---
base-commit: 5128b972fb2801ad9aca54d990a75611ab5283a9
change-id: 20260225-gmem-st-blocks-733f35d10211

Best regards,
--
Ackerley Tng <ackerleytng@google.com>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RFC v3 1/4] KVM: guest_memfd: Track amount of memory allocated on inode
  2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
@ 2026-03-09  9:53 ` Ackerley Tng
  2026-03-09 11:50   ` David Hildenbrand (Arm)
  2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc, Ackerley Tng

The guest memfd currently does not update the inode's i_blocks and i_bytes
count when memory is allocated or freed. Hence, st_blocks returned from
fstat() is always 0.

Introduce byte accounting for guest memfd inodes.  When a new folio is
added to the filemap, add the folio's size.  Use the .invalidate_folio()
callback to subtract the folio's size from inode fields when folios are
truncated and removed from the filemap.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 462c5c5cb602a..77219551056a7 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -136,6 +136,9 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 					 mapping_gfp_mask(inode->i_mapping), policy);
 	mpol_cond_put(policy);
 
+	if (!IS_ERR(folio))
+		inode_add_bytes(inode, folio_size(folio));
+
 	/*
 	 * External interfaces like kvm_gmem_get_pfn() support dealing
 	 * with hugepages to a degree, but internally, guest_memfd currently
@@ -532,10 +535,21 @@ static void kvm_gmem_free_folio(struct folio *folio)
 }
 #endif
 
+static void kvm_gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len)
+{
+	size_t bytes = folio_size(folio);
+
+	WARN_ON_ONCE(offset);
+	WARN_ON_ONCE(len != bytes);
+
+	inode_sub_bytes(folio_inode(folio), bytes);
+}
+
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
+	.invalidate_folio = kvm_gmem_invalidate_folio,
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
 	.free_folio = kvm_gmem_free_folio,
 #endif

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
  2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
@ 2026-03-09  9:53 ` Ackerley Tng
  2026-03-09 11:42   ` Jan Kara
  2026-03-10  1:06   ` Sean Christopherson
  2026-03-09  9:53 ` [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation Ackerley Tng
  3 siblings, 2 replies; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc, Ackerley Tng

Set release always on guest_memfd mappings to enable the use of
.invalidate_folio, which performs inode accounting for guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 77219551056a7..8246b9fbcf832 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	mapping_set_inaccessible(inode->i_mapping);
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+	mapping_set_release_always(inode->i_mapping);
 
 	GMEM_I(inode)->flags = flags;
 

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success
  2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
@ 2026-03-09  9:53 ` Ackerley Tng
  2026-03-09  9:53 ` [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation Ackerley Tng
  3 siblings, 0 replies; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc, Ackerley Tng

Extend kvm_syscalls.h to wrap fstat() to assert success. This will be used
in the next patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/guest_memfd_test.c     | 15 +++++----------
 tools/testing/selftests/kvm/include/kvm_syscalls.h |  2 ++
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ec7644aae999d..638906298ed73 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -270,10 +270,8 @@ static void test_mmap_not_supported(int fd, size_t total_size)
 static void test_file_size(int fd, size_t total_size)
 {
 	struct stat sb;
-	int ret;
 
-	ret = fstat(fd, &sb);
-	TEST_ASSERT(!ret, "fstat should succeed");
+	kvm_fstat(fd, &sb);
 	TEST_ASSERT_EQ(sb.st_size, total_size);
 	TEST_ASSERT_EQ(sb.st_blksize, page_size);
 }
@@ -361,25 +359,22 @@ static void test_create_guest_memfd_invalid_sizes(struct kvm_vm *vm,
 
 static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 {
-	int fd1, fd2, ret;
+	int fd1, fd2;
 	struct stat st1, st2;
 
 	fd1 = __vm_create_guest_memfd(vm, page_size, 0);
 	TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
 
-	ret = fstat(fd1, &st1);
-	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	kvm_fstat(fd1, &st1);
 	TEST_ASSERT(st1.st_size == page_size, "memfd st_size should match requested size");
 
 	fd2 = __vm_create_guest_memfd(vm, page_size * 2, 0);
 	TEST_ASSERT(fd2 != -1, "memfd creation should succeed");
 
-	ret = fstat(fd2, &st2);
-	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	kvm_fstat(fd2, &st2);
 	TEST_ASSERT(st2.st_size == page_size * 2, "second memfd st_size should match requested size");
 
-	ret = fstat(fd1, &st1);
-	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
+	kvm_fstat(fd1, &st1);
 	TEST_ASSERT(st1.st_size == page_size, "first memfd st_size should still match requested size");
 	TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers");
 
diff --git a/tools/testing/selftests/kvm/include/kvm_syscalls.h b/tools/testing/selftests/kvm/include/kvm_syscalls.h
index 843c9904c46f6..2266c06347f5d 100644
--- a/tools/testing/selftests/kvm/include/kvm_syscalls.h
+++ b/tools/testing/selftests/kvm/include/kvm_syscalls.h
@@ -2,6 +2,7 @@
 #ifndef SELFTEST_KVM_SYSCALLS_H
 #define SELFTEST_KVM_SYSCALLS_H
 
+#include <sys/stat.h>
 #include <sys/syscall.h>
 
 #define MAP_ARGS0(m,...)
@@ -78,5 +79,6 @@ __KVM_SYSCALL_DEFINE(close, 1, int, fd);
 __KVM_SYSCALL_DEFINE(fallocate, 4, int, fd, int, mode, loff_t, offset, loff_t, len);
 __KVM_SYSCALL_DEFINE(ftruncate, 2, unsigned int, fd, off_t, length);
 __KVM_SYSCALL_DEFINE(madvise, 3, void *, addr, size_t, length, int, advice);
+__KVM_SYSCALL_DEFINE(fstat, 2, int, fd, struct stat *, buf);
 
 #endif /* SELFTEST_KVM_SYSCALLS_H */

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation
  2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
                   ` (2 preceding siblings ...)
  2026-03-09  9:53 ` [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success Ackerley Tng
@ 2026-03-09  9:53 ` Ackerley Tng
  3 siblings, 0 replies; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09  9:53 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc, Ackerley Tng

The st_blocks field reported by fstat should reflect the number of
allocated 512-byte blocks for the guest memfd file.

Extend the fallocate test to verify that st_blocks is correctly updated
when memory is allocated or deallocated via
fallocate(FALLOC_FL_PUNCH_HOLE).

Add checks after each fallocate call to ensure that st_blocks increases on
allocation, decreases when a hole is punched, and is restored when the hole
is re-allocated. Also verify that st_blocks remains unchanged for failing
fallocate calls.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/guest_memfd_test.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 638906298ed73..3381a556f397d 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -276,41 +276,58 @@ static void test_file_size(int fd, size_t total_size)
 	TEST_ASSERT_EQ(sb.st_blksize, page_size);
 }
 
+static void assert_st_blocks_matches_size(int fd, size_t expected_size)
+{
+	struct stat sb;
+
+	kvm_fstat(fd, &sb);
+	TEST_ASSERT_EQ(sb.st_blocks, expected_size / 512);
+}
+
 static void test_fallocate(int fd, size_t total_size)
 {
 	int ret;
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
 	TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			page_size - 1, page_size);
 	TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
 	TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
 	TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			total_size, page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			total_size + page_size, page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			page_size, page_size - 1);
 	TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+	assert_st_blocks_matches_size(fd, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
 			page_size, page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+	assert_st_blocks_matches_size(fd, total_size - page_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
 	TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+	assert_st_blocks_matches_size(fd, total_size);
 }
 
 static void test_invalid_punch_hole(int fd, size_t total_size)

-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
  2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
@ 2026-03-09 11:42   ` Jan Kara
  2026-03-10  1:06   ` Sean Christopherson
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Kara @ 2026-03-09 11:42 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-kselftest, linux-doc

On Mon 09-03-26 09:53:53, Ackerley Tng wrote:
> Set release always on guest_memfd mappings to enable the use of
> .invalidate_folio, which performs inode accounting for guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

I'd fold this into the previous patch because that makes sense only with
this patch in place. Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

for the first two patches.

								Honza

> ---
>  virt/kvm/guest_memfd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 77219551056a7..8246b9fbcf832 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	mapping_set_inaccessible(inode->i_mapping);
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +	mapping_set_release_always(inode->i_mapping);
>  
>  	GMEM_I(inode)->flags = flags;
>  
> 
> -- 
> 2.53.0.473.g4a7958ca14-goog
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 1/4] KVM: guest_memfd: Track amount of memory allocated on inode
  2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
@ 2026-03-09 11:50   ` David Hildenbrand (Arm)
  2026-03-09 15:45     ` Ackerley Tng
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-09 11:50 UTC (permalink / raw)
  To: Ackerley Tng, Paolo Bonzini, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, seanjc, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc

On 3/9/26 10:53, Ackerley Tng wrote:
> The guest memfd currently does not update the inode's i_blocks and i_bytes
> count when memory is allocated or freed. Hence, st_blocks returned from
> fstat() is always 0.
> 
> Introduce byte accounting for guest memfd inodes.  When a new folio is
> added to the filemap, add the folio's size.  Use the .invalidate_folio()
> callback to subtract the folio's size from inode fields when folios are
> truncated and removed from the filemap.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 462c5c5cb602a..77219551056a7 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -136,6 +136,9 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  					 mapping_gfp_mask(inode->i_mapping), policy);
>  	mpol_cond_put(policy);
>  
> +	if (!IS_ERR(folio))
> +		inode_add_bytes(inode, folio_size(folio));
> +

Can't we have two concurrent calls to __filemap_get_folio_mpol(), and we
don't really know whether our call allocated the folio or simply found
one (the other caller allocated) in the pagecache?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 1/4] KVM: guest_memfd: Track amount of memory allocated on inode
  2026-03-09 11:50   ` David Hildenbrand (Arm)
@ 2026-03-09 15:45     ` Ackerley Tng
  2026-03-09 20:14       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng @ 2026-03-09 15:45 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Paolo Bonzini, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Shuah Khan, Jonathan Corbet, Alexander Viro, Christian Brauner,
	Jan Kara, seanjc, rientjes, rick.p.edgecombe, yan.y.zhao, fvdl,
	jthoughton, vannapurve, shivankg, michael.roth, pratyush,
	pasha.tatashin, kalyazin, tabba, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-mm, linux-fsdevel, linux-kselftest,
	linux-doc

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 3/9/26 10:53, Ackerley Tng wrote:
>> The guest memfd currently does not update the inode's i_blocks and i_bytes
>> count when memory is allocated or freed. Hence, st_blocks returned from
>> fstat() is always 0.
>>
>> Introduce byte accounting for guest memfd inodes.  When a new folio is
>> added to the filemap, add the folio's size.  Use the .invalidate_folio()
>> callback to subtract the folio's size from inode fields when folios are
>> truncated and removed from the filemap.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>  virt/kvm/guest_memfd.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 462c5c5cb602a..77219551056a7 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -136,6 +136,9 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>>  					 mapping_gfp_mask(inode->i_mapping), policy);
>>  	mpol_cond_put(policy);
>>
>> +	if (!IS_ERR(folio))
>> +		inode_add_bytes(inode, folio_size(folio));
>> +
>
> Can't we have two concurrent calls to __filemap_get_folio_mpol(), and we
> don't really know whether our call allocated the folio or simply found
> one (the other caller allocated) in the pagecache?
>

Ah that is true. Two threads can get past filemap_lock_folio(), then get
to __filemap_get_folio_mpol(), and then thread 1 will return from
__filemap_get_folio_mpol() with an allocated folio while thread 2
returns with the folio allocated by thread 1. Both threads would end up
incrementing the number of bytes in the inode.

Sean, Vlastimil, is this a good argument for open coding, like in RFC v2
[1]? So that guest_memfd can do inode_add_bytes() specifically when the
folio is added to the filemap.

An alternative I can think of is to add a callback that is called from
within __filemap_add_folio(). Would that be preferred?

[1] https://lore.kernel.org/all/20260225-gmem-st-blocks-v2-2-87d7098119a9@google.com/

> --
> Cheers,
>
> David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 1/4] KVM: guest_memfd: Track amount of memory allocated on inode
  2026-03-09 15:45     ` Ackerley Tng
@ 2026-03-09 20:14       ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2026-03-09 20:14 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: David Hildenbrand (Arm), Paolo Bonzini, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Matthew Wilcox (Oracle),
	Shuah Khan, Jonathan Corbet, Alexander Viro, Christian Brauner,
	Jan Kara, rientjes, rick.p.edgecombe, yan.y.zhao, fvdl,
	jthoughton, vannapurve, shivankg, michael.roth, pratyush,
	pasha.tatashin, kalyazin, tabba, Vlastimil Babka, kvm,
	linux-kernel, linux-mm, linux-fsdevel, linux-kselftest, linux-doc

On Mon, Mar 09, 2026, Ackerley Tng wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
> 
> > On 3/9/26 10:53, Ackerley Tng wrote:
> >> The guest memfd currently does not update the inode's i_blocks and i_bytes
> >> count when memory is allocated or freed. Hence, st_blocks returned from
> >> fstat() is always 0.
> >>
> >> Introduce byte accounting for guest memfd inodes.  When a new folio is
> >> added to the filemap, add the folio's size.  Use the .invalidate_folio()
> >> callback to subtract the folio's size from inode fields when folios are
> >> truncated and removed from the filemap.
> >>
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >> ---
> >>  virt/kvm/guest_memfd.c | 14 ++++++++++++++
> >>  1 file changed, 14 insertions(+)
> >>
> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> index 462c5c5cb602a..77219551056a7 100644
> >> --- a/virt/kvm/guest_memfd.c
> >> +++ b/virt/kvm/guest_memfd.c
> >> @@ -136,6 +136,9 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
> >>  					 mapping_gfp_mask(inode->i_mapping), policy);
> >>  	mpol_cond_put(policy);
> >>
> >> +	if (!IS_ERR(folio))
> >> +		inode_add_bytes(inode, folio_size(folio));
> >> +
> >
> > Can't we have two concurrent calls to __filemap_get_folio_mpol(), and we
> > don't really know whether our call allocated the folio or simply found
> > one (the other caller allocated) in the pagecache?
> >
> 
> Ah that is true. Two threads can get past filemap_lock_folio(), then get
> to __filemap_get_folio_mpol(), and then thread 1 will return from
> __filemap_get_folio_mpol() with an allocated folio while thread 2
> returns with the folio allocated by thread 1. Both threads would end up
> incrementing the number of bytes in the inode.
> 
> Sean, Vlastimil, is this a good argument for open coding, like in RFC v2
> [1]? So that guest_memfd can do inode_add_bytes() specifically when the
> folio is added to the filemap.

Heh, I assumed that was going to be _the_ argument, i.e. I was expecting the answer
to my implicit question of "if this greatly simplifies accounting" was going to be
"trying to do the right thing while using __filemap_get_folio_mpol() is insane".

> An alternative I can think of is to add a callback that is called from
> within __filemap_add_folio(). Would that be preferred?

Probably not.  Poking around, it definitely seems like guest_memfd is the oddball.
E.g. as David pointed out, even shmem participates in disk quota stuff, and HugeTLB
is its own beast.  In other words, I doubt any "real" filesystem will want to hook
__filemap_add_folio() in this way.

So as I said before, "if this greatly simplifies accounting, then I'm ok with it".
And it sounds like the answer is an emphatic "yes".  And again as I said before,
all I ask at this point is that the refactoring changelog focuses on that point.

P.S. In future versions, please explain _why_ you want to add fstat() support,
i.e. why you want to account allocated bytes/folios.  For folks like me that do
very little userspace programming, and even less filesystems work, fstat() not
working means nothing.  Even if the answer is "because literally every other FS
in Linux works".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
  2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
  2026-03-09 11:42   ` Jan Kara
@ 2026-03-10  1:06   ` Sean Christopherson
  2026-03-10  9:12     ` Ackerley Tng
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-03-10  1:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-kselftest, linux-doc

On Mon, Mar 09, 2026, Ackerley Tng wrote:
> Set release always on guest_memfd mappings to enable the use of
> .invalidate_folio, which performs inode accounting for guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 77219551056a7..8246b9fbcf832 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	mapping_set_inaccessible(inode->i_mapping);
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +	mapping_set_release_always(inode->i_mapping);

*sigh*

So... an internal AI review bot flagged setting AS_RELEASE_ALWAYS as being
potentially problematic, and I started poking around, mostly because I was
curious.  I'm pretty sure the exact scenario painted by the bot isn't possible,
but I do think a similar issue exists in at least truncate_error_folio().  Or at
least, *should* exist, but doesn't because of a different bug.

On memory error, kvm_gmem_error_folio() will get invoked via this code.  Note
the "err != 0" check.  kvm_gmem_error_folio() returns MF_DELAYED, which has an
arbitrary value of '2', and so KVM is always signalling "failure".

		int err = mapping->a_ops->error_remove_folio(mapping, folio);

		if (err != 0)
			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
		else if (!filemap_release_folio(folio, GFP_NOIO))
			pr_info("%#lx: failed to release buffers\n", pfn);

I _think_ that's bad?  On x86, if I'm following the breadcrubs correctly, we'll
end up in this code in kill_me_maybe()

	pr_err("Memory error not recovered");
	kill_me_now(cb);

and send what I assume is a relatively useless SIGBUS and likely kill the VM.

	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);

	p->mce_count = 0;
	force_sig(SIGBUS);

But even if that's somehow the "right" behavior, we're doing it purely by
accident.

As for this patch, if we fix that bug by returning 0, then filemap_release_folio()
is definitely reachable by at least one flow, so I think guest_memfd also needs
to implement release_folio()?

Full AI bot text:
--
Setting the AS_RELEASE_ALWAYS flag causes folio_needs_release() to return
true. This correctly triggers .invalidate_folio during truncation, but does
it also unintentionally expose guest_memfd folios to eviction via
posix_fadvise(POSIX_FADV_DONTNEED)?

If userspace calls posix_fadvise() on a guest_memfd file, the core mm
calls mapping_evict_folio(). Because folio_needs_release() is true, it
calls filemap_release_folio().

Since guest_memfd does not implement a .release_folio address space
operation, filemap_release_folio() falls back to calling
try_to_free_buffers(). Could this fallback cause a warning?

fs/buffer.c:try_to_free_buffers() {
	...
	/* Misconfigured folio check */
	if (WARN_ON_ONCE(!folio_buffers(folio)))
		return true;
	...
}

Because the guest_memfd folio has no private data, folio_buffers()
is NULL, which will trigger this WARN_ON_ONCE.

Furthermore, try_to_free_buffers() returns true, allowing the folio to be
removed from the page cache. Because this eviction path bypasses
truncate_cleanup_folio(), it never calls .invalidate_folio.

Does this mean inode_sub_bytes() is skipped, leaking the inode block
accounting?

Userspace could potentially trigger the warning and infinitely inflate the
inode's block count with:
    struct kvm_create_guest_memfd args = { .size = 4096 };
    int fd = ioctl(kvm_vm_fd, KVM_CREATE_GUEST_MEMFD, &args);
    fallocate(fd, 0, 0, 4096);
    posix_fadvise(fd, 0, 4096, POSIX_FADV_DONTNEED);
Should guest_memfd implement a .release_folio callback that simply
returns false to prevent these folios from being evicted?
--

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
  2026-03-10  1:06   ` Sean Christopherson
@ 2026-03-10  9:12     ` Ackerley Tng
  2026-03-12 19:00       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng @ 2026-03-10  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-kselftest, linux-doc, Lisa Wang, Kalyazin, Nikita

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Mar 09, 2026, Ackerley Tng wrote:
>> Set release always on guest_memfd mappings to enable the use of
>> .invalidate_folio, which performs inode accounting for guest_memfd.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>  virt/kvm/guest_memfd.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 77219551056a7..8246b9fbcf832 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -607,6 +607,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>  	mapping_set_inaccessible(inode->i_mapping);
>>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>> +	mapping_set_release_always(inode->i_mapping);
>
> *sigh*
>
> So... an internal AI review bot flagged setting AS_RELEASE_ALWAYS as being
> potentially problematic, and I started poking around, mostly because I was
> curious.  I'm pretty sure the exact scenario painted by the bot isn't possible,
> but I do think a similar issue exists in at least truncate_error_folio().  Or at
> least, *should* exist, but doesn't because of a different bug.
>
> On memory error, kvm_gmem_error_folio() will get invoked via this code.  Note
> the "err != 0" check.  kvm_gmem_error_folio() returns MF_DELAYED, which has an
> arbitrary value of '2', and so KVM is always signalling "failure".
>
> 		int err = mapping->a_ops->error_remove_folio(mapping, folio);
>
> 		if (err != 0)
> 			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
> 		else if (!filemap_release_folio(folio, GFP_NOIO))
> 			pr_info("%#lx: failed to release buffers\n", pfn);
>
> I _think_ that's bad?  On x86, if I'm following the breadcrubs correctly, we'll
> end up in this code in kill_me_maybe()
>
> 	pr_err("Memory error not recovered");
> 	kill_me_now(cb);
>
> and send what I assume is a relatively useless SIGBUS and likely kill the VM.
>
> 	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
>
> 	p->mce_count = 0;
> 	force_sig(SIGBUS);
>

Glad you agree that's bad :). We reported this earlier and Lisa has been
working on a fix. RFC v1 is here [1].

RFC v2 (Lisa is about to post it) will address David's comments by
aligning shmem to also return MF_DELAYED like guest_memfd, then handling
MF_DELAYED returned from .error_remove_folio() accordingly so that it's
not interpreted as a memory failure handling failure.

[1] https://lore.kernel.org/all/cover.1760551864.git.wyihan@google.com/#r

> But even if that's somehow the "right" behavior, we're doing it purely by
> accident.
>
> As for this patch, if we fix that bug by returning 0, then filemap_release_folio()
> is definitely reachable by at least one flow, so I think guest_memfd also needs
> to implement release_folio()?
>

Is posix_fadvise() the one flow you're talking about?

It indeed calls filemap_release_folio() through mapping_try_invalidate()
-> mapping_evict_folio() -> filemap_release_folio().

From Documentation/filesystems/locking.rst:

  ->release_folio() is called when the MM wants to make a change to the
  folio that would invalidate the filesystem's private data.  For example,
  it may be about to be removed from the address_space or split.  The folio
  is locked and not under writeback.  It may be dirty.  The gfp parameter
  is not usually used for allocation, but rather to indicate what the
  filesystem may do to attempt to free the private data.  The filesystem may
  return false to indicate that the folio's private data cannot be freed.
  If it returns true, it should have already removed the private data from
  the folio.  If a filesystem does not provide a ->release_folio method,
  the pagecache will assume that private data is buffer_heads and call
  try_to_free_buffers().

I could implement .release_folio().

Returning false seems like the easier solution, and is kind of in line
with the documentation above. A guest_memfd folio does not have private
data, so without private data, the private data cannot be freed.

(Took me a while to notice that having private data is not the same
as having something in folio->private, so this doesn't change even after
the direct map removal series lands.)

Returning false is going to break shrink_folio_list(), but that probably
won't affect guest_memfd for now.

Returning false also breaks page_cache_pipe_buf_try_steal(). Does anyone
more familiar with splicing know if that could affect guest_memfd?

Returning true could also work, to indicate that the folio's private
data has been "removed". I'd also have to do inode_sub_bytes() in
.release_folio() then, since in mapping_evict_folio(), remove_mapping()
doesn't call .invalidate_folio().

Then we will have to separately ensure that in truncate_error_folio(),
guest_memfd doesn't double-deduct the folio's size from the inode. This
should be semantically correct though, since IIUC .invalidate_folio() is
when a folio is removed (clean or dirty), but .release_folio() is only
for clean folios. If .error_remove_folio() returns MF_DELAYED, the
truncation didn't happen and so there should be no call to
.release_folio().

>
>
> Full AI bot text:
> --
> Setting the AS_RELEASE_ALWAYS flag causes folio_needs_release() to return
> true. This correctly triggers .invalidate_folio during truncation, but does
> it also unintentionally expose guest_memfd folios to eviction via
> posix_fadvise(POSIX_FADV_DONTNEED)?
>
> If userspace calls posix_fadvise() on a guest_memfd file, the core mm
> calls mapping_evict_folio(). Because folio_needs_release() is true, it
> calls filemap_release_folio().
>
> Since guest_memfd does not implement a .release_folio address space
> operation, filemap_release_folio() falls back to calling
> try_to_free_buffers(). Could this fallback cause a warning?
>
> fs/buffer.c:try_to_free_buffers() {
> 	...
> 	/* Misconfigured folio check */
> 	if (WARN_ON_ONCE(!folio_buffers(folio)))
> 		return true;
> 	...
> }
>
> Because the guest_memfd folio has no private data, folio_buffers()
> is NULL, which will trigger this WARN_ON_ONCE.
>
> Furthermore, try_to_free_buffers() returns true, allowing the folio to be
> removed from the page cache. Because this eviction path bypasses
> truncate_cleanup_folio(), it never calls .invalidate_folio.
>
> Does this mean inode_sub_bytes() is skipped, leaking the inode block
> accounting?
>
> Userspace could potentially trigger the warning and infinitely inflate the
> inode's block count with:
>     struct kvm_create_guest_memfd args = { .size = 4096 };
>     int fd = ioctl(kvm_vm_fd, KVM_CREATE_GUEST_MEMFD, &args);
>     fallocate(fd, 0, 0, 4096);
>     posix_fadvise(fd, 0, 4096, POSIX_FADV_DONTNEED);
> Should guest_memfd implement a .release_folio callback that simply
> returns false to prevent these folios from being evicted?
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings
  2026-03-10  9:12     ` Ackerley Tng
@ 2026-03-12 19:00       ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2026-03-12 19:00 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Matthew Wilcox (Oracle), Shuah Khan, Jonathan Corbet,
	Alexander Viro, Christian Brauner, Jan Kara, rientjes,
	rick.p.edgecombe, yan.y.zhao, fvdl, jthoughton, vannapurve,
	shivankg, michael.roth, pratyush, pasha.tatashin, kalyazin, tabba,
	Vlastimil Babka, kvm, linux-kernel, linux-mm, linux-fsdevel,
	linux-kselftest, linux-doc, Lisa Wang, Nikita Kalyazin

On Tue, Mar 10, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > On Mon, Mar 09, 2026, Ackerley Tng wrote:
> > But even if that's somehow the "right" behavior, we're doing it purely by
> > accident.
> >
> > As for this patch, if we fix that bug by returning 0, then filemap_release_folio()
> > is definitely reachable by at least one flow, so I think guest_memfd also needs
> > to implement release_folio()?
> >
> 
> Is posix_fadvise() the one flow you're talking about?

No, I'm saying if we fix the memory error case, then filemap_release_folio()
likely becomes reachable.  Though there may be other cases.

> It indeed calls filemap_release_folio() through mapping_try_invalidate()
> -> mapping_evict_folio() -> filemap_release_folio().
> 
> >From Documentation/filesystems/locking.rst:
> 
>   ->release_folio() is called when the MM wants to make a change to the
>   folio that would invalidate the filesystem's private data.  For example,
>   it may be about to be removed from the address_space or split.  The folio
>   is locked and not under writeback.  It may be dirty.  The gfp parameter
>   is not usually used for allocation, but rather to indicate what the
>   filesystem may do to attempt to free the private data.  The filesystem may
>   return false to indicate that the folio's private data cannot be freed.
>   If it returns true, it should have already removed the private data from
>   the folio.  If a filesystem does not provide a ->release_folio method,
>   the pagecache will assume that private data is buffer_heads and call
>   try_to_free_buffers().
> 
> I could implement .release_folio().
> 
> Returning false seems like the easier solution, and is kind of in line
> with the documentation above. A guest_memfd folio does not have private
> data, so without private data, the private data cannot be freed.

Eh, not really, If there's no private data, then freeing it always succeeds.

> (Took me a while to notice that having private data is not the same
> as having something in folio->private, so this doesn't change even after
> the direct map removal series lands.)
> 
> Returning false is going to break shrink_folio_list(), but that probably
> won't affect guest_memfd for now.

Definitely not a problem, I'm very against putting guest_memfd pages on the
kernel's standard LRU lists.

> Returning false also breaks page_cache_pipe_buf_try_steal(). Does anyone
> more familiar with splicing know if that could affect guest_memfd?

AFAICT, also not a problem until KVM supports .splice_read().

> Returning true could also work, to indicate that the folio's private
> data has been "removed". I'd also have to do inode_sub_bytes() in
> .release_folio() then, since in mapping_evict_folio(), remove_mapping()
> doesn't call .invalidate_folio().
> 
> Then we will have to separately ensure that in truncate_error_folio(),
> guest_memfd doesn't double-deduct the folio's size from the inode. This
> should be semantically correct though, since IIUC .invalidate_folio() is
> when a folio is removed (clean or dirty), but .release_folio() is only
> for clean folios. If .error_remove_folio() returns MF_DELAYED, the
> truncation didn't happen and so there should be no call to
> .release_folio().

Before we dive deep into solutions, what's the motivation for making fstat() work?
As I asked in the cover letter:

  P.S. In future versions, please explain _why_ you want to add fstat() support,
  i.e. why you want to account allocated bytes/folios.  For folks like me that do
  very little userspace programming, and even less filesystems work, fstat() not
  working means nothing.  Even if the answer is "because literally every other FS
  in Linux works".

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-03-12 19:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09  9:53 [PATCH RFC v3 0/4] guest_memfd: Track amount of memory allocated on inode Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 1/4] KVM: " Ackerley Tng
2026-03-09 11:50   ` David Hildenbrand (Arm)
2026-03-09 15:45     ` Ackerley Tng
2026-03-09 20:14       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 2/4] KVM: guest_memfd: Set release always on guest_memfd mappings Ackerley Tng
2026-03-09 11:42   ` Jan Kara
2026-03-10  1:06   ` Sean Christopherson
2026-03-10  9:12     ` Ackerley Tng
2026-03-12 19:00       ` Sean Christopherson
2026-03-09  9:53 ` [PATCH RFC v3 3/4] KVM: selftests: Wrap fstat() to assert success Ackerley Tng
2026-03-09  9:53 ` [PATCH RFC v3 4/4] KVM: selftests: Test that st_blocks is updated on allocation Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox