public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/2] KVM: guest_memfd: use write for population
@ 2025-11-14 15:18 Kalyazin, Nikita
  2025-11-14 15:18 ` [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Kalyazin, Nikita
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Kalyazin, Nikita @ 2025-11-14 15:18 UTC (permalink / raw)
  To: pbonzini@redhat.com, shuah@kernel.org
  Cc: kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com, david@kernel.org,
	jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek, Cali, Marco,
	Kalyazin, Nikita

On systems that support shared guest memory, write() is useful, for
example, for population of the initial image.  Even though the same can
also be achieved via userspace mapping and memcpying from userspace,
write() provides a more performant option because it does not need to
set user page tables and it does not cause a page fault for every page
like memcpy would.  Note that memcpy cannot be accelerated via
MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on
GUP.

Populating 512MiB of guest_memfd on a x86 machine:
 - via memcpy: 436 ms
 - via write:  202 ms (-54%)

Only PAGE_ALIGNED offset and len are allowed.  Even though non-aligned
writes are technically possible, when in-place conversion support is
implemented [1], the restriction makes handling of mixed shared/private
huge pages simpler.  write() will only be allowed to populate shared
pages.

When direct map removal is implemented [2]
 - write() will not be allowed to access pages that have already
   been removed from direct map
 - on completion, write() will remove the populated pages from
   direct map

While it is technically possible to implement read() syscall on systems
with shared guest memory, it is not supported as there is currently no
use case for it.

[1]
https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com
[2]
https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu.de

Nikita Kalyazin (2):
  KVM: guest_memfd: add generic population via write
  KVM: selftests: update guest_memfd write tests

 Documentation/virt/kvm/api.rst                |  2 +
 include/linux/kvm_host.h                      |  2 +-
 include/uapi/linux/kvm.h                      |  1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 58 +++++++++++++++++--
 virt/kvm/guest_memfd.c                        | 52 +++++++++++++++++
 5 files changed, 108 insertions(+), 7 deletions(-)


base-commit: 8a4821412cf2c1429fffa07c012dd150f2edf78c
--
2.50.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v7 1/2] KVM: guest_memfd: add generic population via write
  2025-11-14 15:18 [PATCH v7 0/2] KVM: guest_memfd: use write for population Kalyazin, Nikita
@ 2025-11-14 15:18 ` Kalyazin, Nikita
  2026-03-12  0:46   ` Sean Christopherson
  2025-11-14 15:18 ` [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests Kalyazin, Nikita
  2025-11-14 15:23 ` [PATCH v7 0/2] KVM: guest_memfd: use write for population Nikita Kalyazin
  2 siblings, 1 reply; 6+ messages in thread
From: Kalyazin, Nikita @ 2025-11-14 15:18 UTC (permalink / raw)
  To: pbonzini@redhat.com, shuah@kernel.org
  Cc: kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com, david@kernel.org,
	jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek, Cali, Marco,
	Kalyazin, Nikita

From: Nikita Kalyazin <kalyazin@amazon.com>

On systems that support shared guest memory, write() is useful, for
example, for population of the initial image.  Even though the same can
also be achieved via userspace mapping and memcpying from userspace,
write() provides a more performant option because it does not need to
set user page tables and it does not cause a page fault for every page
like memcpy would.  Note that memcpy cannot be accelerated via
MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on
GUP.

Populating 512MiB of guest_memfd on a x86 machine:
 - via memcpy: 436 ms
 - via write:  202 ms (-54%)

Only PAGE_ALIGNED offset and len are allowed.  Even though non-aligned
writes are technically possible, when in-place conversion support is
implemented [1], the restriction makes handling of mixed shared/private
huge pages simpler.  write() will only be allowed to populate shared
pages.

When direct map removal is implemented [2]
 - write() will not be allowed to access pages that have already been
   removed from direct map
 - on completion, write() will remove the populated pages from direct
   map

While it is technically possible to implement read() syscall on systems
with shared guest memory, it is not supported as there is currently no
use case for it.

[1] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com
[2] https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu.de

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 Documentation/virt/kvm/api.rst |  2 ++
 include/linux/kvm_host.h       |  2 +-
 include/uapi/linux/kvm.h       |  1 +
 virt/kvm/guest_memfd.c         | 52 ++++++++++++++++++++++++++++++++++
 4 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 57061fa29e6a..9541e95fc2ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6448,6 +6448,8 @@ specified via KVM_CREATE_GUEST_MEMFD.  Currently defined flags:
                                without INIT_SHARED will be marked private).
                                Shared memory can be faulted into host userspace
                                page tables. Private memory cannot.
+  GUEST_MEMFD_FLAG_WRITE       Enable using write() on the guest_memfd file
+                               descriptor.
   ============================ ================================================
 
 When the KVM MMU performs a PFN lookup to service a guest fault and the backing
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5bd76cf394fa..5fbf65f49586 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -736,7 +736,7 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm)
 	u64 flags = GUEST_MEMFD_FLAG_MMAP;
 
 	if (!kvm || kvm_arch_supports_gmem_init_shared(kvm))
-		flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+		flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_WRITE;
 
 	return flags;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52f6000ab020..5b73d6528f1c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1601,6 +1601,7 @@ struct kvm_memory_attributes {
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 #define GUEST_MEMFD_FLAG_MMAP		(1ULL << 0)
 #define GUEST_MEMFD_FLAG_INIT_SHARED	(1ULL << 1)
+#define GUEST_MEMFD_FLAG_WRITE		(1ULL << 2)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index ffadc5ee8e04..2c71c21b9189 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -411,6 +411,8 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
 
 static struct file_operations kvm_gmem_fops = {
 	.mmap		= kvm_gmem_mmap,
+	.llseek		= default_llseek,
+	.write_iter     = generic_perform_write,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -421,6 +423,53 @@ void kvm_gmem_init(struct module *module)
 	kvm_gmem_fops.owner = module;
 }
 
+static bool kvm_gmem_supports_write(struct inode *inode)
+{
+	const u64 flags = (u64)inode->i_private;
+
+	return flags & GUEST_MEMFD_FLAG_WRITE;
+}
+
+static int kvm_gmem_write_begin(const struct kiocb *kiocb,
+				struct address_space *mapping,
+				loff_t pos, unsigned int len,
+				struct folio **folio, void **fsdata)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+
+	if (!kvm_gmem_supports_write(inode))
+		return -ENODEV;
+
+	if (pos + len > i_size_read(inode))
+		return -EINVAL;
+
+	if (!IS_ALIGNED(pos, PAGE_SIZE) || !IS_ALIGNED(len, PAGE_SIZE))
+		return -EINVAL;
+
+	*folio = kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT);
+	if (IS_ERR(*folio))
+		return PTR_ERR(*folio);
+
+	return 0;
+}
+
+static int kvm_gmem_write_end(const struct kiocb *kiocb,
+			      struct address_space *mapping,
+			      loff_t pos, unsigned int len,
+			      unsigned int copied,
+			      struct folio *folio, void *fsdata)
+{
+	if (!folio_test_uptodate(folio)) {
+		folio_zero_range(folio, copied, len - copied);
+		folio_mark_uptodate(folio);
+	}
+
+	folio_unlock(folio);
+	folio_put(folio);
+
+	return copied;
+}
+
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
 				  struct folio *dst, struct folio *src,
 				  enum migrate_mode mode)
@@ -469,6 +518,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
 
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
+	.write_begin = kvm_gmem_write_begin,
+	.write_end = kvm_gmem_write_end,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
@@ -516,6 +567,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	}
 
 	file->f_flags |= O_LARGEFILE;
+	file->f_mode |= FMODE_LSEEK | FMODE_PWRITE;
 
 	inode = file->f_inode;
 	WARN_ON(file->f_mapping != inode->i_mapping);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests
  2025-11-14 15:18 [PATCH v7 0/2] KVM: guest_memfd: use write for population Kalyazin, Nikita
  2025-11-14 15:18 ` [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Kalyazin, Nikita
@ 2025-11-14 15:18 ` Kalyazin, Nikita
  2026-03-12  0:55   ` Sean Christopherson
  2025-11-14 15:23 ` [PATCH v7 0/2] KVM: guest_memfd: use write for population Nikita Kalyazin
  2 siblings, 1 reply; 6+ messages in thread
From: Kalyazin, Nikita @ 2025-11-14 15:18 UTC (permalink / raw)
  To: pbonzini@redhat.com, shuah@kernel.org
  Cc: kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com, david@kernel.org,
	jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek, Cali, Marco,
	Kalyazin, Nikita

From: Nikita Kalyazin <kalyazin@amazon.com>

This is to reflect that the write syscall is now implemented for
guest_memfd.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 58 +++++++++++++++++--
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index e7d9aeb418d3..ef3e92e18ee6 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -24,18 +24,57 @@
 
 static size_t page_size;
 
-static void test_file_read_write(int fd, size_t total_size)
+static void test_file_read(int fd, size_t total_size)
 {
 	char buf[64];
 
 	TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
 		    "read on a guest_mem fd should fail");
-	TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
-		    "write on a guest_mem fd should fail");
 	TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
 		    "pread on a guest_mem fd should fail");
-	TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
-		    "pwrite on a guest_mem fd should fail");
+}
+
+static void test_write_supported(int fd, size_t total_size)
+{
+	size_t page_size = getpagesize();
+	void *buf = NULL;
+	int ret;
+
+	ret = posix_memalign(&buf, page_size, total_size);
+	TEST_ASSERT_EQ(ret, 0);
+
+	ret = pwrite(fd, buf, page_size, total_size);
+	TEST_ASSERT(ret == -1, "writing past the file size on a guest_mem fd should fail");
+	TEST_ASSERT_EQ(errno, EINVAL);
+
+	ret = pwrite(fd, buf, 1, 0);
+	TEST_ASSERT(ret == -1, "writing an unaligned count a guest_mem fd should fail");
+	TEST_ASSERT_EQ(errno, EINVAL);
+
+	ret = pwrite(fd, buf, page_size, 1);
+	TEST_ASSERT(ret == -1, "writing to an unaligned offset a guest_mem fd should fail");
+	TEST_ASSERT_EQ(errno, EINVAL);
+
+	ret = pwrite(fd, buf, page_size, 0);
+	TEST_ASSERT(ret == page_size, "write on a guest_mem fd should succeed");
+
+	free(buf);
+}
+
+static void test_write_not_supported(int fd, size_t total_size)
+{
+	size_t page_size = getpagesize();
+	void *buf = NULL;
+	int ret;
+
+	ret = posix_memalign(&buf, page_size, total_size);
+	TEST_ASSERT_EQ(ret, 0);
+
+	ret = pwrite(fd, buf, page_size, 0);
+	TEST_ASSERT(ret == -1, "write on guest_mem fd should fail");
+	TEST_ASSERT_EQ(errno, ENODEV);
+
+	free(buf);
 }
 
 static void test_mmap_cow(int fd, size_t size)
@@ -267,7 +306,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
 	test_create_guest_memfd_multiple(vm);
 	test_create_guest_memfd_invalid_sizes(vm, flags);
 
-	gmem_test(file_read_write, vm, flags);
+	gmem_test(file_read, vm, flags);
 
 	if (flags & GUEST_MEMFD_FLAG_MMAP) {
 		if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
@@ -282,6 +321,11 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
 		gmem_test(mmap_not_supported, vm, flags);
 	}
 
+	if (flags & GUEST_MEMFD_FLAG_WRITE)
+		gmem_test(write_supported, vm, flags);
+	else
+		gmem_test(write_not_supported, vm, flags);
+
 	gmem_test(file_size, vm, flags);
 	gmem_test(fallocate, vm, flags);
 	gmem_test(invalid_punch_hole, vm, flags);
@@ -299,6 +343,8 @@ static void test_guest_memfd(unsigned long vm_type)
 	flags = vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS);
 	if (flags & GUEST_MEMFD_FLAG_MMAP)
 		__test_guest_memfd(vm, GUEST_MEMFD_FLAG_MMAP);
+	if (flags & GUEST_MEMFD_FLAG_WRITE)
+		__test_guest_memfd(vm, GUEST_MEMFD_FLAG_WRITE);
 
 	/* MMAP should always be supported if INIT_SHARED is supported. */
 	if (flags & GUEST_MEMFD_FLAG_INIT_SHARED)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 0/2] KVM: guest_memfd: use write for population
  2025-11-14 15:18 [PATCH v7 0/2] KVM: guest_memfd: use write for population Kalyazin, Nikita
  2025-11-14 15:18 ` [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Kalyazin, Nikita
  2025-11-14 15:18 ` [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests Kalyazin, Nikita
@ 2025-11-14 15:23 ` Nikita Kalyazin
  2 siblings, 0 replies; 6+ messages in thread
From: Nikita Kalyazin @ 2025-11-14 15:23 UTC (permalink / raw)
  To: Kalyazin, Nikita, pbonzini@redhat.com, shuah@kernel.org
  Cc: kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org, seanjc@google.com, david@kernel.org,
	jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Thomson, Jack, Itazuri, Takahiro, Manwaring, Derek, Cali, Marco



On 14/11/2025 15:18, Kalyazin, Nikita wrote:
> On systems that support shared guest memory, write() is useful, for
> example, for population of the initial image.  Even though the same can
> also be achieved via userspace mapping and memcpying from userspace,
> write() provides a more performant option because it does not need to
> set user page tables and it does not cause a page fault for every page
> like memcpy would.  Note that memcpy cannot be accelerated via
> MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on
> GUP.
> 
> Populating 512MiB of guest_memfd on a x86 machine:
>   - via memcpy: 436 ms
>   - via write:  202 ms (-54%)
> 
> Only PAGE_ALIGNED offset and len are allowed.  Even though non-aligned
> writes are technically possible, when in-place conversion support is
> implemented [1], the restriction makes handling of mixed shared/private
> huge pages simpler.  write() will only be allowed to populate shared
> pages.
> 
> When direct map removal is implemented [2]
>   - write() will not be allowed to access pages that have already
>     been removed from direct map
>   - on completion, write() will remove the populated pages from
>     direct map
> 
> While it is technically possible to implement read() syscall on systems
> with shared guest memory, it is not supported as there is currently no
> use case for it.
> 
> [1]
> https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com
> [2]
> https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu.de

I failed to include links to previous versions:

v7:
  - Sean: add GUEST_MEMFD_FLAG_WRITE and documentation for it
  - Ackerley: only allow PAGE_ALIGNED offset and len
  - Sean/Ackerley: formatting fixes

v6:
  - https://lore.kernel.org/kvm/20251020161352.69257-1-kalyazin@amazon.com
  - Make write support conditional on mmap support instead of relying on
    the up-to-date flag to decide whether writing to a page is allowed
  - James: Remove dependencies on folio_test_large
  - James: Remove page alignment restriction
  - James: Formatting fixes

v5:
  - https://lore.kernel.org/kvm/20250902111951.58315-1-kalyazin@amazon.com
  - Replace the call to the unexported filemap_remove_folio with
    zeroing the bytes that could not be copied
  - Fix checkpatch findings

v4:
  - https://lore.kernel.org/kvm/20250828153049.3922-1-kalyazin@amazon.com
  - Switch from implementing the write callback to write_iter
  - Remove conditional compilation

v3:
  - https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com
  - David/Mike D: Only compile support for the write syscall if
    CONFIG_KVM_GMEM_SHARED_MEM (now gone) is enabled.
v2:
  - https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com
  - Switch from an ioctl to the write syscall to implement population

v1:
  - https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com

> 
> Nikita Kalyazin (2):
>    KVM: guest_memfd: add generic population via write
>    KVM: selftests: update guest_memfd write tests
> 
>   Documentation/virt/kvm/api.rst                |  2 +
>   include/linux/kvm_host.h                      |  2 +-
>   include/uapi/linux/kvm.h                      |  1 +
>   .../testing/selftests/kvm/guest_memfd_test.c  | 58 +++++++++++++++++--
>   virt/kvm/guest_memfd.c                        | 52 +++++++++++++++++
>   5 files changed, 108 insertions(+), 7 deletions(-)
> 
> 
> base-commit: 8a4821412cf2c1429fffa07c012dd150f2edf78c
> --
> 2.50.1
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 1/2] KVM: guest_memfd: add generic population via write
  2025-11-14 15:18 ` [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Kalyazin, Nikita
@ 2026-03-12  0:46   ` Sean Christopherson
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Christopherson @ 2026-03-12  0:46 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini@redhat.com, shuah@kernel.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org,
	david@kernel.org, jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Jack Thomson, Takahiro Itazuri, Derek Manwaring, Marco Cali

On Fri, Nov 14, 2025, Nikita Kalyazin wrote:
> ---
>  Documentation/virt/kvm/api.rst |  2 ++
>  include/linux/kvm_host.h       |  2 +-
>  include/uapi/linux/kvm.h       |  1 +
>  virt/kvm/guest_memfd.c         | 52 ++++++++++++++++++++++++++++++++++
>  4 files changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 57061fa29e6a..9541e95fc2ed 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6448,6 +6448,8 @@ specified via KVM_CREATE_GUEST_MEMFD.  Currently defined flags:
>                                 without INIT_SHARED will be marked private).
>                                 Shared memory can be faulted into host userspace
>                                 page tables. Private memory cannot.
> +  GUEST_MEMFD_FLAG_WRITE       Enable using write() on the guest_memfd file
> +                               descriptor.

Not the greatest place for it due to limited space, but the page alignment and
shared restrictions should be documented, and this seems to be the best spot.
And whatever we do on a partial copy also needs to be documented.  E.g.

  GUEST_MEMFD_FLAG_WRITE       Enable using write() on the guest_memfd file
                               descriptor.  The start and size of the write
                               must be page aligned, and all pages must be in
                               a SHARED state.  If the full buffer cannot be
                               copied for a given page, <something happens>.

> @@ -421,6 +423,53 @@ void kvm_gmem_init(struct module *module)
>  	kvm_gmem_fops.owner = module;
>  }
>  
> +static bool kvm_gmem_supports_write(struct inode *inode)
> +{
> +	const u64 flags = (u64)inode->i_private;
> +
> +	return flags & GUEST_MEMFD_FLAG_WRITE;
> +}
> +
> +static int kvm_gmem_write_begin(const struct kiocb *kiocb,
> +				struct address_space *mapping,
> +				loff_t pos, unsigned int len,
> +				struct folio **folio, void **fsdata)
> +{
> +	struct inode *inode = file_inode(kiocb->ki_filp);
> +
> +	if (!kvm_gmem_supports_write(inode))

Eh, no need for a helper, especially since flags is now easier to get at:

	if (!(GMEM_I(inode)->flags | GUEST_MEMFD_FLAG_WRITE))
		return -ENODEV;

I also think we should leave ourselves a safety net for in-place conversion, and
WARN if the gmem instance isn't INIT_SHARED:

	if (WARN_ON_ONCE(!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)))
		return -EBUSY;

That will also provide a good place to actually verify the memory is shared once
in-place conversion comes along.

> +		return -ENODEV;
> +
> +	if (pos + len > i_size_read(inode))
> +		return -EINVAL;
> +
> +	if (!IS_ALIGNED(pos, PAGE_SIZE) || !IS_ALIGNED(len, PAGE_SIZE))
> +		return -EINVAL;
> +
> +	*folio = kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT);
> +	if (IS_ERR(*folio))
> +		return PTR_ERR(*folio);
> +
> +	return 0;
> +}
> +
> +static int kvm_gmem_write_end(const struct kiocb *kiocb,
> +			      struct address_space *mapping,
> +			      loff_t pos, unsigned int len,
> +			      unsigned int copied,
> +			      struct folio *folio, void *fsdata)
> +{
> +	if (!folio_test_uptodate(folio)) {
> +		folio_zero_range(folio, copied, len - copied);

Hmm, do we actually want to zero and silently ignore the failure?  Given the
intended use case, silently failing here would be a terrible outcome.  Would it
makes sense to instead do this?

	if (len != copied)
		return -EFAULT;

	if (!folio_test_uptodate(folio))
		folio_mark_uptodate(folio);

	folio_unlock(folio);
	folio_put(folio);

	return copied;

That will cause generic_perform_write() to report -EFAULT if no pages were written,
or IIUC, return the position of the last _full_ page that was written.  Then in
the unlikely scenario userspace wants to retry, they can retry starting at the
page that was partially written.  That seems like what VMMs generally would want,
not silent failure.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests
  2025-11-14 15:18 ` [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests Kalyazin, Nikita
@ 2026-03-12  0:55   ` Sean Christopherson
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Christopherson @ 2026-03-12  0:55 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: pbonzini@redhat.com, shuah@kernel.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org,
	david@kernel.org, jthoughton@google.com, ackerleytng@google.com,
	vannapurve@google.com, jackmanb@google.com, patrick.roy@linux.dev,
	Jack Thomson, Takahiro Itazuri, Derek Manwaring, Marco Cali

On Fri, Nov 14, 2025, Nikita Kalyazin wrote:
> +static void test_write_supported(int fd, size_t total_size)
> +{
> +	size_t page_size = getpagesize();
> +	void *buf = NULL;
> +	int ret;
> +
> +	ret = posix_memalign(&buf, page_size, total_size);
> +	TEST_ASSERT_EQ(ret, 0);
> +
> +	ret = pwrite(fd, buf, page_size, total_size);
> +	TEST_ASSERT(ret == -1, "writing past the file size on a guest_mem fd should fail");
> +	TEST_ASSERT_EQ(errno, EINVAL);
> +
> +	ret = pwrite(fd, buf, 1, 0);
> +	TEST_ASSERT(ret == -1, "writing an unaligned count a guest_mem fd should fail");
> +	TEST_ASSERT_EQ(errno, EINVAL);
> +
> +	ret = pwrite(fd, buf, page_size, 1);
> +	TEST_ASSERT(ret == -1, "writing to an unaligned offset a guest_mem fd should fail");
> +	TEST_ASSERT_EQ(errno, EINVAL);
> +
> +	ret = pwrite(fd, buf, page_size, 0);
> +	TEST_ASSERT(ret == page_size, "write on a guest_mem fd should succeed");

This write should actually verify a given pattern is written, not just that the
write() returned success.  There should also be testcases for writes that fail,
e.g. on the first page and also on the nth page.

In a perfect world, there would also be testcases for partial writes, but I'm not
entirely sure how to make that happen, e.g. I'm not sure if an munmap() can pull
the rug out from under __copy_from_iter().  If we go with the -EFAULT behavior,
the test can map two pages, then use a separate helper thread to munmap() the pages
after a delay.  Then in the pwrite() thread, verify all bytes are written on success?
I'm not sure that's interesting enought to warrant the development cost and effort
though, e.g. it doesn't seem wildly different than the deterministic "entire page
failed" case.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-12  0:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 15:18 [PATCH v7 0/2] KVM: guest_memfd: use write for population Kalyazin, Nikita
2025-11-14 15:18 ` [PATCH v7 1/2] KVM: guest_memfd: add generic population via write Kalyazin, Nikita
2026-03-12  0:46   ` Sean Christopherson
2025-11-14 15:18 ` [PATCH v7 2/2] KVM: selftests: update guest_memfd write tests Kalyazin, Nikita
2026-03-12  0:55   ` Sean Christopherson
2025-11-14 15:23 ` [PATCH v7 0/2] KVM: guest_memfd: use write for population Nikita Kalyazin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox