[RFC PATCH v2 00/51] 1G page support for guest

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/51] 1G page support for guest_memfd
@ 2025-05-14 23:41 Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
                   ` (55 more replies)
  0 siblings, 56 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Hello,

This patchset builds upon discussion at LPC 2024 and many guest_memfd
upstream calls to provide 1G page support for guest_memfd by taking
pages from HugeTLB.

This patchset is based on Linux v6.15-rc6, and requires the mmap support
for guest_memfd patchset (Thanks Fuad!) [1].

For ease of testing, this series is also available, stitched together,
at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

This patchset can be divided into two sections:

(a) Patches from the beginning up to and including "KVM: selftests:
    Update script to map shared memory from guest_memfd" are a modified
    version of "conversion support for guest_memfd", which Fuad is
    managing [2].

(b) Patches after "KVM: selftests: Update script to map shared memory
    from guest_memfd" till the end are patches that actually bring in 1G
    page support for guest_memfd.

These are the significant differences between (a) and [2]:

+ [2] uses an xarray to track sharability, but I used a maple tree
  because for 1G pages, iterating pagewise to update shareability was
  prohibitively slow even for testing. I was choosing from among
  multi-index xarrays, interval trees and maple trees [3], and picked
  maple trees because
    + Maple trees were easier to figure out since I didn't have to
      compute the correct multi-index order and handle edge cases if the
      converted range wasn't a neat power of 2.
    + Maple trees were easier to figure out as compared to updating
      parts of a multi-index xarray.
    + Maple trees had an easier API to use than interval trees.
+ [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
  support end-to-end.
+ (a) Removes guest_memfd from participating in LRU, which I needed, to
  get conversion selftests to work as expected, since participation in
  LRU was causing some unexpected refcounts on folios which was blocking
  conversions.

I am sending (a) in emails as well, as opposed to just leaving it on
GitHub, so that we can discuss by commenting inline on emails. If you'd
like to just look at 1G page support, here are some key takeaways from
the first section (a):

+ If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
  creation, guest_memfd will
    + Track shareability (whether an index in the inode is guest-only or
      if the host is allowed to fault memory at a given index).
    + Always be used for guest faults - specifically, kvm_gmem_get_pfn()
      will be used to provide pages for the guest.
    + Always be used by KVM to check private/shared status of a gfn.
+ guest_memfd now has conversion ioctls, allowing conversion to
  private/shared
    + Conversion can fail if there are unexpected refcounts on any
      folios in the range.

Focusing on (b) 1G page support, here's an overview:

1. A bunch of refactoring patches for HugeTLB that isolates the
   allocation of a HugeTLB folio from other HugeTLB concepts such as
   VMA-level reservations, and HugeTLBfs-specific concepts, such as
   where memory policy is stored in the VMA, or where the subpool is
   stored on the inode.
2. A few patches that add a guestmem_hugetlb allocator within mm/. The
   guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
   the memory management functions, and to cleanly handle cleanup, so
   that folio cleanup can happen after the guest_memfd inode (and even
   KVM) goes away.
3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
4. Selftests for 1G page support.

Here are some remaining issues/TODOs:

1. Memory error handling such as machine check errors have not been
   implemented.
2. I've not looked into preparedness of pages, only zeroing has been
   considered.
3. When allocating HugeTLB pages, if two threads allocate indices
   mapping to the same huge page, the utilization in guest_memfd inode's
   subpool may momentarily go over the subpool limit (the requested size
   of the inode at guest_memfd creation time), causing one of the two
   threads to get -ENOMEM. Suggestions to solve this are appreciated!
4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
   pages should be correct but needs testing and could be wrong.
5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
   HugeTLB pages after splitting should be correct but needs testing and
   could be wrong.
6. Page cache accounting: When a hugetlb page is split, guest_memfd will
   incur page count in both NR_HUGETLB (counted at hugetlb allocation
   time) and NR_FILE_PAGES stats (counted when split pages are added to
   the filemap). Is this aligned with what people expect?

Here are some optimizations that could be explored in future series:

1. Pages could be split from 1G to 2M first and only split to 4K if
   necessary.
2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
   pages.

Here's RFC v1 [4] if you're interested in the motivation behind choosing
HugeTLB, or the history of this patch series.

[1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
[3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/

---

Ackerley Tng (49):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: selftests: Test flag validity after guest_memfd supports
    conversions
  KVM: selftests: Test faulting with respect to
    GUEST_MEMFD_FLAG_INIT_PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Allow cleanup of ucall_pool from host
  KVM: selftests: Test conversion flows for guest_memfd
  KVM: selftests: Add script to exercise private_mem_conversions_test
  KVM: selftests: Update private_mem_conversions_test to mmap
    guest_memfd
  KVM: selftests: Update script to map shared memory from guest_memfd
  mm: hugetlb: Consolidate interpretation of gbl_chg within
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of gbl_chg in
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of map_chg_state within
    alloc_hugetlb_folio()
  mm: hugetlb: Rename alloc_surplus_hugetlb_folio
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Inline huge_node() into callers
  mm: hugetlb: Refactor hugetlb allocation functions
  mm: hugetlb: Refactor out hugetlb_alloc_folio()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: Introduce guestmem_hugetlb to support folio_put() handling of
    guestmem pages
  mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  mm: truncate: Expose truncate_inode_folio()
  KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff
    misalignment
  KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  KVM: guest_memfd: Allocate and truncate from custom allocator
  mm: hugetlb: Add functions to add/delete folio from hugetlb lists
  mm: guestmem_hugetlb: Add support for splitting and merging pages
  mm: Convert split_folio() macro to function
  KVM: guest_memfd: Split allocator pages for guest_memfd use
  KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page
    status
  KVM: Add CAP to indicate support for HugeTLB as custom allocator
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Update conversion flows test for HugeTLB
  KVM: selftests: Test truncation paths of guest_memfd
  KVM: selftests: Test allocation and conversion of subfolios
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Update private_mem_conversions_test.sh to test with
    HugeTLB pages
  KVM: selftests: Add script to test HugeTLB statistics
  KVM: selftests: Test guest_memfd for accuracy of st_blocks

Elliot Berman (1):
  filemap: Pass address_space mapping to ->free_folio()

Fuad Tabba (1):
  mm: Consolidate freeing of typed folios on final folio_put()

 Documentation/filesystems/locking.rst         |    2 +-
 Documentation/filesystems/vfs.rst             |   15 +-
 Documentation/virt/kvm/api.rst                |    5 +
 arch/arm64/include/asm/kvm_host.h             |    5 -
 arch/x86/include/asm/kvm_host.h               |   10 -
 arch/x86/kvm/x86.c                            |   53 +-
 fs/hugetlbfs/inode.c                          |    2 +-
 fs/nfs/dir.c                                  |    9 +-
 fs/orangefs/inode.c                           |    3 +-
 include/linux/fs.h                            |    2 +-
 include/linux/guestmem.h                      |   23 +
 include/linux/huge_mm.h                       |    6 +-
 include/linux/hugetlb.h                       |   19 +-
 include/linux/kvm_host.h                      |   32 +-
 include/linux/mempolicy.h                     |   11 +-
 include/linux/mm.h                            |    2 +
 include/linux/page-flags.h                    |   32 +
 include/uapi/linux/guestmem.h                 |   29 +
 include/uapi/linux/kvm.h                      |   16 +
 include/uapi/linux/magic.h                    |    1 +
 mm/Kconfig                                    |   13 +
 mm/Makefile                                   |    1 +
 mm/debug.c                                    |    1 +
 mm/filemap.c                                  |   12 +-
 mm/guestmem_hugetlb.c                         |  512 +++++
 mm/guestmem_hugetlb.h                         |    9 +
 mm/hugetlb.c                                  |  488 ++---
 mm/internal.h                                 |    1 -
 mm/memcontrol.c                               |    2 +
 mm/memory.c                                   |    1 +
 mm/mempolicy.c                                |   44 +-
 mm/secretmem.c                                |    3 +-
 mm/swap.c                                     |   32 +-
 mm/truncate.c                                 |   27 +-
 mm/vmscan.c                                   |    4 +-
 tools/testing/selftests/kvm/Makefile.kvm      |    2 +
 .../kvm/guest_memfd_conversions_test.c        |  797 ++++++++
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  384 ++++
 ...uest_memfd_provide_hugetlb_cgroup_mount.sh |   36 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  293 ++-
 ...memfd_wrap_test_check_hugetlb_reporting.sh |   95 +
 .../testing/selftests/kvm/include/kvm_util.h  |  104 +-
 .../testing/selftests/kvm/include/test_util.h |   20 +-
 .../selftests/kvm/include/ucall_common.h      |    1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  465 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |  102 +
 .../testing/selftests/kvm/lib/ucall_common.c  |   16 +-
 .../kvm/x86/private_mem_conversions_test.c    |  195 +-
 .../kvm/x86/private_mem_conversions_test.sh   |  100 +
 virt/kvm/Kconfig                              |    5 +
 virt/kvm/guest_memfd.c                        | 1655 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   14 +-
 virt/kvm/kvm_mm.h                             |    9 +-
 53 files changed, 5080 insertions(+), 640 deletions(-)
 create mode 100644 include/linux/guestmem.h
 create mode 100644 include/uapi/linux/guestmem.h
 create mode 100644 mm/guestmem_hugetlb.c
 create mode 100644 mm/guestmem_hugetlb.h
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh

--
2.49.0.1045.g170613ef41-goog

^ permalink raw reply	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
                   ` (54 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

guest_memfd's inode represents memory the guest_memfd is
providing. guest_memfd's file represents a struct kvm's view of that
memory.

Using a custom inode allows customization of the inode teardown
process via callbacks. For example, ->evict_inode() allows
customization of the truncation process on file close, and
->destroy_inode() and ->free_inode() allow customization of the inode
freeing process.

Customizing the truncation process allows flexibility in management of
guest_memfd memory and customization of the inode freeing process
allows proper cleanup of memory metadata stored on the inode.

Memory metadata is more appropriately stored on the inode (as opposed
to the file), since the metadata is for the memory and is not unique
to a specific binding and struct kvm.

Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I5c23bce8fefe492b40b8042ece1e81448752da99
---
 include/uapi/linux/magic.h |   1 +
 virt/kvm/guest_memfd.c     | 134 +++++++++++++++++++++++++++++++------
 virt/kvm/kvm_main.c        |   7 +-
 virt/kvm/kvm_mm.h          |   9 ++-
 4 files changed, 125 insertions(+), 26 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..638ca21b7a90 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define GUEST_MEMFD_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b8e247063b20..239d0f13dcc1 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,16 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
+#include <linux/fs.h>
 #include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
-#include <linux/anon_inodes.h>
 
 #include "kvm_mm.h"
 
+static struct vfsmount *kvm_gmem_mnt;
+
 struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
@@ -416,9 +420,51 @@ static struct file_operations kvm_gmem_fops = {
 	.fallocate	= kvm_gmem_fallocate,
 };
 
-void kvm_gmem_init(struct module *module)
+static const struct super_operations kvm_gmem_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	if (!init_pseudo(fc, GUEST_MEMFD_MAGIC))
+		return -ENOMEM;
+
+	ctx = fc->fs_private;
+	ctx->ops = &kvm_gmem_super_operations;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+static int kvm_gmem_init_mount(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+
+	if (WARN_ON_ONCE(IS_ERR(kvm_gmem_mnt)))
+		return PTR_ERR(kvm_gmem_mnt);
+
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+	return 0;
+}
+
+int kvm_gmem_init(struct module *module)
 {
 	kvm_gmem_fops.owner = module;
+
+	return kvm_gmem_init_mount();
+}
+
+void kvm_gmem_exit(void)
+{
+	kern_unmount(kvm_gmem_mnt);
+	kvm_gmem_mnt = NULL;
 }
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -500,11 +546,71 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+						      loff_t size, u64 flags)
+{
+	struct inode *inode;
+
+	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
+	if (IS_ERR(inode))
+		return inode;
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_inaccessible(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+						  u64 flags)
+{
+	static const char *name = "[kvm-gmem]";
+	struct inode *inode;
+	struct file *file;
+	int err;
+
+	err = -ENOENT;
+	if (!try_module_get(kvm_gmem_fops.owner))
+		goto err;
+
+	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto err_put_module;
+	}
+
+	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+				 &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_inode;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+	file->private_data = priv;
+
+out:
+	return file;
+
+err_put_inode:
+	iput(inode);
+err_put_module:
+	module_put(kvm_gmem_fops.owner);
+err:
+	file = ERR_PTR(err);
+	goto out;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
-	const char *anon_name = "[kvm-gmem]";
 	struct kvm_gmem *gmem;
-	struct inode *inode;
 	struct file *file;
 	int fd, err;
 
@@ -518,32 +624,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fd;
 	}
 
-	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
-					 O_RDWR, NULL);
+	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_gmem;
 	}
 
-	file->f_flags |= O_LARGEFILE;
-
-	inode = file->f_inode;
-	WARN_ON(file->f_mapping != inode->i_mapping);
-
-	inode->i_private = (void *)(unsigned long)flags;
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
 
 	fd_install(fd, file);
 	return fd;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c75f933bfbe..66dfdafbb3b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6419,7 +6419,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	if (WARN_ON_ONCE(r))
 		goto err_vfio;
 
-	kvm_gmem_init(module);
+	r = kvm_gmem_init(module);
+	if (r)
+		goto err_gmem;
 
 	r = kvm_init_virtualization();
 	if (r)
@@ -6440,6 +6442,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 err_register:
 	kvm_uninit_virtualization();
 err_virt:
+	kvm_gmem_exit();
+err_gmem:
 	kvm_vfio_ops_exit();
 err_vfio:
 	kvm_async_pf_deinit();
@@ -6471,6 +6475,7 @@ void kvm_exit(void)
 	for_each_possible_cpu(cpu)
 		free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
 	kmem_cache_destroy(kvm_vcpu_cache);
+	kvm_gmem_exit();
 	kvm_vfio_ops_exit();
 	kvm_async_pf_deinit();
 	kvm_irqfd_exit();
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index ec311c0d6718..be68c29fc4ab 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -68,17 +68,20 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 #endif /* HAVE_KVM_PFNCACHE */
 
 #ifdef CONFIG_KVM_GMEM
-void kvm_gmem_init(struct module *module);
+int kvm_gmem_init(struct module *module);
+void kvm_gmem_exit(void);
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset);
 void kvm_gmem_unbind(struct kvm_memory_slot *slot);
 #else
-static inline void kvm_gmem_init(struct module *module)
+static inline int kvm_gmem_init(struct module *module)
 {
-
+	return 0;
 }
 
+static inline void kvm_gmem_exit(void) {};
+
 static inline int kvm_gmem_bind(struct kvm *kvm,
 					 struct kvm_memory_slot *slot,
 					 unsigned int fd, loff_t offset)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-27  3:54   ` Yan Zhao
                     ` (3 more replies)
  2025-05-14 23:41 ` [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag Ackerley Tng
                   ` (53 subsequent siblings)
  55 siblings, 4 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Track guest_memfd memory's shareability status within the inode as
opposed to the file, since it is property of the guest_memfd's memory
contents.

Shareability is a property of the memory and is indexed using the
page's index in the inode. Because shareability is the memory's
property, it is stored within guest_memfd instead of within KVM, like
in kvm->mem_attr_array.

KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
retained to allow VMs to only use guest_memfd for private memory and
some other memory for shared memory.

Not all use cases require guest_memfd() to be shared with the host
when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
private to the guest, and therefore not mappable by the
host. Otherwise, memory is shared until explicitly converted to
private.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
---
 Documentation/virt/kvm/api.rst |   5 ++
 include/uapi/linux/kvm.h       |   2 +
 virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
 3 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 86f74ce7f12a..f609337ae1c2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
 The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
 This is validated when the guest_memfd instance is bound to the VM.
 
+If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
+supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
+will initialize the memory for the guest_memfd as guest-only and not faultable
+by the host.
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4cc824a3a7c9..d7df312479aa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+
 #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
+#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 239d0f13dcc1..590932499eba 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 
@@ -17,6 +18,24 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_inode_private {
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	struct maple_tree shareability;
+#endif
+};
+
+enum shareability {
+	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
+	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
+};
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
 	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
 }
 
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+
+static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
+				      loff_t size, u64 flags)
+{
+	enum shareability m;
+	pgoff_t last;
+
+	last = (size >> PAGE_SHIFT) - 1;
+	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
+						    SHAREABILITY_ALL;
+	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
+				 GFP_KERNEL);
+}
+
+static enum shareability kvm_gmem_shareability_get(struct inode *inode,
+						 pgoff_t index)
+{
+	struct maple_tree *mt;
+	void *entry;
+
+	mt = &kvm_gmem_private(inode)->shareability;
+	entry = mtree_load(mt, index);
+	WARN(!entry,
+	     "Shareability should always be defined for all indices in inode.");
+
+	return xa_to_value(entry);
+}
+
+static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
+{
+	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
+		return ERR_PTR(-EACCES);
+
+	return kvm_gmem_get_folio(inode, index);
+}
+
+#else
+
+static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
+{
+	return 0;
+}
+
+static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
+{
+	WARN_ONCE("Unexpected call to get shared folio.")
+	return NULL;
+}
+
+#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
+
 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				    pgoff_t index, struct folio *folio)
 {
@@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
 
 	filemap_invalidate_lock_shared(inode->i_mapping);
 
-	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		int err = PTR_ERR(folio);
 
@@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
 	.fallocate	= kvm_gmem_fallocate,
 };
 
+static void kvm_gmem_free_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+	kfree(private);
+
+	free_inode_nonrcu(inode);
+}
+
+static void kvm_gmem_destroy_inode(struct inode *inode)
+{
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	/*
+	 * mtree_destroy() can't be used within rcu callback, hence can't be
+	 * done in ->free_inode().
+	 */
+	if (private)
+		mtree_destroy(&private->shareability);
+#endif
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
 	.statfs		= simple_statfs,
+	.destroy_inode	= kvm_gmem_destroy_inode,
+	.free_inode	= kvm_gmem_free_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
 static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
+	int err;
 
 	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
 	if (IS_ERR(inode))
 		return inode;
 
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
+	mt_init(&private->shareability);
+	inode->i_mapping->i_private_data = private;
+
+	err = kvm_gmem_shareability_setup(private, size, flags);
+	if (err)
+		goto out;
+
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
 	inode->i_mapping->a_ops = &kvm_gmem_aops;
@@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
@@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
 		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
 
+	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
+		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
@@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
+	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
+
 	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
 	if (IS_ERR(folio)) {
 		r = PTR_ERR(folio);
@@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		*page = folio_file_page(folio, index);
 	else
 		folio_put(folio);
-
 out:
+	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
 	fput(file);
 	return r;
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-15 13:49   ` Ira Weiny
  2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
                   ` (52 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.

Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 60aaba5808a5..bf2876cbd711 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
 	kvm_vm_release(vm);
 }
 
+static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
+				   bool expect_valid)
+{
+	size_t page_size = getpagesize();
+	int fd;
+
+	fd = __vm_create_guest_memfd(vm, page_size, flag);
+
+	if (expect_valid) {
+		TEST_ASSERT(fd > 0,
+			    "guest_memfd() with flag '0x%lx' should be valid",
+			    flag);
+		close(fd);
+	} else {
+		TEST_ASSERT(fd == -1 && errno == EINVAL,
+			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
+			    flag);
+	}
+}
+
 static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
 					    uint64_t expected_valid_flags)
 {
-	size_t page_size = getpagesize();
 	struct kvm_vm *vm;
 	uint64_t flag = 0;
-	int fd;
 
 	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
 		return;
@@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
 	vm = vm_create_barebones_type(vm_type);
 
 	for (flag = BIT(0); flag; flag <<= 1) {
-		fd = __vm_create_guest_memfd(vm, page_size, flag);
+		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
 
-		if (flag & expected_valid_flags) {
-			TEST_ASSERT(fd > 0,
-				    "guest_memfd() with flag '0x%lx' should be valid",
-				    flag);
-			close(fd);
-		} else {
-			TEST_ASSERT(fd == -1 && errno == EINVAL,
-				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
-				    flag);
+		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
+			test_vm_with_gmem_flag(
+				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
 		}
 	}
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (2 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-15 14:50   ` Ira Weiny
                     ` (2 more replies)
  2025-05-14 23:41 ` [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios Ackerley Tng
                   ` (51 subsequent siblings)
  55 siblings, 3 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
and private respectively.

A guest_memfd ioctl is used because shareability is a property of the
memory, and this property should be modifiable independently of the
attached struct kvm. This allows shareability to be modified even if
the memory is not yet bound using memslots.

For shared to private conversions, if refcounts on any of the folios
within the range are elevated, fail the conversion with -EAGAIN.

At the point of shared to private conversion, all folios in range are
also unmapped. The filemap_invalidate_lock() is held, so no faulting
can occur. Hence, from that point on, only transient refcounts can be
taken on the folios associated with that guest_memfd.

Hence, it is safe to do the conversion from shared to private.

After conversion is complete, refcounts may become elevated, but that
is fine since users of transient refcounts don't actually access
memory.

For private to shared conversions, there are no refcount checks. any
transient refcounts are expected to drop their refcounts soon. The
conversion process will spin waiting for these transient refcounts to
go away.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I3546aaf6c1b795de6dc9ba09e816b64934221918
---
 include/uapi/linux/kvm.h |  11 ++
 virt/kvm/guest_memfd.c   | 357 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 366 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d7df312479aa..5b28e17f6f14 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1577,6 +1577,17 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_GMEM_IO 0xAF
+#define KVM_GMEM_CONVERT_SHARED		_IOWR(KVM_GMEM_IO,  0x41, struct kvm_gmem_convert)
+#define KVM_GMEM_CONVERT_PRIVATE	_IOWR(KVM_GMEM_IO,  0x42, struct kvm_gmem_convert)
+
+struct kvm_gmem_convert {
+	__u64 offset;
+	__u64 size;
+	__u64 error_offset;
+	__u64 reserved[5];
+};
+
 #define KVM_PRE_FAULT_MEMORY	_IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
 
 struct kvm_pre_fault_memory {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 590932499eba..f802116290ce 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -30,6 +30,10 @@ enum shareability {
 };
 
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+				      pgoff_t end);
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+				    pgoff_t end);
 
 static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 {
@@ -85,6 +89,306 @@ static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t inde
 	return kvm_gmem_get_folio(inode, index);
 }
 
+/**
+ * kvm_gmem_shareability_store() - Sets shareability to @value for range.
+ *
+ * @mt: the shareability maple tree.
+ * @index: the range begins at this index in the inode.
+ * @nr_pages: number of PAGE_SIZE pages in this range.
+ * @value: the shareability value to set for this range.
+ *
+ * Unlike mtree_store_range(), this function also merges adjacent ranges that
+ * have the same values as an optimization. Assumes that all stores to @mt go
+ * through this function, such that adjacent ranges are always merged.
+ *
+ * Return: 0 on success and negative error otherwise.
+ */
+static int kvm_gmem_shareability_store(struct maple_tree *mt, pgoff_t index,
+				       size_t nr_pages, enum shareability value)
+{
+	MA_STATE(mas, mt, 0, 0);
+	unsigned long start;
+	unsigned long last;
+	void *entry;
+	int ret;
+
+	start = index;
+	last = start + nr_pages - 1;
+
+	mas_lock(&mas);
+
+	/* Try extending range. entry is NULL on overflow/wrap-around. */
+	mas_set_range(&mas, last + 1, last + 1);
+	entry = mas_find(&mas, last + 1);
+	if (entry && xa_to_value(entry) == value)
+		last = mas.last;
+
+	mas_set_range(&mas, start - 1, start - 1);
+	entry = mas_find(&mas, start - 1);
+	if (entry && xa_to_value(entry) == value)
+		start = mas.index;
+
+	mas_set_range(&mas, start, last);
+	ret = mas_store_gfp(&mas, xa_mk_value(value), GFP_KERNEL);
+
+	mas_unlock(&mas);
+
+	return ret;
+}
+
+struct conversion_work {
+	struct list_head list;
+	pgoff_t start;
+	size_t nr_pages;
+};
+
+static int add_to_work_list(struct list_head *list, pgoff_t start, pgoff_t last)
+{
+	struct conversion_work *work;
+
+	work = kzalloc(sizeof(*work), GFP_KERNEL);
+	if (!work)
+		return -ENOMEM;
+
+	work->start = start;
+	work->nr_pages = last + 1 - start;
+
+	list_add_tail(&work->list, list);
+
+	return 0;
+}
+
+static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t start,
+				       size_t nr_pages, pgoff_t *error_index)
+{
+	const int filemap_get_folios_refcount = 1;
+	struct folio_batch fbatch;
+	bool refcount_safe;
+	pgoff_t last;
+	int i;
+
+	last = start + nr_pages - 1;
+	refcount_safe = true;
+
+	folio_batch_init(&fbatch);
+	while (refcount_safe &&
+	       filemap_get_folios(mapping, &start, last, &fbatch)) {
+
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			int filemap_refcount;
+			int safe_refcount;
+			struct folio *f;
+
+			f = fbatch.folios[i];
+			filemap_refcount = folio_nr_pages(f);
+
+			safe_refcount = filemap_refcount + filemap_get_folios_refcount;
+			if (folio_ref_count(f) != safe_refcount) {
+				refcount_safe = false;
+				*error_index = f->index;
+				break;
+			}
+		}
+
+		folio_batch_release(&fbatch);
+	}
+
+	return refcount_safe;
+}
+
+static int kvm_gmem_shareability_apply(struct inode *inode,
+				       struct conversion_work *work,
+				       enum shareability m)
+{
+	struct maple_tree *mt;
+
+	mt = &kvm_gmem_private(inode)->shareability;
+	return kvm_gmem_shareability_store(mt, work->start, work->nr_pages, m);
+}
+
+static int kvm_gmem_convert_compute_work(struct inode *inode, pgoff_t start,
+					 size_t nr_pages, enum shareability m,
+					 struct list_head *work_list)
+{
+	struct maple_tree *mt;
+	struct ma_state mas;
+	pgoff_t last;
+	void *entry;
+	int ret;
+
+	last = start + nr_pages - 1;
+
+	mt = &kvm_gmem_private(inode)->shareability;
+	ret = 0;
+
+	mas_init(&mas, mt, start);
+
+	rcu_read_lock();
+	mas_for_each(&mas, entry, last) {
+		enum shareability current_m;
+		pgoff_t m_range_index;
+		pgoff_t m_range_last;
+		int ret;
+
+		m_range_index = max(mas.index, start);
+		m_range_last = min(mas.last, last);
+
+		current_m = xa_to_value(entry);
+		if (m == current_m)
+			continue;
+
+		mas_pause(&mas);
+		rcu_read_unlock();
+		/* Caller will clean this up on error. */
+		ret = add_to_work_list(work_list, m_range_index, m_range_last);
+		rcu_read_lock();
+		if (ret)
+			break;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static void kvm_gmem_convert_invalidate_begin(struct inode *inode,
+					      struct conversion_work *work)
+{
+	struct list_head *gmem_list;
+	struct kvm_gmem *gmem;
+	pgoff_t end;
+
+	end = work->start + work->nr_pages;
+
+	gmem_list = &inode->i_mapping->i_private_list;
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_begin(gmem, work->start, end);
+}
+
+static void kvm_gmem_convert_invalidate_end(struct inode *inode,
+					    struct conversion_work *work)
+{
+	struct list_head *gmem_list;
+	struct kvm_gmem *gmem;
+	pgoff_t end;
+
+	end = work->start + work->nr_pages;
+
+	gmem_list = &inode->i_mapping->i_private_list;
+	list_for_each_entry(gmem, gmem_list, entry)
+		kvm_gmem_invalidate_end(gmem, work->start, end);
+}
+
+static int kvm_gmem_convert_should_proceed(struct inode *inode,
+					   struct conversion_work *work,
+					   bool to_shared, pgoff_t *error_index)
+{
+	if (!to_shared) {
+		unmap_mapping_pages(inode->i_mapping, work->start,
+				    work->nr_pages, false);
+
+		if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
+						work->nr_pages, error_index)) {
+			return -EAGAIN;
+		}
+	}
+
+	return 0;
+}
+
+static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
+				  size_t nr_pages, bool shared,
+				  pgoff_t *error_index)
+{
+	struct conversion_work *work, *tmp, *rollback_stop_item;
+	LIST_HEAD(work_list);
+	struct inode *inode;
+	enum shareability m;
+	int ret;
+
+	inode = file_inode(file);
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	m = shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
+	ret = kvm_gmem_convert_compute_work(inode, start, nr_pages, m, &work_list);
+	if (ret || list_empty(&work_list))
+		goto out;
+
+	list_for_each_entry(work, &work_list, list)
+		kvm_gmem_convert_invalidate_begin(inode, work);
+
+	list_for_each_entry(work, &work_list, list) {
+		ret = kvm_gmem_convert_should_proceed(inode, work, shared,
+						      error_index);
+		if (ret)
+			goto invalidate_end;
+	}
+
+	list_for_each_entry(work, &work_list, list) {
+		rollback_stop_item = work;
+		ret = kvm_gmem_shareability_apply(inode, work, m);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
+		list_for_each_entry(work, &work_list, list) {
+			if (work == rollback_stop_item)
+				break;
+
+			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
+		}
+	}
+
+invalidate_end:
+	list_for_each_entry(work, &work_list, list)
+		kvm_gmem_convert_invalidate_end(inode, work);
+out:
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	list_for_each_entry_safe(work, tmp, &work_list, list) {
+		list_del(&work->list);
+		kfree(work);
+	}
+
+	return ret;
+}
+
+static int kvm_gmem_ioctl_convert_range(struct file *file,
+					struct kvm_gmem_convert *param,
+					bool shared)
+{
+	pgoff_t error_index;
+	size_t nr_pages;
+	pgoff_t start;
+	int ret;
+
+	if (param->error_offset)
+		return -EINVAL;
+
+	if (param->size == 0)
+		return 0;
+
+	if (param->offset + param->size < param->offset ||
+	    param->offset > file_inode(file)->i_size ||
+	    param->offset + param->size > file_inode(file)->i_size)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(param->offset, PAGE_SIZE) ||
+	    !IS_ALIGNED(param->size, PAGE_SIZE))
+		return -EINVAL;
+
+	start = param->offset >> PAGE_SHIFT;
+	nr_pages = param->size >> PAGE_SHIFT;
+
+	ret = kvm_gmem_convert_range(file, start, nr_pages, shared, &error_index);
+	if (ret)
+		param->error_offset = error_index << PAGE_SHIFT;
+
+	return ret;
+}
+
 #else
 
 static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
@@ -186,15 +490,26 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 	unsigned long index;
 
 	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		enum kvm_gfn_range_filter filter;
 		pgoff_t pgoff = slot->gmem.pgoff;
 
+		filter = KVM_FILTER_PRIVATE;
+		if (kvm_gmem_memslot_supports_shared(slot)) {
+			/*
+			 * Unmapping would also cause invalidation, but cannot
+			 * rely on mmu_notifiers to do invalidation via
+			 * unmapping, since memory may not be mapped to
+			 * userspace.
+			 */
+			filter |= KVM_FILTER_SHARED;
+		}
+
 		struct kvm_gfn_range gfn_range = {
 			.start = slot->base_gfn + max(pgoff, start) - pgoff,
 			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
 			.slot = slot,
 			.may_block = true,
-			/* guest memfd is relevant to only private mappings. */
-			.attr_filter = KVM_FILTER_PRIVATE,
+			.attr_filter = filter,
 		};
 
 		if (!found_memslot) {
@@ -484,11 +799,49 @@ EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
 #define kvm_gmem_mmap NULL
 #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
 
+static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
+			   unsigned long arg)
+{
+	void __user *argp;
+	int r;
+
+	argp = (void __user *)arg;
+
+	switch (ioctl) {
+#ifdef CONFIG_KVM_GMEM_SHARED_MEM
+	case KVM_GMEM_CONVERT_SHARED:
+	case KVM_GMEM_CONVERT_PRIVATE: {
+		struct kvm_gmem_convert param;
+		bool to_shared;
+
+		r = -EFAULT;
+		if (copy_from_user(&param, argp, sizeof(param)))
+			goto out;
+
+		to_shared = ioctl == KVM_GMEM_CONVERT_SHARED;
+		r = kvm_gmem_ioctl_convert_range(file, &param, to_shared);
+		if (r) {
+			if (copy_to_user(argp, &param, sizeof(param))) {
+				r = -EFAULT;
+				goto out;
+			}
+		}
+		break;
+	}
+#endif
+	default:
+		r = -ENOTTY;
+	}
+out:
+	return r;
+}
+
 static struct file_operations kvm_gmem_fops = {
 	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
+	.unlocked_ioctl	= kvm_gmem_ioctl,
 };
 
 static void kvm_gmem_free_inode(struct inode *inode)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (3 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-28  7:01   ` Binbin Wu
  2025-05-14 23:41 ` [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status Ackerley Tng
                   ` (50 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

filemap_add_folio(), called from filemap_grab_folio(), adds the folio
onto some LRU list, which is not necessary for guest_memfd since
guest_memfd folios don't participate in any swapping.

This patch reimplements part of filemap_add_folio() to avoid adding
allocated guest_memfd folios to the filemap.

With shared to private conversions dependent on refcounts, avoiding
usage of LRU ensures that LRU lists no longer take any refcounts on
guest_memfd folios and significantly reduces the chance of elevated
refcounts during conversion.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: Ia2540d9fc132d46219e6e714fd42bc82a62a27fa
---
 mm/filemap.c           |  1 +
 mm/memcontrol.c        |  2 +
 virt/kvm/guest_memfd.c | 91 ++++++++++++++++++++++++++++++++++++++----
 3 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 7b90cbeb4a1a..bed7160db214 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -954,6 +954,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	return xas_error(&xas);
 }
 ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO);
+EXPORT_SYMBOL_GPL(__filemap_add_folio);
 
 int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 				pgoff_t index, gfp_t gfp)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..1def80570738 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4611,6 +4611,7 @@ int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(__mem_cgroup_charge);
 
 /**
  * mem_cgroup_charge_hugetlb - charge the memcg for a hugetlb folio
@@ -4785,6 +4786,7 @@ void __mem_cgroup_uncharge(struct folio *folio)
 	uncharge_folio(folio, &ug);
 	uncharge_batch(&ug);
 }
+EXPORT_SYMBOL_GPL(__mem_cgroup_uncharge);
 
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f802116290ce..6f6c4d298f8f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -466,6 +466,38 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return r;
 }
 
+static int __kvm_gmem_filemap_add_folio(struct address_space *mapping,
+					struct folio *folio, pgoff_t index)
+{
+	void *shadow = NULL;
+	gfp_t gfp;
+	int ret;
+
+	gfp = mapping_gfp_mask(mapping);
+
+	__folio_set_locked(folio);
+	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
+	__folio_clear_locked(folio);
+
+	return ret;
+}
+
+/*
+ * Adds a folio to the filemap for guest_memfd. Skips adding the folio to any
+ * LRU list.
+ */
+static int kvm_gmem_filemap_add_folio(struct address_space *mapping,
+					     struct folio *folio, pgoff_t index)
+{
+	int ret;
+
+	ret = __kvm_gmem_filemap_add_folio(mapping, folio, index);
+	if (!ret)
+		folio_set_unevictable(folio);
+
+	return ret;
+}
+
 /*
  * Returns a locked folio on success.  The caller is responsible for
  * setting the up-to-date flag before the memory is mapped into the guest.
@@ -477,8 +509,46 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
  */
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
+	struct folio *folio;
+	gfp_t gfp;
+	int ret;
+
+repeat:
+	folio = filemap_lock_folio(inode->i_mapping, index);
+	if (!IS_ERR(folio))
+		return folio;
+
+	gfp = mapping_gfp_mask(inode->i_mapping);
+
 	/* TODO: Support huge pages. */
-	return filemap_grab_folio(inode->i_mapping, index);
+	folio = filemap_alloc_folio(gfp, 0);
+	if (!folio)
+		return ERR_PTR(-ENOMEM);
+
+	ret = mem_cgroup_charge(folio, NULL, gfp);
+	if (ret) {
+		folio_put(folio);
+		return ERR_PTR(ret);
+	}
+
+	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
+	if (ret) {
+		folio_put(folio);
+
+		/*
+		 * There was a race, two threads tried to get a folio indexing
+		 * to the same location in the filemap. The losing thread should
+		 * free the allocated folio, then lock the folio added to the
+		 * filemap by the winning thread.
+		 */
+		if (ret == -EEXIST)
+			goto repeat;
+
+		return ERR_PTR(ret);
+	}
+
+	__folio_set_locked(folio);
+	return folio;
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -956,23 +1026,28 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 }
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
+static void kvm_gmem_invalidate(struct folio *folio)
+{
+	kvm_pfn_t pfn = folio_pfn(folio);
+
+	kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
+}
+#else
+static inline void kvm_gmem_invalidate(struct folio *folio) {}
+#endif
+
 static void kvm_gmem_free_folio(struct folio *folio)
 {
-	struct page *page = folio_page(folio, 0);
-	kvm_pfn_t pfn = page_to_pfn(page);
-	int order = folio_order(folio);
+	folio_clear_unevictable(folio);
 
-	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+	kvm_gmem_invalidate(folio);
 }
-#endif
 
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
 	.free_folio = kvm_gmem_free_folio,
-#endif
 };
 
 static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (4 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-27  3:55   ` Yan Zhao
  2025-05-14 23:41 ` [RFC PATCH v2 07/51] KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION Ackerley Tng
                   ` (49 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Query guest_memfd for private/shared status if those guest_memfds
track private/shared status.

With this patch, Coco VMs can use guest_memfd for both shared and
private memory. If Coco VMs choose to use guest_memfd for both
shared and private memory, by creating guest_memfd with the
GUEST_MEMFD_FLAG_SUPPORT_SHARED flag, guest_memfd will be used to
provide the private/shared status of the memory, instead of
kvm->mem_attr_array.

Change-Id: I8f23d7995c12242aa4e09ccf5ec19360e9c9ed83
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/kvm_host.h | 19 ++++++++++++-------
 virt/kvm/guest_memfd.c   | 22 ++++++++++++++++++++++
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b317392453a5..91279e05e010 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2508,12 +2508,22 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 }
 
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
+
 bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot);
+bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn);
+
 #else
+
 static inline bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
 {
 	return false;
 }
+
+static inline bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	return false;
+}
+
 #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
 
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
@@ -2544,13 +2554,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 		return false;
 
 	slot = gfn_to_memslot(kvm, gfn);
-	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot)) {
-		/*
-		 * For now, memslots only support in-place shared memory if the
-		 * host is allowed to mmap memory (i.e., non-Coco VMs).
-		 */
-		return false;
-	}
+	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
+		return kvm_gmem_is_private(slot, gfn);
 
 	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6f6c4d298f8f..853e989bdcb2 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -865,6 +865,28 @@ bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
 
+bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct inode *inode;
+	struct file *file;
+	pgoff_t index;
+	bool ret;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return false;
+
+	index = kvm_gmem_get_index(slot, gfn);
+	inode = file_inode(file);
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	ret = kvm_gmem_shareability_get(inode, index) == SHAREABILITY_GUEST;
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	fput(file);
+	return ret;
+}
+
 #else
 #define kvm_gmem_mmap NULL
 #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 07/51] KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (5 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 08/51] KVM: selftests: Test flag validity after guest_memfd supports conversions Ackerley Tng
                   ` (48 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

KVM_CAP_GMEM_CONVERSION indicates that guest_memfd supports
conversion.

With this patch, as long as guest_memfd supports shared memory, it
also supports conversion.

With conversion support comes tracking of private/shared memory within
guest_memfd, hence now all VM types support shared memory in
guest_memfd.

Before this patch, Coco VMs did not support shared memory because that
would allow private memory to be accessible to the host. Coco VMs now
support shared memory because with private/shared status tracked in
guest_memfd, private memory will not be allowed to be mapped into the
host.

Change-Id: I057b7bd267dd84a93fdee2e95cceb88cd9dfc647
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  5 -----
 arch/x86/include/asm/kvm_host.h   | 10 ----------
 include/linux/kvm_host.h          | 13 -------------
 include/uapi/linux/kvm.h          |  1 +
 virt/kvm/guest_memfd.c            | 12 ++++--------
 virt/kvm/kvm_main.c               |  3 ++-
 6 files changed, 7 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2514779f5131..7df673a71ade 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1598,9 +1598,4 @@ static inline bool kvm_arch_supports_gmem(struct kvm *kvm)
 	return IS_ENABLED(CONFIG_KVM_GMEM);
 }
 
-static inline bool kvm_arch_vm_supports_gmem_shared_mem(struct kvm *kvm)
-{
-	return IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM);
-}
-
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f72722949cae..709cc2a7ba66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2255,18 +2255,8 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 
 #ifdef CONFIG_KVM_GMEM
 #define kvm_arch_supports_gmem(kvm) ((kvm)->arch.supports_gmem)
-
-/*
- * CoCo VMs with hardware support that use guest_memfd only for backing private
- * memory, e.g., TDX, cannot use guest_memfd with userspace mapping enabled.
- */
-#define kvm_arch_vm_supports_gmem_shared_mem(kvm)			\
-	(IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM) &&			\
-	 ((kvm)->arch.vm_type == KVM_X86_SW_PROTECTED_VM ||		\
-	  (kvm)->arch.vm_type == KVM_X86_DEFAULT_VM))
 #else
 #define kvm_arch_supports_gmem(kvm) false
-#define kvm_arch_vm_supports_gmem_shared_mem(kvm) false
 #endif
 
 #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 91279e05e010..d703f291f467 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -729,19 +729,6 @@ static inline bool kvm_arch_supports_gmem(struct kvm *kvm)
 }
 #endif
 
-/*
- * Returns true if this VM supports shared mem in guest_memfd.
- *
- * Arch code must define kvm_arch_vm_supports_gmem_shared_mem if support for
- * guest_memfd is enabled.
- */
-#if !defined(kvm_arch_vm_supports_gmem_shared_mem) && !IS_ENABLED(CONFIG_KVM_GMEM)
-static inline bool kvm_arch_vm_supports_gmem_shared_mem(struct kvm *kvm)
-{
-	return false;
-}
-#endif
-
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5b28e17f6f14..433e184f83ea 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -931,6 +931,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
 #define KVM_CAP_GMEM_SHARED_MEM 240
+#define KVM_CAP_GMEM_CONVERSION 241
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 853e989bdcb2..8c9c9e54616b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1216,7 +1216,7 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;
 
-	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
+	if (IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM))
 		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
 
 	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
@@ -1286,13 +1286,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 	    offset + size > i_size_read(inode))
 		goto err;
 
-	if (kvm_gmem_supports_shared(inode)) {
-		if (!kvm_arch_vm_supports_gmem_shared_mem(kvm))
-			goto err;
-
-		if (slot->userspace_addr &&
-		    !kvm_gmem_is_same_range(kvm, slot, file, offset))
-			goto err;
+	if (kvm_gmem_supports_shared(inode) && slot->userspace_addr &&
+	    !kvm_gmem_is_same_range(kvm, slot, file, offset)) {
+		goto err;
 	}
 
 	filemap_invalidate_lock(inode->i_mapping);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66dfdafbb3b6..92054b1bbd3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4843,7 +4843,8 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
 	case KVM_CAP_GMEM_SHARED_MEM:
-		return !kvm || kvm_arch_vm_supports_gmem_shared_mem(kvm);
+	case KVM_CAP_GMEM_CONVERSION:
+		return true;
 #endif
 	default:
 		break;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 08/51] KVM: selftests: Test flag validity after guest_memfd supports conversions
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (6 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 07/51] KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 09/51] KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE Ackerley Tng
                   ` (47 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Before guest_memfd supports conversions, Coco VMs must not allow
GUEST_MEMFD_FLAG_SUPPORT_SHARED.

Because this is a platform stability requirement for hosts supporting
Coco VMs, this is an important test to retain.

Change-Id: I7a42a7d22e96adf17db3dcaedac6b175a36a0eab
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 26 ++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index bf2876cbd711..51d88acdf072 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -435,7 +435,8 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
 	for (flag = BIT(0); flag; flag <<= 1) {
 		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
 
-		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
+		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED &&
+		    kvm_has_cap(KVM_CAP_GMEM_CONVERSION)) {
 			test_vm_with_gmem_flag(
 				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
 		}
@@ -444,7 +445,7 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
 	kvm_vm_release(vm);
 }
 
-static void test_gmem_flag_validity(void)
+static void test_gmem_flag_validity_without_conversion_cap(void)
 {
 	uint64_t non_coco_vm_valid_flags = 0;
 
@@ -462,11 +463,30 @@ static void test_gmem_flag_validity(void)
 #endif
 }
 
+static void test_gmem_flag_validity(void)
+{
+	/* After conversions are supported, all VM types support shared mem. */
+	uint64_t valid_flags = GUEST_MEMFD_FLAG_SUPPORT_SHARED;
+
+	test_vm_type_gmem_flag_validity(VM_TYPE_DEFAULT, valid_flags);
+
+#ifdef __x86_64__
+	test_vm_type_gmem_flag_validity(KVM_X86_SW_PROTECTED_VM, valid_flags);
+	test_vm_type_gmem_flag_validity(KVM_X86_SEV_VM, valid_flags);
+	test_vm_type_gmem_flag_validity(KVM_X86_SEV_ES_VM, valid_flags);
+	test_vm_type_gmem_flag_validity(KVM_X86_SNP_VM, valid_flags);
+	test_vm_type_gmem_flag_validity(KVM_X86_TDX_VM, valid_flags);
+#endif
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
-	test_gmem_flag_validity();
+	if (kvm_has_cap(KVM_CAP_GMEM_CONVERSION))
+		test_gmem_flag_validity();
+	else
+		test_gmem_flag_validity_without_conversion_cap();
 
 	test_with_type(VM_TYPE_DEFAULT, 0, false);
 	if (kvm_has_cap(KVM_CAP_GMEM_SHARED_MEM)) {
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 09/51] KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (7 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 08/51] KVM: selftests: Test flag validity after guest_memfd supports conversions Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 10/51] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
                   ` (46 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Test that faulting is denied when guest_memfd's shareability is
initialized as private with GUEST_MEMFD_FLAG_INIT_PRIVATE and allowed
if the flag is not specified.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>

Change-Id: Id93d4683b36fc5a9c924458d26f0525baed26435
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 112 +++++++++++++++---
 1 file changed, 97 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 51d88acdf072..1e79382fd830 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -16,6 +16,7 @@
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/wait.h>
 
 #include "kvm_util.h"
 #include "test_util.h"
@@ -34,7 +35,7 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap_allowed(int fd, size_t page_size, size_t total_size)
+static void test_faulting_allowed(int fd, size_t page_size, size_t total_size)
 {
 	const char val = 0xaa;
 	char *mem;
@@ -65,6 +66,53 @@ static void test_mmap_allowed(int fd, size_t page_size, size_t total_size)
 	TEST_ASSERT(!ret, "munmap should succeed");
 }
 
+static void assert_not_faultable(char *address)
+{
+	pid_t child_pid;
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		*address = 'A';
+		TEST_FAIL("Child should have exited with a signal");
+	} else {
+		int status;
+
+		waitpid(child_pid, &status, 0);
+
+		TEST_ASSERT(WIFSIGNALED(status),
+			    "Child should have exited with a signal");
+		TEST_ASSERT_EQ(WTERMSIG(status), SIGBUS);
+	}
+}
+
+static void test_faulting_sigbus(int fd, size_t total_size)
+{
+	char *mem;
+	int ret;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+	assert_not_faultable(mem);
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+}
+
+static void test_mmap_allowed(int fd, size_t total_size)
+{
+	char *mem;
+	int ret;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+}
+
 static void test_mmap_denied(int fd, size_t page_size, size_t total_size)
 {
 	char *mem;
@@ -364,40 +412,74 @@ static void test_bind_guest_memfd_wrt_userspace_addr(struct kvm_vm *vm)
 	close(fd);
 }
 
-static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
-			   bool expect_mmap_allowed)
+static void test_guest_memfd_features(struct kvm_vm *vm, size_t page_size,
+				      uint64_t guest_memfd_flags,
+				      bool expect_mmap_allowed,
+				      bool expect_faulting_allowed)
 {
-	struct kvm_vm *vm;
 	size_t total_size;
-	size_t page_size;
 	int fd;
 
-	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
-		return;
-
-	page_size = getpagesize();
 	total_size = page_size * 4;
 
-	vm = vm_create_barebones_type(vm_type);
+	if (expect_faulting_allowed)
+		TEST_REQUIRE(expect_mmap_allowed);
 
-	test_create_guest_memfd_multiple(vm);
-	test_bind_guest_memfd_wrt_userspace_addr(vm);
 	test_create_guest_memfd_invalid_sizes(vm, guest_memfd_flags, page_size);
 
 	fd = vm_create_guest_memfd(vm, total_size, guest_memfd_flags);
 
 	test_file_read_write(fd);
 
-	if (expect_mmap_allowed)
-		test_mmap_allowed(fd, page_size, total_size);
-	else
+	if (expect_mmap_allowed) {
+		test_mmap_allowed(fd, total_size);
+
+		if (expect_faulting_allowed)
+			test_faulting_allowed(fd, page_size, total_size);
+		else
+			test_faulting_sigbus(fd, total_size);
+	} else {
 		test_mmap_denied(fd, page_size, total_size);
+	}
 
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
 
 	close(fd);
+}
+
+static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
+			   bool expect_mmap_allowed)
+{
+	struct kvm_vm *vm;
+	size_t page_size;
+
+	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
+		return;
+
+	vm = vm_create_barebones_type(vm_type);
+
+	test_create_guest_memfd_multiple(vm);
+	test_bind_guest_memfd_wrt_userspace_addr(vm);
+
+	page_size = getpagesize();
+	if (guest_memfd_flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
+		test_guest_memfd_features(vm, page_size, guest_memfd_flags,
+					  expect_mmap_allowed, true);
+
+		if (kvm_has_cap(KVM_CAP_GMEM_CONVERSION)) {
+			uint64_t flags = guest_memfd_flags |
+					 GUEST_MEMFD_FLAG_INIT_PRIVATE;
+
+			test_guest_memfd_features(vm, page_size, flags,
+						  expect_mmap_allowed, false);
+		}
+	} else {
+		test_guest_memfd_features(vm, page_size, guest_memfd_flags,
+					  expect_mmap_allowed, false);
+	}
+
 	kvm_vm_release(vm);
 }
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 10/51] KVM: selftests: Refactor vm_mem_add to be more flexible
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (8 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 09/51] KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 11/51] KVM: selftests: Allow cleanup of ucall_pool from host Ackerley Tng
                   ` (45 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

enum vm_mem_backing_src_type is encoding too many different
possibilities on different axes of (1) whether to mmap from an fd, (2)
granularity of mapping for THP, (3) size of hugetlb mapping, and has
yet to be extended to support guest_memfd.

When guest_memfd supports mmap() and we also want to support testing
with mmap()ing from guest_memfd, the number of combinations make
enumeration in vm_mem_backing_src_type difficult.

This refactor separates out vm_mem_backing_src_type from
userspace_mem_region. For now, vm_mem_backing_src_type remains a
possible way for tests to specify, on the command line, the
combination of backing memory to test.

vm_mem_add() is now the last place where vm_mem_backing_src_type is
interpreted, to

1. Check validity of requested guest_paddr
2. Align mmap_size appropriately based on the mapping's page_size and
   architecture
3. Install memory appropriately according to mapping's page size

mmap()ing an alias seems to be specific to userfaultfd tests and could
be refactored out of struct userspace_mem_region and localized in
userfaultfd tests in future.

This paves the way for replacing vm_mem_backing_src_type with multiple
command line flags that would specify backing memory more
flexibly. Future tests are expected to use vm_mem_region_alloc() to
allocate a struct userspace_mem_region, then use more fundamental
functions like vm_mem_region_mmap(), vm_mem_region_madvise_thp(),
kvm_memfd_create(), vm_create_guest_memfd(), and other functions in
vm_mem_add() to flexibly build up struct userspace_mem_region before
finally adding the region to the vm with vm_mem_region_add().

Change-Id: Ibb37af8a1a3bbb6de776426302433c5d9613ee76
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  |  29 +-
 .../testing/selftests/kvm/include/test_util.h |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 429 +++++++++++-------
 tools/testing/selftests/kvm/lib/test_util.c   |  25 +
 4 files changed, 328 insertions(+), 157 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 373912464fb4..853ab68cff79 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -35,11 +35,26 @@ struct userspace_mem_region {
 	struct sparsebit *protected_phy_pages;
 	int fd;
 	off_t offset;
-	enum vm_mem_backing_src_type backing_src_type;
+	/*
+	 * host_mem is mmap_start aligned upwards to an address suitable for the
+	 * architecture. In most cases, host_mem and mmap_start are the same,
+	 * except for s390x, where the host address must be aligned to 1M (due
+	 * to PGSTEs).
+	 */
+#ifdef __s390x__
+#define S390X_HOST_ADDRESS_ALIGNMENT 0x100000
+#endif
 	void *host_mem;
+	/* host_alias is to mmap_alias as host_mem is to mmap_start */
 	void *host_alias;
 	void *mmap_start;
 	void *mmap_alias;
+	/*
+	 * mmap_size is possibly larger than region.memory_size because in some
+	 * cases, host_mem has to be adjusted upwards (see comment for host_mem
+	 * above). In those cases, mmap_size has to be adjusted upwards so that
+	 * enough memory is available in this memslot.
+	 */
 	size_t mmap_size;
 	struct rb_node gpa_node;
 	struct rb_node hva_node;
@@ -582,6 +597,18 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
 				 uint64_t gpa, uint64_t size, void *hva,
 				 uint32_t guest_memfd, uint64_t guest_memfd_offset);
 
+struct userspace_mem_region *vm_mem_region_alloc(struct kvm_vm *vm);
+void *vm_mem_region_mmap(struct userspace_mem_region *region, size_t length,
+			 int flags, int fd, off_t offset);
+void vm_mem_region_install_memory(struct userspace_mem_region *region,
+				  size_t memslot_size, size_t alignment);
+void vm_mem_region_madvise_thp(struct userspace_mem_region *region, int advice);
+int vm_mem_region_install_guest_memfd(struct userspace_mem_region *region,
+				      int guest_memfd);
+void *vm_mem_region_mmap_alias(struct userspace_mem_region *region, int flags,
+			       size_t alignment);
+void vm_mem_region_add(struct kvm_vm *vm, struct userspace_mem_region *region);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 77d13d7920cb..b4a03784ac4f 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -149,6 +149,8 @@ size_t get_trans_hugepagesz(void);
 size_t get_def_hugetlb_pagesz(void);
 const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
 size_t get_backing_src_pagesz(uint32_t i);
+int backing_src_should_madvise(uint32_t i);
+int get_backing_src_madvise_advice(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 815bc45dd8dc..58a3365f479c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -824,15 +824,12 @@ void kvm_vm_free(struct kvm_vm *vmp)
 	free(vmp);
 }
 
-int kvm_memfd_alloc(size_t size, bool hugepages)
+int kvm_create_memfd(size_t size, unsigned int flags)
 {
-	int memfd_flags = MFD_CLOEXEC;
-	int fd, r;
+	int fd;
+	int r;
 
-	if (hugepages)
-		memfd_flags |= MFD_HUGETLB;
-
-	fd = memfd_create("kvm_selftest", memfd_flags);
+	fd = memfd_create("kvm_selftest", flags);
 	TEST_ASSERT(fd != -1, __KVM_SYSCALL_ERROR("memfd_create()", fd));
 
 	r = ftruncate(fd, size);
@@ -844,6 +841,16 @@ int kvm_memfd_alloc(size_t size, bool hugepages)
 	return fd;
 }
 
+int kvm_memfd_alloc(size_t size, bool hugepages)
+{
+	int memfd_flags = MFD_CLOEXEC;
+
+	if (hugepages)
+		memfd_flags |= MFD_HUGETLB;
+
+	return kvm_create_memfd(size, memfd_flags);
+}
+
 static void vm_userspace_mem_region_gpa_insert(struct rb_root *gpa_tree,
 					       struct userspace_mem_region *region)
 {
@@ -953,185 +960,295 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags
 		    errno, strerror(errno));
 }
 
+/**
+ * Allocates and returns a struct userspace_mem_region.
+ */
+struct userspace_mem_region *vm_mem_region_alloc(struct kvm_vm *vm)
+{
+	struct userspace_mem_region *region;
 
-/* FIXME: This thing needs to be ripped apart and rewritten. */
-void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
-		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
+	/* Allocate and initialize new mem region structure. */
+	region = calloc(1, sizeof(*region));
+	TEST_ASSERT(region != NULL, "Insufficient Memory");
+
+	region->unused_phy_pages = sparsebit_alloc();
+	if (vm_arch_has_protected_memory(vm))
+		region->protected_phy_pages = sparsebit_alloc();
+
+	region->fd = -1;
+	region->region.guest_memfd = -1;
+
+	return region;
+}
+
+static size_t compute_page_size(int mmap_flags, int madvise_advice)
+{
+	if (mmap_flags & MAP_HUGETLB) {
+		int size_flags = (mmap_flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK;
+
+		if (!size_flags)
+			return get_def_hugetlb_pagesz();
+
+		return 1ULL << size_flags;
+	}
+
+	return madvise_advice == MADV_HUGEPAGE ? get_trans_hugepagesz() : getpagesize();
+}
+
+/**
+ * Calls mmap() with @length, @flags, @fd, @offset for @region.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the mmap()
+ * syscall.
+ */
+void *vm_mem_region_mmap(struct userspace_mem_region *region, size_t length,
+			 int flags, int fd, off_t offset)
+{
+	void *mem;
+
+	if (flags & MAP_SHARED) {
+		TEST_ASSERT(fd != -1,
+			    "Ensure that fd is provided for shared mappings.");
+		TEST_ASSERT(
+			region->fd == fd || region->region.guest_memfd == fd,
+			"Ensure that fd is opened before mmap, and is either "
+			"set up in region->fd or region->region.guest_memfd.");
+	}
+
+	mem = mmap(NULL, length, PROT_READ | PROT_WRITE, flags, fd, offset);
+	TEST_ASSERT(mem != MAP_FAILED, "Couldn't mmap anonymous memory");
+
+	region->mmap_start = mem;
+	region->mmap_size = length;
+	region->offset = offset;
+
+	return mem;
+}
+
+/**
+ * Installs mmap()ed memory in @region->mmap_start as @region->host_mem,
+ * checking constraints.
+ */
+void vm_mem_region_install_memory(struct userspace_mem_region *region,
+				  size_t memslot_size, size_t alignment)
+{
+	TEST_ASSERT(region->mmap_size >= memslot_size,
+		    "mmap()ed memory insufficient for memslot");
+
+	region->host_mem = align_ptr_up(region->mmap_start, alignment);
+	region->region.userspace_addr = (uint64_t)region->host_mem;
+	region->region.memory_size = memslot_size;
+}
+
+
+/**
+ * Calls madvise with @advice for @region.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the madvise()
+ * syscall.
+ */
+void vm_mem_region_madvise_thp(struct userspace_mem_region *region, int advice)
 {
 	int ret;
+
+	TEST_ASSERT(
+		region->host_mem && region->mmap_size,
+		"vm_mem_region_madvise_thp() must be called after vm_mem_region_mmap()");
+
+	ret = madvise(region->host_mem, region->mmap_size, advice);
+	TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx",
+		    region->host_mem, region->mmap_size);
+}
+
+/**
+ * Installs guest_memfd by setting it up in @region.
+ *
+ * Returns the guest_memfd that was installed in the @region.
+ */
+int vm_mem_region_install_guest_memfd(struct userspace_mem_region *region,
+				      int guest_memfd)
+{
+	/*
+	 * Install a unique fd for each memslot so that the fd can be closed
+	 * when the region is deleted without needing to track if the fd is
+	 * owned by the framework or by the caller.
+	 */
+	guest_memfd = dup(guest_memfd);
+	TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+	region->region.guest_memfd = guest_memfd;
+
+	return guest_memfd;
+}
+
+/**
+ * Calls mmap() to create an alias for mmap()ed memory at region->host_mem,
+ * exactly the same size the was mmap()ed.
+ *
+ * This is used mainly for userfaultfd tests.
+ */
+void *vm_mem_region_mmap_alias(struct userspace_mem_region *region, int flags,
+			       size_t alignment)
+{
+	region->mmap_alias = mmap(NULL, region->mmap_size,
+				  PROT_READ | PROT_WRITE, flags, region->fd, 0);
+	TEST_ASSERT(region->mmap_alias != MAP_FAILED,
+		    __KVM_SYSCALL_ERROR("mmap()",  (int)(unsigned long)MAP_FAILED));
+
+	region->host_alias = align_ptr_up(region->mmap_alias, alignment);
+
+	return region->host_alias;
+}
+
+static void vm_mem_region_assert_no_duplicate(struct kvm_vm *vm, uint32_t slot,
+					      uint64_t gpa, size_t size)
+{
 	struct userspace_mem_region *region;
-	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
-	size_t mem_size = npages * vm->page_size;
-	size_t alignment;
-
-	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
-
-	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
-		"Number of guest pages is not compatible with the host. "
-		"Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
-
-	TEST_ASSERT((guest_paddr % vm->page_size) == 0, "Guest physical "
-		"address not on a page boundary.\n"
-		"  guest_paddr: 0x%lx vm->page_size: 0x%x",
-		guest_paddr, vm->page_size);
-	TEST_ASSERT((((guest_paddr >> vm->page_shift) + npages) - 1)
-		<= vm->max_gfn, "Physical range beyond maximum "
-		"supported physical address,\n"
-		"  guest_paddr: 0x%lx npages: 0x%lx\n"
-		"  vm->max_gfn: 0x%lx vm->page_size: 0x%x",
-		guest_paddr, npages, vm->max_gfn, vm->page_size);
 
 	/*
 	 * Confirm a mem region with an overlapping address doesn't
 	 * already exist.
 	 */
-	region = (struct userspace_mem_region *) userspace_mem_region_find(
-		vm, guest_paddr, (guest_paddr + npages * vm->page_size) - 1);
-	if (region != NULL)
-		TEST_FAIL("overlapping userspace_mem_region already "
-			"exists\n"
-			"  requested guest_paddr: 0x%lx npages: 0x%lx "
-			"page_size: 0x%x\n"
-			"  existing guest_paddr: 0x%lx size: 0x%lx",
-			guest_paddr, npages, vm->page_size,
-			(uint64_t) region->region.guest_phys_addr,
-			(uint64_t) region->region.memory_size);
+	region = userspace_mem_region_find(vm, gpa, gpa + size - 1);
+	if (region != NULL) {
+		TEST_FAIL("overlapping userspace_mem_region already exists\n"
+			  "  requested gpa: 0x%lx size: 0x%lx"
+			  "  existing gpa: 0x%lx size: 0x%lx",
+			  gpa, size,
+			  (uint64_t) region->region.guest_phys_addr,
+			  (uint64_t) region->region.memory_size);
+	}
 
 	/* Confirm no region with the requested slot already exists. */
-	hash_for_each_possible(vm->regions.slot_hash, region, slot_node,
-			       slot) {
+	hash_for_each_possible(vm->regions.slot_hash, region, slot_node, slot) {
 		if (region->region.slot != slot)
 			continue;
 
-		TEST_FAIL("A mem region with the requested slot "
-			"already exists.\n"
-			"  requested slot: %u paddr: 0x%lx npages: 0x%lx\n"
-			"  existing slot: %u paddr: 0x%lx size: 0x%lx",
-			slot, guest_paddr, npages,
-			region->region.slot,
-			(uint64_t) region->region.guest_phys_addr,
-			(uint64_t) region->region.memory_size);
+		TEST_FAIL("A mem region with the requested slot already exists.\n"
+			  "  requested slot: %u paddr: 0x%lx size: 0x%lx\n"
+			  "  existing slot: %u paddr: 0x%lx size: 0x%lx",
+			  slot, gpa, size,
+			  region->region.slot,
+			  (uint64_t) region->region.guest_phys_addr,
+			  (uint64_t) region->region.memory_size);
 	}
+}
 
-	/* Allocate and initialize new mem region structure. */
-	region = calloc(1, sizeof(*region));
-	TEST_ASSERT(region != NULL, "Insufficient Memory");
-	region->mmap_size = mem_size;
+/**
+ * Add a @region to @vm. All necessary fields in region->region should already
+ * be populated.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the
+ * KVM_SET_USER_MEMORY_REGION2 ioctl.
+ */
+void vm_mem_region_add(struct kvm_vm *vm, struct userspace_mem_region *region)
+{
+	uint64_t npages;
+	uint64_t gpa;
+	int ret;
 
-#ifdef __s390x__
-	/* On s390x, the host address must be aligned to 1M (due to PGSTEs) */
-	alignment = 0x100000;
-#else
-	alignment = 1;
-#endif
+	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
 
-	/*
-	 * When using THP mmap is not guaranteed to returned a hugepage aligned
-	 * address so we have to pad the mmap. Padding is not needed for HugeTLB
-	 * because mmap will always return an address aligned to the HugeTLB
-	 * page size.
-	 */
-	if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
-		alignment = max(backing_src_pagesz, alignment);
+	npages = region->region.memory_size / vm->page_size;
+	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
+		    "Number of guest pages is not compatible with the host. "
+		    "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
 
-	TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, backing_src_pagesz));
+	gpa = region->region.guest_phys_addr;
+	TEST_ASSERT((gpa % vm->page_size) == 0,
+		    "Guest physical address not on a page boundary.\n"
+		    "  gpa: 0x%lx vm->page_size: 0x%x",
+		    gpa, vm->page_size);
+	TEST_ASSERT((((gpa >> vm->page_shift) + npages) - 1) <= vm->max_gfn,
+		    "Physical range beyond maximum supported physical address,\n"
+		    "  gpa: 0x%lx npages: 0x%lx\n"
+		    "  vm->max_gfn: 0x%lx vm->page_size: 0x%x",
+		    gpa, npages, vm->max_gfn, vm->page_size);
 
-	/* Add enough memory to align up if necessary */
-	if (alignment > 1)
-		region->mmap_size += alignment;
+	vm_mem_region_assert_no_duplicate(vm, region->region.slot, gpa,
+					  region->mmap_size);
 
-	region->fd = -1;
-	if (backing_src_is_shared(src_type))
-		region->fd = kvm_memfd_alloc(region->mmap_size,
-					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
-
-	region->mmap_start = mmap(NULL, region->mmap_size,
-				  PROT_READ | PROT_WRITE,
-				  vm_mem_backing_src_alias(src_type)->flag,
-				  region->fd, 0);
-	TEST_ASSERT(region->mmap_start != MAP_FAILED,
-		    __KVM_SYSCALL_ERROR("mmap()", (int)(unsigned long)MAP_FAILED));
-
-	TEST_ASSERT(!is_backing_src_hugetlb(src_type) ||
-		    region->mmap_start == align_ptr_up(region->mmap_start, backing_src_pagesz),
-		    "mmap_start %p is not aligned to HugeTLB page size 0x%lx",
-		    region->mmap_start, backing_src_pagesz);
-
-	/* Align host address */
-	region->host_mem = align_ptr_up(region->mmap_start, alignment);
-
-	/* As needed perform madvise */
-	if ((src_type == VM_MEM_SRC_ANONYMOUS ||
-	     src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {
-		ret = madvise(region->host_mem, mem_size,
-			      src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
-		TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s",
-			    region->host_mem, mem_size,
-			    vm_mem_backing_src_alias(src_type)->name);
-	}
-
-	region->backing_src_type = src_type;
-
-	if (flags & KVM_MEM_GUEST_MEMFD) {
-		if (guest_memfd < 0) {
-			uint32_t guest_memfd_flags = 0;
-			TEST_ASSERT(!guest_memfd_offset,
-				    "Offset must be zero when creating new guest_memfd");
-			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
-		} else {
-			/*
-			 * Install a unique fd for each memslot so that the fd
-			 * can be closed when the region is deleted without
-			 * needing to track if the fd is owned by the framework
-			 * or by the caller.
-			 */
-			guest_memfd = dup(guest_memfd);
-			TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
-		}
-
-		region->region.guest_memfd = guest_memfd;
-		region->region.guest_memfd_offset = guest_memfd_offset;
-	} else {
-		region->region.guest_memfd = -1;
-	}
-
-	region->unused_phy_pages = sparsebit_alloc();
-	if (vm_arch_has_protected_memory(vm))
-		region->protected_phy_pages = sparsebit_alloc();
-	sparsebit_set_num(region->unused_phy_pages,
-		guest_paddr >> vm->page_shift, npages);
-	region->region.slot = slot;
-	region->region.flags = flags;
-	region->region.guest_phys_addr = guest_paddr;
-	region->region.memory_size = npages * vm->page_size;
-	region->region.userspace_addr = (uintptr_t) region->host_mem;
 	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
 	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
-		"  rc: %i errno: %i\n"
-		"  slot: %u flags: 0x%x\n"
-		"  guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d",
-		ret, errno, slot, flags,
-		guest_paddr, (uint64_t) region->region.memory_size,
-		region->region.guest_memfd);
+		    "  rc: %i errno: %i\n"
+		    "  slot: %u flags: 0x%x\n"
+		    "  guest_phys_addr: 0x%lx size: 0x%llx guest_memfd: %d",
+		    ret, errno, region->region.slot, region->region.flags,
+		    gpa, region->region.memory_size,
+		    region->region.guest_memfd);
+
+	sparsebit_set_num(region->unused_phy_pages, gpa >> vm->page_shift, npages);
 
 	/* Add to quick lookup data structures */
 	vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
 	vm_userspace_mem_region_hva_insert(&vm->regions.hva_tree, region);
-	hash_add(vm->regions.slot_hash, &region->slot_node, slot);
+	hash_add(vm->regions.slot_hash, &region->slot_node, region->region.slot);
+}
 
-	/* If shared memory, create an alias. */
-	if (region->fd >= 0) {
-		region->mmap_alias = mmap(NULL, region->mmap_size,
-					  PROT_READ | PROT_WRITE,
-					  vm_mem_backing_src_alias(src_type)->flag,
-					  region->fd, 0);
-		TEST_ASSERT(region->mmap_alias != MAP_FAILED,
-			    __KVM_SYSCALL_ERROR("mmap()",  (int)(unsigned long)MAP_FAILED));
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
+{
+	struct userspace_mem_region *region;
+	size_t mapping_page_size;
+	size_t memslot_size;
+	int madvise_advice;
+	size_t mmap_size;
+	size_t alignment;
+	int mmap_flags;
+	int memfd;
 
-		/* Align host alias address */
-		region->host_alias = align_ptr_up(region->mmap_alias, alignment);
+	memslot_size = npages * vm->page_size;
+
+	mmap_flags = vm_mem_backing_src_alias(src_type)->flag;
+	madvise_advice = get_backing_src_madvise_advice(src_type);
+	mapping_page_size = compute_page_size(mmap_flags, madvise_advice);
+
+	TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, mapping_page_size));
+
+	alignment = mapping_page_size;
+#ifdef __s390x__
+	alignment = max(alignment, S390X_HOST_ADDRESS_ALIGNMENT);
+#endif
+
+	region = vm_mem_region_alloc(vm);
+
+	memfd = -1;
+	if (backing_src_is_shared(src_type)) {
+		unsigned int memfd_flags = MFD_CLOEXEC;
+
+		if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
+			memfd_flags |= MFD_HUGETLB;
+
+		memfd = kvm_create_memfd(memslot_size, memfd_flags);
 	}
+	region->fd = memfd;
+
+	mmap_size = align_up(memslot_size, alignment);
+	vm_mem_region_mmap(region, mmap_size, mmap_flags, memfd, 0);
+	vm_mem_region_install_memory(region, memslot_size, alignment);
+
+	if (backing_src_should_madvise(src_type))
+		vm_mem_region_madvise_thp(region, madvise_advice);
+
+	if (backing_src_is_shared(src_type))
+		vm_mem_region_mmap_alias(region, mmap_flags, alignment);
+
+	if (flags & KVM_MEM_GUEST_MEMFD) {
+		if (guest_memfd < 0) {
+			TEST_ASSERT(
+				guest_memfd_offset == 0,
+				"Offset must be zero when creating new guest_memfd");
+			guest_memfd = vm_create_guest_memfd(vm, memslot_size, 0);
+		}
+
+		vm_mem_region_install_guest_memfd(region, guest_memfd);
+	}
+
+	region->region.slot = slot;
+	region->region.flags = flags;
+	region->region.guest_phys_addr = guest_paddr;
+	region->region.guest_memfd_offset = guest_memfd_offset;
+	vm_mem_region_add(vm, region);
 }
 
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 8ed0b74ae837..24dc90693afd 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -308,6 +308,31 @@ size_t get_backing_src_pagesz(uint32_t i)
 	}
 }
 
+int backing_src_should_madvise(uint32_t i)
+{
+	switch (i) {
+	case VM_MEM_SRC_ANONYMOUS:
+	case VM_MEM_SRC_SHMEM:
+	case VM_MEM_SRC_ANONYMOUS_THP:
+		return true;
+	default:
+		return false;
+	}
+}
+
+int get_backing_src_madvise_advice(uint32_t i)
+{
+	switch (i) {
+	case VM_MEM_SRC_ANONYMOUS:
+	case VM_MEM_SRC_SHMEM:
+		return MADV_NOHUGEPAGE;
+	case VM_MEM_SRC_ANONYMOUS_THP:
+		return MADV_NOHUGEPAGE;
+	default:
+		return 0;
+	}
+}
+
 bool is_backing_src_hugetlb(uint32_t i)
 {
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 11/51] KVM: selftests: Allow cleanup of ucall_pool from host
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (9 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 10/51] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 12/51] KVM: selftests: Test conversion flows for guest_memfd Ackerley Tng
                   ` (44 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Many selftests use GUEST_DONE() to signal the end of guest code, which
is handled in userspace. In most tests, the test exits and there is no
need to clean up the ucall_pool->in_use bitmap.

If there are many guest code functions using GUEST_DONE(), or of guest
code functions are run many times, the ucall_pool->in_use bitmap will
fill up, causing later runs of the same guest code function to fail.

This patch allows ucall_free() to be called from userspace on uc.hva,
which will unset and free the correct struct ucall in the pool,
allowing ucalls to continue being used.

Change-Id: I2cb2aeed4b291b1bfb2bece001d09c509cd10446
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/ucall_common.h |  1 +
 tools/testing/selftests/kvm/lib/ucall_common.c   | 16 ++++++++--------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h b/tools/testing/selftests/kvm/include/ucall_common.h
index d9d6581b8d4f..b6b850d0319a 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -40,6 +40,7 @@ __printf(5, 6) void ucall_assert(uint64_t cmd, const char *exp,
 				 const char *fmt, ...);
 uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc);
 void ucall_init(struct kvm_vm *vm, vm_paddr_t mmio_gpa);
+void ucall_free(struct ucall *uc);
 int ucall_nr_pages_required(uint64_t page_size);
 
 /*
diff --git a/tools/testing/selftests/kvm/lib/ucall_common.c b/tools/testing/selftests/kvm/lib/ucall_common.c
index 42151e571953..9b6865c39ea7 100644
--- a/tools/testing/selftests/kvm/lib/ucall_common.c
+++ b/tools/testing/selftests/kvm/lib/ucall_common.c
@@ -21,24 +21,24 @@ int ucall_nr_pages_required(uint64_t page_size)
 
 /*
  * ucall_pool holds per-VM values (global data is duplicated by each VM), it
- * must not be accessed from host code.
+ * should generally not be accessed from host code other than via ucall_free(),
+ * to cleanup after using GUEST_DONE()
  */
 static struct ucall_header *ucall_pool;
 
 void ucall_init(struct kvm_vm *vm, vm_paddr_t mmio_gpa)
 {
-	struct ucall_header *hdr;
 	struct ucall *uc;
 	vm_vaddr_t vaddr;
 	int i;
 
-	vaddr = vm_vaddr_alloc_shared(vm, sizeof(*hdr), KVM_UTIL_MIN_VADDR,
-				      MEM_REGION_DATA);
-	hdr = (struct ucall_header *)addr_gva2hva(vm, vaddr);
-	memset(hdr, 0, sizeof(*hdr));
+	vaddr = vm_vaddr_alloc_shared(vm, sizeof(*ucall_pool),
+				      KVM_UTIL_MIN_VADDR, MEM_REGION_DATA);
+	ucall_pool = (struct ucall_header *)addr_gva2hva(vm, vaddr);
+	memset(ucall_pool, 0, sizeof(*ucall_pool));
 
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
-		uc = &hdr->ucalls[i];
+		uc = &ucall_pool->ucalls[i];
 		uc->hva = uc;
 	}
 
@@ -73,7 +73,7 @@ static struct ucall *ucall_alloc(void)
 	return NULL;
 }
 
-static void ucall_free(struct ucall *uc)
+void ucall_free(struct ucall *uc)
 {
 	/* Beware, here be pointer arithmetic.  */
 	clear_bit(uc - ucall_pool->ucalls, ucall_pool->in_use);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 12/51] KVM: selftests: Test conversion flows for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (10 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 11/51] KVM: selftests: Allow cleanup of ucall_pool from host Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 13/51] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
                   ` (43 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Add minimal tests for guest_memfd to test that when memory is marked
shared in a VM, the host can read and write to it via an mmap()ed
address, and the guest can also read and write to it.

Tests added in this patch use refcounts taken via GUP (requiring
CONFIG_GUP_TEST) to simulate unexpected refcounts on guest_memfd
pages.

Test that unexpected refcounts cause conversions to fail.

Change-Id: I4f8c05aa511bcb9a34921a54fc8315ed89629018
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/guest_memfd_conversions_test.c        | 589 ++++++++++++++++++
 .../testing/selftests/kvm/include/kvm_util.h  |  74 +++
 3 files changed, 664 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index ccf95ed037c3..bc22a5a23c4c 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -131,6 +131,7 @@ TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
 TEST_GEN_PROGS_x86 += guest_memfd_test
+TEST_GEN_PROGS_x86 += guest_memfd_conversions_test
 TEST_GEN_PROGS_x86 += hardware_disable_test
 TEST_GEN_PROGS_x86 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86 += memslot_perf_test
diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
new file mode 100644
index 000000000000..34eb6c9a37b1
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -0,0 +1,589 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test conversion flows for guest_memfd.
+ *
+ * Copyright (c) 2024, Google LLC.
+ */
+#include <linux/kvm.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+#include "ucall_common.h"
+#include "../../../../mm/gup_test.h"
+
+#define GUEST_MEMFD_SHARING_TEST_SLOT 10
+/*
+ * Use high GPA above APIC_DEFAULT_PHYS_BASE to avoid clashing with
+ * APIC_DEFAULT_PHYS_BASE.
+ */
+#define GUEST_MEMFD_SHARING_TEST_GPA 0x100000000ULL
+#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
+
+static int gup_test_fd;
+
+static void pin_pages(void *vaddr, uint64_t size)
+{
+	const struct pin_longterm_test args = {
+		.addr = (uint64_t)vaddr,
+		.size = size,
+		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
+	};
+
+	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+	TEST_REQUIRE(gup_test_fd > 0);
+
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
+}
+
+static void unpin_pages(void)
+{
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
+}
+
+static void guest_check_mem(uint64_t gva, char expected_read_value, char write_value)
+{
+	char *mem = (char *)gva;
+
+	if (expected_read_value != 'X')
+		GUEST_ASSERT_EQ(*mem, expected_read_value);
+
+	if (write_value != 'X')
+		*mem = write_value;
+
+	GUEST_DONE();
+}
+
+static int vcpu_run_handle_basic_ucalls(struct kvm_vcpu *vcpu)
+{
+	struct ucall uc;
+	int rc;
+
+keep_going:
+	do {
+		rc = __vcpu_run(vcpu);
+	} while (rc == -1 && errno == EINTR);
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_PRINTF:
+		REPORT_GUEST_PRINTF(uc);
+		goto keep_going;
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+	}
+
+	return rc;
+}
+
+/**
+ * guest_use_memory() - Assert that guest can use memory at @gva.
+ *
+ * @vcpu: the vcpu to run this test on.
+ * @gva: the virtual address in the guest to try to use.
+ * @expected_read_value: the value that is expected at @gva. Set this to 'X' to
+ *                       skip checking current value.
+ * @write_value: value to write to @gva. Set to 'X' to skip writing value to
+ *               @address.
+ * @expected_errno: the expected errno if an error is expected while reading or
+ *                  writing @gva. Set to 0 if no exception is expected,
+ *                  otherwise set it to the expected errno. If @expected_errno
+ *                  is set, 'Z' is used instead of @expected_read_value or
+ *                  @write_value.
+ */
+static void guest_use_memory(struct kvm_vcpu *vcpu, uint64_t gva,
+			     char expected_read_value, char write_value,
+			     int expected_errno)
+{
+	struct kvm_regs original_regs;
+	int rc;
+
+	if (expected_errno > 0) {
+		expected_read_value = 'Z';
+		write_value = 'Z';
+	}
+
+	/*
+	 * Backup and vCPU state from first run so that guest_check_mem can be
+	 * run again and again.
+	 */
+	vcpu_regs_get(vcpu, &original_regs);
+
+	vcpu_args_set(vcpu, 3, gva, expected_read_value, write_value);
+	vcpu_arch_set_entry_point(vcpu, guest_check_mem);
+
+	rc = vcpu_run_handle_basic_ucalls(vcpu);
+
+	if (expected_errno) {
+		TEST_ASSERT_EQ(rc, -1);
+		TEST_ASSERT_EQ(errno, expected_errno);
+
+		switch (expected_errno) {
+		case EFAULT:
+			TEST_ASSERT_EQ(vcpu->run->exit_reason, 0);
+			break;
+		case EACCES:
+			TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+			break;
+		}
+	} else {
+		struct ucall uc;
+
+		TEST_ASSERT_EQ(rc, 0);
+		TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_DONE);
+
+		/*
+		 * UCALL_DONE() uses up one struct ucall slot. To reuse the slot
+		 * in another run of guest_check_mem, free up that slot.
+		 */
+		ucall_free((struct ucall *)uc.hva);
+	}
+
+	vcpu_regs_set(vcpu, &original_regs);
+}
+
+/**
+ * host_use_memory() - Assert that host can fault and use memory at @address.
+ *
+ * @address: the address to be testing.
+ * @expected_read_value: the value expected to be read from @address. Set to 'X'
+ *                       to skip checking current value at @address.
+ * @write_value: the value to write to @address. Set to 'X' to skip writing
+ *               value to @address.
+ */
+static void host_use_memory(char *address, char expected_read_value,
+			    char write_value)
+{
+	if (expected_read_value != 'X')
+		TEST_ASSERT_EQ(*address, expected_read_value);
+
+	if (write_value != 'X')
+		*address = write_value;
+}
+
+static void assert_host_cannot_fault(char *address)
+{
+	pid_t child_pid;
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		*address = 'A';
+		TEST_FAIL("Child should have exited with a signal");
+	} else {
+		int status;
+
+		waitpid(child_pid, &status, 0);
+
+		TEST_ASSERT(WIFSIGNALED(status),
+			    "Child should have exited with a signal");
+		TEST_ASSERT_EQ(WTERMSIG(status), SIGBUS);
+	}
+}
+
+static void *add_memslot(struct kvm_vm *vm, size_t memslot_size, int guest_memfd)
+{
+	struct userspace_mem_region *region;
+	void *mem;
+
+	TEST_REQUIRE(guest_memfd > 0);
+
+	region = vm_mem_region_alloc(vm);
+
+	guest_memfd = vm_mem_region_install_guest_memfd(region, guest_memfd);
+	mem = vm_mem_region_mmap(region, memslot_size, MAP_SHARED, guest_memfd, 0);
+	vm_mem_region_install_memory(region, memslot_size, PAGE_SIZE);
+
+	region->region.slot = GUEST_MEMFD_SHARING_TEST_SLOT;
+	region->region.flags = KVM_MEM_GUEST_MEMFD;
+	region->region.guest_phys_addr = GUEST_MEMFD_SHARING_TEST_GPA;
+	region->region.guest_memfd_offset = 0;
+
+	vm_mem_region_add(vm, region);
+
+	return mem;
+}
+
+static struct kvm_vm *setup_test(size_t test_page_size, bool init_private,
+				 struct kvm_vcpu **vcpu, int *guest_memfd,
+				 char **mem)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+	size_t test_nr_pages;
+	struct kvm_vm *vm;
+	uint64_t flags;
+
+	test_nr_pages = test_page_size / PAGE_SIZE;
+	vm = __vm_create_shape_with_one_vcpu(shape, vcpu, test_nr_pages, NULL);
+
+	flags = GUEST_MEMFD_FLAG_SUPPORT_SHARED;
+	if (init_private)
+		flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
+
+	*guest_memfd = vm_create_guest_memfd(vm, test_page_size, flags);
+	TEST_ASSERT(*guest_memfd > 0, "guest_memfd creation failed");
+
+	*mem = add_memslot(vm, test_page_size, *guest_memfd);
+
+	virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, GUEST_MEMFD_SHARING_TEST_GPA,
+		 test_nr_pages);
+
+	return vm;
+}
+
+static void cleanup_test(size_t guest_memfd_size, struct kvm_vm *vm,
+			 int guest_memfd, char *mem)
+{
+	kvm_vm_free(vm);
+	TEST_ASSERT_EQ(munmap(mem, guest_memfd_size), 0);
+
+	if (guest_memfd > -1)
+		TEST_ASSERT_EQ(close(guest_memfd), 0);
+}
+
+static void test_sharing(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+
+	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	host_use_memory(mem, 'X', 'A');
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'A', 'B', 0);
+
+	/* Toggle private flag of memory attributes and run the test again. */
+	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+
+	assert_host_cannot_fault(mem);
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
+
+	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+
+	host_use_memory(mem, 'C', 'D');
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'D', 'E', 0);
+
+	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+}
+
+static void test_init_mappable_false(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+
+	vm = setup_test(PAGE_SIZE, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
+
+	assert_host_cannot_fault(mem);
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
+
+	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+
+	host_use_memory(mem, 'A', 'B');
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
+
+	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+}
+
+/*
+ * Test that even if there are no folios yet, conversion requests are recorded
+ * in guest_memfd.
+ */
+static void test_conversion_before_allocation(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+
+	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+
+	assert_host_cannot_fault(mem);
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
+
+	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+
+	host_use_memory(mem, 'A', 'B');
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
+
+	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+}
+
+static void __test_conversion_if_not_all_folios_allocated(int total_nr_pages,
+							  int page_to_fault)
+{
+	const int second_page_to_fault = 8;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	size_t total_size;
+	int guest_memfd;
+	char *mem;
+	int i;
+
+	total_size = PAGE_SIZE * total_nr_pages;
+	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	/*
+	 * Fault 2 of the pages to test filemap range operations except when
+	 * page_to_fault == second_page_to_fault.
+	 */
+	host_use_memory(mem + page_to_fault * PAGE_SIZE, 'X', 'A');
+	host_use_memory(mem + second_page_to_fault * PAGE_SIZE, 'X', 'A');
+
+	guest_memfd_convert_private(guest_memfd, 0, total_size);
+
+	for (i = 0; i < total_nr_pages; ++i) {
+		bool is_faulted;
+		char expected;
+
+		assert_host_cannot_fault(mem + i * PAGE_SIZE);
+
+		is_faulted = i == page_to_fault || i == second_page_to_fault;
+		expected = is_faulted ? 'A' : 'X';
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 expected, 'B', 0);
+	}
+
+	guest_memfd_convert_shared(guest_memfd, 0, total_size);
+
+	for (i = 0; i < total_nr_pages; ++i) {
+		host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'C', 'D', 0);
+	}
+
+	cleanup_test(total_size, vm, guest_memfd, mem);
+}
+
+static void test_conversion_if_not_all_folios_allocated(void)
+{
+	const int total_nr_pages = 16;
+	int i;
+
+	for (i = 0; i < total_nr_pages; ++i)
+		__test_conversion_if_not_all_folios_allocated(total_nr_pages, i);
+}
+
+static void test_conversions_should_not_affect_surrounding_pages(void)
+{
+	struct kvm_vcpu *vcpu;
+	int page_to_convert;
+	struct kvm_vm *vm;
+	size_t total_size;
+	int guest_memfd;
+	int nr_pages;
+	char *mem;
+	int i;
+
+	page_to_convert = 2;
+	nr_pages = 4;
+	total_size = PAGE_SIZE * nr_pages;
+
+	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	for (i = 0; i < nr_pages; ++i) {
+		host_use_memory(mem + i * PAGE_SIZE, 'X', 'A');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'A', 'B', 0);
+	}
+
+	guest_memfd_convert_private(guest_memfd, PAGE_SIZE * page_to_convert, PAGE_SIZE);
+
+
+	for (i = 0; i < nr_pages; ++i) {
+		char to_check;
+
+		if (i == page_to_convert) {
+			assert_host_cannot_fault(mem + i * PAGE_SIZE);
+			to_check = 'B';
+		} else {
+			host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+			to_check = 'C';
+		}
+
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 to_check, 'D', 0);
+	}
+
+	guest_memfd_convert_shared(guest_memfd, PAGE_SIZE * page_to_convert, PAGE_SIZE);
+
+
+	for (i = 0; i < nr_pages; ++i) {
+		host_use_memory(mem + i * PAGE_SIZE, 'D', 'E');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'E', 'F', 0);
+	}
+
+	cleanup_test(total_size, vm, guest_memfd, mem);
+}
+
+static void __test_conversions_should_fail_if_memory_has_elevated_refcount(
+	int nr_pages, int page_to_convert)
+{
+	struct kvm_vcpu *vcpu;
+	loff_t error_offset;
+	struct kvm_vm *vm;
+	size_t total_size;
+	int guest_memfd;
+	char *mem;
+	int ret;
+	int i;
+
+	total_size = PAGE_SIZE * nr_pages;
+	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	pin_pages(mem + page_to_convert * PAGE_SIZE, PAGE_SIZE);
+
+	for (i = 0; i < nr_pages; i++) {
+		host_use_memory(mem + i * PAGE_SIZE, 'X', 'A');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'A', 'B', 0);
+	}
+
+	error_offset = 0;
+	ret = __guest_memfd_convert_private(guest_memfd, page_to_convert * PAGE_SIZE,
+					    PAGE_SIZE, &error_offset);
+	TEST_ASSERT_EQ(ret, -1);
+	TEST_ASSERT_EQ(errno, EAGAIN);
+	TEST_ASSERT_EQ(error_offset, page_to_convert * PAGE_SIZE);
+
+	unpin_pages();
+
+	guest_memfd_convert_private(guest_memfd, page_to_convert * PAGE_SIZE, PAGE_SIZE);
+
+	for (i = 0; i < nr_pages; i++) {
+		char expected;
+
+		if (i == page_to_convert)
+			assert_host_cannot_fault(mem + i * PAGE_SIZE);
+		else
+			host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+
+		expected = i == page_to_convert ? 'X' : 'C';
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 expected, 'D', 0);
+	}
+
+	guest_memfd_convert_shared(guest_memfd, page_to_convert * PAGE_SIZE, PAGE_SIZE);
+
+
+	for (i = 0; i < nr_pages; i++) {
+		char expected = i == page_to_convert ? 'X' : 'D';
+
+		host_use_memory(mem + i * PAGE_SIZE, expected, 'E');
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'E', 'F', 0);
+	}
+
+	cleanup_test(total_size, vm, guest_memfd, mem);
+}
+/*
+ * This test depends on CONFIG_GUP_TEST to provide a kernel module that exposes
+ * pin_user_pages() to userspace.
+ */
+static void test_conversions_should_fail_if_memory_has_elevated_refcount(void)
+{
+	int i;
+
+	for (i = 0; i < 4; i++)
+		__test_conversions_should_fail_if_memory_has_elevated_refcount(4, i);
+}
+
+static void test_truncate_should_not_change_mappability(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+	int ret;
+
+	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+
+	host_use_memory(mem, 'X', 'A');
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			0, PAGE_SIZE);
+	TEST_ASSERT(!ret, "truncating the first page should succeed");
+
+	host_use_memory(mem, 'X', 'A');
+
+	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+
+	assert_host_cannot_fault(mem);
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'A', 'A', 0);
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			0, PAGE_SIZE);
+	TEST_ASSERT(!ret, "truncating the first page should succeed");
+
+	assert_host_cannot_fault(mem);
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
+
+	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+}
+
+static void test_fault_type_independent_of_mem_attributes(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+
+	vm = setup_test(PAGE_SIZE, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
+	vm_mem_set_shared(vm, GUEST_MEMFD_SHARING_TEST_GPA, PAGE_SIZE);
+
+	/*
+	 * kvm->mem_attr_array set to shared, guest_memfd memory initialized as
+	 * private.
+	 */
+
+	/* Host cannot use private memory. */
+	assert_host_cannot_fault(mem);
+
+	/* Guest can fault and use memory. */
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
+
+	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+	vm_mem_set_private(vm, GUEST_MEMFD_SHARING_TEST_GPA, PAGE_SIZE);
+
+	/* Host can use shared memory. */
+	host_use_memory(mem, 'X', 'A');
+
+	/* Guest can also use shared memory. */
+	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
+
+	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+}
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GMEM_SHARED_MEM));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GMEM_CONVERSION));
+
+	test_sharing();
+	test_init_mappable_false();
+	test_conversion_before_allocation();
+	test_conversion_if_not_all_folios_allocated();
+	test_conversions_should_not_affect_surrounding_pages();
+	test_truncate_should_not_change_mappability();
+	test_conversions_should_fail_if_memory_has_elevated_refcount();
+	test_fault_type_independent_of_mem_attributes();
+
+	return 0;
+}
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 853ab68cff79..ffe0625f2d71 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -18,11 +18,13 @@
 #include <asm/atomic.h>
 #include <asm/kvm.h>
 
+#include <string.h>
 #include <sys/ioctl.h>
 
 #include "kvm_util_arch.h"
 #include "kvm_util_types.h"
 #include "sparsebit.h"
+#include <sys/types.h>
 
 #define KVM_DEV_PATH "/dev/kvm"
 #define KVM_MAX_VCPUS 512
@@ -426,6 +428,78 @@ static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
 	vm_set_memory_attributes(vm, gpa, size, 0);
 }
 
+static inline int __guest_memfd_convert_private(int guest_memfd, loff_t offset,
+						size_t size, loff_t *error_offset)
+{
+	int ret;
+
+	struct kvm_gmem_convert param = {
+		.offset = offset,
+		.size = size,
+		.error_offset = 0,
+	};
+
+	ret = ioctl(guest_memfd, KVM_GMEM_CONVERT_PRIVATE, &param);
+	if (ret)
+		*error_offset = param.error_offset;
+
+	return ret;
+}
+
+static inline void guest_memfd_convert_private(int guest_memfd, loff_t offset,
+					       size_t size)
+{
+	loff_t error_offset;
+	int retries;
+	int ret;
+
+	retries = 2;
+	do {
+		error_offset = 0;
+		ret = __guest_memfd_convert_private(guest_memfd, offset, size,
+						    &error_offset);
+	} while (ret == -1 && errno == EAGAIN && --retries > 0);
+
+	TEST_ASSERT(!ret, "Unexpected error %s (%m) at offset 0x%lx",
+		    strerrorname_np(errno), error_offset);
+}
+
+static inline int __guest_memfd_convert_shared(int guest_memfd, loff_t offset,
+					       size_t size, loff_t *error_offset)
+{
+	int ret;
+
+	struct kvm_gmem_convert param = {
+		.offset = offset,
+		.size = size,
+		.error_offset = 0,
+	};
+
+	ret = ioctl(guest_memfd, KVM_GMEM_CONVERT_SHARED, &param);
+	if (ret)
+		*error_offset = param.error_offset;
+
+	return ret;
+}
+
+static inline void guest_memfd_convert_shared(int guest_memfd, loff_t offset,
+					      size_t size)
+{
+	loff_t error_offset;
+	int retries;
+	int ret;
+
+	retries = 2;
+	do {
+		error_offset = 0;
+		ret = __guest_memfd_convert_shared(guest_memfd, offset, size,
+						    &error_offset);
+	} while (ret == -1 && errno == EAGAIN && --retries > 0);
+
+	TEST_ASSERT(!ret, "Unexpected error %s (%m) at offset 0x%lx",
+		    strerrorname_np(errno), error_offset);
+}
+
 void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
 			    bool punch_hole);
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 13/51] KVM: selftests: Add script to exercise private_mem_conversions_test
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (11 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 12/51] KVM: selftests: Test conversion flows for guest_memfd Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 14/51] KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd Ackerley Tng
                   ` (42 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Makes testing different combinations of private_mem_conversions_test
flags easier.

Change-Id: I7647e92524baf09eb97e09bdbd95ad57ada44f4b
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/x86/private_mem_conversions_test.sh   | 82 +++++++++++++++++++
 1 file changed, 82 insertions(+)
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
new file mode 100755
index 000000000000..76efa81114d2
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only */
+#
+# Wrapper script which runs different test setups of
+# private_mem_conversions_test.
+#
+# Copyright (C) 2024, Google LLC.
+
+set -e
+
+num_vcpus_to_test=4
+num_memslots_to_test=$num_vcpus_to_test
+
+get_default_hugepage_size_in_kB() {
+	grep "Hugepagesize:" /proc/meminfo | grep -o '[[:digit:]]\+'
+}
+
+# Required pages are based on the test setup (see computation for memfd_size) in
+# test_mem_conversions() in private_mem_migrate_tests.c)
+
+# These static requirements are set to the maximum required for
+# num_vcpus_to_test, over all the hugetlb-related tests
+required_num_2m_hugepages=$(( 1024 * num_vcpus_to_test ))
+required_num_1g_hugepages=$(( 2 * num_vcpus_to_test ))
+
+# The other hugetlb sizes are not supported on x86_64
+[ "$(cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 2>/dev/null || echo 0)" \
+  -ge "$required_num_2m_hugepages" ] && hugepage_2mb_enabled=1
+[ "$(cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages 2>/dev/null || echo 0)" \
+  -ge "$required_num_1g_hugepages" ] && hugepage_1gb_enabled=1
+
+case $(get_default_hugepage_size_in_kB) in
+	2048)
+		hugepage_default_enabled=$hugepage_2mb_enabled
+		;;
+	1048576)
+		hugepage_default_enabled=$hugepage_1gb_enabled
+		;;
+	*)
+		hugepage_default_enabled=0
+		;;
+esac
+
+backing_src_types=( anonymous )
+backing_src_types+=( anonymous_thp )
+[ -n "$hugepage_default_enabled" ] && \
+	backing_src_types+=( anonymous_hugetlb ) || \
+	echo "skipping anonymous_hugetlb backing source type"
+[ -n "$hugepage_2mb_enabled" ] && \
+	backing_src_types+=( anonymous_hugetlb_2mb ) || \
+	echo "skipping anonymous_hugetlb_2mb backing source type"
+[ -n "$hugepage_1gb_enabled" ] && \
+	backing_src_types+=( anonymous_hugetlb_1gb ) || \
+	echo "skipping anonymous_hugetlb_1gb backing source type"
+backing_src_types+=( shmem )
+[ -n "$hugepage_default_enabled" ] && \
+	backing_src_types+=( shared_hugetlb ) || \
+	echo "skipping shared_hugetlb backing source type"
+
+set +e
+
+TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
+
+(
+	set -e
+
+	for src_type in "${backing_src_types[@]}"; do
+
+		set -x
+
+                $TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test
+		$TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
+
+		{ set +x; } 2>/dev/null
+
+		echo
+
+	done
+)
+RET=$?
+
+exit $RET
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 14/51] KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (12 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 13/51] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 15/51] KVM: selftests: Update script to map shared memory from guest_memfd Ackerley Tng
                   ` (41 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This patch updates private_mem_conversions_test to use guest_memfd for
both private and shared memory. The guest_memfd conversion ioctls are
used to perform conversions.

Specify -g to also back shared memory with memory from guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: Ibc647dc43fbdddac7cc465886bed92c07bbf4f00
---
 .../testing/selftests/kvm/include/kvm_util.h  |   1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  36 ++++
 .../kvm/x86/private_mem_conversions_test.c    | 163 +++++++++++++++---
 3 files changed, 176 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index ffe0625f2d71..ded65a15abea 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -721,6 +721,7 @@ void *addr_gpa2hva(struct kvm_vm *vm, vm_paddr_t gpa);
 void *addr_gva2hva(struct kvm_vm *vm, vm_vaddr_t gva);
 vm_paddr_t addr_hva2gpa(struct kvm_vm *vm, void *hva);
 void *addr_gpa2alias(struct kvm_vm *vm, vm_paddr_t gpa);
+int addr_gpa2guest_memfd(struct kvm_vm *vm, vm_paddr_t gpa, loff_t *offset);
 
 #ifndef vcpu_arch_put_guest
 #define vcpu_arch_put_guest(mem, val) do { (mem) = (val); } while (0)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 58a3365f479c..253d0c00e2f0 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1734,6 +1734,42 @@ void *addr_gpa2hva(struct kvm_vm *vm, vm_paddr_t gpa)
 		+ (gpa - region->region.guest_phys_addr));
 }
 
+/*
+ * Address VM Physical to guest_memfd
+ *
+ * Input Args:
+ *   vm - Virtual Machine
+ *   gpa - VM physical address
+ *
+ * Output Args:
+ *   offset - offset in guest_memfd for gpa
+ *
+ * Return:
+ *   guest_memfd for
+ *
+ * Locates the memory region containing the VM physical address given by gpa,
+ * within the VM given by vm.  When found, the guest_memfd providing the memory
+ * to the vm physical address and the offset in the file corresponding to the
+ * requested gpa is returned.  A TEST_ASSERT failure occurs if no region
+ * containing gpa exists.
+ */
+int addr_gpa2guest_memfd(struct kvm_vm *vm, vm_paddr_t gpa, loff_t *offset)
+{
+	struct userspace_mem_region *region;
+
+	gpa = vm_untag_gpa(vm, gpa);
+
+	region = userspace_mem_region_find(vm, gpa, gpa);
+	if (!region) {
+		TEST_FAIL("No vm physical memory at 0x%lx", gpa);
+		return -1;
+	}
+
+	*offset = region->region.guest_memfd_offset + gpa - region->region.guest_phys_addr;
+
+	return region->region.guest_memfd;
+}
+
 /*
  * Address Host Virtual to VM Physical
  *
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 82a8d88b5338..ec20bb7e95c8 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -11,6 +11,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
+#include <sys/wait.h>
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -202,15 +203,19 @@ static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
 		guest_sync_shared(gpa, size, p3, p4);
 		memcmp_g(gpa, p4, size);
 
-		/* Reset the shared memory back to the initial pattern. */
-		memset((void *)gpa, init_p, size);
-
 		/*
 		 * Free (via PUNCH_HOLE) *all* private memory so that the next
 		 * iteration starts from a clean slate, e.g. with respect to
 		 * whether or not there are pages/folios in guest_mem.
 		 */
 		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+
+		/*
+		 * Reset the entire block back to the initial pattern. Do this
+		 * after fallocate(PUNCH_HOLE) because hole-punching zeroes
+		 * memory.
+		 */
+		memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
 	}
 }
 
@@ -286,7 +291,8 @@ static void guest_code(uint64_t base_gpa)
 	GUEST_DONE();
 }
 
-static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
+static void handle_exit_hypercall(struct kvm_vcpu *vcpu,
+				  bool back_shared_memory_with_guest_memfd)
 {
 	struct kvm_run *run = vcpu->run;
 	uint64_t gpa = run->hypercall.args[0];
@@ -303,17 +309,81 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 	if (do_fallocate)
 		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
 
-	if (set_attributes)
-		vm_set_memory_attributes(vm, gpa, size,
-					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	if (set_attributes) {
+		if (back_shared_memory_with_guest_memfd) {
+			loff_t offset;
+			int guest_memfd;
+
+			guest_memfd = addr_gpa2guest_memfd(vm, gpa, &offset);
+
+			if (map_shared)
+				guest_memfd_convert_shared(guest_memfd, offset, size);
+			else
+				guest_memfd_convert_private(guest_memfd, offset, size);
+		} else {
+			uint64_t attrs;
+
+			attrs = map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
+			vm_set_memory_attributes(vm, gpa, size, attrs);
+		}
+	}
 	run->hypercall.ret = 0;
 }
 
+static void assert_not_faultable(uint8_t *address)
+{
+	pid_t child_pid;
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		*address = 'A';
+		TEST_FAIL("Child should have exited with a signal");
+	} else {
+		int status;
+
+		waitpid(child_pid, &status, 0);
+
+		TEST_ASSERT(WIFSIGNALED(status),
+			    "Child should have exited with a signal");
+		TEST_ASSERT_EQ(WTERMSIG(status), SIGBUS);
+	}
+}
+
+static void add_memslot(struct kvm_vm *vm, uint64_t gpa, uint32_t slot,
+			uint64_t size, int guest_memfd,
+			uint64_t guest_memfd_offset)
+{
+	struct userspace_mem_region *region;
+
+	region = vm_mem_region_alloc(vm);
+
+	guest_memfd = vm_mem_region_install_guest_memfd(region, guest_memfd);
+
+	vm_mem_region_mmap(region, size, MAP_SHARED, guest_memfd, guest_memfd_offset);
+	vm_mem_region_install_memory(region, size, getpagesize());
+
+	region->region.slot = slot;
+	region->region.flags = KVM_MEM_GUEST_MEMFD;
+	region->region.guest_phys_addr = gpa;
+	region->region.guest_memfd_offset = guest_memfd_offset;
+
+	vm_mem_region_add(vm, region);
+}
+
 static bool run_vcpus;
 
-static void *__test_mem_conversions(void *__vcpu)
+struct test_thread_args
 {
-	struct kvm_vcpu *vcpu = __vcpu;
+	struct kvm_vcpu *vcpu;
+	bool back_shared_memory_with_guest_memfd;
+};
+
+static void *__test_mem_conversions(void *params)
+{
+	struct test_thread_args *args = params;
+	struct kvm_vcpu *vcpu = args->vcpu;
 	struct kvm_run *run = vcpu->run;
 	struct kvm_vm *vm = vcpu->vm;
 	struct ucall uc;
@@ -325,7 +395,10 @@ static void *__test_mem_conversions(void *__vcpu)
 		vcpu_run(vcpu);
 
 		if (run->exit_reason == KVM_EXIT_HYPERCALL) {
-			handle_exit_hypercall(vcpu);
+			handle_exit_hypercall(
+				vcpu,
+				args->back_shared_memory_with_guest_memfd);
+
 			continue;
 		}
 
@@ -349,8 +422,18 @@ static void *__test_mem_conversions(void *__vcpu)
 				size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
 				uint8_t *hva = addr_gpa2hva(vm, gpa + i);
 
-				/* In all cases, the host should observe the shared data. */
-				memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				/* Check contents of memory */
+				if (args->back_shared_memory_with_guest_memfd &&
+				    uc.args[0] == SYNC_PRIVATE) {
+					assert_not_faultable(hva);
+				} else {
+					/*
+					 * If shared and private memory use
+					 * separate backing memory, the host
+					 * should always observe shared data.
+					 */
+					memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				}
 
 				/* For shared, write the new pattern to guest memory. */
 				if (uc.args[0] == SYNC_SHARED)
@@ -366,14 +449,16 @@ static void *__test_mem_conversions(void *__vcpu)
 	}
 }
 
-static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
-				 uint32_t nr_memslots)
+static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
+				 uint32_t nr_vcpus, uint32_t nr_memslots,
+				 bool back_shared_memory_with_guest_memfd)
 {
 	/*
 	 * Allocate enough memory so that each vCPU's chunk of memory can be
 	 * naturally aligned with respect to the size of the backing store.
 	 */
 	const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type));
+	struct test_thread_args *thread_args[KVM_MAX_VCPUS];
 	const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
 	const size_t memfd_size = per_cpu_size * nr_vcpus;
 	const size_t slot_size = memfd_size / nr_memslots;
@@ -381,6 +466,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 	pthread_t threads[KVM_MAX_VCPUS];
 	struct kvm_vm *vm;
 	int memfd, i, r;
+	uint64_t flags;
 
 	const struct vm_shape shape = {
 		.mode = VM_MODE_DEFAULT,
@@ -394,12 +480,23 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 
 	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-	memfd = vm_create_guest_memfd(vm, memfd_size, 0);
+	flags = back_shared_memory_with_guest_memfd ?
+			GUEST_MEMFD_FLAG_SUPPORT_SHARED :
+			0;
+	memfd = vm_create_guest_memfd(vm, memfd_size, flags);
 
-	for (i = 0; i < nr_memslots; i++)
-		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
-			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
+	for (i = 0; i < nr_memslots; i++) {
+		if (back_shared_memory_with_guest_memfd) {
+			add_memslot(vm, BASE_DATA_GPA + slot_size * i,
+				    BASE_DATA_SLOT + i, slot_size, memfd,
+				    slot_size * i);
+		} else {
+			vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
+				   BASE_DATA_SLOT + i,
+				   slot_size / vm->page_size,
+				   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
+		}
+	}
 
 	for (i = 0; i < nr_vcpus; i++) {
 		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
@@ -412,13 +509,23 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 		 */
 		virt_map(vm, gpa, gpa, PER_CPU_DATA_SIZE / vm->page_size);
 
-		pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]);
+		thread_args[i] = malloc(sizeof(struct test_thread_args));
+		TEST_ASSERT(thread_args[i] != NULL,
+			    "Could not allocate memory for thread parameters");
+		thread_args[i]->vcpu = vcpus[i];
+		thread_args[i]->back_shared_memory_with_guest_memfd =
+			back_shared_memory_with_guest_memfd;
+
+		pthread_create(&threads[i], NULL, __test_mem_conversions,
+			       (void *)thread_args[i]);
 	}
 
 	WRITE_ONCE(run_vcpus, true);
 
-	for (i = 0; i < nr_vcpus; i++)
+	for (i = 0; i < nr_vcpus; i++) {
 		pthread_join(threads[i], NULL);
+		free(thread_args[i]);
+	}
 
 	kvm_vm_free(vm);
 
@@ -440,7 +547,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 static void usage(const char *cmd)
 {
 	puts("");
-	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
+	printf("usage: %s [-h] [-g] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
 	puts("");
 	backing_src_help("-s");
 	puts("");
@@ -448,18 +555,21 @@ static void usage(const char *cmd)
 	puts("");
 	puts(" -m: specify the number of memslots (default: 1)");
 	puts("");
+	puts(" -g: back shared memory with guest_memfd (default: false)");
+	puts("");
 }
 
 int main(int argc, char *argv[])
 {
 	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	bool back_shared_memory_with_guest_memfd = false;
 	uint32_t nr_memslots = 1;
 	uint32_t nr_vcpus = 1;
 	int opt;
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
-	while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
+	while ((opt = getopt(argc, argv, "hgm:s:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
@@ -470,6 +580,9 @@ int main(int argc, char *argv[])
 		case 'm':
 			nr_memslots = atoi_positive("nr_memslots", optarg);
 			break;
+		case 'g':
+			back_shared_memory_with_guest_memfd = true;
+			break;
 		case 'h':
 		default:
 			usage(argv[0]);
@@ -477,7 +590,9 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+	test_mem_conversions(src_type, nr_vcpus, nr_memslots,
+			     back_shared_memory_with_guest_memfd);
+
 
 	return 0;
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 15/51] KVM: selftests: Update script to map shared memory from guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (13 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 14/51] KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
                   ` (40 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Update the private_mem_conversions_test.sh script to use the -g flag
to also test conversions when both private and shared memory are
mapped from guest_memfd.

Change-Id: I16f8f6e4e5c361bbc4daeb66f15e8165db3d98f7
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/x86/private_mem_conversions_test.sh  | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
index 76efa81114d2..5dda6916e071 100755
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
@@ -71,6 +71,9 @@ TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
                 $TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test
 		$TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
 
+                $TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -g
+		$TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -m $num_memslots_to_test -g
+
 		{ set +x; } 2>/dev/null
 
 		echo
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (14 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 15/51] KVM: selftests: Update script to map shared memory from guest_memfd Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-15  2:09   ` Matthew Wilcox
                     ` (2 more replies)
  2025-05-14 23:41 ` [RFC PATCH v2 17/51] mm: hugetlb: Cleanup interpretation of gbl_chg in alloc_hugetlb_folio() Ackerley Tng
                   ` (39 subsequent siblings)
  55 siblings, 3 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Previously, gbl_chg was passed from alloc_hugetlb_folio() into
dequeue_hugetlb_folio_vma(), leaking the concept of gbl_chg into
dequeue_hugetlb_folio_vma().

This patch consolidates the interpretation of gbl_chg into
alloc_hugetlb_folio(), also renaming dequeue_hugetlb_folio_vma() to
dequeue_hugetlb_folio() so dequeue_hugetlb_folio() can just focus on
dequeuing a folio.

Change-Id: I31bf48af2400b6e13b44d03c8be22ce1a9092a9c
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6ea1be71aa42..b843e869496f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1364,9 +1364,9 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }
 
-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
-				struct vm_area_struct *vma,
-				unsigned long address, long gbl_chg)
+static struct folio *dequeue_hugetlb_folio(struct hstate *h,
+					   struct vm_area_struct *vma,
+					   unsigned long address)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1374,13 +1374,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	int nid;
 
-	/*
-	 * gbl_chg==1 means the allocation requires a new page that was not
-	 * reserved before.  Making sure there's at least one free page.
-	 */
-	if (gbl_chg && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
@@ -1398,9 +1391,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 
 	mpol_cond_put(mpol);
 	return folio;
-
-err:
-	return NULL;
 }
 
 /*
@@ -3074,12 +3064,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		goto out_uncharge_cgroup_reservation;
 
 	spin_lock_irq(&hugetlb_lock);
+
 	/*
-	 * glb_chg is passed to indicate whether or not a page must be taken
-	 * from the global free pool (global change).  gbl_chg == 0 indicates
-	 * a reservation exists for the allocation.
+	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
+	 * try dequeuing a page. If there are available_huge_pages(), try using
+	 * them!
 	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
+	folio = NULL;
+	if (!gbl_chg || available_huge_pages(h))
+		folio = dequeue_hugetlb_folio(h, vma, addr);
+
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 17/51] mm: hugetlb: Cleanup interpretation of gbl_chg in alloc_hugetlb_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (15 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio() Ackerley Tng
                   ` (38 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

The comment before dequeuing a folio explains that if gbl_chg == 0, a
reservation exists for the allocation.

In addition, if a vma reservation exists, there's no need to get a
reservation from the subpool, and gbl_chg was set to 0.

This patch replaces both of that with code: subpool_reservation_exists
defaults to false, and if a vma reservation does not exist, a
reservation is sought from the subpool.

Then, the existence of a reservation, whether in the vma or subpool,
is summarized into reservation_exists, which is then used to determine
whether to dequeue a folio.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I52130a0bf9f33e07d320a446cdb3ebfddd9de658
---
 mm/hugetlb.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b843e869496f..597f2b9f62b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2999,8 +2999,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
+	bool subpool_reservation_exists;
+	bool reservation_exists;
 	struct folio *folio;
-	long retval, gbl_chg;
+	long retval;
 	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
@@ -3036,17 +3038,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * that the allocation will not exceed the subpool limit.
 	 * Or if it can get one from the pool reservation directly.
 	 */
+	subpool_reservation_exists = false;
 	if (map_chg) {
-		gbl_chg = hugepage_subpool_get_pages(spool, 1);
-		if (gbl_chg < 0)
+		int npages_req = hugepage_subpool_get_pages(spool, 1);
+
+		if (npages_req < 0)
 			goto out_end_reservation;
-	} else {
-		/*
-		 * If we have the vma reservation ready, no need for extra
-		 * global reservation.
-		 */
-		gbl_chg = 0;
+
+		subpool_reservation_exists = npages_req == 0;
 	}
+	reservation_exists = !map_chg || subpool_reservation_exists;
 
 	/*
 	 * If this allocation is not consuming a per-vma reservation,
@@ -3065,13 +3066,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_lock_irq(&hugetlb_lock);
 
-	/*
-	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
-	 * try dequeuing a page. If there are available_huge_pages(), try using
-	 * them!
-	 */
 	folio = NULL;
-	if (!gbl_chg || available_huge_pages(h))
+	if (reservation_exists || available_huge_pages(h))
 		folio = dequeue_hugetlb_folio(h, vma, addr);
 
 	if (!folio) {
@@ -3089,7 +3085,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * Either dequeued or buddy-allocated folio needs to add special
 	 * mark to the folio when it consumes a global reservation.
 	 */
-	if (!gbl_chg) {
+	if (reservation_exists) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (16 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 17/51] mm: hugetlb: Cleanup interpretation of gbl_chg in alloc_hugetlb_folio() Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-07-07 18:08   ` James Houghton
  2025-05-14 23:41 ` [RFC PATCH v2 19/51] mm: hugetlb: Rename alloc_surplus_hugetlb_folio Ackerley Tng
                   ` (37 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Interpreting map_chg_state inline, within alloc_hugetlb_folio(),
improves readability.

Instead of having cow_from_owner and the result of
vma_needs_reservation() compute a map_chg_state, and then interpreting
map_chg_state within alloc_hugetlb_folio() to determine whether to

+ Get a page from the subpool or
+ Charge cgroup reservations or
+ Commit vma reservations or
+ Clean up reservations

This refactoring makes those decisions just based on whether a
vma_reservation_exists. If a vma_reservation_exists, the subpool had
already been debited and the cgroup had been charged, hence
alloc_hugetlb_folio() should not double-debit or double-charge. If the
vma reservation can't be used (as in cow_from_owner), then the vma
reservation effectively does not exist and vma_reservation_exists is
set to false.

The conditions for committing reservations or cleaning are also
updated to be paired with the corresponding conditions guarding
reservation creation.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I22d72a2cae61fb64dc78e0a870b254811a06a31e
---
 mm/hugetlb.c | 94 ++++++++++++++++++++++------------------------------
 1 file changed, 39 insertions(+), 55 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 597f2b9f62b5..67144af7ab79 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2968,25 +2968,6 @@ void wait_for_freed_hugetlb_folios(void)
 	flush_work(&free_hpage_work);
 }
 
-typedef enum {
-	/*
-	 * For either 0/1: we checked the per-vma resv map, and one resv
-	 * count either can be reused (0), or an extra needed (1).
-	 */
-	MAP_CHG_REUSE = 0,
-	MAP_CHG_NEEDED = 1,
-	/*
-	 * Cannot use per-vma resv count can be used, hence a new resv
-	 * count is enforced.
-	 *
-	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
-	 * that currently vma_needs_reservation() has an unwanted side
-	 * effect to either use end() or commit() to complete the
-	 * transaction.	 Hence it needs to differenciate from NEEDED.
-	 */
-	MAP_CHG_ENFORCED = 2,
-} map_chg_state;
-
 /*
  * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
  * faults of hugetlb private mappings on top of a non-page-cache folio (in
@@ -3000,46 +2981,45 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	bool subpool_reservation_exists;
+	bool vma_reservation_exists;
 	bool reservation_exists;
+	bool charge_cgroup_rsvd;
 	struct folio *folio;
-	long retval;
-	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 
 	idx = hstate_index(h);
 
-	/* Whether we need a separate per-vma reservation? */
 	if (cow_from_owner) {
 		/*
 		 * Special case!  Since it's a CoW on top of a reserved
 		 * page, the private resv map doesn't count.  So it cannot
 		 * consume the per-vma resv map even if it's reserved.
 		 */
-		map_chg = MAP_CHG_ENFORCED;
+		vma_reservation_exists = false;
 	} else {
 		/*
 		 * Examine the region/reserve map to determine if the process
-		 * has a reservation for the page to be allocated.  A return
-		 * code of zero indicates a reservation exists (no change).
+		 * has a reservation for the page to be allocated and debit the
+		 * reservation.  If the number of pages required is 0,
+		 * reservation exists.
 		 */
-		retval = vma_needs_reservation(h, vma, addr);
-		if (retval < 0)
+		int npages_req = vma_needs_reservation(h, vma, addr);
+
+		if (npages_req < 0)
 			return ERR_PTR(-ENOMEM);
-		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
+
+		vma_reservation_exists = npages_req == 0;
 	}
 
 	/*
-	 * Whether we need a separate global reservation?
-	 *
-	 * Processes that did not create the mapping will have no
-	 * reserves as indicated by the region/reserve map. Check
-	 * that the allocation will not exceed the subpool limit.
-	 * Or if it can get one from the pool reservation directly.
+	 * Debit subpool only if a vma reservation does not exist.  If
+	 * vma_reservation_exists, the vma reservation was either moved from the
+	 * subpool or taken directly from hstate in hugetlb_reserve_pages()
 	 */
 	subpool_reservation_exists = false;
-	if (map_chg) {
+	if (!vma_reservation_exists) {
 		int npages_req = hugepage_subpool_get_pages(spool, 1);
 
 		if (npages_req < 0)
@@ -3047,13 +3027,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 		subpool_reservation_exists = npages_req == 0;
 	}
-	reservation_exists = !map_chg || subpool_reservation_exists;
+
+	reservation_exists = vma_reservation_exists || subpool_reservation_exists;
 
 	/*
-	 * If this allocation is not consuming a per-vma reservation,
-	 * charge the hugetlb cgroup now.
+	 * If a vma_reservation_exists, we can skip charging hugetlb
+	 * reservations since that was charged in hugetlb_reserve_pages() when
+	 * the reservation was recorded on the resv_map.
 	 */
-	if (map_chg) {
+	charge_cgroup_rsvd = !vma_reservation_exists;
+	if (charge_cgroup_rsvd) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
 		if (ret)
@@ -3091,10 +3074,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}
 
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-	/* If allocation is not consuming a reservation, also store the
-	 * hugetlb_cgroup pointer on the page.
-	 */
-	if (map_chg) {
+
+	if (charge_cgroup_rsvd) {
 		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
 						  h_cg, folio);
 	}
@@ -3103,25 +3084,27 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	hugetlb_set_folio_subpool(folio, spool);
 
-	if (map_chg != MAP_CHG_ENFORCED) {
-		/* commit() is only needed if the map_chg is not enforced */
-		retval = vma_commit_reservation(h, vma, addr);
+	/* If vma accounting wasn't bypassed earlier, follow up with commit. */
+	if (!cow_from_owner) {
+		int ret = vma_commit_reservation(h, vma, addr);
 		/*
-		 * Check for possible race conditions. When it happens..
-		 * The page was added to the reservation map between
-		 * vma_needs_reservation and vma_commit_reservation.
-		 * This indicates a race with hugetlb_reserve_pages.
+		 * If there is a discrepancy in reservation status between the
+		 * time of vma_needs_reservation() and vma_commit_reservation(),
+		 * then there the page must have been added to the reservation
+		 * map between vma_needs_reservation() and
+		 * vma_commit_reservation().
+		 *
 		 * Adjust for the subpool count incremented above AND
 		 * in hugetlb_reserve_pages for the same page.	Also,
 		 * the reservation count added in hugetlb_reserve_pages
 		 * no longer applies.
 		 */
-		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+		if (unlikely(!vma_reservation_exists && ret == 0)) {
 			long rsv_adjust;
 
 			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 			hugetlb_acct_memory(h, -rsv_adjust);
-			if (map_chg) {
+			if (charge_cgroup_rsvd) {
 				spin_lock_irq(&hugetlb_lock);
 				hugetlb_cgroup_uncharge_folio_rsvd(
 				    hstate_index(h), pages_per_huge_page(h),
@@ -3149,14 +3132,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_uncharge_cgroup_reservation:
-	if (map_chg)
+	if (charge_cgroup_rsvd)
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg)
+	if (!vma_reservation_exists)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
-	if (map_chg != MAP_CHG_ENFORCED)
+	/* If vma accounting wasn't bypassed earlier, cleanup. */
+	if (!cow_from_owner)
 		vma_end_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 19/51] mm: hugetlb: Rename alloc_surplus_hugetlb_folio
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (17 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio() Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:41 ` [RFC PATCH v2 20/51] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
                   ` (36 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Rename alloc_surplus_hugetlb_folio vs
alloc_surplus_hugetlb_folio_nodemask to align with
dequeue_hugetlb_folio vs dequeue_hugetlb_folio_nodemask.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I38982497eb70aeb174c386ed71bb896d85939eae
---
 mm/hugetlb.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67144af7ab79..b822b204e9b3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2236,7 +2236,7 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, unsigned long end_pfn)
 /*
  * Allocates a fresh surplus page from the page allocator.
  */
-static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
+static struct folio *alloc_surplus_hugetlb_folio_nodemask(struct hstate *h,
 				gfp_t gfp_mask,	int nid, nodemask_t *nmask)
 {
 	struct folio *folio = NULL;
@@ -2312,9 +2312,9 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 /*
  * Use the VMA's mpolicy to allocate a huge page from the buddy.
  */
-static
-struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr)
+static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
+						 struct vm_area_struct *vma,
+						 unsigned long addr)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -2326,14 +2326,14 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
-		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio_nodemask(h, gfp, nid, nodemask);
 
 		/* Fallback to all nodes if page==NULL */
 		nodemask = NULL;
 	}
 
 	if (!folio)
-		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
+		folio = alloc_surplus_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask);
 	mpol_cond_put(mpol);
 	return folio;
 }
@@ -2435,14 +2435,14 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 		/* Prioritize current node */
 		if (node_isset(numa_mem_id(), alloc_nodemask))
-			folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+			folio = alloc_surplus_hugetlb_folio_nodemask(h, htlb_alloc_mask(h),
 					numa_mem_id(), NULL);
 
 		if (!folio) {
 			for_each_node_mask(node, alloc_nodemask) {
 				if (node == numa_mem_id())
 					continue;
-				folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+				folio = alloc_surplus_hugetlb_folio_nodemask(h, htlb_alloc_mask(h),
 						node, NULL);
 				if (folio)
 					break;
@@ -3055,7 +3055,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
+		folio = alloc_surplus_hugetlb_folio(h, vma, addr);
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
@@ -3868,11 +3868,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * First take pages out of surplus state.  Then make up the
 	 * remaining difference by allocating fresh huge pages.
 	 *
-	 * We might race with alloc_surplus_hugetlb_folio() here and be unable
-	 * to convert a surplus huge page to a normal huge page. That is
-	 * not critical, though, it just means the overall size of the
-	 * pool might be one hugepage larger than it needs to be, but
-	 * within all the constraints specified by the sysctls.
+	 * We might race with alloc_surplus_hugetlb_folio_nodemask()
+	 * here and be unable to convert a surplus huge page to a normal
+	 * huge page. That is not critical, though, it just means the
+	 * overall size of the pool might be one hugepage larger than it
+	 * needs to be, but within all the constraints specified by the
+	 * sysctls.
 	 */
 	while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
 		if (!adjust_pool_surplus(h, nodes_allowed, -1))
@@ -3930,10 +3931,11 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * By placing pages into the surplus state independent of the
 	 * overcommit value, we are allowing the surplus pool size to
 	 * exceed overcommit. There are few sane options here. Since
-	 * alloc_surplus_hugetlb_folio() is checking the global counter,
-	 * though, we'll note that we're not allowed to exceed surplus
-	 * and won't grow the pool anywhere else. Not until one of the
-	 * sysctls are changed, or the surplus pages go out of use.
+	 * alloc_surplus_hugetlb_folio_nodemask() is checking the global
+	 * counter, though, we'll note that we're not allowed to exceed
+	 * surplus and won't grow the pool anywhere else. Not until one
+	 * of the sysctls are changed, or the surplus pages go out of
+	 * use.
 	 *
 	 * min_count is the expected number of persistent pages, we
 	 * shouldn't calculate min_count by using
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 20/51] mm: mempolicy: Refactor out policy_node_nodemask()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (18 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 19/51] mm: hugetlb: Rename alloc_surplus_hugetlb_folio Ackerley Tng
@ 2025-05-14 23:41 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 21/51] mm: hugetlb: Inline huge_node() into callers Ackerley Tng
                   ` (35 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:41 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li, kernel test robot,
	Gregory Price

This was refactored out of huge_node().

huge_node()'s interpretation of vma for order assumes the
hugetlb-specific storage of the hstate information in the
inode. policy_node_nodemask() does not assume that, and can be used
more generically.

This refactoring also enforces that nid default to the current node
id, which was not previously enforced.

alloc_pages_mpol() is the last remaining direct user of
policy_nodemask(). All its callers begin with nid being the current
node id as well. More refactoring is required for to simplify that.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409140519.DIQST28c-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202409140553.G2RGVWNA-lkp@intel.com/
Reviewed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I5774b27d2e718f4d08b59f8d2fedbb34eda7bac3
---
 include/linux/mempolicy.h |  9 +++++++++
 mm/mempolicy.c            | 33 ++++++++++++++++++++++++++-------
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ce9885e0178a..840c576abcfd 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -138,6 +138,8 @@ extern void numa_policy_init(void);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
+extern int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
+				pgoff_t ilx, nodemask_t **nodemask);
 extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
@@ -251,6 +253,13 @@ static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 {
 }
 
+static inline int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
+				       pgoff_t ilx, nodemask_t **nodemask)
+{
+	*nodemask = NULL;
+	return 0;
+}
+
 static inline int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b28a1e6ae096..7837158ee5a8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1261,7 +1261,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 
 		h = folio_hstate(src);
 		gfp = htlb_alloc_mask(h);
-		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
+		nid = policy_node_nodemask(pol, gfp, ilx, &nodemask);
 		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp,
 				htlb_allow_alloc_fallback(MR_MEMPOLICY_MBIND));
 	}
@@ -2121,6 +2121,29 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 	return nodemask;
 }
 
+/**
+ * policy_node_nodemask() - Interpret memory policy to get nodemask and nid.
+ *
+ * @mpol: the memory policy to interpret.
+ * @gfp_flags: gfp flags for this request.
+ * @ilx: interleave index, for use only when MPOL_INTERLEAVE or
+ *       MPOL_WEIGHTED_INTERLEAVE
+ * @nodemask: (output) pointer to nodemask pointer for 'bind' and 'prefer-many'
+ *            policy
+ *
+ * Context: must hold reference on @mpol.
+ * Return: a nid suitable for a page allocation and a pointer. If the effective
+ *         policy is 'bind' or 'prefer-many', returns a pointer to the
+ *         mempolicy's @nodemask for filtering the zonelist.
+ */
+int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
+			 pgoff_t ilx, nodemask_t **nodemask)
+{
+	int nid = numa_node_id();
+	*nodemask = policy_nodemask(gfp_flags, mpol, ilx, &nid);
+	return nid;
+}
+
 #ifdef CONFIG_HUGETLBFS
 /*
  * huge_node(@vma, @addr, @gfp_flags, @mpol)
@@ -2139,12 +2162,9 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 		struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	pgoff_t ilx;
-	int nid;
 
-	nid = numa_node_id();
 	*mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
-	*nodemask = policy_nodemask(gfp_flags, *mpol, ilx, &nid);
-	return nid;
+	return policy_node_nodemask(*mpol, gfp_flags, ilx, nodemask);
 }
 
 /*
@@ -2601,8 +2621,7 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
 		return alloc_pages_bulk_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
 
-	nid = numa_node_id();
-	nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
+	nid = policy_node_nodemask(pol, gfp, NO_INTERLEAVE_INDEX, &nodemask);
 	return alloc_pages_bulk_noprof(gfp, nid, nodemask,
 				       nr_pages, page_array);
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 21/51] mm: hugetlb: Inline huge_node() into callers
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (19 preceding siblings ...)
  2025-05-14 23:41 ` [RFC PATCH v2 20/51] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions Ackerley Tng
                   ` (34 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

huge_node()'s role was to read struct mempolicy (mpol) from the vma
and also interpret mpol to get node id and nodemask.

huge_node() can be inlined into callers since 2 out of 3 of the
callers will be refactored in later patches to take and interpret mpol
without reading mpol from the vma.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: Ic94b2ed916fd4f89b7d2755288a3a2f6a56051f7
---
 include/linux/mempolicy.h | 12 ------------
 mm/hugetlb.c              | 13 ++++++++++---
 mm/mempolicy.c            | 21 ---------------------
 3 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 840c576abcfd..41fc53605ef0 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -140,9 +140,6 @@ extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
 extern int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
 				pgoff_t ilx, nodemask_t **nodemask);
-extern int huge_node(struct vm_area_struct *vma,
-				unsigned long addr, gfp_t gfp_flags,
-				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
 extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
@@ -260,15 +257,6 @@ static inline int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
 	return 0;
 }
 
-static inline int huge_node(struct vm_area_struct *vma,
-				unsigned long addr, gfp_t gfp_flags,
-				struct mempolicy **mpol, nodemask_t **nodemask)
-{
-	*mpol = NULL;
-	*nodemask = NULL;
-	return 0;
-}
-
 static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
 {
 	return false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b822b204e9b3..5cc261b90e39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1372,10 +1372,12 @@ static struct folio *dequeue_hugetlb_folio(struct hstate *h,
 	struct mempolicy *mpol;
 	gfp_t gfp_mask;
 	nodemask_t *nodemask;
+	pgoff_t ilx;
 	int nid;
 
 	gfp_mask = htlb_alloc_mask(h);
-	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	mpol = get_vma_policy(vma, address, h->order, &ilx);
+	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 
 	if (mpol_is_preferred_many(mpol)) {
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
@@ -2321,8 +2323,11 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 	int nid;
 	nodemask_t *nodemask;
+	pgoff_t ilx;
+
+	mpol = get_vma_policy(vma, addr, h->order, &ilx);
+	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 
-	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
@@ -6829,10 +6834,12 @@ static struct folio *alloc_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	struct folio *folio;
 	gfp_t gfp_mask;
+	pgoff_t ilx;
 	int node;
 
 	gfp_mask = htlb_alloc_mask(h);
-	node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	mpol = get_vma_policy(vma, address, h->order, &ilx);
+	node = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 	/*
 	 * This is used to allocate a temporary hugetlb to hold the copied
 	 * content, which will then be copied again to the final hugetlb
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7837158ee5a8..39d0abc407dc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2145,27 +2145,6 @@ int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
 }
 
 #ifdef CONFIG_HUGETLBFS
-/*
- * huge_node(@vma, @addr, @gfp_flags, @mpol)
- * @vma: virtual memory area whose policy is sought
- * @addr: address in @vma for shared policy lookup and interleave policy
- * @gfp_flags: for requested zone
- * @mpol: pointer to mempolicy pointer for reference counted mempolicy
- * @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy
- *
- * Returns a nid suitable for a huge page allocation and a pointer
- * to the struct mempolicy for conditional unref after allocation.
- * If the effective policy is 'bind' or 'prefer-many', returns a pointer
- * to the mempolicy's @nodemask for filtering the zonelist.
- */
-int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
-		struct mempolicy **mpol, nodemask_t **nodemask)
-{
-	pgoff_t ilx;
-
-	*mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
-	return policy_node_nodemask(*mpol, gfp_flags, ilx, nodemask);
-}
 
 /*
  * init_nodemask_of_mempolicy
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (20 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 21/51] mm: hugetlb: Inline huge_node() into callers Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-31 23:45   ` Ira Weiny
  2025-05-14 23:42 ` [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng
                   ` (33 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Refactor dequeue_hugetlb_folio() and alloc_surplus_hugetlb_folio() to
take mpol, nid and nodemask. This decouples allocation of a folio from
a vma.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I890fb46fe8c6349383d8cf89befc68a4994eb416
---
 mm/hugetlb.c | 64 ++++++++++++++++++++++++----------------------------
 1 file changed, 30 insertions(+), 34 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5cc261b90e39..29d1a3fb10df 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1364,34 +1364,22 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }
 
-static struct folio *dequeue_hugetlb_folio(struct hstate *h,
-					   struct vm_area_struct *vma,
-					   unsigned long address)
+static struct folio *dequeue_hugetlb_folio(struct hstate *h, gfp_t gfp_mask,
+					   struct mempolicy *mpol,
+					   int nid, nodemask_t *nodemask)
 {
 	struct folio *folio = NULL;
-	struct mempolicy *mpol;
-	gfp_t gfp_mask;
-	nodemask_t *nodemask;
-	pgoff_t ilx;
-	int nid;
-
-	gfp_mask = htlb_alloc_mask(h);
-	mpol = get_vma_policy(vma, address, h->order, &ilx);
-	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 
 	if (mpol_is_preferred_many(mpol)) {
-		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
-							nid, nodemask);
+		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask);
 
 		/* Fallback to all nodes if page==NULL */
 		nodemask = NULL;
 	}
 
 	if (!folio)
-		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
-							nid, nodemask);
+		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask);
 
-	mpol_cond_put(mpol);
 	return folio;
 }
 
@@ -2312,21 +2300,14 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 }
 
 /*
- * Use the VMA's mpolicy to allocate a huge page from the buddy.
+ * Allocate a huge page from the buddy allocator given memory policy and node information.
  */
 static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
-						 struct vm_area_struct *vma,
-						 unsigned long addr)
+						 gfp_t gfp_mask,
+						 struct mempolicy *mpol,
+						 int nid, nodemask_t *nodemask)
 {
 	struct folio *folio = NULL;
-	struct mempolicy *mpol;
-	gfp_t gfp_mask = htlb_alloc_mask(h);
-	int nid;
-	nodemask_t *nodemask;
-	pgoff_t ilx;
-
-	mpol = get_vma_policy(vma, addr, h->order, &ilx);
-	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
@@ -2339,7 +2320,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
 
 	if (!folio)
 		folio = alloc_surplus_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask);
-	mpol_cond_put(mpol);
+
 	return folio;
 }
 
@@ -2993,6 +2974,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	gfp_t gfp_mask;
+	pgoff_t ilx;
+	int nid;
 
 	idx = hstate_index(h);
 
@@ -3032,7 +3018,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 		subpool_reservation_exists = npages_req == 0;
 	}
-
 	reservation_exists = vma_reservation_exists || subpool_reservation_exists;
 
 	/*
@@ -3048,21 +3033,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 			goto out_subpool_put;
 	}
 
+	mpol = get_vma_policy(vma, addr, h->order, &ilx);
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
-	if (ret)
+	if (ret) {
+		mpol_cond_put(mpol);
 		goto out_uncharge_cgroup_reservation;
+	}
+
+	gfp_mask = htlb_alloc_mask(h);
+	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
 
 	spin_lock_irq(&hugetlb_lock);
 
 	folio = NULL;
 	if (reservation_exists || available_huge_pages(h))
-		folio = dequeue_hugetlb_folio(h, vma, addr);
+		folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
 
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_surplus_hugetlb_folio(h, vma, addr);
-		if (!folio)
+		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
+		if (!folio) {
+			mpol_cond_put(mpol);
 			goto out_uncharge_cgroup;
+		}
 		spin_lock_irq(&hugetlb_lock);
 		list_add(&folio->lru, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
@@ -3087,6 +3081,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_unlock_irq(&hugetlb_lock);
 
+	mpol_cond_put(mpol);
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	/* If vma accounting wasn't bypassed earlier, follow up with commit. */
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (21 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-06-01  0:38   ` Ira Weiny
  2025-05-14 23:42 ` [RFC PATCH v2 24/51] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
                   ` (32 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which
handles allocation of a folio and cgroup charging.

Other than flags to control charging in the allocation process,
hugetlb_alloc_folio() also has parameters for memory policy.

This refactoring as a whole decouples the hugetlb page allocation from
hugetlbfs, (1) where the subpool is stored at the fs mount, (2)
reservations are made during mmap and stored in the vma, and (3) mpol
must be stored at vma->vm_policy (4) a vma must be used for allocation
even if the pages are not meant to be used by host process.

This decoupling will allow hugetlb_alloc_folio() to be used by
guest_memfd in later patches. In guest_memfd, (1) a subpool is created
per-fd and is stored on the inode, (2) no vma-related reservations are
used (3) mpol may not be associated with a vma since (4) for private
pages, the pages will not be mappable to userspace and hence have to
associated vmas.

This could hopefully also open hugetlb up as a more generic source of
hugetlb pages that are not bound to hugetlbfs, with the complexities
of userspace/mmap/vma-related reservations contained just to
hugetlbfs.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I60528f246341268acbf0ed5de7752ae2cacbef93
---
 include/linux/hugetlb.h |  12 +++
 mm/hugetlb.c            | 192 ++++++++++++++++++++++------------------
 2 files changed, 118 insertions(+), 86 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8f3ac832ee7f..8ba941d88956 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -698,6 +698,9 @@ bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 void wait_for_freed_hugetlb_folios(void);
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
+				  pgoff_t ilx, bool charge_cgroup_rsvd,
+				  bool use_existing_reservation);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 				unsigned long addr, bool cow_from_owner);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
@@ -1099,6 +1102,15 @@ static inline void wait_for_freed_hugetlb_folios(void)
 {
 }
 
+static inline struct folio *hugetlb_alloc_folio(struct hstate *h,
+						struct mempolicy *mpol,
+						pgoff_t ilx,
+						bool charge_cgroup_rsvd,
+						bool use_existing_reservation)
+{
+	return NULL;
+}
+
 static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   bool cow_from_owner)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 29d1a3fb10df..5b088fe002a2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2954,6 +2954,101 @@ void wait_for_freed_hugetlb_folios(void)
 	flush_work(&free_hpage_work);
 }
 
+/**
+ * hugetlb_alloc_folio() - Allocates a hugetlb folio.
+ *
+ * @h: struct hstate to allocate from.
+ * @mpol: struct mempolicy to apply for this folio allocation.
+ * @ilx: Interleave index for interpretation of @mpol.
+ * @charge_cgroup_rsvd: Set to true to charge cgroup reservation.
+ * @use_existing_reservation: Set to true if this allocation should use an
+ *                            existing hstate reservation.
+ *
+ * This function handles cgroup and global hstate reservations. VMA-related
+ * reservations and subpool debiting must be handled by the caller if necessary.
+ *
+ * Return: folio on success or negated error otherwise.
+ */
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
+				  pgoff_t ilx, bool charge_cgroup_rsvd,
+				  bool use_existing_reservation)
+{
+	unsigned int nr_pages = pages_per_huge_page(h);
+	struct hugetlb_cgroup *h_cg = NULL;
+	struct folio *folio = NULL;
+	nodemask_t *nodemask;
+	gfp_t gfp_mask;
+	int nid;
+	int idx;
+	int ret;
+
+	idx = hstate_index(h);
+
+	if (charge_cgroup_rsvd) {
+		if (hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg))
+			goto out;
+	}
+
+	if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg))
+		goto out_uncharge_cgroup_reservation;
+
+	gfp_mask = htlb_alloc_mask(h);
+	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
+
+	spin_lock_irq(&hugetlb_lock);
+
+	if (use_existing_reservation || available_huge_pages(h))
+		folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
+
+	if (!folio) {
+		spin_unlock_irq(&hugetlb_lock);
+		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
+		if (!folio)
+			goto out_uncharge_cgroup;
+		spin_lock_irq(&hugetlb_lock);
+		list_add(&folio->lru, &h->hugepage_activelist);
+		folio_ref_unfreeze(folio, 1);
+		/* Fall through */
+	}
+
+	if (use_existing_reservation) {
+		folio_set_hugetlb_restore_reserve(folio);
+		h->resv_huge_pages--;
+	}
+
+	hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio);
+
+	if (charge_cgroup_rsvd)
+		hugetlb_cgroup_commit_charge_rsvd(idx, nr_pages, h_cg, folio);
+
+	spin_unlock_irq(&hugetlb_lock);
+
+	gfp_mask = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	ret = mem_cgroup_charge_hugetlb(folio, gfp_mask);
+	/*
+	 * Unconditionally increment NR_HUGETLB here. If it turns out that
+	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
+	 * decrement NR_HUGETLB.
+	 */
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
+
+	if (ret == -ENOMEM) {
+		free_huge_folio(folio);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	return folio;
+
+out_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
+out_uncharge_cgroup_reservation:
+	if (charge_cgroup_rsvd)
+		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg);
+out:
+	folio = ERR_PTR(-ENOSPC);
+	goto out;
+}
+
 /*
  * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
  * faults of hugetlb private mappings on top of a non-page-cache folio (in
@@ -2971,16 +3066,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	bool reservation_exists;
 	bool charge_cgroup_rsvd;
 	struct folio *folio;
-	int ret, idx;
-	struct hugetlb_cgroup *h_cg = NULL;
-	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 	struct mempolicy *mpol;
-	nodemask_t *nodemask;
-	gfp_t gfp_mask;
 	pgoff_t ilx;
-	int nid;
-
-	idx = hstate_index(h);
 
 	if (cow_from_owner) {
 		/*
@@ -3020,69 +3107,22 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}
 	reservation_exists = vma_reservation_exists || subpool_reservation_exists;
 
-	/*
-	 * If a vma_reservation_exists, we can skip charging hugetlb
-	 * reservations since that was charged in hugetlb_reserve_pages() when
-	 * the reservation was recorded on the resv_map.
-	 */
-	charge_cgroup_rsvd = !vma_reservation_exists;
-	if (charge_cgroup_rsvd) {
-		ret = hugetlb_cgroup_charge_cgroup_rsvd(
-			idx, pages_per_huge_page(h), &h_cg);
-		if (ret)
-			goto out_subpool_put;
-	}
-
 	mpol = get_vma_policy(vma, addr, h->order, &ilx);
 
-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
-	if (ret) {
-		mpol_cond_put(mpol);
-		goto out_uncharge_cgroup_reservation;
-	}
-
-	gfp_mask = htlb_alloc_mask(h);
-	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
-
-	spin_lock_irq(&hugetlb_lock);
-
-	folio = NULL;
-	if (reservation_exists || available_huge_pages(h))
-		folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
-
-	if (!folio) {
-		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
-		if (!folio) {
-			mpol_cond_put(mpol);
-			goto out_uncharge_cgroup;
-		}
-		spin_lock_irq(&hugetlb_lock);
-		list_add(&folio->lru, &h->hugepage_activelist);
-		folio_ref_unfreeze(folio, 1);
-		/* Fall through */
-	}
-
 	/*
-	 * Either dequeued or buddy-allocated folio needs to add special
-	 * mark to the folio when it consumes a global reservation.
+	 * If a vma_reservation_exists, we can skip charging cgroup reservations
+	 * since that was charged during vma reservation. Use a reservation as
+	 * long as it exists.
 	 */
-	if (reservation_exists) {
-		folio_set_hugetlb_restore_reserve(folio);
-		h->resv_huge_pages--;
-	}
-
-	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-
-	if (charge_cgroup_rsvd) {
-		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
-						  h_cg, folio);
-	}
-
-	spin_unlock_irq(&hugetlb_lock);
+	charge_cgroup_rsvd = !vma_reservation_exists;
+	folio = hugetlb_alloc_folio(h, mpol, ilx, charge_cgroup_rsvd,
+				    reservation_exists);
 
 	mpol_cond_put(mpol);
 
+	if (IS_ERR_OR_NULL(folio))
+		goto out_subpool_put;
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	/* If vma accounting wasn't bypassed earlier, follow up with commit. */
@@ -3091,9 +3131,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		/*
 		 * If there is a discrepancy in reservation status between the
 		 * time of vma_needs_reservation() and vma_commit_reservation(),
-		 * then there the page must have been added to the reservation
-		 * map between vma_needs_reservation() and
-		 * vma_commit_reservation().
+		 * then the page must have been added to the reservation map
+		 * between vma_needs_reservation() and vma_commit_reservation().
 		 *
 		 * Adjust for the subpool count incremented above AND
 		 * in hugetlb_reserve_pages for the same page.	Also,
@@ -3115,27 +3154,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		}
 	}
 
-	ret = mem_cgroup_charge_hugetlb(folio, gfp);
-	/*
-	 * Unconditionally increment NR_HUGETLB here. If it turns out that
-	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
-	 * decrement NR_HUGETLB.
-	 */
-	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
-
-	if (ret == -ENOMEM) {
-		free_huge_folio(folio);
-		return ERR_PTR(-ENOMEM);
-	}
-
 	return folio;
 
-out_uncharge_cgroup:
-	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
-out_uncharge_cgroup_reservation:
-	if (charge_cgroup_rsvd)
-		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
-						    h_cg);
 out_subpool_put:
 	if (!vma_reservation_exists)
 		hugepage_subpool_put_pages(spool, 1);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 24/51] mm: hugetlb: Add option to create new subpool without using surplus
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (22 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 25/51] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
                   ` (31 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

__hugetlb_acct_memory() today does more than just memory
accounting. When there's insufficient HugeTLB pages,
__hugetlb_acct_memory() will attempt to get surplus pages.

This change adds a flag to disable getting surplus pages if there are
insufficient HugeTLB pages.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: Id79fdeaa236b4fed38fc3c20482b03fff729198f
---
 fs/hugetlbfs/inode.c    |  2 +-
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c            | 77 +++++++++++++++++++++++++++++++----------
 3 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e4de5425838d..609a88950354 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1424,7 +1424,7 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
 		sbinfo->spool = hugepage_new_subpool(ctx->hstate,
 						     ctx->max_hpages,
-						     ctx->min_hpages);
+						     ctx->min_hpages, true);
 		if (!sbinfo->spool)
 			goto out_free;
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8ba941d88956..c59264391c33 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -116,7 +116,7 @@ extern int hugetlb_max_hstate __read_mostly;
 	for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
 
 struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
-						long min_hpages);
+					      long min_hpages, bool use_surplus);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b088fe002a2..d22c5a8fd441 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -115,6 +115,7 @@ static int num_fault_mutexes __ro_after_init;
 struct mutex *hugetlb_fault_mutex_table __ro_after_init;
 
 /* Forward declaration */
+static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus);
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
 static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
@@ -162,7 +163,7 @@ static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
 }
 
 struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
-						long min_hpages)
+					      long min_hpages, bool use_surplus)
 {
 	struct hugepage_subpool *spool;
 
@@ -176,7 +177,8 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 	spool->hstate = h;
 	spool->min_hpages = min_hpages;
 
-	if (min_hpages != -1 && hugetlb_acct_memory(h, min_hpages)) {
+	if (min_hpages != -1 &&
+	    __hugetlb_acct_memory(h, min_hpages, use_surplus)) {
 		kfree(spool);
 		return NULL;
 	}
@@ -2382,35 +2384,64 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
 	return NULL;
 }
 
-/*
- * Increase the hugetlb pool such that it can accommodate a reservation
- * of size 'delta'.
+/**
+ * hugetlb_hstate_reserve_pages() - Reserve @requested number of hugetlb pages
+ * from hstate @h.
+ *
+ * @h: the hstate to reserve from.
+ * @requested: number of hugetlb pages to reserve.
+ *
+ * If there are insufficient available hugetlb pages, no reservations are made.
+ *
+ * Return: the number of surplus pages required to meet the @requested number of
+ *         hugetlb pages.
  */
-static int gather_surplus_pages(struct hstate *h, long delta)
+static int hugetlb_hstate_reserve_pages(struct hstate *h, long requested)
+	__must_hold(&hugetlb_lock)
+{
+	long needed;
+
+	needed = (h->resv_huge_pages + requested) - h->free_huge_pages;
+	if (needed <= 0) {
+		h->resv_huge_pages += requested;
+		return 0;
+	}
+
+	return needed;
+}
+
+/**
+ * gather_surplus_pages() - Increase the hugetlb pool such that it can
+ * accommodate a reservation of size @requested.
+ *
+ * @h: the hstate in concern.
+ * @requested: The requested number of hugetlb pages.
+ * @needed: The number of hugetlb pages the pool needs to be increased by, based
+ *          on current number of reservations and free hugetlb pages.
+ *
+ * Return: 0 if successful or negative error otherwise.
+ */
+static int gather_surplus_pages(struct hstate *h, long requested, long needed)
 	__must_hold(&hugetlb_lock)
 {
 	LIST_HEAD(surplus_list);
 	struct folio *folio, *tmp;
 	int ret;
 	long i;
-	long needed, allocated;
+	long allocated;
 	bool alloc_ok = true;
 	int node;
 	nodemask_t *mbind_nodemask, alloc_nodemask;
 
+	if (needed == 0)
+		return 0;
+
 	mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 	if (mbind_nodemask)
 		nodes_and(alloc_nodemask, *mbind_nodemask, cpuset_current_mems_allowed);
 	else
 		alloc_nodemask = cpuset_current_mems_allowed;
 
-	lockdep_assert_held(&hugetlb_lock);
-	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
-	if (needed <= 0) {
-		h->resv_huge_pages += delta;
-		return 0;
-	}
-
 	allocated = 0;
 
 	ret = -ENOMEM;
@@ -2448,7 +2479,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * because either resv_huge_pages or free_huge_pages may have changed.
 	 */
 	spin_lock_irq(&hugetlb_lock);
-	needed = (h->resv_huge_pages + delta) -
+	needed = (h->resv_huge_pages + requested) -
 			(h->free_huge_pages + allocated);
 	if (needed > 0) {
 		if (alloc_ok)
@@ -2469,7 +2500,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	 * before they are reserved.
 	 */
 	needed += allocated;
-	h->resv_huge_pages += delta;
+	h->resv_huge_pages += requested;
 	ret = 0;
 
 	/* Free the needed pages to the hugetlb pool */
@@ -5284,7 +5315,7 @@ unsigned long hugetlb_total_pages(void)
 	return nr_total_pages;
 }
 
-static int hugetlb_acct_memory(struct hstate *h, long delta)
+static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus)
 {
 	int ret = -ENOMEM;
 
@@ -5316,7 +5347,12 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	 * above.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(h, delta) < 0)
+		long needed = hugetlb_hstate_reserve_pages(h, delta);
+
+		if (!use_surplus && needed > 0)
+			goto out;
+
+		if (gather_surplus_pages(h, delta, needed) < 0)
 			goto out;
 
 		if (delta > allowed_mems_nr(h)) {
@@ -5334,6 +5370,11 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	return ret;
 }
 
+static int hugetlb_acct_memory(struct hstate *h, long delta)
+{
+	return __hugetlb_acct_memory(h, delta, true);
+}
+
 static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 {
 	struct resv_map *resv = vma_resv_map(vma);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 25/51] mm: truncate: Expose preparation steps for truncate_inode_pages_final
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (23 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 24/51] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 26/51] mm: Consolidate freeing of typed folios on final folio_put() Ackerley Tng
                   ` (30 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This will allow preparation steps to be shared while implementing
truncation differently.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I83ad5965b8b50283ad930c20c99e3165cb5626c9
---
 include/linux/mm.h |  1 +
 mm/truncate.c      | 26 ++++++++++++++++----------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf55206935c4..e4e73c231ced 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3514,6 +3514,7 @@ extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info);
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void truncate_inode_pages_final_prepare(struct address_space *mapping);
 extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
diff --git a/mm/truncate.c b/mm/truncate.c
index 5d98054094d1..057e4aa73aa9 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -457,16 +457,7 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
 }
 EXPORT_SYMBOL(truncate_inode_pages);
 
-/**
- * truncate_inode_pages_final - truncate *all* pages before inode dies
- * @mapping: mapping to truncate
- *
- * Called under (and serialized by) inode->i_rwsem.
- *
- * Filesystems have to use this in the .evict_inode path to inform the
- * VM that this is the final truncate and the inode is going away.
- */
-void truncate_inode_pages_final(struct address_space *mapping)
+void truncate_inode_pages_final_prepare(struct address_space *mapping)
 {
 	/*
 	 * Page reclaim can not participate in regular inode lifetime
@@ -487,6 +478,21 @@ void truncate_inode_pages_final(struct address_space *mapping)
 		xa_lock_irq(&mapping->i_pages);
 		xa_unlock_irq(&mapping->i_pages);
 	}
+}
+EXPORT_SYMBOL(truncate_inode_pages_final_prepare);
+
+/**
+ * truncate_inode_pages_final - truncate *all* pages before inode dies
+ * @mapping: mapping to truncate
+ *
+ * Called under (and serialized by) inode->i_rwsem.
+ *
+ * Filesystems have to use this in the .evict_inode path to inform the
+ * VM that this is the final truncate and the inode is going away.
+ */
+void truncate_inode_pages_final(struct address_space *mapping)
+{
+	truncate_inode_pages_final_prepare(mapping);
 
 	truncate_inode_pages(mapping, 0);
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 26/51] mm: Consolidate freeing of typed folios on final folio_put()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (24 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 25/51] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 27/51] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
                   ` (29 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

From: Fuad Tabba <tabba@google.com>

Some folio types, such as hugetlb, handle freeing their own folios.

The guestmem_hugetlb folio, to be introduced in a later patch,
requires extra handling as part of the freeing process.

As a first step towards that, this patch consolidates freeing folios
that have a type. The first user is hugetlb folios. Later in this
patch series, guestmem_hugetlb will become the second user of this.

Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I881dc58ca89603ddd1e8e1ccca8f5dbfc80c43be
---
 include/linux/page-flags.h | 15 +++++++++++++++
 mm/swap.c                  | 23 ++++++++++++++++++-----
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e6a21b62dcce..9dd60fb8c33f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -985,6 +985,21 @@ static inline bool page_has_type(const struct page *page)
 	return page_mapcount_is_type(data_race(page->page_type));
 }
 
+static inline int page_get_type(const struct page *page)
+{
+	return page->page_type >> 24;
+}
+
+static inline bool folio_has_type(const struct folio *folio)
+{
+	return page_has_type(&folio->page);
+}
+
+static inline int folio_get_type(const struct folio *folio)
+{
+	return page_get_type(&folio->page);
+}
+
 #define FOLIO_TYPE_OPS(lname, fname)					\
 static __always_inline bool folio_test_##fname(const struct folio *folio) \
 {									\
diff --git a/mm/swap.c b/mm/swap.c
index 77b2d5997873..d0a5971787c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,6 +94,19 @@ static void page_cache_release(struct folio *folio)
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 }
 
+static void free_typed_folio(struct folio *folio)
+{
+	switch (folio_get_type(folio)) {
+#ifdef CONFIG_HUGETLBFS
+	case PGTY_hugetlb:
+		free_huge_folio(folio);
+		return;
+#endif
+	default:
+		WARN_ON_ONCE(1);
+	}
+}
+
 void __folio_put(struct folio *folio)
 {
 	if (unlikely(folio_is_zone_device(folio))) {
@@ -101,8 +114,8 @@ void __folio_put(struct folio *folio)
 		return;
 	}
 
-	if (folio_test_hugetlb(folio)) {
-		free_huge_folio(folio);
+	if (unlikely(folio_has_type(folio))) {
+		free_typed_folio(folio);
 		return;
 	}
 
@@ -964,13 +977,13 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		if (!folio_ref_sub_and_test(folio, nr_refs))
 			continue;
 
-		/* hugetlb has its own memcg */
-		if (folio_test_hugetlb(folio)) {
+		if (unlikely(folio_has_type(folio))) {
+			/* typed folios have their own memcg, if any */
 			if (lruvec) {
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			free_huge_folio(folio);
+			free_typed_folio(folio);
 			continue;
 		}
 		folio_unqueue_deferred_split(folio);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 27/51] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (25 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 26/51] mm: Consolidate freeing of typed folios on final folio_put() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 28/51] mm: Introduce guestmem_hugetlb to support folio_put() handling of guestmem pages Ackerley Tng
                   ` (28 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This will allow hugetlb subpools to be used by guestmem_hugetlb.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I909355935f2ab342e65e7bfdc106bedd1dc177c9
---
 include/linux/hugetlb.h | 3 +++
 mm/hugetlb.c            | 6 ++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c59264391c33..e6b90e72d46d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,9 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 					      long min_hpages, bool use_surplus);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
+long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta);
+long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta);
+
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int move_hugetlb_page_tables(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d22c5a8fd441..816f257680be 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -205,8 +205,7 @@ void hugepage_put_subpool(struct hugepage_subpool *spool)
  * only be different than the passed value (delta) in the case where
  * a subpool minimum size must be maintained.
  */
-static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
-				      long delta)
+long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta)
 {
 	long ret = delta;
 
@@ -250,8 +249,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
  * The return value may only be different than the passed value (delta)
  * in the case where a subpool minimum size must be maintained.
  */
-static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
-				       long delta)
+long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta)
 {
 	long ret = delta;
 	unsigned long flags;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 28/51] mm: Introduce guestmem_hugetlb to support folio_put() handling of guestmem pages
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (26 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 27/51] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd Ackerley Tng
                   ` (27 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

The PGTY_guestmem_hugetlb is introduced so folios can be marked for
further cleanup by guestmem_hugetlb.

guestmem_hugetlb folios can have positive mapcounts, which will
conflict with the installation of a page type. Hence,
PGTY_guestmem_hugetlb will only be installed when a folio is
truncated, after the folio has been unmapped and has a mapcount of 0.

Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I635f8929e06f73d7899737bd47090b7cbc7222dc
---
 include/linux/page-flags.h | 17 +++++++++++++++++
 mm/Kconfig                 | 10 ++++++++++
 mm/Makefile                |  1 +
 mm/debug.c                 |  1 +
 mm/guestmem_hugetlb.c      | 14 ++++++++++++++
 mm/guestmem_hugetlb.h      |  9 +++++++++
 mm/swap.c                  |  9 +++++++++
 7 files changed, 61 insertions(+)
 create mode 100644 mm/guestmem_hugetlb.c
 create mode 100644 mm/guestmem_hugetlb.h

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9dd60fb8c33f..543f6481ca60 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -965,6 +965,7 @@ enum pagetype {
 	PGTY_zsmalloc		= 0xf6,
 	PGTY_unaccepted		= 0xf7,
 	PGTY_large_kmalloc	= 0xf8,
+	PGTY_guestmem_hugetlb	= 0xf9,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1114,6 +1115,22 @@ FOLIO_TYPE_OPS(hugetlb, hugetlb)
 FOLIO_TEST_FLAG_FALSE(hugetlb)
 #endif
 
+/*
+ * PGTY_guestmem_hugetlb, for now, is used to mark a folio as requiring further
+ * cleanup by the guestmem_hugetlb allocator.  This page type is installed only
+ * at truncation time, by guest_memfd, if further cleanup is required.  It is
+ * safe to install this page type at truncation time because by then mapcount
+ * would be 0.
+ *
+ * The plan is to always set this page type for any folios allocated by
+ * guestmem_hugetlb once typed folios can be mapped to userspace cleanly.
+ */
+#ifdef CONFIG_GUESTMEM_HUGETLB
+FOLIO_TYPE_OPS(guestmem_hugetlb, guestmem_hugetlb)
+#else
+FOLIO_TEST_FLAG_FALSE(guestmem_hugetlb)
+#endif
+
 PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index e113f713b493..131adc49f58d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1216,6 +1216,16 @@ config SECRETMEM
 	  memory areas visible only in the context of the owning process and
 	  not mapped to other processes and other kernel page tables.
 
+config GUESTMEM_HUGETLB
+	bool "Enable guestmem_hugetlb allocator for guest_memfd"
+	depends on HUGETLBFS
+	help
+	  Enable this to make HugeTLB folios available to guest_memfd
+	  (KVM virtualization) as backing memory.
+
+	  This feature wraps HugeTLB as a custom allocator that
+	  guest_memfd can use.
+
 config ANON_VMA_NAME
 	bool "Anonymous VMA name support"
 	depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..c91c8e8fef71 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -127,6 +127,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_GUESTMEM_HUGETLB) += guestmem_hugetlb.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/debug.c b/mm/debug.c
index db83e381a8ae..439ab128772d 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -56,6 +56,7 @@ static const char *page_type_names[] = {
 	DEF_PAGETYPE_NAME(table),
 	DEF_PAGETYPE_NAME(buddy),
 	DEF_PAGETYPE_NAME(unaccepted),
+	DEF_PAGETYPE_NAME(guestmem_hugetlb),
 };
 
 static const char *page_type_name(unsigned int page_type)
diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
new file mode 100644
index 000000000000..51a724ebcc50
--- /dev/null
+++ b/mm/guestmem_hugetlb.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * guestmem_hugetlb is an allocator for guest_memfd. guest_memfd wraps HugeTLB
+ * as an allocator for guest_memfd.
+ */
+
+#include <linux/mm_types.h>
+
+#include "guestmem_hugetlb.h"
+
+void guestmem_hugetlb_handle_folio_put(struct folio *folio)
+{
+	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
+}
diff --git a/mm/guestmem_hugetlb.h b/mm/guestmem_hugetlb.h
new file mode 100644
index 000000000000..5c9452b77252
--- /dev/null
+++ b/mm/guestmem_hugetlb.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MM_GUESTMEM_HUGETLB_H
+#define _LINUX_MM_GUESTMEM_HUGETLB_H
+
+#include <linux/mm_types.h>
+
+void guestmem_hugetlb_handle_folio_put(struct folio *folio);
+
+#endif
diff --git a/mm/swap.c b/mm/swap.c
index d0a5971787c4..2747230ced89 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -40,6 +40,10 @@
 
 #include "internal.h"
 
+#ifdef CONFIG_GUESTMEM_HUGETLB
+#include "guestmem_hugetlb.h"
+#endif
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/pagemap.h>
 
@@ -101,6 +105,11 @@ static void free_typed_folio(struct folio *folio)
 	case PGTY_hugetlb:
 		free_huge_folio(folio);
 		return;
+#endif
+#ifdef CONFIG_GUESTMEM_HUGETLB
+	case PGTY_guestmem_hugetlb:
+		guestmem_hugetlb_handle_folio_put(folio);
+		return;
 #endif
 	default:
 		WARN_ON_ONCE(1);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (27 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 28/51] mm: Introduce guestmem_hugetlb to support folio_put() handling of guestmem pages Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-16 14:07   ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 30/51] mm: truncate: Expose truncate_inode_folio() Ackerley Tng
                   ` (26 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

guestmem_hugetlb is an allocator for guest_memfd. It wraps HugeTLB to
provide huge folios for guest_memfd.

This patch also introduces guestmem_allocator_operations as a set of
operations that allocators for guest_memfd can provide. In a later
patch, guest_memfd will use these operations to manage pages from an
allocator.

The allocator operations are memory-management specific and are placed
in mm/ so key mm-specific functions do not have to be exposed
unnecessarily.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I3cafe111ea7b3c84755d7112ff8f8c541c11136d
---
 include/linux/guestmem.h      |  20 +++++
 include/uapi/linux/guestmem.h |  29 +++++++
 mm/Kconfig                    |   5 +-
 mm/guestmem_hugetlb.c         | 159 ++++++++++++++++++++++++++++++++++
 4 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/guestmem.h
 create mode 100644 include/uapi/linux/guestmem.h

diff --git a/include/linux/guestmem.h b/include/linux/guestmem.h
new file mode 100644
index 000000000000..4b2d820274d9
--- /dev/null
+++ b/include/linux/guestmem.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_GUESTMEM_H
+#define _LINUX_GUESTMEM_H
+
+#include <linux/fs.h>
+
+struct guestmem_allocator_operations {
+	void *(*inode_setup)(size_t size, u64 flags);
+	void (*inode_teardown)(void *private, size_t inode_size);
+	struct folio *(*alloc_folio)(void *private);
+	/*
+	 * Returns the number of PAGE_SIZE pages in a page that this guestmem
+	 * allocator provides.
+	 */
+	size_t (*nr_pages_in_folio)(void *priv);
+};
+
+extern const struct guestmem_allocator_operations guestmem_hugetlb_ops;
+
+#endif
diff --git a/include/uapi/linux/guestmem.h b/include/uapi/linux/guestmem.h
new file mode 100644
index 000000000000..2e518682edd5
--- /dev/null
+++ b/include/uapi/linux/guestmem.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_GUESTMEM_H
+#define _UAPI_LINUX_GUESTMEM_H
+
+/*
+ * Huge page size must be explicitly defined when using the guestmem_hugetlb
+ * allocator for guest_memfd.  It is the responsibility of the application to
+ * know which sizes are supported on the running system.  See mmap(2) man page
+ * for details.
+ */
+
+#define GUESTMEM_HUGETLB_FLAG_SHIFT	58
+#define GUESTMEM_HUGETLB_FLAG_MASK	0x3fUL
+
+#define GUESTMEM_HUGETLB_FLAG_16KB	(14UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_64KB	(16UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_512KB	(19UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_1MB	(20UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_2MB	(21UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_8MB	(23UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_16MB	(24UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_32MB	(25UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_256MB	(28UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_512MB	(29UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_1GB	(30UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_2GB	(31UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+#define GUESTMEM_HUGETLB_FLAG_16GB	(34UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
+
+#endif /* _UAPI_LINUX_GUESTMEM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 131adc49f58d..bb6e39e37245 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1218,7 +1218,10 @@ config SECRETMEM
 
 config GUESTMEM_HUGETLB
 	bool "Enable guestmem_hugetlb allocator for guest_memfd"
-	depends on HUGETLBFS
+	select GUESTMEM
+	select HUGETLBFS
+	select HUGETLB_PAGE
+	select HUGETLB_PAGE_OPTIMIZE_VMEMMAP
 	help
 	  Enable this to make HugeTLB folios available to guest_memfd
 	  (KVM virtualization) as backing memory.
diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
index 51a724ebcc50..5459ef7eb329 100644
--- a/mm/guestmem_hugetlb.c
+++ b/mm/guestmem_hugetlb.c
@@ -5,6 +5,14 @@
  */
 
 #include <linux/mm_types.h>
+#include <linux/guestmem.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_cgroup.h>
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/pagemap.h>
+
+#include <uapi/linux/guestmem.h>
 
 #include "guestmem_hugetlb.h"
 
@@ -12,3 +20,154 @@ void guestmem_hugetlb_handle_folio_put(struct folio *folio)
 {
 	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
 }
+
+struct guestmem_hugetlb_private {
+	struct hstate *h;
+	struct hugepage_subpool *spool;
+	struct hugetlb_cgroup *h_cg_rsvd;
+};
+
+static size_t guestmem_hugetlb_nr_pages_in_folio(void *priv)
+{
+	struct guestmem_hugetlb_private *private = priv;
+
+	return pages_per_huge_page(private->h);
+}
+
+static void *guestmem_hugetlb_setup(size_t size, u64 flags)
+
+{
+	struct guestmem_hugetlb_private *private;
+	struct hugetlb_cgroup *h_cg_rsvd = NULL;
+	struct hugepage_subpool *spool;
+	unsigned long nr_pages;
+	int page_size_log;
+	struct hstate *h;
+	long hpages;
+	int idx;
+	int ret;
+
+	page_size_log = (flags >> GUESTMEM_HUGETLB_FLAG_SHIFT) &
+			GUESTMEM_HUGETLB_FLAG_MASK;
+	h = hstate_sizelog(page_size_log);
+	if (!h)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * Check against h because page_size_log could be 0 to request default
+	 * HugeTLB page size.
+	 */
+	if (!IS_ALIGNED(size, huge_page_size(h)))
+		return ERR_PTR(-EINVAL);
+
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		return ERR_PTR(-ENOMEM);
+
+	/* Creating a subpool makes reservations, hence charge for them now. */
+	idx = hstate_index(h);
+	nr_pages = size >> PAGE_SHIFT;
+	ret = hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg_rsvd);
+	if (ret)
+		goto err_free;
+
+	hpages = size >> huge_page_shift(h);
+	spool = hugepage_new_subpool(h, hpages, hpages, false);
+	if (!spool)
+		goto err_uncharge;
+
+	private->h = h;
+	private->spool = spool;
+	private->h_cg_rsvd = h_cg_rsvd;
+
+	return private;
+
+err_uncharge:
+	ret = -ENOMEM;
+	hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg_rsvd);
+err_free:
+	kfree(private);
+	return ERR_PTR(ret);
+}
+
+static void guestmem_hugetlb_teardown(void *priv, size_t inode_size)
+{
+	struct guestmem_hugetlb_private *private = priv;
+	unsigned long nr_pages;
+	int idx;
+
+	hugepage_put_subpool(private->spool);
+
+	idx = hstate_index(private->h);
+	nr_pages = inode_size >> PAGE_SHIFT;
+	hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, private->h_cg_rsvd);
+
+	kfree(private);
+}
+
+static struct folio *guestmem_hugetlb_alloc_folio(void *priv)
+{
+	struct guestmem_hugetlb_private *private = priv;
+	struct mempolicy *mpol;
+	struct folio *folio;
+	pgoff_t ilx;
+	int ret;
+
+	ret = hugepage_subpool_get_pages(private->spool, 1);
+	if (ret == -ENOMEM) {
+		return ERR_PTR(-ENOMEM);
+	} else if (ret > 0) {
+		/* guest_memfd will not use surplus pages. */
+		goto err_put_pages;
+	}
+
+	/*
+	 * TODO: mempolicy would probably have to be stored on the inode, use
+	 * task policy for now.
+	 */
+	mpol = get_task_policy(current);
+
+	/* TODO: ignore interleaving for now. */
+	ilx = NO_INTERLEAVE_INDEX;
+
+	/*
+	 * charge_cgroup_rsvd is false because we already charged reservations
+	 * when creating the subpool for this
+	 * guest_memfd. use_existing_reservation is true - we're using a
+	 * reservation from the guest_memfd's subpool.
+	 */
+	folio = hugetlb_alloc_folio(private->h, mpol, ilx, false, true);
+	mpol_cond_put(mpol);
+
+	if (IS_ERR_OR_NULL(folio))
+		goto err_put_pages;
+
+	/*
+	 * Clear restore_reserve here so that when this folio is freed,
+	 * free_huge_folio() will always attempt to return the reservation to
+	 * the subpool.  guest_memfd, unlike regular hugetlb, has no resv_map,
+	 * and hence when freeing, the folio needs to be returned to the
+	 * subpool.  guest_memfd does not use surplus hugetlb pages, so in
+	 * free_huge_folio(), returning to subpool will always succeed and the
+	 * hstate reservation will then get restored.
+	 *
+	 * hugetlbfs does this in hugetlb_add_to_page_cache().
+	 */
+	folio_clear_hugetlb_restore_reserve(folio);
+
+	hugetlb_set_folio_subpool(folio, private->spool);
+
+	return folio;
+
+err_put_pages:
+	hugepage_subpool_put_pages(private->spool, 1);
+	return ERR_PTR(-ENOMEM);
+}
+
+const struct guestmem_allocator_operations guestmem_hugetlb_ops = {
+	.inode_setup = guestmem_hugetlb_setup,
+	.inode_teardown = guestmem_hugetlb_teardown,
+	.alloc_folio = guestmem_hugetlb_alloc_folio,
+	.nr_pages_in_folio = guestmem_hugetlb_nr_pages_in_folio,
+};
+EXPORT_SYMBOL_GPL(guestmem_hugetlb_ops);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 30/51] mm: truncate: Expose truncate_inode_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (28 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 31/51] KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff misalignment Ackerley Tng
                   ` (25 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

guest_memfd will be using truncate_inode_folio() to remove folios from
guest_memfd's filemap.

Change-Id: Iab72c6d4138cf19f6efeb38341eabe28ded42fd6
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/mm.h    | 1 +
 mm/guestmem_hugetlb.c | 2 +-
 mm/internal.h         | 1 -
 mm/truncate.c         | 1 +
 4 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4e73c231ced..74ca6b7d1d43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2530,6 +2530,7 @@ extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
+int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
 int generic_error_remove_folio(struct address_space *mapping,
 		struct folio *folio);
 
diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
index 5459ef7eb329..ec5a188ca2a7 100644
--- a/mm/guestmem_hugetlb.c
+++ b/mm/guestmem_hugetlb.c
@@ -4,12 +4,12 @@
  * as an allocator for guest_memfd.
  */
 
-#include <linux/mm_types.h>
 #include <linux/guestmem.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/mempolicy.h>
 #include <linux/mm.h>
+#include <linux/mm_types.h>
 #include <linux/pagemap.h>
 
 #include <uapi/linux/guestmem.h>
diff --git a/mm/internal.h b/mm/internal.h
index 25a29872c634..a1694f030539 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -448,7 +448,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
 unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
 		pgoff_t end, struct folio_batch *fbatch, pgoff_t *indices);
 void filemap_free_folio(struct address_space *mapping, struct folio *folio);
-int truncate_inode_folio(struct address_space *mapping, struct folio *folio);
 bool truncate_inode_partial_folio(struct folio *folio, loff_t start,
 		loff_t end);
 long mapping_evict_folio(struct address_space *mapping, struct folio *folio);
diff --git a/mm/truncate.c b/mm/truncate.c
index 057e4aa73aa9..4baab1e5d2cf 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -176,6 +176,7 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	filemap_remove_folio(folio);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(truncate_inode_folio);
 
 /*
  * Handle partial folios.  The folio may be entirely within the
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 31/51] KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff misalignment
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (29 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 30/51] mm: truncate: Expose truncate_inode_folio() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
                   ` (24 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

When slot->base_gfn and userspace_addr are not aligned wrt each other,
large page support is disabled for the entire memslot.

This patch applies the same logic for when slot->base_gfn and
gmem.pgoff are not aligned wrt each other.

Change-Id: Iab21b8995e77beae6dbadc3b623a1e9e07e6dce6
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 12433b1e755b..ee0e3420cc17 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12950,6 +12950,46 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
 	return 0;
 }
 
+static inline bool kvm_is_level_aligned(u64 value, int level)
+{
+	return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
+}
+
+static inline bool
+kvm_should_allow_lpage_for_slot(struct kvm_memory_slot *slot, int level)
+{
+	bool gfn_and_userspace_addr_aligned;
+	unsigned long ugfn;
+
+	ugfn = slot->userspace_addr >> PAGE_SHIFT;
+
+	/*
+	 * If addresses are not aligned wrt each other, then large page mapping
+	 * cannot be allowed for the slot since page tables only allow guest to
+	 * host translations to function at fixed levels.
+	 */
+	gfn_and_userspace_addr_aligned =
+		kvm_is_level_aligned(slot->base_gfn ^ ugfn, level);
+
+	/*
+	 * If slot->userspace_addr is 0 (disabled), 0 is always aligned so the
+	 * check is deferred to gmem.pgoff.
+	 */
+	if (!gfn_and_userspace_addr_aligned)
+		return false;
+
+	if (kvm_slot_has_gmem(slot)) {
+		bool gfn_and_gmem_pgoff_aligned;
+
+		gfn_and_gmem_pgoff_aligned = kvm_is_level_aligned(
+			slot->base_gfn ^ slot->gmem.pgoff, level);
+
+		return gfn_and_gmem_pgoff_aligned;
+	}
+
+	return true;
+}
+
 static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 				      struct kvm_memory_slot *slot)
 {
@@ -12971,7 +13011,6 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
 	for (i = 1; i < KVM_NR_PAGE_SIZES; ++i) {
 		struct kvm_lpage_info *linfo;
-		unsigned long ugfn;
 		int lpages;
 		int level = i + 1;
 
@@ -12983,16 +13022,12 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
 		slot->arch.lpage_info[i - 1] = linfo;
 
-		if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+		if (!kvm_is_level_aligned(slot->base_gfn, level))
 			linfo[0].disallow_lpage = 1;
-		if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
+		if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
 			linfo[lpages - 1].disallow_lpage = 1;
-		ugfn = slot->userspace_addr >> PAGE_SHIFT;
-		/*
-		 * If the gfn and userspace address are not aligned wrt each
-		 * other, disable large page support for this slot.
-		 */
-		if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
+
+		if (!kvm_should_allow_lpage_for_slot(slot, level)) {
 			unsigned long j;
 
 			for (j = 0; j < lpages; ++j)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (30 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 31/51] KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff misalignment Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-23 10:47   ` Yan Zhao
  2025-08-12  9:13   ` Tony Lindgren
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
                   ` (23 subsequent siblings)
  55 siblings, 2 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This patch adds support for guestmem_hugetlb as the first custom
allocator in guest_memfd.

If requested at guest_memfd creation time, the custom allocator will
be used in initialization and cleanup.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I1eb9625dc761ecadcc2aa21480cfdfcf9ab7ce67
---
 include/uapi/linux/kvm.h |   1 +
 virt/kvm/Kconfig         |   5 +
 virt/kvm/guest_memfd.c   | 203 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 199 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 433e184f83ea..af486b2e4862 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1571,6 +1571,7 @@ struct kvm_memory_attributes {
 
 #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
 #define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
+#define GUEST_MEMFD_FLAG_HUGETLB	(1UL << 2)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 14ffd9c1d480..ff917bb57371 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -133,3 +133,8 @@ config KVM_GMEM_SHARED_MEM
        select KVM_GMEM
        bool
        prompt "Enables in-place shared memory for guest_memfd"
+
+config KVM_GMEM_HUGETLB
+       select KVM_PRIVATE_MEM
+       depends on GUESTMEM_HUGETLB
+       bool "Enables using a custom allocator with guest_memfd, see CONFIG_GUESTMEM_HUGETLB"
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8c9c9e54616b..c65d93c5a443 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -3,11 +3,14 @@
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
 #include <linux/fs.h>
+#include <linux/guestmem.h>
 #include <linux/kvm_host.h>
 #include <linux/maple_tree.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 
+#include <uapi/linux/guestmem.h>
+
 #include "kvm_mm.h"
 
 static struct vfsmount *kvm_gmem_mnt;
@@ -22,6 +25,10 @@ struct kvm_gmem_inode_private {
 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
 	struct maple_tree shareability;
 #endif
+#ifdef CONFIG_KVM_GMEM_HUGETLB
+	const struct guestmem_allocator_operations *allocator_ops;
+	void *allocator_private;
+#endif
 };
 
 enum shareability {
@@ -40,6 +47,44 @@ static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 	return inode->i_mapping->i_private_data;
 }
 
+#ifdef CONFIG_KVM_GMEM_HUGETLB
+
+static const struct guestmem_allocator_operations *
+kvm_gmem_allocator_ops(struct inode *inode)
+{
+	return kvm_gmem_private(inode)->allocator_ops;
+}
+
+static void *kvm_gmem_allocator_private(struct inode *inode)
+{
+	return kvm_gmem_private(inode)->allocator_private;
+}
+
+static bool kvm_gmem_has_custom_allocator(struct inode *inode)
+{
+	return kvm_gmem_allocator_ops(inode) != NULL;
+}
+
+#else
+
+static const struct guestmem_allocator_operations *
+kvm_gmem_allocator_ops(struct inode *inode)
+{
+	return NULL;
+}
+
+static void *kvm_gmem_allocator_private(struct inode *inode)
+{
+	return NULL;
+}
+
+static bool kvm_gmem_has_custom_allocator(struct inode *inode)
+{
+	return false;
+}
+
+#endif
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -510,7 +555,6 @@ static int kvm_gmem_filemap_add_folio(struct address_space *mapping,
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
 	struct folio *folio;
-	gfp_t gfp;
 	int ret;
 
 repeat:
@@ -518,17 +562,24 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 	if (!IS_ERR(folio))
 		return folio;
 
-	gfp = mapping_gfp_mask(inode->i_mapping);
+	if (kvm_gmem_has_custom_allocator(inode)) {
+		void *p = kvm_gmem_allocator_private(inode);
 
-	/* TODO: Support huge pages. */
-	folio = filemap_alloc_folio(gfp, 0);
-	if (!folio)
-		return ERR_PTR(-ENOMEM);
+		folio = kvm_gmem_allocator_ops(inode)->alloc_folio(p);
+		if (IS_ERR(folio))
+			return folio;
+	} else {
+		gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 
-	ret = mem_cgroup_charge(folio, NULL, gfp);
-	if (ret) {
-		folio_put(folio);
-		return ERR_PTR(ret);
+		folio = filemap_alloc_folio(gfp, 0);
+		if (!folio)
+			return ERR_PTR(-ENOMEM);
+
+		ret = mem_cgroup_charge(folio, NULL, gfp);
+		if (ret) {
+			folio_put(folio);
+			return ERR_PTR(ret);
+		}
 	}
 
 	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
@@ -611,6 +662,80 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 	}
 }
 
+/**
+ * kvm_gmem_truncate_indices() - Truncates all folios beginning @index for
+ * @nr_pages.
+ *
+ * @mapping: filemap to truncate pages from.
+ * @index: the index in the filemap to begin truncation.
+ * @nr_pages: number of PAGE_SIZE pages to truncate.
+ *
+ * Return: the number of PAGE_SIZE pages that were actually truncated.
+ */
+static long kvm_gmem_truncate_indices(struct address_space *mapping,
+				      pgoff_t index, size_t nr_pages)
+{
+	struct folio_batch fbatch;
+	long truncated;
+	pgoff_t last;
+
+	last = index + nr_pages - 1;
+
+	truncated = 0;
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &index, last, &fbatch)) {
+		unsigned int i;
+
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *f = fbatch.folios[i];
+
+			truncated += folio_nr_pages(f);
+			folio_lock(f);
+			truncate_inode_folio(f->mapping, f);
+			folio_unlock(f);
+		}
+
+		folio_batch_release(&fbatch);
+		cond_resched();
+	}
+
+	return truncated;
+}
+
+/**
+ * kvm_gmem_truncate_inode_aligned_pages() - Removes entire folios from filemap
+ * in @inode.
+ *
+ * @inode: inode to remove folios from.
+ * @index: start of range to be truncated. Must be hugepage aligned.
+ * @nr_pages: number of PAGE_SIZE pages to be iterated over.
+ *
+ * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
+ * inode metadata.
+ */
+static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
+						  pgoff_t index,
+						  size_t nr_pages)
+{
+	size_t nr_per_huge_page;
+	long num_freed;
+	pgoff_t idx;
+	void *priv;
+
+	priv = kvm_gmem_allocator_private(inode);
+	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+	num_freed = 0;
+	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
+		num_freed += kvm_gmem_truncate_indices(
+			inode->i_mapping, idx, nr_per_huge_page);
+	}
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
+	spin_unlock(&inode->i_lock);
+}
+
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
@@ -940,6 +1065,13 @@ static void kvm_gmem_free_inode(struct inode *inode)
 {
 	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
 
+	/* private may be NULL if inode creation process had an error. */
+	if (private && kvm_gmem_has_custom_allocator(inode)) {
+		void *p = kvm_gmem_allocator_private(inode);
+
+		kvm_gmem_allocator_ops(inode)->inode_teardown(p, inode->i_size);
+	}
+
 	kfree(private);
 
 	free_inode_nonrcu(inode);
@@ -959,8 +1091,24 @@ static void kvm_gmem_destroy_inode(struct inode *inode)
 #endif
 }
 
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	truncate_inode_pages_final_prepare(inode->i_mapping);
+
+	if (kvm_gmem_has_custom_allocator(inode)) {
+		size_t nr_pages = inode->i_size >> PAGE_SHIFT;
+
+		kvm_gmem_truncate_inode_aligned_pages(inode, 0, nr_pages);
+	} else {
+		truncate_inode_pages(inode->i_mapping, 0);
+	}
+
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
 	.statfs		= simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 	.destroy_inode	= kvm_gmem_destroy_inode,
 	.free_inode	= kvm_gmem_free_inode,
 };
@@ -1062,6 +1210,12 @@ static void kvm_gmem_free_folio(struct folio *folio)
 {
 	folio_clear_unevictable(folio);
 
+	/*
+	 * No-op for 4K page since the PG_uptodate is cleared as part of
+	 * freeing, but may be required for other allocators to reset page.
+	 */
+	folio_clear_uptodate(folio);
+
 	kvm_gmem_invalidate(folio);
 }
 
@@ -1115,6 +1269,25 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	if (err)
 		goto out;
 
+#ifdef CONFIG_KVM_GMEM_HUGETLB
+	if (flags & GUEST_MEMFD_FLAG_HUGETLB) {
+		void *allocator_priv;
+		size_t nr_pages;
+
+		allocator_priv = guestmem_hugetlb_ops.inode_setup(size, flags);
+		if (IS_ERR(allocator_priv)) {
+			err = PTR_ERR(allocator_priv);
+			goto out;
+		}
+
+		private->allocator_ops = &guestmem_hugetlb_ops;
+		private->allocator_private = allocator_priv;
+
+		nr_pages = guestmem_hugetlb_ops.nr_pages_in_folio(allocator_priv);
+		inode->i_blkbits = ilog2(nr_pages << PAGE_SHIFT);
+	}
+#endif
+
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
 	inode->i_mapping->a_ops = &kvm_gmem_aops;
@@ -1210,6 +1383,10 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	return err;
 }
 
+/* Mask of bits belonging to allocators and are opaque to guest_memfd. */
+#define SUPPORTED_CUSTOM_ALLOCATOR_MASK \
+	(GUESTMEM_HUGETLB_FLAG_MASK << GUESTMEM_HUGETLB_FLAG_SHIFT)
+
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
@@ -1222,6 +1399,12 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
 		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
 
+	if (IS_ENABLED(CONFIG_KVM_GMEM_HUGETLB) &&
+	    flags & GUEST_MEMFD_FLAG_HUGETLB) {
+		valid_flags |= GUEST_MEMFD_FLAG_HUGETLB |
+			       SUPPORTED_CUSTOM_ALLOCATOR_MASK;
+	}
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (31 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-21 18:05   ` Vishal Annapurve
                     ` (3 more replies)
  2025-05-14 23:42 ` [RFC PATCH v2 34/51] mm: hugetlb: Add functions to add/delete folio from hugetlb lists Ackerley Tng
                   ` (22 subsequent siblings)
  55 siblings, 4 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

If a custom allocator is requested at guest_memfd creation time, pages
from the custom allocator will be used to back guest_memfd.

Change-Id: I59df960b3273790f42fe5bea54a234f40962eb75
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/memory.c            |   1 +
 virt/kvm/guest_memfd.c | 142 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ba3ea0a82f7f..3af45e96913c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7249,6 +7249,7 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 	else
 		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
 }
+EXPORT_SYMBOL_GPL(folio_zero_user);
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
 				   unsigned long addr_hint,
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index c65d93c5a443..24d270b9b725 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -478,15 +478,13 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
  * leaking host data and the up-to-date flag is set.
  */
 static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
-				  gfn_t gfn, struct folio *folio)
+				  gfn_t gfn, struct folio *folio,
+				  unsigned long addr_hint)
 {
-	unsigned long nr_pages, i;
 	pgoff_t index;
 	int r;
 
-	nr_pages = folio_nr_pages(folio);
-	for (i = 0; i < nr_pages; i++)
-		clear_highpage(folio_page(folio, i));
+	folio_zero_user(folio, addr_hint);
 
 	/*
 	 * Preparing huge folios should always be safe, since it should
@@ -554,7 +552,9 @@ static int kvm_gmem_filemap_add_folio(struct address_space *mapping,
  */
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
+	size_t allocated_size;
 	struct folio *folio;
+	pgoff_t index_floor;
 	int ret;
 
 repeat:
@@ -581,8 +581,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 			return ERR_PTR(ret);
 		}
 	}
+	allocated_size = folio_size(folio);
 
-	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
+	index_floor = round_down(index, folio_nr_pages(folio));
+	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
 	if (ret) {
 		folio_put(folio);
 
@@ -598,7 +600,17 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 		return ERR_PTR(ret);
 	}
 
-	__folio_set_locked(folio);
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += allocated_size / 512;
+	spin_unlock(&inode->i_lock);
+
+	/*
+	 * folio is the one that is allocated, this gets the folio at the
+	 * requested index.
+	 */
+	folio = page_folio(folio_file_page(folio, index));
+	folio_lock(folio);
+
 	return folio;
 }
 
@@ -736,6 +748,92 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
 	spin_unlock(&inode->i_lock);
 }
 
+/**
+ * kvm_gmem_zero_range() - Zeroes all sub-pages in range [@start, @end).
+ *
+ * @mapping: the filemap to remove this range from.
+ * @start: index in filemap for start of range (inclusive).
+ * @end: index in filemap for end of range (exclusive).
+ *
+ * The pages in range may be split. truncate_inode_pages_range() isn't the right
+ * function because it removes pages from the page cache; this function only
+ * zeroes the pages.
+ */
+static void kvm_gmem_zero_range(struct address_space *mapping,
+				pgoff_t start, pgoff_t end)
+{
+	struct folio_batch fbatch;
+
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &start, end - 1, &fbatch)) {
+		unsigned int i;
+
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *f;
+			size_t nr_bytes;
+
+			f = fbatch.folios[i];
+			nr_bytes = offset_in_folio(f, end << PAGE_SHIFT);
+			if (nr_bytes == 0)
+				nr_bytes = folio_size(f);
+
+			folio_zero_segment(f, 0, nr_bytes);
+		}
+
+		folio_batch_release(&fbatch);
+		cond_resched();
+	}
+}
+
+/**
+ * kvm_gmem_truncate_inode_range() - Truncate pages in range [@lstart, @lend).
+ *
+ * @inode: inode to truncate from.
+ * @lstart: offset in inode for start of range (inclusive).
+ * @lend: offset in inode for end of range (exclusive).
+ *
+ * Removes full (huge)pages from the filemap and zeroing incomplete
+ * (huge)pages. The pages in the range may be split.
+ */
+static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
+					  loff_t lend)
+{
+	pgoff_t full_hpage_start;
+	size_t nr_per_huge_page;
+	pgoff_t full_hpage_end;
+	size_t nr_pages;
+	pgoff_t start;
+	pgoff_t end;
+	void *priv;
+
+	priv = kvm_gmem_allocator_private(inode);
+	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+	start = lstart >> PAGE_SHIFT;
+	end = min(lend, i_size_read(inode)) >> PAGE_SHIFT;
+
+	full_hpage_start = round_up(start, nr_per_huge_page);
+	full_hpage_end = round_down(end, nr_per_huge_page);
+
+	if (start < full_hpage_start) {
+		pgoff_t zero_end = min(full_hpage_start, end);
+
+		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
+	}
+
+	if (full_hpage_end > full_hpage_start) {
+		nr_pages = full_hpage_end - full_hpage_start;
+		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
+						      nr_pages);
+	}
+
+	if (end > full_hpage_end && end > full_hpage_start) {
+		pgoff_t zero_start = max(full_hpage_end, start);
+
+		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
+	}
+}
+
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
@@ -752,7 +850,12 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_begin(gmem, start, end);
 
-	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+	if (kvm_gmem_has_custom_allocator(inode)) {
+		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
+	} else {
+		/* Page size is PAGE_SIZE, so use optimized truncation function. */
+		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+	}
 
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_end(gmem, start, end);
@@ -776,6 +879,16 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 
 	start = offset >> PAGE_SHIFT;
 	end = (offset + len) >> PAGE_SHIFT;
+	if (kvm_gmem_has_custom_allocator(inode)) {
+		size_t nr_pages;
+		void *p;
+
+		p = kvm_gmem_allocator_private(inode);
+		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
+
+		start = round_down(start, nr_pages);
+		end = round_down(end, nr_pages);
+	}
 
 	r = 0;
 	for (index = start; index < end; ) {
@@ -1570,7 +1683,7 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
 
 	*pfn = folio_file_pfn(folio, index);
 	if (max_order)
-		*max_order = 0;
+		*max_order = folio_order(folio);
 
 	*is_prepared = folio_test_uptodate(folio);
 	return folio;
@@ -1597,8 +1710,15 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		goto out;
 	}
 
-	if (!is_prepared)
-		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
+	if (!is_prepared) {
+		/*
+		 * Use the same address as hugetlb for zeroing private pages
+		 * that won't be mapped to userspace anyway.
+		 */
+		unsigned long addr_hint = folio->index << PAGE_SHIFT;
+
+		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio, addr_hint);
+	}
 
 	folio_unlock(folio);
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 34/51] mm: hugetlb: Add functions to add/delete folio from hugetlb lists
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (32 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages Ackerley Tng
                   ` (21 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

These functions are introduced in hugetlb.c so the private
hugetlb_lock can be accessed.

These functions will be used in splitting and merging pages in a later
patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

Change-Id: I42f8feda40cbd28e5fd02e54fa58145d847a220e
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e6b90e72d46d..e432ccfe3e63 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -156,6 +156,8 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						vm_flags_t vm_flags);
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
+void hugetlb_folio_list_add(struct folio *folio, struct list_head *list);
+void hugetlb_folio_list_del(struct folio *folio);
 bool folio_isolate_hugetlb(struct folio *folio, struct list_head *list);
 int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison);
 int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 816f257680be..6e326c09c505 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7473,6 +7473,28 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 	return 0;
 }
 
+void hugetlb_folio_list_add(struct folio *folio, struct list_head *list)
+{
+	/*
+	 * hstate's hugepage_activelist is guarded by hugetlb_lock, hence hold
+	 * hugetlb_lock while modifying folio-> lru.
+	 */
+	spin_lock_irq(&hugetlb_lock);
+	list_add(&folio->lru, list);
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+void hugetlb_folio_list_del(struct folio *folio)
+{
+	/*
+	 * hstate's hugepage_activelist is guarded by hugetlb_lock, hence hold
+	 * hugetlb_lock while modifying folio-> lru.
+	 */
+	spin_lock_irq(&hugetlb_lock);
+	list_del(&folio->lru);
+	spin_unlock_irq(&hugetlb_lock);
+}
+
 #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
 static unsigned long page_table_shareable(struct vm_area_struct *svma,
 				struct vm_area_struct *vma,
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (33 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 34/51] mm: hugetlb: Add functions to add/delete folio from hugetlb lists Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function Ackerley Tng
                   ` (20 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

These functions allow guest_memfd to split and merge HugeTLB pages,
and clean them up on freeing the page.

For merging and splitting pages on conversion, guestmem_hugetlb
expects the refcount on the pages to already be 0. The caller must
ensure that.

For conversions, guest_memfd ensures that the refcounts are already 0
by checking that there are no unexpected refcounts, and then freezing
the expected refcounts away. On unexpected refcounts, guest_memfd will
return an error to userspace.

For truncation, on unexpected refcounts, guest_memfd will return an
error to userspace.

For truncation on closing, guest_memfd will just remove its own
refcounts (the filemap refcounts) and mark split pages with
PGTY_guestmem_hugetlb.

The presence of PGTY_guestmem_hugetlb will trigger the folio_put()
callback to handle further cleanup. This cleanup process will merge
pages (with refcount 0, since cleanup is triggered from folio_put())
before returning the pages to HugeTLB.

Since the merging process is long, it is deferred to a worker thread
since folio_put() could be called from atomic context.

Change-Id: Ib04a3236f1e7250fd9af827630c334d40fb09d40
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
---
 include/linux/guestmem.h |   3 +
 mm/guestmem_hugetlb.c    | 349 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 347 insertions(+), 5 deletions(-)

diff --git a/include/linux/guestmem.h b/include/linux/guestmem.h
index 4b2d820274d9..3ee816d1dd34 100644
--- a/include/linux/guestmem.h
+++ b/include/linux/guestmem.h
@@ -8,6 +8,9 @@ struct guestmem_allocator_operations {
 	void *(*inode_setup)(size_t size, u64 flags);
 	void (*inode_teardown)(void *private, size_t inode_size);
 	struct folio *(*alloc_folio)(void *private);
+	int (*split_folio)(struct folio *folio);
+	void (*merge_folio)(struct folio *folio);
+	void (*free_folio)(struct folio *folio);
 	/*
 	 * Returns the number of PAGE_SIZE pages in a page that this guestmem
 	 * allocator provides.
diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
index ec5a188ca2a7..8727598cf18e 100644
--- a/mm/guestmem_hugetlb.c
+++ b/mm/guestmem_hugetlb.c
@@ -11,15 +11,12 @@
 #include <linux/mm.h>
 #include <linux/mm_types.h>
 #include <linux/pagemap.h>
+#include <linux/xarray.h>
 
 #include <uapi/linux/guestmem.h>
 
 #include "guestmem_hugetlb.h"
-
-void guestmem_hugetlb_handle_folio_put(struct folio *folio)
-{
-	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
-}
+#include "hugetlb_vmemmap.h"
 
 struct guestmem_hugetlb_private {
 	struct hstate *h;
@@ -34,6 +31,339 @@ static size_t guestmem_hugetlb_nr_pages_in_folio(void *priv)
 	return pages_per_huge_page(private->h);
 }
 
+static DEFINE_XARRAY(guestmem_hugetlb_stash);
+
+struct guestmem_hugetlb_metadata {
+	void *_hugetlb_subpool;
+	void *_hugetlb_cgroup;
+	void *_hugetlb_hwpoison;
+	void *private;
+};
+
+struct guestmem_hugetlb_stash_item {
+	struct guestmem_hugetlb_metadata hugetlb_metadata;
+	/* hstate tracks the original size of this folio. */
+	struct hstate *h;
+	/* Count of split pages, individually freed, waiting to be merged. */
+	atomic_t nr_pages_waiting_to_be_merged;
+};
+
+struct workqueue_struct *guestmem_hugetlb_wq __ro_after_init;
+static struct work_struct guestmem_hugetlb_cleanup_work;
+static LLIST_HEAD(guestmem_hugetlb_cleanup_list);
+
+static inline void guestmem_hugetlb_register_folio_put_callback(struct folio *folio)
+{
+	__folio_set_guestmem_hugetlb(folio);
+}
+
+static inline void guestmem_hugetlb_unregister_folio_put_callback(struct folio *folio)
+{
+	__folio_clear_guestmem_hugetlb(folio);
+}
+
+static inline void guestmem_hugetlb_defer_cleanup(struct folio *folio)
+{
+	struct llist_node *node;
+
+	/*
+	 * Reuse the folio->mapping pointer as a struct llist_node, since
+	 * folio->mapping is NULL at this point.
+	 */
+	BUILD_BUG_ON(sizeof(folio->mapping) != sizeof(struct llist_node));
+	node = (struct llist_node *)&folio->mapping;
+
+	/*
+	 * Only schedule work if list is previously empty. Otherwise,
+	 * schedule_work() had been called but the workfn hasn't retrieved the
+	 * list yet.
+	 */
+	if (llist_add(node, &guestmem_hugetlb_cleanup_list))
+		queue_work(guestmem_hugetlb_wq, &guestmem_hugetlb_cleanup_work);
+}
+
+void guestmem_hugetlb_handle_folio_put(struct folio *folio)
+{
+	guestmem_hugetlb_unregister_folio_put_callback(folio);
+
+	/*
+	 * folio_put() can be called in interrupt context, hence do the work
+	 * outside of interrupt context
+	 */
+	guestmem_hugetlb_defer_cleanup(folio);
+}
+
+/*
+ * Stash existing hugetlb metadata. Use this function just before splitting a
+ * hugetlb page.
+ */
+static inline void
+__guestmem_hugetlb_stash_metadata(struct guestmem_hugetlb_metadata *metadata,
+				  struct folio *folio)
+{
+	/*
+	 * (folio->page + 1) doesn't have to be stashed since those fields are
+	 * known on split/reconstruct and will be reinitialized anyway.
+	 */
+
+	/*
+	 * subpool is created for every guest_memfd inode, but the folios will
+	 * outlive the inode, hence we store the subpool here.
+	 */
+	metadata->_hugetlb_subpool = folio->_hugetlb_subpool;
+	/*
+	 * _hugetlb_cgroup has to be stored for freeing
+	 * later. _hugetlb_cgroup_rsvd does not, since it is NULL for
+	 * guest_memfd folios anyway. guest_memfd reservations are handled in
+	 * the inode.
+	 */
+	metadata->_hugetlb_cgroup = folio->_hugetlb_cgroup;
+	metadata->_hugetlb_hwpoison = folio->_hugetlb_hwpoison;
+
+	/*
+	 * HugeTLB flags are stored in folio->private. stash so that ->private
+	 * can be used by core-mm.
+	 */
+	metadata->private = folio->private;
+}
+
+static int guestmem_hugetlb_stash_metadata(struct folio *folio)
+{
+	XA_STATE(xas, &guestmem_hugetlb_stash, 0);
+	struct guestmem_hugetlb_stash_item *stash;
+	void *entry;
+
+	stash = kzalloc(sizeof(*stash), 1);
+	if (!stash)
+		return -ENOMEM;
+
+	stash->h = folio_hstate(folio);
+	__guestmem_hugetlb_stash_metadata(&stash->hugetlb_metadata, folio);
+
+	xas_set_order(&xas, folio_pfn(folio), folio_order(folio));
+
+	xas_lock(&xas);
+	entry = xas_store(&xas, stash);
+	xas_unlock(&xas);
+
+	if (xa_is_err(entry)) {
+		kfree(stash);
+		return xa_err(entry);
+	}
+
+	return 0;
+}
+
+static inline void
+__guestmem_hugetlb_unstash_metadata(struct guestmem_hugetlb_metadata *metadata,
+				    struct folio *folio)
+{
+	folio->_hugetlb_subpool = metadata->_hugetlb_subpool;
+	folio->_hugetlb_cgroup = metadata->_hugetlb_cgroup;
+	folio->_hugetlb_cgroup_rsvd = NULL;
+	folio->_hugetlb_hwpoison = metadata->_hugetlb_hwpoison;
+
+	folio_change_private(folio, metadata->private);
+}
+
+static int guestmem_hugetlb_unstash_free_metadata(struct folio *folio)
+{
+	struct guestmem_hugetlb_stash_item *stash;
+	unsigned long pfn;
+
+	pfn = folio_pfn(folio);
+
+	stash = xa_erase(&guestmem_hugetlb_stash, pfn);
+	__guestmem_hugetlb_unstash_metadata(&stash->hugetlb_metadata, folio);
+
+	kfree(stash);
+
+	return 0;
+}
+
+/**
+ * guestmem_hugetlb_split_folio() - Split a HugeTLB @folio to PAGE_SIZE pages.
+ *
+ * @folio: The folio to be split.
+ *
+ * Context: Before splitting, the folio must have a refcount of 0. After
+ *          splitting, each split folio has a refcount of 0.
+ * Return: 0 on success and negative error otherwise.
+ */
+static int guestmem_hugetlb_split_folio(struct folio *folio)
+{
+	long orig_nr_pages;
+	int ret;
+	int i;
+
+	if (folio_size(folio) == PAGE_SIZE)
+		return 0;
+
+	orig_nr_pages = folio_nr_pages(folio);
+	ret = guestmem_hugetlb_stash_metadata(folio);
+	if (ret)
+		return ret;
+
+	/*
+	 * hugetlb_vmemmap_restore_folio() has to be called ahead of the rest
+	 * because it checks and page type. This doesn't actually split the
+	 * folio, so the first few struct pages are still intact.
+	 */
+	ret = hugetlb_vmemmap_restore_folio(folio_hstate(folio), folio);
+	if (ret)
+		goto err;
+
+	/*
+	 * Can clear without lock because this will not race with the folio
+	 * being mapped. folio's page type is overlaid with mapcount and so in
+	 * other cases it's necessary to take hugetlb_lock to prevent races with
+	 * mapcount increasing.
+	 */
+	__folio_clear_hugetlb(folio);
+
+	/*
+	 * Remove the first folio from h->hugepage_activelist since it is no
+	 * longer a HugeTLB page. The other split pages should not be on any
+	 * lists.
+	 */
+	hugetlb_folio_list_del(folio);
+
+	/* Actually split page by undoing prep_compound_page() */
+	__folio_clear_head(folio);
+
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+	/*
+	 * Zero out _nr_pages, otherwise this overlaps with memcg_data,
+	 * resulting in lookups on false memcg_data.  _nr_pages doesn't have to
+	 * be set to 1 because folio_nr_pages() relies on the presence of the
+	 * head flag to return 1 for nr_pages.
+	 */
+	folio->_nr_pages = 0;
+#endif
+
+	for (i = 1; i < orig_nr_pages; ++i) {
+		struct page *p = folio_page(folio, i);
+
+		/* Copy flags from the first page to split pages. */
+		p->flags = folio->flags;
+
+		p->mapping = NULL;
+		clear_compound_head(p);
+	}
+
+	return 0;
+
+err:
+	guestmem_hugetlb_unstash_free_metadata(folio);
+
+	return ret;
+}
+
+/**
+ * guestmem_hugetlb_merge_folio() - Merge a HugeTLB folio from the folio
+ * beginning @first_folio.
+ *
+ * @first_folio: the first folio in a contiguous block of folios to be merged.
+ *
+ * The size of the contiguous block is tracked in guestmem_hugetlb_stash.
+ *
+ * Context: The first folio is checked to have a refcount of 0 before
+ *          reconstruction. After reconstruction, the reconstructed folio has a
+ *          refcount of 0.
+ */
+static void guestmem_hugetlb_merge_folio(struct folio *first_folio)
+{
+	struct guestmem_hugetlb_stash_item *stash;
+	struct hstate *h;
+
+	stash = xa_load(&guestmem_hugetlb_stash, folio_pfn(first_folio));
+	h = stash->h;
+
+	/*
+	 * This is the step that does the merge. prep_compound_page() will write
+	 * to pages 1 and 2 as well, so guestmem_unstash_hugetlb_metadata() has
+	 * to come after this.
+	 */
+	prep_compound_page(&first_folio->page, huge_page_order(h));
+
+	WARN_ON(guestmem_hugetlb_unstash_free_metadata(first_folio));
+
+	/*
+	 * prep_compound_page() will set up mapping on tail pages. For
+	 * completeness, clear mapping on head page.
+	 */
+	first_folio->mapping = NULL;
+
+	__folio_set_hugetlb(first_folio);
+
+	hugetlb_folio_list_add(first_folio, &h->hugepage_activelist);
+
+	hugetlb_vmemmap_optimize_folio(h, first_folio);
+}
+
+static struct folio *guestmem_hugetlb_maybe_merge_folio(struct folio *folio)
+{
+	struct guestmem_hugetlb_stash_item *stash;
+	unsigned long first_folio_pfn;
+	struct folio *first_folio;
+	unsigned long pfn;
+	size_t nr_pages;
+
+	pfn = folio_pfn(folio);
+
+	stash = xa_load(&guestmem_hugetlb_stash, pfn);
+	nr_pages = pages_per_huge_page(stash->h);
+	if (atomic_inc_return(&stash->nr_pages_waiting_to_be_merged) < nr_pages)
+		return NULL;
+
+	first_folio_pfn = round_down(pfn, nr_pages);
+	first_folio = pfn_folio(first_folio_pfn);
+
+	guestmem_hugetlb_merge_folio(first_folio);
+
+	return first_folio;
+}
+
+static void guestmem_hugetlb_cleanup_folio(struct folio *folio)
+{
+	struct folio *merged_folio;
+
+	merged_folio = guestmem_hugetlb_maybe_merge_folio(folio);
+	if (merged_folio)
+		__folio_put(merged_folio);
+}
+
+static void guestmem_hugetlb_cleanup_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+
+	node = llist_del_all(&guestmem_hugetlb_cleanup_list);
+	while (node) {
+		struct folio *folio;
+
+		folio = container_of((struct address_space **)node,
+				     struct folio, mapping);
+
+		node = node->next;
+		folio->mapping = NULL;
+
+		guestmem_hugetlb_cleanup_folio(folio);
+	}
+}
+
+static int __init guestmem_hugetlb_init(void)
+{
+	INIT_WORK(&guestmem_hugetlb_cleanup_work, guestmem_hugetlb_cleanup_workfn);
+
+	guestmem_hugetlb_wq = alloc_workqueue("guestmem_hugetlb",
+					      WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+	if (!guestmem_hugetlb_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+subsys_initcall(guestmem_hugetlb_init);
+
 static void *guestmem_hugetlb_setup(size_t size, u64 flags)
 
 {
@@ -164,10 +494,19 @@ static struct folio *guestmem_hugetlb_alloc_folio(void *priv)
 	return ERR_PTR(-ENOMEM);
 }
 
+static void guestmem_hugetlb_free_folio(struct folio *folio)
+{
+	if (xa_load(&guestmem_hugetlb_stash, folio_pfn(folio)))
+		guestmem_hugetlb_register_folio_put_callback(folio);
+}
+
 const struct guestmem_allocator_operations guestmem_hugetlb_ops = {
 	.inode_setup = guestmem_hugetlb_setup,
 	.inode_teardown = guestmem_hugetlb_teardown,
 	.alloc_folio = guestmem_hugetlb_alloc_folio,
+	.split_folio = guestmem_hugetlb_split_folio,
+	.merge_folio = guestmem_hugetlb_merge_folio,
+	.free_folio = guestmem_hugetlb_free_folio,
 	.nr_pages_in_folio = guestmem_hugetlb_nr_pages_in_folio,
 };
 EXPORT_SYMBOL_GPL(guestmem_hugetlb_ops);
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (34 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-21 16:40   ` Edgecombe, Rick P
  2025-05-14 23:42 ` [RFC PATCH v2 37/51] filemap: Pass address_space mapping to ->free_folio() Ackerley Tng
                   ` (19 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This will prevent the macro from overriding any function and function
calls defined as split_folio().

Change-Id: I88a66bd876731b272282a42468c3bf8ac008b7cc
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/huge_mm.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e893d546a49f..f392ff49a816 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -99,7 +99,11 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
-#define split_folio(f) split_folio_to_list(f, NULL)
+int split_folio_to_list(struct folio *folio, struct list_head *list);
+static inline int split_folio(struct folio *folio)
+{
+	return split_folio_to_list(folio, NULL);
+}
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 #define HPAGE_PMD_SHIFT PMD_SHIFT
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 37/51] filemap: Pass address_space mapping to ->free_folio()
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (35 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
                   ` (18 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li, Mike Day

From: Elliot Berman <quic_eberman@quicinc.com>

The plan is to be able to support multiple allocators for guest_memfd
folios. To allow each allocator to handle release of a folio from a
guest_memfd filemap, ->free_folio() needs to retrieve allocator
information that is stored on the guest_memfd inode.

->free_folio() shouldn't assume that folio->mapping is set/valid, and
the mapping is well-known to callers of .free_folio(). Hence, pass
address_space mapping to ->free_folio() for the callback to retrieve
any necessary information.

Link: https://lore.kernel.org/all/15f665b4-2d33-41ca-ac50-fafe24ade32f@redhat.com/
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Change-Id: I8bac907832a0b2491fa403a6ab72fcef1b4713ee
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Tested-by: Mike Day <michael.day@amd.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/filesystems/locking.rst |  2 +-
 Documentation/filesystems/vfs.rst     | 15 +++++++++------
 fs/nfs/dir.c                          |  9 +++++++--
 fs/orangefs/inode.c                   |  3 ++-
 include/linux/fs.h                    |  2 +-
 mm/filemap.c                          |  9 +++++----
 mm/secretmem.c                        |  3 ++-
 mm/vmscan.c                           |  4 ++--
 virt/kvm/guest_memfd.c                |  3 ++-
 9 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 0ec0bb6eb0fb..c3d7430481ae 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -263,7 +263,7 @@ prototypes::
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidate_folio) (struct folio *, size_t start, size_t len);
 	bool (*release_folio)(struct folio *, gfp_t);
-	void (*free_folio)(struct folio *);
+	void (*free_folio)(struct address_space *, struct folio *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 	int (*migrate_folio)(struct address_space *, struct folio *dst,
 			struct folio *src, enum migrate_mode);
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index ae79c30b6c0c..bba1ac848f96 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -833,7 +833,7 @@ cache in your filesystem.  The following members are defined:
 		sector_t (*bmap)(struct address_space *, sector_t);
 		void (*invalidate_folio) (struct folio *, size_t start, size_t len);
 		bool (*release_folio)(struct folio *, gfp_t);
-		void (*free_folio)(struct folio *);
+		void (*free_folio)(struct address_space *, struct folio *);
 		ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 		int (*migrate_folio)(struct mapping *, struct folio *dst,
 				struct folio *src, enum migrate_mode);
@@ -1011,11 +1011,14 @@ cache in your filesystem.  The following members are defined:
 	clear the uptodate flag if it cannot free private data yet.
 
 ``free_folio``
-	free_folio is called once the folio is no longer visible in the
-	page cache in order to allow the cleanup of any private data.
-	Since it may be called by the memory reclaimer, it should not
-	assume that the original address_space mapping still exists, and
-	it should not block.
+	free_folio is called once the folio is no longer visible in
+	the page cache in order to allow the cleanup of any private
+	data.  Since it may be called by the memory reclaimer, it
+	should not assume that the original address_space mapping
+	still exists at folio->mapping. The mapping the folio used to
+	belong to is instead passed for free_folio to read any
+	information it might need from the mapping. free_folio should
+	not block.
 
 ``direct_IO``
 	called by the generic read/write routines to perform direct_IO -
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index bd23fc736b39..148433f6d9d4 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -55,7 +55,7 @@ static int nfs_closedir(struct inode *, struct file *);
 static int nfs_readdir(struct file *, struct dir_context *);
 static int nfs_fsync_dir(struct file *, loff_t, loff_t, int);
 static loff_t nfs_llseek_dir(struct file *, loff_t, int);
-static void nfs_readdir_clear_array(struct folio *);
+static void nfs_free_folio(struct address_space *, struct folio *);
 static int nfs_do_create(struct inode *dir, struct dentry *dentry,
 			 umode_t mode, int open_flags);
 
@@ -69,7 +69,7 @@ const struct file_operations nfs_dir_operations = {
 };
 
 const struct address_space_operations nfs_dir_aops = {
-	.free_folio = nfs_readdir_clear_array,
+	.free_folio = nfs_free_folio,
 };
 
 #define NFS_INIT_DTSIZE PAGE_SIZE
@@ -230,6 +230,11 @@ static void nfs_readdir_clear_array(struct folio *folio)
 	kunmap_local(array);
 }
 
+static void nfs_free_folio(struct address_space *mapping, struct folio *folio)
+{
+	nfs_readdir_clear_array(folio);
+}
+
 static void nfs_readdir_folio_reinit_array(struct folio *folio, u64 last_cookie,
 					   u64 change_attr)
 {
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 5ac743c6bc2e..884cc5295f3e 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -449,7 +449,8 @@ static bool orangefs_release_folio(struct folio *folio, gfp_t foo)
 	return !folio_test_private(folio);
 }
 
-static void orangefs_free_folio(struct folio *folio)
+static void orangefs_free_folio(struct address_space *mapping,
+				struct folio *folio)
 {
 	kfree(folio_detach_private(folio));
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0fded2e3c661..9862ea92a2af 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -455,7 +455,7 @@ struct address_space_operations {
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
 	bool (*release_folio)(struct folio *, gfp_t);
-	void (*free_folio)(struct folio *folio);
+	void (*free_folio)(struct address_space *mapping, struct folio *folio);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
 	/*
 	 * migrate the contents of a folio to the specified target. If
diff --git a/mm/filemap.c b/mm/filemap.c
index bed7160db214..a02c3d8e00e8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -226,11 +226,11 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
 
 void filemap_free_folio(struct address_space *mapping, struct folio *folio)
 {
-	void (*free_folio)(struct folio *);
+	void (*free_folio)(struct address_space*, struct folio *);
 
 	free_folio = mapping->a_ops->free_folio;
 	if (free_folio)
-		free_folio(folio);
+		free_folio(mapping, folio);
 
 	folio_put_refs(folio, folio_nr_pages(folio));
 }
@@ -820,7 +820,8 @@ EXPORT_SYMBOL(file_write_and_wait_range);
 void replace_page_cache_folio(struct folio *old, struct folio *new)
 {
 	struct address_space *mapping = old->mapping;
-	void (*free_folio)(struct folio *) = mapping->a_ops->free_folio;
+	void (*free_folio)(struct address_space *, struct folio *) =
+		mapping->a_ops->free_folio;
 	pgoff_t offset = old->index;
 	XA_STATE(xas, &mapping->i_pages, offset);
 
@@ -849,7 +850,7 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
 		__lruvec_stat_add_folio(new, NR_SHMEM);
 	xas_unlock_irq(&xas);
 	if (free_folio)
-		free_folio(old);
+		free_folio(mapping, old);
 	folio_put(old);
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_folio);
diff --git a/mm/secretmem.c b/mm/secretmem.c
index c0e459e58cb6..178507c1b900 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -152,7 +152,8 @@ static int secretmem_migrate_folio(struct address_space *mapping,
 	return -EBUSY;
 }
 
-static void secretmem_free_folio(struct folio *folio)
+static void secretmem_free_folio(struct address_space *mapping,
+				 struct folio *folio)
 {
 	set_direct_map_default_noflush(&folio->page);
 	folio_zero_segment(folio, 0, folio_size(folio));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3783e45bfc92..b8add4d0cf18 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -788,7 +788,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 		xa_unlock_irq(&mapping->i_pages);
 		put_swap_folio(folio, swap);
 	} else {
-		void (*free_folio)(struct folio *);
+		void (*free_folio)(struct address_space *, struct folio *);
 
 		free_folio = mapping->a_ops->free_folio;
 		/*
@@ -817,7 +817,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 		spin_unlock(&mapping->host->i_lock);
 
 		if (free_folio)
-			free_folio(folio);
+			free_folio(mapping, folio);
 	}
 
 	return 1;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 24d270b9b725..c578d0ebe314 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1319,7 +1319,8 @@ static void kvm_gmem_invalidate(struct folio *folio)
 static inline void kvm_gmem_invalidate(struct folio *folio) {}
 #endif
 
-static void kvm_gmem_free_folio(struct folio *folio)
+static void kvm_gmem_free_folio(struct address_space *mapping,
+				struct folio *folio)
 {
 	folio_clear_unevictable(folio);
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (36 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 37/51] filemap: Pass address_space mapping to ->free_folio() Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-22 22:19   ` Edgecombe, Rick P
                     ` (3 more replies)
  2025-05-14 23:42 ` [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE) Ackerley Tng
                   ` (17 subsequent siblings)
  55 siblings, 4 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

In this patch, newly allocated pages are split to 4K regular pages
before providing them to the requester (fallocate() or KVM).

During a private to shared conversion, folios are split if not already
split.

During a shared to private conversion, folios are merged if not
already merged.

When the folios are removed from the filemap on truncation, the
allocator is given a chance to do any necessary prep for when the
folio is freed.

When a conversion is requested on a subfolio within a hugepage range,
faulting must be prevented on the whole hugepage range for
correctness.

See related discussion at
https://lore.kernel.org/all/Z__AAB_EFxGFEjDR@google.com/T/

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Change-Id: Ib5ee22e3dae034c529773048a626ad98d4b10af3
---
 mm/filemap.c           |   2 +
 virt/kvm/guest_memfd.c | 501 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 483 insertions(+), 20 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a02c3d8e00e8..a052f8e0c41e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -223,6 +223,7 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
 	filemap_unaccount_folio(mapping, folio);
 	page_cache_delete(mapping, folio, shadow);
 }
+EXPORT_SYMBOL_GPL(__filemap_remove_folio);
 
 void filemap_free_folio(struct address_space *mapping, struct folio *folio)
 {
@@ -258,6 +259,7 @@ void filemap_remove_folio(struct folio *folio)
 
 	filemap_free_folio(mapping, folio);
 }
+EXPORT_SYMBOL_GPL(filemap_remove_folio);
 
 /*
  * page_cache_delete_batch - delete several folios from page cache
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index c578d0ebe314..cb426c1dfef8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -41,6 +41,11 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
 				      pgoff_t end);
 static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 				    pgoff_t end);
+static int __kvm_gmem_filemap_add_folio(struct address_space *mapping,
+					struct folio *folio, pgoff_t index);
+static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
+						pgoff_t start, size_t nr_pages,
+						bool is_split_operation);
 
 static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 {
@@ -126,6 +131,31 @@ static enum shareability kvm_gmem_shareability_get(struct inode *inode,
 	return xa_to_value(entry);
 }
 
+static bool kvm_gmem_shareability_in_range(struct inode *inode, pgoff_t start,
+					    size_t nr_pages, enum shareability m)
+{
+	struct maple_tree *mt;
+	pgoff_t last;
+	void *entry;
+
+	mt = &kvm_gmem_private(inode)->shareability;
+
+	last = start + nr_pages - 1;
+	mt_for_each(mt, entry, start, last) {
+		if (xa_to_value(entry) == m)
+			return true;
+	}
+
+	return false;
+}
+
+static inline bool kvm_gmem_has_some_shared(struct inode *inode, pgoff_t start,
+					    size_t nr_pages)
+{
+	return kvm_gmem_shareability_in_range(inode, start, nr_pages,
+					     SHAREABILITY_ALL);
+}
+
 static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
 {
 	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
@@ -241,6 +271,105 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
 	return refcount_safe;
 }
 
+static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
+				   pgoff_t end)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+	bool locked = false;
+	bool flush = false;
+
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		pgoff_t pgoff = slot->gmem.pgoff;
+
+		struct kvm_gfn_range gfn_range = {
+			.start = slot->base_gfn + max(pgoff, start) - pgoff,
+			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
+			.slot = slot,
+			.may_block = true,
+			/* This function is only concerned with private mappings. */
+			.attr_filter = KVM_FILTER_PRIVATE,
+		};
+
+		if (!locked) {
+			KVM_MMU_LOCK(kvm);
+			locked = true;
+		}
+
+		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	if (locked)
+		KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+				      pgoff_t end)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm *kvm = gmem->kvm;
+	unsigned long index;
+	bool found_memslot;
+
+	found_memslot = false;
+	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
+		gfn_t gfn_start;
+		gfn_t gfn_end;
+		pgoff_t pgoff;
+
+		pgoff = slot->gmem.pgoff;
+
+		gfn_start = slot->base_gfn + max(pgoff, start) - pgoff;
+		gfn_end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff;
+
+		if (!found_memslot) {
+			found_memslot = true;
+
+			KVM_MMU_LOCK(kvm);
+			kvm_mmu_invalidate_begin(kvm);
+		}
+
+		kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
+	}
+
+	if (found_memslot)
+		KVM_MMU_UNLOCK(kvm);
+}
+
+static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
+						 pgoff_t bound, bool start)
+{
+	size_t nr_pages;
+	void *priv;
+
+	if (!kvm_gmem_has_custom_allocator(inode))
+		return bound;
+
+	priv = kvm_gmem_allocator_private(inode);
+	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+	if (start)
+		return round_down(bound, nr_pages);
+	else
+		return round_up(bound, nr_pages);
+}
+
+static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
+						 pgoff_t bound)
+{
+	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
+}
+
+static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
+					       pgoff_t bound)
+{
+	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
+}
+
 static int kvm_gmem_shareability_apply(struct inode *inode,
 				       struct conversion_work *work,
 				       enum shareability m)
@@ -299,35 +428,53 @@ static void kvm_gmem_convert_invalidate_begin(struct inode *inode,
 					      struct conversion_work *work)
 {
 	struct list_head *gmem_list;
+	pgoff_t invalidate_start;
+	pgoff_t invalidate_end;
 	struct kvm_gmem *gmem;
-	pgoff_t end;
+	pgoff_t work_end;
 
-	end = work->start + work->nr_pages;
+	work_end = work->start + work->nr_pages;
+	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
+	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);
 
 	gmem_list = &inode->i_mapping->i_private_list;
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin(gmem, work->start, end);
+		kvm_gmem_invalidate_begin(gmem, invalidate_start, invalidate_end);
 }
 
 static void kvm_gmem_convert_invalidate_end(struct inode *inode,
 					    struct conversion_work *work)
 {
 	struct list_head *gmem_list;
+	pgoff_t invalidate_start;
+	pgoff_t invalidate_end;
 	struct kvm_gmem *gmem;
-	pgoff_t end;
+	pgoff_t work_end;
 
-	end = work->start + work->nr_pages;
+	work_end = work->start + work->nr_pages;
+	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
+	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);
 
 	gmem_list = &inode->i_mapping->i_private_list;
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_end(gmem, work->start, end);
+		kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
 }
 
 static int kvm_gmem_convert_should_proceed(struct inode *inode,
 					   struct conversion_work *work,
 					   bool to_shared, pgoff_t *error_index)
 {
-	if (!to_shared) {
+	if (to_shared) {
+		struct list_head *gmem_list;
+		struct kvm_gmem *gmem;
+		pgoff_t work_end;
+
+		work_end = work->start + work->nr_pages;
+
+		gmem_list = &inode->i_mapping->i_private_list;
+		list_for_each_entry(gmem, gmem_list, entry)
+			kvm_gmem_unmap_private(gmem, work->start, work_end);
+	} else {
 		unmap_mapping_pages(inode->i_mapping, work->start,
 				    work->nr_pages, false);
 
@@ -340,6 +487,27 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
 	return 0;
 }
 
+static int kvm_gmem_convert_execute_work(struct inode *inode,
+					 struct conversion_work *work,
+					 bool to_shared)
+{
+	enum shareability m;
+	int ret;
+
+	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
+	ret = kvm_gmem_shareability_apply(inode, work, m);
+	if (ret)
+		return ret;
+	/*
+	 * Apply shareability first so split/merge can operate on new
+	 * shareability state.
+	 */
+	ret = kvm_gmem_restructure_folios_in_range(
+		inode, work->start, work->nr_pages, to_shared);
+
+	return ret;
+}
+
 static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
 				  size_t nr_pages, bool shared,
 				  pgoff_t *error_index)
@@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
 
 	list_for_each_entry(work, &work_list, list) {
 		rollback_stop_item = work;
-		ret = kvm_gmem_shareability_apply(inode, work, m);
+
+		ret = kvm_gmem_convert_execute_work(inode, work, shared);
 		if (ret)
 			break;
 	}
 
 	if (ret) {
-		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
 		list_for_each_entry(work, &work_list, list) {
+			int r;
+
+			r = kvm_gmem_convert_execute_work(inode, work, !shared);
+			WARN_ON(r);
+
 			if (work == rollback_stop_item)
 				break;
-
-			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
 		}
 	}
 
@@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
 	return ret;
 }
 
+#ifdef CONFIG_KVM_GMEM_HUGETLB
+
+static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
+{
+	struct address_space *mapping = folio->mapping;
+
+	spin_lock(&mapping->host->i_lock);
+	xa_lock_irq(&mapping->i_pages);
+
+	__filemap_remove_folio(folio, NULL);
+
+	xa_unlock_irq(&mapping->i_pages);
+	spin_unlock(&mapping->host->i_lock);
+}
+
+/**
+ * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
+ * split/merge.
+ *
+ * @folio: the folio to be removed.
+ *
+ * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
+ * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
+ *
+ * Context: Expects only the filemap's refcounts to be left on the folio. Will
+ *          freeze these refcounts away so that no other users will interfere
+ *          with restructuring.
+ */
+static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
+{
+	int filemap_refcount;
+
+	filemap_refcount = folio_nr_pages(folio);
+	while (!folio_ref_freeze(folio, filemap_refcount)) {
+		/*
+		 * At this point only filemap refcounts are expected, hence okay
+		 * to spin until speculative refcounts go away.
+		 */
+		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
+	}
+
+	folio_lock(folio);
+	__filemap_remove_folio_for_restructuring(folio);
+	folio_unlock(folio);
+}
+
+/**
+ * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
+ *
+ * @inode: inode containing the folio.
+ * @folio: folio to be split.
+ *
+ * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
+ * and add back the split folios.
+ *
+ * Context: Expects that before this call, folio's refcount is just the
+ *          filemap's refcounts. After this function returns, the split folios'
+ *          refcounts will also be filemap's refcounts.
+ * Return: 0 on success or negative error otherwise.
+ */
+static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
+{
+	size_t orig_nr_pages;
+	pgoff_t orig_index;
+	size_t i, j;
+	int ret;
+
+	orig_nr_pages = folio_nr_pages(folio);
+	if (orig_nr_pages == 1)
+		return 0;
+
+	orig_index = folio->index;
+
+	filemap_remove_folio_for_restructuring(folio);
+
+	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
+	if (ret)
+		goto err;
+
+	for (i = 0; i < orig_nr_pages; ++i) {
+		struct folio *f = page_folio(folio_page(folio, i));
+
+		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
+						   orig_index + i);
+		if (ret)
+			goto rollback;
+	}
+
+	return ret;
+
+rollback:
+	for (j = 0; j < i; ++j) {
+		struct folio *f = page_folio(folio_page(folio, j));
+
+		filemap_remove_folio_for_restructuring(f);
+	}
+
+	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
+err:
+	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
+
+	return ret;
+}
+
+static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
+						      struct folio *folio)
+{
+	size_t to_nr_pages;
+	void *priv;
+
+	if (!kvm_gmem_has_custom_allocator(inode))
+		return 0;
+
+	priv = kvm_gmem_allocator_private(inode);
+	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
+
+	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
+		return kvm_gmem_split_folio_in_filemap(inode, folio);
+
+	return 0;
+}
+
+/**
+ * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
+ * @inode.
+ *
+ * @inode: inode containing the folio.
+ * @first_folio: first folio among folios to be merged.
+ *
+ * Will clean up subfolios from filemap and add back the merged folio.
+ *
+ * Context: Expects that before this call, all subfolios only have filemap
+ *          refcounts. After this function returns, the merged folio will only
+ *          have filemap refcounts.
+ * Return: 0 on success or negative error otherwise.
+ */
+static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
+					   struct folio *first_folio)
+{
+	size_t to_nr_pages;
+	pgoff_t index;
+	void *priv;
+	size_t i;
+	int ret;
+
+	index = first_folio->index;
+
+	priv = kvm_gmem_allocator_private(inode);
+	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+	if (folio_nr_pages(first_folio) == to_nr_pages)
+		return 0;
+
+	for (i = 0; i < to_nr_pages; ++i) {
+		struct folio *f = page_folio(folio_page(first_folio, i));
+
+		filemap_remove_folio_for_restructuring(f);
+	}
+
+	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
+
+	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
+	if (ret)
+		goto err_split;
+
+	return ret;
+
+err_split:
+	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
+	for (i = 0; i < to_nr_pages; ++i) {
+		struct folio *f = page_folio(folio_page(first_folio, i));
+
+		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
+	}
+
+	return ret;
+}
+
+static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
+						      struct folio *first_folio)
+{
+	size_t to_nr_pages;
+	void *priv;
+
+	priv = kvm_gmem_allocator_private(inode);
+	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
+		return 0;
+
+	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
+}
+
+static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
+						pgoff_t start, size_t nr_pages,
+						bool is_split_operation)
+{
+	size_t to_nr_pages;
+	pgoff_t index;
+	pgoff_t end;
+	void *priv;
+	int ret;
+
+	if (!kvm_gmem_has_custom_allocator(inode))
+		return 0;
+
+	end = start + nr_pages;
+
+	/* Round to allocator page size, to check all (huge) pages in range. */
+	priv = kvm_gmem_allocator_private(inode);
+	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+	start = round_down(start, to_nr_pages);
+	end = round_up(end, to_nr_pages);
+
+	for (index = start; index < end; index += to_nr_pages) {
+		struct folio *f;
+
+		f = filemap_get_folio(inode->i_mapping, index);
+		if (IS_ERR(f))
+			continue;
+
+		/* Leave just filemap's refcounts on the folio. */
+		folio_put(f);
+
+		if (is_split_operation)
+			ret = kvm_gmem_split_folio_in_filemap(inode, f);
+		else
+			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
+
+		if (ret)
+			goto rollback;
+	}
+	return ret;
+
+rollback:
+	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
+		struct folio *f;
+
+		f = filemap_get_folio(inode->i_mapping, index);
+		if (IS_ERR(f))
+			continue;
+
+		/* Leave just filemap's refcounts on the folio. */
+		folio_put(f);
+
+		if (is_split_operation)
+			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
+		else
+			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
+	}
+
+	return ret;
+}
+
+#else
+
+static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
+						      struct folio *folio)
+{
+	return 0;
+}
+
+static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
+						pgoff_t start, size_t nr_pages,
+						bool is_split_operation)
+{
+	return 0;
+}
+
+#endif
+
 #else
 
 static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
@@ -563,11 +1005,16 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 		return folio;
 
 	if (kvm_gmem_has_custom_allocator(inode)) {
-		void *p = kvm_gmem_allocator_private(inode);
+		size_t nr_pages;
+		void *p;
 
+		p = kvm_gmem_allocator_private(inode);
 		folio = kvm_gmem_allocator_ops(inode)->alloc_folio(p);
 		if (IS_ERR(folio))
 			return folio;
+
+		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
+		index_floor = round_down(index, nr_pages);
 	} else {
 		gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 
@@ -580,10 +1027,11 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 			folio_put(folio);
 			return ERR_PTR(ret);
 		}
+
+		index_floor = index;
 	}
 	allocated_size = folio_size(folio);
 
-	index_floor = round_down(index, folio_nr_pages(folio));
 	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
 	if (ret) {
 		folio_put(folio);
@@ -600,6 +1048,13 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 		return ERR_PTR(ret);
 	}
 
+	/* Leave just filemap's refcounts on folio. */
+	folio_put(folio);
+
+	ret = kvm_gmem_try_split_folio_in_filemap(inode, folio);
+	if (ret)
+		goto err;
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks += allocated_size / 512;
 	spin_unlock(&inode->i_lock);
@@ -608,14 +1063,17 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 	 * folio is the one that is allocated, this gets the folio at the
 	 * requested index.
 	 */
-	folio = page_folio(folio_file_page(folio, index));
-	folio_lock(folio);
+	folio = filemap_lock_folio(inode->i_mapping, index);
 
 	return folio;
+
+err:
+	filemap_remove_folio(folio);
+	return ERR_PTR(ret);
 }
 
-static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
-				      pgoff_t end)
+static void kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
+					      pgoff_t start, pgoff_t end)
 {
 	bool flush = false, found_memslot = false;
 	struct kvm_memory_slot *slot;
@@ -848,7 +1306,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	filemap_invalidate_lock(inode->i_mapping);
 
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin(gmem, start, end);
+		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
 
 	if (kvm_gmem_has_custom_allocator(inode)) {
 		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
@@ -978,7 +1436,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	 * Zap all SPTEs pointed at by this file.  Do not free the backing
 	 * memory, as its lifetime is associated with the inode, not the file.
 	 */
-	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
+	kvm_gmem_invalidate_begin_and_zap(gmem, 0, -1ul);
 	kvm_gmem_invalidate_end(gmem, 0, -1ul);
 
 	list_del(&gmem->entry);
@@ -1289,7 +1747,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	end = start + folio_nr_pages(folio);
 
 	list_for_each_entry(gmem, gmem_list, entry)
-		kvm_gmem_invalidate_begin(gmem, start, end);
+		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
 
 	/*
 	 * Do not truncate the range, what action is taken in response to the
@@ -1330,6 +1788,9 @@ static void kvm_gmem_free_folio(struct address_space *mapping,
 	 */
 	folio_clear_uptodate(folio);
 
+	if (kvm_gmem_has_custom_allocator(mapping->host))
+		kvm_gmem_allocator_ops(mapping->host)->free_folio(folio);
+
 	kvm_gmem_invalidate(folio);
 }
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (37 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-28 11:00   ` Yan Zhao
  2025-05-14 23:42 ` [RFC PATCH v2 40/51] KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page status Ackerley Tng
                   ` (16 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Merge and truncate on fallocate(PUNCH_HOLE), but if the file is being
closed, defer merging to folio_put() callback.

Change-Id: Iae26987756e70c83f3b121edbc0ed0bc105eec0d
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 76 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 68 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index cb426c1dfef8..04b1513c2998 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -859,6 +859,35 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
 	return ret;
 }
 
+static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
+					   size_t nr_pages)
+{
+	struct folio *f;
+	pgoff_t unused;
+	long num_freed;
+
+	unmap_mapping_pages(inode->i_mapping, index, nr_pages, false);
+
+	if (!kvm_gmem_has_safe_refcount(inode->i_mapping, index, nr_pages, &unused))
+		return -EAGAIN;
+
+	f = filemap_get_folio(inode->i_mapping, index);
+	if (IS_ERR(f))
+		return 0;
+
+	/* Leave just filemap's refcounts on the folio. */
+	folio_put(f);
+
+	WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
+
+	num_freed = folio_nr_pages(f);
+	folio_lock(f);
+	truncate_inode_folio(inode->i_mapping, f);
+	folio_unlock(f);
+
+	return num_freed;
+}
+
 #else
 
 static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
@@ -874,6 +903,12 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
 	return 0;
 }
 
+static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
+					   size_t nr_pages)
+{
+	return 0;
+}
+
 #endif
 
 #else
@@ -1182,8 +1217,10 @@ static long kvm_gmem_truncate_indices(struct address_space *mapping,
  *
  * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
  * inode metadata.
+ *
+ * Return: 0 on success and negative error otherwise.
  */
-static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
+static long kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
 						  pgoff_t index,
 						  size_t nr_pages)
 {
@@ -1191,19 +1228,34 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
 	long num_freed;
 	pgoff_t idx;
 	void *priv;
+	long ret;
 
 	priv = kvm_gmem_allocator_private(inode);
 	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
 
+	ret = 0;
 	num_freed = 0;
 	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
-		num_freed += kvm_gmem_truncate_indices(
-			inode->i_mapping, idx, nr_per_huge_page);
+		if (mapping_exiting(inode->i_mapping) ||
+		    !kvm_gmem_has_some_shared(inode, idx, nr_per_huge_page)) {
+			num_freed += kvm_gmem_truncate_indices(
+				inode->i_mapping, idx, nr_per_huge_page);
+		} else {
+			ret = kvm_gmem_merge_truncate_indices(inode, idx,
+							      nr_per_huge_page);
+			if (ret < 0)
+				break;
+
+			num_freed += ret;
+			ret = 0;
+		}
 	}
 
 	spin_lock(&inode->i_lock);
 	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
 	spin_unlock(&inode->i_lock);
+
+	return ret;
 }
 
 /**
@@ -1252,8 +1304,10 @@ static void kvm_gmem_zero_range(struct address_space *mapping,
  *
  * Removes full (huge)pages from the filemap and zeroing incomplete
  * (huge)pages. The pages in the range may be split.
+ *
+ * Return: 0 on success and negative error otherwise.
  */
-static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
+static long kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
 					  loff_t lend)
 {
 	pgoff_t full_hpage_start;
@@ -1263,6 +1317,7 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
 	pgoff_t start;
 	pgoff_t end;
 	void *priv;
+	long ret;
 
 	priv = kvm_gmem_allocator_private(inode);
 	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
@@ -1279,10 +1334,11 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
 		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
 	}
 
+	ret = 0;
 	if (full_hpage_end > full_hpage_start) {
 		nr_pages = full_hpage_end - full_hpage_start;
-		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
-						      nr_pages);
+		ret = kvm_gmem_truncate_inode_aligned_pages(
+			inode, full_hpage_start, nr_pages);
 	}
 
 	if (end > full_hpage_end && end > full_hpage_start) {
@@ -1290,6 +1346,8 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
 
 		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
 	}
+
+	return ret;
 }
 
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
@@ -1298,6 +1356,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	pgoff_t start = offset >> PAGE_SHIFT;
 	pgoff_t end = (offset + len) >> PAGE_SHIFT;
 	struct kvm_gmem *gmem;
+	long ret;
 
 	/*
 	 * Bindings must be stable across invalidation to ensure the start+end
@@ -1308,8 +1367,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
 
+	ret = 0;
 	if (kvm_gmem_has_custom_allocator(inode)) {
-		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
+		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
 	} else {
 		/* Page size is PAGE_SIZE, so use optimized truncation function. */
 		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
@@ -1320,7 +1380,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	filemap_invalidate_unlock(inode->i_mapping);
 
-	return 0;
+	return ret;
 }
 
 static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 40/51] KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page status
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (38 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE) Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 41/51] KVM: Add CAP to indicate support for HugeTLB as custom allocator Ackerley Tng
                   ` (15 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

kvm_gmem_mapping_order() should return the maximum mapping order for a
gfn if a page were to be faulted in for that gfn.

For inodes that support a custom allocator, the maximum mapping order
should be determined by the custom allocator in conjunction with
guest_memfd.

This patch updates kvm_gmem_mapping_order() to take into account that
for the guestmem_hugetlb custom allocator, pages are split if any page
in a huge page range is shared.

Change-Id: I5c061af6cefdcbd708a4334cd58edc340afcf44e
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 72 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 04b1513c2998..8b5fe1360e58 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -709,19 +709,27 @@ static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *fo
 	return ret;
 }
 
+static inline bool kvm_gmem_should_split_at_index(struct inode *inode,
+						  pgoff_t index)
+{
+	pgoff_t index_floor;
+	size_t nr_pages;
+	void *priv;
+
+	priv = kvm_gmem_allocator_private(inode);
+	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+	index_floor = round_down(index, nr_pages);
+
+	return kvm_gmem_has_some_shared(inode, index_floor, nr_pages);
+}
+
 static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
 						      struct folio *folio)
 {
-	size_t to_nr_pages;
-	void *priv;
-
 	if (!kvm_gmem_has_custom_allocator(inode))
 		return 0;
 
-	priv = kvm_gmem_allocator_private(inode);
-	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
-
-	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
+	if (kvm_gmem_should_split_at_index(inode, folio->index))
 		return kvm_gmem_split_folio_in_filemap(inode, folio);
 
 	return 0;
@@ -890,6 +898,12 @@ static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
 
 #else
 
+static inline bool kvm_gmem_should_split_at_index(struct inode *inode,
+						  pgoff_t index)
+{
+	return false;
+}
+
 static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
 						      struct folio *folio)
 {
@@ -1523,7 +1537,7 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
-static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
+static pgoff_t kvm_gmem_get_index(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
@@ -2256,14 +2270,52 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
 /**
- * Returns the mapping order for this @gfn in @slot.
+ * kvm_gmem_mapping_order() - Get the mapping order for this @gfn in @slot.
+ *
+ * @slot: the memslot that gfn belongs to.
+ * @gfn: the gfn to look up mapping order for.
  *
  * This is equal to max_order that would be returned if kvm_gmem_get_pfn() were
  * called now.
+ *
+ * Return: the mapping order for this @gfn in @slot.
  */
 int kvm_gmem_mapping_order(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return 0;
+	struct inode *inode;
+	struct file *file;
+	int ret;
+
+	file = kvm_gmem_get_file((struct kvm_memory_slot *)slot);
+	if (!file)
+		return 0;
+
+	inode = file_inode(file);
+
+	ret = 0;
+	if (kvm_gmem_has_custom_allocator(inode)) {
+		bool should_split;
+		pgoff_t index;
+
+		index = kvm_gmem_get_index(slot, gfn);
+
+		filemap_invalidate_lock_shared(inode->i_mapping);
+		should_split = kvm_gmem_should_split_at_index(inode, index);
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+
+		if (!should_split) {
+			size_t nr_pages;
+			void *priv;
+
+			priv = kvm_gmem_allocator_private(inode);
+			nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
+
+			ret = ilog2(nr_pages);
+		}
+	}
+
+	fput(file);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_mapping_order);
 
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 41/51] KVM: Add CAP to indicate support for HugeTLB as custom allocator
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (39 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 40/51] KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page status Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 42/51] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
                   ` (14 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

If this CAP returns true, then guestmem_hugetlb can be used as a
custom allocator for guest_memfd.

Change-Id: I4edef395b5bd5814b70c81788d87aa94823c35d5
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/uapi/linux/kvm.h | 1 +
 virt/kvm/kvm_main.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index af486b2e4862..5012343dc2c5 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -932,6 +932,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_WRITABLE_IMP_ID_REGS 239
 #define KVM_CAP_GMEM_SHARED_MEM 240
 #define KVM_CAP_GMEM_CONVERSION 241
+#define KVM_CAP_GMEM_HUGETLB 242
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 92054b1bbd3f..230bcb853712 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4845,6 +4845,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_GMEM_SHARED_MEM:
 	case KVM_CAP_GMEM_CONVERSION:
 		return true;
+#endif
+#ifdef CONFIG_KVM_GMEM_HUGETLB
+	case KVM_CAP_GMEM_HUGETLB:
+		return true;
 #endif
 	default:
 		break;
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 42/51] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (40 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 41/51] KVM: Add CAP to indicate support for HugeTLB as custom allocator Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 43/51] KVM: selftests: Update conversion flows test for HugeTLB Ackerley Tng
                   ` (13 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Add tests for 2MB and 1GB page sizes, and update the invalid flags
test for GUEST_MEMFD_FLAG_HUGETLB.

In tests, touch every page but not every byte in page to save time
while testing.

Change-Id: I7d80a12b991a064cfd796e3c6e11f9a95fd16ec1
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 94 +++++++++++++------
 1 file changed, 67 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 1e79382fd830..c8acccaa9e1d 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -13,6 +13,8 @@
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
+#include <linux/guestmem.h>
+#include <linux/sizes.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -38,6 +40,7 @@ static void test_file_read_write(int fd)
 static void test_faulting_allowed(int fd, size_t page_size, size_t total_size)
 {
 	const char val = 0xaa;
+	size_t increment;
 	char *mem;
 	size_t i;
 	int ret;
@@ -45,21 +48,25 @@ static void test_faulting_allowed(int fd, size_t page_size, size_t total_size)
 	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	TEST_ASSERT(mem != MAP_FAILED, "mmaping() guest memory should pass.");
 
-	memset(mem, val, total_size);
-	for (i = 0; i < total_size; i++)
+	increment = page_size >> 1;
+
+	for (i = 0; i < total_size; i += increment)
+		mem[i] = val;
+	for (i = 0; i < total_size; i += increment)
 		TEST_ASSERT_EQ(mem[i], val);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
 			page_size);
 	TEST_ASSERT(!ret, "fallocate the first page should succeed");
 
-	for (i = 0; i < page_size; i++)
+	for (i = 0; i < page_size; i += increment)
 		TEST_ASSERT_EQ(mem[i], 0x00);
-	for (; i < total_size; i++)
+	for (; i < total_size; i += increment)
 		TEST_ASSERT_EQ(mem[i], val);
 
-	memset(mem, val, total_size);
-	for (i = 0; i < total_size; i++)
+	for (i = 0; i < total_size; i += increment)
+		mem[i] = val;
+	for (i = 0; i < total_size; i += increment)
 		TEST_ASSERT_EQ(mem[i], val);
 
 	ret = munmap(mem, total_size);
@@ -209,7 +216,7 @@ static void test_create_guest_memfd_invalid_sizes(struct kvm_vm *vm,
 	size_t size;
 	int fd;
 
-	for (size = 1; size < page_size; size++) {
+	for (size = 1; size < page_size; size += (page_size >> 1)) {
 		fd = __vm_create_guest_memfd(vm, size, guest_memfd_flags);
 		TEST_ASSERT(fd == -1 && errno == EINVAL,
 			    "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL",
@@ -217,28 +224,33 @@ static void test_create_guest_memfd_invalid_sizes(struct kvm_vm *vm,
 	}
 }
 
-static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
+static void test_create_guest_memfd_multiple(struct kvm_vm *vm,
+					     uint64_t guest_memfd_flags,
+					     size_t page_size)
 {
 	int fd1, fd2, ret;
 	struct stat st1, st2;
 
-	fd1 = __vm_create_guest_memfd(vm, 4096, 0);
+	fd1 = __vm_create_guest_memfd(vm, page_size, guest_memfd_flags);
 	TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
 
 	ret = fstat(fd1, &st1);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st1.st_size == 4096, "memfd st_size should match requested size");
+	TEST_ASSERT(st1.st_size == page_size, "memfd st_size should match requested size");
 
-	fd2 = __vm_create_guest_memfd(vm, 8192, 0);
+	fd2 = __vm_create_guest_memfd(vm, page_size * 2, guest_memfd_flags);
 	TEST_ASSERT(fd2 != -1, "memfd creation should succeed");
 
 	ret = fstat(fd2, &st2);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st2.st_size == 8192, "second memfd st_size should match requested size");
+	TEST_ASSERT(st2.st_size == page_size * 2,
+		    "second memfd st_size should match requested size");
+
 
 	ret = fstat(fd1, &st1);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st1.st_size == 4096, "first memfd st_size should still match requested size");
+	TEST_ASSERT(st1.st_size == page_size,
+		    "first memfd st_size should still match requested size");
 	TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers");
 
 	close(fd2);
@@ -449,21 +461,13 @@ static void test_guest_memfd_features(struct kvm_vm *vm, size_t page_size,
 	close(fd);
 }
 
-static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
-			   bool expect_mmap_allowed)
+static void test_guest_memfd_features_for_page_size(struct kvm_vm *vm,
+						    uint64_t guest_memfd_flags,
+						    size_t page_size,
+						    bool expect_mmap_allowed)
 {
-	struct kvm_vm *vm;
-	size_t page_size;
+	test_create_guest_memfd_multiple(vm, guest_memfd_flags, page_size);
 
-	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
-		return;
-
-	vm = vm_create_barebones_type(vm_type);
-
-	test_create_guest_memfd_multiple(vm);
-	test_bind_guest_memfd_wrt_userspace_addr(vm);
-
-	page_size = getpagesize();
 	if (guest_memfd_flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
 		test_guest_memfd_features(vm, page_size, guest_memfd_flags,
 					  expect_mmap_allowed, true);
@@ -479,6 +483,34 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
 		test_guest_memfd_features(vm, page_size, guest_memfd_flags,
 					  expect_mmap_allowed, false);
 	}
+}
+
+static void test_with_type(unsigned long vm_type, uint64_t base_flags,
+			   bool expect_mmap_allowed)
+{
+	struct kvm_vm *vm;
+	uint64_t flags;
+
+	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
+		return;
+
+	vm = vm_create_barebones_type(vm_type);
+
+	test_bind_guest_memfd_wrt_userspace_addr(vm);
+
+	printf("Test guest_memfd with 4K pages for vm_type %ld\n", vm_type);
+	test_guest_memfd_features_for_page_size(vm, base_flags, getpagesize(), expect_mmap_allowed);
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 2M pages for vm_type %ld\n", vm_type);
+	flags = base_flags | GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_2MB;
+	test_guest_memfd_features_for_page_size(vm, flags, SZ_2M, expect_mmap_allowed);
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 1G pages for vm_type %ld\n", vm_type);
+	flags = base_flags | GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_1GB;
+	test_guest_memfd_features_for_page_size(vm, flags, SZ_1G, expect_mmap_allowed);
+	printf("\tPASSED\n");
 
 	kvm_vm_release(vm);
 }
@@ -486,9 +518,14 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
 static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
 				   bool expect_valid)
 {
-	size_t page_size = getpagesize();
+	size_t page_size;
 	int fd;
 
+	if (flag == GUEST_MEMFD_FLAG_HUGETLB)
+		page_size = get_def_hugetlb_pagesz();
+	else
+		page_size = getpagesize();
+
 	fd = __vm_create_guest_memfd(vm, page_size, flag);
 
 	if (expect_valid) {
@@ -550,6 +587,9 @@ static void test_gmem_flag_validity(void)
 	/* After conversions are supported, all VM types support shared mem. */
 	uint64_t valid_flags = GUEST_MEMFD_FLAG_SUPPORT_SHARED;
 
+	if (kvm_has_cap(KVM_CAP_GMEM_HUGETLB))
+		valid_flags |= GUEST_MEMFD_FLAG_HUGETLB;
+
 	test_vm_type_gmem_flag_validity(VM_TYPE_DEFAULT, valid_flags);
 
 #ifdef __x86_64__
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 43/51] KVM: selftests: Update conversion flows test for HugeTLB
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (41 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 42/51] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 44/51] KVM: selftests: Test truncation paths of guest_memfd Ackerley Tng
                   ` (12 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This patch updates conversion flows test to use
GUEST_MEMFD_FLAG_HUGETLB and tests with the 2MB and 1GB sizes.

Change-Id: If5d93cb776d6bebd504a80bba553bd534e62be38
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 171 ++++++++++--------
 1 file changed, 98 insertions(+), 73 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 34eb6c9a37b1..22126454fd6b 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -5,6 +5,7 @@
  * Copyright (c) 2024, Google LLC.
  */
 #include <linux/kvm.h>
+#include <linux/sizes.h>
 #include <stdio.h>
 #include <string.h>
 #include <sys/mman.h>
@@ -228,6 +229,11 @@ static struct kvm_vm *setup_test(size_t test_page_size, bool init_private,
 	if (init_private)
 		flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
 
+	if (test_page_size == SZ_2M)
+		flags |= GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_2MB;
+	else if (test_page_size == SZ_1G)
+		flags |= GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_1GB;
+
 	*guest_memfd = vm_create_guest_memfd(vm, test_page_size, flags);
 	TEST_ASSERT(*guest_memfd > 0, "guest_memfd creation failed");
 
@@ -249,79 +255,80 @@ static void cleanup_test(size_t guest_memfd_size, struct kvm_vm *vm,
 		TEST_ASSERT_EQ(close(guest_memfd), 0);
 }
 
-static void test_sharing(void)
+static void test_sharing(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	int guest_memfd;
 	char *mem;
 
-	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+	vm = setup_test(test_page_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
 	host_use_memory(mem, 'X', 'A');
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'A', 'B', 0);
 
 	/* Toggle private flag of memory attributes and run the test again. */
-	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_private(guest_memfd, 0, test_page_size);
 
 	assert_host_cannot_fault(mem);
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
 
-	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, 0, test_page_size);
 
 	host_use_memory(mem, 'C', 'D');
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'D', 'E', 0);
 
-	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
 }
 
-static void test_init_mappable_false(void)
+static void test_init_mappable_false(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	int guest_memfd;
 	char *mem;
 
-	vm = setup_test(PAGE_SIZE, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
+	vm = setup_test(test_page_size, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
 
 	assert_host_cannot_fault(mem);
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
 
-	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, 0, test_page_size);
 
 	host_use_memory(mem, 'A', 'B');
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
 
-	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
 }
 
 /*
  * Test that even if there are no folios yet, conversion requests are recorded
  * in guest_memfd.
  */
-static void test_conversion_before_allocation(void)
+static void test_conversion_before_allocation(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	int guest_memfd;
 	char *mem;
 
-	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+	vm = setup_test(test_page_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
-	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_private(guest_memfd, 0, test_page_size);
 
 	assert_host_cannot_fault(mem);
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
 
-	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, 0, test_page_size);
 
 	host_use_memory(mem, 'A', 'B');
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'B', 'C', 0);
 
-	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
 }
 
-static void __test_conversion_if_not_all_folios_allocated(int total_nr_pages,
+static void __test_conversion_if_not_all_folios_allocated(size_t test_page_size,
+							  int total_nr_pages,
 							  int page_to_fault)
 {
 	const int second_page_to_fault = 8;
@@ -332,15 +339,15 @@ static void __test_conversion_if_not_all_folios_allocated(int total_nr_pages,
 	char *mem;
 	int i;
 
-	total_size = PAGE_SIZE * total_nr_pages;
+	total_size = test_page_size * total_nr_pages;
 	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
 	/*
 	 * Fault 2 of the pages to test filemap range operations except when
 	 * page_to_fault == second_page_to_fault.
 	 */
-	host_use_memory(mem + page_to_fault * PAGE_SIZE, 'X', 'A');
-	host_use_memory(mem + second_page_to_fault * PAGE_SIZE, 'X', 'A');
+	host_use_memory(mem + page_to_fault * test_page_size, 'X', 'A');
+	host_use_memory(mem + second_page_to_fault * test_page_size, 'X', 'A');
 
 	guest_memfd_convert_private(guest_memfd, 0, total_size);
 
@@ -348,37 +355,37 @@ static void __test_conversion_if_not_all_folios_allocated(int total_nr_pages,
 		bool is_faulted;
 		char expected;
 
-		assert_host_cannot_fault(mem + i * PAGE_SIZE);
+		assert_host_cannot_fault(mem + i * test_page_size);
 
 		is_faulted = i == page_to_fault || i == second_page_to_fault;
 		expected = is_faulted ? 'A' : 'X';
 		guest_use_memory(vcpu,
-				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 expected, 'B', 0);
 	}
 
 	guest_memfd_convert_shared(guest_memfd, 0, total_size);
 
 	for (i = 0; i < total_nr_pages; ++i) {
-		host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+		host_use_memory(mem + i * test_page_size, 'B', 'C');
 		guest_use_memory(vcpu,
-				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 'C', 'D', 0);
 	}
 
 	cleanup_test(total_size, vm, guest_memfd, mem);
 }
 
-static void test_conversion_if_not_all_folios_allocated(void)
+static void test_conversion_if_not_all_folios_allocated(size_t test_page_size)
 {
 	const int total_nr_pages = 16;
 	int i;
 
 	for (i = 0; i < total_nr_pages; ++i)
-		__test_conversion_if_not_all_folios_allocated(total_nr_pages, i);
+		__test_conversion_if_not_all_folios_allocated(test_page_size, total_nr_pages, i);
 }
 
-static void test_conversions_should_not_affect_surrounding_pages(void)
+static void test_conversions_should_not_affect_surrounding_pages(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	int page_to_convert;
@@ -391,40 +398,40 @@ static void test_conversions_should_not_affect_surrounding_pages(void)
 
 	page_to_convert = 2;
 	nr_pages = 4;
-	total_size = PAGE_SIZE * nr_pages;
+	total_size = test_page_size * nr_pages;
 
 	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
 	for (i = 0; i < nr_pages; ++i) {
-		host_use_memory(mem + i * PAGE_SIZE, 'X', 'A');
-		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+		host_use_memory(mem + i * test_page_size, 'X', 'A');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 'A', 'B', 0);
 	}
 
-	guest_memfd_convert_private(guest_memfd, PAGE_SIZE * page_to_convert, PAGE_SIZE);
+	guest_memfd_convert_private(guest_memfd, test_page_size * page_to_convert, test_page_size);
 
 
 	for (i = 0; i < nr_pages; ++i) {
 		char to_check;
 
 		if (i == page_to_convert) {
-			assert_host_cannot_fault(mem + i * PAGE_SIZE);
+			assert_host_cannot_fault(mem + i * test_page_size);
 			to_check = 'B';
 		} else {
-			host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+			host_use_memory(mem + i * test_page_size, 'B', 'C');
 			to_check = 'C';
 		}
 
-		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 to_check, 'D', 0);
 	}
 
-	guest_memfd_convert_shared(guest_memfd, PAGE_SIZE * page_to_convert, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, test_page_size * page_to_convert, test_page_size);
 
 
 	for (i = 0; i < nr_pages; ++i) {
-		host_use_memory(mem + i * PAGE_SIZE, 'D', 'E');
-		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+		host_use_memory(mem + i * test_page_size, 'D', 'E');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 'E', 'F', 0);
 	}
 
@@ -432,7 +439,7 @@ static void test_conversions_should_not_affect_surrounding_pages(void)
 }
 
 static void __test_conversions_should_fail_if_memory_has_elevated_refcount(
-	int nr_pages, int page_to_convert)
+	size_t test_page_size, int nr_pages, int page_to_convert)
 {
 	struct kvm_vcpu *vcpu;
 	loff_t error_offset;
@@ -443,50 +450,50 @@ static void __test_conversions_should_fail_if_memory_has_elevated_refcount(
 	int ret;
 	int i;
 
-	total_size = PAGE_SIZE * nr_pages;
+	total_size = test_page_size * nr_pages;
 	vm = setup_test(total_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
-	pin_pages(mem + page_to_convert * PAGE_SIZE, PAGE_SIZE);
+	pin_pages(mem + page_to_convert * test_page_size, test_page_size);
 
 	for (i = 0; i < nr_pages; i++) {
-		host_use_memory(mem + i * PAGE_SIZE, 'X', 'A');
-		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+		host_use_memory(mem + i * test_page_size, 'X', 'A');
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 'A', 'B', 0);
 	}
 
 	error_offset = 0;
-	ret = __guest_memfd_convert_private(guest_memfd, page_to_convert * PAGE_SIZE,
-					    PAGE_SIZE, &error_offset);
+	ret = __guest_memfd_convert_private(guest_memfd, page_to_convert * test_page_size,
+					    test_page_size, &error_offset);
 	TEST_ASSERT_EQ(ret, -1);
 	TEST_ASSERT_EQ(errno, EAGAIN);
-	TEST_ASSERT_EQ(error_offset, page_to_convert * PAGE_SIZE);
+	TEST_ASSERT_EQ(error_offset, page_to_convert * test_page_size);
 
 	unpin_pages();
 
-	guest_memfd_convert_private(guest_memfd, page_to_convert * PAGE_SIZE, PAGE_SIZE);
+	guest_memfd_convert_private(guest_memfd, page_to_convert * test_page_size, test_page_size);
 
 	for (i = 0; i < nr_pages; i++) {
 		char expected;
 
 		if (i == page_to_convert)
-			assert_host_cannot_fault(mem + i * PAGE_SIZE);
+			assert_host_cannot_fault(mem + i * test_page_size);
 		else
-			host_use_memory(mem + i * PAGE_SIZE, 'B', 'C');
+			host_use_memory(mem + i * test_page_size, 'B', 'C');
 
 		expected = i == page_to_convert ? 'X' : 'C';
-		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+		guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 expected, 'D', 0);
 	}
 
-	guest_memfd_convert_shared(guest_memfd, page_to_convert * PAGE_SIZE, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, page_to_convert * test_page_size, test_page_size);
 
 
 	for (i = 0; i < nr_pages; i++) {
 		char expected = i == page_to_convert ? 'X' : 'D';
 
-		host_use_memory(mem + i * PAGE_SIZE, expected, 'E');
+		host_use_memory(mem + i * test_page_size, expected, 'E');
 		guest_use_memory(vcpu,
-				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * test_page_size,
 				 'E', 'F', 0);
 	}
 
@@ -496,15 +503,18 @@ static void __test_conversions_should_fail_if_memory_has_elevated_refcount(
  * This test depends on CONFIG_GUP_TEST to provide a kernel module that exposes
  * pin_user_pages() to userspace.
  */
-static void test_conversions_should_fail_if_memory_has_elevated_refcount(void)
+static void test_conversions_should_fail_if_memory_has_elevated_refcount(
+		size_t test_page_size)
 {
 	int i;
 
-	for (i = 0; i < 4; i++)
-		__test_conversions_should_fail_if_memory_has_elevated_refcount(4, i);
+	for (i = 0; i < 4; i++) {
+		__test_conversions_should_fail_if_memory_has_elevated_refcount(
+			test_page_size, 4, i);
+	}
 }
 
-static void test_truncate_should_not_change_mappability(void)
+static void test_truncate_should_not_change_mappability(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
@@ -512,40 +522,40 @@ static void test_truncate_should_not_change_mappability(void)
 	char *mem;
 	int ret;
 
-	vm = setup_test(PAGE_SIZE, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
+	vm = setup_test(test_page_size, /*init_private=*/false, &vcpu, &guest_memfd, &mem);
 
 	host_use_memory(mem, 'X', 'A');
 
 	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			0, PAGE_SIZE);
+			0, test_page_size);
 	TEST_ASSERT(!ret, "truncating the first page should succeed");
 
 	host_use_memory(mem, 'X', 'A');
 
-	guest_memfd_convert_private(guest_memfd, 0, PAGE_SIZE);
+	guest_memfd_convert_private(guest_memfd, 0, test_page_size);
 
 	assert_host_cannot_fault(mem);
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'A', 'A', 0);
 
 	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			0, PAGE_SIZE);
+			0, test_page_size);
 	TEST_ASSERT(!ret, "truncating the first page should succeed");
 
 	assert_host_cannot_fault(mem);
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
 
-	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
 }
 
-static void test_fault_type_independent_of_mem_attributes(void)
+static void test_fault_type_independent_of_mem_attributes(size_t test_page_size)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	int guest_memfd;
 	char *mem;
 
-	vm = setup_test(PAGE_SIZE, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
-	vm_mem_set_shared(vm, GUEST_MEMFD_SHARING_TEST_GPA, PAGE_SIZE);
+	vm = setup_test(test_page_size, /*init_private=*/true, &vcpu, &guest_memfd, &mem);
+	vm_mem_set_shared(vm, GUEST_MEMFD_SHARING_TEST_GPA, test_page_size);
 
 	/*
 	 * kvm->mem_attr_array set to shared, guest_memfd memory initialized as
@@ -558,8 +568,8 @@ static void test_fault_type_independent_of_mem_attributes(void)
 	/* Guest can fault and use memory. */
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
 
-	guest_memfd_convert_shared(guest_memfd, 0, PAGE_SIZE);
-	vm_mem_set_private(vm, GUEST_MEMFD_SHARING_TEST_GPA, PAGE_SIZE);
+	guest_memfd_convert_shared(guest_memfd, 0, test_page_size);
+	vm_mem_set_private(vm, GUEST_MEMFD_SHARING_TEST_GPA, test_page_size);
 
 	/* Host can use shared memory. */
 	host_use_memory(mem, 'X', 'A');
@@ -567,7 +577,19 @@ static void test_fault_type_independent_of_mem_attributes(void)
 	/* Guest can also use shared memory. */
 	guest_use_memory(vcpu, GUEST_MEMFD_SHARING_TEST_GVA, 'X', 'A', 0);
 
-	cleanup_test(PAGE_SIZE, vm, guest_memfd, mem);
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+}
+
+static void test_with_size(size_t test_page_size)
+{
+	test_sharing(test_page_size);
+	test_init_mappable_false(test_page_size);
+	test_conversion_before_allocation(test_page_size);
+	test_conversion_if_not_all_folios_allocated(test_page_size);
+	test_conversions_should_not_affect_surrounding_pages(test_page_size);
+	test_truncate_should_not_change_mappability(test_page_size);
+	test_conversions_should_fail_if_memory_has_elevated_refcount(test_page_size);
+	test_fault_type_independent_of_mem_attributes(test_page_size);
 }
 
 int main(int argc, char *argv[])
@@ -576,14 +598,17 @@ int main(int argc, char *argv[])
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GMEM_SHARED_MEM));
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GMEM_CONVERSION));
 
-	test_sharing();
-	test_init_mappable_false();
-	test_conversion_before_allocation();
-	test_conversion_if_not_all_folios_allocated();
-	test_conversions_should_not_affect_surrounding_pages();
-	test_truncate_should_not_change_mappability();
-	test_conversions_should_fail_if_memory_has_elevated_refcount();
-	test_fault_type_independent_of_mem_attributes();
+	printf("Test guest_memfd with 4K pages\n");
+	test_with_size(PAGE_SIZE);
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 2M pages\n");
+	test_with_size(SZ_2M);
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 1G pages\n");
+	test_with_size(SZ_1G);
+	printf("\tPASSED\n");
 
 	return 0;
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 44/51] KVM: selftests: Test truncation paths of guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (42 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 43/51] KVM: selftests: Update conversion flows test for HugeTLB Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 45/51] KVM: selftests: Test allocation and conversion of subfolios Ackerley Tng
                   ` (11 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

When guest_memfd folios are truncated, if pages are split, they have
to be merged.

For truncations, userspace will get an error if there are unexpected
refcounts on the folios.

For truncation on closing, kernel will handle the merging even if
there are unexpected refcounts on the folios.

This patch tests the above two scenarios.

Change-Id: I0f0c619763f575605fab8b3c453858960e43ed71
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 95 +++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 22126454fd6b..435f91424d5f 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -4,6 +4,7 @@
  *
  * Copyright (c) 2024, Google LLC.
  */
+#include <linux/guestmem.h>
 #include <linux/kvm.h>
 #include <linux/sizes.h>
 #include <stdio.h>
@@ -580,6 +581,97 @@ static void test_fault_type_independent_of_mem_attributes(size_t test_page_size)
 	cleanup_test(test_page_size, vm, guest_memfd, mem);
 }
 
+static void test_truncate_shared_while_pinned(size_t test_page_size)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+	int ret;
+
+	vm = setup_test(test_page_size, /*init_private=*/false, &vcpu,
+			&guest_memfd, &mem);
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE, 0, test_page_size);
+	TEST_ASSERT(!ret, "fallocate should have succeeded");
+
+	pin_pages(mem, test_page_size);
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			0, test_page_size);
+	if (test_page_size == PAGE_SIZE) {
+		TEST_ASSERT(!ret, "truncate should have succeeded since there is no need to merge");
+	} else {
+		TEST_ASSERT(ret, "truncate should have failed since pages are pinned");
+		TEST_ASSERT_EQ(errno, EAGAIN);
+	}
+
+	unpin_pages();
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			0, test_page_size);
+	TEST_ASSERT(!ret, "truncate should succeed now that pages are unpinned");
+
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+}
+
+static void test_truncate_private(size_t test_page_size)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+	int ret;
+
+	vm = setup_test(test_page_size, /*init_private=*/true, &vcpu,
+			&guest_memfd, &mem);
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE, 0, test_page_size);
+	TEST_ASSERT(!ret, "fallocate should have succeeded");
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			0, test_page_size);
+	TEST_ASSERT(!ret, "truncate should have succeeded since there is no need to merge");
+
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+}
+
+static void __test_close_with_pinning(size_t test_page_size, bool init_private)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int guest_memfd;
+	char *mem;
+	int ret;
+
+	vm = setup_test(test_page_size, init_private, &vcpu, &guest_memfd, &mem);
+
+	ret = fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE, 0, test_page_size);
+	TEST_ASSERT(!ret, "fallocate should have succeeded");
+
+	if (!init_private)
+		pin_pages(mem, test_page_size);
+
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+
+	if (!init_private)
+		unpin_pages();
+
+	/*
+	 * Test this with ./guest_memfd_wrap_test_check_hugetlb_reporting.sh to
+	 * check that the HugeTLB page got merged and returned to HugeTLB.
+	 *
+	 * Sleep here to give kernel worker time to do the merge and return.
+	 */
+	sleep(1);
+}
+
+static void test_close_with_pinning(size_t test_page_size)
+{
+	__test_close_with_pinning(test_page_size, true);
+	__test_close_with_pinning(test_page_size, false);
+}
+
 static void test_with_size(size_t test_page_size)
 {
 	test_sharing(test_page_size);
@@ -590,6 +682,9 @@ static void test_with_size(size_t test_page_size)
 	test_truncate_should_not_change_mappability(test_page_size);
 	test_conversions_should_fail_if_memory_has_elevated_refcount(test_page_size);
 	test_fault_type_independent_of_mem_attributes(test_page_size);
+	test_truncate_shared_while_pinned(test_page_size);
+	test_truncate_private(test_page_size);
+	test_close_with_pinning(test_page_size);
 }
 
 int main(int argc, char *argv[])
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 45/51] KVM: selftests: Test allocation and conversion of subfolios
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (43 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 44/51] KVM: selftests: Test truncation paths of guest_memfd Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 46/51] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
                   ` (10 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This patch adds tests for allocation and conversion of subfolios in a
large folio.

Change-Id: I37035b2c24398e2c83a2ac5a46b4e6ceed2a8b53
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 435f91424d5f..c31d1abd1b93 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -672,6 +672,92 @@ static void test_close_with_pinning(size_t test_page_size)
 	__test_close_with_pinning(test_page_size, false);
 }
 
+static void test_allocate_subfolios(size_t test_page_size)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	size_t increment;
+	int guest_memfd;
+	size_t nr_pages;
+	char *mem;
+	int i;
+
+	if (test_page_size == PAGE_SIZE)
+		return;
+
+	vm = setup_test(test_page_size, /*init_private=*/false, &vcpu,
+			&guest_memfd, &mem);
+
+	nr_pages = test_page_size / PAGE_SIZE;
+
+	/*
+	 * Loop backwards to check allocation of the correct subfolio within the
+	 * huge folio. If it were allocated wrongly, the second loop would error
+	 * out because one or more of the checks would be wrong.
+	 */
+	increment = nr_pages >> 1;
+	for (i = nr_pages - 1; i >= 0; i -= increment)
+		host_use_memory(mem + i * PAGE_SIZE, 'X', 'A' + i);
+	for (i = nr_pages - 1; i >= 0; i -= increment)
+		host_use_memory(mem + i * PAGE_SIZE, 'A' + i, 'A' + i);
+
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+}
+
+static void test_convert_subfolios(size_t test_page_size)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	size_t increment;
+	int guest_memfd;
+	size_t nr_pages;
+	int to_convert;
+	char *mem;
+	int i;
+
+	if (test_page_size == PAGE_SIZE)
+		return;
+
+	vm = setup_test(test_page_size, /*init_private=*/true, &vcpu,
+			&guest_memfd, &mem);
+
+	nr_pages = test_page_size / PAGE_SIZE;
+
+	increment = nr_pages >> 1;
+	for (i = 0; i < nr_pages; i += increment) {
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'X', 'A', 0);
+		assert_host_cannot_fault(mem + i * PAGE_SIZE);
+	}
+
+	to_convert = round_up(nr_pages / 2, increment);
+	guest_memfd_convert_shared(guest_memfd, to_convert * PAGE_SIZE, PAGE_SIZE);
+
+
+	for (i = 0; i < nr_pages; i += increment) {
+		if (i == to_convert)
+			host_use_memory(mem + i * PAGE_SIZE, 'A', 'B');
+		else
+			assert_host_cannot_fault(mem + i * PAGE_SIZE);
+
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'X', 'B', 0);
+	}
+
+	guest_memfd_convert_private(guest_memfd, to_convert * PAGE_SIZE, PAGE_SIZE);
+
+	for (i = 0; i < nr_pages; i += increment) {
+		guest_use_memory(vcpu,
+				 GUEST_MEMFD_SHARING_TEST_GVA + i * PAGE_SIZE,
+				 'B', 'C', 0);
+		assert_host_cannot_fault(mem + i * PAGE_SIZE);
+	}
+
+	cleanup_test(test_page_size, vm, guest_memfd, mem);
+}
+
 static void test_with_size(size_t test_page_size)
 {
 	test_sharing(test_page_size);
@@ -685,6 +771,8 @@ static void test_with_size(size_t test_page_size)
 	test_truncate_shared_while_pinned(test_page_size);
 	test_truncate_private(test_page_size);
 	test_close_with_pinning(test_page_size);
+	test_allocate_subfolios(test_page_size);
+	test_convert_subfolios(test_page_size);
 }
 
 int main(int argc, char *argv[])
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 46/51] KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (44 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 45/51] KVM: selftests: Test allocation and conversion of subfolios Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 47/51] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
                   ` (9 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Using HugeTLB as the huge page allocator for guest_memfd allows reuse
of HugeTLB's reporting mechanism, hence HugeTLB stats must be kept
up-to-date.

This patch tests for the above.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: Ida3319b1d40c593d8167a03506c7030e67fc746b
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/guest_memfd_hugetlb_reporting_test.c  | 384 ++++++++++++++++++
 ...uest_memfd_provide_hugetlb_cgroup_mount.sh |  36 ++
 3 files changed, 421 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index bc22a5a23c4c..2ffe6bc95a68 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -132,6 +132,7 @@ TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
 TEST_GEN_PROGS_x86 += guest_memfd_test
 TEST_GEN_PROGS_x86 += guest_memfd_conversions_test
+TEST_GEN_PROGS_x86 += guest_memfd_hugetlb_reporting_test
 TEST_GEN_PROGS_x86 += hardware_disable_test
 TEST_GEN_PROGS_x86 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86 += memslot_perf_test
diff --git a/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c b/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
new file mode 100644
index 000000000000..8ff1dda3e02f
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Tests that HugeTLB statistics are correct at various points of the lifecycle
+ * of guest_memfd with 1G page support.
+ *
+ * Providing a HUGETLB_CGROUP_PATH will allow cgroup reservations to be
+ * tested.
+ *
+ * Either use
+ *
+ *   ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_hugetlb_reporting_test
+ *
+ * or provide the mount with
+ *
+ *   export HUGETLB_CGROUP_PATH=/tmp/hugetlb-cgroup
+ *   mount -t cgroup -o hugetlb none $HUGETLB_CGROUP_PATH
+ *   ./guest_memfd_hugetlb_reporting_test
+ *
+ *
+ * Copyright (C) 2025 Google LLC
+ *
+ * Authors:
+ *   Ackerley Tng <ackerleytng@google.com>
+ */
+
+#include <fcntl.h>
+#include <linux/falloc.h>
+#include <linux/guestmem.h>
+#include <linux/kvm.h>
+#include <linux/limits.h>
+#include <linux/memfd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "processor.h"
+
+static unsigned long read_value(const char *file_name)
+{
+	FILE *fp;
+	unsigned long num;
+
+	fp = fopen(file_name, "r");
+	TEST_ASSERT(fp != NULL, "Error opening file %s!\n", file_name);
+
+	TEST_ASSERT_EQ(fscanf(fp, "%lu", &num), 1);
+
+	fclose(fp);
+
+	return num;
+}
+
+enum hugetlb_statistic {
+	FREE_HUGEPAGES,
+	NR_HUGEPAGES,
+	NR_OVERCOMMIT_HUGEPAGES,
+	RESV_HUGEPAGES,
+	SURPLUS_HUGEPAGES,
+	NR_TESTED_HUGETLB_STATISTICS,
+};
+
+enum hugetlb_cgroup_statistic {
+	LIMIT_IN_BYTES,
+	MAX_USAGE_IN_BYTES,
+	USAGE_IN_BYTES,
+	NR_TESTED_HUGETLB_CGROUP_STATISTICS,
+};
+
+enum hugetlb_cgroup_statistic_category {
+	USAGE = 0,
+	RESERVATION,
+	NR_HUGETLB_CGROUP_STATISTIC_CATEGORIES,
+};
+
+static const char *hugetlb_statistics[NR_TESTED_HUGETLB_STATISTICS] = {
+	[FREE_HUGEPAGES] = "free_hugepages",
+	[NR_HUGEPAGES] = "nr_hugepages",
+	[NR_OVERCOMMIT_HUGEPAGES] = "nr_overcommit_hugepages",
+	[RESV_HUGEPAGES] = "resv_hugepages",
+	[SURPLUS_HUGEPAGES] = "surplus_hugepages",
+};
+
+static const char *hugetlb_cgroup_statistics[NR_TESTED_HUGETLB_CGROUP_STATISTICS] = {
+	[LIMIT_IN_BYTES] = "limit_in_bytes",
+	[MAX_USAGE_IN_BYTES] = "max_usage_in_bytes",
+	[USAGE_IN_BYTES] = "usage_in_bytes",
+};
+
+enum test_page_size {
+	TEST_SZ_2M,
+	TEST_SZ_1G,
+	NR_TEST_SIZES,
+};
+
+struct test_param {
+	size_t page_size;
+	int memfd_create_flags;
+	uint64_t guest_memfd_flags;
+	char *hugetlb_size_string;
+	char *hugetlb_cgroup_size_string;
+};
+
+const struct test_param *test_params(enum test_page_size size)
+{
+	static const struct test_param params[] = {
+		[TEST_SZ_2M] = {
+			.page_size = PG_SIZE_2M,
+			.memfd_create_flags = MFD_HUGETLB | MFD_HUGE_2MB,
+			.guest_memfd_flags = GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_2MB,
+			.hugetlb_size_string = "2048kB",
+			.hugetlb_cgroup_size_string = "2MB",
+		},
+		[TEST_SZ_1G] = {
+			.page_size = PG_SIZE_1G,
+			.memfd_create_flags = MFD_HUGETLB | MFD_HUGE_1GB,
+			.guest_memfd_flags = GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_1GB,
+			.hugetlb_size_string = "1048576kB",
+			.hugetlb_cgroup_size_string = "1GB",
+		},
+	};
+
+	return &params[size];
+}
+
+static unsigned long read_hugetlb_statistic(enum test_page_size size,
+					    enum hugetlb_statistic statistic)
+{
+	char path[PATH_MAX] = "/sys/kernel/mm/hugepages/hugepages-";
+
+	strcat(path, test_params(size)->hugetlb_size_string);
+	strcat(path, "/");
+	strcat(path, hugetlb_statistics[statistic]);
+
+	return read_value(path);
+}
+
+static unsigned long read_hugetlb_cgroup_statistic(const char *hugetlb_cgroup_path,
+						   enum test_page_size size,
+						   enum hugetlb_cgroup_statistic statistic,
+						   bool reservations)
+{
+	char path[PATH_MAX] = "";
+
+	strcat(path, hugetlb_cgroup_path);
+
+	if (hugetlb_cgroup_path[strlen(hugetlb_cgroup_path) - 1] != '/')
+		strcat(path, "/");
+
+	strcat(path, "hugetlb.");
+	strcat(path, test_params(size)->hugetlb_cgroup_size_string);
+	if (reservations)
+		strcat(path, ".rsvd");
+	strcat(path, ".");
+	strcat(path, hugetlb_cgroup_statistics[statistic]);
+
+	return read_value(path);
+}
+
+static unsigned long hugetlb_baseline[NR_TEST_SIZES]
+				     [NR_TESTED_HUGETLB_STATISTICS];
+
+static unsigned long
+	hugetlb_cgroup_baseline[NR_TEST_SIZES]
+			       [NR_TESTED_HUGETLB_CGROUP_STATISTICS]
+			       [NR_HUGETLB_CGROUP_STATISTIC_CATEGORIES];
+
+
+static void establish_baseline(const char *hugetlb_cgroup_path)
+{
+	const char *p = hugetlb_cgroup_path;
+	int i, j;
+
+	for (i = 0; i < NR_TEST_SIZES; ++i) {
+		for (j = 0; j < NR_TESTED_HUGETLB_STATISTICS; ++j)
+			hugetlb_baseline[i][j] = read_hugetlb_statistic(i, j);
+
+		if (!hugetlb_cgroup_path)
+			continue;
+
+		for (j = 0; j < NR_TESTED_HUGETLB_CGROUP_STATISTICS; ++j) {
+			hugetlb_cgroup_baseline[i][j][USAGE] =
+				read_hugetlb_cgroup_statistic(p, i, j, USAGE);
+			hugetlb_cgroup_baseline[i][j][RESERVATION] =
+				read_hugetlb_cgroup_statistic(p, i, j, RESERVATION);
+		}
+	}
+}
+
+static void assert_stats_at_baseline(const char *hugetlb_cgroup_path)
+{
+	const char *p = hugetlb_cgroup_path;
+
+	/* Enumerate these for easy assertion reading. */
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_2M, FREE_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_2M][FREE_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_2M, NR_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_2M][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_2M, NR_OVERCOMMIT_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_2M][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_2M, RESV_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_2M][RESV_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_2M, SURPLUS_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_2M][SURPLUS_HUGEPAGES]);
+
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_1G, FREE_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_1G][FREE_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_1G, NR_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_1G][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_1G, NR_OVERCOMMIT_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_1G][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_1G, RESV_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_1G][RESV_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(TEST_SZ_1G, SURPLUS_HUGEPAGES),
+		       hugetlb_baseline[TEST_SZ_1G][SURPLUS_HUGEPAGES]);
+
+	if (!hugetlb_cgroup_path)
+		return;
+
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_2M, LIMIT_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[TEST_SZ_2M][LIMIT_IN_BYTES][USAGE]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_2M, MAX_USAGE_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[TEST_SZ_2M][MAX_USAGE_IN_BYTES][USAGE]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_2M, USAGE_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[TEST_SZ_2M][USAGE_IN_BYTES][USAGE]);
+
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_1G, LIMIT_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[TEST_SZ_1G][LIMIT_IN_BYTES][RESERVATION]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_1G, MAX_USAGE_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[TEST_SZ_1G][MAX_USAGE_IN_BYTES][RESERVATION]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, TEST_SZ_1G, USAGE_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[TEST_SZ_1G][USAGE_IN_BYTES][RESERVATION]);
+}
+
+static void assert_stats(const char *hugetlb_cgroup_path,
+			 enum test_page_size size, unsigned long num_reserved,
+			 unsigned long num_faulted)
+{
+	size_t pgsz = test_params(size)->page_size;
+	const char *p = hugetlb_cgroup_path;
+
+	TEST_ASSERT_EQ(read_hugetlb_statistic(size, FREE_HUGEPAGES),
+		       hugetlb_baseline[size][FREE_HUGEPAGES] - num_faulted);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(size, NR_HUGEPAGES),
+		       hugetlb_baseline[size][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(size, NR_OVERCOMMIT_HUGEPAGES),
+		       hugetlb_baseline[size][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(size, RESV_HUGEPAGES),
+		       hugetlb_baseline[size][RESV_HUGEPAGES] + num_reserved - num_faulted);
+	TEST_ASSERT_EQ(read_hugetlb_statistic(size, SURPLUS_HUGEPAGES),
+		       hugetlb_baseline[size][SURPLUS_HUGEPAGES]);
+
+	if (!hugetlb_cgroup_path)
+		return;
+
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, LIMIT_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[size][LIMIT_IN_BYTES][USAGE]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, MAX_USAGE_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[size][MAX_USAGE_IN_BYTES][USAGE]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, USAGE_IN_BYTES, USAGE),
+		hugetlb_cgroup_baseline[size][USAGE_IN_BYTES][USAGE] + num_faulted * pgsz);
+
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, LIMIT_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[size][LIMIT_IN_BYTES][RESERVATION]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, MAX_USAGE_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[size][MAX_USAGE_IN_BYTES][RESERVATION]);
+	TEST_ASSERT_EQ(
+		read_hugetlb_cgroup_statistic(p, size, USAGE_IN_BYTES, RESERVATION),
+		hugetlb_cgroup_baseline[size][USAGE_IN_BYTES][RESERVATION] + num_reserved * pgsz);
+}
+
+/* Use hugetlb behavior as a baseline. guest_memfd should have comparable behavior. */
+static void test_hugetlb_behavior(const char *hugetlb_cgroup_path, enum test_page_size test_size)
+{
+	const struct test_param *param;
+	char *mem;
+	int memfd;
+
+	param = test_params(test_size);
+
+	assert_stats_at_baseline(hugetlb_cgroup_path);
+
+	memfd = memfd_create("guest_memfd_hugetlb_reporting_test",
+			     param->memfd_create_flags);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 0, 0);
+
+	mem = mmap(NULL, param->page_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_HUGETLB, memfd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "Couldn't mmap()");
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 0);
+
+	*mem = 'A';
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 1);
+
+	munmap(mem, param->page_size);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 1);
+
+	madvise(mem, param->page_size, MADV_DONTNEED);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 1);
+
+	madvise(mem, param->page_size, MADV_REMOVE);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 1);
+
+	close(memfd);
+
+	assert_stats_at_baseline(hugetlb_cgroup_path);
+}
+
+static void test_guest_memfd_behavior(const char *hugetlb_cgroup_path,
+				      enum test_page_size test_size)
+{
+	const struct test_param *param;
+	struct kvm_vm *vm;
+	int guest_memfd;
+
+	param = test_params(test_size);
+
+	assert_stats_at_baseline(hugetlb_cgroup_path);
+
+	vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 0, 0);
+
+	guest_memfd = vm_create_guest_memfd(vm, param->page_size,
+					    param->guest_memfd_flags);
+
+	/* fd creation reserves pages. */
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 0);
+
+	fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE, 0, param->page_size);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 1);
+
+	fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+		  param->page_size);
+
+	assert_stats(hugetlb_cgroup_path, test_size, 1, 0);
+
+	close(guest_memfd);
+
+	/*
+	 * Wait a little for stats to be updated in rcu callback. resv_hugepages
+	 * is updated on truncation in ->free_inode, and ->free_inode() happens
+	 * in an rcu callback.
+	 */
+	usleep(300 * 1000);
+
+	assert_stats_at_baseline(hugetlb_cgroup_path);
+
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	char *hugetlb_cgroup_path;
+
+	hugetlb_cgroup_path = getenv("HUGETLB_CGROUP_PATH");
+
+	establish_baseline(hugetlb_cgroup_path);
+
+	test_hugetlb_behavior(hugetlb_cgroup_path, TEST_SZ_2M);
+	test_hugetlb_behavior(hugetlb_cgroup_path, TEST_SZ_1G);
+
+	test_guest_memfd_behavior(hugetlb_cgroup_path, TEST_SZ_2M);
+	test_guest_memfd_behavior(hugetlb_cgroup_path, TEST_SZ_1G);
+}
diff --git a/tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh b/tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
new file mode 100755
index 000000000000..4180d49771c8
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Wrapper that runs test, providing a hugetlb cgroup mount in environment
+# variable HUGETLB_CGROUP_PATH
+#
+# Example:
+#   ./guest_memfd_provide_hugetlb_cgroup_mount.sh ./guest_memfd_hugetlb_reporting_test
+#
+# Copyright (C) 2025, Google LLC.
+
+script_dir=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+temp_dir=$(mktemp -d /tmp/guest_memfd_hugetlb_reporting_test_XXXXXX)
+if [[ -z "$temp_dir" ]]; then
+  echo "Error: Failed to create temporary directory for hugetlb cgroup mount." >&2
+  exit 1
+fi
+
+delete_temp_dir() {
+  rm -rf $temp_dir
+}
+trap delete_temp_dir EXIT
+
+
+mount -t cgroup -o hugetlb none $temp_dir
+
+
+cleanup() {
+  umount $temp_dir
+  rm -rf $temp_dir
+}
+trap cleanup EXIT
+
+
+HUGETLB_CGROUP_PATH=$temp_dir $@
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 47/51] KVM: selftests: Support various types of backing sources for private memory
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (45 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 46/51] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 48/51] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
                   ` (8 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Adds support for various type of backing sources for private
memory (in the sense of confidential computing), similar to the
backing sources available for shared memory.

Change-Id: I683b48c90d74f8cb99e416d26c8fb98331df0bab
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/test_util.h | 18 ++++-
 tools/testing/selftests/kvm/lib/test_util.c   | 77 +++++++++++++++++++
 2 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index b4a03784ac4f..bfd9d9a897e3 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -139,9 +139,19 @@ enum vm_mem_backing_src_type {
 
 struct vm_mem_backing_src_alias {
 	const char *name;
-	uint32_t flag;
+	uint64_t flag;
 };
 
+enum vm_private_mem_backing_src_type {
+	VM_PRIVATE_MEM_SRC_GUEST_MEM,  /* Use default page size */
+	VM_PRIVATE_MEM_SRC_HUGETLB,    /* Use kernel default page size for hugetlb pages */
+	VM_PRIVATE_MEM_SRC_HUGETLB_2MB,
+	VM_PRIVATE_MEM_SRC_HUGETLB_1GB,
+	NUM_PRIVATE_MEM_SRC_TYPES,
+};
+
+#define DEFAULT_VM_PRIVATE_MEM_SRC VM_PRIVATE_MEM_SRC_GUEST_MEM
+
 #define MIN_RUN_DELAY_NS	200000UL
 
 bool thp_configured(void);
@@ -154,6 +164,12 @@ int get_backing_src_madvise_advice(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
+
+void private_mem_backing_src_help(const char *flag);
+enum vm_private_mem_backing_src_type parse_private_mem_backing_src_type(const char *type_name);
+const struct vm_mem_backing_src_alias *vm_private_mem_backing_src_alias(uint32_t i);
+size_t get_private_mem_backing_src_pagesz(uint32_t i);
+
 long get_run_delay(void);
 
 /*
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 24dc90693afd..8c4d6ec44c41 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -15,6 +15,8 @@
 #include <sys/syscall.h>
 #include <linux/mman.h>
 #include "linux/kernel.h"
+#include <linux/kvm.h>
+#include <linux/guestmem.h>
 
 #include "test_util.h"
 
@@ -288,6 +290,34 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 	return &aliases[i];
 }
 
+const struct vm_mem_backing_src_alias *vm_private_mem_backing_src_alias(uint32_t i)
+{
+	static const struct vm_mem_backing_src_alias aliases[] = {
+		[VM_PRIVATE_MEM_SRC_GUEST_MEM] = {
+			.name = "private_mem_guest_mem",
+			.flag = 0,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB] = {
+			.name = "private_mem_hugetlb",
+			.flag = GUEST_MEMFD_FLAG_HUGETLB,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB_2MB] = {
+			.name = "private_mem_hugetlb_2mb",
+			.flag = GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_2MB,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB_1GB] = {
+			.name = "private_mem_hugetlb_1gb",
+			.flag = GUEST_MEMFD_FLAG_HUGETLB | GUESTMEM_HUGETLB_FLAG_1GB,
+		},
+	};
+	_Static_assert(ARRAY_SIZE(aliases) == NUM_PRIVATE_MEM_SRC_TYPES,
+		       "Missing new backing private mem src types?");
+
+	TEST_ASSERT(i < NUM_PRIVATE_MEM_SRC_TYPES, "Private mem backing src type ID %d too big", i);
+
+	return &aliases[i];
+}
+
 #define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
 
 size_t get_backing_src_pagesz(uint32_t i)
@@ -333,6 +363,22 @@ int get_backing_src_madvise_advice(uint32_t i)
 	}
 }
 
+size_t get_private_mem_backing_src_pagesz(uint32_t i)
+{
+	switch (i) {
+	case VM_PRIVATE_MEM_SRC_GUEST_MEM:
+		return getpagesize();
+	case VM_PRIVATE_MEM_SRC_HUGETLB:
+		return get_def_hugetlb_pagesz();
+	default: {
+		uint64_t flag = vm_private_mem_backing_src_alias(i)->flag;
+
+		return 1UL << ((flag >> GUESTMEM_HUGETLB_FLAG_SHIFT) &
+			       GUESTMEM_HUGETLB_FLAG_MASK);
+	}
+	}
+}
+
 bool is_backing_src_hugetlb(uint32_t i)
 {
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
@@ -369,6 +415,37 @@ enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name)
 	return -1;
 }
 
+static void print_available_private_mem_backing_src_types(const char *prefix)
+{
+	int i;
+
+	printf("%sAvailable private mem backing src types:\n", prefix);
+
+	for (i = 0; i < NUM_PRIVATE_MEM_SRC_TYPES; i++)
+		printf("%s    %s\n", prefix, vm_private_mem_backing_src_alias(i)->name);
+}
+
+void private_mem_backing_src_help(const char *flag)
+{
+	printf(" %s: specify the type of memory that should be used to\n"
+	       "     back guest private memory. (default: %s)\n",
+	       flag, vm_private_mem_backing_src_alias(DEFAULT_VM_PRIVATE_MEM_SRC)->name);
+	print_available_private_mem_backing_src_types("     ");
+}
+
+enum vm_private_mem_backing_src_type parse_private_mem_backing_src_type(const char *type_name)
+{
+	int i;
+
+	for (i = 0; i < NUM_PRIVATE_MEM_SRC_TYPES; i++)
+		if (!strcmp(type_name, vm_private_mem_backing_src_alias(i)->name))
+			return i;
+
+	print_available_private_mem_backing_src_types("");
+	TEST_FAIL("Unknown private mem backing src type: %s", type_name);
+	return -1;
+}
+
 long get_run_delay(void)
 {
 	char path[64];
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 48/51] KVM: selftests: Update test for various private memory backing source types
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (46 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 47/51] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 49/51] KVM: selftests: Update private_mem_conversions_test.sh to test with HugeTLB pages Ackerley Tng
                   ` (7 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Update private_mem_conversions_test for various private memory backing
source types, testing HugeTLB support in guest_memfd.

Change-Id: I50facb166a282f97570591eb331c3f19676b01cc
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/x86/private_mem_conversions_test.c    | 42 +++++++++++++------
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index ec20bb7e95c8..5a0fd9155ce8 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -450,21 +450,18 @@ static void *__test_mem_conversions(void *params)
 }
 
 static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
+				 enum vm_private_mem_backing_src_type private_mem_src_type,
 				 uint32_t nr_vcpus, uint32_t nr_memslots,
 				 bool back_shared_memory_with_guest_memfd)
 {
-	/*
-	 * Allocate enough memory so that each vCPU's chunk of memory can be
-	 * naturally aligned with respect to the size of the backing store.
-	 */
-	const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type));
 	struct test_thread_args *thread_args[KVM_MAX_VCPUS];
-	const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
-	const size_t memfd_size = per_cpu_size * nr_vcpus;
-	const size_t slot_size = memfd_size / nr_memslots;
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	pthread_t threads[KVM_MAX_VCPUS];
+	size_t per_cpu_size;
+	size_t memfd_size;
 	struct kvm_vm *vm;
+	size_t alignment;
+	size_t slot_size;
 	int memfd, i, r;
 	uint64_t flags;
 
@@ -473,6 +470,18 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
 		.type = KVM_X86_SW_PROTECTED_VM,
 	};
 
+	/*
+	 * Allocate enough memory so that each vCPU's chunk of memory can be
+	 * naturally aligned with respect to the size of the backing store.
+	 */
+	alignment = max_t(size_t, SZ_2M,
+			  max_t(size_t, get_backing_src_pagesz(src_type),
+				get_private_mem_backing_src_pagesz(
+					private_mem_src_type)));
+	per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
+	memfd_size = per_cpu_size * nr_vcpus;
+	slot_size = memfd_size / nr_memslots;
+
 	TEST_ASSERT(slot_size * nr_memslots == memfd_size,
 		    "The memfd size (0x%lx) needs to be cleanly divisible by the number of memslots (%u)",
 		    memfd_size, nr_memslots);
@@ -483,6 +492,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
 	flags = back_shared_memory_with_guest_memfd ?
 			GUEST_MEMFD_FLAG_SUPPORT_SHARED :
 			0;
+	flags |= vm_private_mem_backing_src_alias(private_mem_src_type)->flag;
 	memfd = vm_create_guest_memfd(vm, memfd_size, flags);
 
 	for (i = 0; i < nr_memslots; i++) {
@@ -547,10 +557,13 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
 static void usage(const char *cmd)
 {
 	puts("");
-	printf("usage: %s [-h] [-g] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
+	printf("usage: %s [-h] [-g] [-m nr_memslots] [-s mem_type] [-p private_mem_type] [-n nr_vcpus]\n",
+	       cmd);
 	puts("");
 	backing_src_help("-s");
 	puts("");
+	private_mem_backing_src_help("-p");
+	puts("");
 	puts(" -n: specify the number of vcpus (default: 1)");
 	puts("");
 	puts(" -m: specify the number of memslots (default: 1)");
@@ -561,6 +574,7 @@ static void usage(const char *cmd)
 
 int main(int argc, char *argv[])
 {
+	enum vm_private_mem_backing_src_type private_mem_src_type = DEFAULT_VM_PRIVATE_MEM_SRC;
 	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
 	bool back_shared_memory_with_guest_memfd = false;
 	uint32_t nr_memslots = 1;
@@ -569,11 +583,14 @@ int main(int argc, char *argv[])
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
-	while ((opt = getopt(argc, argv, "hgm:s:n:")) != -1) {
+	while ((opt = getopt(argc, argv, "hgm:s:p:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
 			break;
+		case 'p':
+			private_mem_src_type = parse_private_mem_backing_src_type(optarg);
+			break;
 		case 'n':
 			nr_vcpus = atoi_positive("nr_vcpus", optarg);
 			break;
@@ -590,9 +607,8 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	test_mem_conversions(src_type, nr_vcpus, nr_memslots,
-			     back_shared_memory_with_guest_memfd);
-
+	test_mem_conversions(src_type, private_mem_src_type, nr_vcpus,
+			     nr_memslots, back_shared_memory_with_guest_memfd);
 
 	return 0;
 }
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 49/51] KVM: selftests: Update private_mem_conversions_test.sh to test with HugeTLB pages
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (47 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 48/51] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-14 23:42 ` [RFC PATCH v2 50/51] KVM: selftests: Add script to test HugeTLB statistics Ackerley Tng
                   ` (6 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Update test script to also test HugeTLB support for guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Change-Id: I7c6cc25d6b86e1e0dc74018f46c7e2796fab6357
---
 .../kvm/x86/private_mem_conversions_test.sh   | 29 ++++++++++++++-----
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
index 5dda6916e071..0d2c5fa729fd 100755
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
@@ -57,6 +57,17 @@ backing_src_types+=( shmem )
 	backing_src_types+=( shared_hugetlb ) || \
 	echo "skipping shared_hugetlb backing source type"
 
+private_mem_backing_src_types=( private_mem_guest_mem )
+[ -n "$hugepage_default_enabled" ] && \
+	private_mem_backing_src_types+=( private_mem_hugetlb ) || \
+	echo "skipping private_mem_hugetlb backing source type"
+[ -n "$hugepage_2mb_enabled" ] && \
+	private_mem_backing_src_types+=( private_mem_hugetlb_2mb ) || \
+	echo "skipping private_mem_hugetlb_2mb backing source type"
+[ -n "$hugepage_1gb_enabled" ] && \
+	private_mem_backing_src_types+=( private_mem_hugetlb_1gb ) || \
+	echo "skipping private_mem_hugetlb_1gb backing source type"
+
 set +e
 
 TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
@@ -66,17 +77,21 @@ TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
 
 	for src_type in "${backing_src_types[@]}"; do
 
-		set -x
+		for private_mem_src_type in "${private_mem_backing_src_types[@]}"; do
 
-                $TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test
-		$TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
+			set -x
 
-                $TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -g
-		$TEST_EXECUTABLE -s "$src_type" -n $num_vcpus_to_test -m $num_memslots_to_test -g
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
 
-		{ set +x; } 2>/dev/null
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -g
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -m $num_memslots_to_test -g
 
-		echo
+			{ set +x; } 2>/dev/null
+
+			echo
+
+		done
 
 	done
 )
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 50/51] KVM: selftests: Add script to test HugeTLB statistics
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (48 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 49/51] KVM: selftests: Update private_mem_conversions_test.sh to test with HugeTLB pages Ackerley Tng
@ 2025-05-14 23:42 ` Ackerley Tng
  2025-05-15 18:03 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Edgecombe, Rick P
                   ` (5 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-14 23:42 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

This script wraps other tests to check that HugeTLB statistics are
restored to what they were before the test was run.

Does not account HugeTLB statistics updated by other non-test
processes running in the background while the test is running.

Change-Id: I1d827656ef215fd85e368f4a3629f306e7f33f18
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 ...memfd_wrap_test_check_hugetlb_reporting.sh | 95 +++++++++++++++++++
 1 file changed, 95 insertions(+)
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh

diff --git a/tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh b/tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
new file mode 100755
index 000000000000..475ec5c4ce1b
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
@@ -0,0 +1,95 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Wrapper that runs test, checking that HugeTLB-related statistics have not
+# changed before and after test.
+#
+# Example:
+#   ./guest_memfd_wrap_test_check_hugetlb_reporting.sh ./guest_memfd_test
+#
+# Example of combining this with ./guest_memfd_provide_hugetlb_cgroup_mount.sh:
+#   ./guest_memfd_provide_hugetlb_cgroup_mount.sh \
+#     ./guest_memfd_wrap_test_check_hugetlb_reporting.sh \
+#     ./guest_memfd_hugetlb_reporting_test
+#
+# Copyright (C) 2025, Google LLC.
+
+declare -A baseline
+
+hugetlb_sizes=(
+  "2048kB"
+  "1048576kB"
+)
+
+statistics=(
+  "free_hugepages"
+  "nr_hugepages"
+  "nr_overcommit_hugepages"
+  "resv_hugepages"
+  "surplus_hugepages"
+)
+
+cgroup_hugetlb_sizes=(
+  "2MB"
+  "1GB"
+)
+
+cgroup_statistics=(
+  "limit_in_bytes"
+  "max_usage_in_bytes"
+  "usage_in_bytes"
+)
+
+establish_statistics_baseline() {
+  for size in "${hugetlb_sizes[@]}"; do
+
+    for statistic in "${statistics[@]}"; do
+
+      local path="/sys/kernel/mm/hugepages/hugepages-${size}/${statistic}"
+      baseline["$path"]=$(cat "$path")
+
+    done
+
+  done
+
+  if [ -n "$HUGETLB_CGROUP_PATH" ]; then
+
+    for size in "${cgroup_hugetlb_sizes[@]}"; do
+
+      for statistic in "${cgroup_statistics[@]}"; do
+
+        local rsvd_path="${HUGETLB_CGROUP_PATH}/hugetlb.${size}.rsvd.${statistic}"
+        local path="${HUGETLB_CGROUP_PATH}/hugetlb.${size}.${statistic}"
+
+        baseline["$rsvd_path"]=$(cat "$rsvd_path")
+        baseline["$path"]=$(cat "$path")
+
+      done
+
+    done
+
+  fi
+}
+
+assert_path_at_baseline() {
+  local path=$1
+
+  current=$(cat "$path")
+  expected=${baseline["$path"]}
+  if [ "$current" != "$expected"  ]; then
+    echo "$path was $current instead of $expected"
+  fi
+}
+
+assert_statistics_at_baseline() {
+  for path in "${!baseline[@]}"; do
+    assert_path_at_baseline $path
+  done
+}
+
+
+establish_statistics_baseline
+
+$@
+
+assert_statistics_at_baseline
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
@ 2025-05-15  2:09   ` Matthew Wilcox
  2025-05-28  8:55   ` Binbin Wu
  2025-07-07 18:27   ` James Houghton
  2 siblings, 0 replies; 231+ messages in thread
From: Matthew Wilcox @ 2025-05-15  2:09 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:41:55PM -0700, Ackerley Tng wrote:
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -				struct vm_area_struct *vma,
> -				unsigned long address, long gbl_chg)
> +static struct folio *dequeue_hugetlb_folio(struct hstate *h,
> +					   struct vm_area_struct *vma,
> +					   unsigned long address)

Pleaase don't mess with the indentation unless necessary.  Nobody
cares what your personal style preference is.  You're obscuring the
actual changes.


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-14 23:41 ` [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag Ackerley Tng
@ 2025-05-15 13:49   ` Ira Weiny
  2025-05-16 17:42     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-05-15 13:49 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ackerley Tng wrote:
> Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
> GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
> 
> Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> index 60aaba5808a5..bf2876cbd711 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
>  	kvm_vm_release(vm);
>  }
>  
> +static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
> +				   bool expect_valid)
> +{
> +	size_t page_size = getpagesize();
> +	int fd;
> +
> +	fd = __vm_create_guest_memfd(vm, page_size, flag);
> +
> +	if (expect_valid) {
> +		TEST_ASSERT(fd > 0,
> +			    "guest_memfd() with flag '0x%lx' should be valid",
> +			    flag);
> +		close(fd);
> +	} else {
> +		TEST_ASSERT(fd == -1 && errno == EINVAL,
> +			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> +			    flag);
> +	}
> +}
> +
>  static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>  					    uint64_t expected_valid_flags)
>  {
> -	size_t page_size = getpagesize();
>  	struct kvm_vm *vm;
>  	uint64_t flag = 0;
> -	int fd;
>  
>  	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
>  		return;
> @@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>  	vm = vm_create_barebones_type(vm_type);
>  
>  	for (flag = BIT(0); flag; flag <<= 1) {
> -		fd = __vm_create_guest_memfd(vm, page_size, flag);
> +		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
>  
> -		if (flag & expected_valid_flags) {
> -			TEST_ASSERT(fd > 0,
> -				    "guest_memfd() with flag '0x%lx' should be valid",
> -				    flag);
> -			close(fd);
> -		} else {
> -			TEST_ASSERT(fd == -1 && errno == EINVAL,
> -				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> -				    flag);
> +		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
> +			test_vm_with_gmem_flag(
> +				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);

I don't understand the point of this check.  In 2/51 we set 
GUEST_MEMFD_FLAG_INIT_PRIVATE when GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.

When can this check ever fail?

Ira

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
@ 2025-05-15 14:50   ` Ira Weiny
  2025-05-16 17:53     ` Ackerley Tng
  2025-05-20  9:22   ` Fuad Tabba
  2025-05-28  3:16   ` Binbin Wu
  2 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-05-15 14:50 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ackerley Tng wrote:

[snip]

> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> 

[snip]

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 590932499eba..f802116290ce 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -30,6 +30,10 @@ enum shareability {
>  };
>  
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +				      pgoff_t end);
> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> +				    pgoff_t end);
>  
>  static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>  {
> @@ -85,6 +89,306 @@ static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t inde
>  	return kvm_gmem_get_folio(inode, index);
>  }
>  
> +/**
> + * kvm_gmem_shareability_store() - Sets shareability to @value for range.
> + *
> + * @mt: the shareability maple tree.
> + * @index: the range begins at this index in the inode.
> + * @nr_pages: number of PAGE_SIZE pages in this range.
> + * @value: the shareability value to set for this range.
> + *
> + * Unlike mtree_store_range(), this function also merges adjacent ranges that
> + * have the same values as an optimization.

Is this an optimization or something which will be required to convert
from shared back to private and back to a huge page mapping?

If this is purely an optimization it might be best to leave it out for now
to get functionality first.

I have more to review but wanted to ask this.

Ira

[snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (49 preceding siblings ...)
  2025-05-14 23:42 ` [RFC PATCH v2 50/51] KVM: selftests: Add script to test HugeTLB statistics Ackerley Tng
@ 2025-05-15 18:03 ` Edgecombe, Rick P
  2025-05-15 18:42   ` Vishal Annapurve
  2025-05-16  0:22 ` [RFC PATCH v2 51/51] KVM: selftests: Test guest_memfd for accuracy of st_blocks Ackerley Tng
                   ` (4 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-15 18:03 UTC (permalink / raw)
  To: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.

Do you have any more concrete numbers on benefits of 1GB huge pages for
guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
- Increase TLB hit rate and reduce page walks on TLB miss
- Improved IO performance
- Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
- Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
backing memory

Do you know how often the 1GB TDP mappings get shattered by shared pages?

Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
memory savings (for example dynamic PAMT), and the rest of the benefits don't
have numbers. How much are we getting for all the complexity, over say buddy
allocated 2MB pages?

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-15 18:03 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Edgecombe, Rick P
@ 2025-05-15 18:42   ` Vishal Annapurve
  2025-05-15 23:35     ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-15 18:42 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org, palmer@dabbelt.com,
	pvorel@suse.cz, catalin.marinas@arm.com, Miao, Jun,
	Shutemov, Kirill, pdurrant@amazon.co.uk, steven.price@arm.com,
	peterx@redhat.com, vbabka@suse.cz, jack@suse.cz,
	amoorthy@google.com, maz@kernel.org, keirf@google.com,
	vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Xu, Haibo1, jhubbard@nvidia.com, Yamahata, Isaku,
	jthoughton@google.com, will@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, roypat@amazon.co.uk,
	seanjc@google.com

On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > Hello,
> >
> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > upstream calls to provide 1G page support for guest_memfd by taking
> > pages from HugeTLB.
>
> Do you have any more concrete numbers on benefits of 1GB huge pages for
> guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> - Increase TLB hit rate and reduce page walks on TLB miss
> - Improved IO performance
> - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> backing memory
>
> Do you know how often the 1GB TDP mappings get shattered by shared pages?
>
> Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> memory savings (for example dynamic PAMT), and the rest of the benefits don't
> have numbers. How much are we getting for all the complexity, over say buddy
> allocated 2MB pages?

This series should work for any page sizes backed by hugetlb memory.
Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
essential for certain workloads and will emerge as guest_memfd users.
Features like KHO/memory persistence in addition also depend on
hugepage support in guest_memfd.

This series takes strides towards making guest_memfd compatible with
usecases where 1G pages are essential and non-confidential VMs are
already exercising them.

I think the main complexity here lies in supporting in-place
conversion which applies to any huge page size even for buddy
allocated 2MB pages or THP.

This complexity arises because page structs work at a fixed
granularity, future roadmap towards not having page structs for guest
memory (at least private memory to begin with) should help towards
greatly reducing this complexity.

That being said, DPAMT and huge page EPT mappings for TDX VMs remain
essential and complement this series well for better memory footprint
and overall performance of TDX VMs.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-15 18:42   ` Vishal Annapurve
@ 2025-05-15 23:35     ` Edgecombe, Rick P
  2025-05-16  0:57       ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-15 23:35 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: palmer@dabbelt.com, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	tabba@google.com, keirf@google.com, quic_svaddagi@quicinc.com,
	amoorthy@google.com, pvorel@suse.cz, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, Wang, Wei W, jack@suse.cz,
	Wieczor-Retman, Maciej, Zhao, Yan Y, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	willy@infradead.org, rppt@kernel.org, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, Du, Fan, fvdl@google.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com, Xu, Haibo1,
	jhubbard@nvidia.com, anup@brainfault.org, maz@kernel.org,
	Yamahata, Isaku, jthoughton@google.com, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com, Shutemov, Kirill,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, roypat@amazon.co.uk,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, linux-mm@kvack.org,
	will@kernel.org, seanjc@google.com, hch@infradead.org

On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote:
> On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > > Hello,
> > > 
> > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > upstream calls to provide 1G page support for guest_memfd by taking
> > > pages from HugeTLB.
> > 
> > Do you have any more concrete numbers on benefits of 1GB huge pages for
> > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> > - Increase TLB hit rate and reduce page walks on TLB miss
> > - Improved IO performance
> > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> > backing memory
> > 
> > Do you know how often the 1GB TDP mappings get shattered by shared pages?
> > 
> > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> > memory savings (for example dynamic PAMT), and the rest of the benefits don't
> > have numbers. How much are we getting for all the complexity, over say buddy
> > allocated 2MB pages?
> 
> This series should work for any page sizes backed by hugetlb memory.
> Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> essential for certain workloads and will emerge as guest_memfd users.
> Features like KHO/memory persistence in addition also depend on
> hugepage support in guest_memfd.
> 
> This series takes strides towards making guest_memfd compatible with
> usecases where 1G pages are essential and non-confidential VMs are
> already exercising them.
> 
> I think the main complexity here lies in supporting in-place
> conversion which applies to any huge page size even for buddy
> allocated 2MB pages or THP.
> 
> This complexity arises because page structs work at a fixed
> granularity, future roadmap towards not having page structs for guest
> memory (at least private memory to begin with) should help towards
> greatly reducing this complexity.
> 
> That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> essential and complement this series well for better memory footprint
> and overall performance of TDX VMs.

Hmm, this didn't really answer my questions about the concrete benefits.

I think it would help to include this kind of justification for the 1GB
guestmemfd pages. "essential for certain workloads and will emerge" is a bit
hard to review against...

I think one of the challenges with coco is that it's almost like a sprint to
reimplement virtualization. But enough things are changing at once that not all
of the normal assumptions hold, so it can't copy all the same solutions. The
recent example was that for TDX huge pages we found that normal promotion paths
weren't actually yielding any benefit for surprising TDX specific reasons.

On the TDX side we are also, at least currently, unmapping private pages while
they are mapped shared, so any 1GB pages would get split to 2MB if there are any
shared pages in them. I wonder how many 1GB pages there would be after all the
shared pages are converted. At smaller TD sizes, it could be not much.

So for TDX in isolation, it seems like jumping out too far ahead to effectively
consider the value. But presumably you guys are testing this on SEV or
something? Have you measured any performance improvement? For what kind of
applications? Or is the idea to basically to make guestmemfd work like however
Google does guest memory?


^ permalink raw reply	[flat|nested] 231+ messages in thread

* [RFC PATCH v2 51/51] KVM: selftests: Test guest_memfd for accuracy of st_blocks
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (50 preceding siblings ...)
  2025-05-15 18:03 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Edgecombe, Rick P
@ 2025-05-16  0:22 ` Ackerley Tng
  2025-05-16 19:48 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Ira Weiny
                   ` (3 subsequent siblings)
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16  0:22 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Test that st_blocks in struct stat (inode->i_blocks) is updated.

Change-Id: I67d814f130671b6b64b575e6a25fd17b1994c640
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 55 ++++++++++++++++---
 1 file changed, 46 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index c8acccaa9e1d..f51cd876d7dc 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -142,41 +142,78 @@ static void test_file_size(int fd, size_t page_size, size_t total_size)
 	TEST_ASSERT_EQ(sb.st_blksize, page_size);
 }
 
-static void test_fallocate(int fd, size_t page_size, size_t total_size)
+static void assert_st_blocks_equals_size(int fd, size_t page_size, size_t expected_size)
 {
+	struct stat sb;
+	int ret;
+
+	/* TODO: st_blocks is not updated for 4K-page guest_memfd. */
+	if (page_size == getpagesize())
+		return;
+
+	ret = fstat(fd, &sb);
+	TEST_ASSERT(!ret, "fstat should succeed");
+	TEST_ASSERT_EQ(sb.st_blocks, expected_size / 512);
+}
+
+static void test_fallocate(int fd, size_t test_page_size, size_t total_size)
+{
+	size_t page_size;
 	int ret;
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
 	TEST_ASSERT(!ret, "fallocate with aligned offset and size should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			page_size - 1, page_size);
+			test_page_size - 1, test_page_size);
 	TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
-	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, test_page_size);
 	TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
-	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, page_size);
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + test_page_size, test_page_size);
 	TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			total_size, page_size);
+			total_size, test_page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			total_size + page_size, page_size);
+			total_size + test_page_size, test_page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			page_size, page_size - 1);
+			test_page_size, test_page_size - 1);
 	TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
 
 	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
-			page_size, page_size);
+			test_page_size, test_page_size);
 	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size - test_page_size);
 
-	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			test_page_size, test_page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) in a hole should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size - test_page_size);
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, test_page_size, test_page_size);
 	TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+	assert_st_blocks_equals_size(fd, test_page_size, total_size);
+
+	page_size = getpagesize();
+	if (test_page_size == page_size) {
+		ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+				test_page_size + page_size, page_size);
+		TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) of a subfolio should succeed");
+		assert_st_blocks_equals_size(fd, test_page_size, total_size);
+	}
 }
 
 static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
-- 
2.49.0.1045.g170613ef41-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-15 23:35     ` Edgecombe, Rick P
@ 2025-05-16  0:57       ` Sean Christopherson
  2025-05-16  2:12         ` Edgecombe, Rick P
  2025-05-16 13:09         ` Jason Gunthorpe
  0 siblings, 2 replies; 231+ messages in thread
From: Sean Christopherson @ 2025-05-16  0:57 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Vishal Annapurve, palmer@dabbelt.com, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, tabba@google.com, keirf@google.com,
	quic_svaddagi@quicinc.com, amoorthy@google.com, pvorel@suse.cz,
	quic_eberman@quicinc.com, mail@maciej.szmigiero.name,
	vkuznets@redhat.com, anthony.yznaga@oracle.com, Wei W Wang,
	jack@suse.cz, Maciej Wieczor-Retman, Yan Y Zhao, Dave Hansen,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	willy@infradead.org, rppt@kernel.org, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, Fan Du, fvdl@google.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com, Haibo1 Xu,
	jhubbard@nvidia.com, anup@brainfault.org, maz@kernel.org,
	Isaku Yamahata, jthoughton@google.com, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com, Kirill Shutemov,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Chao Peng,
	nikunj@amd.com, Alexander Graf, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Ira Weiny,
	aou@eecs.berkeley.edu, Xiaoyao Li, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, roypat@amazon.co.uk,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Thu, May 15, 2025, Rick P Edgecombe wrote:
> On Thu, 2025-05-15 at 11:42 -0700, Vishal Annapurve wrote:
> > On Thu, May 15, 2025 at 11:03 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > > 
> > > On Wed, 2025-05-14 at 16:41 -0700, Ackerley Tng wrote:
> > > > Hello,
> > > > 
> > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > upstream calls to provide 1G page support for guest_memfd by taking
> > > > pages from HugeTLB.
> > > 
> > > Do you have any more concrete numbers on benefits of 1GB huge pages for
> > > guestmemfd/coco VMs? I saw in the LPC talk it has the benefits as:
> > > - Increase TLB hit rate and reduce page walks on TLB miss
> > > - Improved IO performance
> > > - Memory savings of ~1.6% from HugeTLB Vmemmap Optimization (HVO)
> > > - Bring guest_memfd to parity with existing VMs that use HugeTLB pages for
> > > backing memory
> > > 
> > > Do you know how often the 1GB TDP mappings get shattered by shared pages?
> > > 
> > > Thinking from the TDX perspective, we might have bigger fish to fry than 1.6%
> > > memory savings (for example dynamic PAMT), and the rest of the benefits don't
> > > have numbers. How much are we getting for all the complexity, over say buddy
> > > allocated 2MB pages?

TDX may have bigger fish to fry, but some of us have bigger fish to fry than TDX :-)

> > This series should work for any page sizes backed by hugetlb memory.
> > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > essential for certain workloads and will emerge as guest_memfd users.
> > Features like KHO/memory persistence in addition also depend on
> > hugepage support in guest_memfd.
> > 
> > This series takes strides towards making guest_memfd compatible with
> > usecases where 1G pages are essential and non-confidential VMs are
> > already exercising them.
> > 
> > I think the main complexity here lies in supporting in-place
> > conversion which applies to any huge page size even for buddy
> > allocated 2MB pages or THP.
> > 
> > This complexity arises because page structs work at a fixed
> > granularity, future roadmap towards not having page structs for guest
> > memory (at least private memory to begin with) should help towards
> > greatly reducing this complexity.
> > 
> > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > essential and complement this series well for better memory footprint
> > and overall performance of TDX VMs.
> 
> Hmm, this didn't really answer my questions about the concrete benefits.
> 
> I think it would help to include this kind of justification for the 1GB
> guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> hard to review against...
> 
> I think one of the challenges with coco is that it's almost like a sprint to
> reimplement virtualization. But enough things are changing at once that not all
> of the normal assumptions hold, so it can't copy all the same solutions. The
> recent example was that for TDX huge pages we found that normal promotion paths
> weren't actually yielding any benefit for surprising TDX specific reasons.
> 
> On the TDX side we are also, at least currently, unmapping private pages while
> they are mapped shared, so any 1GB pages would get split to 2MB if there are any
> shared pages in them. I wonder how many 1GB pages there would be after all the
> shared pages are converted. At smaller TD sizes, it could be not much.

You're conflating two different things.  guest_memfd allocating and managing
1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
map memory into the guest using 4KiB pages.

> So for TDX in isolation, it seems like jumping out too far ahead to effectively
> consider the value. But presumably you guys are testing this on SEV or
> something? Have you measured any performance improvement? For what kind of
> applications? Or is the idea to basically to make guestmemfd work like however
> Google does guest memory?

The longer term goal of guest_memfd is to make it suitable for backing all VMs,
hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX, but
we (and others) want to use guest_memfd for far more than just CoCo VMs.  And
for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16  0:57       ` Sean Christopherson
@ 2025-05-16  2:12         ` Edgecombe, Rick P
  2025-05-16 13:11           ` Vishal Annapurve
  2025-05-16 13:09         ` Jason Gunthorpe
  1 sibling, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16  2:12 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	maz@kernel.org, vkuznets@redhat.com, quic_eberman@quicinc.com,
	keirf@google.com, hughd@google.com, Annapurve, Vishal,
	mail@maciej.szmigiero.name, palmer@dabbelt.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, jack@suse.cz, paul.walmsley@sifive.com,
	aik@amd.com, usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, vbabka@suse.cz, Du, Fan,
	anthony.yznaga@oracle.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, bfoster@redhat.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Xu, Haibo1,
	anup@brainfault.org, Hansen, Dave, Yamahata, Isaku,
	jthoughton@google.com, Wang, Wei W, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, ackerleytng@google.com,
	linux-fsdevel@vger.kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, roypat@amazon.co.uk, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > benefits don't have numbers. How much are we getting for all the
> > > > complexity, over say buddy allocated 2MB pages?
> 
> TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> TDX :-)

Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
this series is.

> 
> > > This series should work for any page sizes backed by hugetlb memory.
> > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > essential for certain workloads and will emerge as guest_memfd users.
> > > Features like KHO/memory persistence in addition also depend on
> > > hugepage support in guest_memfd.
> > > 
> > > This series takes strides towards making guest_memfd compatible with
> > > usecases where 1G pages are essential and non-confidential VMs are
> > > already exercising them.
> > > 
> > > I think the main complexity here lies in supporting in-place
> > > conversion which applies to any huge page size even for buddy
> > > allocated 2MB pages or THP.
> > > 
> > > This complexity arises because page structs work at a fixed
> > > granularity, future roadmap towards not having page structs for guest
> > > memory (at least private memory to begin with) should help towards
> > > greatly reducing this complexity.
> > > 
> > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > essential and complement this series well for better memory footprint
> > > and overall performance of TDX VMs.
> > 
> > Hmm, this didn't really answer my questions about the concrete benefits.
> > 
> > I think it would help to include this kind of justification for the 1GB
> > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > hard to review against...
> > 
> > I think one of the challenges with coco is that it's almost like a sprint to
> > reimplement virtualization. But enough things are changing at once that not
> > all of the normal assumptions hold, so it can't copy all the same solutions.
> > The recent example was that for TDX huge pages we found that normal
> > promotion paths weren't actually yielding any benefit for surprising TDX
> > specific reasons.
> > 
> > On the TDX side we are also, at least currently, unmapping private pages
> > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > there are any shared pages in them. I wonder how many 1GB pages there would
> > be after all the shared pages are converted. At smaller TD sizes, it could
> > be not much.
> 
> You're conflating two different things.  guest_memfd allocating and managing
> 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> map memory into the guest using 4KiB pages.

I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
list quoted there was more about guest performance. Or maybe the clever page
table walkers that find contiguous small mappings could benefit guest
performance too? It's the kind of thing I'd like to see at least broadly called
out.

I'm thinking that Google must have a ridiculous amount of learnings about VM
memory management. And this is probably designed around those learnings. But
reviewers can't really evaluate it if they don't know the reasons and tradeoffs
taken. If it's going upstream, I think it should have at least the high level
reasoning explained.

I don't mean to harp on the point so hard, but I didn't expect it to be
controversial either.

> 
> > So for TDX in isolation, it seems like jumping out too far ahead to
> > effectively consider the value. But presumably you guys are testing this on
> > SEV or something? Have you measured any performance improvement? For what
> > kind of applications? Or is the idea to basically to make guestmemfd work
> > like however Google does guest memory?
> 
> The longer term goal of guest_memfd is to make it suitable for backing all
> VMs, hence Vishal's "Non-CoCo VMs" comment.

Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
talking about pKVM.

>   Yes, some of this is useful for TDX, but we (and others) want to use
> guest_memfd for far more than just CoCo VMs. 


>  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
I've heard this a lot. It must be true, but I've never seen the actual numbers.
For a long time people believed 1GB huge pages on the direct map were critical,
but then benchmarking on a contemporary CPU couldn't find much difference
between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
the combined walks are huge, iTLB, etc, but I'd love to see a real number.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16  0:57       ` Sean Christopherson
  2025-05-16  2:12         ` Edgecombe, Rick P
@ 2025-05-16 13:09         ` Jason Gunthorpe
  2025-05-16 17:04           ` Edgecombe, Rick P
  1 sibling, 1 reply; 231+ messages in thread
From: Jason Gunthorpe @ 2025-05-16 13:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, Vishal Annapurve, palmer@dabbelt.com,
	kvm@vger.kernel.org, catalin.marinas@arm.com, Jun Miao,
	nsaenz@amazon.es, pdurrant@amazon.co.uk, vbabka@suse.cz,
	peterx@redhat.com, x86@kernel.org, tabba@google.com,
	keirf@google.com, quic_svaddagi@quicinc.com, amoorthy@google.com,
	pvorel@suse.cz, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, Wei W Wang, jack@suse.cz,
	Maciej Wieczor-Retman, Yan Y Zhao, Dave Hansen,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	willy@infradead.org, rppt@kernel.org, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, Fan Du, fvdl@google.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, hughd@google.com, Haibo1 Xu,
	jhubbard@nvidia.com, anup@brainfault.org, maz@kernel.org,
	Isaku Yamahata, jthoughton@google.com, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com, Kirill Shutemov,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Chao Peng,
	nikunj@amd.com, Alexander Graf, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Ira Weiny,
	aou@eecs.berkeley.edu, Xiaoyao Li, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, roypat@amazon.co.uk,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Thu, May 15, 2025 at 05:57:57PM -0700, Sean Christopherson wrote:

> You're conflating two different things.  guest_memfd allocating and managing
> 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> map memory into the guest using 4KiB pages.

Even if KVM is limited to 4K the IOMMU might not be - alot of these
workloads have a heavy IO component and we need the iommu to perform
well too.

Frankly, I don't think there should be objection to making memory more
contiguous. There is alot of data that this always brings wins
somewhere for someone.

> The longer term goal of guest_memfd is to make it suitable for backing all VMs,
> hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX, but
> we (and others) want to use guest_memfd for far more than just CoCo VMs.  And
> for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.

Yes, even from an iommu perspective with 2D translation we need to
have the 1G pages from the S2 resident in the IOTLB or performance
falls off a cliff.

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16  2:12         ` Edgecombe, Rick P
@ 2025-05-16 13:11           ` Vishal Annapurve
  2025-05-16 16:45             ` Edgecombe, Rick P
  2025-05-16 17:45             ` Sean Christopherson
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-16 13:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Miao, Jun, Shutemov, Kirill,
	pdurrant@amazon.co.uk, steven.price@arm.com, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com, hughd@google.com,
	mail@maciej.szmigiero.name, palmer@dabbelt.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, jack@suse.cz, paul.walmsley@sifive.com,
	aik@amd.com, usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, vbabka@suse.cz, Du, Fan,
	anthony.yznaga@oracle.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, bfoster@redhat.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Xu, Haibo1,
	anup@brainfault.org, Hansen, Dave, Yamahata, Isaku,
	jthoughton@google.com, Wang, Wei W, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, ackerleytng@google.com,
	linux-fsdevel@vger.kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, roypat@amazon.co.uk, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > > > Thinking from the TDX perspective, we might have bigger fish to fry than
> > > > > 1.6% memory savings (for example dynamic PAMT), and the rest of the
> > > > > benefits don't have numbers. How much are we getting for all the
> > > > > complexity, over say buddy allocated 2MB pages?
> >
> > TDX may have bigger fish to fry, but some of us have bigger fish to fry than
> > TDX :-)
>
> Fair enough. But TDX is on the "roadmap". So it helps to say what the target of
> this series is.
>
> >
> > > > This series should work for any page sizes backed by hugetlb memory.
> > > > Non-CoCo VMs, pKVM and Confidential VMs all need hugepages that are
> > > > essential for certain workloads and will emerge as guest_memfd users.
> > > > Features like KHO/memory persistence in addition also depend on
> > > > hugepage support in guest_memfd.
> > > >
> > > > This series takes strides towards making guest_memfd compatible with
> > > > usecases where 1G pages are essential and non-confidential VMs are
> > > > already exercising them.
> > > >
> > > > I think the main complexity here lies in supporting in-place
> > > > conversion which applies to any huge page size even for buddy
> > > > allocated 2MB pages or THP.
> > > >
> > > > This complexity arises because page structs work at a fixed
> > > > granularity, future roadmap towards not having page structs for guest
> > > > memory (at least private memory to begin with) should help towards
> > > > greatly reducing this complexity.
> > > >
> > > > That being said, DPAMT and huge page EPT mappings for TDX VMs remain
> > > > essential and complement this series well for better memory footprint
> > > > and overall performance of TDX VMs.
> > >
> > > Hmm, this didn't really answer my questions about the concrete benefits.
> > >
> > > I think it would help to include this kind of justification for the 1GB
> > > guestmemfd pages. "essential for certain workloads and will emerge" is a bit
> > > hard to review against...
> > >
> > > I think one of the challenges with coco is that it's almost like a sprint to
> > > reimplement virtualization. But enough things are changing at once that not
> > > all of the normal assumptions hold, so it can't copy all the same solutions.
> > > The recent example was that for TDX huge pages we found that normal
> > > promotion paths weren't actually yielding any benefit for surprising TDX
> > > specific reasons.
> > >
> > > On the TDX side we are also, at least currently, unmapping private pages
> > > while they are mapped shared, so any 1GB pages would get split to 2MB if
> > > there are any shared pages in them. I wonder how many 1GB pages there would
> > > be after all the shared pages are converted. At smaller TD sizes, it could
> > > be not much.
> >
> > You're conflating two different things.  guest_memfd allocating and managing
> > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> > map memory into the guest using 4KiB pages.
>
> I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
> list quoted there was more about guest performance. Or maybe the clever page
> table walkers that find contiguous small mappings could benefit guest
> performance too? It's the kind of thing I'd like to see at least broadly called
> out.

The crux of this series really is hugetlb backing support for
guest_memfd and handling CoCo VMs irrespective of the page size as I
suggested earlier, so 2M page sizes will need to handle similar
complexity of in-place conversion.

Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
lower memory footprint using HVO and lower MMU/IOMMU page table memory
footprint among other improvements. These percentages carry a
substantial impact when working at the scale of large fleets of hosts
each carrying significant memory capacity.

guest_memfd hugepage support + hugepage EPT mapping support for TDX
VMs significantly help:
1) ~70% decrease in TDX VM boot up time
2) ~65% decrease in TDX VM shutdown time
3) ~90% decrease in TDX VM PAMT memory overhead
4) Improvement in TDX SEPT memory overhead

And we believe this combination should also help achieve better
performance with TDX connect in future.

Hugetlb huge pages are preferred as they are statically carved out at
boot and so provide much better guarantees of availability. Once the
pages are carved out, any VMs scheduled on such a host will need to
work with the same hugetlb memory sizes. This series attempts to use
hugetlb pages with in-place conversion, avoiding the double allocation
problem that otherwise results in significant memory overheads for
CoCo VMs.

>
> I'm thinking that Google must have a ridiculous amount of learnings about VM
> memory management. And this is probably designed around those learnings. But
> reviewers can't really evaluate it if they don't know the reasons and tradeoffs
> taken. If it's going upstream, I think it should have at least the high level
> reasoning explained.
>
> I don't mean to harp on the point so hard, but I didn't expect it to be
> controversial either.
>
> >
> > > So for TDX in isolation, it seems like jumping out too far ahead to
> > > effectively consider the value. But presumably you guys are testing this on
> > > SEV or something? Have you measured any performance improvement? For what
> > > kind of applications? Or is the idea to basically to make guestmemfd work
> > > like however Google does guest memory?
> >
> > The longer term goal of guest_memfd is to make it suitable for backing all
> > VMs, hence Vishal's "Non-CoCo VMs" comment.
>
> Oh, I actually wasn't aware of this. Or maybe I remember now. I thought he was
> talking about pKVM.
>
> >   Yes, some of this is useful for TDX, but we (and others) want to use
> > guest_memfd for far more than just CoCo VMs.
>
>
> >  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> I've heard this a lot. It must be true, but I've never seen the actual numbers.
> For a long time people believed 1GB huge pages on the direct map were critical,
> but then benchmarking on a contemporary CPU couldn't find much difference
> between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
> the combined walks are huge, iTLB, etc, but I'd love to see a real number.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  2025-05-14 23:42 ` [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd Ackerley Tng
@ 2025-05-16 14:07   ` Ackerley Tng
  2025-05-16 20:33     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 14:07 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ackerley Tng <ackerleytng@google.com> writes:

> guestmem_hugetlb is an allocator for guest_memfd. It wraps HugeTLB to
> provide huge folios for guest_memfd.
>
> This patch also introduces guestmem_allocator_operations as a set of
> operations that allocators for guest_memfd can provide. In a later
> patch, guest_memfd will use these operations to manage pages from an
> allocator.
>
> The allocator operations are memory-management specific and are placed
> in mm/ so key mm-specific functions do not have to be exposed
> unnecessarily.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Change-Id: I3cafe111ea7b3c84755d7112ff8f8c541c11136d
> ---
>  include/linux/guestmem.h      |  20 +++++
>  include/uapi/linux/guestmem.h |  29 +++++++
>  mm/Kconfig                    |   5 +-
>  mm/guestmem_hugetlb.c         | 159 ++++++++++++++++++++++++++++++++++
>  4 files changed, 212 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/guestmem.h
>  create mode 100644 include/uapi/linux/guestmem.h
>
> <snip>
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 131adc49f58d..bb6e39e37245 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1218,7 +1218,10 @@ config SECRETMEM
>  
>  config GUESTMEM_HUGETLB
>  	bool "Enable guestmem_hugetlb allocator for guest_memfd"
> -	depends on HUGETLBFS
> +	select GUESTMEM
> +	select HUGETLBFS
> +	select HUGETLB_PAGE
> +	select HUGETLB_PAGE_OPTIMIZE_VMEMMAP

My bad. I left out CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y in
my testing and just found that when it is set, I hit

  BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);

with the basic guest_memfd_test on splitting pages on allocation.

I'll follow up with the fix soon.

Another note about testing: I've been testing in a nested VM for the
development process:

1. Host
2. VM for development
3. Nested VM running kernel being developed
4. Nested nested VMs created during selftests

This series has not yet been tested on a physical host.

>  	help
>  	  Enable this to make HugeTLB folios available to guest_memfd
>  	  (KVM virtualization) as backing memory.
>
> <snip>
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 13:11           ` Vishal Annapurve
@ 2025-05-16 16:45             ` Edgecombe, Rick P
  2025-05-16 17:51               ` Sean Christopherson
  2025-05-16 17:45             ` Sean Christopherson
  1 sibling, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 16:45 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, jack@suse.cz, maz@kernel.org,
	tabba@google.com, vkuznets@redhat.com, quic_svaddagi@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	quic_eberman@quicinc.com, Wang, Wei W, keirf@google.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, Hansen, Dave,
	ajones@ventanamicro.com, rppt@kernel.org,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	fvdl@google.com, paul.walmsley@sifive.com,
	quic_cvanscha@quicinc.com, nsaenz@amazon.es, willy@infradead.org,
	Du, Fan, anthony.yznaga@oracle.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	Shutemov, Kirill, akpm@linux-foundation.org, steven.price@arm.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, bfoster@redhat.com,
	jhubbard@nvidia.com, Xu, Haibo1, anup@brainfault.org,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, quic_pheragu@quicinc.com,
	jarkko@kernel.org, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	linux-mm@kvack.org, seanjc@google.com, hch@infradead.org

On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote:
> The crux of this series really is hugetlb backing support for
> guest_memfd and handling CoCo VMs irrespective of the page size as I
> suggested earlier, so 2M page sizes will need to handle similar
> complexity of in-place conversion.

I assumed this part was added 1GB complexity:
 mm/hugetlb.c                                  |  488 ++---

I'll dig into the series and try to understand the point better.

> 
> Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
> lower memory footprint using HVO and lower MMU/IOMMU page table memory
> footprint among other improvements. These percentages carry a
> substantial impact when working at the scale of large fleets of hosts
> each carrying significant memory capacity.

There must have been a lot of measuring involved in that. But the numbers I was
hoping for were how much does *this* series help upstream.

> 
> guest_memfd hugepage support + hugepage EPT mapping support for TDX
> VMs significantly help:
> 1) ~70% decrease in TDX VM boot up time
> 2) ~65% decrease in TDX VM shutdown time
> 3) ~90% decrease in TDX VM PAMT memory overhead
> 4) Improvement in TDX SEPT memory overhead

Thanks. It is the difference between 4k mappings and 2MB mappings I guess? Or
are you saying this is the difference between 1GB contiguous pages for TDX at
2MB mapping, and 2MB contiguous pages at TDX 2MB mappings? The 1GB part is the
one I was curious about.

> 
> And we believe this combination should also help achieve better
> performance with TDX connect in future.

Please don't take this query as an objection that the series doesn't help TDX
enough or something like that. If it doesn't help TDX at all (not the case),
that is fine. The objection is only that the specific benefits and tradeoffs
around 1GB pages are not clear in the upstream posting.

> 
> Hugetlb huge pages are preferred as they are statically carved out at
> boot and so provide much better guarantees of availability.
> 

Reserved memory can provide physically contiguous pages more frequently. Seems
not surprising at all, and something that could have a number. 

>  Once the
> pages are carved out, any VMs scheduled on such a host will need to
> work with the same hugetlb memory sizes. This series attempts to use
> hugetlb pages with in-place conversion, avoiding the double allocation
> problem that otherwise results in significant memory overheads for
> CoCo VMs.

I asked this question assuming there were some measurements for the 1GB part of
this series. It sounds like the reasoning is instead that this is how Google
does things, which is backed by way more benchmarking than kernel patches are
used to getting. So it can just be reasonable assumed to be helpful.

But for upstream code, I'd expect there to be a bit more concrete than "we
believe" and "substantial impact". It seems like I'm in the minority here
though. So if no one else wants to pressure test the thinking in the usual way,
I guess I'll just have to wonder.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 13:09         ` Jason Gunthorpe
@ 2025-05-16 17:04           ` Edgecombe, Rick P
  0 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 17:04 UTC (permalink / raw)
  To: seanjc@google.com, jgg@ziepe.ca
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, palmer@dabbelt.com,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	maz@kernel.org, vkuznets@redhat.com, x86@kernel.org,
	keirf@google.com, hughd@google.com, Annapurve, Vishal,
	mail@maciej.szmigiero.name, jack@suse.cz, Wieczor-Retman, Maciej,
	Zhao, Yan Y, Du, Fan, willy@infradead.org,
	paul.walmsley@sifive.com, nsaenz@amazon.es, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	Hansen, Dave, vbabka@suse.cz, bfoster@redhat.com,
	quic_eberman@quicinc.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, anthony.yznaga@oracle.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, ajones@ventanamicro.com, jhubbard@nvidia.com,
	Xu, Haibo1, anup@brainfault.org, Yamahata, Isaku,
	jthoughton@google.com, Wang, Wei W, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, ackerleytng@google.com,
	linux-fsdevel@vger.kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, roypat@amazon.co.uk, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Fri, 2025-05-16 at 10:09 -0300, Jason Gunthorpe wrote:
> > You're conflating two different things.  guest_memfd allocating and managing
> > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can
> > only
> > map memory into the guest using 4KiB pages.
> 
> Even if KVM is limited to 4K the IOMMU might not be - alot of these
> workloads have a heavy IO component and we need the iommu to perform
> well too.

Oh, interesting point.

> 
> Frankly, I don't think there should be objection to making memory more
> contiguous. 

No objections from me to anything except the lack of concrete justification.

> There is alot of data that this always brings wins
> somewhere for someone.

For the direct map huge page benchmarking, they saw that sometimes 1GB pages
helped, but also sometimes 2MB pages helped. That 1GB will help *some* workload
doesn't seem surprising.

> 
> > The longer term goal of guest_memfd is to make it suitable for backing all
> > VMs,
> > hence Vishal's "Non-CoCo VMs" comment.  Yes, some of this is useful for TDX,
> > but
> > we (and others) want to use guest_memfd for far more than just CoCo VMs. 
> > And
> > for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> 
> Yes, even from an iommu perspective with 2D translation we need to
> have the 1G pages from the S2 resident in the IOTLB or performance
> falls off a cliff.

"falls off a cliff" is the level of detail and the direction of hand waving I
have been hearing. But it also seems modern CPUs are quite good at hiding the
cost of walks with caches etc. Like how 5 level paging was made unconditional. I
didn't think about IOTLB though. Thanks for mentioning it.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-15 13:49   ` Ira Weiny
@ 2025-05-16 17:42     ` Ackerley Tng
  2025-05-16 19:31       ` Ira Weiny
  2025-05-27  8:53       ` Binbin Wu
  0 siblings, 2 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 17:42 UTC (permalink / raw)
  To: Ira Weiny, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ira Weiny <ira.weiny@intel.com> writes:

> Ackerley Tng wrote:
>> Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
>> GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>> 
>> Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>  .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
>>  1 file changed, 24 insertions(+), 12 deletions(-)
>> 
>> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
>> index 60aaba5808a5..bf2876cbd711 100644
>> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
>> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
>> @@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
>>  	kvm_vm_release(vm);
>>  }
>>  
>> +static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
>> +				   bool expect_valid)
>> +{
>> +	size_t page_size = getpagesize();
>> +	int fd;
>> +
>> +	fd = __vm_create_guest_memfd(vm, page_size, flag);
>> +
>> +	if (expect_valid) {
>> +		TEST_ASSERT(fd > 0,
>> +			    "guest_memfd() with flag '0x%lx' should be valid",
>> +			    flag);
>> +		close(fd);
>> +	} else {
>> +		TEST_ASSERT(fd == -1 && errno == EINVAL,
>> +			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>> +			    flag);
>> +	}
>> +}
>> +
>>  static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>  					    uint64_t expected_valid_flags)
>>  {
>> -	size_t page_size = getpagesize();
>>  	struct kvm_vm *vm;
>>  	uint64_t flag = 0;
>> -	int fd;
>>  
>>  	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
>>  		return;
>> @@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>  	vm = vm_create_barebones_type(vm_type);
>>  
>>  	for (flag = BIT(0); flag; flag <<= 1) {
>> -		fd = __vm_create_guest_memfd(vm, page_size, flag);
>> +		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
>>  
>> -		if (flag & expected_valid_flags) {
>> -			TEST_ASSERT(fd > 0,
>> -				    "guest_memfd() with flag '0x%lx' should be valid",
>> -				    flag);
>> -			close(fd);
>> -		} else {
>> -			TEST_ASSERT(fd == -1 && errno == EINVAL,
>> -				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>> -				    flag);
>> +		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
>> +			test_vm_with_gmem_flag(
>> +				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
>
> I don't understand the point of this check.  In 2/51 we set 
> GUEST_MEMFD_FLAG_INIT_PRIVATE when GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>
> When can this check ever fail?
>
> Ira

In 02/51, GUEST_MEMFD_FLAG_INIT_PRIVATE is not set by default,
GUEST_MEMFD_FLAG_INIT_PRIVATE is set as one of the valid_flags.

The intention is that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid if
GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested by userspace.

In this test, the earlier part before the if block calls
test_vm_with_gmem_flag() all valid flags, and that already tests
GUEST_MEMFD_FLAG_SUPPORT_SHARED individually.

Specifically if GUEST_MEMFD_FLAG_SUPPORT_SHARED is set, this if block
adds a test for when both GUEST_MEMFD_FLAG_SUPPORT_SHARED and
GUEST_MEMFD_FLAG_INIT_PRIVATE are set, and sets that expect_valid is
true.

This second test doesn't fail, it is meant to check that the kernel
allows the pair of flags to be set. Hope that makes sense.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 13:11           ` Vishal Annapurve
  2025-05-16 16:45             ` Edgecombe, Rick P
@ 2025-05-16 17:45             ` Sean Christopherson
  1 sibling, 0 replies; 231+ messages in thread
From: Sean Christopherson @ 2025-05-16 17:45 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, Kirill Shutemov,
	pdurrant@amazon.co.uk, steven.price@arm.com, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com, hughd@google.com,
	mail@maciej.szmigiero.name, palmer@dabbelt.com,
	Maciej Wieczor-Retman, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, jack@suse.cz, paul.walmsley@sifive.com,
	aik@amd.com, usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, vbabka@suse.cz, Fan Du,
	anthony.yznaga@oracle.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, bfoster@redhat.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	anup@brainfault.org, Dave Hansen, Isaku Yamahata,
	jthoughton@google.com, Wei W Wang, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Chao P Peng,
	nikunj@amd.com, Alexander Graf, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Ira Weiny,
	aou@eecs.berkeley.edu, Xiaoyao Li, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, ackerleytng@google.com,
	linux-fsdevel@vger.kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, roypat@amazon.co.uk, linux-mm@kvack.org,
	will@kernel.org, hch@infradead.org

On Fri, May 16, 2025, Vishal Annapurve wrote:
> On Thu, May 15, 2025 at 7:12 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> > On Thu, 2025-05-15 at 17:57 -0700, Sean Christopherson wrote:
> > > You're conflating two different things.  guest_memfd allocating and managing
> > > 1GiB physical pages, and KVM mapping memory into the guest at 1GiB/2MiB
> > > granularity.  Allocating memory in 1GiB chunks is useful even if KVM can only
> > > map memory into the guest using 4KiB pages.
> >
> > I'm aware of the 1.6% vmemmap benefits from the LPC talk. Is there more? The
> > list quoted there was more about guest performance. Or maybe the clever page
> > table walkers that find contiguous small mappings could benefit guest
> > performance too? It's the kind of thing I'd like to see at least broadly called
> > out.
> 
> The crux of this series really is hugetlb backing support for guest_memfd and
> handling CoCo VMs irrespective of the page size as I suggested earlier, so 2M
> page sizes will need to handle similar complexity of in-place conversion.
> 
> Google internally uses 1G hugetlb pages to achieve high bandwidth IO,

E.g. hitting target networking line rates is only possible with 1GiB mappings,
otherwise TLB pressure gets in the way.

> lower memory footprint using HVO and lower MMU/IOMMU page table memory
> footprint among other improvements. These percentages carry a substantial
> impact when working at the scale of large fleets of hosts each carrying
> significant memory capacity.

Yeah, 1.6% might sound small, but over however many bytes of RAM there are in
the fleet, it's a huge (lol) amount of memory saved.

> > >   Yes, some of this is useful for TDX, but we (and others) want to use
> > > guest_memfd for far more than just CoCo VMs.
> >
> >
> > >  And for non-CoCo VMs, 1GiB hugepages are mandatory for various workloads.
> > I've heard this a lot. It must be true, but I've never seen the actual numbers.
> > For a long time people believed 1GB huge pages on the direct map were critical,
> > but then benchmarking on a contemporary CPU couldn't find much difference
> > between 2MB and 1GB. I'd expect TDP huge pages to be different than that because
> > the combined walks are huge, iTLB, etc, but I'd love to see a real number.

The direct map is very, very different than userspace and thus guest mappings.
Software (hopefully) isn't using the direct map to index multi-TiB databases,
or to transfer GiBs of data over the network.  The amount of memory the kernel
is regularly accessing is an order or two magnitude smaller than single process
use cases.

A few examples from a quick search:

http://pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be
https://www.percona.com/blog/benchmark-postgresql-with-linux-hugepages/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 16:45             ` Edgecombe, Rick P
@ 2025-05-16 17:51               ` Sean Christopherson
  2025-05-16 19:14                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-05-16 17:51 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Vishal Annapurve, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, jack@suse.cz, maz@kernel.org,
	tabba@google.com, vkuznets@redhat.com, quic_svaddagi@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	quic_eberman@quicinc.com, Wei W Wang, keirf@google.com,
	Maciej Wieczor-Retman, Yan Y Zhao, Dave Hansen,
	ajones@ventanamicro.com, rppt@kernel.org,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	fvdl@google.com, paul.walmsley@sifive.com,
	quic_cvanscha@quicinc.com, nsaenz@amazon.es, willy@infradead.org,
	Fan Du, anthony.yznaga@oracle.com, linux-kernel@vger.kernel.org,
	thomas.lendacky@amd.com, mic@digikod.net, oliver.upton@linux.dev,
	Kirill Shutemov, akpm@linux-foundation.org, steven.price@arm.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, bfoster@redhat.com,
	jhubbard@nvidia.com, Haibo1 Xu, anup@brainfault.org,
	Isaku Yamahata, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, quic_pheragu@quicinc.com,
	jarkko@kernel.org, chenhuacai@kernel.org, Kai Huang,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Chao P Peng, nikunj@amd.com, Alexander Graf,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Ira Weiny, aou@eecs.berkeley.edu,
	Xiaoyao Li, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	linux-mm@kvack.org, hch@infradead.org

On Fri, May 16, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-05-16 at 06:11 -0700, Vishal Annapurve wrote:
> > Google internally uses 1G hugetlb pages to achieve high bandwidth IO,
> > lower memory footprint using HVO and lower MMU/IOMMU page table memory
> > footprint among other improvements. These percentages carry a
> > substantial impact when working at the scale of large fleets of hosts
> > each carrying significant memory capacity.
> 
> There must have been a lot of measuring involved in that. But the numbers I was
> hoping for were how much does *this* series help upstream.

...

> I asked this question assuming there were some measurements for the 1GB part of
> this series. It sounds like the reasoning is instead that this is how Google
> does things, which is backed by way more benchmarking than kernel patches are
> used to getting. So it can just be reasonable assumed to be helpful.
> 
> But for upstream code, I'd expect there to be a bit more concrete than "we
> believe" and "substantial impact". It seems like I'm in the minority here
> though. So if no one else wants to pressure test the thinking in the usual way,
> I guess I'll just have to wonder.

From my perspective, 1GiB hugepage support in guest_memfd isn't about improving
CoCo performance, it's about achieving feature parity on guest_memfd with respect
to existing backing stores so that it's possible to use guest_memfd to back all
VM shapes in a fleet.

Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages,
unless you want to re-litigate the existence of 1GiB support in HugeTLBFS.

If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory
for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and
non-CoCo VMs on a single host.  Mixing 1GiB HugeTLBFS with any other backing store
for VMs simply isn't tenable due to the nature of 1GiB allocations.  E.g. grabbing
sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where
HugeTLBFS can't allocate memory for non-CoCo VMs.

Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves
adding 1GiB support to guest_memfd as the only way forward.

Any boost to TDX (or SNP) performance is purely a bonus.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-15 14:50   ` Ira Weiny
@ 2025-05-16 17:53     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 17:53 UTC (permalink / raw)
  To: Ira Weiny, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ira Weiny <ira.weiny@intel.com> writes:

> Ackerley Tng wrote:
>
> [snip]
>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> 
>
> [snip]
>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 590932499eba..f802116290ce 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -30,6 +30,10 @@ enum shareability {
>>  };
>>  
>>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
>> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>> +				      pgoff_t end);
>> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
>> +				    pgoff_t end);
>>  
>>  static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>>  {
>> @@ -85,6 +89,306 @@ static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t inde
>>  	return kvm_gmem_get_folio(inode, index);
>>  }
>>  
>> +/**
>> + * kvm_gmem_shareability_store() - Sets shareability to @value for range.
>> + *
>> + * @mt: the shareability maple tree.
>> + * @index: the range begins at this index in the inode.
>> + * @nr_pages: number of PAGE_SIZE pages in this range.
>> + * @value: the shareability value to set for this range.
>> + *
>> + * Unlike mtree_store_range(), this function also merges adjacent ranges that
>> + * have the same values as an optimization.
>
> Is this an optimization or something which will be required to convert
> from shared back to private and back to a huge page mapping?
>

This is an optimization.

> If this is purely an optimization it might be best to leave it out for now
> to get functionality first.
>

I see this (small) optimization as part of using maple trees.

Fuad's version [1] uses xarrays and has 1 xarray entry per page
offset. I wanted to illustrate that by using maple trees, we can share
just 1 entry for a whole range, and part of that sharing involves
merging adjacent shareability entries that have the same value.

IIUC, these other users of maple trees also do some kind of
expansion/range merging:

+ VMAs in vma_expand() [2]
+ regcache in regcache_maple_write() [3]

> I have more to review but wanted to ask this.
>
> Ira
>
> [snip]

[1] https://lore.kernel.org/all/20250328153133.3504118-4-tabba@google.com/
[2] https://elixir.bootlin.com/linux/v6.14.6/source/mm/vma.c#L1059
[3] https://elixir.bootlin.com/linux/v6.14.6/source/drivers/base/regmap/regcache-maple.c#L38

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 17:51               ` Sean Christopherson
@ 2025-05-16 19:14                 ` Edgecombe, Rick P
  2025-05-16 20:25                   ` Dave Hansen
  0 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 19:14 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: palmer@dabbelt.com, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org, jack@suse.cz,
	tabba@google.com, quic_svaddagi@quicinc.com, amoorthy@google.com,
	pvorel@suse.cz, vkuznets@redhat.com, mail@maciej.szmigiero.name,
	Annapurve, Vishal, anthony.yznaga@oracle.com, Wang, Wei W,
	keirf@google.com, Wieczor-Retman, Maciej, Zhao, Yan Y,
	ajones@ventanamicro.com, Hansen, Dave, rppt@kernel.org,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, willy@infradead.org, Du, Fan,
	quic_eberman@quicinc.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, akpm@linux-foundation.org,
	steven.price@arm.com, muchun.song@linux.dev,
	binbin.wu@linux.intel.com, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	hughd@google.com, Xu, Haibo1, jhubbard@nvidia.com,
	anup@brainfault.org, maz@kernel.org, Yamahata, Isaku,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org, Shutemov, Kirill,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, hch@infradead.org,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	will@kernel.org, linux-mm@kvack.org

On Fri, 2025-05-16 at 10:51 -0700, Sean Christopherson wrote:
> From my perspective, 1GiB hugepage support in guest_memfd isn't about improving
> CoCo performance, it's about achieving feature parity on guest_memfd with respect
> to existing backing stores so that it's possible to use guest_memfd to back all
> VM shapes in a fleet.
> 
> Let's assume there is significant value in backing non-CoCo VMs with 1GiB pages,
> unless you want to re-litigate the existence of 1GiB support in HugeTLBFS.

I didn't expect to go in that direction when I first asked. But everyone says
huge, but no one knows the numbers. It can be a sign of things.

Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by
because people couldn't find a cost to the extra level of walk. So re-litigate,
no. But I'll probably remain quietly suspicious of the exact cost/value. At
least on the CPU side, I totally missed the IOTLB side at first, sorry.

> 
> If we assume 1GiB support is mandatory for non-CoCo VMs, then it becomes mandatory
> for CoCo VMs as well, because it's the only realistic way to run CoCo VMs and
> non-CoCo VMs on a single host.  Mixing 1GiB HugeTLBFS with any other backing store
> for VMs simply isn't tenable due to the nature of 1GiB allocations.  E.g. grabbing
> sub-1GiB chunks of memory for CoCo VMs quickly fragments memory to the point where
> HugeTLBFS can't allocate memory for non-CoCo VMs.

It makes sense that there would be a difference in how many huge pages the non-
coco guests would get. Where I start to lose you is when you guys talk about
"mandatory" or similar. If you want upstream review, it would help to have more
numbers on the "why" question. At least for us folks outside the hyperscalars
where such things are not as obvious.

> 
> Teaching HugeTLBFS to play nice with TDX and SNP isn't happening, which leaves
> adding 1GiB support to guest_memfd as the only way forward.
> 
> Any boost to TDX (or SNP) performance is purely a bonus.

Most of the bullets in the talk were about mapping sizes AFAICT, so this is the
kind of reasoning I was hoping for. Thanks for elaborating on it, even though
still no one has any numbers besides the vmemmap savings.



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-16 17:42     ` Ackerley Tng
@ 2025-05-16 19:31       ` Ira Weiny
  2025-05-27  8:53       ` Binbin Wu
  1 sibling, 0 replies; 231+ messages in thread
From: Ira Weiny @ 2025-05-16 19:31 UTC (permalink / raw)
  To: Ackerley Tng, Ira Weiny, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel
  Cc: aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ackerley Tng wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > Ackerley Tng wrote:
> >> Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
> >> GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
> >> 
> >> Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >> ---
> >>  .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
> >>  1 file changed, 24 insertions(+), 12 deletions(-)
> >> 
> >> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
> >> index 60aaba5808a5..bf2876cbd711 100644
> >> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> >> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> >> @@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
> >>  	kvm_vm_release(vm);
> >>  }
> >>  
> >> +static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
> >> +				   bool expect_valid)
> >> +{
> >> +	size_t page_size = getpagesize();
> >> +	int fd;
> >> +
> >> +	fd = __vm_create_guest_memfd(vm, page_size, flag);
> >> +
> >> +	if (expect_valid) {
> >> +		TEST_ASSERT(fd > 0,
> >> +			    "guest_memfd() with flag '0x%lx' should be valid",
> >> +			    flag);
> >> +		close(fd);
> >> +	} else {
> >> +		TEST_ASSERT(fd == -1 && errno == EINVAL,
> >> +			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> >> +			    flag);
> >> +	}
> >> +}
> >> +
> >>  static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
> >>  					    uint64_t expected_valid_flags)
> >>  {
> >> -	size_t page_size = getpagesize();
> >>  	struct kvm_vm *vm;
> >>  	uint64_t flag = 0;
> >> -	int fd;
> >>  
> >>  	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
> >>  		return;
> >> @@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
> >>  	vm = vm_create_barebones_type(vm_type);
> >>  
> >>  	for (flag = BIT(0); flag; flag <<= 1) {
> >> -		fd = __vm_create_guest_memfd(vm, page_size, flag);
> >> +		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
> >>  
> >> -		if (flag & expected_valid_flags) {
> >> -			TEST_ASSERT(fd > 0,
> >> -				    "guest_memfd() with flag '0x%lx' should be valid",
> >> -				    flag);
> >> -			close(fd);
> >> -		} else {
> >> -			TEST_ASSERT(fd == -1 && errno == EINVAL,
> >> -				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
> >> -				    flag);
> >> +		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
> >> +			test_vm_with_gmem_flag(
> >> +				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
> >
> > I don't understand the point of this check.  In 2/51 we set 
> > GUEST_MEMFD_FLAG_INIT_PRIVATE when GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
> >
> > When can this check ever fail?
> >
> > Ira
> 
> In 02/51, GUEST_MEMFD_FLAG_INIT_PRIVATE is not set by default,
> GUEST_MEMFD_FLAG_INIT_PRIVATE is set as one of the valid_flags.

Ah My mistake I read that too quickly.

Thanks,
Ira

> 
> The intention is that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid if
> GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested by userspace.
> 
> In this test, the earlier part before the if block calls
> test_vm_with_gmem_flag() all valid flags, and that already tests
> GUEST_MEMFD_FLAG_SUPPORT_SHARED individually.
> 
> Specifically if GUEST_MEMFD_FLAG_SUPPORT_SHARED is set, this if block
> adds a test for when both GUEST_MEMFD_FLAG_SUPPORT_SHARED and
> GUEST_MEMFD_FLAG_INIT_PRIVATE are set, and sets that expect_valid is
> true.
> 
> This second test doesn't fail, it is meant to check that the kernel
> allows the pair of flags to be set. Hope that makes sense.



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (51 preceding siblings ...)
  2025-05-16  0:22 ` [RFC PATCH v2 51/51] KVM: selftests: Test guest_memfd for accuracy of st_blocks Ackerley Tng
@ 2025-05-16 19:48 ` Ira Weiny
  2025-05-16 19:59   ` Ira Weiny
  2025-05-16 22:43 ` Ackerley Tng
                   ` (2 subsequent siblings)
  55 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-05-16 19:48 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	afranji
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
> 
> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> for guest_memfd patchset (Thanks Fuad!) [1].

Trying to manage dependencies I find that Ryan's just released series[1]
is required to build this set.

[1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/

Specifically this patch:
	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/

	defines

	alloc_anon_secure_inode()

Am I wrong in that?

> 
> For ease of testing, this series is also available, stitched together,
> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> 

I went digging in your git tree and then found Ryan's set.  So thanks for
the git tree.  :-D

However, it seems this add another dependency which should be managed in
David's email of dependencies?

Ira

> This patchset can be divided into two sections:
> 
> (a) Patches from the beginning up to and including "KVM: selftests:
>     Update script to map shared memory from guest_memfd" are a modified
>     version of "conversion support for guest_memfd", which Fuad is
>     managing [2].
> 
> (b) Patches after "KVM: selftests: Update script to map shared memory
>     from guest_memfd" till the end are patches that actually bring in 1G
>     page support for guest_memfd.
> 
> These are the significant differences between (a) and [2]:
> 
> + [2] uses an xarray to track sharability, but I used a maple tree
>   because for 1G pages, iterating pagewise to update shareability was
>   prohibitively slow even for testing. I was choosing from among
>   multi-index xarrays, interval trees and maple trees [3], and picked
>   maple trees because
>     + Maple trees were easier to figure out since I didn't have to
>       compute the correct multi-index order and handle edge cases if the
>       converted range wasn't a neat power of 2.
>     + Maple trees were easier to figure out as compared to updating
>       parts of a multi-index xarray.
>     + Maple trees had an easier API to use than interval trees.
> + [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
>   support end-to-end.
> + (a) Removes guest_memfd from participating in LRU, which I needed, to
>   get conversion selftests to work as expected, since participation in
>   LRU was causing some unexpected refcounts on folios which was blocking
>   conversions.
> 
> I am sending (a) in emails as well, as opposed to just leaving it on
> GitHub, so that we can discuss by commenting inline on emails. If you'd
> like to just look at 1G page support, here are some key takeaways from
> the first section (a):
> 
> + If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
>   creation, guest_memfd will
>     + Track shareability (whether an index in the inode is guest-only or
>       if the host is allowed to fault memory at a given index).
>     + Always be used for guest faults - specifically, kvm_gmem_get_pfn()
>       will be used to provide pages for the guest.
>     + Always be used by KVM to check private/shared status of a gfn.
> + guest_memfd now has conversion ioctls, allowing conversion to
>   private/shared
>     + Conversion can fail if there are unexpected refcounts on any
>       folios in the range.
> 
> Focusing on (b) 1G page support, here's an overview:
> 
> 1. A bunch of refactoring patches for HugeTLB that isolates the
>    allocation of a HugeTLB folio from other HugeTLB concepts such as
>    VMA-level reservations, and HugeTLBfs-specific concepts, such as
>    where memory policy is stored in the VMA, or where the subpool is
>    stored on the inode.
> 2. A few patches that add a guestmem_hugetlb allocator within mm/. The
>    guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
>    the memory management functions, and to cleanly handle cleanup, so
>    that folio cleanup can happen after the guest_memfd inode (and even
>    KVM) goes away.
> 3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
> 4. Selftests for 1G page support.
> 
> Here are some remaining issues/TODOs:
> 
> 1. Memory error handling such as machine check errors have not been
>    implemented.
> 2. I've not looked into preparedness of pages, only zeroing has been
>    considered.
> 3. When allocating HugeTLB pages, if two threads allocate indices
>    mapping to the same huge page, the utilization in guest_memfd inode's
>    subpool may momentarily go over the subpool limit (the requested size
>    of the inode at guest_memfd creation time), causing one of the two
>    threads to get -ENOMEM. Suggestions to solve this are appreciated!
> 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
>    pages should be correct but needs testing and could be wrong.
> 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
>    HugeTLB pages after splitting should be correct but needs testing and
>    could be wrong.
> 6. Page cache accounting: When a hugetlb page is split, guest_memfd will
>    incur page count in both NR_HUGETLB (counted at hugetlb allocation
>    time) and NR_FILE_PAGES stats (counted when split pages are added to
>    the filemap). Is this aligned with what people expect?
> 
> Here are some optimizations that could be explored in future series:
> 
> 1. Pages could be split from 1G to 2M first and only split to 4K if
>    necessary.
> 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
>    pages.
> 
> Here's RFC v1 [4] if you're interested in the motivation behind choosing
> HugeTLB, or the history of this patch series.
> 
> [1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
> [2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
> [3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
> [4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 19:48 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Ira Weiny
@ 2025-05-16 19:59   ` Ira Weiny
  2025-05-16 20:26     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-05-16 19:59 UTC (permalink / raw)
  To: Ira Weiny, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, afranji
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ira Weiny wrote:
> Ackerley Tng wrote:
> > Hello,
> > 
> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > upstream calls to provide 1G page support for guest_memfd by taking
> > pages from HugeTLB.
> > 
> > This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > for guest_memfd patchset (Thanks Fuad!) [1].
> 
> Trying to manage dependencies I find that Ryan's just released series[1]
> is required to build this set.
> 
> [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> 
> Specifically this patch:
> 	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/
> 
> 	defines
> 
> 	alloc_anon_secure_inode()

Perhaps Ryan's set is not required?  Just that patch?

It looks like Ryan's 2/13 is the same as your 1/51 patch?

https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/

I'll pull 1/13 and see where I get.

Ira

> 
> Am I wrong in that?
> 
> > 
> > For ease of testing, this series is also available, stitched together,
> > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > 
> 
> I went digging in your git tree and then found Ryan's set.  So thanks for
> the git tree.  :-D
> 
> However, it seems this add another dependency which should be managed in
> David's email of dependencies?
> 
> Ira
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 19:14                 ` Edgecombe, Rick P
@ 2025-05-16 20:25                   ` Dave Hansen
  2025-05-16 21:42                     ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Dave Hansen @ 2025-05-16 20:25 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com
  Cc: palmer@dabbelt.com, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org, jack@suse.cz,
	tabba@google.com, quic_svaddagi@quicinc.com, amoorthy@google.com,
	pvorel@suse.cz, vkuznets@redhat.com, mail@maciej.szmigiero.name,
	Annapurve, Vishal, anthony.yznaga@oracle.com, Wang, Wei W,
	keirf@google.com, Wieczor-Retman, Maciej, Zhao, Yan Y,
	ajones@ventanamicro.com, rppt@kernel.org,
	quic_mnalajal@quicinc.com, aik@amd.com, usama.arif@bytedance.com,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, willy@infradead.org, Du, Fan,
	quic_eberman@quicinc.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, akpm@linux-foundation.org,
	steven.price@arm.com, muchun.song@linux.dev,
	binbin.wu@linux.intel.com, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	hughd@google.com, Xu, Haibo1, jhubbard@nvidia.com,
	anup@brainfault.org, maz@kernel.org, Yamahata, Isaku,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org, Shutemov, Kirill,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, hch@infradead.org,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	will@kernel.org, linux-mm@kvack.org

On 5/16/25 12:14, Edgecombe, Rick P wrote:
> Meanwhile I'm watching patches to make 5 level paging walks unconditional fly by
> because people couldn't find a cost to the extra level of walk. So re-litigate,
> no. But I'll probably remain quietly suspicious of the exact cost/value. At
> least on the CPU side, I totally missed the IOTLB side at first, sorry.

It's a little more complicated than just the depth of the worst-case walk.

In practice, many page walks can use the mid-level paging structure
caches because the mappings aren't sparse.

With 5-level paging in particular, userspace doesn't actually change
much at all. Its layout is pretty much the same unless folks are opting
in to the higher (5-level only) address space. So userspace isn't
sparse, at least at the scale of what 5-level paging is capable of.

For the kernel, things are a bit more spread out than they were before.
For instance, the direct map and vmalloc() are in separate p4d pages
when they used to be nestled together in the same half of one pgd.

But, again, they're not *that* sparse. The direct map, for example,
doesn't become more sparse, it just moves to a lower virtual address.
Ditto for vmalloc().  Just because 5-level paging has a massive
vmalloc() area doesn't mean we use it.

Basically, 5-level paging adds a level to the top of the page walk, and
we're really good at caching those when they're not accessed sparsely.

CPUs are not as good at caching the leaf side of the page walk. There
are tricks like AMD's TLB coalescing that help. But, generally, each
walk on the leaf end of the walks eats a TLB entry. Those just don't
cache as well as the top of the tree.

That's why we need to be more maniacal about reducing leaf levels than
the levels toward the root.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 19:59   ` Ira Weiny
@ 2025-05-16 20:26     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 20:26 UTC (permalink / raw)
  To: Ira Weiny, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	afranji
  Cc: aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ira Weiny <ira.weiny@intel.com> writes:

> Ira Weiny wrote:
>> Ackerley Tng wrote:
>> > Hello,
>> > 
>> > This patchset builds upon discussion at LPC 2024 and many guest_memfd
>> > upstream calls to provide 1G page support for guest_memfd by taking
>> > pages from HugeTLB.
>> > 
>> > This patchset is based on Linux v6.15-rc6, and requires the mmap support
>> > for guest_memfd patchset (Thanks Fuad!) [1].
>> 
>> Trying to manage dependencies I find that Ryan's just released series[1]
>> is required to build this set.
>> 
>> [1] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
>> 
>> Specifically this patch:
>> 	https://lore.kernel.org/all/1f42c32fc18d973b8ec97c8be8b7cd921912d42a.1747368092.git.afranji@google.com/
>> 
>> 	defines
>> 
>> 	alloc_anon_secure_inode()
>
> Perhaps Ryan's set is not required?  Just that patch?
>
> It looks like Ryan's 2/13 is the same as your 1/51 patch?
>
> https://lore.kernel.org/all/754b4898c3362050071f6dd09deb24f3c92a41c3.1747368092.git.afranji@google.com/
>
> I'll pull 1/13 and see where I get.
>
> Ira
>
>> 
>> Am I wrong in that?
>>

My bad, this patch was missing from this series:

From bd629d1ec6ffb7091a5f996dc7835abed8467f3e Mon Sep 17 00:00:00 2001
Message-ID: <bd629d1ec6ffb7091a5f996dc7835abed8467f3e.1747426836.git.ackerleytng@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Date: Wed, 7 May 2025 07:59:28 -0700
Subject: [RFC PATCH v2 1/1] fs: Refactor to provide function that allocates a
 secure anonymous inode

alloc_anon_secure_inode() returns an inode after running checks in
security_inode_init_security_anon().

Also refactor secretmem's file creation process to use the new
function.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Change-Id: I4eb8622775bc3d544ec695f453ffd747d9490e40
---
 fs/anon_inodes.c   | 22 ++++++++++++++++------
 include/linux/fs.h |  1 +
 mm/secretmem.c     |  9 +--------
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 583ac81669c2..4c3110378647 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -55,17 +55,20 @@ static struct file_system_type anon_inode_fs_type = {
 	.kill_sb	= kill_anon_super,
 };
 
-static struct inode *anon_inode_make_secure_inode(
-	const char *name,
-	const struct inode *context_inode)
+static struct inode *anon_inode_make_secure_inode(struct super_block *s,
+		const char *name, const struct inode *context_inode,
+		bool fs_internal)
 {
 	struct inode *inode;
 	int error;
 
-	inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	inode = alloc_anon_inode(s);
 	if (IS_ERR(inode))
 		return inode;
-	inode->i_flags &= ~S_PRIVATE;
+
+	if (!fs_internal)
+		inode->i_flags &= ~S_PRIVATE;
+
 	error =	security_inode_init_security_anon(inode, &QSTR(name),
 						  context_inode);
 	if (error) {
@@ -75,6 +78,12 @@ static struct inode *anon_inode_make_secure_inode(
 	return inode;
 }
 
+struct inode *alloc_anon_secure_inode(struct super_block *s, const char *name)
+{
+	return anon_inode_make_secure_inode(s, name, NULL, true);
+}
+EXPORT_SYMBOL_GPL(alloc_anon_secure_inode);
+
 static struct file *__anon_inode_getfile(const char *name,
 					 const struct file_operations *fops,
 					 void *priv, int flags,
@@ -88,7 +97,8 @@ static struct file *__anon_inode_getfile(const char *name,
 		return ERR_PTR(-ENOENT);
 
 	if (make_inode) {
-		inode =	anon_inode_make_secure_inode(name, context_inode);
+		inode = anon_inode_make_secure_inode(anon_inode_mnt->mnt_sb,
+						     name, context_inode, false);
 		if (IS_ERR(inode)) {
 			file = ERR_CAST(inode);
 			goto err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..0fded2e3c661 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3550,6 +3550,7 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
 extern const struct address_space_operations ram_aops;
 extern int always_delete_dentry(const struct dentry *);
 extern struct inode *alloc_anon_inode(struct super_block *);
+extern struct inode *alloc_anon_secure_inode(struct super_block *, const char *);
 extern int simple_nosetlease(struct file *, int, struct file_lease **, void **);
 extern const struct dentry_operations simple_dentry_operations;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1b0a214ee558..c0e459e58cb6 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -195,18 +195,11 @@ static struct file *secretmem_file_create(unsigned long flags)
 	struct file *file;
 	struct inode *inode;
 	const char *anon_name = "[secretmem]";
-	int err;
 
-	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	inode = alloc_anon_secure_inode(secretmem_mnt->mnt_sb, anon_name);
 	if (IS_ERR(inode))
 		return ERR_CAST(inode);
 
-	err = security_inode_init_security_anon(inode, &QSTR(anon_name), NULL);
-	if (err) {
-		file = ERR_PTR(err);
-		goto err_free_inode;
-	}
-
 	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
 				 O_RDWR, &secretmem_fops);
 	if (IS_ERR(file))
-- 
2.49.0.1101.gccaa498523-goog

>> > 
>> > For ease of testing, this series is also available, stitched together,
>> > at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>> > 
>> 
>> I went digging in your git tree and then found Ryan's set.  So thanks for
>> the git tree.  :-D

Glad that helped!

>> 
>> However, it seems this add another dependency which should be managed in
>> David's email of dependencies?

This is a good idea. David, do you think these two patches should be
managed as a separate patch series in the email of dependencies?

+ (left out of RFCv2, but is above) "fs: Refactor to provide function that allocates a secure anonymous inode"
+ 01/51 "KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes"

They're being used by a few patch series now.

>> 
>> Ira
>> 

^ permalink raw reply related	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  2025-05-16 14:07   ` Ackerley Tng
@ 2025-05-16 20:33     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 20:33 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ackerley Tng <ackerleytng@google.com> writes:

> Ackerley Tng <ackerleytng@google.com> writes:
>
>> guestmem_hugetlb is an allocator for guest_memfd. It wraps HugeTLB to
>> provide huge folios for guest_memfd.
>>
>> This patch also introduces guestmem_allocator_operations as a set of
>> operations that allocators for guest_memfd can provide. In a later
>> patch, guest_memfd will use these operations to manage pages from an
>> allocator.
>>
>> The allocator operations are memory-management specific and are placed
>> in mm/ so key mm-specific functions do not have to be exposed
>> unnecessarily.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> Change-Id: I3cafe111ea7b3c84755d7112ff8f8c541c11136d
>> ---
>>  include/linux/guestmem.h      |  20 +++++
>>  include/uapi/linux/guestmem.h |  29 +++++++
>>  mm/Kconfig                    |   5 +-
>>  mm/guestmem_hugetlb.c         | 159 ++++++++++++++++++++++++++++++++++
>>  4 files changed, 212 insertions(+), 1 deletion(-)
>>  create mode 100644 include/linux/guestmem.h
>>  create mode 100644 include/uapi/linux/guestmem.h
>>
>> <snip>
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 131adc49f58d..bb6e39e37245 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1218,7 +1218,10 @@ config SECRETMEM
>>  
>>  config GUESTMEM_HUGETLB
>>  	bool "Enable guestmem_hugetlb allocator for guest_memfd"
>> -	depends on HUGETLBFS
>> +	select GUESTMEM
>> +	select HUGETLBFS
>> +	select HUGETLB_PAGE
>> +	select HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>
> My bad. I left out CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y in
> my testing and just found that when it is set, I hit
>
>   BUG_ON(pte_page(ptep_get(pte)) != walk->reuse_page);
>
> with the basic guest_memfd_test on splitting pages on allocation.
>
> I'll follow up with the fix soon.
>
> Another note about testing: I've been testing in a nested VM for the
> development process:
>
> 1. Host
> 2. VM for development
> 3. Nested VM running kernel being developed
> 4. Nested nested VMs created during selftests
>
> This series has not yet been tested on a physical host.
>
>>  	help
>>  	  Enable this to make HugeTLB folios available to guest_memfd
>>  	  (KVM virtualization) as backing memory.
>>
>> <snip>
>>

Here's the fix for this issue

From 998af6404d4e39920ba42764e7f3815cb9bb9e3d Mon Sep 17 00:00:00 2001
Message-ID: <998af6404d4e39920ba42764e7f3815cb9bb9e3d.1747427489.git.ackerleytng@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Date: Fri, 16 May 2025 13:14:55 -0700
Subject: [RFC PATCH v2 1/1] KVM: guest_memfd: Reorder undoing vmemmap
 optimization and stashing hugetlb folio metadata

Without this patch, when HugeTLB folio metadata is stashed, the
vmemmap_optimized flag, stored in a HugeTLB folio's folio->private was
stashed as set.

The first splitting works, but on merging, when the folio metadata was
unstashed, vmemmap_optimized is unstashed as set, making the call to
hugetlb_vmemmap_optimize_folio() skip actually applying optimizations.

On a second split, hugetlb_vmemmap_restore_folio() attempts to reapply
optimizations when it was already applied, hence hitting the BUG().

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/guestmem_hugetlb.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
index 8727598cf18e..2c0192543676 100644
--- a/mm/guestmem_hugetlb.c
+++ b/mm/guestmem_hugetlb.c
@@ -200,16 +200,21 @@ static int guestmem_hugetlb_split_folio(struct folio *folio)
 		return 0;
 
 	orig_nr_pages = folio_nr_pages(folio);
-	ret = guestmem_hugetlb_stash_metadata(folio);
+
+	/*
+	 * hugetlb_vmemmap_restore_folio() has to be called ahead of the rest
+	 * because it checks page type. This doesn't actually split the folio,
+	 * so the first few struct pages are still intact.
+	 */
+	ret = hugetlb_vmemmap_restore_folio(folio_hstate(folio), folio);
 	if (ret)
 		return ret;
 
 	/*
-	 * hugetlb_vmemmap_restore_folio() has to be called ahead of the rest
-	 * because it checks and page type. This doesn't actually split the
-	 * folio, so the first few struct pages are still intact.
+	 * Stash metadata after vmemmap stuff so the outcome of the vmemmap
+	 * restoration is stashed.
 	 */
-	ret = hugetlb_vmemmap_restore_folio(folio_hstate(folio), folio);
+	ret = guestmem_hugetlb_stash_metadata(folio);
 	if (ret)
 		goto err;
 
@@ -254,8 +259,7 @@ static int guestmem_hugetlb_split_folio(struct folio *folio)
 	return 0;
 
 err:
-	guestmem_hugetlb_unstash_free_metadata(folio);
-
+	hugetlb_vmemmap_optimize_folio(folio_hstate(folio), folio);
 	return ret;
 }
 
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-16 20:25                   ` Dave Hansen
@ 2025-05-16 21:42                     ` Edgecombe, Rick P
  0 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-16 21:42 UTC (permalink / raw)
  To: Hansen, Dave, seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, maz@kernel.org,
	jack@suse.cz, palmer@dabbelt.com, quic_svaddagi@quicinc.com,
	keirf@google.com, Annapurve, Vishal, anthony.yznaga@oracle.com,
	mail@maciej.szmigiero.name, vkuznets@redhat.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	quic_mnalajal@quicinc.com, rppt@kernel.org, Du, Fan, aik@amd.com,
	usama.arif@bytedance.com, Wang, Wei W, fvdl@google.com,
	paul.walmsley@sifive.com, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, willy@infradead.org,
	steven.price@arm.com, quic_eberman@quicinc.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, nsaenz@amazon.es,
	akpm@linux-foundation.org, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	hughd@google.com, Xu, Haibo1, jhubbard@nvidia.com,
	anup@brainfault.org, Yamahata, Isaku, jthoughton@google.com,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	ackerleytng@google.com, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, will@kernel.org, linux-mm@kvack.org

On Fri, 2025-05-16 at 13:25 -0700, Dave Hansen wrote:
> It's a little more complicated than just the depth of the worst-case walk.
> 
> In practice, many page walks can use the mid-level paging structure
> caches because the mappings aren't sparse.
> 
> With 5-level paging in particular, userspace doesn't actually change
> much at all. Its layout is pretty much the same unless folks are opting
> in to the higher (5-level only) address space. So userspace isn't
> sparse, at least at the scale of what 5-level paging is capable of.
> 
> For the kernel, things are a bit more spread out than they were before.
> For instance, the direct map and vmalloc() are in separate p4d pages
> when they used to be nestled together in the same half of one pgd.
> 
> But, again, they're not *that* sparse. The direct map, for example,
> doesn't become more sparse, it just moves to a lower virtual address.
> Ditto for vmalloc().  Just because 5-level paging has a massive
> vmalloc() area doesn't mean we use it.
> 
> Basically, 5-level paging adds a level to the top of the page walk, and
> we're really good at caching those when they're not accessed sparsely.
> 
> CPUs are not as good at caching the leaf side of the page walk. There
> are tricks like AMD's TLB coalescing that help. But, generally, each
> walk on the leaf end of the walks eats a TLB entry. Those just don't
> cache as well as the top of the tree.
> 
> That's why we need to be more maniacal about reducing leaf levels than
> the levels toward the root.

Makes sense. For what is easy for the CPU to cache, it can be more about the
address space layout then the length of the walk.

Going off topic from this patchset...

I have a possibly fun related anecdote. A while ago when I was doing the KVM XO
stuff, I was trying to test how much worse the performance was from caches being
forced to deal with the sparser GPA accesses. The test was to modify the guest
to force all the executable GVA mappings to go on the XO alias. I was confused
to find that KVM XO was faster than the normal layout by a small, but consistent
amount. It had me scratching my head. It turned out that the NX huge page
mitigation was able to maintain large pages for the data accesses because all
the executable accesses were moved off of the main GPA alias.

My takeaway was that the real world implementations can interact in surprising
ways, and for at least my ability to reason about it, it's good to verify with a
test when possible.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (52 preceding siblings ...)
  2025-05-16 19:48 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Ira Weiny
@ 2025-05-16 22:43 ` Ackerley Tng
  2025-06-19  8:13 ` Yan Zhao
  2025-06-26 23:19 ` Ackerley Tng
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-16 22:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ackerley Tng <ackerleytng@google.com> writes:

> <snip>
>
> Here are some remaining issues/TODOs:
>
> 1. Memory error handling such as machine check errors have not been
>    implemented.
> 2. I've not looked into preparedness of pages, only zeroing has been
>    considered.
> 3. When allocating HugeTLB pages, if two threads allocate indices
>    mapping to the same huge page, the utilization in guest_memfd inode's
>    subpool may momentarily go over the subpool limit (the requested size
>    of the inode at guest_memfd creation time), causing one of the two
>    threads to get -ENOMEM. Suggestions to solve this are appreciated!
> 4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
>    pages should be correct but needs testing and could be wrong.
> 5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
>    HugeTLB pages after splitting should be correct but needs testing and
>    could be wrong.
> 6. Page cache accounting: When a hugetlb page is split, guest_memfd will
>    incur page count in both NR_HUGETLB (counted at hugetlb allocation
>    time) and NR_FILE_PAGES stats (counted when split pages are added to
>    the filemap). Is this aligned with what people expect?
>

For people who might be testing this series with non-Coco VMs (heads up,
Patrick and Nikita!), this currently splits the folio as long as some
shareability in the huge folio is shared, which is probably unnecessary?

IIUC core-mm doesn't support mapping at 1G but from a cursory reading it
seems like the faulting function calling kvm_gmem_fault_shared() could
possibly be able to map a 1G page at 4K.

Looks like we might need another flag like
GUEST_MEMFD_FLAG_SUPPORT_CONVERSION, which will gate initialization of
the shareability maple tree/xarray.

If shareability is NULL for the entire hugepage range, then no splitting
will occur.

For Coco VMs, this should be safe, since if this flag is not set,
kvm_gmem_fault_shared() will always not be able to fault (the
shareability value will be NULL.

> Here are some optimizations that could be explored in future series:
>
> 1. Pages could be split from 1G to 2M first and only split to 4K if
>    necessary.
> 2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
>    pages.
>
> <snip>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
  2025-05-15 14:50   ` Ira Weiny
@ 2025-05-20  9:22   ` Fuad Tabba
  2025-05-20 13:02     ` Vishal Annapurve
  2025-05-28  3:16   ` Binbin Wu
  2 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-20  9:22 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vannapurve, vbabka, viro, vkuznets, wei.w.wang, will,
	willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

Hi Ackerley,

On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
>
> The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> and private respectively.

I have a high level question about this particular patch and this
approach for conversion: why do we need IOCTLs to manage conversion
between private and shared?

In the presentations I gave at LPC [1, 2], and in my latest patch
series that performs in-place conversion [3] and the associated (by
now outdated) state diagram [4], I didn't see the need to have a
userspace-facing interface to manage that. KVM has all the information
it needs to handle conversions, which are triggered by the guest. To
me this seems like it adds additional complexity, as well as a user
facing interface that we would need to maintain.

There are various ways we could handle conversion without explicit
interference from userspace. What I had in mind is the following (as
an example, details can vary according to VM type). I will use use the
case of conversion from shared to private because that is the more
complicated (interesting) case:

- Guest issues a hypercall to request that a shared folio become private.

- The hypervisor receives the call, and passes it to KVM.

- KVM unmaps the folio from the guest stage-2 (EPT I think in x86
parlance), and unmaps it from the host. The host however, could still
have references (e.g., GUP).

- KVM exits to the host (hypervisor call exit), with the information
that the folio has been unshared from it.

- A well behaving host would now get rid of all of its references
(e.g., release GUPs), perform a VCPU run, and the guest continues
running as normal. I expect this to be the common case.

But to handle the more interesting situation, let's say that the host
doesn't do it immediately, and for some reason it holds on to some
references to that folio.

- Even if that's the case, the guest can still run *. If the guest
tries to access the folio, KVM detects that access when it tries to
fault it into the guest, sees that the host still has references to
that folio, and exits back to the host with a memory fault exit. At
this point, the VCPU that has tried to fault in that particular folio
cannot continue running as long as it cannot fault in that folio.

- The host tries a VCPU run again, and the above repeats, i.e., KVM
checks the refcount, finds that the host still holds references,
doesn't fault the folio into the guest, and exits back to the host.

- Eventually a well-behaving host releases all its references, and the
following VCPU run is able to fault the page into the guest, and
proceed with running it.

In case the guest is destroyed before that happens, we have the whole
folio_put() callback scenario we had discussed earlier.

In other words, the interface that I had in mind where KVM run exists
(hyp call, memory fault), as well as VCPU run. Both which already
exist, and convey the same information. Is there a case where that
isn't enough or suboptimal?

Thanks,
/fuad

(*) An alternative suggestion was to block the VCPU from running
altogether, regardless of whether it wants to fault the unshared page
immediately, and continually exit to the host until references are
dropped and the conversion can happen.

[1] https://lpc.events/event/17/contributions/1487/
[2] https://lpc.events/event/18/contributions/1758/
[3] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
[4] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf


> A guest_memfd ioctl is used because shareability is a property of the
> memory, and this property should be modifiable independently of the
> attached struct kvm. This allows shareability to be modified even if
> the memory is not yet bound using memslots.
>
> For shared to private conversions, if refcounts on any of the folios
> within the range are elevated, fail the conversion with -EAGAIN.
>
> At the point of shared to private conversion, all folios in range are
> also unmapped. The filemap_invalidate_lock() is held, so no faulting
> can occur. Hence, from that point on, only transient refcounts can be
> taken on the folios associated with that guest_memfd.
>
> Hence, it is safe to do the conversion from shared to private.
>
> After conversion is complete, refcounts may become elevated, but that
> is fine since users of transient refcounts don't actually access
> memory.
>
> For private to shared conversions, there are no refcount checks. any
> transient refcounts are expected to drop their refcounts soon. The
> conversion process will spin waiting for these transient refcounts to
> go away.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Change-Id: I3546aaf6c1b795de6dc9ba09e816b64934221918
> ---
>  include/uapi/linux/kvm.h |  11 ++
>  virt/kvm/guest_memfd.c   | 357 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 366 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index d7df312479aa..5b28e17f6f14 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1577,6 +1577,17 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>
> +#define KVM_GMEM_IO 0xAF
> +#define KVM_GMEM_CONVERT_SHARED                _IOWR(KVM_GMEM_IO,  0x41, struct kvm_gmem_convert)
> +#define KVM_GMEM_CONVERT_PRIVATE       _IOWR(KVM_GMEM_IO,  0x42, struct kvm_gmem_convert)
> +
> +struct kvm_gmem_convert {
> +       __u64 offset;
> +       __u64 size;
> +       __u64 error_offset;
> +       __u64 reserved[5];
> +};
> +
>  #define KVM_PRE_FAULT_MEMORY   _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
>
>  struct kvm_pre_fault_memory {
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 590932499eba..f802116290ce 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -30,6 +30,10 @@ enum shareability {
>  };
>
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +                                     pgoff_t end);
> +static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> +                                   pgoff_t end);
>
>  static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>  {
> @@ -85,6 +89,306 @@ static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t inde
>         return kvm_gmem_get_folio(inode, index);
>  }
>
> +/**
> + * kvm_gmem_shareability_store() - Sets shareability to @value for range.
> + *
> + * @mt: the shareability maple tree.
> + * @index: the range begins at this index in the inode.
> + * @nr_pages: number of PAGE_SIZE pages in this range.
> + * @value: the shareability value to set for this range.
> + *
> + * Unlike mtree_store_range(), this function also merges adjacent ranges that
> + * have the same values as an optimization. Assumes that all stores to @mt go
> + * through this function, such that adjacent ranges are always merged.
> + *
> + * Return: 0 on success and negative error otherwise.
> + */
> +static int kvm_gmem_shareability_store(struct maple_tree *mt, pgoff_t index,
> +                                      size_t nr_pages, enum shareability value)
> +{
> +       MA_STATE(mas, mt, 0, 0);
> +       unsigned long start;
> +       unsigned long last;
> +       void *entry;
> +       int ret;
> +
> +       start = index;
> +       last = start + nr_pages - 1;
> +
> +       mas_lock(&mas);
> +
> +       /* Try extending range. entry is NULL on overflow/wrap-around. */
> +       mas_set_range(&mas, last + 1, last + 1);
> +       entry = mas_find(&mas, last + 1);
> +       if (entry && xa_to_value(entry) == value)
> +               last = mas.last;
> +
> +       mas_set_range(&mas, start - 1, start - 1);
> +       entry = mas_find(&mas, start - 1);
> +       if (entry && xa_to_value(entry) == value)
> +               start = mas.index;
> +
> +       mas_set_range(&mas, start, last);
> +       ret = mas_store_gfp(&mas, xa_mk_value(value), GFP_KERNEL);
> +
> +       mas_unlock(&mas);
> +
> +       return ret;
> +}
> +
> +struct conversion_work {
> +       struct list_head list;
> +       pgoff_t start;
> +       size_t nr_pages;
> +};
> +
> +static int add_to_work_list(struct list_head *list, pgoff_t start, pgoff_t last)
> +{
> +       struct conversion_work *work;
> +
> +       work = kzalloc(sizeof(*work), GFP_KERNEL);
> +       if (!work)
> +               return -ENOMEM;
> +
> +       work->start = start;
> +       work->nr_pages = last + 1 - start;
> +
> +       list_add_tail(&work->list, list);
> +
> +       return 0;
> +}
> +
> +static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t start,
> +                                      size_t nr_pages, pgoff_t *error_index)
> +{
> +       const int filemap_get_folios_refcount = 1;
> +       struct folio_batch fbatch;
> +       bool refcount_safe;
> +       pgoff_t last;
> +       int i;
> +
> +       last = start + nr_pages - 1;
> +       refcount_safe = true;
> +
> +       folio_batch_init(&fbatch);
> +       while (refcount_safe &&
> +              filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +                       int filemap_refcount;
> +                       int safe_refcount;
> +                       struct folio *f;
> +
> +                       f = fbatch.folios[i];
> +                       filemap_refcount = folio_nr_pages(f);
> +
> +                       safe_refcount = filemap_refcount + filemap_get_folios_refcount;
> +                       if (folio_ref_count(f) != safe_refcount) {
> +                               refcount_safe = false;
> +                               *error_index = f->index;
> +                               break;
> +                       }
> +               }
> +
> +               folio_batch_release(&fbatch);
> +       }
> +
> +       return refcount_safe;
> +}
> +
> +static int kvm_gmem_shareability_apply(struct inode *inode,
> +                                      struct conversion_work *work,
> +                                      enum shareability m)
> +{
> +       struct maple_tree *mt;
> +
> +       mt = &kvm_gmem_private(inode)->shareability;
> +       return kvm_gmem_shareability_store(mt, work->start, work->nr_pages, m);
> +}
> +
> +static int kvm_gmem_convert_compute_work(struct inode *inode, pgoff_t start,
> +                                        size_t nr_pages, enum shareability m,
> +                                        struct list_head *work_list)
> +{
> +       struct maple_tree *mt;
> +       struct ma_state mas;
> +       pgoff_t last;
> +       void *entry;
> +       int ret;
> +
> +       last = start + nr_pages - 1;
> +
> +       mt = &kvm_gmem_private(inode)->shareability;
> +       ret = 0;
> +
> +       mas_init(&mas, mt, start);
> +
> +       rcu_read_lock();
> +       mas_for_each(&mas, entry, last) {
> +               enum shareability current_m;
> +               pgoff_t m_range_index;
> +               pgoff_t m_range_last;
> +               int ret;
> +
> +               m_range_index = max(mas.index, start);
> +               m_range_last = min(mas.last, last);
> +
> +               current_m = xa_to_value(entry);
> +               if (m == current_m)
> +                       continue;
> +
> +               mas_pause(&mas);
> +               rcu_read_unlock();
> +               /* Caller will clean this up on error. */
> +               ret = add_to_work_list(work_list, m_range_index, m_range_last);
> +               rcu_read_lock();
> +               if (ret)
> +                       break;
> +       }
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static void kvm_gmem_convert_invalidate_begin(struct inode *inode,
> +                                             struct conversion_work *work)
> +{
> +       struct list_head *gmem_list;
> +       struct kvm_gmem *gmem;
> +       pgoff_t end;
> +
> +       end = work->start + work->nr_pages;
> +
> +       gmem_list = &inode->i_mapping->i_private_list;
> +       list_for_each_entry(gmem, gmem_list, entry)
> +               kvm_gmem_invalidate_begin(gmem, work->start, end);
> +}
> +
> +static void kvm_gmem_convert_invalidate_end(struct inode *inode,
> +                                           struct conversion_work *work)
> +{
> +       struct list_head *gmem_list;
> +       struct kvm_gmem *gmem;
> +       pgoff_t end;
> +
> +       end = work->start + work->nr_pages;
> +
> +       gmem_list = &inode->i_mapping->i_private_list;
> +       list_for_each_entry(gmem, gmem_list, entry)
> +               kvm_gmem_invalidate_end(gmem, work->start, end);
> +}
> +
> +static int kvm_gmem_convert_should_proceed(struct inode *inode,
> +                                          struct conversion_work *work,
> +                                          bool to_shared, pgoff_t *error_index)
> +{
> +       if (!to_shared) {
> +               unmap_mapping_pages(inode->i_mapping, work->start,
> +                                   work->nr_pages, false);
> +
> +               if (!kvm_gmem_has_safe_refcount(inode->i_mapping, work->start,
> +                                               work->nr_pages, error_index)) {
> +                       return -EAGAIN;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> +                                 size_t nr_pages, bool shared,
> +                                 pgoff_t *error_index)
> +{
> +       struct conversion_work *work, *tmp, *rollback_stop_item;
> +       LIST_HEAD(work_list);
> +       struct inode *inode;
> +       enum shareability m;
> +       int ret;
> +
> +       inode = file_inode(file);
> +
> +       filemap_invalidate_lock(inode->i_mapping);
> +
> +       m = shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> +       ret = kvm_gmem_convert_compute_work(inode, start, nr_pages, m, &work_list);
> +       if (ret || list_empty(&work_list))
> +               goto out;
> +
> +       list_for_each_entry(work, &work_list, list)
> +               kvm_gmem_convert_invalidate_begin(inode, work);
> +
> +       list_for_each_entry(work, &work_list, list) {
> +               ret = kvm_gmem_convert_should_proceed(inode, work, shared,
> +                                                     error_index);
> +               if (ret)
> +                       goto invalidate_end;
> +       }
> +
> +       list_for_each_entry(work, &work_list, list) {
> +               rollback_stop_item = work;
> +               ret = kvm_gmem_shareability_apply(inode, work, m);
> +               if (ret)
> +                       break;
> +       }
> +
> +       if (ret) {
> +               m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
> +               list_for_each_entry(work, &work_list, list) {
> +                       if (work == rollback_stop_item)
> +                               break;
> +
> +                       WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> +               }
> +       }
> +
> +invalidate_end:
> +       list_for_each_entry(work, &work_list, list)
> +               kvm_gmem_convert_invalidate_end(inode, work);
> +out:
> +       filemap_invalidate_unlock(inode->i_mapping);
> +
> +       list_for_each_entry_safe(work, tmp, &work_list, list) {
> +               list_del(&work->list);
> +               kfree(work);
> +       }
> +
> +       return ret;
> +}
> +
> +static int kvm_gmem_ioctl_convert_range(struct file *file,
> +                                       struct kvm_gmem_convert *param,
> +                                       bool shared)
> +{
> +       pgoff_t error_index;
> +       size_t nr_pages;
> +       pgoff_t start;
> +       int ret;
> +
> +       if (param->error_offset)
> +               return -EINVAL;
> +
> +       if (param->size == 0)
> +               return 0;
> +
> +       if (param->offset + param->size < param->offset ||
> +           param->offset > file_inode(file)->i_size ||
> +           param->offset + param->size > file_inode(file)->i_size)
> +               return -EINVAL;
> +
> +       if (!IS_ALIGNED(param->offset, PAGE_SIZE) ||
> +           !IS_ALIGNED(param->size, PAGE_SIZE))
> +               return -EINVAL;
> +
> +       start = param->offset >> PAGE_SHIFT;
> +       nr_pages = param->size >> PAGE_SHIFT;
> +
> +       ret = kvm_gmem_convert_range(file, start, nr_pages, shared, &error_index);
> +       if (ret)
> +               param->error_offset = error_index << PAGE_SHIFT;
> +
> +       return ret;
> +}
> +
>  #else
>
>  static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> @@ -186,15 +490,26 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>         unsigned long index;
>
>         xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +               enum kvm_gfn_range_filter filter;
>                 pgoff_t pgoff = slot->gmem.pgoff;
>
> +               filter = KVM_FILTER_PRIVATE;
> +               if (kvm_gmem_memslot_supports_shared(slot)) {
> +                       /*
> +                        * Unmapping would also cause invalidation, but cannot
> +                        * rely on mmu_notifiers to do invalidation via
> +                        * unmapping, since memory may not be mapped to
> +                        * userspace.
> +                        */
> +                       filter |= KVM_FILTER_SHARED;
> +               }
> +
>                 struct kvm_gfn_range gfn_range = {
>                         .start = slot->base_gfn + max(pgoff, start) - pgoff,
>                         .end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
>                         .slot = slot,
>                         .may_block = true,
> -                       /* guest memfd is relevant to only private mappings. */
> -                       .attr_filter = KVM_FILTER_PRIVATE,
> +                       .attr_filter = filter,
>                 };
>
>                 if (!found_memslot) {
> @@ -484,11 +799,49 @@ EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
>  #define kvm_gmem_mmap NULL
>  #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> +                          unsigned long arg)
> +{
> +       void __user *argp;
> +       int r;
> +
> +       argp = (void __user *)arg;
> +
> +       switch (ioctl) {
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +       case KVM_GMEM_CONVERT_SHARED:
> +       case KVM_GMEM_CONVERT_PRIVATE: {
> +               struct kvm_gmem_convert param;
> +               bool to_shared;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&param, argp, sizeof(param)))
> +                       goto out;
> +
> +               to_shared = ioctl == KVM_GMEM_CONVERT_SHARED;
> +               r = kvm_gmem_ioctl_convert_range(file, &param, to_shared);
> +               if (r) {
> +                       if (copy_to_user(argp, &param, sizeof(param))) {
> +                               r = -EFAULT;
> +                               goto out;
> +                       }
> +               }
> +               break;
> +       }
> +#endif
> +       default:
> +               r = -ENOTTY;
> +       }
> +out:
> +       return r;
> +}
> +
>  static struct file_operations kvm_gmem_fops = {
>         .mmap           = kvm_gmem_mmap,
>         .open           = generic_file_open,
>         .release        = kvm_gmem_release,
>         .fallocate      = kvm_gmem_fallocate,
> +       .unlocked_ioctl = kvm_gmem_ioctl,
>  };
>
>  static void kvm_gmem_free_inode(struct inode *inode)
> --
> 2.49.0.1045.g170613ef41-goog
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20  9:22   ` Fuad Tabba
@ 2025-05-20 13:02     ` Vishal Annapurve
  2025-05-20 13:44       ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-20 13:02 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi Ackerley,
>
> On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> >
> > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > and private respectively.
>
> I have a high level question about this particular patch and this
> approach for conversion: why do we need IOCTLs to manage conversion
> between private and shared?
>
> In the presentations I gave at LPC [1, 2], and in my latest patch
> series that performs in-place conversion [3] and the associated (by
> now outdated) state diagram [4], I didn't see the need to have a
> userspace-facing interface to manage that. KVM has all the information
> it needs to handle conversions, which are triggered by the guest. To
> me this seems like it adds additional complexity, as well as a user
> facing interface that we would need to maintain.
>
> There are various ways we could handle conversion without explicit
> interference from userspace. What I had in mind is the following (as
> an example, details can vary according to VM type). I will use use the
> case of conversion from shared to private because that is the more
> complicated (interesting) case:
>
> - Guest issues a hypercall to request that a shared folio become private.
>
> - The hypervisor receives the call, and passes it to KVM.
>
> - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> parlance), and unmaps it from the host. The host however, could still
> have references (e.g., GUP).
>
> - KVM exits to the host (hypervisor call exit), with the information
> that the folio has been unshared from it.
>
> - A well behaving host would now get rid of all of its references
> (e.g., release GUPs), perform a VCPU run, and the guest continues
> running as normal. I expect this to be the common case.
>
> But to handle the more interesting situation, let's say that the host
> doesn't do it immediately, and for some reason it holds on to some
> references to that folio.
>
> - Even if that's the case, the guest can still run *. If the guest
> tries to access the folio, KVM detects that access when it tries to
> fault it into the guest, sees that the host still has references to
> that folio, and exits back to the host with a memory fault exit. At
> this point, the VCPU that has tried to fault in that particular folio
> cannot continue running as long as it cannot fault in that folio.

Are you talking about the following scheme?
1) guest_memfd checks shareability on each get pfn and if there is a
mismatch exit to the host.
2) host user space has to guess whether it's a pending refcount or
whether it's an actual mismatch.
3) guest_memfd will maintain a third state
"pending_private_conversion" or equivalent which will transition to
private upon the last refcount drop of each page.

If conversion is triggered by userspace (in case of pKVM, it will be
triggered from within the KVM (?)):
* Conversion will just fail if there are extra refcounts and userspace
can try to get rid of extra refcounts on the range while it has enough
context without hitting any ambiguity with memory fault exit.
* guest_memfd will not have to deal with this extra state from 3 above
and overall guest_memfd conversion handling becomes relatively
simpler.

Note that for x86 CoCo cases, memory conversion is already triggered
by userspace using KVM ioctl, this series is proposing to use
guest_memfd ioctl to do the same.
 - Allows not having to keep track of separate shared/private range
information in KVM.
 - Simpler handling of the conversion process done per guest_memfd
rather than for full range.
     - Userspace can handle the rollback as needed, simplifying error
handling in guest_memfd.
 - guest_memfd is single source of truth and notifies the users of
shareability change.
     - e.g. IOMMU, userspace, KVM MMU all can be registered for
getting notifications from guest_memfd directly and will get notified
for invalidation upon shareability attribute updates.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 13:02     ` Vishal Annapurve
@ 2025-05-20 13:44       ` Fuad Tabba
  2025-05-20 14:11         ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-20 13:44 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Vishal,

On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Ackerley,
> >
> > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> > >
> > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > and private respectively.
> >
> > I have a high level question about this particular patch and this
> > approach for conversion: why do we need IOCTLs to manage conversion
> > between private and shared?
> >
> > In the presentations I gave at LPC [1, 2], and in my latest patch
> > series that performs in-place conversion [3] and the associated (by
> > now outdated) state diagram [4], I didn't see the need to have a
> > userspace-facing interface to manage that. KVM has all the information
> > it needs to handle conversions, which are triggered by the guest. To
> > me this seems like it adds additional complexity, as well as a user
> > facing interface that we would need to maintain.
> >
> > There are various ways we could handle conversion without explicit
> > interference from userspace. What I had in mind is the following (as
> > an example, details can vary according to VM type). I will use use the
> > case of conversion from shared to private because that is the more
> > complicated (interesting) case:
> >
> > - Guest issues a hypercall to request that a shared folio become private.
> >
> > - The hypervisor receives the call, and passes it to KVM.
> >
> > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > parlance), and unmaps it from the host. The host however, could still
> > have references (e.g., GUP).
> >
> > - KVM exits to the host (hypervisor call exit), with the information
> > that the folio has been unshared from it.
> >
> > - A well behaving host would now get rid of all of its references
> > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > running as normal. I expect this to be the common case.
> >
> > But to handle the more interesting situation, let's say that the host
> > doesn't do it immediately, and for some reason it holds on to some
> > references to that folio.
> >
> > - Even if that's the case, the guest can still run *. If the guest
> > tries to access the folio, KVM detects that access when it tries to
> > fault it into the guest, sees that the host still has references to
> > that folio, and exits back to the host with a memory fault exit. At
> > this point, the VCPU that has tried to fault in that particular folio
> > cannot continue running as long as it cannot fault in that folio.
>
> Are you talking about the following scheme?
> 1) guest_memfd checks shareability on each get pfn and if there is a
> mismatch exit to the host.

I think we are not really on the same page here (no pun intended :) ).
I'll try to answer your questions anyway...

Which get_pfn? Are you referring to get_pfn when faulting the page
into the guest or into the host?

> 2) host user space has to guess whether it's a pending refcount or
> whether it's an actual mismatch.

No need to guess. VCPU run will let it know exactly why it's exiting.

> 3) guest_memfd will maintain a third state
> "pending_private_conversion" or equivalent which will transition to
> private upon the last refcount drop of each page.
>
> If conversion is triggered by userspace (in case of pKVM, it will be
> triggered from within the KVM (?)):

Why would conversion be triggered by userspace? As far as I know, it's
the guest that triggers the conversion.

> * Conversion will just fail if there are extra refcounts and userspace
> can try to get rid of extra refcounts on the range while it has enough
> context without hitting any ambiguity with memory fault exit.
> * guest_memfd will not have to deal with this extra state from 3 above
> and overall guest_memfd conversion handling becomes relatively
> simpler.

That's not really related. The extra state isn't necessary any more
once we agreed in the previous discussion that we will retry instead.

> Note that for x86 CoCo cases, memory conversion is already triggered
> by userspace using KVM ioctl, this series is proposing to use
> guest_memfd ioctl to do the same.

The reason why for x86 CoCo cases conversion is already triggered by
userspace using KVM ioctl is that it has to, since shared memory and
private memory are two separate pages, and userspace needs to manage
that. Sharing memory in place removes the need for that.

This series isn't using the same ioctl, it's introducing new ones to
perform a task that as far as I can tell so far, KVM can handle by
itself.

>  - Allows not having to keep track of separate shared/private range
> information in KVM.

This patch series is already tracking shared/private range information in KVM.

>  - Simpler handling of the conversion process done per guest_memfd
> rather than for full range.
>      - Userspace can handle the rollback as needed, simplifying error
> handling in guest_memfd.
>  - guest_memfd is single source of truth and notifies the users of
> shareability change.
>      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> getting notifications from guest_memfd directly and will get notified
> for invalidation upon shareability attribute updates.

All of these can still be done without introducing a new ioctl.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 13:44       ` Fuad Tabba
@ 2025-05-20 14:11         ` Vishal Annapurve
  2025-05-20 14:33           ` Fuad Tabba
  2025-06-24  8:23           ` Alexey Kardashevskiy
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-20 14:11 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi Vishal,
>
> On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> > >
> > > Hi Ackerley,
> > >
> > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> > > >
> > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > > and private respectively.
> > >
> > > I have a high level question about this particular patch and this
> > > approach for conversion: why do we need IOCTLs to manage conversion
> > > between private and shared?
> > >
> > > In the presentations I gave at LPC [1, 2], and in my latest patch
> > > series that performs in-place conversion [3] and the associated (by
> > > now outdated) state diagram [4], I didn't see the need to have a
> > > userspace-facing interface to manage that. KVM has all the information
> > > it needs to handle conversions, which are triggered by the guest. To
> > > me this seems like it adds additional complexity, as well as a user
> > > facing interface that we would need to maintain.
> > >
> > > There are various ways we could handle conversion without explicit
> > > interference from userspace. What I had in mind is the following (as
> > > an example, details can vary according to VM type). I will use use the
> > > case of conversion from shared to private because that is the more
> > > complicated (interesting) case:
> > >
> > > - Guest issues a hypercall to request that a shared folio become private.
> > >
> > > - The hypervisor receives the call, and passes it to KVM.
> > >
> > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > > parlance), and unmaps it from the host. The host however, could still
> > > have references (e.g., GUP).
> > >
> > > - KVM exits to the host (hypervisor call exit), with the information
> > > that the folio has been unshared from it.
> > >
> > > - A well behaving host would now get rid of all of its references
> > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > > running as normal. I expect this to be the common case.
> > >
> > > But to handle the more interesting situation, let's say that the host
> > > doesn't do it immediately, and for some reason it holds on to some
> > > references to that folio.
> > >
> > > - Even if that's the case, the guest can still run *. If the guest
> > > tries to access the folio, KVM detects that access when it tries to
> > > fault it into the guest, sees that the host still has references to
> > > that folio, and exits back to the host with a memory fault exit. At
> > > this point, the VCPU that has tried to fault in that particular folio
> > > cannot continue running as long as it cannot fault in that folio.
> >
> > Are you talking about the following scheme?
> > 1) guest_memfd checks shareability on each get pfn and if there is a
> > mismatch exit to the host.
>
> I think we are not really on the same page here (no pun intended :) ).
> I'll try to answer your questions anyway...
>
> Which get_pfn? Are you referring to get_pfn when faulting the page
> into the guest or into the host?

I am referring to guest fault handling in KVM.

>
> > 2) host user space has to guess whether it's a pending refcount or
> > whether it's an actual mismatch.
>
> No need to guess. VCPU run will let it know exactly why it's exiting.
>
> > 3) guest_memfd will maintain a third state
> > "pending_private_conversion" or equivalent which will transition to
> > private upon the last refcount drop of each page.
> >
> > If conversion is triggered by userspace (in case of pKVM, it will be
> > triggered from within the KVM (?)):
>
> Why would conversion be triggered by userspace? As far as I know, it's
> the guest that triggers the conversion.
>
> > * Conversion will just fail if there are extra refcounts and userspace
> > can try to get rid of extra refcounts on the range while it has enough
> > context without hitting any ambiguity with memory fault exit.
> > * guest_memfd will not have to deal with this extra state from 3 above
> > and overall guest_memfd conversion handling becomes relatively
> > simpler.
>
> That's not really related. The extra state isn't necessary any more
> once we agreed in the previous discussion that we will retry instead.

Who is *we* here? Which entity will retry conversion?

>
> > Note that for x86 CoCo cases, memory conversion is already triggered
> > by userspace using KVM ioctl, this series is proposing to use
> > guest_memfd ioctl to do the same.
>
> The reason why for x86 CoCo cases conversion is already triggered by
> userspace using KVM ioctl is that it has to, since shared memory and
> private memory are two separate pages, and userspace needs to manage
> that. Sharing memory in place removes the need for that.

Userspace still needs to clean up memory usage before conversion is
successful. e.g. remove IOMMU mappings for shared to private
conversion. I would think that memory conversion should not succeed
before all existing users let go of the guest_memfd pages for the
range being converted.

In x86 CoCo usecases, userspace can also decide to not allow
conversion for scenarios where ranges are still under active use by
the host and guest is erroneously trying to take away memory. Both
SNP/TDX spec allow failure of conversion due to in use memory.

>
> This series isn't using the same ioctl, it's introducing new ones to
> perform a task that as far as I can tell so far, KVM can handle by
> itself.

I would like to understand this better. How will KVM handle the
conversion process for guest_memfd pages? Can you help walk an example
sequence for shared to private conversion specifically around
guest_memfd offset states?

>
> >  - Allows not having to keep track of separate shared/private range
> > information in KVM.
>
> This patch series is already tracking shared/private range information in KVM.
>
> >  - Simpler handling of the conversion process done per guest_memfd
> > rather than for full range.
> >      - Userspace can handle the rollback as needed, simplifying error
> > handling in guest_memfd.
> >  - guest_memfd is single source of truth and notifies the users of
> > shareability change.
> >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> > getting notifications from guest_memfd directly and will get notified
> > for invalidation upon shareability attribute updates.
>
> All of these can still be done without introducing a new ioctl.
>
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 14:11         ` Vishal Annapurve
@ 2025-05-20 14:33           ` Fuad Tabba
  2025-05-20 16:02             ` Vishal Annapurve
  2025-06-24  8:23           ` Alexey Kardashevskiy
  1 sibling, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-20 14:33 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Vishal,

On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Vishal,
> >
> > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> > > >
> > > > Hi Ackerley,
> > > >
> > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> > > > >
> > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > > > and private respectively.
> > > >
> > > > I have a high level question about this particular patch and this
> > > > approach for conversion: why do we need IOCTLs to manage conversion
> > > > between private and shared?
> > > >
> > > > In the presentations I gave at LPC [1, 2], and in my latest patch
> > > > series that performs in-place conversion [3] and the associated (by
> > > > now outdated) state diagram [4], I didn't see the need to have a
> > > > userspace-facing interface to manage that. KVM has all the information
> > > > it needs to handle conversions, which are triggered by the guest. To
> > > > me this seems like it adds additional complexity, as well as a user
> > > > facing interface that we would need to maintain.
> > > >
> > > > There are various ways we could handle conversion without explicit
> > > > interference from userspace. What I had in mind is the following (as
> > > > an example, details can vary according to VM type). I will use use the
> > > > case of conversion from shared to private because that is the more
> > > > complicated (interesting) case:
> > > >
> > > > - Guest issues a hypercall to request that a shared folio become private.
> > > >
> > > > - The hypervisor receives the call, and passes it to KVM.
> > > >
> > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > > > parlance), and unmaps it from the host. The host however, could still
> > > > have references (e.g., GUP).
> > > >
> > > > - KVM exits to the host (hypervisor call exit), with the information
> > > > that the folio has been unshared from it.
> > > >
> > > > - A well behaving host would now get rid of all of its references
> > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > > > running as normal. I expect this to be the common case.
> > > >
> > > > But to handle the more interesting situation, let's say that the host
> > > > doesn't do it immediately, and for some reason it holds on to some
> > > > references to that folio.
> > > >
> > > > - Even if that's the case, the guest can still run *. If the guest
> > > > tries to access the folio, KVM detects that access when it tries to
> > > > fault it into the guest, sees that the host still has references to
> > > > that folio, and exits back to the host with a memory fault exit. At
> > > > this point, the VCPU that has tried to fault in that particular folio
> > > > cannot continue running as long as it cannot fault in that folio.
> > >
> > > Are you talking about the following scheme?
> > > 1) guest_memfd checks shareability on each get pfn and if there is a
> > > mismatch exit to the host.
> >
> > I think we are not really on the same page here (no pun intended :) ).
> > I'll try to answer your questions anyway...
> >
> > Which get_pfn? Are you referring to get_pfn when faulting the page
> > into the guest or into the host?
>
> I am referring to guest fault handling in KVM.
>
> >
> > > 2) host user space has to guess whether it's a pending refcount or
> > > whether it's an actual mismatch.
> >
> > No need to guess. VCPU run will let it know exactly why it's exiting.
> >
> > > 3) guest_memfd will maintain a third state
> > > "pending_private_conversion" or equivalent which will transition to
> > > private upon the last refcount drop of each page.
> > >
> > > If conversion is triggered by userspace (in case of pKVM, it will be
> > > triggered from within the KVM (?)):
> >
> > Why would conversion be triggered by userspace? As far as I know, it's
> > the guest that triggers the conversion.
> >
> > > * Conversion will just fail if there are extra refcounts and userspace
> > > can try to get rid of extra refcounts on the range while it has enough
> > > context without hitting any ambiguity with memory fault exit.
> > > * guest_memfd will not have to deal with this extra state from 3 above
> > > and overall guest_memfd conversion handling becomes relatively
> > > simpler.
> >
> > That's not really related. The extra state isn't necessary any more
> > once we agreed in the previous discussion that we will retry instead.
>
> Who is *we* here? Which entity will retry conversion?

Userspace will re-attempt the VCPU run.

> >
> > > Note that for x86 CoCo cases, memory conversion is already triggered
> > > by userspace using KVM ioctl, this series is proposing to use
> > > guest_memfd ioctl to do the same.
> >
> > The reason why for x86 CoCo cases conversion is already triggered by
> > userspace using KVM ioctl is that it has to, since shared memory and
> > private memory are two separate pages, and userspace needs to manage
> > that. Sharing memory in place removes the need for that.
>
> Userspace still needs to clean up memory usage before conversion is
> successful. e.g. remove IOMMU mappings for shared to private
> conversion. I would think that memory conversion should not succeed
> before all existing users let go of the guest_memfd pages for the
> range being converted.

Yes. Userspace will know that it needs to do that on the VCPU exit,
which informs it of the guest's hypervisor request to unshare (convert
from shared to private) the page.

> In x86 CoCo usecases, userspace can also decide to not allow
> conversion for scenarios where ranges are still under active use by
> the host and guest is erroneously trying to take away memory. Both
> SNP/TDX spec allow failure of conversion due to in use memory.

How can the guest erroneously try to take away memory? If the guest
sends a hypervisor request asking for a conversion of memory that
doesn't belong to it, then I would expect the hypervisor to prevent
that.

I don't see how having an IOCTL to trigger the conversion is needed to
allow conversion failure. How is that different from userspace
ignoring or delaying releasing all references it has for the
conversion request?

> >
> > This series isn't using the same ioctl, it's introducing new ones to
> > perform a task that as far as I can tell so far, KVM can handle by
> > itself.
>
> I would like to understand this better. How will KVM handle the
> conversion process for guest_memfd pages? Can you help walk an example
> sequence for shared to private conversion specifically around
> guest_memfd offset states?

To make sure that we are discussing the same scenario: can you do the
same as well please --- walk me through an example sequence for shared
to private conversion specifically around guest_memfd offset states
With the IOCTLs involved?

Here is an example that I have implemented and tested with pKVM. Note
that there are alternatives, the flow below is architecture or even
vm-type dependent. None of this code is code KVM code and the
behaviour could vary.


Assuming the folio is shared with the host:

Guest sends unshare hypercall to the hypervisor
Hypervisor forwards request to KVM (gmem) (having done due diligence)
KVM (gmem) performs an unmap_folio(), exits to userspace with
KVM_EXIT_UNSHARE and all the information about the folio being
unshared

Case 1:
Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
references, sets state to PRIVATE

Case 2 (alternative 1):
Userspace doesn't release its references
Userspace calls vcpu_run(): KVM (gmem) sees that there are still
references, exits back to userspace with KVM_EXIT_UNSHARE

Case 2 (alternative 2):
Userspace doesn't release its references
Userspace calls vcpu_run(): KVM (gmem) sees that there are still
references, unmaps folio from guest, but allows it to run (until it
tries to fault in the folio)
Guest tries to fault in folio that still has reference, KVM does not
allow that (it sees that the folio is shared, and it doesn't fault in
shared folios to confidential guests)
KVM exits back to userspace with KVM_EXIT_UNSHARE

As I mentioned, the alternatives above are _not_ set in core KVM code.
They can vary by architecture of VM type, depending on the policy,
support, etc..

Now for your example please on how this would work with IOCTLs :)

Thanks,
/fuad

> >
> > >  - Allows not having to keep track of separate shared/private range
> > > information in KVM.
> >
> > This patch series is already tracking shared/private range information in KVM.
> >
> > >  - Simpler handling of the conversion process done per guest_memfd
> > > rather than for full range.
> > >      - Userspace can handle the rollback as needed, simplifying error
> > > handling in guest_memfd.
> > >  - guest_memfd is single source of truth and notifies the users of
> > > shareability change.
> > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> > > getting notifications from guest_memfd directly and will get notified
> > > for invalidation upon shareability attribute updates.
> >
> > All of these can still be done without introducing a new ioctl.
> >
> > Cheers,
> > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 14:33           ` Fuad Tabba
@ 2025-05-20 16:02             ` Vishal Annapurve
  2025-05-20 18:05               ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-20 16:02 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi Vishal,
>
> On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
> > >
> > > Hi Vishal,
> > >
> > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> > > > >
> > > > > Hi Ackerley,
> > > > >
> > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> > > > > >
> > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > > > > and private respectively.
> > > > >
> > > > > I have a high level question about this particular patch and this
> > > > > approach for conversion: why do we need IOCTLs to manage conversion
> > > > > between private and shared?
> > > > >
> > > > > In the presentations I gave at LPC [1, 2], and in my latest patch
> > > > > series that performs in-place conversion [3] and the associated (by
> > > > > now outdated) state diagram [4], I didn't see the need to have a
> > > > > userspace-facing interface to manage that. KVM has all the information
> > > > > it needs to handle conversions, which are triggered by the guest. To
> > > > > me this seems like it adds additional complexity, as well as a user
> > > > > facing interface that we would need to maintain.
> > > > >
> > > > > There are various ways we could handle conversion without explicit
> > > > > interference from userspace. What I had in mind is the following (as
> > > > > an example, details can vary according to VM type). I will use use the
> > > > > case of conversion from shared to private because that is the more
> > > > > complicated (interesting) case:
> > > > >
> > > > > - Guest issues a hypercall to request that a shared folio become private.
> > > > >
> > > > > - The hypervisor receives the call, and passes it to KVM.
> > > > >
> > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > > > > parlance), and unmaps it from the host. The host however, could still
> > > > > have references (e.g., GUP).
> > > > >
> > > > > - KVM exits to the host (hypervisor call exit), with the information
> > > > > that the folio has been unshared from it.
> > > > >
> > > > > - A well behaving host would now get rid of all of its references
> > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > > > > running as normal. I expect this to be the common case.
> > > > >
> > > > > But to handle the more interesting situation, let's say that the host
> > > > > doesn't do it immediately, and for some reason it holds on to some
> > > > > references to that folio.
> > > > >
> > > > > - Even if that's the case, the guest can still run *. If the guest
> > > > > tries to access the folio, KVM detects that access when it tries to
> > > > > fault it into the guest, sees that the host still has references to
> > > > > that folio, and exits back to the host with a memory fault exit. At
> > > > > this point, the VCPU that has tried to fault in that particular folio
> > > > > cannot continue running as long as it cannot fault in that folio.
> > > >
> > > > Are you talking about the following scheme?
> > > > 1) guest_memfd checks shareability on each get pfn and if there is a
> > > > mismatch exit to the host.
> > >
> > > I think we are not really on the same page here (no pun intended :) ).
> > > I'll try to answer your questions anyway...
> > >
> > > Which get_pfn? Are you referring to get_pfn when faulting the page
> > > into the guest or into the host?
> >
> > I am referring to guest fault handling in KVM.
> >
> > >
> > > > 2) host user space has to guess whether it's a pending refcount or
> > > > whether it's an actual mismatch.
> > >
> > > No need to guess. VCPU run will let it know exactly why it's exiting.
> > >
> > > > 3) guest_memfd will maintain a third state
> > > > "pending_private_conversion" or equivalent which will transition to
> > > > private upon the last refcount drop of each page.
> > > >
> > > > If conversion is triggered by userspace (in case of pKVM, it will be
> > > > triggered from within the KVM (?)):
> > >
> > > Why would conversion be triggered by userspace? As far as I know, it's
> > > the guest that triggers the conversion.
> > >
> > > > * Conversion will just fail if there are extra refcounts and userspace
> > > > can try to get rid of extra refcounts on the range while it has enough
> > > > context without hitting any ambiguity with memory fault exit.
> > > > * guest_memfd will not have to deal with this extra state from 3 above
> > > > and overall guest_memfd conversion handling becomes relatively
> > > > simpler.
> > >
> > > That's not really related. The extra state isn't necessary any more
> > > once we agreed in the previous discussion that we will retry instead.
> >
> > Who is *we* here? Which entity will retry conversion?
>
> Userspace will re-attempt the VCPU run.

Then KVM will have to keep track of the ranges that need conversion
across exits. I think it's cleaner to let userspace make the decision
and invoke conversion without carrying additional state in KVM about
guest request.

>
> > >
> > > > Note that for x86 CoCo cases, memory conversion is already triggered
> > > > by userspace using KVM ioctl, this series is proposing to use
> > > > guest_memfd ioctl to do the same.
> > >
> > > The reason why for x86 CoCo cases conversion is already triggered by
> > > userspace using KVM ioctl is that it has to, since shared memory and
> > > private memory are two separate pages, and userspace needs to manage
> > > that. Sharing memory in place removes the need for that.
> >
> > Userspace still needs to clean up memory usage before conversion is
> > successful. e.g. remove IOMMU mappings for shared to private
> > conversion. I would think that memory conversion should not succeed
> > before all existing users let go of the guest_memfd pages for the
> > range being converted.
>
> Yes. Userspace will know that it needs to do that on the VCPU exit,
> which informs it of the guest's hypervisor request to unshare (convert
> from shared to private) the page.
>
> > In x86 CoCo usecases, userspace can also decide to not allow
> > conversion for scenarios where ranges are still under active use by
> > the host and guest is erroneously trying to take away memory. Both
> > SNP/TDX spec allow failure of conversion due to in use memory.
>
> How can the guest erroneously try to take away memory? If the guest
> sends a hypervisor request asking for a conversion of memory that
> doesn't belong to it, then I would expect the hypervisor to prevent
> that.

Making a range as private is effectively disallowing host from
accessing those ranges -> so taking away memory.

>
> I don't see how having an IOCTL to trigger the conversion is needed to
> allow conversion failure. How is that different from userspace
> ignoring or delaying releasing all references it has for the
> conversion request?
>
> > >
> > > This series isn't using the same ioctl, it's introducing new ones to
> > > perform a task that as far as I can tell so far, KVM can handle by
> > > itself.
> >
> > I would like to understand this better. How will KVM handle the
> > conversion process for guest_memfd pages? Can you help walk an example
> > sequence for shared to private conversion specifically around
> > guest_memfd offset states?
>
> To make sure that we are discussing the same scenario: can you do the
> same as well please --- walk me through an example sequence for shared
> to private conversion specifically around guest_memfd offset states
> With the IOCTLs involved?
>
> Here is an example that I have implemented and tested with pKVM. Note
> that there are alternatives, the flow below is architecture or even
> vm-type dependent. None of this code is code KVM code and the
> behaviour could vary.
>
>
> Assuming the folio is shared with the host:
>
> Guest sends unshare hypercall to the hypervisor
> Hypervisor forwards request to KVM (gmem) (having done due diligence)
> KVM (gmem) performs an unmap_folio(), exits to userspace with

For x86 CoCo VM usecases I was talking about, userspace would like to
avoid unmap_mapping_range() on the range before it's safe to unshare
the range.

> KVM_EXIT_UNSHARE and all the information about the folio being
> unshared
>
> Case 1:
> Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
> Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
> references, sets state to PRIVATE
>
> Case 2 (alternative 1):
> Userspace doesn't release its references
> Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> references, exits back to userspace with KVM_EXIT_UNSHARE
>
> Case 2 (alternative 2):
> Userspace doesn't release its references
> Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> references, unmaps folio from guest, but allows it to run (until it
> tries to fault in the folio)
> Guest tries to fault in folio that still has reference, KVM does not
> allow that (it sees that the folio is shared, and it doesn't fault in
> shared folios to confidential guests)
> KVM exits back to userspace with KVM_EXIT_UNSHARE
>
> As I mentioned, the alternatives above are _not_ set in core KVM code.
> They can vary by architecture of VM type, depending on the policy,
> support, etc..
>
> Now for your example please on how this would work with IOCTLs :)
>
> Thanks,
> /fuad
>
> > >
> > > >  - Allows not having to keep track of separate shared/private range
> > > > information in KVM.
> > >
> > > This patch series is already tracking shared/private range information in KVM.
> > >
> > > >  - Simpler handling of the conversion process done per guest_memfd
> > > > rather than for full range.
> > > >      - Userspace can handle the rollback as needed, simplifying error
> > > > handling in guest_memfd.
> > > >  - guest_memfd is single source of truth and notifies the users of
> > > > shareability change.
> > > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> > > > getting notifications from guest_memfd directly and will get notified
> > > > for invalidation upon shareability attribute updates.
> > >
> > > All of these can still be done without introducing a new ioctl.
> > >
> > > Cheers,
> > > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 16:02             ` Vishal Annapurve
@ 2025-05-20 18:05               ` Fuad Tabba
  2025-05-20 19:40                 ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-20 18:05 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Vishal,

On Tue, 20 May 2025 at 17:03, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Vishal,
> >
> > On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@google.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
> > > >
> > > > Hi Vishal,
> > > >
> > > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> > > > > >
> > > > > > Hi Ackerley,
> > > > > >
> > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> > > > > > >
> > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> > > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> > > > > > > and private respectively.
> > > > > >
> > > > > > I have a high level question about this particular patch and this
> > > > > > approach for conversion: why do we need IOCTLs to manage conversion
> > > > > > between private and shared?
> > > > > >
> > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch
> > > > > > series that performs in-place conversion [3] and the associated (by
> > > > > > now outdated) state diagram [4], I didn't see the need to have a
> > > > > > userspace-facing interface to manage that. KVM has all the information
> > > > > > it needs to handle conversions, which are triggered by the guest. To
> > > > > > me this seems like it adds additional complexity, as well as a user
> > > > > > facing interface that we would need to maintain.
> > > > > >
> > > > > > There are various ways we could handle conversion without explicit
> > > > > > interference from userspace. What I had in mind is the following (as
> > > > > > an example, details can vary according to VM type). I will use use the
> > > > > > case of conversion from shared to private because that is the more
> > > > > > complicated (interesting) case:
> > > > > >
> > > > > > - Guest issues a hypercall to request that a shared folio become private.
> > > > > >
> > > > > > - The hypervisor receives the call, and passes it to KVM.
> > > > > >
> > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> > > > > > parlance), and unmaps it from the host. The host however, could still
> > > > > > have references (e.g., GUP).
> > > > > >
> > > > > > - KVM exits to the host (hypervisor call exit), with the information
> > > > > > that the folio has been unshared from it.
> > > > > >
> > > > > > - A well behaving host would now get rid of all of its references
> > > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> > > > > > running as normal. I expect this to be the common case.
> > > > > >
> > > > > > But to handle the more interesting situation, let's say that the host
> > > > > > doesn't do it immediately, and for some reason it holds on to some
> > > > > > references to that folio.
> > > > > >
> > > > > > - Even if that's the case, the guest can still run *. If the guest
> > > > > > tries to access the folio, KVM detects that access when it tries to
> > > > > > fault it into the guest, sees that the host still has references to
> > > > > > that folio, and exits back to the host with a memory fault exit. At
> > > > > > this point, the VCPU that has tried to fault in that particular folio
> > > > > > cannot continue running as long as it cannot fault in that folio.
> > > > >
> > > > > Are you talking about the following scheme?
> > > > > 1) guest_memfd checks shareability on each get pfn and if there is a
> > > > > mismatch exit to the host.
> > > >
> > > > I think we are not really on the same page here (no pun intended :) ).
> > > > I'll try to answer your questions anyway...
> > > >
> > > > Which get_pfn? Are you referring to get_pfn when faulting the page
> > > > into the guest or into the host?
> > >
> > > I am referring to guest fault handling in KVM.
> > >
> > > >
> > > > > 2) host user space has to guess whether it's a pending refcount or
> > > > > whether it's an actual mismatch.
> > > >
> > > > No need to guess. VCPU run will let it know exactly why it's exiting.
> > > >
> > > > > 3) guest_memfd will maintain a third state
> > > > > "pending_private_conversion" or equivalent which will transition to
> > > > > private upon the last refcount drop of each page.
> > > > >
> > > > > If conversion is triggered by userspace (in case of pKVM, it will be
> > > > > triggered from within the KVM (?)):
> > > >
> > > > Why would conversion be triggered by userspace? As far as I know, it's
> > > > the guest that triggers the conversion.
> > > >
> > > > > * Conversion will just fail if there are extra refcounts and userspace
> > > > > can try to get rid of extra refcounts on the range while it has enough
> > > > > context without hitting any ambiguity with memory fault exit.
> > > > > * guest_memfd will not have to deal with this extra state from 3 above
> > > > > and overall guest_memfd conversion handling becomes relatively
> > > > > simpler.
> > > >
> > > > That's not really related. The extra state isn't necessary any more
> > > > once we agreed in the previous discussion that we will retry instead.
> > >
> > > Who is *we* here? Which entity will retry conversion?
> >
> > Userspace will re-attempt the VCPU run.
>
> Then KVM will have to keep track of the ranges that need conversion
> across exits. I think it's cleaner to let userspace make the decision
> and invoke conversion without carrying additional state in KVM about
> guest request.

I disagree. I think it's cleaner not to introduce a user interface,
and just to track the reason for the last exit, along with the
required additional data. KVM is responsible already for handling the
workflow, why delegate this last part to the VMM?

> >
> > > >
> > > > > Note that for x86 CoCo cases, memory conversion is already triggered
> > > > > by userspace using KVM ioctl, this series is proposing to use
> > > > > guest_memfd ioctl to do the same.
> > > >
> > > > The reason why for x86 CoCo cases conversion is already triggered by
> > > > userspace using KVM ioctl is that it has to, since shared memory and
> > > > private memory are two separate pages, and userspace needs to manage
> > > > that. Sharing memory in place removes the need for that.
> > >
> > > Userspace still needs to clean up memory usage before conversion is
> > > successful. e.g. remove IOMMU mappings for shared to private
> > > conversion. I would think that memory conversion should not succeed
> > > before all existing users let go of the guest_memfd pages for the
> > > range being converted.
> >
> > Yes. Userspace will know that it needs to do that on the VCPU exit,
> > which informs it of the guest's hypervisor request to unshare (convert
> > from shared to private) the page.
> >
> > > In x86 CoCo usecases, userspace can also decide to not allow
> > > conversion for scenarios where ranges are still under active use by
> > > the host and guest is erroneously trying to take away memory. Both
> > > SNP/TDX spec allow failure of conversion due to in use memory.
> >
> > How can the guest erroneously try to take away memory? If the guest
> > sends a hypervisor request asking for a conversion of memory that
> > doesn't belong to it, then I would expect the hypervisor to prevent
> > that.
>
> Making a range as private is effectively disallowing host from
> accessing those ranges -> so taking away memory.

You said "erroneously" earlier. My question is, how can the guest
*erroneously* try to take away memory? This is the normal flow of
guest/host relations. The memory is the guest's: it decides when to
share it with the host, and it can take it away.

> >
> > I don't see how having an IOCTL to trigger the conversion is needed to
> > allow conversion failure. How is that different from userspace
> > ignoring or delaying releasing all references it has for the
> > conversion request?
> >
> > > >
> > > > This series isn't using the same ioctl, it's introducing new ones to
> > > > perform a task that as far as I can tell so far, KVM can handle by
> > > > itself.
> > >
> > > I would like to understand this better. How will KVM handle the
> > > conversion process for guest_memfd pages? Can you help walk an example
> > > sequence for shared to private conversion specifically around
> > > guest_memfd offset states?
> >
> > To make sure that we are discussing the same scenario: can you do the
> > same as well please --- walk me through an example sequence for shared
> > to private conversion specifically around guest_memfd offset states
> > With the IOCTLs involved?
> >
> > Here is an example that I have implemented and tested with pKVM. Note
> > that there are alternatives, the flow below is architecture or even
> > vm-type dependent. None of this code is code KVM code and the
> > behaviour could vary.
> >
> >
> > Assuming the folio is shared with the host:
> >
> > Guest sends unshare hypercall to the hypervisor
> > Hypervisor forwards request to KVM (gmem) (having done due diligence)
> > KVM (gmem) performs an unmap_folio(), exits to userspace with
>
> For x86 CoCo VM usecases I was talking about, userspace would like to
> avoid unmap_mapping_range() on the range before it's safe to unshare
> the range.

Why? There is no harm in userspace unmapping before the memory isn't
shared. I don't see the problem with that.

You still haven't responded to my question from the previous email:
can you please return the favor and walk me through an example
sequence for shared to private conversion specifically around
guest_memfd offset states with the IOCTLs involved? :D

Thanks!
/fuad


> > KVM_EXIT_UNSHARE and all the information about the folio being
> > unshared
> >
> > Case 1:
> > Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
> > Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
> > references, sets state to PRIVATE
> >
> > Case 2 (alternative 1):
> > Userspace doesn't release its references
> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> > references, exits back to userspace with KVM_EXIT_UNSHARE
> >
> > Case 2 (alternative 2):
> > Userspace doesn't release its references
> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> > references, unmaps folio from guest, but allows it to run (until it
> > tries to fault in the folio)
> > Guest tries to fault in folio that still has reference, KVM does not
> > allow that (it sees that the folio is shared, and it doesn't fault in
> > shared folios to confidential guests)
> > KVM exits back to userspace with KVM_EXIT_UNSHARE
> >
> > As I mentioned, the alternatives above are _not_ set in core KVM code.
> > They can vary by architecture of VM type, depending on the policy,
> > support, etc..
> >
> > Now for your example please on how this would work with IOCTLs :)
> >
> > Thanks,
> > /fuad
> >
> > > >
> > > > >  - Allows not having to keep track of separate shared/private range
> > > > > information in KVM.
> > > >
> > > > This patch series is already tracking shared/private range information in KVM.
> > > >
> > > > >  - Simpler handling of the conversion process done per guest_memfd
> > > > > rather than for full range.
> > > > >      - Userspace can handle the rollback as needed, simplifying error
> > > > > handling in guest_memfd.
> > > > >  - guest_memfd is single source of truth and notifies the users of
> > > > > shareability change.
> > > > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> > > > > getting notifications from guest_memfd directly and will get notified
> > > > > for invalidation upon shareability attribute updates.
> > > >
> > > > All of these can still be done without introducing a new ioctl.
> > > >
> > > > Cheers,
> > > > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 18:05               ` Fuad Tabba
@ 2025-05-20 19:40                 ` Ackerley Tng
  2025-05-21 12:36                   ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-20 19:40 UTC (permalink / raw)
  To: Fuad Tabba, Vishal Annapurve
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

Fuad Tabba <tabba@google.com> writes:

Let me try to bridge the gap here beginning with the flow we were
counting on for a shared to private conversion, for TDX:

1. Guest sends unshare hypercall to the hypervisor

2. (For x86 IIUC hypervisor is the same as KVM) KVM forwards the request
   to userspace via a KVM_EXIT_HYPERCALL, with KVM_HC_MAP_GPA_RANGE as
   the hypercall number.

   KVM also records that the guest wanted a shared to private
   conversion, the gpa and size of the request (no change from now, KVM
   already records that information in struct kvm_run) [1]

3. Userspace will do necessary coordination in userspace, then call the
   conversion ioctl, passing the parameters along to the ioctl.

4. Ioctl goes to guest_memfd, guest_memfd unmaps the pages, checks
   refcounts. If there's anything unexpected, error out to userspace. If
   all is well, flip shareability, exit to userspace with success.

5. Userspace calls vcpu_run() again, the handler for
   KVM_HC_MAP_GPA_RANGE will tell the guest that userspace was able to
   fulfill guest request with hypercall.ret set to 0 and then the guest
   will continue.

6. On the next fault guest_memfd will allow the private fault from the
   guest.


The flow you're proposing works too, with some changes, but it's
probably okay for x86 to have a slightly different flow anyway: (I
refactored the steps you outlined)

> 1. Guest sends unshare hypercall to the hypervisor

Same

> 2. Hypervisor forwards request to KVM (gmem) (having done due diligence)

For x86 IIUC hypervisor is the same as KVM, so there's no forwarding to KVM.

> 3. KVM (gmem) performs an unmap_folio(), exits to userspace with
>    KVM_EXIT_UNSHARE and all the information about the folio being unshared

The KVM_EXIT_UNSHARE here would correspond to x86's
KVM_HC_MAP_GPA_RANGE.

Unmapping before exiting with KVM_EXIT_UNSHARE here might be a little
premature since userspace may have to do some stuff before permitting
the conversion. For example, the memory may be mapped into another
userspace driver process, which needs to first be stopped.

But no issue though, as long as we don't flip shareability, if the host
uses the memory, the kvm_gmem_fault_shared() will just happen again,
nullifying the unmapping.

We could just shift the unmapping till after vcpu_run() is called
again.

> 4. Userspace will do necessary coordination in userspace, then do
>    vcpu_run()

There's another layer here, at least for x86, as to whether the
coordination was successful. For x86's KVM_HC_MAP_GPA_RANGE, userspace
can indicate a non-zero hypercall.ret for error.

For unsuccessful coordinations, userspace sets hypercall.ret to error
and the vcpu_run() handler doesn't try the conversion. Guest is informed
of hypercall error and guest will figure it out.

> 5. Successful coordination, case 1: vcpu_run() knows the last exit was
>    KVM_EXIT_UNSHARE and will set state to PRIVATE

For case 1, userspace will set hypercall.ret == 0, guest_memfd will do
the conversion, basically calling the same function that the ioctl calls
within guest_memfd.

> 5. Successful coordination, case 2, alternative 1: vcpu_run() knows
>    the last exit was KVM_EXIT_UNSHARE

Exit to userspace with KVM_EXIT_MEMORY_FAULT.

> 5. Successful coordination, case 2, alternative 2: vcpu_run() knows
>    the last exit was KVM_EXIT_UNSHARE

Forward hypercall.ret == 0 to the guest. Since the conversion was not
performed, the next fault will be mismatched and there will be a
KVM_EXIT_MEMORY_FAULT.

> Hi Vishal,
>
> On Tue, 20 May 2025 at 17:03, Vishal Annapurve <vannapurve@google.com> wrote:
>>
>> On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@google.com> wrote:
>> >
>> > Hi Vishal,
>> >
>> > On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@google.com> wrote:
>> > >
>> > > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
>> > > >
>> > > > Hi Vishal,
>> > > >
>> > > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
>> > > > >
>> > > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
>> > > > > >
>> > > > > > Hi Ackerley,
>> > > > > >
>> > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
>> > > > > > >
>> > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
>> > > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
>> > > > > > > and private respectively.
>> > > > > >
>> > > > > > I have a high level question about this particular patch and this
>> > > > > > approach for conversion: why do we need IOCTLs to manage conversion
>> > > > > > between private and shared?
>> > > > > >
>> > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch
>> > > > > > series that performs in-place conversion [3] and the associated (by
>> > > > > > now outdated) state diagram [4], I didn't see the need to have a
>> > > > > > userspace-facing interface to manage that. KVM has all the information
>> > > > > > it needs to handle conversions, which are triggered by the guest. To
>> > > > > > me this seems like it adds additional complexity, as well as a user
>> > > > > > facing interface that we would need to maintain.
>> > > > > >
>> > > > > > There are various ways we could handle conversion without explicit
>> > > > > > interference from userspace. What I had in mind is the following (as
>> > > > > > an example, details can vary according to VM type). I will use use the
>> > > > > > case of conversion from shared to private because that is the more
>> > > > > > complicated (interesting) case:
>> > > > > >
>> > > > > > - Guest issues a hypercall to request that a shared folio become private.
>> > > > > >
>> > > > > > - The hypervisor receives the call, and passes it to KVM.
>> > > > > >
>> > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
>> > > > > > parlance), and unmaps it from the host. The host however, could still
>> > > > > > have references (e.g., GUP).
>> > > > > >
>> > > > > > - KVM exits to the host (hypervisor call exit), with the information
>> > > > > > that the folio has been unshared from it.
>> > > > > >
>> > > > > > - A well behaving host would now get rid of all of its references
>> > > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
>> > > > > > running as normal. I expect this to be the common case.
>> > > > > >
>> > > > > > But to handle the more interesting situation, let's say that the host
>> > > > > > doesn't do it immediately, and for some reason it holds on to some
>> > > > > > references to that folio.
>> > > > > >
>> > > > > > - Even if that's the case, the guest can still run *. If the guest
>> > > > > > tries to access the folio, KVM detects that access when it tries to
>> > > > > > fault it into the guest, sees that the host still has references to
>> > > > > > that folio, and exits back to the host with a memory fault exit. At
>> > > > > > this point, the VCPU that has tried to fault in that particular folio
>> > > > > > cannot continue running as long as it cannot fault in that folio.
>> > > > >
>> > > > > Are you talking about the following scheme?
>> > > > > 1) guest_memfd checks shareability on each get pfn and if there is a
>> > > > > mismatch exit to the host.
>> > > >
>> > > > I think we are not really on the same page here (no pun intended :) ).
>> > > > I'll try to answer your questions anyway...
>> > > >
>> > > > Which get_pfn? Are you referring to get_pfn when faulting the page
>> > > > into the guest or into the host?
>> > >
>> > > I am referring to guest fault handling in KVM.
>> > >
>> > > >
>> > > > > 2) host user space has to guess whether it's a pending refcount or
>> > > > > whether it's an actual mismatch.
>> > > >
>> > > > No need to guess. VCPU run will let it know exactly why it's exiting.
>> > > >
>> > > > > 3) guest_memfd will maintain a third state
>> > > > > "pending_private_conversion" or equivalent which will transition to
>> > > > > private upon the last refcount drop of each page.
>> > > > >
>> > > > > If conversion is triggered by userspace (in case of pKVM, it will be
>> > > > > triggered from within the KVM (?)):
>> > > >
>> > > > Why would conversion be triggered by userspace? As far as I know, it's
>> > > > the guest that triggers the conversion.
>> > > >
>> > > > > * Conversion will just fail if there are extra refcounts and userspace
>> > > > > can try to get rid of extra refcounts on the range while it has enough
>> > > > > context without hitting any ambiguity with memory fault exit.
>> > > > > * guest_memfd will not have to deal with this extra state from 3 above
>> > > > > and overall guest_memfd conversion handling becomes relatively
>> > > > > simpler.
>> > > >
>> > > > That's not really related. The extra state isn't necessary any more
>> > > > once we agreed in the previous discussion that we will retry instead.
>> > >
>> > > Who is *we* here? Which entity will retry conversion?
>> >
>> > Userspace will re-attempt the VCPU run.
>>
>> Then KVM will have to keep track of the ranges that need conversion
>> across exits. I think it's cleaner to let userspace make the decision
>> and invoke conversion without carrying additional state in KVM about
>> guest request.
>
> I disagree. I think it's cleaner not to introduce a user interface,
> and just to track the reason for the last exit, along with the
> required additional data. KVM is responsible already for handling the
> workflow, why delegate this last part to the VMM?
>

I believe Fuad's concern is the complexity of adding and maintaining
another ioctl, as opposed to having vcpu_run() do the conversions.

I think the two options are basically the same in that both are actually
adding some form of user contract, just in different places.

For the ioctl approach, in this RFCv2 I added a error_offset field so
that userspace has a hint of where the conversion had an issue. the
ioctl also returns errors to indicate what went wrong, like -EINVAL or
-ENOMEM if perhaps splitting the page required memory and there wasn't
any, or the kernel ran out of memory trying to update mappability.

If we want to provide the same level of error information for the
vcpu_run() approach, we should probably add error_offset to
KVM_EXIT_MEMORY_FAULT so that on a conversion failure we could re-exit
to userspace with more information about the error_offset.


So what we're really comparing is two ways to perform the conversion (1)
via a direct ioctl and (2) via vcpu_run().

I think having a direct ioctl is cleaner because it doesn't involve
vCPUs for a memory operation.

Conceptually, the conversion is a memory operation belonging to memory
in the guest_memfd. Hence, the conversion operation is better addressed
directly to the memory via a direct ioctl.

For this same reason, we didn't want to do the conversion via the
KVM_SET_MEMORY_ATTRIBUTES ioctl. KVM_SET_MEMORY_ATTRIBUTES is an
operation for KVM's view of guest_memfd, which is linked to but not
directly the same as a memory operation.

By having a direct ioctl over using KVM_SET_MEMORY_ATTRIBUTES, we avoid
having a dependency where memslots must first be bound to guest_memfd
for the conversion to work.

When rebooting, the memslots may not yet be bound to the guest_memfd,
but we want to reset the guest_memfd's to private. If we use
KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
convert. If we had a direct ioctl, we don't have this restriction.

If we do the conversion via vcpu_run() we would be forced to handle
conversions only with a vcpu_run() and only the guest can initiate a
conversion.

On a guest boot for TDX, the memory is assumed to be private. If the we
gave it memory set as shared, we'd just have a bunch of
KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
will want to reset the guest memory to private.

We could say the firmware should reset memory to private on guest
reboot, but we can't force all guests to update firmware.

>> >
>> > > >
>> > > > > Note that for x86 CoCo cases, memory conversion is already triggered
>> > > > > by userspace using KVM ioctl, this series is proposing to use
>> > > > > guest_memfd ioctl to do the same.
>> > > >
>> > > > The reason why for x86 CoCo cases conversion is already triggered by
>> > > > userspace using KVM ioctl is that it has to, since shared memory and
>> > > > private memory are two separate pages, and userspace needs to manage
>> > > > that. Sharing memory in place removes the need for that.
>> > >
>> > > Userspace still needs to clean up memory usage before conversion is
>> > > successful. e.g. remove IOMMU mappings for shared to private
>> > > conversion. I would think that memory conversion should not succeed
>> > > before all existing users let go of the guest_memfd pages for the
>> > > range being converted.
>> >
>> > Yes. Userspace will know that it needs to do that on the VCPU exit,
>> > which informs it of the guest's hypervisor request to unshare (convert
>> > from shared to private) the page.
>> >
>> > > In x86 CoCo usecases, userspace can also decide to not allow
>> > > conversion for scenarios where ranges are still under active use by
>> > > the host and guest is erroneously trying to take away memory. Both
>> > > SNP/TDX spec allow failure of conversion due to in use memory.
>> >
>> > How can the guest erroneously try to take away memory? If the guest
>> > sends a hypervisor request asking for a conversion of memory that
>> > doesn't belong to it, then I would expect the hypervisor to prevent
>> > that.
>>
>> Making a range as private is effectively disallowing host from
>> accessing those ranges -> so taking away memory.
>
> You said "erroneously" earlier. My question is, how can the guest
> *erroneously* try to take away memory? This is the normal flow of
> guest/host relations. The memory is the guest's: it decides when to
> share it with the host, and it can take it away.
>

See above, it's not really erroneous as long as we
kvm_gmem_fault_shared() can still happen, since after unmapping, any
host access will just fault the page again.

>> >
>> > I don't see how having an IOCTL to trigger the conversion is needed to
>> > allow conversion failure. How is that different from userspace
>> > ignoring or delaying releasing all references it has for the
>> > conversion request?
>> >
>> > > >
>> > > > This series isn't using the same ioctl, it's introducing new ones to
>> > > > perform a task that as far as I can tell so far, KVM can handle by
>> > > > itself.
>> > >
>> > > I would like to understand this better. How will KVM handle the
>> > > conversion process for guest_memfd pages? Can you help walk an example
>> > > sequence for shared to private conversion specifically around
>> > > guest_memfd offset states?
>> >
>> > To make sure that we are discussing the same scenario: can you do the
>> > same as well please --- walk me through an example sequence for shared
>> > to private conversion specifically around guest_memfd offset states
>> > With the IOCTLs involved?
>> >
>> > Here is an example that I have implemented and tested with pKVM. Note
>> > that there are alternatives, the flow below is architecture or even
>> > vm-type dependent. None of this code is code KVM code and the
>> > behaviour could vary.
>> >
>> >
>> > Assuming the folio is shared with the host:
>> >
>> > Guest sends unshare hypercall to the hypervisor
>> > Hypervisor forwards request to KVM (gmem) (having done due diligence)
>> > KVM (gmem) performs an unmap_folio(), exits to userspace with
>>
>> For x86 CoCo VM usecases I was talking about, userspace would like to
>> avoid unmap_mapping_range() on the range before it's safe to unshare
>> the range.
>
> Why? There is no harm in userspace unmapping before the memory isn't
> shared. I don't see the problem with that.
>

Yes, no harm done, just possible remapping after unmapping.

> You still haven't responded to my question from the previous email:
> can you please return the favor and walk me through an example
> sequence for shared to private conversion specifically around
> guest_memfd offset states with the IOCTLs involved? :D
>

Right at the top :)

> Thanks!
> /fuad
>
>
>> > KVM_EXIT_UNSHARE and all the information about the folio being
>> > unshared
>> >
>> > Case 1:
>> > Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
>> > Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
>> > references, sets state to PRIVATE
>> >
>> > Case 2 (alternative 1):
>> > Userspace doesn't release its references
>> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
>> > references, exits back to userspace with KVM_EXIT_UNSHARE
>> >
>> > Case 2 (alternative 2):
>> > Userspace doesn't release its references
>> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
>> > references, unmaps folio from guest, but allows it to run (until it
>> > tries to fault in the folio)
>> > Guest tries to fault in folio that still has reference, KVM does not
>> > allow that (it sees that the folio is shared, and it doesn't fault in
>> > shared folios to confidential guests)
>> > KVM exits back to userspace with KVM_EXIT_UNSHARE
>> >
>> > As I mentioned, the alternatives above are _not_ set in core KVM code.
>> > They can vary by architecture of VM type, depending on the policy,
>> > support, etc..
>> >
>> > Now for your example please on how this would work with IOCTLs :)
>> >
>> > Thanks,
>> > /fuad
>> >
>> > > >
>> > > > >  - Allows not having to keep track of separate shared/private range
>> > > > > information in KVM.
>> > > >
>> > > > This patch series is already tracking shared/private range information in KVM.
>> > > >
>> > > > >  - Simpler handling of the conversion process done per guest_memfd
>> > > > > rather than for full range.
>> > > > >      - Userspace can handle the rollback as needed, simplifying error
>> > > > > handling in guest_memfd.
>> > > > >  - guest_memfd is single source of truth and notifies the users of
>> > > > > shareability change.
>> > > > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
>> > > > > getting notifications from guest_memfd directly and will get notified
>> > > > > for invalidation upon shareability attribute updates.
>> > > >
>> > > > All of these can still be done without introducing a new ioctl.
>> > > >
>> > > > Cheers,
>> > > > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 19:40                 ` Ackerley Tng
@ 2025-05-21 12:36                   ` Fuad Tabba
  2025-05-21 14:42                     ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-21 12:36 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Ackerley,

On Tue, 20 May 2025 at 20:40, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> Let me try to bridge the gap here beginning with the flow we were
> counting on for a shared to private conversion, for TDX:
>
> 1. Guest sends unshare hypercall to the hypervisor
>
> 2. (For x86 IIUC hypervisor is the same as KVM) KVM forwards the request
>    to userspace via a KVM_EXIT_HYPERCALL, with KVM_HC_MAP_GPA_RANGE as
>    the hypercall number.
>
>    KVM also records that the guest wanted a shared to private
>    conversion, the gpa and size of the request (no change from now, KVM
>    already records that information in struct kvm_run) [1]
>
> 3. Userspace will do necessary coordination in userspace, then call the
>    conversion ioctl, passing the parameters along to the ioctl.
>
> 4. Ioctl goes to guest_memfd, guest_memfd unmaps the pages, checks
>    refcounts. If there's anything unexpected, error out to userspace. If
>    all is well, flip shareability, exit to userspace with success.
>
> 5. Userspace calls vcpu_run() again, the handler for
>    KVM_HC_MAP_GPA_RANGE will tell the guest that userspace was able to
>    fulfill guest request with hypercall.ret set to 0 and then the guest
>    will continue.
>
> 6. On the next fault guest_memfd will allow the private fault from the
>    guest.
>
>
> The flow you're proposing works too, with some changes, but it's
> probably okay for x86 to have a slightly different flow anyway: (I
> refactored the steps you outlined)
>
> > 1. Guest sends unshare hypercall to the hypervisor
>
> Same
>
> > 2. Hypervisor forwards request to KVM (gmem) (having done due diligence)
>
> For x86 IIUC hypervisor is the same as KVM, so there's no forwarding to KVM.
>
> > 3. KVM (gmem) performs an unmap_folio(), exits to userspace with
> >    KVM_EXIT_UNSHARE and all the information about the folio being unshared
>
> The KVM_EXIT_UNSHARE here would correspond to x86's
> KVM_HC_MAP_GPA_RANGE.
>
> Unmapping before exiting with KVM_EXIT_UNSHARE here might be a little
> premature since userspace may have to do some stuff before permitting
> the conversion. For example, the memory may be mapped into another
> userspace driver process, which needs to first be stopped.
>
> But no issue though, as long as we don't flip shareability, if the host
> uses the memory, the kvm_gmem_fault_shared() will just happen again,
> nullifying the unmapping.
>
> We could just shift the unmapping till after vcpu_run() is called
> again.
>
> > 4. Userspace will do necessary coordination in userspace, then do
> >    vcpu_run()
>
> There's another layer here, at least for x86, as to whether the
> coordination was successful. For x86's KVM_HC_MAP_GPA_RANGE, userspace
> can indicate a non-zero hypercall.ret for error.
>
> For unsuccessful coordinations, userspace sets hypercall.ret to error
> and the vcpu_run() handler doesn't try the conversion. Guest is informed
> of hypercall error and guest will figure it out.
>
> > 5. Successful coordination, case 1: vcpu_run() knows the last exit was
> >    KVM_EXIT_UNSHARE and will set state to PRIVATE
>
> For case 1, userspace will set hypercall.ret == 0, guest_memfd will do
> the conversion, basically calling the same function that the ioctl calls
> within guest_memfd.
>
> > 5. Successful coordination, case 2, alternative 1: vcpu_run() knows
> >    the last exit was KVM_EXIT_UNSHARE
>
> Exit to userspace with KVM_EXIT_MEMORY_FAULT.
>
> > 5. Successful coordination, case 2, alternative 2: vcpu_run() knows
> >    the last exit was KVM_EXIT_UNSHARE
>
> Forward hypercall.ret == 0 to the guest. Since the conversion was not
> performed, the next fault will be mismatched and there will be a
> KVM_EXIT_MEMORY_FAULT.

So far so good. With regard to the flow, in the code that I had, all
the specific details were arm64, and even pKVM specific. None of it
was baked into core KVM code, since of course, different
architectures, and even different VM types, will vary significantly.
Arm CCA for example is closer to TDX than it is to pKVM. Moreover, it
was just a hack at getting something reasonable that works, as a proof
of concept.

This is one of the reasons I'm not a fan of having a userspace IOCTL
as an additional required step as part of this protocol. KVM exits
already exist (*), and we need them anyway here. The flow above is
VM-type specific, and since much of it isn't exposed to the user: it's
easy (and likely) to change. Having an IOCTL and adding another step
in the process makes it more difficult to change things later.

(*) Try saying that ten times fast! Note: first word is exit, second
word is exist :)

> > Hi Vishal,
> >
> > On Tue, 20 May 2025 at 17:03, Vishal Annapurve <vannapurve@google.com> wrote:
> >>
> >> On Tue, May 20, 2025 at 7:34 AM Fuad Tabba <tabba@google.com> wrote:
> >> >
> >> > Hi Vishal,
> >> >
> >> > On Tue, 20 May 2025 at 15:11, Vishal Annapurve <vannapurve@google.com> wrote:
> >> > >
> >> > > On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
> >> > > >
> >> > > > Hi Vishal,
> >> > > >
> >> > > > On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
> >> > > > >
> >> > > > > On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
> >> > > > > >
> >> > > > > > Hi Ackerley,
> >> > > > > >
> >> > > > > > On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
> >> > > > > > >
> >> > > > > > > The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
> >> > > > > > > KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
> >> > > > > > > and private respectively.
> >> > > > > >
> >> > > > > > I have a high level question about this particular patch and this
> >> > > > > > approach for conversion: why do we need IOCTLs to manage conversion
> >> > > > > > between private and shared?
> >> > > > > >
> >> > > > > > In the presentations I gave at LPC [1, 2], and in my latest patch
> >> > > > > > series that performs in-place conversion [3] and the associated (by
> >> > > > > > now outdated) state diagram [4], I didn't see the need to have a
> >> > > > > > userspace-facing interface to manage that. KVM has all the information
> >> > > > > > it needs to handle conversions, which are triggered by the guest. To
> >> > > > > > me this seems like it adds additional complexity, as well as a user
> >> > > > > > facing interface that we would need to maintain.
> >> > > > > >
> >> > > > > > There are various ways we could handle conversion without explicit
> >> > > > > > interference from userspace. What I had in mind is the following (as
> >> > > > > > an example, details can vary according to VM type). I will use use the
> >> > > > > > case of conversion from shared to private because that is the more
> >> > > > > > complicated (interesting) case:
> >> > > > > >
> >> > > > > > - Guest issues a hypercall to request that a shared folio become private.
> >> > > > > >
> >> > > > > > - The hypervisor receives the call, and passes it to KVM.
> >> > > > > >
> >> > > > > > - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
> >> > > > > > parlance), and unmaps it from the host. The host however, could still
> >> > > > > > have references (e.g., GUP).
> >> > > > > >
> >> > > > > > - KVM exits to the host (hypervisor call exit), with the information
> >> > > > > > that the folio has been unshared from it.
> >> > > > > >
> >> > > > > > - A well behaving host would now get rid of all of its references
> >> > > > > > (e.g., release GUPs), perform a VCPU run, and the guest continues
> >> > > > > > running as normal. I expect this to be the common case.
> >> > > > > >
> >> > > > > > But to handle the more interesting situation, let's say that the host
> >> > > > > > doesn't do it immediately, and for some reason it holds on to some
> >> > > > > > references to that folio.
> >> > > > > >
> >> > > > > > - Even if that's the case, the guest can still run *. If the guest
> >> > > > > > tries to access the folio, KVM detects that access when it tries to
> >> > > > > > fault it into the guest, sees that the host still has references to
> >> > > > > > that folio, and exits back to the host with a memory fault exit. At
> >> > > > > > this point, the VCPU that has tried to fault in that particular folio
> >> > > > > > cannot continue running as long as it cannot fault in that folio.
> >> > > > >
> >> > > > > Are you talking about the following scheme?
> >> > > > > 1) guest_memfd checks shareability on each get pfn and if there is a
> >> > > > > mismatch exit to the host.
> >> > > >
> >> > > > I think we are not really on the same page here (no pun intended :) ).
> >> > > > I'll try to answer your questions anyway...
> >> > > >
> >> > > > Which get_pfn? Are you referring to get_pfn when faulting the page
> >> > > > into the guest or into the host?
> >> > >
> >> > > I am referring to guest fault handling in KVM.
> >> > >
> >> > > >
> >> > > > > 2) host user space has to guess whether it's a pending refcount or
> >> > > > > whether it's an actual mismatch.
> >> > > >
> >> > > > No need to guess. VCPU run will let it know exactly why it's exiting.
> >> > > >
> >> > > > > 3) guest_memfd will maintain a third state
> >> > > > > "pending_private_conversion" or equivalent which will transition to
> >> > > > > private upon the last refcount drop of each page.
> >> > > > >
> >> > > > > If conversion is triggered by userspace (in case of pKVM, it will be
> >> > > > > triggered from within the KVM (?)):
> >> > > >
> >> > > > Why would conversion be triggered by userspace? As far as I know, it's
> >> > > > the guest that triggers the conversion.
> >> > > >
> >> > > > > * Conversion will just fail if there are extra refcounts and userspace
> >> > > > > can try to get rid of extra refcounts on the range while it has enough
> >> > > > > context without hitting any ambiguity with memory fault exit.
> >> > > > > * guest_memfd will not have to deal with this extra state from 3 above
> >> > > > > and overall guest_memfd conversion handling becomes relatively
> >> > > > > simpler.
> >> > > >
> >> > > > That's not really related. The extra state isn't necessary any more
> >> > > > once we agreed in the previous discussion that we will retry instead.
> >> > >
> >> > > Who is *we* here? Which entity will retry conversion?
> >> >
> >> > Userspace will re-attempt the VCPU run.
> >>
> >> Then KVM will have to keep track of the ranges that need conversion
> >> across exits. I think it's cleaner to let userspace make the decision
> >> and invoke conversion without carrying additional state in KVM about
> >> guest request.
> >
> > I disagree. I think it's cleaner not to introduce a user interface,
> > and just to track the reason for the last exit, along with the
> > required additional data. KVM is responsible already for handling the
> > workflow, why delegate this last part to the VMM?
> >
>
> I believe Fuad's concern is the complexity of adding and maintaining
> another ioctl, as opposed to having vcpu_run() do the conversions.
>
> I think the two options are basically the same in that both are actually
> adding some form of user contract, just in different places.
>
> For the ioctl approach, in this RFCv2 I added a error_offset field so
> that userspace has a hint of where the conversion had an issue. the
> ioctl also returns errors to indicate what went wrong, like -EINVAL or
> -ENOMEM if perhaps splitting the page required memory and there wasn't
> any, or the kernel ran out of memory trying to update mappability.
>
> If we want to provide the same level of error information for the
> vcpu_run() approach, we should probably add error_offset to
> KVM_EXIT_MEMORY_FAULT so that on a conversion failure we could re-exit
> to userspace with more information about the error_offset.
>
>
> So what we're really comparing is two ways to perform the conversion (1)
> via a direct ioctl and (2) via vcpu_run().

That's exactly right.

> I think having a direct ioctl is cleaner because it doesn't involve
> vCPUs for a memory operation.
>
> Conceptually, the conversion is a memory operation belonging to memory
> in the guest_memfd. Hence, the conversion operation is better addressed
> directly to the memory via a direct ioctl.
>
> For this same reason, we didn't want to do the conversion via the
> KVM_SET_MEMORY_ATTRIBUTES ioctl. KVM_SET_MEMORY_ATTRIBUTES is an
> operation for KVM's view of guest_memfd, which is linked to but not
> directly the same as a memory operation.
>
> By having a direct ioctl over using KVM_SET_MEMORY_ATTRIBUTES, we avoid
> having a dependency where memslots must first be bound to guest_memfd
> for the conversion to work.
>
> When rebooting, the memslots may not yet be bound to the guest_memfd,
> but we want to reset the guest_memfd's to private. If we use
> KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
> convert. If we had a direct ioctl, we don't have this restriction.
>
> If we do the conversion via vcpu_run() we would be forced to handle
> conversions only with a vcpu_run() and only the guest can initiate a
> conversion.
>
> On a guest boot for TDX, the memory is assumed to be private. If the we
> gave it memory set as shared, we'd just have a bunch of
> KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
> will want to reset the guest memory to private.
>
> We could say the firmware should reset memory to private on guest
> reboot, but we can't force all guests to update firmware.

Here is where I disagree. I do think that this is the CoCo guest's
responsibility (and by guest I include its firmware) to fix its own
state after a reboot. How would the host even know that a guest is
rebooting if it's a CoCo guest?

Either the host doesn't (or cannot even) know that the guest is
rebooting, in which case I don't see how having an IOCTL would help.
Or somehow the host does know that, i.e., via a hypercall that
indicates that. In which case, we could have it so that for that type
of VM, we would reconvert its pages to private on a reboot.

Additionally, we could introduce range operations for
sharing/unsharing, to avoid having to have an exit for every one.

> >> >
> >> > > >
> >> > > > > Note that for x86 CoCo cases, memory conversion is already triggered
> >> > > > > by userspace using KVM ioctl, this series is proposing to use
> >> > > > > guest_memfd ioctl to do the same.
> >> > > >
> >> > > > The reason why for x86 CoCo cases conversion is already triggered by
> >> > > > userspace using KVM ioctl is that it has to, since shared memory and
> >> > > > private memory are two separate pages, and userspace needs to manage
> >> > > > that. Sharing memory in place removes the need for that.
> >> > >
> >> > > Userspace still needs to clean up memory usage before conversion is
> >> > > successful. e.g. remove IOMMU mappings for shared to private
> >> > > conversion. I would think that memory conversion should not succeed
> >> > > before all existing users let go of the guest_memfd pages for the
> >> > > range being converted.
> >> >
> >> > Yes. Userspace will know that it needs to do that on the VCPU exit,
> >> > which informs it of the guest's hypervisor request to unshare (convert
> >> > from shared to private) the page.
> >> >
> >> > > In x86 CoCo usecases, userspace can also decide to not allow
> >> > > conversion for scenarios where ranges are still under active use by
> >> > > the host and guest is erroneously trying to take away memory. Both
> >> > > SNP/TDX spec allow failure of conversion due to in use memory.
> >> >
> >> > How can the guest erroneously try to take away memory? If the guest
> >> > sends a hypervisor request asking for a conversion of memory that
> >> > doesn't belong to it, then I would expect the hypervisor to prevent
> >> > that.
> >>
> >> Making a range as private is effectively disallowing host from
> >> accessing those ranges -> so taking away memory.
> >
> > You said "erroneously" earlier. My question is, how can the guest
> > *erroneously* try to take away memory? This is the normal flow of
> > guest/host relations. The memory is the guest's: it decides when to
> > share it with the host, and it can take it away.
> >
>
> See above, it's not really erroneous as long as we
> kvm_gmem_fault_shared() can still happen, since after unmapping, any
> host access will just fault the page again.

I was confused by the word "erroneous", as you would expect that for a
CoCo guest, the host wouldn't (or shouldn't) know the intention behind
a CoCo guest's access. I would expect that erroneous guest accesses
would be handled by the hypervisor. But I think we're on the same page
now.

> >> >
> >> > I don't see how having an IOCTL to trigger the conversion is needed to
> >> > allow conversion failure. How is that different from userspace
> >> > ignoring or delaying releasing all references it has for the
> >> > conversion request?
> >> >
> >> > > >
> >> > > > This series isn't using the same ioctl, it's introducing new ones to
> >> > > > perform a task that as far as I can tell so far, KVM can handle by
> >> > > > itself.
> >> > >
> >> > > I would like to understand this better. How will KVM handle the
> >> > > conversion process for guest_memfd pages? Can you help walk an example
> >> > > sequence for shared to private conversion specifically around
> >> > > guest_memfd offset states?
> >> >
> >> > To make sure that we are discussing the same scenario: can you do the
> >> > same as well please --- walk me through an example sequence for shared
> >> > to private conversion specifically around guest_memfd offset states
> >> > With the IOCTLs involved?
> >> >
> >> > Here is an example that I have implemented and tested with pKVM. Note
> >> > that there are alternatives, the flow below is architecture or even
> >> > vm-type dependent. None of this code is code KVM code and the
> >> > behaviour could vary.
> >> >
> >> >
> >> > Assuming the folio is shared with the host:
> >> >
> >> > Guest sends unshare hypercall to the hypervisor
> >> > Hypervisor forwards request to KVM (gmem) (having done due diligence)
> >> > KVM (gmem) performs an unmap_folio(), exits to userspace with
> >>
> >> For x86 CoCo VM usecases I was talking about, userspace would like to
> >> avoid unmap_mapping_range() on the range before it's safe to unshare
> >> the range.
> >
> > Why? There is no harm in userspace unmapping before the memory isn't
> > shared. I don't see the problem with that.
> >
>
> Yes, no harm done, just possible remapping after unmapping.
>
> > You still haven't responded to my question from the previous email:
> > can you please return the favor and walk me through an example
> > sequence for shared to private conversion specifically around
> > guest_memfd offset states with the IOCTLs involved? :D
> >
>
> Right at the top :)

Thank you Ackerley!

Cheers,
/fuad

>
> > Thanks!
> > /fuad
> >
> >
> >> > KVM_EXIT_UNSHARE and all the information about the folio being
> >> > unshared
> >> >
> >> > Case 1:
> >> > Userspace removes any remaining references (GUPs, IOMMU Mappings etc...)
> >> > Userspace calls vcpu_run(): KVM (gmem) sees that there aren't any
> >> > references, sets state to PRIVATE
> >> >
> >> > Case 2 (alternative 1):
> >> > Userspace doesn't release its references
> >> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> >> > references, exits back to userspace with KVM_EXIT_UNSHARE
> >> >
> >> > Case 2 (alternative 2):
> >> > Userspace doesn't release its references
> >> > Userspace calls vcpu_run(): KVM (gmem) sees that there are still
> >> > references, unmaps folio from guest, but allows it to run (until it
> >> > tries to fault in the folio)
> >> > Guest tries to fault in folio that still has reference, KVM does not
> >> > allow that (it sees that the folio is shared, and it doesn't fault in
> >> > shared folios to confidential guests)
> >> > KVM exits back to userspace with KVM_EXIT_UNSHARE
> >> >
> >> > As I mentioned, the alternatives above are _not_ set in core KVM code.
> >> > They can vary by architecture of VM type, depending on the policy,
> >> > support, etc..
> >> >
> >> > Now for your example please on how this would work with IOCTLs :)
> >> >
> >> > Thanks,
> >> > /fuad
> >> >
> >> > > >
> >> > > > >  - Allows not having to keep track of separate shared/private range
> >> > > > > information in KVM.
> >> > > >
> >> > > > This patch series is already tracking shared/private range information in KVM.
> >> > > >
> >> > > > >  - Simpler handling of the conversion process done per guest_memfd
> >> > > > > rather than for full range.
> >> > > > >      - Userspace can handle the rollback as needed, simplifying error
> >> > > > > handling in guest_memfd.
> >> > > > >  - guest_memfd is single source of truth and notifies the users of
> >> > > > > shareability change.
> >> > > > >      - e.g. IOMMU, userspace, KVM MMU all can be registered for
> >> > > > > getting notifications from guest_memfd directly and will get notified
> >> > > > > for invalidation upon shareability attribute updates.
> >> > > >
> >> > > > All of these can still be done without introducing a new ioctl.
> >> > > >
> >> > > > Cheers,
> >> > > > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-21 12:36                   ` Fuad Tabba
@ 2025-05-21 14:42                     ` Vishal Annapurve
  2025-05-21 15:21                       ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-21 14:42 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> ....
> > When rebooting, the memslots may not yet be bound to the guest_memfd,
> > but we want to reset the guest_memfd's to private. If we use
> > KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
> > convert. If we had a direct ioctl, we don't have this restriction.
> >
> > If we do the conversion via vcpu_run() we would be forced to handle
> > conversions only with a vcpu_run() and only the guest can initiate a
> > conversion.
> >
> > On a guest boot for TDX, the memory is assumed to be private. If the we
> > gave it memory set as shared, we'd just have a bunch of
> > KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
> > will want to reset the guest memory to private.
> >
> > We could say the firmware should reset memory to private on guest
> > reboot, but we can't force all guests to update firmware.
>
> Here is where I disagree. I do think that this is the CoCo guest's
> responsibility (and by guest I include its firmware) to fix its own
> state after a reboot. How would the host even know that a guest is
> rebooting if it's a CoCo guest?

There are a bunch of complexities here, reboot sequence on x86 can be
triggered using multiple ways that I don't fully understand, but few
of them include reading/writing to "reset register" in MMIO/PCI config
space that are emulated by the host userspace directly. Host has to
know when the guest is shutting down to manage it's lifecycle.

x86 CoCo VM firmwares don't support warm/soft reboot and even if it
does in future, guest kernel can choose a different reboot mechanism.
So guest reboot needs to be emulated by always starting from scratch.
This sequence needs initial guest firmware payload to be installed
into private ranges of guest_memfd.

>
> Either the host doesn't (or cannot even) know that the guest is
> rebooting, in which case I don't see how having an IOCTL would help.

Host does know that the guest is rebooting.

> Or somehow the host does know that, i.e., via a hypercall that
> indicates that. In which case, we could have it so that for that type
> of VM, we would reconvert its pages to private on a reboot.

This possibly could be solved by resetting the ranges to private when
binding with a memslot of certain VM type. But then Google also has a
usecase to support intrahost migration where a live VM and associated
guest_memfd files are bound to new KVM VM and memslots.

Otherwise, we need an additional contract between userspace/KVM to
intercept/handle guest_memfd range reset.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-21 14:42                     ` Vishal Annapurve
@ 2025-05-21 15:21                       ` Fuad Tabba
  2025-05-21 15:51                         ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-21 15:21 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Vishal,

On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> > ....
> > > When rebooting, the memslots may not yet be bound to the guest_memfd,
> > > but we want to reset the guest_memfd's to private. If we use
> > > KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
> > > convert. If we had a direct ioctl, we don't have this restriction.
> > >
> > > If we do the conversion via vcpu_run() we would be forced to handle
> > > conversions only with a vcpu_run() and only the guest can initiate a
> > > conversion.
> > >
> > > On a guest boot for TDX, the memory is assumed to be private. If the we
> > > gave it memory set as shared, we'd just have a bunch of
> > > KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
> > > will want to reset the guest memory to private.
> > >
> > > We could say the firmware should reset memory to private on guest
> > > reboot, but we can't force all guests to update firmware.
> >
> > Here is where I disagree. I do think that this is the CoCo guest's
> > responsibility (and by guest I include its firmware) to fix its own
> > state after a reboot. How would the host even know that a guest is
> > rebooting if it's a CoCo guest?
>
> There are a bunch of complexities here, reboot sequence on x86 can be
> triggered using multiple ways that I don't fully understand, but few
> of them include reading/writing to "reset register" in MMIO/PCI config
> space that are emulated by the host userspace directly. Host has to
> know when the guest is shutting down to manage it's lifecycle.

In that case, I think we need to fully understand these complexities
before adding new IOCTLs. It could be that once we understand these
issues, we find that we don't need these IOCTLs. It's hard to justify
adding an IOCTL for something we don't understand.

> x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> does in future, guest kernel can choose a different reboot mechanism.
> So guest reboot needs to be emulated by always starting from scratch.
> This sequence needs initial guest firmware payload to be installed
> into private ranges of guest_memfd.
>
> >
> > Either the host doesn't (or cannot even) know that the guest is
> > rebooting, in which case I don't see how having an IOCTL would help.
>
> Host does know that the guest is rebooting.

In that case, that (i.e., the host finding out that the guest is
rebooting) could trigger the conversion back to private. No need for
an IOCTL.

> > Or somehow the host does know that, i.e., via a hypercall that
> > indicates that. In which case, we could have it so that for that type
> > of VM, we would reconvert its pages to private on a reboot.
>
> This possibly could be solved by resetting the ranges to private when
> binding with a memslot of certain VM type. But then Google also has a
> usecase to support intrahost migration where a live VM and associated
> guest_memfd files are bound to new KVM VM and memslots.
>
> Otherwise, we need an additional contract between userspace/KVM to
> intercept/handle guest_memfd range reset.

Then this becomes a migration issue to be solved then, not a huge page
support issue. If such IOCTLs are needed for migration, it's too early
to add them now.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-21 15:21                       ` Fuad Tabba
@ 2025-05-21 15:51                         ` Vishal Annapurve
  2025-05-21 18:27                           ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-21 15:51 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@google.com> wrote:
>
> Hi Vishal,
>
> On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@google.com> wrote:
> >
> > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> > > ....
> > > > When rebooting, the memslots may not yet be bound to the guest_memfd,
> > > > but we want to reset the guest_memfd's to private. If we use
> > > > KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
> > > > convert. If we had a direct ioctl, we don't have this restriction.
> > > >
> > > > If we do the conversion via vcpu_run() we would be forced to handle
> > > > conversions only with a vcpu_run() and only the guest can initiate a
> > > > conversion.
> > > >
> > > > On a guest boot for TDX, the memory is assumed to be private. If the we
> > > > gave it memory set as shared, we'd just have a bunch of
> > > > KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
> > > > will want to reset the guest memory to private.
> > > >
> > > > We could say the firmware should reset memory to private on guest
> > > > reboot, but we can't force all guests to update firmware.
> > >
> > > Here is where I disagree. I do think that this is the CoCo guest's
> > > responsibility (and by guest I include its firmware) to fix its own
> > > state after a reboot. How would the host even know that a guest is
> > > rebooting if it's a CoCo guest?
> >
> > There are a bunch of complexities here, reboot sequence on x86 can be
> > triggered using multiple ways that I don't fully understand, but few
> > of them include reading/writing to "reset register" in MMIO/PCI config
> > space that are emulated by the host userspace directly. Host has to
> > know when the guest is shutting down to manage it's lifecycle.
>
> In that case, I think we need to fully understand these complexities
> before adding new IOCTLs. It could be that once we understand these
> issues, we find that we don't need these IOCTLs. It's hard to justify
> adding an IOCTL for something we don't understand.
>

I don't understand all the ways x86 guest can trigger reboot but I do
know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
config register write that is emulated by host userspace.

> > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > does in future, guest kernel can choose a different reboot mechanism.
> > So guest reboot needs to be emulated by always starting from scratch.
> > This sequence needs initial guest firmware payload to be installed
> > into private ranges of guest_memfd.
> >
> > >
> > > Either the host doesn't (or cannot even) know that the guest is
> > > rebooting, in which case I don't see how having an IOCTL would help.
> >
> > Host does know that the guest is rebooting.
>
> In that case, that (i.e., the host finding out that the guest is
> rebooting) could trigger the conversion back to private. No need for
> an IOCTL.

In the reboot scenarios, it's the host userspace finding out that the
guest kernel wants to reboot.

>
> > > Or somehow the host does know that, i.e., via a hypercall that
> > > indicates that. In which case, we could have it so that for that type
> > > of VM, we would reconvert its pages to private on a reboot.
> >
> > This possibly could be solved by resetting the ranges to private when
> > binding with a memslot of certain VM type. But then Google also has a
> > usecase to support intrahost migration where a live VM and associated
> > guest_memfd files are bound to new KVM VM and memslots.
> >
> > Otherwise, we need an additional contract between userspace/KVM to
> > intercept/handle guest_memfd range reset.
>
> Then this becomes a migration issue to be solved then, not a huge page
> support issue. If such IOCTLs are needed for migration, it's too early
> to add them now.

The guest_memfd ioctl is not needed for migration but to change/reset
guest_memfd range attributes. I am saying that migration usecase can
conflict with some ways that we can solve resetting guest_memfd range
attributes without adding a new IOCTL as migration closely resembles
reboot scenario as both of them can/need reusing the same guest memory
files but one needs to preserve guest memory state.

Reiterating my understanding here, guest memfd ioctl can be used by
host userspace to -
1) Change guest memfd range attributes during memory conversion
     - This can be handled by KVM hypercall exits in theory as you are
suggesting but Ackerley and me are still thinking that this is a
memory operation that goes beyond vcpu scope and will involve
interaction with IOMMU backend as well, it's cleaner to have a
separate guest memfd specific ioctl for this operation as the impact
is even beyond KVM.

2) Reset guest memfd range attributes during guest reboot to allow
reusing the same guest memfd files.
    - This helps reset the range state to private as needed inline
with initial shared/private configuration chosen at the guest memfd
creation.
    - This also helps reconstitute all the huge pages back to their
original state that may have gotten split during the runtime of the
guest.
  This is a host initiated request for guest memfd memory conversion
that we should not be overloading with other KVM interactions in my
opinion.

>
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function
  2025-05-14 23:42 ` [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function Ackerley Tng
@ 2025-05-21 16:40   ` Edgecombe, Rick P
  0 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-21 16:40 UTC (permalink / raw)
  To: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
> +int split_folio_to_list(struct folio *folio, struct list_head *list);

With CONFIG_TRANSPARENT_HUGEPAGE=n, I get:

include/linux/huge_mm.h:569:19: error: static declaration of
‘split_folio_to_list’ follows non-static declaration
  569 | static inline int split_folio_to_list(struct folio *folio, struct
list_head *list)
      |                   ^~~~~~~~~~~~~~~~~~~
include/linux/huge_mm.h:102:5: note: previous declaration of
‘split_folio_to_list’ with type ‘int(struct folio *, struct list_head *)’
  102 | int split_folio_to_list(struct folio *folio, struct list_head *list);


> +static inline int split_folio(struct folio *folio)
> +{
> +	return split_folio_to_list(folio, NULL);
> +}
>  


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
@ 2025-05-21 18:05   ` Vishal Annapurve
  2025-05-22 23:12   ` Edgecombe, Rick P
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-05-21 18:05 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 4:43 PM Ackerley Tng <ackerleytng@google.com> wrote:
> ...
> +/**
> + * kvm_gmem_zero_range() - Zeroes all sub-pages in range [@start, @end).
> + *
> + * @mapping: the filemap to remove this range from.
> + * @start: index in filemap for start of range (inclusive).
> + * @end: index in filemap for end of range (exclusive).
> + *
> + * The pages in range may be split. truncate_inode_pages_range() isn't the right
> + * function because it removes pages from the page cache; this function only
> + * zeroes the pages.
> + */
> +static void kvm_gmem_zero_range(struct address_space *mapping,
> +                               pgoff_t start, pgoff_t end)
> +{
> +       struct folio_batch fbatch;
> +
> +       folio_batch_init(&fbatch);
> +       while (filemap_get_folios(mapping, &start, end - 1, &fbatch)) {
> +               unsigned int i;
> +
> +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +                       struct folio *f;
> +                       size_t nr_bytes;
> +
> +                       f = fbatch.folios[i];
> +                       nr_bytes = offset_in_folio(f, end << PAGE_SHIFT);
> +                       if (nr_bytes == 0)
> +                               nr_bytes = folio_size(f);
> +
> +                       folio_zero_segment(f, 0, nr_bytes);

folio_zero_segment takes byte offset and number of bytes within the
folio. This invocation needs to operate on the folio range that is
overlapping with [Start, end) and instead it's always starting from 0
and ending at an unaligned offset within the folio. This will result
in zeroing more than requested or lesser than requested or both
depending on the request and folio size.

> +               }
> +
> +               folio_batch_release(&fbatch);
> +               cond_resched();
> +       }
> +}
> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-21 15:51                         ` Vishal Annapurve
@ 2025-05-21 18:27                           ` Fuad Tabba
  2025-05-22 14:52                             ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-21 18:27 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Vishal,

On Wed, 21 May 2025 at 16:51, Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Vishal,
> >
> > On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@google.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> > > > ....
> > > > > When rebooting, the memslots may not yet be bound to the guest_memfd,
> > > > > but we want to reset the guest_memfd's to private. If we use
> > > > > KVM_SET_MEMORY_ATTRIBUTES to convert, we'd be forced to first bind, then
> > > > > convert. If we had a direct ioctl, we don't have this restriction.
> > > > >
> > > > > If we do the conversion via vcpu_run() we would be forced to handle
> > > > > conversions only with a vcpu_run() and only the guest can initiate a
> > > > > conversion.
> > > > >
> > > > > On a guest boot for TDX, the memory is assumed to be private. If the we
> > > > > gave it memory set as shared, we'd just have a bunch of
> > > > > KVM_EXIT_MEMORY_FAULTs that slow down boot. Hence on a guest reboot, we
> > > > > will want to reset the guest memory to private.
> > > > >
> > > > > We could say the firmware should reset memory to private on guest
> > > > > reboot, but we can't force all guests to update firmware.
> > > >
> > > > Here is where I disagree. I do think that this is the CoCo guest's
> > > > responsibility (and by guest I include its firmware) to fix its own
> > > > state after a reboot. How would the host even know that a guest is
> > > > rebooting if it's a CoCo guest?
> > >
> > > There are a bunch of complexities here, reboot sequence on x86 can be
> > > triggered using multiple ways that I don't fully understand, but few
> > > of them include reading/writing to "reset register" in MMIO/PCI config
> > > space that are emulated by the host userspace directly. Host has to
> > > know when the guest is shutting down to manage it's lifecycle.
> >
> > In that case, I think we need to fully understand these complexities
> > before adding new IOCTLs. It could be that once we understand these
> > issues, we find that we don't need these IOCTLs. It's hard to justify
> > adding an IOCTL for something we don't understand.
> >
>
> I don't understand all the ways x86 guest can trigger reboot but I do
> know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
> config register write that is emulated by host userspace.
>
> > > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > > does in future, guest kernel can choose a different reboot mechanism.
> > > So guest reboot needs to be emulated by always starting from scratch.
> > > This sequence needs initial guest firmware payload to be installed
> > > into private ranges of guest_memfd.
> > >
> > > >
> > > > Either the host doesn't (or cannot even) know that the guest is
> > > > rebooting, in which case I don't see how having an IOCTL would help.
> > >
> > > Host does know that the guest is rebooting.
> >
> > In that case, that (i.e., the host finding out that the guest is
> > rebooting) could trigger the conversion back to private. No need for
> > an IOCTL.
>
> In the reboot scenarios, it's the host userspace finding out that the
> guest kernel wants to reboot.

How does the host userspace find that out? If the host userspace is
capable of finding that out, then surely KVM is also capable of
finding out the same.


> >
> > > > Or somehow the host does know that, i.e., via a hypercall that
> > > > indicates that. In which case, we could have it so that for that type
> > > > of VM, we would reconvert its pages to private on a reboot.
> > >
> > > This possibly could be solved by resetting the ranges to private when
> > > binding with a memslot of certain VM type. But then Google also has a
> > > usecase to support intrahost migration where a live VM and associated
> > > guest_memfd files are bound to new KVM VM and memslots.
> > >
> > > Otherwise, we need an additional contract between userspace/KVM to
> > > intercept/handle guest_memfd range reset.
> >
> > Then this becomes a migration issue to be solved then, not a huge page
> > support issue. If such IOCTLs are needed for migration, it's too early
> > to add them now.
>
> The guest_memfd ioctl is not needed for migration but to change/reset
> guest_memfd range attributes. I am saying that migration usecase can
> conflict with some ways that we can solve resetting guest_memfd range
> attributes without adding a new IOCTL as migration closely resembles
> reboot scenario as both of them can/need reusing the same guest memory
> files but one needs to preserve guest memory state.
>
> Reiterating my understanding here, guest memfd ioctl can be used by
> host userspace to -
> 1) Change guest memfd range attributes during memory conversion
>      - This can be handled by KVM hypercall exits in theory as you are
> suggesting but Ackerley and me are still thinking that this is a
> memory operation that goes beyond vcpu scope and will involve
> interaction with IOMMU backend as well, it's cleaner to have a
> separate guest memfd specific ioctl for this operation as the impact
> is even beyond KVM.

The IOMMU backend needs to know about the sharing/unsharing, not
trigger it. The memory is the guest's. We already have a mechanism for
informing userspace of these kinds of events with KVM exits. This
doesn't justify adding a new IOCTL.

> 2) Reset guest memfd range attributes during guest reboot to allow
> reusing the same guest memfd files.
>     - This helps reset the range state to private as needed inline
> with initial shared/private configuration chosen at the guest memfd
> creation.
>     - This also helps reconstitute all the huge pages back to their
> original state that may have gotten split during the runtime of the
> guest.
>   This is a host initiated request for guest memfd memory conversion
> that we should not be overloading with other KVM interactions in my
> opinion.

Then, we could argue about whether we need a "reset" IOCTL (not that I
am arguing for that). But still, like I said, if the host becomes
aware that the confidential guest is rebooting, then surely KVM can be
made aware.

I wonder if this might be better suited for the biweekly guest_memfd sync.

Cheers,
/fuad
> >
> > Cheers,
> > /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-21 18:27                           ` Fuad Tabba
@ 2025-05-22 14:52                             ` Sean Christopherson
  2025-05-22 15:07                               ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-05-22 14:52 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 21, 2025, Fuad Tabba wrote:
> On Wed, 21 May 2025 at 16:51, Vishal Annapurve <vannapurve@google.com> wrote:
> > On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@google.com> wrote:
> > > On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@google.com> wrote:
> > > > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> > > > There are a bunch of complexities here, reboot sequence on x86 can be
> > > > triggered using multiple ways that I don't fully understand, but few
> > > > of them include reading/writing to "reset register" in MMIO/PCI config
> > > > space that are emulated by the host userspace directly. Host has to
> > > > know when the guest is shutting down to manage it's lifecycle.
> > >
> > > In that case, I think we need to fully understand these complexities
> > > before adding new IOCTLs. It could be that once we understand these
> > > issues, we find that we don't need these IOCTLs. It's hard to justify
> > > adding an IOCTL for something we don't understand.
> > >
> >
> > I don't understand all the ways x86 guest can trigger reboot but I do
> > know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
> > config register write that is emulated by host userspace.
> >
> > > > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > > > does in future, guest kernel can choose a different reboot mechanism.
> > > > So guest reboot needs to be emulated by always starting from scratch.
> > > > This sequence needs initial guest firmware payload to be installed
> > > > into private ranges of guest_memfd.
> > > >
> > > > >
> > > > > Either the host doesn't (or cannot even) know that the guest is
> > > > > rebooting, in which case I don't see how having an IOCTL would help.
> > > >
> > > > Host does know that the guest is rebooting.
> > >
> > > In that case, that (i.e., the host finding out that the guest is
> > > rebooting) could trigger the conversion back to private. No need for an
> > > IOCTL.
> >
> > In the reboot scenarios, it's the host userspace finding out that the guest
> > kernel wants to reboot.
> 
> How does the host userspace find that out? If the host userspace is capable
> of finding that out, then surely KVM is also capable of finding out the same.

Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
defeat the purpose of adding these ioctls.

KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
or not PCI exists for the VM.  And reboot may be emulated by simply creating a
new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
change in state would happen in an entirely new struct kvm.

That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
front.  The changelog asserts that:

  A guest_memfd ioctl is used because shareability is a property of the memory,
  and this property should be modifiable independently of the attached struct kvm

but then follows with a very weak and IMO largely irrelevant justification of:

  This allows shareability to be modified even if the memory is not yet bound
  using memslots.

Allowing userspace to change shareability without memslots is one relatively minor
flow in one very specific use case.

The real justification for these ioctls is that fundamentally, shareability for
in-place conversions is a property of a guest_memfd instance and not a struct kvm
instance, and so needs to owned by guest_memfd.

I.e. focus on justifying the change from a design and conceptual perspective,
not from a mechanical perspective of a flow that likely's somewhat unique to our
specific environment.  Y'all are getting deep into the weeds on a random aspect
of x86 platform architecture, instead of focusing on the overall design.

The other issue that's likely making this more confusing than it needs to be is
that this series is actually two completely different series bundled into one,
with very little explanation.  Moving shared vs. private ownership into
guest_memfd isn't a requirement for 1GiB support, it's a requirement for in-place
shared/private conversion in guest_memfd.

For the current guest_memfd implementation, shared vs. private is tracked in the
VM via memory attributes, because a guest_memfd instance is *only* private.  I.e.
shared vs. private is a property of the VM, not of the guest_memfd instance.  But
when in-place conversion support comes along, ownership of that particular
attribute needs to shift to the guest_memfd instance.

I know I gave feedback on earlier posting about there being too series flying
around, but shoving two distinct concepts into a single series is not the answer.
My complaints about too much noise wasn't that there were multiple series, it was
that there was very little coordination and lots of chaos.

If you split this series in two, which should be trivial since you've already
organized the patches as a split, then sans the selftests (thank you for those!),
in-place conversion support will be its own (much smaller!) series that can focus
on that specific aspect of the design, and can provide a cover letter that
expounds on the design goals and uAPI.

  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes

And then you can post the 1GiB series separately.  So long as you provide pointers
to dependencies along with a link to a repo+branch with the kitchen sink, I won't
complain about things being too chaotic :-)

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-22 14:52                             ` Sean Christopherson
@ 2025-05-22 15:07                               ` Fuad Tabba
  2025-05-22 16:26                                 ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-22 15:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Sean,

On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 21, 2025, Fuad Tabba wrote:
> > On Wed, 21 May 2025 at 16:51, Vishal Annapurve <vannapurve@google.com> wrote:
> > > On Wed, May 21, 2025 at 8:22 AM Fuad Tabba <tabba@google.com> wrote:
> > > > On Wed, 21 May 2025 at 15:42, Vishal Annapurve <vannapurve@google.com> wrote:
> > > > > On Wed, May 21, 2025 at 5:36 AM Fuad Tabba <tabba@google.com> wrote:
> > > > > There are a bunch of complexities here, reboot sequence on x86 can be
> > > > > triggered using multiple ways that I don't fully understand, but few
> > > > > of them include reading/writing to "reset register" in MMIO/PCI config
> > > > > space that are emulated by the host userspace directly. Host has to
> > > > > know when the guest is shutting down to manage it's lifecycle.
> > > >
> > > > In that case, I think we need to fully understand these complexities
> > > > before adding new IOCTLs. It could be that once we understand these
> > > > issues, we find that we don't need these IOCTLs. It's hard to justify
> > > > adding an IOCTL for something we don't understand.
> > > >
> > >
> > > I don't understand all the ways x86 guest can trigger reboot but I do
> > > know that x86 CoCo linux guest kernel triggers reset using MMIO/PCI
> > > config register write that is emulated by host userspace.
> > >
> > > > > x86 CoCo VM firmwares don't support warm/soft reboot and even if it
> > > > > does in future, guest kernel can choose a different reboot mechanism.
> > > > > So guest reboot needs to be emulated by always starting from scratch.
> > > > > This sequence needs initial guest firmware payload to be installed
> > > > > into private ranges of guest_memfd.
> > > > >
> > > > > >
> > > > > > Either the host doesn't (or cannot even) know that the guest is
> > > > > > rebooting, in which case I don't see how having an IOCTL would help.
> > > > >
> > > > > Host does know that the guest is rebooting.
> > > >
> > > > In that case, that (i.e., the host finding out that the guest is
> > > > rebooting) could trigger the conversion back to private. No need for an
> > > > IOCTL.
> > >
> > > In the reboot scenarios, it's the host userspace finding out that the guest
> > > kernel wants to reboot.
> >
> > How does the host userspace find that out? If the host userspace is capable
> > of finding that out, then surely KVM is also capable of finding out the same.
>
> Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> defeat the purpose of adding these ioctls.
>
> KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> change in state would happen in an entirely new struct kvm.
>
> That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> front.  The changelog asserts that:
>
>   A guest_memfd ioctl is used because shareability is a property of the memory,
>   and this property should be modifiable independently of the attached struct kvm
>
> but then follows with a very weak and IMO largely irrelevant justification of:
>
>   This allows shareability to be modified even if the memory is not yet bound
>   using memslots.
>
> Allowing userspace to change shareability without memslots is one relatively minor
> flow in one very specific use case.
>
> The real justification for these ioctls is that fundamentally, shareability for
> in-place conversions is a property of a guest_memfd instance and not a struct kvm
> instance, and so needs to owned by guest_memfd.

Thanks for the clarification Sean. I have a couple of followup
questions/comments that you might be able to help with:

From a conceptual point of view, I understand that the in-place
conversion is a property of guest_memfd. But that doesn't necessarily
mean that the interface between kvm <-> guest_memfd is a userspace
IOCTL. We already communicate directly between the two. Other, even
less related subsystems within the kernel also interact without going
through userspace. Why can't we do the same here? I'm not suggesting
it not be owned by guest_memfd, but that we communicate directly.

From a performance point of view, I would expect the common case to be
that when KVM gets an unshare request from the guest, it would be able
to unmap those pages from the (cooperative) host userspace, and return
back to the guest. In this scenario, the host userspace wouldn't even
need to be involved. Having a userspace IOCTL as part of this makes
that trip unnecessarily longer for the common case.

Cheers,
/fuad

> I.e. focus on justifying the change from a design and conceptual perspective,
> not from a mechanical perspective of a flow that likely's somewhat unique to our
> specific environment.  Y'all are getting deep into the weeds on a random aspect
> of x86 platform architecture, instead of focusing on the overall design.
>
> The other issue that's likely making this more confusing than it needs to be is
> that this series is actually two completely different series bundled into one,
> with very little explanation.  Moving shared vs. private ownership into
> guest_memfd isn't a requirement for 1GiB support, it's a requirement for in-place
> shared/private conversion in guest_memfd.
>
> For the current guest_memfd implementation, shared vs. private is tracked in the
> VM via memory attributes, because a guest_memfd instance is *only* private.  I.e.
> shared vs. private is a property of the VM, not of the guest_memfd instance.  But
> when in-place conversion support comes along, ownership of that particular
> attribute needs to shift to the guest_memfd instance.
>
> I know I gave feedback on earlier posting about there being too series flying
> around, but shoving two distinct concepts into a single series is not the answer.
> My complaints about too much noise wasn't that there were multiple series, it was
> that there was very little coordination and lots of chaos.
>
> If you split this series in two, which should be trivial since you've already
> organized the patches as a split, then sans the selftests (thank you for those!),
> in-place conversion support will be its own (much smaller!) series that can focus
> on that specific aspect of the design, and can provide a cover letter that
> expounds on the design goals and uAPI.
>
>   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
>   KVM: Query guest_memfd for private/shared status
>   KVM: guest_memfd: Skip LRU for guest_memfd folios
>   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
>   KVM: guest_memfd: Introduce and use shareability to guard faulting
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
>
> And then you can post the 1GiB series separately.  So long as you provide pointers
> to dependencies along with a link to a repo+branch with the kitchen sink, I won't
> complain about things being too chaotic :-)

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-22 15:07                               ` Fuad Tabba
@ 2025-05-22 16:26                                 ` Sean Christopherson
  2025-05-23 10:12                                   ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-05-22 16:26 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Thu, May 22, 2025, Fuad Tabba wrote:
> On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@google.com> wrote:
> > On Wed, May 21, 2025, Fuad Tabba wrote:
> > > How does the host userspace find that out? If the host userspace is capable
> > > of finding that out, then surely KVM is also capable of finding out the same.
> >
> > Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> > defeat the purpose of adding these ioctls.
> >
> > KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> > the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> > or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> > new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> > change in state would happen in an entirely new struct kvm.
> >
> > That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> > front.  The changelog asserts that:
> >
> >   A guest_memfd ioctl is used because shareability is a property of the memory,
> >   and this property should be modifiable independently of the attached struct kvm
> >
> > but then follows with a very weak and IMO largely irrelevant justification of:
> >
> >   This allows shareability to be modified even if the memory is not yet bound
> >   using memslots.
> >
> > Allowing userspace to change shareability without memslots is one relatively minor
> > flow in one very specific use case.
> >
> > The real justification for these ioctls is that fundamentally, shareability for
> > in-place conversions is a property of a guest_memfd instance and not a struct kvm
> > instance, and so needs to owned by guest_memfd.
> 
> Thanks for the clarification Sean. I have a couple of followup
> questions/comments that you might be able to help with:
> 
> From a conceptual point of view, I understand that the in-place conversion is
> a property of guest_memfd. But that doesn't necessarily mean that the
> interface between kvm <-> guest_memfd is a userspace IOCTL.

kvm and guest_memfd aren't the communication endpoints for in-place conversions,
and more importantly, kvm isn't part of the control plane.  kvm's primary role
(for guest_memfd with in-place conversions) is to manage the page tables to map
memory into the guest.

kvm *may* also explicitly provide a communication channel between the guest and
host, e.g. when conversions are initiated via hypercalls, but in some cases the
communication channel may be created through pre-existing mechanisms, e.g. a
shared memory buffer or emulated I/O (such as the PCI reset case).

  guest => kvm (dumb pipe) => userspace => guest_memfd => kvm (invalidate)

And in other cases, kvm might not be in that part of the picture at all, e.g. if
the userspace VMM provides an interface to the VM owner (which could also be the
user running the VM) to reset the VM, then the flow would look like:

  userspace => guest_memfd => kvm (invalidate)

A decent comparison is vCPUs.  KVM _could_ route all ioctls through the VM, but
that's unpleasant for all parties, as it'd be cumbersome for userspace, and
unnecessarily complex and messy for KVM.  Similarly, routing guest_memfd state
changes through KVM_SET_MEMORY_ATTRIBUTES is awkward from both design and mechanical
perspectives.

Even if we disagree on how ugly/pretty routing conversions through kvm would be,
which I'll allow is subjective, the bigger problem is that bouncing through
KVM_SET_MEMORY_ATTRIBUTES would create an unholy mess of an ABI.

Today, KVM_SET_MEMORY_ATTRIBUTES is handled entirely within kvm, and any changes
take effect irrespective of any memslot bindings.  And that didn't happen by
chance; preserving and enforcing attribute changes independently of memslots was
a key design requirement, precisely because memslots are ephemeral to a certain
extent.

Adding support for in-place guest_memfd conversion will require new ABI, and so
will be a "breaking" change for KVM_SET_MEMORY_ATTRIBUTES no matter what.  E.g.
KVM will need to reject KVM_MEMORY_ATTRIBUTE_PRIVATE for VMs that elect to use
in-place guest_memfd conversions.  But very critically, KVM can cripsly enumerate
the lack of KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_CAP_MEMORY_ATTRIBUTES, the
behavior will be very straightforward to document (e.g. CAP X is mutually excusive
with KVM_MEMORY_ATTRIBUTE_PRIVATE), and it will be opt-in, i.e. won't truly be a
breaking change.

If/when we move shareability to guest_memfd, routing state changes through
KVM_SET_MEMORY_ATTRIBUTES will gain a subtle dependency on userspace having to
create memslots in order for state changes to take effect.  That wrinkle would be
weird and annoying to document, e.g. "if CAP X is enabled, the ioctl ordering is
A => B => C, otherwise the ordering doesn't matter", and would create many more
conundrums:

  - If a memslot needs to exist in order for KVM_SET_MEMORY_ATTRIBUTES to take effect,
    what should happen if that memslot is deleted?
  - If a memslot isn't found, should KVM_SET_MEMORY_ATTRIBUTES fail and report
    an error, or silently do nothing?
  - If KVM_SET_MEMORY_ATTRIBUTES affects multiple memslots that are bound to
    multiple guest_memfd, how does KVM guarantee atomicity?  What happens if one
    guest_memfd conversion succeeds, but a later fails?

> We already communicate directly between the two. Other, even less related
> subsystems within the kernel also interact without going through userspace.
> Why can't we do the same here? I'm not suggesting it not be owned by
> guest_memfd, but that we communicate directly.

I'm not concerned about kvm communicating with guest_memfd, as you note it's all
KVM.  As above, my concerns are all about KVM's ABI and who owns/controls what.

> From a performance point of view, I would expect the common case to be that
> when KVM gets an unshare request from the guest, it would be able to unmap
> those pages from the (cooperative) host userspace, and return back to the
> guest. In this scenario, the host userspace wouldn't even need to be
> involved.

Hard NAK, at least from an x86 perspective.  Userspace is the sole decision maker
with respect to what memory is state of shared vs. private, full stop.  The guest
can make *requests* to convert memory, but ultimately it's host userspace that
decides whether or not to honor the request.

We've litigated this exact issue multiple times.  All state changes must be
controlled by userspace, because userspace is the only entity that can gracefully
handle exceptions and edge cases, and is the only entity with (almost) full
knowledge of the system.  We can discuss this again if necessary, but I'd much
prefer to not rehash all of those conversations.

> Having a userspace IOCTL as part of this makes that trip unnecessarily longer
> for the common case.

I'm very skeptical that an exit to userspace is going to even be measurable in
terms of the cost to convert memory.  Conversion is going to require multiple
locks, modifications to multiple sets of page tables with all the associated TLB
maintenance, possibly cache maintenance, and probably a few other things I'm
forgetting.  The cost of a few user<=>kernel transitions is likely going to be a
drop in the bucket.

If I'm wrong, and there are flows where the user<=>kernel transitions are the
long pole, then we could certainly exploring adding a way for userspace to opt
into a "fast path" conversion.  But it would need to be exactly that, an optional
fast path that can fall back to the "slow" userspace-driven conversion as needed.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
@ 2025-05-22 22:19   ` Edgecombe, Rick P
  2025-06-05 17:15     ` Ackerley Tng
                       ` (4 more replies)
  2025-05-27  4:30   ` Yan Zhao
                     ` (2 subsequent siblings)
  3 siblings, 5 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-22 22:19 UTC (permalink / raw)
  To: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
> +
> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
> +						 pgoff_t bound, bool start)
> +{
> +	size_t nr_pages;
> +	void *priv;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))

General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
checked all over the place across this series. There are only two allocators
after this, right? So one is implemented with callbacks presumably designed to
fit other allocators, and one has special case logic in guest_memfd.c.

Did you consider designing struct guestmem_allocator_operations so that it could
encapsulate the special logic for both the existing and new allocators? If it
didn't work well, could we expect that a next allocator would actually fit
struct guestmem_allocator_operations?

> +		return bound;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	if (start)
> +		return round_down(bound, nr_pages);
> +	else
> +		return round_up(bound, nr_pages);
> +}
> +
> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
> +						 pgoff_t bound)
> +{
> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
> +}
> +
> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
> +					       pgoff_t bound)
> +{
> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
> +}
> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
  2025-05-21 18:05   ` Vishal Annapurve
@ 2025-05-22 23:12   ` Edgecombe, Rick P
  2025-05-28 10:58   ` Yan Zhao
  2025-06-03  7:43   ` Binbin Wu
  3 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-05-22 23:12 UTC (permalink / raw)
  To: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
> If a custom allocator is requested at guest_memfd creation time, pages
> from the custom allocator will be used to back guest_memfd.
> 
> Change-Id: I59df960b3273790f42fe5bea54a234f40962eb75
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

I know it's an RFC, but for future maturity, these logs are pretty thin across
the series. Only one sentence for 143 lines is way to limited.

> ---
>  mm/memory.c            |   1 +
>  virt/kvm/guest_memfd.c | 142 +++++++++++++++++++++++++++++++++++++----
>  2 files changed, 132 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ba3ea0a82f7f..3af45e96913c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7249,6 +7249,7 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>  	else
>  		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>  }
> +EXPORT_SYMBOL_GPL(folio_zero_user);
>  
>  static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>  				   unsigned long addr_hint,
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index c65d93c5a443..24d270b9b725 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -478,15 +478,13 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   * leaking host data and the up-to-date flag is set.
>   */
>  static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> -				  gfn_t gfn, struct folio *folio)
> +				  gfn_t gfn, struct folio *folio,
> +				  unsigned long addr_hint)
>  {
> -	unsigned long nr_pages, i;
>  	pgoff_t index;
>  	int r;
>  
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		clear_highpage(folio_page(folio, i));
> +	folio_zero_user(folio, addr_hint);

This is unrelated cleanup.

>  
>  	/*
>  	 * Preparing huge folios should always be safe, since it should
> @@ -554,7 +552,9 @@ static int kvm_gmem_filemap_add_folio(struct address_space *mapping,
>   */
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  {
> +	size_t allocated_size;
>  	struct folio *folio;
> +	pgoff_t index_floor;
>  	int ret;
>  
>  repeat:
> @@ -581,8 +581,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  			return ERR_PTR(ret);
>  		}
>  	}
> +	allocated_size = folio_size(folio);
>  
> -	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
> +	index_floor = round_down(index, folio_nr_pages(folio));
> +	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
>  	if (ret) {
>  		folio_put(folio);
>  
> @@ -598,7 +600,17 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  		return ERR_PTR(ret);
>  	}
>  
> -	__folio_set_locked(folio);
> +	spin_lock(&inode->i_lock);
> +	inode->i_blocks += allocated_size / 512;
> +	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * folio is the one that is allocated, this gets the folio at the
> +	 * requested index.
> +	 */
> +	folio = page_folio(folio_file_page(folio, index));
> +	folio_lock(folio);
> +
>  	return folio;
>  }
>  
> @@ -736,6 +748,92 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>  	spin_unlock(&inode->i_lock);
>  }
>  
> +/**
> + * kvm_gmem_zero_range() - Zeroes all sub-pages in range [@start, @end).
> + *
> + * @mapping: the filemap to remove this range from.
> + * @start: index in filemap for start of range (inclusive).
> + * @end: index in filemap for end of range (exclusive).
> + *
> + * The pages in range may be split. truncate_inode_pages_range() isn't the right
> + * function because it removes pages from the page cache; this function only
> + * zeroes the pages.
> + */
> +static void kvm_gmem_zero_range(struct address_space *mapping,
> +				pgoff_t start, pgoff_t end)
> +{
> +	struct folio_batch fbatch;
> +
> +	folio_batch_init(&fbatch);
> +	while (filemap_get_folios(mapping, &start, end - 1, &fbatch)) {
> +		unsigned int i;
> +
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *f;
> +			size_t nr_bytes;
> +
> +			f = fbatch.folios[i];
> +			nr_bytes = offset_in_folio(f, end << PAGE_SHIFT);
> +			if (nr_bytes == 0)
> +				nr_bytes = folio_size(f);
> +
> +			folio_zero_segment(f, 0, nr_bytes);
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +}
> +
> +/**
> + * kvm_gmem_truncate_inode_range() - Truncate pages in range [@lstart, @lend).
> + *
> + * @inode: inode to truncate from.
> + * @lstart: offset in inode for start of range (inclusive).
> + * @lend: offset in inode for end of range (exclusive).
> + *
> + * Removes full (huge)pages from the filemap and zeroing incomplete
> + * (huge)pages. The pages in the range may be split.
> + */
> +static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> +					  loff_t lend)
> +{
> +	pgoff_t full_hpage_start;
> +	size_t nr_per_huge_page;
> +	pgoff_t full_hpage_end;
> +	size_t nr_pages;
> +	pgoff_t start;
> +	pgoff_t end;
> +	void *priv;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	start = lstart >> PAGE_SHIFT;
> +	end = min(lend, i_size_read(inode)) >> PAGE_SHIFT;
> +
> +	full_hpage_start = round_up(start, nr_per_huge_page);
> +	full_hpage_end = round_down(end, nr_per_huge_page);

I think it's supposed to zero the start at a byte granularity.

> +
> +	if (start < full_hpage_start) {
> +		pgoff_t zero_end = min(full_hpage_start, end);
> +
> +		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
> +	}
> +
> +	if (full_hpage_end > full_hpage_start) {
> +		nr_pages = full_hpage_end - full_hpage_start;
> +		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
> +						      nr_pages);
> +	}
> +
> +	if (end > full_hpage_end && end > full_hpage_start) {
> +		pgoff_t zero_start = max(full_hpage_end, start);

This is weird. Could it just round up `end`, then check it and use it instead?

> +
> +		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
> +	}
> +}
> +
>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  {
>  	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
> @@ -752,7 +850,12 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_begin(gmem, start, end);
>  
> -	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> +	if (kvm_gmem_has_custom_allocator(inode)) {
> +		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> +	} else {
> +		/* Page size is PAGE_SIZE, so use optimized truncation function. */
> +		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> +	}
>  
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_end(gmem, start, end);
> @@ -776,6 +879,16 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>  
>  	start = offset >> PAGE_SHIFT;
>  	end = (offset + len) >> PAGE_SHIFT;
> +	if (kvm_gmem_has_custom_allocator(inode)) {
> +		size_t nr_pages;
> +		void *p;
> +
> +		p = kvm_gmem_allocator_private(inode);
> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
> +
> +		start = round_down(start, nr_pages);
> +		end = round_down(end, nr_pages);
> +	}
>  
>  	r = 0;
>  	for (index = start; index < end; ) {
> @@ -1570,7 +1683,7 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>  
>  	*pfn = folio_file_pfn(folio, index);
>  	if (max_order)
> -		*max_order = 0;
> +		*max_order = folio_order(folio);

You might be able to have a separate patch that makes existing code work with
larger folio sizes. Then add in the custom allocator/truncator bits in another
one.

>  
>  	*is_prepared = folio_test_uptodate(folio);
>  	return folio;
> @@ -1597,8 +1710,15 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto out;
>  	}
>  
> -	if (!is_prepared)
> -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (!is_prepared) {
> +		/*
> +		 * Use the same address as hugetlb for zeroing private pages
> +		 * that won't be mapped to userspace anyway.
> +		 */
> +		unsigned long addr_hint = folio->index << PAGE_SHIFT;

This could use some more explanation.

> +
> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio, addr_hint);
> +	}
>  
>  	folio_unlock(folio);
>  


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-22 16:26                                 ` Sean Christopherson
@ 2025-05-23 10:12                                   ` Fuad Tabba
  0 siblings, 0 replies; 231+ messages in thread
From: Fuad Tabba @ 2025-05-23 10:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Sean,


On Thu, 22 May 2025 at 17:26, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, May 22, 2025, Fuad Tabba wrote:
> > On Thu, 22 May 2025 at 15:52, Sean Christopherson <seanjc@google.com> wrote:
> > > On Wed, May 21, 2025, Fuad Tabba wrote:
> > > > How does the host userspace find that out? If the host userspace is capable
> > > > of finding that out, then surely KVM is also capable of finding out the same.
> > >
> > > Nope, not on x86.  Well, not without userspace invoking a new ioctl, which would
> > > defeat the purpose of adding these ioctls.
> > >
> > > KVM is only responsible for emulating/virtualizing the "CPU".  The chipset, e.g.
> > > the PCI config space, is fully owned by userspace.  KVM doesn't even know whether
> > > or not PCI exists for the VM.  And reboot may be emulated by simply creating a
> > > new KVM instance, i.e. even if KVM was somehow aware of the reboot request, the
> > > change in state would happen in an entirely new struct kvm.
> > >
> > > That said, Vishal and Ackerley, this patch is a bit lacking on the documentation
> > > front.  The changelog asserts that:
> > >
> > >   A guest_memfd ioctl is used because shareability is a property of the memory,
> > >   and this property should be modifiable independently of the attached struct kvm
> > >
> > > but then follows with a very weak and IMO largely irrelevant justification of:
> > >
> > >   This allows shareability to be modified even if the memory is not yet bound
> > >   using memslots.
> > >
> > > Allowing userspace to change shareability without memslots is one relatively minor
> > > flow in one very specific use case.
> > >
> > > The real justification for these ioctls is that fundamentally, shareability for
> > > in-place conversions is a property of a guest_memfd instance and not a struct kvm
> > > instance, and so needs to owned by guest_memfd.
> >
> > Thanks for the clarification Sean. I have a couple of followup
> > questions/comments that you might be able to help with:
> >
> > From a conceptual point of view, I understand that the in-place conversion is
> > a property of guest_memfd. But that doesn't necessarily mean that the
> > interface between kvm <-> guest_memfd is a userspace IOCTL.
>
> kvm and guest_memfd aren't the communication endpoints for in-place conversions,
> and more importantly, kvm isn't part of the control plane.  kvm's primary role
> (for guest_memfd with in-place conversions) is to manage the page tables to map
> memory into the guest.
>
> kvm *may* also explicitly provide a communication channel between the guest and
> host, e.g. when conversions are initiated via hypercalls, but in some cases the
> communication channel may be created through pre-existing mechanisms, e.g. a
> shared memory buffer or emulated I/O (such as the PCI reset case).
>
>   guest => kvm (dumb pipe) => userspace => guest_memfd => kvm (invalidate)
>
> And in other cases, kvm might not be in that part of the picture at all, e.g. if
> the userspace VMM provides an interface to the VM owner (which could also be the
> user running the VM) to reset the VM, then the flow would look like:
>
>   userspace => guest_memfd => kvm (invalidate)
>
> A decent comparison is vCPUs.  KVM _could_ route all ioctls through the VM, but
> that's unpleasant for all parties, as it'd be cumbersome for userspace, and
> unnecessarily complex and messy for KVM.  Similarly, routing guest_memfd state
> changes through KVM_SET_MEMORY_ATTRIBUTES is awkward from both design and mechanical
> perspectives.
>
> Even if we disagree on how ugly/pretty routing conversions through kvm would be,
> which I'll allow is subjective, the bigger problem is that bouncing through
> KVM_SET_MEMORY_ATTRIBUTES would create an unholy mess of an ABI.
>
> Today, KVM_SET_MEMORY_ATTRIBUTES is handled entirely within kvm, and any changes
> take effect irrespective of any memslot bindings.  And that didn't happen by
> chance; preserving and enforcing attribute changes independently of memslots was
> a key design requirement, precisely because memslots are ephemeral to a certain
> extent.
>
> Adding support for in-place guest_memfd conversion will require new ABI, and so
> will be a "breaking" change for KVM_SET_MEMORY_ATTRIBUTES no matter what.  E.g.
> KVM will need to reject KVM_MEMORY_ATTRIBUTE_PRIVATE for VMs that elect to use
> in-place guest_memfd conversions.  But very critically, KVM can cripsly enumerate
> the lack of KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_CAP_MEMORY_ATTRIBUTES, the
> behavior will be very straightforward to document (e.g. CAP X is mutually excusive
> with KVM_MEMORY_ATTRIBUTE_PRIVATE), and it will be opt-in, i.e. won't truly be a
> breaking change.
>
> If/when we move shareability to guest_memfd, routing state changes through
> KVM_SET_MEMORY_ATTRIBUTES will gain a subtle dependency on userspace having to
> create memslots in order for state changes to take effect.  That wrinkle would be
> weird and annoying to document, e.g. "if CAP X is enabled, the ioctl ordering is
> A => B => C, otherwise the ordering doesn't matter", and would create many more
> conundrums:
>
>   - If a memslot needs to exist in order for KVM_SET_MEMORY_ATTRIBUTES to take effect,
>     what should happen if that memslot is deleted?
>   - If a memslot isn't found, should KVM_SET_MEMORY_ATTRIBUTES fail and report
>     an error, or silently do nothing?
>   - If KVM_SET_MEMORY_ATTRIBUTES affects multiple memslots that are bound to
>     multiple guest_memfd, how does KVM guarantee atomicity?  What happens if one
>     guest_memfd conversion succeeds, but a later fails?
>
> > We already communicate directly between the two. Other, even less related
> > subsystems within the kernel also interact without going through userspace.
> > Why can't we do the same here? I'm not suggesting it not be owned by
> > guest_memfd, but that we communicate directly.
>
> I'm not concerned about kvm communicating with guest_memfd, as you note it's all
> KVM.  As above, my concerns are all about KVM's ABI and who owns/controls what.
>
> > From a performance point of view, I would expect the common case to be that
> > when KVM gets an unshare request from the guest, it would be able to unmap
> > those pages from the (cooperative) host userspace, and return back to the
> > guest. In this scenario, the host userspace wouldn't even need to be
> > involved.
>
> Hard NAK, at least from an x86 perspective.  Userspace is the sole decision maker
> with respect to what memory is state of shared vs. private, full stop.  The guest
> can make *requests* to convert memory, but ultimately it's host userspace that
> decides whether or not to honor the request.
>
> We've litigated this exact issue multiple times.  All state changes must be
> controlled by userspace, because userspace is the only entity that can gracefully
> handle exceptions and edge cases, and is the only entity with (almost) full
> knowledge of the system.  We can discuss this again if necessary, but I'd much
> prefer to not rehash all of those conversations.
>
> > Having a userspace IOCTL as part of this makes that trip unnecessarily longer
> > for the common case.
>
> I'm very skeptical that an exit to userspace is going to even be measurable in
> terms of the cost to convert memory.  Conversion is going to require multiple
> locks, modifications to multiple sets of page tables with all the associated TLB
> maintenance, possibly cache maintenance, and probably a few other things I'm
> forgetting.  The cost of a few user<=>kernel transitions is likely going to be a
> drop in the bucket.
>
> If I'm wrong, and there are flows where the user<=>kernel transitions are the
> long pole, then we could certainly exploring adding a way for userspace to opt
> into a "fast path" conversion.  But it would need to be exactly that, an optional
> fast path that can fall back to the "slow" userspace-driven conversion as needed.

Thanks for this very thorough explanation. I know that we have
litigated this issue, but not this _exact_ issue. My understanding was
that the main reason for using IOCTLs for memory attributes is that
userspace needs to manage private and shared memory seperately,
including allocation and punching holes where necessary.

That said, no need to discuss this again. If it turns out that
user<->kernel transitions are a bottleneck we could look into an
opt-in fast path as you said.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
@ 2025-05-23 10:47   ` Yan Zhao
  2025-08-12  9:13   ` Tony Lindgren
  1 sibling, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-23 10:47 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:42:11PM -0700, Ackerley Tng wrote:
>  enum shareability {
> @@ -40,6 +47,44 @@ static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>  	return inode->i_mapping->i_private_data;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_HUGETLB
> +
> +static const struct guestmem_allocator_operations *
> +kvm_gmem_allocator_ops(struct inode *inode)
> +{
> +	return kvm_gmem_private(inode)->allocator_ops;
> +}
> +
> +static void *kvm_gmem_allocator_private(struct inode *inode)
> +{
> +	return kvm_gmem_private(inode)->allocator_private;
> +}
> +
> +static bool kvm_gmem_has_custom_allocator(struct inode *inode)
> +{

+       if (!kvm_gmem_private(inode))
+               return NULL;

> +	return kvm_gmem_allocator_ops(inode) != NULL;
> +}
> +
...

> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +	truncate_inode_pages_final_prepare(inode->i_mapping);
> +
> +	if (kvm_gmem_has_custom_allocator(inode)) {

The i_private_data of the root inode in pseudo fs is NULL.
Without the two lines added above, evicting the root inode during unmount will
cause NULL pointer deference.

> +		size_t nr_pages = inode->i_size >> PAGE_SHIFT;
> +
> +		kvm_gmem_truncate_inode_aligned_pages(inode, 0, nr_pages);
> +	} else {
> +		truncate_inode_pages(inode->i_mapping, 0);
> +	}
> +
> +	clear_inode(inode);
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
>  	.statfs		= simple_statfs,
> +	.evict_inode	= kvm_gmem_evict_inode,
>  	.destroy_inode	= kvm_gmem_destroy_inode,
>  	.free_inode	= kvm_gmem_free_inode,
>  };
 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
@ 2025-05-27  3:54   ` Yan Zhao
  2025-05-29 18:20     ` Ackerley Tng
  2025-05-30  8:53     ` Fuad Tabba
  2025-05-27  8:25   ` Binbin Wu
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-27  3:54 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> Track guest_memfd memory's shareability status within the inode as
> opposed to the file, since it is property of the guest_memfd's memory
> contents.
> 
> Shareability is a property of the memory and is indexed using the
> page's index in the inode. Because shareability is the memory's
> property, it is stored within guest_memfd instead of within KVM, like
> in kvm->mem_attr_array.
> 
> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> retained to allow VMs to only use guest_memfd for private memory and
> some other memory for shared memory.
> 
> Not all use cases require guest_memfd() to be shared with the host
> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> private to the guest, and therefore not mappable by the
> host. Otherwise, memory is shared until explicitly converted to
> private.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> ---
>  Documentation/virt/kvm/api.rst |   5 ++
>  include/uapi/linux/kvm.h       |   2 +
>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>  3 files changed, 129 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 86f74ce7f12a..f609337ae1c2 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>  This is validated when the guest_memfd instance is bound to the VM.
>  
> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> +will initialize the memory for the guest_memfd as guest-only and not faultable
> +by the host.
> +
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
>  4.143 KVM_PRE_FAULT_MEMORY
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4cc824a3a7c9..d7df312479aa 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>  
>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> +
>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
>  
>  struct kvm_create_guest_memfd {
>  	__u64 size;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 239d0f13dcc1..590932499eba 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/kvm_host.h>
> +#include <linux/maple_tree.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/pagemap.h>
>  
> @@ -17,6 +18,24 @@ struct kvm_gmem {
>  	struct list_head entry;
>  };
>  
> +struct kvm_gmem_inode_private {
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	struct maple_tree shareability;
> +#endif
> +};
> +
> +enum shareability {
> +	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
> +	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
> +};
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> +
> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> +{
> +	return inode->i_mapping->i_private_data;
> +}
> +
>  /**
>   * folio_file_pfn - like folio_file_page, but return a pfn.
>   * @folio: The folio which contains this index.
> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
>  	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +
> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> +				      loff_t size, u64 flags)
> +{
> +	enum shareability m;
> +	pgoff_t last;
> +
> +	last = (size >> PAGE_SHIFT) - 1;
> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> +						    SHAREABILITY_ALL;
> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> +				 GFP_KERNEL);
> +}
> +
> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> +						 pgoff_t index)
> +{
> +	struct maple_tree *mt;
> +	void *entry;
> +
> +	mt = &kvm_gmem_private(inode)->shareability;
> +	entry = mtree_load(mt, index);
> +	WARN(!entry,
> +	     "Shareability should always be defined for all indices in inode.");
I noticed that in [1], the kvm_gmem_mmap() does not check the range.
So, the WARN() here can be hit when userspace mmap() an area larger than the
inode size and accesses the out of band HVA.

Maybe limit the mmap() range?

@@ -1609,6 +1620,10 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
        if (!kvm_gmem_supports_shared(file_inode(file)))
                return -ENODEV;

+       if (vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT) > i_size_read(file_inode(file)))
+               return -EINVAL;
+
        if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
            (VM_SHARED | VM_MAYSHARE)) {
                return -EINVAL;

[1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/

> +	return xa_to_value(entry);
> +}
> +
> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> +{
> +	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> +		return ERR_PTR(-EACCES);
> +
> +	return kvm_gmem_get_folio(inode, index);
> +}
> +
> +#else
> +
> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> +{
> +	return 0;
> +}
> +
> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> +{
> +	WARN_ONCE("Unexpected call to get shared folio.")
> +	return NULL;
> +}
> +
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> +
>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  				    pgoff_t index, struct folio *folio)
>  {
> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(inode->i_mapping);
>  
> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		int err = PTR_ERR(folio);
>  
> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
>  	.fallocate	= kvm_gmem_fallocate,
>  };
>  
> +static void kvm_gmem_free_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +	kfree(private);
> +
> +	free_inode_nonrcu(inode);
> +}
> +
> +static void kvm_gmem_destroy_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	/*
> +	 * mtree_destroy() can't be used within rcu callback, hence can't be
> +	 * done in ->free_inode().
> +	 */
> +	if (private)
> +		mtree_destroy(&private->shareability);
> +#endif
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
>  	.statfs		= simple_statfs,
> +	.destroy_inode	= kvm_gmem_destroy_inode,
> +	.free_inode	= kvm_gmem_free_inode,
>  };
>  
>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  						      loff_t size, u64 flags)
>  {
> +	struct kvm_gmem_inode_private *private;
>  	struct inode *inode;
> +	int err;
>  
>  	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>  	if (IS_ERR(inode))
>  		return inode;
>  
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
> +	mt_init(&private->shareability);
Wrap the mt_init() inside "#ifdef CONFIG_KVM_GMEM_SHARED_MEM" ?

> +	inode->i_mapping->i_private_data = private;
> +
> +	err = kvm_gmem_shareability_setup(private, size, flags);
> +	if (err)
> +		goto out;
> +
>  	inode->i_private = (void *)(unsigned long)flags;
>  	inode->i_op = &kvm_gmem_iops;
>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>  
>  	return inode;
> +
> +out:
> +	iput(inode);
> +
> +	return ERR_PTR(err);
>  }
>  
>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
>  		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
>  
> +	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!file)
>  		return -EFAULT;
>  
> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +
>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>  	if (IS_ERR(folio)) {
>  		r = PTR_ERR(folio);
> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		*page = folio_file_page(folio, index);
>  	else
>  		folio_put(folio);
> -
>  out:
> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>  	fput(file);
>  	return r;
>  }
> -- 
> 2.49.0.1045.g170613ef41-goog
> 
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status
  2025-05-14 23:41 ` [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status Ackerley Tng
@ 2025-05-27  3:55   ` Yan Zhao
  2025-05-28  8:08     ` Binbin Wu
  0 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-05-27  3:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:41:45PM -0700, Ackerley Tng wrote:
> Query guest_memfd for private/shared status if those guest_memfds
> track private/shared status.
> 
> With this patch, Coco VMs can use guest_memfd for both shared and
> private memory. If Coco VMs choose to use guest_memfd for both
> shared and private memory, by creating guest_memfd with the
> GUEST_MEMFD_FLAG_SUPPORT_SHARED flag, guest_memfd will be used to
> provide the private/shared status of the memory, instead of
> kvm->mem_attr_array.
> 
> Change-Id: I8f23d7995c12242aa4e09ccf5ec19360e9c9ed83
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  include/linux/kvm_host.h | 19 ++++++++++++-------
>  virt/kvm/guest_memfd.c   | 22 ++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b317392453a5..91279e05e010 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2508,12 +2508,22 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  }
>  
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +
>  bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot);
> +bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn);
> +
>  #else
> +
>  static inline bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
>  {
>  	return false;
>  }
> +
> +static inline bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>  
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> @@ -2544,13 +2554,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>  		return false;
>  
>  	slot = gfn_to_memslot(kvm, gfn);
> -	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot)) {
> -		/*
> -		 * For now, memslots only support in-place shared memory if the
> -		 * host is allowed to mmap memory (i.e., non-Coco VMs).
> -		 */
> -		return false;
> -	}
> +	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
> +		return kvm_gmem_is_private(slot, gfn);
When userspace gets an exit reason KVM_EXIT_MEMORY_FAULT, looks it needs to
update both KVM memory attribute and gmem shareability, via two separate ioctls?


>  	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;

>  }
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 6f6c4d298f8f..853e989bdcb2 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -865,6 +865,28 @@ bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
>  }
>  EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
>  
> +bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	struct inode *inode;
> +	struct file *file;
> +	pgoff_t index;
> +	bool ret;
> +
> +	file = kvm_gmem_get_file(slot);
> +	if (!file)
> +		return false;
> +
> +	index = kvm_gmem_get_index(slot, gfn);
> +	inode = file_inode(file);
> +
> +	filemap_invalidate_lock_shared(inode->i_mapping);
> +	ret = kvm_gmem_shareability_get(inode, index) == SHAREABILITY_GUEST;
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
> +
> +	fput(file);
> +	return ret;
> +}
> +
>  #else
>  #define kvm_gmem_mmap NULL
>  #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> -- 
> 2.49.0.1045.g170613ef41-goog
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
  2025-05-22 22:19   ` Edgecombe, Rick P
@ 2025-05-27  4:30   ` Yan Zhao
  2025-05-27  4:38     ` Yan Zhao
  2025-06-05 17:50     ` Ackerley Tng
  2025-05-27  8:45   ` Yan Zhao
  2025-06-05  5:24   ` Binbin Wu
  3 siblings, 2 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-27  4:30 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
> In this patch, newly allocated pages are split to 4K regular pages
> before providing them to the requester (fallocate() or KVM).
> 
> During a private to shared conversion, folios are split if not already
> split.
> 
> During a shared to private conversion, folios are merged if not
> already merged.
> 
> When the folios are removed from the filemap on truncation, the
> allocator is given a chance to do any necessary prep for when the
> folio is freed.
> 
> When a conversion is requested on a subfolio within a hugepage range,
> faulting must be prevented on the whole hugepage range for
> correctness.
> 
> See related discussion at
> https://lore.kernel.org/all/Z__AAB_EFxGFEjDR@google.com/T/
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Change-Id: Ib5ee22e3dae034c529773048a626ad98d4b10af3
> ---
>  mm/filemap.c           |   2 +
>  virt/kvm/guest_memfd.c | 501 +++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 483 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a02c3d8e00e8..a052f8e0c41e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -223,6 +223,7 @@ void __filemap_remove_folio(struct folio *folio, void *shadow)
>  	filemap_unaccount_folio(mapping, folio);
>  	page_cache_delete(mapping, folio, shadow);
>  }
> +EXPORT_SYMBOL_GPL(__filemap_remove_folio);
>  
>  void filemap_free_folio(struct address_space *mapping, struct folio *folio)
>  {
> @@ -258,6 +259,7 @@ void filemap_remove_folio(struct folio *folio)
>  
>  	filemap_free_folio(mapping, folio);
>  }
> +EXPORT_SYMBOL_GPL(filemap_remove_folio);
>  
>  /*
>   * page_cache_delete_batch - delete several folios from page cache
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index c578d0ebe314..cb426c1dfef8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -41,6 +41,11 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>  				      pgoff_t end);
>  static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
>  				    pgoff_t end);
> +static int __kvm_gmem_filemap_add_folio(struct address_space *mapping,
> +					struct folio *folio, pgoff_t index);
> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> +						pgoff_t start, size_t nr_pages,
> +						bool is_split_operation);
>  
>  static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>  {
> @@ -126,6 +131,31 @@ static enum shareability kvm_gmem_shareability_get(struct inode *inode,
>  	return xa_to_value(entry);
>  }
>  
> +static bool kvm_gmem_shareability_in_range(struct inode *inode, pgoff_t start,
> +					    size_t nr_pages, enum shareability m)
> +{
> +	struct maple_tree *mt;
> +	pgoff_t last;
> +	void *entry;
> +
> +	mt = &kvm_gmem_private(inode)->shareability;
> +
> +	last = start + nr_pages - 1;
> +	mt_for_each(mt, entry, start, last) {
> +		if (xa_to_value(entry) == m)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static inline bool kvm_gmem_has_some_shared(struct inode *inode, pgoff_t start,
> +					    size_t nr_pages)
> +{
> +	return kvm_gmem_shareability_in_range(inode, start, nr_pages,
> +					     SHAREABILITY_ALL);
> +}
> +
>  static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>  {
>  	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> @@ -241,6 +271,105 @@ static bool kvm_gmem_has_safe_refcount(struct address_space *mapping, pgoff_t st
>  	return refcount_safe;
>  }
>  
> +static void kvm_gmem_unmap_private(struct kvm_gmem *gmem, pgoff_t start,
> +				   pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm *kvm = gmem->kvm;
> +	unsigned long index;
> +	bool locked = false;
> +	bool flush = false;
> +
> +	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +		pgoff_t pgoff = slot->gmem.pgoff;
> +
> +		struct kvm_gfn_range gfn_range = {
> +			.start = slot->base_gfn + max(pgoff, start) - pgoff,
> +			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
> +			.slot = slot,
> +			.may_block = true,
> +			/* This function is only concerned with private mappings. */
> +			.attr_filter = KVM_FILTER_PRIVATE,
> +		};
> +
> +		if (!locked) {
> +			KVM_MMU_LOCK(kvm);
> +			locked = true;
> +		}
> +
> +		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	if (locked)
> +		KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> +				      pgoff_t end)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm *kvm = gmem->kvm;
> +	unsigned long index;
> +	bool found_memslot;
> +
> +	found_memslot = false;
> +	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +		gfn_t gfn_start;
> +		gfn_t gfn_end;
> +		pgoff_t pgoff;
> +
> +		pgoff = slot->gmem.pgoff;
> +
> +		gfn_start = slot->base_gfn + max(pgoff, start) - pgoff;
> +		gfn_end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff;
> +		if (!found_memslot) {
> +			found_memslot = true;
> +
> +			KVM_MMU_LOCK(kvm);
> +			kvm_mmu_invalidate_begin(kvm);
> +		}
> +
> +		kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> +	}
> +
> +	if (found_memslot)
> +		KVM_MMU_UNLOCK(kvm);
> +}
> +
> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
> +						 pgoff_t bound, bool start)
> +{
> +	size_t nr_pages;
> +	void *priv;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return bound;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	if (start)
> +		return round_down(bound, nr_pages);
> +	else
> +		return round_up(bound, nr_pages);
> +}
> +
> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
> +						 pgoff_t bound)
> +{
> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
> +}
> +
> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
> +					       pgoff_t bound)
> +{
> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
> +}
> +
>  static int kvm_gmem_shareability_apply(struct inode *inode,
>  				       struct conversion_work *work,
>  				       enum shareability m)
> @@ -299,35 +428,53 @@ static void kvm_gmem_convert_invalidate_begin(struct inode *inode,
>  					      struct conversion_work *work)
>  {
>  	struct list_head *gmem_list;
> +	pgoff_t invalidate_start;
> +	pgoff_t invalidate_end;
>  	struct kvm_gmem *gmem;
> -	pgoff_t end;
> +	pgoff_t work_end;
>  
> -	end = work->start + work->nr_pages;
> +	work_end = work->start + work->nr_pages;
> +	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
> +	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);
Could we just notify the exact gfn range and let KVM adjust the invalidate
range?

Then kvm_gmem_invalidate_begin() can asks KVM to do EPT splitting before any
kvm_mmu_unmap_gfn_range() is performed.


>  	gmem_list = &inode->i_mapping->i_private_list;
>  	list_for_each_entry(gmem, gmem_list, entry)
> -		kvm_gmem_invalidate_begin(gmem, work->start, end);
> +		kvm_gmem_invalidate_begin(gmem, invalidate_start, invalidate_end);
>  }
>  
>  static void kvm_gmem_convert_invalidate_end(struct inode *inode,
>  					    struct conversion_work *work)
>  {
>  	struct list_head *gmem_list;
> +	pgoff_t invalidate_start;
> +	pgoff_t invalidate_end;
>  	struct kvm_gmem *gmem;
> -	pgoff_t end;
> +	pgoff_t work_end;
>  
> -	end = work->start + work->nr_pages;
> +	work_end = work->start + work->nr_pages;
> +	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
> +	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);
>  
>  	gmem_list = &inode->i_mapping->i_private_list;
>  	list_for_each_entry(gmem, gmem_list, entry)
> -		kvm_gmem_invalidate_end(gmem, work->start, end);
> +		kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
>  }
>  
>  static int kvm_gmem_convert_should_proceed(struct inode *inode,
>  					   struct conversion_work *work,
>  					   bool to_shared, pgoff_t *error_index)
>  {
> -	if (!to_shared) {
> +	if (to_shared) {
> +		struct list_head *gmem_list;
> +		struct kvm_gmem *gmem;
> +		pgoff_t work_end;
> +
> +		work_end = work->start + work->nr_pages;
> +
> +		gmem_list = &inode->i_mapping->i_private_list;
> +		list_for_each_entry(gmem, gmem_list, entry)
> +			kvm_gmem_unmap_private(gmem, work->start, work_end);
> +	} else {
>  		unmap_mapping_pages(inode->i_mapping, work->start,
>  				    work->nr_pages, false);
>  
> @@ -340,6 +487,27 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
>  	return 0;
>  }
>  
> +static int kvm_gmem_convert_execute_work(struct inode *inode,
> +					 struct conversion_work *work,
> +					 bool to_shared)
> +{
> +	enum shareability m;
> +	int ret;
> +
> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> +	ret = kvm_gmem_shareability_apply(inode, work, m);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * Apply shareability first so split/merge can operate on new
> +	 * shareability state.
> +	 */
> +	ret = kvm_gmem_restructure_folios_in_range(
> +		inode, work->start, work->nr_pages, to_shared);
> +
> +	return ret;
> +}
> +
>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>  				  size_t nr_pages, bool shared,
>  				  pgoff_t *error_index)
> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>  
>  	list_for_each_entry(work, &work_list, list) {
>  		rollback_stop_item = work;
> -		ret = kvm_gmem_shareability_apply(inode, work, m);
> +
> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
>  		if (ret)
>  			break;
>  	}
>  
>  	if (ret) {
> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>  		list_for_each_entry(work, &work_list, list) {
> +			int r;
> +
> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
> +			WARN_ON(r);
> +
>  			if (work == rollback_stop_item)
>  				break;
> -
> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
>  		}
>  	}
>  
> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_HUGETLB
> +
> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
> +{
> +	struct address_space *mapping = folio->mapping;
> +
> +	spin_lock(&mapping->host->i_lock);
> +	xa_lock_irq(&mapping->i_pages);
> +
> +	__filemap_remove_folio(folio, NULL);
> +
> +	xa_unlock_irq(&mapping->i_pages);
> +	spin_unlock(&mapping->host->i_lock);
> +}
> +
> +/**
> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
> + * split/merge.
> + *
> + * @folio: the folio to be removed.
> + *
> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
> + *
> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
> + *          freeze these refcounts away so that no other users will interfere
> + *          with restructuring.
> + */
> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
> +{
> +	int filemap_refcount;
> +
> +	filemap_refcount = folio_nr_pages(folio);
> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
> +		/*
> +		 * At this point only filemap refcounts are expected, hence okay
> +		 * to spin until speculative refcounts go away.
> +		 */
> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
> +	}
> +
> +	folio_lock(folio);
> +	__filemap_remove_folio_for_restructuring(folio);
> +	folio_unlock(folio);
> +}
> +
> +/**
> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
> + *
> + * @inode: inode containing the folio.
> + * @folio: folio to be split.
> + *
> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
> + * and add back the split folios.
> + *
> + * Context: Expects that before this call, folio's refcount is just the
> + *          filemap's refcounts. After this function returns, the split folios'
> + *          refcounts will also be filemap's refcounts.
> + * Return: 0 on success or negative error otherwise.
> + */
> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
> +{
> +	size_t orig_nr_pages;
> +	pgoff_t orig_index;
> +	size_t i, j;
> +	int ret;
> +
> +	orig_nr_pages = folio_nr_pages(folio);
> +	if (orig_nr_pages == 1)
> +		return 0;
> +
> +	orig_index = folio->index;
> +
> +	filemap_remove_folio_for_restructuring(folio);
> +
> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
> +	if (ret)
> +		goto err;
> +
> +	for (i = 0; i < orig_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(folio, i));
> +
> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
> +						   orig_index + i);
> +		if (ret)
> +			goto rollback;
> +	}
> +
> +	return ret;
> +
> +rollback:
> +	for (j = 0; j < i; ++j) {
> +		struct folio *f = page_folio(folio_page(folio, j));
> +
> +		filemap_remove_folio_for_restructuring(f);
> +	}
> +
> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
> +err:
> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
> +
> +	return ret;
> +}
> +
> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> +						      struct folio *folio)
> +{
> +	size_t to_nr_pages;
> +	void *priv;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return 0;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
> +
> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
If the guest_memfd is configured with GUESTMEM_HUGETLB_FLAG_1GB, it seems that
whenever there's a shared page within a 1GB range, the folio will always be
split into 4KB folios. Is it good?

> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
> +
> +	return 0;
> +}
> +
> +/**
> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
> + * @inode.
> + *
> + * @inode: inode containing the folio.
> + * @first_folio: first folio among folios to be merged.
> + *
> + * Will clean up subfolios from filemap and add back the merged folio.
> + *
> + * Context: Expects that before this call, all subfolios only have filemap
> + *          refcounts. After this function returns, the merged folio will only
> + *          have filemap refcounts.
> + * Return: 0 on success or negative error otherwise.
> + */
> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
> +					   struct folio *first_folio)
> +{
> +	size_t to_nr_pages;
> +	pgoff_t index;
> +	void *priv;
> +	size_t i;
> +	int ret;
> +
> +	index = first_folio->index;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +	if (folio_nr_pages(first_folio) == to_nr_pages)
> +		return 0;
> +
> +	for (i = 0; i < to_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(first_folio, i));
> +
> +		filemap_remove_folio_for_restructuring(f);
> +	}
> +
> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
> +
> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
> +	if (ret)
> +		goto err_split;
> +
> +	return ret;
> +
> +err_split:
> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
> +	for (i = 0; i < to_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(first_folio, i));
> +
> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
> +	}
> +
> +	return ret;
> +}
> +
> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
> +						      struct folio *first_folio)
> +{
> +	size_t to_nr_pages;
> +	void *priv;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
> +		return 0;
> +
> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
> +}
> +
> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> +						pgoff_t start, size_t nr_pages,
> +						bool is_split_operation)
> +{
> +	size_t to_nr_pages;
> +	pgoff_t index;
> +	pgoff_t end;
> +	void *priv;
> +	int ret;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return 0;
> +
> +	end = start + nr_pages;
> +
> +	/* Round to allocator page size, to check all (huge) pages in range. */
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	start = round_down(start, to_nr_pages);
> +	end = round_up(end, to_nr_pages);
> +
> +	for (index = start; index < end; index += to_nr_pages) {
> +		struct folio *f;
> +
> +		f = filemap_get_folio(inode->i_mapping, index);
> +		if (IS_ERR(f))
> +			continue;
> +
> +		/* Leave just filemap's refcounts on the folio. */
> +		folio_put(f);
> +
> +		if (is_split_operation)
> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
The split operation is performed after kvm_gmem_unmap_private() within
kvm_gmem_convert_should_proceed(), right?

So, it seems that that it's not necessary for TDX to avoid holding private page
references, as TDX must have released the page refs after
kvm_gmem_unmap_private() (except when there's TDX module or KVM bug).

> +		else
> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
> +
> +		if (ret)
> +			goto rollback;
> +	}
> +	return ret;
> +
> +rollback:
> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
> +		struct folio *f;
> +
> +		f = filemap_get_folio(inode->i_mapping, index);
> +		if (IS_ERR(f))
> +			continue;
> +
> +		/* Leave just filemap's refcounts on the folio. */
> +		folio_put(f);
> +
> +		if (is_split_operation)
> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> +		else
> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
> +	}
> +
> +	return ret;
> +}
> +
> +#else
> +
> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> +						      struct folio *folio)
> +{
> +	return 0;
> +}
> +
> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> +						pgoff_t start, size_t nr_pages,
> +						bool is_split_operation)
> +{
> +	return 0;
> +}
> +
> +#endif
> +
>  #else
>  
>  static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> @@ -563,11 +1005,16 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  		return folio;
>  
>  	if (kvm_gmem_has_custom_allocator(inode)) {
> -		void *p = kvm_gmem_allocator_private(inode);
> +		size_t nr_pages;
> +		void *p;
>  
> +		p = kvm_gmem_allocator_private(inode);
>  		folio = kvm_gmem_allocator_ops(inode)->alloc_folio(p);
>  		if (IS_ERR(folio))
>  			return folio;
> +
> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
> +		index_floor = round_down(index, nr_pages);
>  	} else {
>  		gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
>  
> @@ -580,10 +1027,11 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  			folio_put(folio);
>  			return ERR_PTR(ret);
>  		}
> +
> +		index_floor = index;
>  	}
>  	allocated_size = folio_size(folio);
>  
> -	index_floor = round_down(index, folio_nr_pages(folio));
>  	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
>  	if (ret) {
>  		folio_put(folio);
> @@ -600,6 +1048,13 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  		return ERR_PTR(ret);
>  	}
>  
> +	/* Leave just filemap's refcounts on folio. */
> +	folio_put(folio);
> +
> +	ret = kvm_gmem_try_split_folio_in_filemap(inode, folio);
> +	if (ret)
> +		goto err;
> +
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks += allocated_size / 512;
>  	spin_unlock(&inode->i_lock);
> @@ -608,14 +1063,17 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  	 * folio is the one that is allocated, this gets the folio at the
>  	 * requested index.
>  	 */
> -	folio = page_folio(folio_file_page(folio, index));
> -	folio_lock(folio);
> +	folio = filemap_lock_folio(inode->i_mapping, index);
>  
>  	return folio;
> +
> +err:
> +	filemap_remove_folio(folio);
> +	return ERR_PTR(ret);
>  }
>  
> -static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> -				      pgoff_t end)
> +static void kvm_gmem_invalidate_begin_and_zap(struct kvm_gmem *gmem,
> +					      pgoff_t start, pgoff_t end)
>  {
>  	bool flush = false, found_memslot = false;
>  	struct kvm_memory_slot *slot;
> @@ -848,7 +1306,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	filemap_invalidate_lock(inode->i_mapping);
>  
>  	list_for_each_entry(gmem, gmem_list, entry)
> -		kvm_gmem_invalidate_begin(gmem, start, end);
> +		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
>  
>  	if (kvm_gmem_has_custom_allocator(inode)) {
>  		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> @@ -978,7 +1436,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
>  	 * Zap all SPTEs pointed at by this file.  Do not free the backing
>  	 * memory, as its lifetime is associated with the inode, not the file.
>  	 */
> -	kvm_gmem_invalidate_begin(gmem, 0, -1ul);
> +	kvm_gmem_invalidate_begin_and_zap(gmem, 0, -1ul);
>  	kvm_gmem_invalidate_end(gmem, 0, -1ul);
>  
>  	list_del(&gmem->entry);
> @@ -1289,7 +1747,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>  	end = start + folio_nr_pages(folio);
>  
>  	list_for_each_entry(gmem, gmem_list, entry)
> -		kvm_gmem_invalidate_begin(gmem, start, end);
> +		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
>  
>  	/*
>  	 * Do not truncate the range, what action is taken in response to the
> @@ -1330,6 +1788,9 @@ static void kvm_gmem_free_folio(struct address_space *mapping,
>  	 */
>  	folio_clear_uptodate(folio);
>  
> +	if (kvm_gmem_has_custom_allocator(mapping->host))
> +		kvm_gmem_allocator_ops(mapping->host)->free_folio(folio);
> +
>  	kvm_gmem_invalidate(folio);
>  }
>  
> -- 
> 2.49.0.1045.g170613ef41-goog
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-27  4:30   ` Yan Zhao
@ 2025-05-27  4:38     ` Yan Zhao
  2025-06-05 17:50     ` Ackerley Tng
  1 sibling, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-27  4:38 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

> > +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> > +						pgoff_t start, size_t nr_pages,
> > +						bool is_split_operation)
> > +{
> > +	size_t to_nr_pages;
> > +	pgoff_t index;
> > +	pgoff_t end;
> > +	void *priv;
> > +	int ret;
> > +
> > +	if (!kvm_gmem_has_custom_allocator(inode))
> > +		return 0;
> > +
> > +	end = start + nr_pages;
> > +
> > +	/* Round to allocator page size, to check all (huge) pages in range. */
> > +	priv = kvm_gmem_allocator_private(inode);
> > +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> > +
> > +	start = round_down(start, to_nr_pages);
> > +	end = round_up(end, to_nr_pages);
> > +
> > +	for (index = start; index < end; index += to_nr_pages) {
> > +		struct folio *f;
> > +
> > +		f = filemap_get_folio(inode->i_mapping, index);
> > +		if (IS_ERR(f))
> > +			continue;
> > +
> > +		/* Leave just filemap's refcounts on the folio. */
> > +		folio_put(f);
> > +
> > +		if (is_split_operation)
> > +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> The split operation is performed after kvm_gmem_unmap_private() within
> kvm_gmem_convert_should_proceed(), right?
> 
> So, it seems that that it's not necessary for TDX to avoid holding private page
> references, as TDX must have released the page refs after
> kvm_gmem_unmap_private() (except when there's TDX module or KVM bug).
Oops. Please ignore this one.
The unmap does not necessarily cover the entire folio range, so split still
requires TDX not to hold ref count.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
  2025-05-27  3:54   ` Yan Zhao
@ 2025-05-27  8:25   ` Binbin Wu
  2025-05-27  8:43     ` Binbin Wu
  2025-05-29 18:26     ` Ackerley Tng
  2025-05-29  5:42   ` Michael Roth
  2025-08-01  0:01   ` Yan Zhao
  3 siblings, 2 replies; 231+ messages in thread
From: Binbin Wu @ 2025-05-27  8:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:41 AM, Ackerley Tng wrote:
> Track guest_memfd memory's shareability status within the inode as
> opposed to the file, since it is property of the guest_memfd's memory
> contents.
>
> Shareability is a property of the memory and is indexed using the
> page's index in the inode. Because shareability is the memory's
> property, it is stored within guest_memfd instead of within KVM, like
> in kvm->mem_attr_array.
>
> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> retained to allow VMs to only use guest_memfd for private memory and
> some other memory for shared memory.
>
> Not all use cases require guest_memfd() to be shared with the host
> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> private to the guest, and therefore not mappable by the
> host. Otherwise, memory is shared until explicitly converted to
> private.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> ---
>   Documentation/virt/kvm/api.rst |   5 ++
>   include/uapi/linux/kvm.h       |   2 +
>   virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>   3 files changed, 129 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 86f74ce7f12a..f609337ae1c2 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>   The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>   This is validated when the guest_memfd instance is bound to the VM.
>   
> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.

It seems that the sentence is stale?
Didn't find the definition of KVM_CAP_GMEM_CONVERSIONS.

> Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> +will initialize the memory for the guest_memfd as guest-only and not faultable
> +by the host.
> +
[...]
>   
>   static int kvm_gmem_init_fs_context(struct fs_context *fc)
> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>   static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>   						      loff_t size, u64 flags)
>   {
> +	struct kvm_gmem_inode_private *private;
>   	struct inode *inode;
> +	int err;
>   
>   	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>   	if (IS_ERR(inode))
>   		return inode;
>   
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
> +	mt_init(&private->shareability);

shareability is defined only when CONFIG_KVM_GMEM_SHARED_MEM enabled, should be done within CONFIG_KVM_GMEM_SHARED_MEM .


> +	inode->i_mapping->i_private_data = private;
> +
> +	err = kvm_gmem_shareability_setup(private, size, flags);
> +	if (err)
> +		goto out;
> +
>   	inode->i_private = (void *)(unsigned long)flags;
>   	inode->i_op = &kvm_gmem_iops;
>   	inode->i_mapping->a_ops = &kvm_gmem_aops;
> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>   	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>   
>   	return inode;
> +
> +out:
> +	iput(inode);
> +
> +	return ERR_PTR(err);
>   }
>   
>
[...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-27  8:25   ` Binbin Wu
@ 2025-05-27  8:43     ` Binbin Wu
  2025-05-29 18:26     ` Ackerley Tng
  1 sibling, 0 replies; 231+ messages in thread
From: Binbin Wu @ 2025-05-27  8:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/27/2025 4:25 PM, Binbin Wu wrote:
>
>
> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>> Track guest_memfd memory's shareability status within the inode as
>> opposed to the file, since it is property of the guest_memfd's memory
>> contents.
>>
>> Shareability is a property of the memory and is indexed using the
>> page's index in the inode. Because shareability is the memory's
>> property, it is stored within guest_memfd instead of within KVM, like
>> in kvm->mem_attr_array.
>>
>> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
>> retained to allow VMs to only use guest_memfd for private memory and
>> some other memory for shared memory.
>>
>> Not all use cases require guest_memfd() to be shared with the host
>> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
>> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
>> private to the guest, and therefore not mappable by the
>> host. Otherwise, memory is shared until explicitly converted to
>> private.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Co-developed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
>> ---
>>   Documentation/virt/kvm/api.rst |   5 ++
>>   include/uapi/linux/kvm.h       |   2 +
>>   virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>>   3 files changed, 129 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 86f74ce7f12a..f609337ae1c2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>>   The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>>   This is validated when the guest_memfd instance is bound to the VM.
>>   +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
>> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.
>
> It seems that the sentence is stale?
> Didn't find the definition of KVM_CAP_GMEM_CONVERSIONS.
Aha! It's a typo, should be KVM_CAP_GMEM_CONVERSION.



>
>> Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
>> +will initialize the memory for the guest_memfd as guest-only and not faultable
>> +by the host.
>> +
> [...]
>>     static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>>   static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>                                 loff_t size, u64 flags)
>>   {
>> +    struct kvm_gmem_inode_private *private;
>>       struct inode *inode;
>> +    int err;
>>         inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>>       if (IS_ERR(inode))
>>           return inode;
>>   +    err = -ENOMEM;
>> +    private = kzalloc(sizeof(*private), GFP_KERNEL);
>> +    if (!private)
>> +        goto out;
>> +
>> +    mt_init(&private->shareability);
>
> shareability is defined only when CONFIG_KVM_GMEM_SHARED_MEM enabled, should be done within CONFIG_KVM_GMEM_SHARED_MEM .
>
>
>> + inode->i_mapping->i_private_data = private;
>> +
>> +    err = kvm_gmem_shareability_setup(private, size, flags);
>> +    if (err)
>> +        goto out;
>> +
>>       inode->i_private = (void *)(unsigned long)flags;
>>       inode->i_op = &kvm_gmem_iops;
>>       inode->i_mapping->a_ops = &kvm_gmem_aops;
>> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>         return inode;
>> +
>> +out:
>> +    iput(inode);
>> +
>> +    return ERR_PTR(err);
>>   }
>>
> [...]
>


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
  2025-05-22 22:19   ` Edgecombe, Rick P
  2025-05-27  4:30   ` Yan Zhao
@ 2025-05-27  8:45   ` Yan Zhao
  2025-06-05 19:10     ` Ackerley Tng
  2025-06-05  5:24   ` Binbin Wu
  3 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-05-27  8:45 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
> +static int kvm_gmem_convert_execute_work(struct inode *inode,
> +					 struct conversion_work *work,
> +					 bool to_shared)
> +{
> +	enum shareability m;
> +	int ret;
> +
> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> +	ret = kvm_gmem_shareability_apply(inode, work, m);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * Apply shareability first so split/merge can operate on new
> +	 * shareability state.
> +	 */
> +	ret = kvm_gmem_restructure_folios_in_range(
> +		inode, work->start, work->nr_pages, to_shared);
> +
> +	return ret;
> +}
> +
>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>  				  size_t nr_pages, bool shared,
>  				  pgoff_t *error_index)
> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>  
>  	list_for_each_entry(work, &work_list, list) {
>  		rollback_stop_item = work;
> -		ret = kvm_gmem_shareability_apply(inode, work, m);
> +
> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
>  		if (ret)
>  			break;
>  	}
>  
>  	if (ret) {
> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>  		list_for_each_entry(work, &work_list, list) {
> +			int r;
> +
> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
> +			WARN_ON(r);
> +
>  			if (work == rollback_stop_item)
>  				break;
> -
> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
Could kvm_gmem_shareability_apply() fail here?

>  		}
>  	}
>  
> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_HUGETLB
> +
> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
> +{
> +	struct address_space *mapping = folio->mapping;
> +
> +	spin_lock(&mapping->host->i_lock);
> +	xa_lock_irq(&mapping->i_pages);
> +
> +	__filemap_remove_folio(folio, NULL);
> +
> +	xa_unlock_irq(&mapping->i_pages);
> +	spin_unlock(&mapping->host->i_lock);
> +}
> +
> +/**
> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
> + * split/merge.
> + *
> + * @folio: the folio to be removed.
> + *
> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
> + *
> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
> + *          freeze these refcounts away so that no other users will interfere
> + *          with restructuring.
> + */
> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
> +{
> +	int filemap_refcount;
> +
> +	filemap_refcount = folio_nr_pages(folio);
> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
> +		/*
> +		 * At this point only filemap refcounts are expected, hence okay
> +		 * to spin until speculative refcounts go away.
> +		 */
> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
> +	}
> +
> +	folio_lock(folio);
> +	__filemap_remove_folio_for_restructuring(folio);
> +	folio_unlock(folio);
> +}
> +
> +/**
> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
> + *
> + * @inode: inode containing the folio.
> + * @folio: folio to be split.
> + *
> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
> + * and add back the split folios.
> + *
> + * Context: Expects that before this call, folio's refcount is just the
> + *          filemap's refcounts. After this function returns, the split folios'
> + *          refcounts will also be filemap's refcounts.
> + * Return: 0 on success or negative error otherwise.
> + */
> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
> +{
> +	size_t orig_nr_pages;
> +	pgoff_t orig_index;
> +	size_t i, j;
> +	int ret;
> +
> +	orig_nr_pages = folio_nr_pages(folio);
> +	if (orig_nr_pages == 1)
> +		return 0;
> +
> +	orig_index = folio->index;
> +
> +	filemap_remove_folio_for_restructuring(folio);
> +
> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
> +	if (ret)
> +		goto err;
> +
> +	for (i = 0; i < orig_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(folio, i));
> +
> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
> +						   orig_index + i);
Why does the failure of __kvm_gmem_filemap_add_folio() here lead to rollback,    
while the failure of the one under rollback only triggers WARN_ON()?

> +		if (ret)
> +			goto rollback;
> +	}
> +
> +	return ret;
> +
> +rollback:
> +	for (j = 0; j < i; ++j) {
> +		struct folio *f = page_folio(folio_page(folio, j));
> +
> +		filemap_remove_folio_for_restructuring(f);
> +	}
> +
> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
> +err:
> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
> +
> +	return ret;
> +}
> +
> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> +						      struct folio *folio)
> +{
> +	size_t to_nr_pages;
> +	void *priv;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return 0;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
> +
> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
> +
> +	return 0;
> +}
> +
> +/**
> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
> + * @inode.
> + *
> + * @inode: inode containing the folio.
> + * @first_folio: first folio among folios to be merged.
> + *
> + * Will clean up subfolios from filemap and add back the merged folio.
> + *
> + * Context: Expects that before this call, all subfolios only have filemap
> + *          refcounts. After this function returns, the merged folio will only
> + *          have filemap refcounts.
> + * Return: 0 on success or negative error otherwise.
> + */
> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
> +					   struct folio *first_folio)
> +{
> +	size_t to_nr_pages;
> +	pgoff_t index;
> +	void *priv;
> +	size_t i;
> +	int ret;
> +
> +	index = first_folio->index;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +	if (folio_nr_pages(first_folio) == to_nr_pages)
> +		return 0;
> +
> +	for (i = 0; i < to_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(first_folio, i));
> +
> +		filemap_remove_folio_for_restructuring(f);
> +	}
> +
> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
> +
> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
> +	if (ret)
> +		goto err_split;
> +
> +	return ret;
> +
> +err_split:
> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
guestmem_hugetlb_split_folio() is possible to fail. e.g.
After the stash is freed by guestmem_hugetlb_unstash_free_metadata() in
guestmem_hugetlb_merge_folio(), it's possible to get -ENOMEM for the stash
allocation in guestmem_hugetlb_stash_metadata() in
guestmem_hugetlb_split_folio().


> +	for (i = 0; i < to_nr_pages; ++i) {
> +		struct folio *f = page_folio(folio_page(first_folio, i));
> +
> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
> +	}
> +
> +	return ret;
> +}
> +
> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
> +						      struct folio *first_folio)
> +{
> +	size_t to_nr_pages;
> +	void *priv;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
> +		return 0;
> +
> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
> +}
> +
> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> +						pgoff_t start, size_t nr_pages,
> +						bool is_split_operation)
> +{
> +	size_t to_nr_pages;
> +	pgoff_t index;
> +	pgoff_t end;
> +	void *priv;
> +	int ret;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return 0;
> +
> +	end = start + nr_pages;
> +
> +	/* Round to allocator page size, to check all (huge) pages in range. */
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	start = round_down(start, to_nr_pages);
> +	end = round_up(end, to_nr_pages);
> +
> +	for (index = start; index < end; index += to_nr_pages) {
> +		struct folio *f;
> +
> +		f = filemap_get_folio(inode->i_mapping, index);
> +		if (IS_ERR(f))
> +			continue;
> +
> +		/* Leave just filemap's refcounts on the folio. */
> +		folio_put(f);
> +
> +		if (is_split_operation)
> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
kvm_gmem_try_split_folio_in_filemap()?

> +		else
> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
> +
> +		if (ret)
> +			goto rollback;
> +	}
> +	return ret;
> +
> +rollback:
> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
> +		struct folio *f;
> +
> +		f = filemap_get_folio(inode->i_mapping, index);
> +		if (IS_ERR(f))
> +			continue;
> +
> +		/* Leave just filemap's refcounts on the folio. */
> +		folio_put(f);
> +
> +		if (is_split_operation)
> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> +		else
> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
Is it safe to just leave WARN_ON()s in the rollback case?

Besides, are the kvm_gmem_merge_folio_in_filemap() and
kvm_gmem_split_folio_in_filemap() here duplicated with the
kvm_gmem_split_folio_in_filemap() and kvm_gmem_try_merge_folio_in_filemap() in
the following "r = kvm_gmem_convert_execute_work(inode, work, !shared)"?

> +	}
> +
> +	return ret;
> +}
> +
> +#else
> +
> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> +						      struct folio *folio)
> +{
> +	return 0;
> +}
> +
> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> +						pgoff_t start, size_t nr_pages,
> +						bool is_split_operation)
> +{
> +	return 0;
> +}
> +
> +#endif
> +
 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-16 17:42     ` Ackerley Tng
  2025-05-16 19:31       ` Ira Weiny
@ 2025-05-27  8:53       ` Binbin Wu
  2025-05-30 19:59         ` Ackerley Tng
  1 sibling, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-05-27  8:53 UTC (permalink / raw)
  To: Ackerley Tng, Ira Weiny
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans,
	jhubbard, jroedel, jthoughton, jun.miao, kai.huang, keirf,
	kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/17/2025 1:42 AM, Ackerley Tng wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
>
>> Ackerley Tng wrote:
>>> Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
>>> GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>>>
>>> Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>> ---
>>>   .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
>>>   1 file changed, 24 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
>>> index 60aaba5808a5..bf2876cbd711 100644
>>> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
>>> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
>>> @@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
>>>   	kvm_vm_release(vm);
>>>   }
>>>   
>>> +static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
>>> +				   bool expect_valid)
>>> +{
>>> +	size_t page_size = getpagesize();
>>> +	int fd;
>>> +
>>> +	fd = __vm_create_guest_memfd(vm, page_size, flag);
>>> +
>>> +	if (expect_valid) {
>>> +		TEST_ASSERT(fd > 0,
>>> +			    "guest_memfd() with flag '0x%lx' should be valid",
>>> +			    flag);
>>> +		close(fd);
>>> +	} else {
>>> +		TEST_ASSERT(fd == -1 && errno == EINVAL,
>>> +			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>>> +			    flag);
>>> +	}
>>> +}
>>> +
>>>   static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>>   					    uint64_t expected_valid_flags)
>>>   {
>>> -	size_t page_size = getpagesize();
>>>   	struct kvm_vm *vm;
>>>   	uint64_t flag = 0;
>>> -	int fd;
>>>   
>>>   	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
>>>   		return;
>>> @@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>>   	vm = vm_create_barebones_type(vm_type);
>>>   
>>>   	for (flag = BIT(0); flag; flag <<= 1) {
>>> -		fd = __vm_create_guest_memfd(vm, page_size, flag);
>>> +		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
>>>   
>>> -		if (flag & expected_valid_flags) {
>>> -			TEST_ASSERT(fd > 0,
>>> -				    "guest_memfd() with flag '0x%lx' should be valid",
>>> -				    flag);
>>> -			close(fd);
>>> -		} else {
>>> -			TEST_ASSERT(fd == -1 && errno == EINVAL,
>>> -				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>>> -				    flag);
>>> +		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
>>> +			test_vm_with_gmem_flag(
>>> +				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
>> I don't understand the point of this check.  In 2/51 we set
>> GUEST_MEMFD_FLAG_INIT_PRIVATE when GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>>
>> When can this check ever fail?
>>
>> Ira
> In 02/51, GUEST_MEMFD_FLAG_INIT_PRIVATE is not set by default,
> GUEST_MEMFD_FLAG_INIT_PRIVATE is set as one of the valid_flags.
>
> The intention is that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid if
> GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested by userspace.
>
> In this test, the earlier part before the if block calls
> test_vm_with_gmem_flag() all valid flags, and that already tests
> GUEST_MEMFD_FLAG_SUPPORT_SHARED individually.
>
> Specifically if GUEST_MEMFD_FLAG_SUPPORT_SHARED is set, this if block
> adds a test for when both GUEST_MEMFD_FLAG_SUPPORT_SHARED and
> GUEST_MEMFD_FLAG_INIT_PRIVATE are set, and sets that expect_valid is
> true.
Maybe it's more clear to move this case out of the loop?


>
> This second test doesn't fail, it is meant to check that the kernel
> allows the pair of flags to be set. Hope that makes sense.


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
  2025-05-15 14:50   ` Ira Weiny
  2025-05-20  9:22   ` Fuad Tabba
@ 2025-05-28  3:16   ` Binbin Wu
  2025-05-30 20:10     ` Ackerley Tng
  2 siblings, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-05-28  3:16 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:41 AM, Ackerley Tng wrote:

[...]
> +
> +static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> +				  size_t nr_pages, bool shared,
> +				  pgoff_t *error_index)
> +{
> +	struct conversion_work *work, *tmp, *rollback_stop_item;
> +	LIST_HEAD(work_list);
> +	struct inode *inode;
> +	enum shareability m;
> +	int ret;
> +
> +	inode = file_inode(file);
> +
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	m = shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> +	ret = kvm_gmem_convert_compute_work(inode, start, nr_pages, m, &work_list);
> +	if (ret || list_empty(&work_list))
> +		goto out;
> +
> +	list_for_each_entry(work, &work_list, list)
> +		kvm_gmem_convert_invalidate_begin(inode, work);
> +
> +	list_for_each_entry(work, &work_list, list) {
> +		ret = kvm_gmem_convert_should_proceed(inode, work, shared,
> +						      error_index);

Since kvm_gmem_invalidate_begin() begins to handle shared memory,
kvm_gmem_convert_invalidate_begin() will zap the table.
The shared mapping could be zapped in kvm_gmem_convert_invalidate_begin() even
when kvm_gmem_convert_should_proceed() returns error.
The sequence is a bit confusing to me, at least in this patch so far.

> +		if (ret)
> +			goto invalidate_end;
> +	}
> +
> +	list_for_each_entry(work, &work_list, list) {
> +		rollback_stop_item = work;
> +		ret = kvm_gmem_shareability_apply(inode, work, m);
> +		if (ret)
> +			break;
> +	}
> +
> +	if (ret) {
> +		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
> +		list_for_each_entry(work, &work_list, list) {
> +			if (work == rollback_stop_item)
> +				break;
> +
> +			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> +		}
> +	}
> +
> +invalidate_end:
> +	list_for_each_entry(work, &work_list, list)
> +		kvm_gmem_convert_invalidate_end(inode, work);
> +out:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	list_for_each_entry_safe(work, tmp, &work_list, list) {
> +		list_del(&work->list);
> +		kfree(work);
> +	}
> +
> +	return ret;
> +}
> +
[...]
> @@ -186,15 +490,26 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>   	unsigned long index;
>   
>   	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
> +		enum kvm_gfn_range_filter filter;
>   		pgoff_t pgoff = slot->gmem.pgoff;
>   
> +		filter = KVM_FILTER_PRIVATE;
> +		if (kvm_gmem_memslot_supports_shared(slot)) {
> +			/*
> +			 * Unmapping would also cause invalidation, but cannot
> +			 * rely on mmu_notifiers to do invalidation via
> +			 * unmapping, since memory may not be mapped to
> +			 * userspace.
> +			 */
> +			filter |= KVM_FILTER_SHARED;
> +		}
> +
>   		struct kvm_gfn_range gfn_range = {
>   			.start = slot->base_gfn + max(pgoff, start) - pgoff,
>   			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
>   			.slot = slot,
>   			.may_block = true,
> -			/* guest memfd is relevant to only private mappings. */
> -			.attr_filter = KVM_FILTER_PRIVATE,
> +			.attr_filter = filter,
>   		};
>   
>   		if (!found_memslot) {
> @@ -484,11 +799,49 @@ EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
>   #define kvm_gmem_mmap NULL
>   #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>   
[...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios
  2025-05-14 23:41 ` [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios Ackerley Tng
@ 2025-05-28  7:01   ` Binbin Wu
  2025-05-30 20:32     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-05-28  7:01 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:41 AM, Ackerley Tng wrote:
> filemap_add_folio(), called from filemap_grab_folio(), adds the folio
> onto some LRU list, which is not necessary for guest_memfd since
> guest_memfd folios don't participate in any swapping.
>
> This patch reimplements part of filemap_add_folio() to avoid adding
> allocated guest_memfd folios to the filemap.

filemap -> LRU list?

>
> With shared to private conversions dependent on refcounts, avoiding
> usage of LRU ensures that LRU lists no longer take any refcounts on
> guest_memfd folios and significantly reduces the chance of elevated
> refcounts during conversion.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Change-Id: Ia2540d9fc132d46219e6e714fd42bc82a62a27fa
> ---
>   mm/filemap.c           |  1 +
>   mm/memcontrol.c        |  2 +
>   virt/kvm/guest_memfd.c | 91 ++++++++++++++++++++++++++++++++++++++----
>   3 files changed, 86 insertions(+), 8 deletions(-)
>
[...]
>   /*
>    * Returns a locked folio on success.  The caller is responsible for
>    * setting the up-to-date flag before the memory is mapped into the guest.
> @@ -477,8 +509,46 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>    */
>   static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   {
> +	struct folio *folio;
> +	gfp_t gfp;
> +	int ret;
> +
> +repeat:
> +	folio = filemap_lock_folio(inode->i_mapping, index);
> +	if (!IS_ERR(folio))
> +		return folio;
> +
> +	gfp = mapping_gfp_mask(inode->i_mapping);
> +
>   	/* TODO: Support huge pages. */
> -	return filemap_grab_folio(inode->i_mapping, index);
> +	folio = filemap_alloc_folio(gfp, 0);
> +	if (!folio)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = mem_cgroup_charge(folio, NULL, gfp);
> +	if (ret) {
> +		folio_put(folio);
> +		return ERR_PTR(ret);
> +	}
> +
> +	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
> +	if (ret) {
> +		folio_put(folio);
> +
> +		/*
> +		 * There was a race, two threads tried to get a folio indexing
> +		 * to the same location in the filemap. The losing thread should
> +		 * free the allocated folio, then lock the folio added to the
> +		 * filemap by the winning thread.

How about changing
“then lock the folio added to the filemap by the winning thread”
to
"the winning thread locks the folio added to the filemap"?

> +		 */
> +		if (ret == -EEXIST)
> +			goto repeat;
> +
> +		return ERR_PTR(ret);
> +	}
> +
> +	__folio_set_locked(folio);
> +	return folio;
>   }
>   
>   static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> @@ -956,23 +1026,28 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>   }
>   
>   #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct folio *folio)
> +{
> +	kvm_pfn_t pfn = folio_pfn(folio);
> +
> +	kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
> +}
> +#else
> +static inline void kvm_gmem_invalidate(struct folio *folio) {}

No need to tag a local static function with "inline".

> +#endif
> +
[...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status
  2025-05-27  3:55   ` Yan Zhao
@ 2025-05-28  8:08     ` Binbin Wu
  2025-05-28  9:55       ` Yan Zhao
  0 siblings, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-05-28  8:08 UTC (permalink / raw)
  To: Yan Zhao, Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li



On 5/27/2025 11:55 AM, Yan Zhao wrote:
> On Wed, May 14, 2025 at 04:41:45PM -0700, Ackerley Tng wrote:
>> Query guest_memfd for private/shared status if those guest_memfds
>> track private/shared status.
>>
>> With this patch, Coco VMs can use guest_memfd for both shared and
>> private memory. If Coco VMs choose to use guest_memfd for both
>> shared and private memory, by creating guest_memfd with the
>> GUEST_MEMFD_FLAG_SUPPORT_SHARED flag, guest_memfd will be used to
>> provide the private/shared status of the memory, instead of
>> kvm->mem_attr_array.
>>
>> Change-Id: I8f23d7995c12242aa4e09ccf5ec19360e9c9ed83
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>   include/linux/kvm_host.h | 19 ++++++++++++-------
>>   virt/kvm/guest_memfd.c   | 22 ++++++++++++++++++++++
>>   2 files changed, 34 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index b317392453a5..91279e05e010 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -2508,12 +2508,22 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>>   }
>>   
>>   #ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +
>>   bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot);
>> +bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn);
>> +
>>   #else
>> +
>>   static inline bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
>>   {
>>   	return false;
>>   }
>> +
>> +static inline bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
>> +{
>> +	return false;
>> +}
>> +
>>   #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>>   
>>   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>> @@ -2544,13 +2554,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
>>   		return false;
>>   
>>   	slot = gfn_to_memslot(kvm, gfn);
>> -	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot)) {
>> -		/*
>> -		 * For now, memslots only support in-place shared memory if the
>> -		 * host is allowed to mmap memory (i.e., non-Coco VMs).
>> -		 */
>> -		return false;
>> -	}
>> +	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
>> +		return kvm_gmem_is_private(slot, gfn);
> When userspace gets an exit reason KVM_EXIT_MEMORY_FAULT, looks it needs to
> update both KVM memory attribute and gmem shareability, via two separate ioctls?
IIUC, when userspace sets flag GUEST_MEMFD_FLAG_SUPPORT_SHARED to create the
guest_memfd, the check for memory attribute will go through the guest_memfd way,
the information in kvm->mem_attr_array will not be used.

So if userspace sets GUEST_MEMFD_FLAG_SUPPORT_SHARED, it uses
KVM_GMEM_CONVERT_SHARED/PRIVATE to update gmem shareability.
If userspace doesn't set GUEST_MEMFD_FLAG_SUPPORT_SHARED, it still uses
KVM_SET_MEMORY_ATTRIBUTES to update KVM memory attribute tracking.


>
>
>>   	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>>   }



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
  2025-05-15  2:09   ` Matthew Wilcox
@ 2025-05-28  8:55   ` Binbin Wu
  2025-07-07 18:27   ` James Houghton
  2 siblings, 0 replies; 231+ messages in thread
From: Binbin Wu @ 2025-05-28  8:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:41 AM, Ackerley Tng wrote:
> Previously, gbl_chg was passed from alloc_hugetlb_folio() into
> dequeue_hugetlb_folio_vma(), leaking the concept of gbl_chg into
> dequeue_hugetlb_folio_vma().
>
> This patch consolidates the interpretation of gbl_chg into
> alloc_hugetlb_folio(), also renaming dequeue_hugetlb_folio_vma() to
> dequeue_hugetlb_folio() so dequeue_hugetlb_folio() can just focus on
> dequeuing a folio.
>
> Change-Id: I31bf48af2400b6e13b44d03c8be22ce1a9092a9c
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   mm/hugetlb.c | 28 +++++++++++-----------------
>   1 file changed, 11 insertions(+), 17 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6ea1be71aa42..b843e869496f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1364,9 +1364,9 @@ static unsigned long available_huge_pages(struct hstate *h)
>   	return h->free_huge_pages - h->resv_huge_pages;
>   }
>   
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -				struct vm_area_struct *vma,
> -				unsigned long address, long gbl_chg)
> +static struct folio *dequeue_hugetlb_folio(struct hstate *h,
> +					   struct vm_area_struct *vma,
> +					   unsigned long address)

The rename seems not needed in this patch, since the function still takes vma
and uses it. May be better to move the rename to a later patch.

>   {
>   	struct folio *folio = NULL;
>   	struct mempolicy *mpol;
> @@ -1374,13 +1374,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>   	nodemask_t *nodemask;
>   	int nid;
>   
> -	/*
> -	 * gbl_chg==1 means the allocation requires a new page that was not
> -	 * reserved before.  Making sure there's at least one free page.
> -	 */
> -	if (gbl_chg && !available_huge_pages(h))
> -		goto err;
> -
>   	gfp_mask = htlb_alloc_mask(h);
>   	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
>   
> @@ -1398,9 +1391,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>   
>   	mpol_cond_put(mpol);
>   	return folio;
> -
> -err:
> -	return NULL;
>   }
>   
>   /*
> @@ -3074,12 +3064,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>   		goto out_uncharge_cgroup_reservation;
>   
>   	spin_lock_irq(&hugetlb_lock);
> +
>   	/*
> -	 * glb_chg is passed to indicate whether or not a page must be taken
> -	 * from the global free pool (global change).  gbl_chg == 0 indicates
> -	 * a reservation exists for the allocation.
> +	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
> +	 * try dequeuing a page. If there are available_huge_pages(), try using
> +	 * them!
>   	 */
> -	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
> +	folio = NULL;
> +	if (!gbl_chg || available_huge_pages(h))
> +		folio = dequeue_hugetlb_folio(h, vma, addr);
> +
>   	if (!folio) {
>   		spin_unlock_irq(&hugetlb_lock);
>   		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status
  2025-05-28  8:08     ` Binbin Wu
@ 2025-05-28  9:55       ` Yan Zhao
  0 siblings, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-28  9:55 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 28, 2025 at 04:08:34PM +0800, Binbin Wu wrote:
> 
> 
> On 5/27/2025 11:55 AM, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 04:41:45PM -0700, Ackerley Tng wrote:
> > > Query guest_memfd for private/shared status if those guest_memfds
> > > track private/shared status.
> > > 
> > > With this patch, Coco VMs can use guest_memfd for both shared and
> > > private memory. If Coco VMs choose to use guest_memfd for both
> > > shared and private memory, by creating guest_memfd with the
> > > GUEST_MEMFD_FLAG_SUPPORT_SHARED flag, guest_memfd will be used to
> > > provide the private/shared status of the memory, instead of
> > > kvm->mem_attr_array.
> > > 
> > > Change-Id: I8f23d7995c12242aa4e09ccf5ec19360e9c9ed83
> > > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > ---
> > >   include/linux/kvm_host.h | 19 ++++++++++++-------
> > >   virt/kvm/guest_memfd.c   | 22 ++++++++++++++++++++++
> > >   2 files changed, 34 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index b317392453a5..91279e05e010 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -2508,12 +2508,22 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > >   }
> > >   #ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > +
> > >   bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot);
> > > +bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn);
> > > +
> > >   #else
> > > +
> > >   static inline bool kvm_gmem_memslot_supports_shared(const struct kvm_memory_slot *slot)
> > >   {
> > >   	return false;
> > >   }
> > > +
> > > +static inline bool kvm_gmem_is_private(struct kvm_memory_slot *slot, gfn_t gfn)
> > > +{
> > > +	return false;
> > > +}
> > > +
> > >   #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > >   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > > @@ -2544,13 +2554,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > >   		return false;
> > >   	slot = gfn_to_memslot(kvm, gfn);
> > > -	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot)) {
> > > -		/*
> > > -		 * For now, memslots only support in-place shared memory if the
> > > -		 * host is allowed to mmap memory (i.e., non-Coco VMs).
> > > -		 */
> > > -		return false;
> > > -	}
> > > +	if (kvm_slot_has_gmem(slot) && kvm_gmem_memslot_supports_shared(slot))
> > > +		return kvm_gmem_is_private(slot, gfn);
> > When userspace gets an exit reason KVM_EXIT_MEMORY_FAULT, looks it needs to
> > update both KVM memory attribute and gmem shareability, via two separate ioctls?
> IIUC, when userspace sets flag GUEST_MEMFD_FLAG_SUPPORT_SHARED to create the
> guest_memfd, the check for memory attribute will go through the guest_memfd way,
> the information in kvm->mem_attr_array will not be used.
> 
> So if userspace sets GUEST_MEMFD_FLAG_SUPPORT_SHARED, it uses
> KVM_GMEM_CONVERT_SHARED/PRIVATE to update gmem shareability.
> If userspace doesn't set GUEST_MEMFD_FLAG_SUPPORT_SHARED, it still uses
> KVM_SET_MEMORY_ATTRIBUTES to update KVM memory attribute tracking.
Ok, so the user needs to search the memory region and guest_memfd to choose the
right ioctl.

For slots with guest_memfd of flag GUEST_MEMFD_FLAG_SUPPORT_SHARED, the
KVM_LPAGE_MIXED_FLAG bit in the lpage_info cannot reflect the truth and a false
value there may also prevent KVM from installing a huge page.

> > 
> > 
> > >   	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > >   }
> 
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
  2025-05-21 18:05   ` Vishal Annapurve
  2025-05-22 23:12   ` Edgecombe, Rick P
@ 2025-05-28 10:58   ` Yan Zhao
  2025-06-03  7:43   ` Binbin Wu
  3 siblings, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-28 10:58 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:42:12PM -0700, Ackerley Tng wrote:
> If a custom allocator is requested at guest_memfd creation time, pages
> from the custom allocator will be used to back guest_memfd.
> 
> Change-Id: I59df960b3273790f42fe5bea54a234f40962eb75
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  mm/memory.c            |   1 +
>  virt/kvm/guest_memfd.c | 142 +++++++++++++++++++++++++++++++++++++----
>  2 files changed, 132 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ba3ea0a82f7f..3af45e96913c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7249,6 +7249,7 @@ void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>  	else
>  		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
>  }
> +EXPORT_SYMBOL_GPL(folio_zero_user);
>  
>  static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>  				   unsigned long addr_hint,
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index c65d93c5a443..24d270b9b725 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -478,15 +478,13 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   * leaking host data and the up-to-date flag is set.
>   */
>  static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> -				  gfn_t gfn, struct folio *folio)
> +				  gfn_t gfn, struct folio *folio,
> +				  unsigned long addr_hint)
>  {
> -	unsigned long nr_pages, i;
>  	pgoff_t index;
>  	int r;
>  
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		clear_highpage(folio_page(folio, i));
> +	folio_zero_user(folio, addr_hint);
>  
>  	/*
>  	 * Preparing huge folios should always be safe, since it should
> @@ -554,7 +552,9 @@ static int kvm_gmem_filemap_add_folio(struct address_space *mapping,
>   */
>  static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  {
> +	size_t allocated_size;
>  	struct folio *folio;
> +	pgoff_t index_floor;
>  	int ret;
>  
>  repeat:
> @@ -581,8 +581,10 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  			return ERR_PTR(ret);
>  		}
>  	}
> +	allocated_size = folio_size(folio);
>  
> -	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
> +	index_floor = round_down(index, folio_nr_pages(folio));
> +	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
>  	if (ret) {
>  		folio_put(folio);
>  
> @@ -598,7 +600,17 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>  		return ERR_PTR(ret);
>  	}
>  
> -	__folio_set_locked(folio);
> +	spin_lock(&inode->i_lock);
> +	inode->i_blocks += allocated_size / 512;
> +	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * folio is the one that is allocated, this gets the folio at the
> +	 * requested index.
> +	 */
> +	folio = page_folio(folio_file_page(folio, index));
> +	folio_lock(folio);
> +
>  	return folio;
>  }
>  
> @@ -736,6 +748,92 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>  	spin_unlock(&inode->i_lock);
>  }
>  
> +/**
> + * kvm_gmem_zero_range() - Zeroes all sub-pages in range [@start, @end).
> + *
> + * @mapping: the filemap to remove this range from.
> + * @start: index in filemap for start of range (inclusive).
> + * @end: index in filemap for end of range (exclusive).
> + *
> + * The pages in range may be split. truncate_inode_pages_range() isn't the right
> + * function because it removes pages from the page cache; this function only
> + * zeroes the pages.
> + */
> +static void kvm_gmem_zero_range(struct address_space *mapping,
> +				pgoff_t start, pgoff_t end)
> +{
> +	struct folio_batch fbatch;
> +
> +	folio_batch_init(&fbatch);
> +	while (filemap_get_folios(mapping, &start, end - 1, &fbatch)) {
> +		unsigned int i;
> +
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *f;
> +			size_t nr_bytes;
> +
> +			f = fbatch.folios[i];
> +			nr_bytes = offset_in_folio(f, end << PAGE_SHIFT);
> +			if (nr_bytes == 0)
> +				nr_bytes = folio_size(f);
> +
Is folio_lock() required here?

> +			folio_zero_segment(f, 0, nr_bytes);
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +}
> +
> +/**
> + * kvm_gmem_truncate_inode_range() - Truncate pages in range [@lstart, @lend).
> + *
> + * @inode: inode to truncate from.
> + * @lstart: offset in inode for start of range (inclusive).
> + * @lend: offset in inode for end of range (exclusive).
> + *
> + * Removes full (huge)pages from the filemap and zeroing incomplete
> + * (huge)pages. The pages in the range may be split.
> + */
> +static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> +					  loff_t lend)
> +{
> +	pgoff_t full_hpage_start;
> +	size_t nr_per_huge_page;
> +	pgoff_t full_hpage_end;
> +	size_t nr_pages;
> +	pgoff_t start;
> +	pgoff_t end;
> +	void *priv;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> +
> +	start = lstart >> PAGE_SHIFT;
> +	end = min(lend, i_size_read(inode)) >> PAGE_SHIFT;
> +
> +	full_hpage_start = round_up(start, nr_per_huge_page);
> +	full_hpage_end = round_down(end, nr_per_huge_page);
> +
> +	if (start < full_hpage_start) {
> +		pgoff_t zero_end = min(full_hpage_start, end);
> +
> +		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
> +	}
> +
> +	if (full_hpage_end > full_hpage_start) {
> +		nr_pages = full_hpage_end - full_hpage_start;
> +		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
> +						      nr_pages);
> +	}
> +
> +	if (end > full_hpage_end && end > full_hpage_start) {
> +		pgoff_t zero_start = max(full_hpage_end, start);
> +
> +		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
> +	}
> +}
> +
>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  {
>  	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
> @@ -752,7 +850,12 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_begin(gmem, start, end);
The kvm_gmem_punch_hole() can be triggered by user's madvise() operation, e.g.,

mem = mmap(NULL, test_page_size, PROT_READ | PROT_WRITE, MAP_SHARED, guest_memfd, 0);
madvise(mem, test_page_size, MADV_REMOVE);

As the mmap'ed VA is only for shared memory, it seems that the madvise() on a VA
range should only affect the shared memory. However, kvm_gmem_punch_hole()
indiscriminately truncates or zeros any memory.



> -	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> +	if (kvm_gmem_has_custom_allocator(inode)) {
> +		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> +	} else {
> +		/* Page size is PAGE_SIZE, so use optimized truncation function. */
> +		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> +	}
>  
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_end(gmem, start, end);
> @@ -776,6 +879,16 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>  
>  	start = offset >> PAGE_SHIFT;
>  	end = (offset + len) >> PAGE_SHIFT;
> +	if (kvm_gmem_has_custom_allocator(inode)) {
> +		size_t nr_pages;
> +		void *p;
> +
> +		p = kvm_gmem_allocator_private(inode);
> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
> +
> +		start = round_down(start, nr_pages);
> +		end = round_down(end, nr_pages);
> +	}
>  
>  	r = 0;
>  	for (index = start; index < end; ) {
> @@ -1570,7 +1683,7 @@ static struct folio *__kvm_gmem_get_pfn(struct file *file,
>  
>  	*pfn = folio_file_pfn(folio, index);
>  	if (max_order)
> -		*max_order = 0;
> +		*max_order = folio_order(folio);
>  
>  	*is_prepared = folio_test_uptodate(folio);
>  	return folio;
> @@ -1597,8 +1710,15 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto out;
>  	}
>  
> -	if (!is_prepared)
> -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (!is_prepared) {
> +		/*
> +		 * Use the same address as hugetlb for zeroing private pages
> +		 * that won't be mapped to userspace anyway.
> +		 */
> +		unsigned long addr_hint = folio->index << PAGE_SHIFT;
> +
> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio, addr_hint);
> +	}
>  
>  	folio_unlock(folio);
>  
> -- 
> 2.49.0.1045.g170613ef41-goog
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  2025-05-14 23:42 ` [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE) Ackerley Tng
@ 2025-05-28 11:00   ` Yan Zhao
  2025-05-28 16:39     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-05-28 11:00 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:42:18PM -0700, Ackerley Tng wrote:
> Merge and truncate on fallocate(PUNCH_HOLE), but if the file is being
> closed, defer merging to folio_put() callback.
> 
> Change-Id: Iae26987756e70c83f3b121edbc0ed0bc105eec0d
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 76 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 68 insertions(+), 8 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index cb426c1dfef8..04b1513c2998 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -859,6 +859,35 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>  	return ret;
>  }
>  
> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> +					   size_t nr_pages)
> +{
> +	struct folio *f;
> +	pgoff_t unused;
> +	long num_freed;
> +
> +	unmap_mapping_pages(inode->i_mapping, index, nr_pages, false);
> +
> +	if (!kvm_gmem_has_safe_refcount(inode->i_mapping, index, nr_pages, &unused))
Why is kvm_gmem_has_safe_refcount() checked here, but not in
kvm_gmem_zero_range() within kvm_gmem_truncate_inode_range() in patch 33?

> +		return -EAGAIN;
> +

Rather than merging the folios, could we simply call kvm_gmem_truncate_indices()
instead?

num_freed = kvm_gmem_truncate_indices(inode->i_mapping, index, nr_pages);
return num_freed;

> +	f = filemap_get_folio(inode->i_mapping, index);
> +	if (IS_ERR(f))
> +		return 0;
> +
> +	/* Leave just filemap's refcounts on the folio. */
> +	folio_put(f);
> +
> +	WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> +
> +	num_freed = folio_nr_pages(f);
> +	folio_lock(f);
> +	truncate_inode_folio(inode->i_mapping, f);
> +	folio_unlock(f);
> +
> +	return num_freed;
> +}
> +
>  #else
>  
>  static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> @@ -874,6 +903,12 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>  	return 0;
>  }
>  
> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> +					   size_t nr_pages)
> +{
> +	return 0;
> +}
> +
>  #endif
>  
>  #else
> @@ -1182,8 +1217,10 @@ static long kvm_gmem_truncate_indices(struct address_space *mapping,
>   *
>   * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
>   * inode metadata.
> + *
> + * Return: 0 on success and negative error otherwise.
>   */
> -static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> +static long kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>  						  pgoff_t index,
>  						  size_t nr_pages)
>  {
> @@ -1191,19 +1228,34 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>  	long num_freed;
>  	pgoff_t idx;
>  	void *priv;
> +	long ret;
>  
>  	priv = kvm_gmem_allocator_private(inode);
>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>  
> +	ret = 0;
>  	num_freed = 0;
>  	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
> -		num_freed += kvm_gmem_truncate_indices(
> -			inode->i_mapping, idx, nr_per_huge_page);
> +		if (mapping_exiting(inode->i_mapping) ||
> +		    !kvm_gmem_has_some_shared(inode, idx, nr_per_huge_page)) {
> +			num_freed += kvm_gmem_truncate_indices(
> +				inode->i_mapping, idx, nr_per_huge_page);
> +		} else {
> +			ret = kvm_gmem_merge_truncate_indices(inode, idx,
> +							      nr_per_huge_page);
> +			if (ret < 0)
> +				break;
> +
> +			num_freed += ret;
> +			ret = 0;
> +		}
>  	}
>  
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
>  	spin_unlock(&inode->i_lock);
> +
> +	return ret;
>  }
>  
>  /**
> @@ -1252,8 +1304,10 @@ static void kvm_gmem_zero_range(struct address_space *mapping,
>   *
>   * Removes full (huge)pages from the filemap and zeroing incomplete
>   * (huge)pages. The pages in the range may be split.
> + *
> + * Return: 0 on success and negative error otherwise.
>   */
> -static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> +static long kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>  					  loff_t lend)
>  {
>  	pgoff_t full_hpage_start;
> @@ -1263,6 +1317,7 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>  	pgoff_t start;
>  	pgoff_t end;
>  	void *priv;
> +	long ret;
>  
>  	priv = kvm_gmem_allocator_private(inode);
>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> @@ -1279,10 +1334,11 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>  		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
>  	}
>  
> +	ret = 0;
>  	if (full_hpage_end > full_hpage_start) {
>  		nr_pages = full_hpage_end - full_hpage_start;
> -		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
> -						      nr_pages);
> +		ret = kvm_gmem_truncate_inode_aligned_pages(
> +			inode, full_hpage_start, nr_pages);
>  	}
>  
>  	if (end > full_hpage_end && end > full_hpage_start) {
> @@ -1290,6 +1346,8 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>  
>  		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
>  	}
> +
> +	return ret;
>  }
>  
>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> @@ -1298,6 +1356,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	pgoff_t start = offset >> PAGE_SHIFT;
>  	pgoff_t end = (offset + len) >> PAGE_SHIFT;
>  	struct kvm_gmem *gmem;
> +	long ret;
>  
>  	/*
>  	 * Bindings must be stable across invalidation to ensure the start+end
> @@ -1308,8 +1367,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
>  
> +	ret = 0;
>  	if (kvm_gmem_has_custom_allocator(inode)) {
> -		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> +		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
>  	} else {
>  		/* Page size is PAGE_SIZE, so use optimized truncation function. */
>  		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> @@ -1320,7 +1380,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  
>  	filemap_invalidate_unlock(inode->i_mapping);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> -- 
> 2.49.0.1045.g170613ef41-goog
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  2025-05-28 11:00   ` Yan Zhao
@ 2025-05-28 16:39     ` Ackerley Tng
  2025-05-29  3:26       ` Yan Zhao
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-28 16:39 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, May 14, 2025 at 04:42:18PM -0700, Ackerley Tng wrote:
>> Merge and truncate on fallocate(PUNCH_HOLE), but if the file is being
>> closed, defer merging to folio_put() callback.
>> 
>> Change-Id: Iae26987756e70c83f3b121edbc0ed0bc105eec0d
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>  virt/kvm/guest_memfd.c | 76 +++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 68 insertions(+), 8 deletions(-)
>> 
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index cb426c1dfef8..04b1513c2998 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -859,6 +859,35 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>>  	return ret;
>>  }
>>  
>> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
>> +					   size_t nr_pages)
>> +{
>> +	struct folio *f;
>> +	pgoff_t unused;
>> +	long num_freed;
>> +
>> +	unmap_mapping_pages(inode->i_mapping, index, nr_pages, false);
>> +
>> +	if (!kvm_gmem_has_safe_refcount(inode->i_mapping, index, nr_pages, &unused))

Yan, thank you for your reviews!

> Why is kvm_gmem_has_safe_refcount() checked here, but not in
> kvm_gmem_zero_range() within kvm_gmem_truncate_inode_range() in patch 33?
>

The contract for guest_memfd with HugeTLB pages is that if holes are
punched in any ranges less than a full huge page, no pages are removed
from the filemap. Those ranges are only zeroed.

In kvm_gmem_zero_range(), we never remove any folios, and so there is no
need to merge. If there's no need to merge, then we don't need to check
for a safe refcount, and can just proceed to zero.

kvm_gmem_merge_truncate_indices() is only used during hole punching and
not when the file is closed. Hole punch vs file closure is checked using
mapping_exiting(inode->i_mapping).

During a hole punch, we will only allow truncation if there are no
unexpected refcounts on any subpages, hence this
kvm_gmem_has_safe_refcount() check.

>> +		return -EAGAIN;
>> +
>
> Rather than merging the folios, could we simply call kvm_gmem_truncate_indices()
> instead?
>
> num_freed = kvm_gmem_truncate_indices(inode->i_mapping, index, nr_pages);
> return num_freed;
>

We could do this too, but then that would be deferring the huge page
merging to the folio_put() callback and eventually the kernel worker
thread.

My goal here is to try to not to defer merging and freeing as much as
possible so that most of the page/memory operations are
synchronous, because synchronous operations are more predictable.

As an example of improving predictability, in one of the selftests, I do
a hole punch and then try to allocate again. Because the merging and
freeing of the HugeTLB page sometimes takes too long, the allocation
sometimes fails: the guest_memfd's subpool hadn't yet received the freed
page back. With a synchronous truncation, the truncation may take
longer, but the selftest predictably passes.

>> +	f = filemap_get_folio(inode->i_mapping, index);
>> +	if (IS_ERR(f))
>> +		return 0;
>> +
>> +	/* Leave just filemap's refcounts on the folio. */
>> +	folio_put(f);
>> +
>> +	WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
>> +
>> +	num_freed = folio_nr_pages(f);
>> +	folio_lock(f);
>> +	truncate_inode_folio(inode->i_mapping, f);
>> +	folio_unlock(f);
>> +
>> +	return num_freed;
>> +}
>> +
>>  #else
>>  
>>  static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> @@ -874,6 +903,12 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>>  	return 0;
>>  }
>>  
>> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
>> +					   size_t nr_pages)
>> +{
>> +	return 0;
>> +}
>> +
>>  #endif
>>  
>>  #else
>> @@ -1182,8 +1217,10 @@ static long kvm_gmem_truncate_indices(struct address_space *mapping,
>>   *
>>   * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
>>   * inode metadata.
>> + *
>> + * Return: 0 on success and negative error otherwise.
>>   */
>> -static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>> +static long kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>>  						  pgoff_t index,
>>  						  size_t nr_pages)
>>  {
>> @@ -1191,19 +1228,34 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
>>  	long num_freed;
>>  	pgoff_t idx;
>>  	void *priv;
>> +	long ret;
>>  
>>  	priv = kvm_gmem_allocator_private(inode);
>>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>>  
>> +	ret = 0;
>>  	num_freed = 0;
>>  	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
>> -		num_freed += kvm_gmem_truncate_indices(
>> -			inode->i_mapping, idx, nr_per_huge_page);
>> +		if (mapping_exiting(inode->i_mapping) ||
>> +		    !kvm_gmem_has_some_shared(inode, idx, nr_per_huge_page)) {
>> +			num_freed += kvm_gmem_truncate_indices(
>> +				inode->i_mapping, idx, nr_per_huge_page);
>> +		} else {
>> +			ret = kvm_gmem_merge_truncate_indices(inode, idx,
>> +							      nr_per_huge_page);
>> +			if (ret < 0)
>> +				break;
>> +
>> +			num_freed += ret;
>> +			ret = 0;
>> +		}
>>  	}
>>  
>>  	spin_lock(&inode->i_lock);
>>  	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
>>  	spin_unlock(&inode->i_lock);
>> +
>> +	return ret;
>>  }
>>  
>>  /**
>> @@ -1252,8 +1304,10 @@ static void kvm_gmem_zero_range(struct address_space *mapping,
>>   *
>>   * Removes full (huge)pages from the filemap and zeroing incomplete
>>   * (huge)pages. The pages in the range may be split.
>> + *
>> + * Return: 0 on success and negative error otherwise.
>>   */
>> -static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>> +static long kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>>  					  loff_t lend)
>>  {
>>  	pgoff_t full_hpage_start;
>> @@ -1263,6 +1317,7 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>>  	pgoff_t start;
>>  	pgoff_t end;
>>  	void *priv;
>> +	long ret;
>>  
>>  	priv = kvm_gmem_allocator_private(inode);
>>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> @@ -1279,10 +1334,11 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>>  		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
>>  	}
>>  
>> +	ret = 0;
>>  	if (full_hpage_end > full_hpage_start) {
>>  		nr_pages = full_hpage_end - full_hpage_start;
>> -		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
>> -						      nr_pages);
>> +		ret = kvm_gmem_truncate_inode_aligned_pages(
>> +			inode, full_hpage_start, nr_pages);
>>  	}
>>  
>>  	if (end > full_hpage_end && end > full_hpage_start) {
>> @@ -1290,6 +1346,8 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
>>  
>>  		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
>>  	}
>> +
>> +	return ret;
>>  }
>>  
>>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>> @@ -1298,6 +1356,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  	pgoff_t start = offset >> PAGE_SHIFT;
>>  	pgoff_t end = (offset + len) >> PAGE_SHIFT;
>>  	struct kvm_gmem *gmem;
>> +	long ret;
>>  
>>  	/*
>>  	 * Bindings must be stable across invalidation to ensure the start+end
>> @@ -1308,8 +1367,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  	list_for_each_entry(gmem, gmem_list, entry)
>>  		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
>>  
>> +	ret = 0;
>>  	if (kvm_gmem_has_custom_allocator(inode)) {
>> -		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
>> +		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
>>  	} else {
>>  		/* Page size is PAGE_SIZE, so use optimized truncation function. */
>>  		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
>> @@ -1320,7 +1380,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  
>>  	filemap_invalidate_unlock(inode->i_mapping);
>>  
>> -	return 0;
>> +	return ret;
>>  }
>>  
>>  static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>> -- 
>> 2.49.0.1045.g170613ef41-goog
>> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  2025-05-28 16:39     ` Ackerley Tng
@ 2025-05-29  3:26       ` Yan Zhao
  0 siblings, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-05-29  3:26 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 28, 2025 at 09:39:35AM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, May 14, 2025 at 04:42:18PM -0700, Ackerley Tng wrote:
> >> Merge and truncate on fallocate(PUNCH_HOLE), but if the file is being
> >> closed, defer merging to folio_put() callback.
> >> 
> >> Change-Id: Iae26987756e70c83f3b121edbc0ed0bc105eec0d
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >> ---
> >>  virt/kvm/guest_memfd.c | 76 +++++++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 68 insertions(+), 8 deletions(-)
> >> 
> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> index cb426c1dfef8..04b1513c2998 100644
> >> --- a/virt/kvm/guest_memfd.c
> >> +++ b/virt/kvm/guest_memfd.c
> >> @@ -859,6 +859,35 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >>  	return ret;
> >>  }
> >>  
> >> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> >> +					   size_t nr_pages)
> >> +{
> >> +	struct folio *f;
> >> +	pgoff_t unused;
> >> +	long num_freed;
> >> +
> >> +	unmap_mapping_pages(inode->i_mapping, index, nr_pages, false);
> >> +
> >> +	if (!kvm_gmem_has_safe_refcount(inode->i_mapping, index, nr_pages, &unused))
> 
> Yan, thank you for your reviews!
> 
> > Why is kvm_gmem_has_safe_refcount() checked here, but not in
> > kvm_gmem_zero_range() within kvm_gmem_truncate_inode_range() in patch 33?
> >
> 
> The contract for guest_memfd with HugeTLB pages is that if holes are
> punched in any ranges less than a full huge page, no pages are removed
> from the filemap. Those ranges are only zeroed.
> 
> In kvm_gmem_zero_range(), we never remove any folios, and so there is no
> need to merge. If there's no need to merge, then we don't need to check
> for a safe refcount, and can just proceed to zero.
However, if there are still extra ref count to a shared page, its content will
be zeroed out.

> 
> kvm_gmem_merge_truncate_indices() is only used during hole punching and
> not when the file is closed. Hole punch vs file closure is checked using
> mapping_exiting(inode->i_mapping).
> 
> During a hole punch, we will only allow truncation if there are no
> unexpected refcounts on any subpages, hence this
> kvm_gmem_has_safe_refcount() check.
Hmm, I couldn't find a similar refcount check in hugetlbfs_punch_hole().
Did I overlook it?

So, why does guest_memfd require this check when punching a hole?

> >> +		return -EAGAIN;
> >> +
> >
> > Rather than merging the folios, could we simply call kvm_gmem_truncate_indices()
> > instead?
> >
> > num_freed = kvm_gmem_truncate_indices(inode->i_mapping, index, nr_pages);
> > return num_freed;
> >
> 
> We could do this too, but then that would be deferring the huge page
> merging to the folio_put() callback and eventually the kernel worker
> thread.
With deferring the huge page merging to folio_put(), a benefit is that
__kvm_gmem_filemap_add_folio() can be saved for the merged folio. This function
is possible to fail and is unnecessary for punch hole as the folio will be
removed immediately from the filemap in truncate_inode_folio().


> My goal here is to try to not to defer merging and freeing as much as
> possible so that most of the page/memory operations are
> synchronous, because synchronous operations are more predictable.
> 
> As an example of improving predictability, in one of the selftests, I do
> a hole punch and then try to allocate again. Because the merging and
> freeing of the HugeTLB page sometimes takes too long, the allocation
> sometimes fails: the guest_memfd's subpool hadn't yet received the freed
> page back. With a synchronous truncation, the truncation may take
> longer, but the selftest predictably passes.
Maybe check if guestmem_hugetlb_handle_folio_put() is invoked in the
interrupt context, and, if not, invoke the guestmem_hugetlb_cleanup_folio()
synchronously?


> >> +	f = filemap_get_folio(inode->i_mapping, index);
> >> +	if (IS_ERR(f))
> >> +		return 0;
> >> +
> >> +	/* Leave just filemap's refcounts on the folio. */
> >> +	folio_put(f);
> >> +
> >> +	WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> >> +
> >> +	num_freed = folio_nr_pages(f);
> >> +	folio_lock(f);
> >> +	truncate_inode_folio(inode->i_mapping, f);
> >> +	folio_unlock(f);
> >> +
> >> +	return num_freed;
> >> +}
> >> +
> >>  #else
> >>  
> >>  static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> @@ -874,6 +903,12 @@ static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >>  	return 0;
> >>  }
> >>  
> >> +static long kvm_gmem_merge_truncate_indices(struct inode *inode, pgoff_t index,
> >> +					   size_t nr_pages)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >>  #endif
> >>  
> >>  #else
> >> @@ -1182,8 +1217,10 @@ static long kvm_gmem_truncate_indices(struct address_space *mapping,
> >>   *
> >>   * Removes folios beginning @index for @nr_pages from filemap in @inode, updates
> >>   * inode metadata.
> >> + *
> >> + * Return: 0 on success and negative error otherwise.
> >>   */
> >> -static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >> +static long kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >>  						  pgoff_t index,
> >>  						  size_t nr_pages)
> >>  {
> >> @@ -1191,19 +1228,34 @@ static void kvm_gmem_truncate_inode_aligned_pages(struct inode *inode,
> >>  	long num_freed;
> >>  	pgoff_t idx;
> >>  	void *priv;
> >> +	long ret;
> >>  
> >>  	priv = kvm_gmem_allocator_private(inode);
> >>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >>  
> >> +	ret = 0;
> >>  	num_freed = 0;
> >>  	for (idx = index; idx < index + nr_pages; idx += nr_per_huge_page) {
> >> -		num_freed += kvm_gmem_truncate_indices(
> >> -			inode->i_mapping, idx, nr_per_huge_page);
> >> +		if (mapping_exiting(inode->i_mapping) ||
> >> +		    !kvm_gmem_has_some_shared(inode, idx, nr_per_huge_page)) {
> >> +			num_freed += kvm_gmem_truncate_indices(
> >> +				inode->i_mapping, idx, nr_per_huge_page);
> >> +		} else {
> >> +			ret = kvm_gmem_merge_truncate_indices(inode, idx,
> >> +							      nr_per_huge_page);
> >> +			if (ret < 0)
> >> +				break;
> >> +
> >> +			num_freed += ret;
> >> +			ret = 0;
> >> +		}
> >>  	}
> >>  
> >>  	spin_lock(&inode->i_lock);
> >>  	inode->i_blocks -= (num_freed << PAGE_SHIFT) / 512;
> >>  	spin_unlock(&inode->i_lock);
> >> +
> >> +	return ret;
> >>  }
> >>  
> >>  /**
> >> @@ -1252,8 +1304,10 @@ static void kvm_gmem_zero_range(struct address_space *mapping,
> >>   *
> >>   * Removes full (huge)pages from the filemap and zeroing incomplete
> >>   * (huge)pages. The pages in the range may be split.
> >> + *
> >> + * Return: 0 on success and negative error otherwise.
> >>   */
> >> -static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >> +static long kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  					  loff_t lend)
> >>  {
> >>  	pgoff_t full_hpage_start;
> >> @@ -1263,6 +1317,7 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  	pgoff_t start;
> >>  	pgoff_t end;
> >>  	void *priv;
> >> +	long ret;
> >>  
> >>  	priv = kvm_gmem_allocator_private(inode);
> >>  	nr_per_huge_page = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> @@ -1279,10 +1334,11 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  		kvm_gmem_zero_range(inode->i_mapping, start, zero_end);
> >>  	}
> >>  
> >> +	ret = 0;
> >>  	if (full_hpage_end > full_hpage_start) {
> >>  		nr_pages = full_hpage_end - full_hpage_start;
> >> -		kvm_gmem_truncate_inode_aligned_pages(inode, full_hpage_start,
> >> -						      nr_pages);
> >> +		ret = kvm_gmem_truncate_inode_aligned_pages(
> >> +			inode, full_hpage_start, nr_pages);
> >>  	}
> >>  
> >>  	if (end > full_hpage_end && end > full_hpage_start) {
> >> @@ -1290,6 +1346,8 @@ static void kvm_gmem_truncate_inode_range(struct inode *inode, loff_t lstart,
> >>  
> >>  		kvm_gmem_zero_range(inode->i_mapping, zero_start, end);
> >>  	}
> >> +
> >> +	return ret;
> >>  }
> >>  
> >>  static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >> @@ -1298,6 +1356,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  	pgoff_t start = offset >> PAGE_SHIFT;
> >>  	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> >>  	struct kvm_gmem *gmem;
> >> +	long ret;
> >>  
> >>  	/*
> >>  	 * Bindings must be stable across invalidation to ensure the start+end
> >> @@ -1308,8 +1367,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  	list_for_each_entry(gmem, gmem_list, entry)
> >>  		kvm_gmem_invalidate_begin_and_zap(gmem, start, end);
> >>  
> >> +	ret = 0;
> >>  	if (kvm_gmem_has_custom_allocator(inode)) {
> >> -		kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> >> +		ret = kvm_gmem_truncate_inode_range(inode, offset, offset + len);
> >>  	} else {
> >>  		/* Page size is PAGE_SIZE, so use optimized truncation function. */
> >>  		truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> >> @@ -1320,7 +1380,7 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> >>  
> >>  	filemap_invalidate_unlock(inode->i_mapping);
> >>  
> >> -	return 0;
> >> +	return ret;
> >>  }
> >>  
> >>  static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> >> -- 
> >> 2.49.0.1045.g170613ef41-goog
> >> 
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
  2025-05-27  3:54   ` Yan Zhao
  2025-05-27  8:25   ` Binbin Wu
@ 2025-05-29  5:42   ` Michael Roth
  2025-06-11 21:51     ` Ackerley Tng
  2025-06-11 22:10     ` Ackerley Tng
  2025-08-01  0:01   ` Yan Zhao
  3 siblings, 2 replies; 231+ messages in thread
From: Michael Roth @ 2025-05-29  5:42 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, mpe, muchun.song, nikunj,
	nsaenz, oliver.upton, palmer, pankaj.gupta, paul.walmsley,
	pbonzini, pdurrant, peterx, pgonda, pvorel, qperret,
	quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> Track guest_memfd memory's shareability status within the inode as
> opposed to the file, since it is property of the guest_memfd's memory
> contents.
> 
> Shareability is a property of the memory and is indexed using the
> page's index in the inode. Because shareability is the memory's
> property, it is stored within guest_memfd instead of within KVM, like
> in kvm->mem_attr_array.
> 
> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> retained to allow VMs to only use guest_memfd for private memory and
> some other memory for shared memory.
> 
> Not all use cases require guest_memfd() to be shared with the host
> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> private to the guest, and therefore not mappable by the
> host. Otherwise, memory is shared until explicitly converted to
> private.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> ---
>  Documentation/virt/kvm/api.rst |   5 ++
>  include/uapi/linux/kvm.h       |   2 +
>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>  3 files changed, 129 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 86f74ce7f12a..f609337ae1c2 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>  This is validated when the guest_memfd instance is bound to the VM.
>  
> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> +will initialize the memory for the guest_memfd as guest-only and not faultable
> +by the host.
> +

KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
like this flag should be deferred until that patch is in place. Is it
really needed at that point though? Userspace would be able to set the
initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.

The mtree contents seems to get stored in the same manner in either case so
performance-wise only the overhead of a few userspace<->kernel switches
would be saved. Are there any other reasons?

Otherwise, maybe just settle on SHARED as a documented default (since at
least non-CoCo VMs would be able to reliably benefit) and let
CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
granularity makes sense for the architecture/guest configuration.

>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
>  4.143 KVM_PRE_FAULT_MEMORY
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4cc824a3a7c9..d7df312479aa 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>  
>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> +
>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
>  
>  struct kvm_create_guest_memfd {
>  	__u64 size;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 239d0f13dcc1..590932499eba 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/kvm_host.h>
> +#include <linux/maple_tree.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/pagemap.h>
>  
> @@ -17,6 +18,24 @@ struct kvm_gmem {
>  	struct list_head entry;
>  };
>  
> +struct kvm_gmem_inode_private {
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	struct maple_tree shareability;
> +#endif
> +};
> +
> +enum shareability {
> +	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
> +	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
> +};
> +
> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> +
> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> +{
> +	return inode->i_mapping->i_private_data;
> +}
> +
>  /**
>   * folio_file_pfn - like folio_file_page, but return a pfn.
>   * @folio: The folio which contains this index.
> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
>  	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
>  }
>  
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +
> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> +				      loff_t size, u64 flags)
> +{
> +	enum shareability m;
> +	pgoff_t last;
> +
> +	last = (size >> PAGE_SHIFT) - 1;
> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> +						    SHAREABILITY_ALL;
> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> +				 GFP_KERNEL);

One really nice thing about using a maple tree is that it should get rid
of a fairly significant startup delay for SNP/TDX when the entire xarray gets
initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
(which is the current QEMU default behavior).

I'd originally advocated for sticking with the xarray implementation Fuad was
using until we'd determined we really need it for HugeTLB support, but I'm
sort of thinking it's already justified just based on the above.

Maybe it would make sense for KVM memory attributes too?

> +}
> +
> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> +						 pgoff_t index)
> +{
> +	struct maple_tree *mt;
> +	void *entry;
> +
> +	mt = &kvm_gmem_private(inode)->shareability;
> +	entry = mtree_load(mt, index);
> +	WARN(!entry,
> +	     "Shareability should always be defined for all indices in inode.");
> +
> +	return xa_to_value(entry);
> +}
> +
> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> +{
> +	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> +		return ERR_PTR(-EACCES);
> +
> +	return kvm_gmem_get_folio(inode, index);
> +}
> +
> +#else
> +
> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> +{
> +	return 0;
> +}
> +
> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> +{
> +	WARN_ONCE("Unexpected call to get shared folio.")
> +	return NULL;
> +}
> +
> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> +
>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  				    pgoff_t index, struct folio *folio)
>  {
> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>  
>  	filemap_invalidate_lock_shared(inode->i_mapping);
>  
> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		int err = PTR_ERR(folio);
>  
> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
>  	.fallocate	= kvm_gmem_fallocate,
>  };
>  
> +static void kvm_gmem_free_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +	kfree(private);
> +
> +	free_inode_nonrcu(inode);
> +}
> +
> +static void kvm_gmem_destroy_inode(struct inode *inode)
> +{
> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> +
> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> +	/*
> +	 * mtree_destroy() can't be used within rcu callback, hence can't be
> +	 * done in ->free_inode().
> +	 */
> +	if (private)
> +		mtree_destroy(&private->shareability);
> +#endif
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
>  	.statfs		= simple_statfs,
> +	.destroy_inode	= kvm_gmem_destroy_inode,
> +	.free_inode	= kvm_gmem_free_inode,
>  };
>  
>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  						      loff_t size, u64 flags)
>  {
> +	struct kvm_gmem_inode_private *private;
>  	struct inode *inode;
> +	int err;
>  
>  	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>  	if (IS_ERR(inode))
>  		return inode;
>  
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
> +	mt_init(&private->shareability);
> +	inode->i_mapping->i_private_data = private;
> +
> +	err = kvm_gmem_shareability_setup(private, size, flags);
> +	if (err)
> +		goto out;
> +
>  	inode->i_private = (void *)(unsigned long)flags;
>  	inode->i_op = &kvm_gmem_iops;
>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>  
>  	return inode;
> +
> +out:
> +	iput(inode);
> +
> +	return ERR_PTR(err);
>  }
>  
>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
>  		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
>  
> +	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!file)
>  		return -EFAULT;
>  
> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +

I like the idea of using a write-lock/read-lock to protect write/read access
to shareability state (though maybe not necessarily re-using filemap's
invalidate lock), it's simple and still allows concurrent faulting in of gmem
pages. One issue on the SNP side (which also came up in one of the gmem calls)
is if we introduce support for tracking preparedness as discussed (e.g. via a
new SHAREABILITY_GUEST_PREPARED state) the
SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
fault-time, and so would need to take the write-lock and no longer allow for
concurrent fault-handling.

I was originally planning on introducing a new rw_semaphore with similar
semantics to the rw_lock that Fuad previously had in his restricted mmap
series[1] (and simiar semantics to filemap invalidate lock here). The main
difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
was that in the case of a folio being present for an index, the folio lock would
also need to be held in order to update the shareability state. Because
of that, fault paths (which will always either have or allocate folio
basically) can rely on the folio lock to guard shareability state in a more
granular way and so can avoid a global write lock.

They would still need to hold the read lock to access the tree however.
Or more specifically, any paths that could allocate a folio need to take
a read lock so there isn't a TOCTOU situation where shareability is
being updated for an index for which a folio hasn't been allocated, but
then just afterward the folio gets faulted in/allocated while the
shareability state is already being updated which the understand that
there was no folio around that needed locking.

I had a branch with in-place conversion support for SNP[2] that added this
lock reworking on top of Fuad's series along with preparation tracking,
but I'm now planning to rebase that on top of the patches from this
series that Sean mentioned[3] earlier:

  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes

but figured I'd mention it here in case there are other things to consider on
the locking front.

Definitely agree with Sean though that it would be nice to start identifying a
common base of patches for the in-place conversion enablement for SNP, TDX, and
pKVM so the APIs/interfaces for hugepages can be handled separately.

-Mike

[1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
[2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
[3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/

>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>  	if (IS_ERR(folio)) {
>  		r = PTR_ERR(folio);
> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		*page = folio_file_page(folio, index);
>  	else
>  		folio_put(folio);
> -
>  out:
> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>  	fput(file);
>  	return r;
>  }
> -- 
> 2.49.0.1045.g170613ef41-goog
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-27  3:54   ` Yan Zhao
@ 2025-05-29 18:20     ` Ackerley Tng
  2025-05-30  8:53     ` Fuad Tabba
  1 sibling, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-29 18:20 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
>> Track guest_memfd memory's shareability status within the inode as
>> opposed to the file, since it is property of the guest_memfd's memory
>> contents.
>> 
>> Shareability is a property of the memory and is indexed using the
>> page's index in the inode. Because shareability is the memory's
>> property, it is stored within guest_memfd instead of within KVM, like
>> in kvm->mem_attr_array.
>> 
>> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
>> retained to allow VMs to only use guest_memfd for private memory and
>> some other memory for shared memory.
>> 
>> Not all use cases require guest_memfd() to be shared with the host
>> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
>> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
>> private to the guest, and therefore not mappable by the
>> host. Otherwise, memory is shared until explicitly converted to
>> private.
>> 
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Co-developed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
>> ---
>>  Documentation/virt/kvm/api.rst |   5 ++
>>  include/uapi/linux/kvm.h       |   2 +
>>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>>  3 files changed, 129 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 86f74ce7f12a..f609337ae1c2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>>  This is validated when the guest_memfd instance is bound to the VM.
>>  
>> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
>> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
>> +will initialize the memory for the guest_memfd as guest-only and not faultable
>> +by the host.
>> +
>>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>>  
>>  4.143 KVM_PRE_FAULT_MEMORY
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 4cc824a3a7c9..d7df312479aa 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
>>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>>  
>>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>> +
>>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
>> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
>>  
>>  struct kvm_create_guest_memfd {
>>  	__u64 size;
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 239d0f13dcc1..590932499eba 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/falloc.h>
>>  #include <linux/fs.h>
>>  #include <linux/kvm_host.h>
>> +#include <linux/maple_tree.h>
>>  #include <linux/pseudo_fs.h>
>>  #include <linux/pagemap.h>
>>  
>> @@ -17,6 +18,24 @@ struct kvm_gmem {
>>  	struct list_head entry;
>>  };
>>  
>> +struct kvm_gmem_inode_private {
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +	struct maple_tree shareability;
>> +#endif
>> +};
>> +
>> +enum shareability {
>> +	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
>> +	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
>> +};
>> +
>> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
>> +
>> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>> +{
>> +	return inode->i_mapping->i_private_data;
>> +}
>> +
>>  /**
>>   * folio_file_pfn - like folio_file_page, but return a pfn.
>>   * @folio: The folio which contains this index.
>> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
>>  	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
>>  }
>>  
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +
>> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
>> +				      loff_t size, u64 flags)
>> +{
>> +	enum shareability m;
>> +	pgoff_t last;
>> +
>> +	last = (size >> PAGE_SHIFT) - 1;
>> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
>> +						    SHAREABILITY_ALL;
>> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
>> +				 GFP_KERNEL);
>> +}
>> +
>> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
>> +						 pgoff_t index)
>> +{
>> +	struct maple_tree *mt;
>> +	void *entry;
>> +
>> +	mt = &kvm_gmem_private(inode)->shareability;
>> +	entry = mtree_load(mt, index);
>> +	WARN(!entry,
>> +	     "Shareability should always be defined for all indices in inode.");
> I noticed that in [1], the kvm_gmem_mmap() does not check the range.
> So, the WARN() here can be hit when userspace mmap() an area larger than the
> inode size and accesses the out of band HVA.
>
> Maybe limit the mmap() range?
>
> @@ -1609,6 +1620,10 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!kvm_gmem_supports_shared(file_inode(file)))
>                 return -ENODEV;
>
> +       if (vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT) > i_size_read(file_inode(file)))
> +               return -EINVAL;
> +
>         if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
>             (VM_SHARED | VM_MAYSHARE)) {
>                 return -EINVAL;
>
> [1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/
>

This is a good idea. Thanks! I also think it is a good idea to include
this with the guest_memfd mmap base series that Fuad is working on [1],
maybe in v11.

[1] https://lore.kernel.org/all/20250527180245.1413463-1-tabba@google.com/

>> +	return xa_to_value(entry);
>> +}
>> +
>> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> +{
>> +	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
>> +		return ERR_PTR(-EACCES);
>> +
>> +	return kvm_gmem_get_folio(inode, index);
>> +}
>> +
>> +#else
>> +
>> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> +{
>> +	WARN_ONCE("Unexpected call to get shared folio.")
>> +	return NULL;
>> +}
>> +
>> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>> +
>>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  				    pgoff_t index, struct folio *folio)
>>  {
>> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>>  
>>  	filemap_invalidate_lock_shared(inode->i_mapping);
>>  
>> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>>  	if (IS_ERR(folio)) {
>>  		int err = PTR_ERR(folio);
>>  
>> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
>>  	.fallocate	= kvm_gmem_fallocate,
>>  };
>>  
>> +static void kvm_gmem_free_inode(struct inode *inode)
>> +{
>> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> +
>> +	kfree(private);
>> +
>> +	free_inode_nonrcu(inode);
>> +}
>> +
>> +static void kvm_gmem_destroy_inode(struct inode *inode)
>> +{
>> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> +
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +	/*
>> +	 * mtree_destroy() can't be used within rcu callback, hence can't be
>> +	 * done in ->free_inode().
>> +	 */
>> +	if (private)
>> +		mtree_destroy(&private->shareability);
>> +#endif
>> +}
>> +
>>  static const struct super_operations kvm_gmem_super_operations = {
>>  	.statfs		= simple_statfs,
>> +	.destroy_inode	= kvm_gmem_destroy_inode,
>> +	.free_inode	= kvm_gmem_free_inode,
>>  };
>>  
>>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  						      loff_t size, u64 flags)
>>  {
>> +	struct kvm_gmem_inode_private *private;
>>  	struct inode *inode;
>> +	int err;
>>  
>>  	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>>  	if (IS_ERR(inode))
>>  		return inode;
>>  
>> +	err = -ENOMEM;
>> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
>> +	if (!private)
>> +		goto out;
>> +
>> +	mt_init(&private->shareability);
> Wrap the mt_init() inside "#ifdef CONFIG_KVM_GMEM_SHARED_MEM" ?
>

Will fix this in the next revision. Will also update this to only
initialize shareability if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED).

>> +	inode->i_mapping->i_private_data = private;
>> +
>> +	err = kvm_gmem_shareability_setup(private, size, flags);
>> +	if (err)
>> +		goto out;
>> +
>>  	inode->i_private = (void *)(unsigned long)flags;
>>  	inode->i_op = &kvm_gmem_iops;
>>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>  
>>  	return inode;
>> +
>> +out:
>> +	iput(inode);
>> +
>> +	return ERR_PTR(err);
>>  }
>>  
>>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>  	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
>>  		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
>>  
>> +	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
>> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
>> +
>>  	if (flags & ~valid_flags)
>>  		return -EINVAL;
>>  
>> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  	if (!file)
>>  		return -EFAULT;
>>  
>> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +
>>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>>  	if (IS_ERR(folio)) {
>>  		r = PTR_ERR(folio);
>> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  		*page = folio_file_page(folio, index);
>>  	else
>>  		folio_put(folio);
>> -
>>  out:
>> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>>  	fput(file);
>>  	return r;
>>  }
>> -- 
>> 2.49.0.1045.g170613ef41-goog
>> 
>> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-27  8:25   ` Binbin Wu
  2025-05-27  8:43     ` Binbin Wu
@ 2025-05-29 18:26     ` Ackerley Tng
  2025-05-29 20:37       ` Ackerley Tng
  1 sibling, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-29 18:26 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>> Track guest_memfd memory's shareability status within the inode as
>> opposed to the file, since it is property of the guest_memfd's memory
>> contents.
>>
>> Shareability is a property of the memory and is indexed using the
>> page's index in the inode. Because shareability is the memory's
>> property, it is stored within guest_memfd instead of within KVM, like
>> in kvm->mem_attr_array.
>>
>> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
>> retained to allow VMs to only use guest_memfd for private memory and
>> some other memory for shared memory.
>>
>> Not all use cases require guest_memfd() to be shared with the host
>> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
>> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
>> private to the guest, and therefore not mappable by the
>> host. Otherwise, memory is shared until explicitly converted to
>> private.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Co-developed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
>> ---
>>   Documentation/virt/kvm/api.rst |   5 ++
>>   include/uapi/linux/kvm.h       |   2 +
>>   virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>>   3 files changed, 129 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 86f74ce7f12a..f609337ae1c2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>>   The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>>   This is validated when the guest_memfd instance is bound to the VM.
>>   
>> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
>> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.
>
> It seems that the sentence is stale?
> Didn't find the definition of KVM_CAP_GMEM_CONVERSIONS.
>

Thanks. This should read

If the capability KVM_CAP_GMEM_SHARED_MEM is supported, and
GUEST_MEMFD_FLAG_SUPPORT_SHARED is specified, then the 'flags' field
supports GUEST_MEMFD_FLAG_INIT_PRIVATE.

>> Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
>> +will initialize the memory for the guest_memfd as guest-only and not faultable
>> +by the host.
>> +
> [...]
>>   
>>   static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>>   static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>   						      loff_t size, u64 flags)
>>   {
>> +	struct kvm_gmem_inode_private *private;
>>   	struct inode *inode;
>> +	int err;
>>   
>>   	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>>   	if (IS_ERR(inode))
>>   		return inode;
>>   
>> +	err = -ENOMEM;
>> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
>> +	if (!private)
>> +		goto out;
>> +
>> +	mt_init(&private->shareability);
>
> shareability is defined only when CONFIG_KVM_GMEM_SHARED_MEM enabled, should be done within CONFIG_KVM_GMEM_SHARED_MEM .
>
>

Yes, thank you! Will also update this to only initialize shareability if
(flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED).

>> +	inode->i_mapping->i_private_data = private;
>> +
>> +	err = kvm_gmem_shareability_setup(private, size, flags);
>> +	if (err)
>> +		goto out;
>> +
>>   	inode->i_private = (void *)(unsigned long)flags;
>>   	inode->i_op = &kvm_gmem_iops;
>>   	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>   	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>   
>>   	return inode;
>> +
>> +out:
>> +	iput(inode);
>> +
>> +	return ERR_PTR(err);
>>   }
>>   
>>
> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-29 18:26     ` Ackerley Tng
@ 2025-05-29 20:37       ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-29 20:37 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, vannapurve, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Ackerley Tng <ackerleytng@google.com> writes:

> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
>> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>>> Track guest_memfd memory's shareability status within the inode as
>>> opposed to the file, since it is property of the guest_memfd's memory
>>> contents.
>>>
>>> Shareability is a property of the memory and is indexed using the
>>> page's index in the inode. Because shareability is the memory's
>>> property, it is stored within guest_memfd instead of within KVM, like
>>> in kvm->mem_attr_array.
>>>
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
>>> retained to allow VMs to only use guest_memfd for private memory and
>>> some other memory for shared memory.
>>>
>>> Not all use cases require guest_memfd() to be shared with the host
>>> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
>>> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
>>> private to the guest, and therefore not mappable by the
>>> host. Otherwise, memory is shared until explicitly converted to
>>> private.
>>>
>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>>> Co-developed-by: Fuad Tabba <tabba@google.com>
>>> Signed-off-by: Fuad Tabba <tabba@google.com>
>>> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
>>> ---
>>>   Documentation/virt/kvm/api.rst |   5 ++
>>>   include/uapi/linux/kvm.h       |   2 +
>>>   virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>>>   3 files changed, 129 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>> index 86f74ce7f12a..f609337ae1c2 100644
>>> --- a/Documentation/virt/kvm/api.rst
>>> +++ b/Documentation/virt/kvm/api.rst
>>> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>>>   The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>>>   This is validated when the guest_memfd instance is bound to the VM.
>>>   
>>> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
>>> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.
>>
>> It seems that the sentence is stale?
>> Didn't find the definition of KVM_CAP_GMEM_CONVERSIONS.
>>
>
> Thanks. This should read
>
> If the capability KVM_CAP_GMEM_SHARED_MEM is supported, and
> GUEST_MEMFD_FLAG_SUPPORT_SHARED is specified, then the 'flags' field
> supports GUEST_MEMFD_FLAG_INIT_PRIVATE.
>

My bad, saw your other email. Fixing the above:

If the capability KVM_CAP_GMEM_CONVERSION is supported, and
GUEST_MEMFD_FLAG_SUPPORT_SHARED is specified, then the 'flags' field
supports GUEST_MEMFD_FLAG_INIT_PRIVATE.

>>> Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
>>> +will initialize the memory for the guest_memfd as guest-only and not faultable
>>> +by the host.
>>> +
>> [...]
>>>   
>>>   static int kvm_gmem_init_fs_context(struct fs_context *fc)
>>> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>>>   static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>>   						      loff_t size, u64 flags)
>>>   {
>>> +	struct kvm_gmem_inode_private *private;
>>>   	struct inode *inode;
>>> +	int err;
>>>   
>>>   	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>>>   	if (IS_ERR(inode))
>>>   		return inode;
>>>   
>>> +	err = -ENOMEM;
>>> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
>>> +	if (!private)
>>> +		goto out;
>>> +
>>> +	mt_init(&private->shareability);
>>
>> shareability is defined only when CONFIG_KVM_GMEM_SHARED_MEM enabled, should be done within CONFIG_KVM_GMEM_SHARED_MEM .
>>
>>
>
> Yes, thank you! Will also update this to only initialize shareability if
> (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED).
>
>>> +	inode->i_mapping->i_private_data = private;
>>> +
>>> +	err = kvm_gmem_shareability_setup(private, size, flags);
>>> +	if (err)
>>> +		goto out;
>>> +
>>>   	inode->i_private = (void *)(unsigned long)flags;
>>>   	inode->i_op = &kvm_gmem_iops;
>>>   	inode->i_mapping->a_ops = &kvm_gmem_aops;
>>> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>>   	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>>   
>>>   	return inode;
>>> +
>>> +out:
>>> +	iput(inode);
>>> +
>>> +	return ERR_PTR(err);
>>>   }
>>>   
>>>
>> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-27  3:54   ` Yan Zhao
  2025-05-29 18:20     ` Ackerley Tng
@ 2025-05-30  8:53     ` Fuad Tabba
  2025-05-30 18:32       ` Ackerley Tng
  1 sibling, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-05-30  8:53 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Hi,

.. snip..

> I noticed that in [1], the kvm_gmem_mmap() does not check the range.
> So, the WARN() here can be hit when userspace mmap() an area larger than the
> inode size and accesses the out of band HVA.
>
> Maybe limit the mmap() range?
>
> @@ -1609,6 +1620,10 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>         if (!kvm_gmem_supports_shared(file_inode(file)))
>                 return -ENODEV;
>
> +       if (vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT) > i_size_read(file_inode(file)))
> +               return -EINVAL;
> +
>         if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
>             (VM_SHARED | VM_MAYSHARE)) {
>                 return -EINVAL;
>
> [1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/

I don't think we want to do that for a couple of reasons. We catch
such invalid accesses on faulting, and, by analogy, afaikt, neither
secretmem nor memfd perform a similar check on mmap (nor do
memory-mapped files in general).

There are also valid reasons why a user would want to deliberately
mmap more memory than the backing store, knowing that it's only going
to fault what it's going to use, e.g., alignment.

Cheers,
/fuad


> > +     return xa_to_value(entry);
> > +}
> > +
> > +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > +{
> > +     if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> > +             return ERR_PTR(-EACCES);
> > +
> > +     return kvm_gmem_get_folio(inode, index);
> > +}
> > +
> > +#else
> > +
> > +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > +{
> > +     WARN_ONCE("Unexpected call to get shared folio.")
> > +     return NULL;
> > +}
> > +
> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > +
> >  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >                                   pgoff_t index, struct folio *folio)
> >  {
> > @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >
> >       filemap_invalidate_lock_shared(inode->i_mapping);
> >
> > -     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > +     folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >       if (IS_ERR(folio)) {
> >               int err = PTR_ERR(folio);
> >
> > @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> >       .fallocate      = kvm_gmem_fallocate,
> >  };
> >
> > +static void kvm_gmem_free_inode(struct inode *inode)
> > +{
> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > +
> > +     kfree(private);
> > +
> > +     free_inode_nonrcu(inode);
> > +}
> > +
> > +static void kvm_gmem_destroy_inode(struct inode *inode)
> > +{
> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > +
> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > +     /*
> > +      * mtree_destroy() can't be used within rcu callback, hence can't be
> > +      * done in ->free_inode().
> > +      */
> > +     if (private)
> > +             mtree_destroy(&private->shareability);
> > +#endif
> > +}
> > +
> >  static const struct super_operations kvm_gmem_super_operations = {
> >       .statfs         = simple_statfs,
> > +     .destroy_inode  = kvm_gmem_destroy_inode,
> > +     .free_inode     = kvm_gmem_free_inode,
> >  };
> >
> >  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> >  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >                                                     loff_t size, u64 flags)
> >  {
> > +     struct kvm_gmem_inode_private *private;
> >       struct inode *inode;
> > +     int err;
> >
> >       inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> >       if (IS_ERR(inode))
> >               return inode;
> >
> > +     err = -ENOMEM;
> > +     private = kzalloc(sizeof(*private), GFP_KERNEL);
> > +     if (!private)
> > +             goto out;
> > +
> > +     mt_init(&private->shareability);
> Wrap the mt_init() inside "#ifdef CONFIG_KVM_GMEM_SHARED_MEM" ?
>
> > +     inode->i_mapping->i_private_data = private;
> > +
> > +     err = kvm_gmem_shareability_setup(private, size, flags);
> > +     if (err)
> > +             goto out;
> > +
> >       inode->i_private = (void *)(unsigned long)flags;
> >       inode->i_op = &kvm_gmem_iops;
> >       inode->i_mapping->a_ops = &kvm_gmem_aops;
> > @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> >
> >       return inode;
> > +
> > +out:
> > +     iput(inode);
> > +
> > +     return ERR_PTR(err);
> >  }
> >
> >  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> > @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >       if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> >               valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> >
> > +     if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> > +             valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> > +
> >       if (flags & ~valid_flags)
> >               return -EINVAL;
> >
> > @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >       if (!file)
> >               return -EFAULT;
> >
> > +     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > +
> >       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> >       if (IS_ERR(folio)) {
> >               r = PTR_ERR(folio);
> > @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >               *page = folio_file_page(folio, index);
> >       else
> >               folio_put(folio);
> > -
> >  out:
> > +     filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> >       fput(file);
> >       return r;
> >  }
> > --
> > 2.49.0.1045.g170613ef41-goog
> >
> >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-30  8:53     ` Fuad Tabba
@ 2025-05-30 18:32       ` Ackerley Tng
  2025-06-02  9:43         ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-30 18:32 UTC (permalink / raw)
  To: Fuad Tabba, Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vannapurve, vbabka, viro, vkuznets, wei.w.wang, will,
	willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

Fuad Tabba <tabba@google.com> writes:

> Hi,
>
> .. snip..
>
>> I noticed that in [1], the kvm_gmem_mmap() does not check the range.
>> So, the WARN() here can be hit when userspace mmap() an area larger than the
>> inode size and accesses the out of band HVA.
>>
>> Maybe limit the mmap() range?
>>
>> @@ -1609,6 +1620,10 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
>>         if (!kvm_gmem_supports_shared(file_inode(file)))
>>                 return -ENODEV;
>>
>> +       if (vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT) > i_size_read(file_inode(file)))
>> +               return -EINVAL;
>> +
>>         if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
>>             (VM_SHARED | VM_MAYSHARE)) {
>>                 return -EINVAL;
>>
>> [1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/
>
> I don't think we want to do that for a couple of reasons. We catch
> such invalid accesses on faulting, and, by analogy, afaikt, neither
> secretmem nor memfd perform a similar check on mmap (nor do
> memory-mapped files in general).
>
> There are also valid reasons why a user would want to deliberately
> mmap more memory than the backing store, knowing that it's only going
> to fault what it's going to use, e.g., alignment.
>

This is a good point.

I think there's no check against the inode size on faulting now though?
v10's [1] kvm_gmem_fault_shared() calls kvm_gmem_get_folio()
straightaway.

We should add a check like [2] to kvm_gmem_fault_shared().

[1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/
[2] https://github.com/torvalds/linux/blob/8477ab143069c6b05d6da4a8184ded8b969240f5/mm/filemap.c#L3373

> Cheers,
> /fuad
>
>
>> > +     return xa_to_value(entry);
>> > +}
>> > +
>> > +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> > +{
>> > +     if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
>> > +             return ERR_PTR(-EACCES);
>> > +
>> > +     return kvm_gmem_get_folio(inode, index);
>> > +}
>> > +
>> > +#else
>> > +
>> > +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
>> > +{
>> > +     return 0;
>> > +}
>> > +
>> > +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> > +{
>> > +     WARN_ONCE("Unexpected call to get shared folio.")
>> > +     return NULL;
>> > +}
>> > +
>> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>> > +
>> >  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >                                   pgoff_t index, struct folio *folio)
>> >  {
>> > @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>> >
>> >       filemap_invalidate_lock_shared(inode->i_mapping);
>> >
>> > -     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>> > +     folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>> >       if (IS_ERR(folio)) {
>> >               int err = PTR_ERR(folio);
>> >
>> > @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
>> >       .fallocate      = kvm_gmem_fallocate,
>> >  };
>> >
>> > +static void kvm_gmem_free_inode(struct inode *inode)
>> > +{
>> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> > +
>> > +     kfree(private);
>> > +
>> > +     free_inode_nonrcu(inode);
>> > +}
>> > +
>> > +static void kvm_gmem_destroy_inode(struct inode *inode)
>> > +{
>> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> > +
>> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> > +     /*
>> > +      * mtree_destroy() can't be used within rcu callback, hence can't be
>> > +      * done in ->free_inode().
>> > +      */
>> > +     if (private)
>> > +             mtree_destroy(&private->shareability);
>> > +#endif
>> > +}
>> > +
>> >  static const struct super_operations kvm_gmem_super_operations = {
>> >       .statfs         = simple_statfs,
>> > +     .destroy_inode  = kvm_gmem_destroy_inode,
>> > +     .free_inode     = kvm_gmem_free_inode,
>> >  };
>> >
>> >  static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> > @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>> >  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> >                                                     loff_t size, u64 flags)
>> >  {
>> > +     struct kvm_gmem_inode_private *private;
>> >       struct inode *inode;
>> > +     int err;
>> >
>> >       inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>> >       if (IS_ERR(inode))
>> >               return inode;
>> >
>> > +     err = -ENOMEM;
>> > +     private = kzalloc(sizeof(*private), GFP_KERNEL);
>> > +     if (!private)
>> > +             goto out;
>> > +
>> > +     mt_init(&private->shareability);
>> Wrap the mt_init() inside "#ifdef CONFIG_KVM_GMEM_SHARED_MEM" ?
>>
>> > +     inode->i_mapping->i_private_data = private;
>> > +
>> > +     err = kvm_gmem_shareability_setup(private, size, flags);
>> > +     if (err)
>> > +             goto out;
>> > +
>> >       inode->i_private = (void *)(unsigned long)flags;
>> >       inode->i_op = &kvm_gmem_iops;
>> >       inode->i_mapping->a_ops = &kvm_gmem_aops;
>> > @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>> >       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>> >
>> >       return inode;
>> > +
>> > +out:
>> > +     iput(inode);
>> > +
>> > +     return ERR_PTR(err);
>> >  }
>> >
>> >  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>> > @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>> >       if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
>> >               valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
>> >
>> > +     if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
>> > +             valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
>> > +
>> >       if (flags & ~valid_flags)
>> >               return -EINVAL;
>> >
>> > @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >       if (!file)
>> >               return -EFAULT;
>> >
>> > +     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> > +
>> >       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>> >       if (IS_ERR(folio)) {
>> >               r = PTR_ERR(folio);
>> > @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >               *page = folio_file_page(folio, index);
>> >       else
>> >               folio_put(folio);
>> > -
>> >  out:
>> > +     filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> >       fput(file);
>> >       return r;
>> >  }
>> > --
>> > 2.49.0.1045.g170613ef41-goog
>> >
>> >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  2025-05-27  8:53       ` Binbin Wu
@ 2025-05-30 19:59         ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-30 19:59 UTC (permalink / raw)
  To: Binbin Wu, Ira Weiny
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans,
	jhubbard, jroedel, jthoughton, jun.miao, kai.huang, keirf,
	kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/17/2025 1:42 AM, Ackerley Tng wrote:
>> Ira Weiny <ira.weiny@intel.com> writes:
>>
>>> Ackerley Tng wrote:
>>>> Test that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid when
>>>> GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>>>>
>>>> Change-Id: I506e236a232047cfaee17bcaed02ee14c8d25bbb
>>>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>>> ---
>>>>   .../testing/selftests/kvm/guest_memfd_test.c  | 36 ++++++++++++-------
>>>>   1 file changed, 24 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
>>>> index 60aaba5808a5..bf2876cbd711 100644
>>>> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
>>>> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
>>>> @@ -401,13 +401,31 @@ static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
>>>>   	kvm_vm_release(vm);
>>>>   }
>>>>   
>>>> +static void test_vm_with_gmem_flag(struct kvm_vm *vm, uint64_t flag,
>>>> +				   bool expect_valid)
>>>> +{
>>>> +	size_t page_size = getpagesize();
>>>> +	int fd;
>>>> +
>>>> +	fd = __vm_create_guest_memfd(vm, page_size, flag);
>>>> +
>>>> +	if (expect_valid) {
>>>> +		TEST_ASSERT(fd > 0,
>>>> +			    "guest_memfd() with flag '0x%lx' should be valid",
>>>> +			    flag);
>>>> +		close(fd);
>>>> +	} else {
>>>> +		TEST_ASSERT(fd == -1 && errno == EINVAL,
>>>> +			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>>>> +			    flag);
>>>> +	}
>>>> +}
>>>> +
>>>>   static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>>>   					    uint64_t expected_valid_flags)
>>>>   {
>>>> -	size_t page_size = getpagesize();
>>>>   	struct kvm_vm *vm;
>>>>   	uint64_t flag = 0;
>>>> -	int fd;
>>>>   
>>>>   	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type)))
>>>>   		return;
>>>> @@ -415,17 +433,11 @@ static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
>>>>   	vm = vm_create_barebones_type(vm_type);
>>>>   
>>>>   	for (flag = BIT(0); flag; flag <<= 1) {
>>>> -		fd = __vm_create_guest_memfd(vm, page_size, flag);
>>>> +		test_vm_with_gmem_flag(vm, flag, flag & expected_valid_flags);
>>>>   
>>>> -		if (flag & expected_valid_flags) {
>>>> -			TEST_ASSERT(fd > 0,
>>>> -				    "guest_memfd() with flag '0x%lx' should be valid",
>>>> -				    flag);
>>>> -			close(fd);
>>>> -		} else {
>>>> -			TEST_ASSERT(fd == -1 && errno == EINVAL,
>>>> -				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
>>>> -				    flag);
>>>> +		if (flag == GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
>>>> +			test_vm_with_gmem_flag(
>>>> +				vm, flag | GUEST_MEMFD_FLAG_INIT_PRIVATE, true);
>>> I don't understand the point of this check.  In 2/51 we set
>>> GUEST_MEMFD_FLAG_INIT_PRIVATE when GUEST_MEMFD_FLAG_SUPPORT_SHARED is set.
>>>
>>> When can this check ever fail?
>>>
>>> Ira
>> In 02/51, GUEST_MEMFD_FLAG_INIT_PRIVATE is not set by default,
>> GUEST_MEMFD_FLAG_INIT_PRIVATE is set as one of the valid_flags.
>>
>> The intention is that GUEST_MEMFD_FLAG_INIT_PRIVATE is only valid if
>> GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested by userspace.
>>
>> In this test, the earlier part before the if block calls
>> test_vm_with_gmem_flag() all valid flags, and that already tests
>> GUEST_MEMFD_FLAG_SUPPORT_SHARED individually.
>>
>> Specifically if GUEST_MEMFD_FLAG_SUPPORT_SHARED is set, this if block
>> adds a test for when both GUEST_MEMFD_FLAG_SUPPORT_SHARED and
>> GUEST_MEMFD_FLAG_INIT_PRIVATE are set, and sets that expect_valid is
>> true.
> Maybe it's more clear to move this case out of the loop?
>

Will try that in the next revision. Thanks!

>>
>> This second test doesn't fail, it is meant to check that the kernel
>> allows the pair of flags to be set. Hope that makes sense.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-28  3:16   ` Binbin Wu
@ 2025-05-30 20:10     ` Ackerley Tng
  2025-06-03  0:54       ` Binbin Wu
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-05-30 20:10 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, vannapurve, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>
> [...]
>> +
>> +static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>> +				  size_t nr_pages, bool shared,
>> +				  pgoff_t *error_index)
>> +{
>> +	struct conversion_work *work, *tmp, *rollback_stop_item;
>> +	LIST_HEAD(work_list);
>> +	struct inode *inode;
>> +	enum shareability m;
>> +	int ret;
>> +
>> +	inode = file_inode(file);
>> +
>> +	filemap_invalidate_lock(inode->i_mapping);
>> +
>> +	m = shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
>> +	ret = kvm_gmem_convert_compute_work(inode, start, nr_pages, m, &work_list);
>> +	if (ret || list_empty(&work_list))
>> +		goto out;
>> +
>> +	list_for_each_entry(work, &work_list, list)
>> +		kvm_gmem_convert_invalidate_begin(inode, work);
>> +
>> +	list_for_each_entry(work, &work_list, list) {
>> +		ret = kvm_gmem_convert_should_proceed(inode, work, shared,
>> +						      error_index);
>
> Since kvm_gmem_invalidate_begin() begins to handle shared memory,
> kvm_gmem_convert_invalidate_begin() will zap the table.
> The shared mapping could be zapped in kvm_gmem_convert_invalidate_begin() even
> when kvm_gmem_convert_should_proceed() returns error.
> The sequence is a bit confusing to me, at least in this patch so far.
>

It is true that zapping of pages from the guest page table will happen
before we figure out whether conversion is allowed.

For a shared-to-private conversion, we will definitely unmap from the
host before checking if conversion is allowed, and there's no choice
there since conversion is allowed if there are no unexpected refcounts,
and the way to eliminate expected refcounts is to unmap from the host.

Since we're unmapping before checking if conversion is allowed, I
thought it would be fine to also zap from guest page tables before
checking if conversion is allowed.

Conversion is not meant to happen very regularly, and even if it is
unmapped or zapped, the next access will fault in the page anyway, so
there is a performance but not a functionality impact.

Hope that helps. Is it still odd to zap before checking if conversion
should proceed?

>> +		if (ret)
>> +			goto invalidate_end;
>> +	}
>> +
>> +	list_for_each_entry(work, &work_list, list) {
>> +		rollback_stop_item = work;
>> +		ret = kvm_gmem_shareability_apply(inode, work, m);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	if (ret) {
>> +		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>> +		list_for_each_entry(work, &work_list, list) {
>> +			if (work == rollback_stop_item)
>> +				break;
>> +
>> +			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
>> +		}
>> +	}
>> +
>> +invalidate_end:
>> +	list_for_each_entry(work, &work_list, list)
>> +		kvm_gmem_convert_invalidate_end(inode, work);
>> +out:
>> +	filemap_invalidate_unlock(inode->i_mapping);
>> +
>> +	list_for_each_entry_safe(work, tmp, &work_list, list) {
>> +		list_del(&work->list);
>> +		kfree(work);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
> [...]
>> @@ -186,15 +490,26 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>>   	unsigned long index;
>>   
>>   	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
>> +		enum kvm_gfn_range_filter filter;
>>   		pgoff_t pgoff = slot->gmem.pgoff;
>>   
>> +		filter = KVM_FILTER_PRIVATE;
>> +		if (kvm_gmem_memslot_supports_shared(slot)) {
>> +			/*
>> +			 * Unmapping would also cause invalidation, but cannot
>> +			 * rely on mmu_notifiers to do invalidation via
>> +			 * unmapping, since memory may not be mapped to
>> +			 * userspace.
>> +			 */
>> +			filter |= KVM_FILTER_SHARED;
>> +		}
>> +
>>   		struct kvm_gfn_range gfn_range = {
>>   			.start = slot->base_gfn + max(pgoff, start) - pgoff,
>>   			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
>>   			.slot = slot,
>>   			.may_block = true,
>> -			/* guest memfd is relevant to only private mappings. */
>> -			.attr_filter = KVM_FILTER_PRIVATE,
>> +			.attr_filter = filter,
>>   		};
>>   
>>   		if (!found_memslot) {
>> @@ -484,11 +799,49 @@ EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
>>   #define kvm_gmem_mmap NULL
>>   #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>>   
> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios
  2025-05-28  7:01   ` Binbin Wu
@ 2025-05-30 20:32     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-05-30 20:32 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, vannapurve, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>> filemap_add_folio(), called from filemap_grab_folio(), adds the folio
>> onto some LRU list, which is not necessary for guest_memfd since
>> guest_memfd folios don't participate in any swapping.
>>
>> This patch reimplements part of filemap_add_folio() to avoid adding
>> allocated guest_memfd folios to the filemap.
>
> filemap -> LRU list?
>

Yes, thank you. Will fix this in the next revision.

>>
>> With shared to private conversions dependent on refcounts, avoiding
>> usage of LRU ensures that LRU lists no longer take any refcounts on
>> guest_memfd folios and significantly reduces the chance of elevated
>> refcounts during conversion.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Change-Id: Ia2540d9fc132d46219e6e714fd42bc82a62a27fa
>> ---
>>   mm/filemap.c           |  1 +
>>   mm/memcontrol.c        |  2 +
>>   virt/kvm/guest_memfd.c | 91 ++++++++++++++++++++++++++++++++++++++----
>>   3 files changed, 86 insertions(+), 8 deletions(-)
>>
> [...]
>>   /*
>>    * Returns a locked folio on success.  The caller is responsible for
>>    * setting the up-to-date flag before the memory is mapped into the guest.
>> @@ -477,8 +509,46 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>>    */
>>   static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>>   {
>> +	struct folio *folio;
>> +	gfp_t gfp;
>> +	int ret;
>> +
>> +repeat:
>> +	folio = filemap_lock_folio(inode->i_mapping, index);
>> +	if (!IS_ERR(folio))
>> +		return folio;
>> +
>> +	gfp = mapping_gfp_mask(inode->i_mapping);
>> +
>>   	/* TODO: Support huge pages. */
>> -	return filemap_grab_folio(inode->i_mapping, index);
>> +	folio = filemap_alloc_folio(gfp, 0);
>> +	if (!folio)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	ret = mem_cgroup_charge(folio, NULL, gfp);
>> +	if (ret) {
>> +		folio_put(folio);
>> +		return ERR_PTR(ret);
>> +	}
>> +
>> +	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index);
>> +	if (ret) {
>> +		folio_put(folio);
>> +
>> +		/*
>> +		 * There was a race, two threads tried to get a folio indexing
>> +		 * to the same location in the filemap. The losing thread should
>> +		 * free the allocated folio, then lock the folio added to the
>> +		 * filemap by the winning thread.
>
> How about changing
> “then lock the folio added to the filemap by the winning thread”
> to
> "the winning thread locks the folio added to the filemap"?
>

How about:

There was a race. Threads tried to get a folio indexing to the same
location in the filemap. The winning thread allocated and locked the
folio at the requested index. The losing threads should free the extra
allocated folio, then wait to lock the same folio allocated (and locked)
by the winning thread.

>> +		 */
>> +		if (ret == -EEXIST)
>> +			goto repeat;
>> +
>> +		return ERR_PTR(ret);
>> +	}
>> +
>> +	__folio_set_locked(folio);
>> +	return folio;
>>   }
>>   
>>   static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>> @@ -956,23 +1026,28 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>>   }
>>   
>>   #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> +static void kvm_gmem_invalidate(struct folio *folio)
>> +{
>> +	kvm_pfn_t pfn = folio_pfn(folio);
>> +
>> +	kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
>> +}
>> +#else
>> +static inline void kvm_gmem_invalidate(struct folio *folio) {}
>
> No need to tag a local static function with "inline".
>

Will fix in the next revision.

>> +#endif
>> +
> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions
  2025-05-14 23:42 ` [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions Ackerley Tng
@ 2025-05-31 23:45   ` Ira Weiny
  2025-06-13 22:03     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-05-31 23:45 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ackerley Tng wrote:
> Refactor dequeue_hugetlb_folio() and alloc_surplus_hugetlb_folio() to
> take mpol, nid and nodemask. This decouples allocation of a folio from
> a vma.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Change-Id: I890fb46fe8c6349383d8cf89befc68a4994eb416
> ---
>  mm/hugetlb.c | 64 ++++++++++++++++++++++++----------------------------
>  1 file changed, 30 insertions(+), 34 deletions(-)
> 

[snip]

>  
> @@ -2993,6 +2974,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
>  	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	gfp_t gfp_mask;
> +	pgoff_t ilx;
> +	int nid;
>  
>  	idx = hstate_index(h);
>  
> @@ -3032,7 +3018,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  
>  		subpool_reservation_exists = npages_req == 0;
>  	}
> -
>  	reservation_exists = vma_reservation_exists || subpool_reservation_exists;
>  
>  	/*
> @@ -3048,21 +3033,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  			goto out_subpool_put;
>  	}
>  
> +	mpol = get_vma_policy(vma, addr, h->order, &ilx);

Why does the memory policy need to be acquired here instead of after the
cgroup charge?  AFAICT this is not needed and would at least eliminate 1
of the error conditions puts.

> +
>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
> -	if (ret)
> +	if (ret) {
> +		mpol_cond_put(mpol);
                ^^^^
		here

All that said I think the use of some new cleanup macros could really help
a lot of this code.

What do folks in this area of the kernel think of those?

Ira

[snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio()
  2025-05-14 23:42 ` [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng
@ 2025-06-01  0:38   ` Ira Weiny
  2025-06-13 22:07     ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-06-01  0:38 UTC (permalink / raw)
  To: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: ackerleytng, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vannapurve, vbabka, viro,
	vkuznets, wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao,
	yilun.xu, yuzenghui, zhiquan1.li

Ackerley Tng wrote:
> Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which
> handles allocation of a folio and cgroup charging.
> 
> Other than flags to control charging in the allocation process,
> hugetlb_alloc_folio() also has parameters for memory policy.
> 
> This refactoring as a whole decouples the hugetlb page allocation from
> hugetlbfs, (1) where the subpool is stored at the fs mount, (2)
> reservations are made during mmap and stored in the vma, and (3) mpol
> must be stored at vma->vm_policy (4) a vma must be used for allocation
> even if the pages are not meant to be used by host process.
> 
> This decoupling will allow hugetlb_alloc_folio() to be used by
> guest_memfd in later patches. In guest_memfd, (1) a subpool is created
> per-fd and is stored on the inode, (2) no vma-related reservations are
> used (3) mpol may not be associated with a vma since (4) for private
> pages, the pages will not be mappable to userspace and hence have to
> associated vmas.
> 
> This could hopefully also open hugetlb up as a more generic source of
> hugetlb pages that are not bound to hugetlbfs, with the complexities
> of userspace/mmap/vma-related reservations contained just to
> hugetlbfs.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Change-Id: I60528f246341268acbf0ed5de7752ae2cacbef93
> ---
>  include/linux/hugetlb.h |  12 +++
>  mm/hugetlb.c            | 192 ++++++++++++++++++++++------------------
>  2 files changed, 118 insertions(+), 86 deletions(-)
> 

[snip]

>  
> +/**
> + * hugetlb_alloc_folio() - Allocates a hugetlb folio.
> + *
> + * @h: struct hstate to allocate from.
> + * @mpol: struct mempolicy to apply for this folio allocation.
> + * @ilx: Interleave index for interpretation of @mpol.
> + * @charge_cgroup_rsvd: Set to true to charge cgroup reservation.
> + * @use_existing_reservation: Set to true if this allocation should use an
> + *                            existing hstate reservation.
> + *
> + * This function handles cgroup and global hstate reservations. VMA-related
> + * reservations and subpool debiting must be handled by the caller if necessary.
> + *
> + * Return: folio on success or negated error otherwise.
> + */
> +struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
> +				  pgoff_t ilx, bool charge_cgroup_rsvd,
> +				  bool use_existing_reservation)
> +{
> +	unsigned int nr_pages = pages_per_huge_page(h);
> +	struct hugetlb_cgroup *h_cg = NULL;
> +	struct folio *folio = NULL;
> +	nodemask_t *nodemask;
> +	gfp_t gfp_mask;
> +	int nid;
> +	int idx;
> +	int ret;
> +
> +	idx = hstate_index(h);
> +
> +	if (charge_cgroup_rsvd) {
> +		if (hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg))
> +			goto out;

Why not just return here?
			return ERR_PTR(-ENOSPC);

> +	}
> +
> +	if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg))
> +		goto out_uncharge_cgroup_reservation;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
> +
> +	spin_lock_irq(&hugetlb_lock);
> +
> +	if (use_existing_reservation || available_huge_pages(h))
> +		folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
> +
> +	if (!folio) {
> +		spin_unlock_irq(&hugetlb_lock);
> +		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
> +		if (!folio)
> +			goto out_uncharge_cgroup;
> +		spin_lock_irq(&hugetlb_lock);
> +		list_add(&folio->lru, &h->hugepage_activelist);
> +		folio_ref_unfreeze(folio, 1);
> +		/* Fall through */
> +	}
> +
> +	if (use_existing_reservation) {
> +		folio_set_hugetlb_restore_reserve(folio);
> +		h->resv_huge_pages--;
> +	}
> +
> +	hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio);
> +
> +	if (charge_cgroup_rsvd)
> +		hugetlb_cgroup_commit_charge_rsvd(idx, nr_pages, h_cg, folio);
> +
> +	spin_unlock_irq(&hugetlb_lock);
> +
> +	gfp_mask = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +	ret = mem_cgroup_charge_hugetlb(folio, gfp_mask);
> +	/*
> +	 * Unconditionally increment NR_HUGETLB here. If it turns out that
> +	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
> +	 * decrement NR_HUGETLB.
> +	 */
> +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
> +
> +	if (ret == -ENOMEM) {
> +		free_huge_folio(folio);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	return folio;
> +
> +out_uncharge_cgroup:
> +	hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
> +out_uncharge_cgroup_reservation:
> +	if (charge_cgroup_rsvd)
> +		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg);

I find the direct copy of the unwind logic from alloc_hugetlb_folio()
cumbersome and it seems like a good opportunity to clean it up.

> +out:
> +	folio = ERR_PTR(-ENOSPC);
> +	goto out;

Endless loop?

Ira

[snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-30 18:32       ` Ackerley Tng
@ 2025-06-02  9:43         ` Fuad Tabba
  0 siblings, 0 replies; 231+ messages in thread
From: Fuad Tabba @ 2025-06-02  9:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Yan Zhao, kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Hi Ackerley,

On Fri, 30 May 2025 at 19:32, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Hi,
> >
> > .. snip..
> >
> >> I noticed that in [1], the kvm_gmem_mmap() does not check the range.
> >> So, the WARN() here can be hit when userspace mmap() an area larger than the
> >> inode size and accesses the out of band HVA.
> >>
> >> Maybe limit the mmap() range?
> >>
> >> @@ -1609,6 +1620,10 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> >>         if (!kvm_gmem_supports_shared(file_inode(file)))
> >>                 return -ENODEV;
> >>
> >> +       if (vma->vm_end - vma->vm_start + (vma->vm_pgoff << PAGE_SHIFT) > i_size_read(file_inode(file)))
> >> +               return -EINVAL;
> >> +
> >>         if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
> >>             (VM_SHARED | VM_MAYSHARE)) {
> >>                 return -EINVAL;
> >>
> >> [1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/
> >
> > I don't think we want to do that for a couple of reasons. We catch
> > such invalid accesses on faulting, and, by analogy, afaikt, neither
> > secretmem nor memfd perform a similar check on mmap (nor do
> > memory-mapped files in general).
> >
> > There are also valid reasons why a user would want to deliberately
> > mmap more memory than the backing store, knowing that it's only going
> > to fault what it's going to use, e.g., alignment.
> >
>
> This is a good point.
>
> I think there's no check against the inode size on faulting now though?
> v10's [1] kvm_gmem_fault_shared() calls kvm_gmem_get_folio()
> straightaway.
>
> We should add a check like [2] to kvm_gmem_fault_shared().

Yes! I mistakenly thought that kvm_gmem_get_folio() had such a check,
I just verified that it doesn't. I have added the check, as well as a
new selftest to make sure we don't miss it in the future.

Thanks!
/fuad

> [1] https://lore.kernel.org/all/20250513163438.3942405-8-tabba@google.com/
> [2] https://github.com/torvalds/linux/blob/8477ab143069c6b05d6da4a8184ded8b969240f5/mm/filemap.c#L3373
>
> > Cheers,
> > /fuad
> >
> >
> >> > +     return xa_to_value(entry);
> >> > +}
> >> > +
> >> > +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> > +{
> >> > +     if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> >> > +             return ERR_PTR(-EACCES);
> >> > +
> >> > +     return kvm_gmem_get_folio(inode, index);
> >> > +}
> >> > +
> >> > +#else
> >> > +
> >> > +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> >> > +{
> >> > +     return 0;
> >> > +}
> >> > +
> >> > +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> > +{
> >> > +     WARN_ONCE("Unexpected call to get shared folio.")
> >> > +     return NULL;
> >> > +}
> >> > +
> >> > +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >> > +
> >> >  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >                                   pgoff_t index, struct folio *folio)
> >> >  {
> >> > @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >> >
> >> >       filemap_invalidate_lock_shared(inode->i_mapping);
> >> >
> >> > -     folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> > +     folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >> >       if (IS_ERR(folio)) {
> >> >               int err = PTR_ERR(folio);
> >> >
> >> > @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> >> >       .fallocate      = kvm_gmem_fallocate,
> >> >  };
> >> >
> >> > +static void kvm_gmem_free_inode(struct inode *inode)
> >> > +{
> >> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> >> > +
> >> > +     kfree(private);
> >> > +
> >> > +     free_inode_nonrcu(inode);
> >> > +}
> >> > +
> >> > +static void kvm_gmem_destroy_inode(struct inode *inode)
> >> > +{
> >> > +     struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> >> > +
> >> > +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> > +     /*
> >> > +      * mtree_destroy() can't be used within rcu callback, hence can't be
> >> > +      * done in ->free_inode().
> >> > +      */
> >> > +     if (private)
> >> > +             mtree_destroy(&private->shareability);
> >> > +#endif
> >> > +}
> >> > +
> >> >  static const struct super_operations kvm_gmem_super_operations = {
> >> >       .statfs         = simple_statfs,
> >> > +     .destroy_inode  = kvm_gmem_destroy_inode,
> >> > +     .free_inode     = kvm_gmem_free_inode,
> >> >  };
> >> >
> >> >  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> >> > @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> >> >  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >> >                                                     loff_t size, u64 flags)
> >> >  {
> >> > +     struct kvm_gmem_inode_private *private;
> >> >       struct inode *inode;
> >> > +     int err;
> >> >
> >> >       inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> >> >       if (IS_ERR(inode))
> >> >               return inode;
> >> >
> >> > +     err = -ENOMEM;
> >> > +     private = kzalloc(sizeof(*private), GFP_KERNEL);
> >> > +     if (!private)
> >> > +             goto out;
> >> > +
> >> > +     mt_init(&private->shareability);
> >> Wrap the mt_init() inside "#ifdef CONFIG_KVM_GMEM_SHARED_MEM" ?
> >>
> >> > +     inode->i_mapping->i_private_data = private;
> >> > +
> >> > +     err = kvm_gmem_shareability_setup(private, size, flags);
> >> > +     if (err)
> >> > +             goto out;
> >> > +
> >> >       inode->i_private = (void *)(unsigned long)flags;
> >> >       inode->i_op = &kvm_gmem_iops;
> >> >       inode->i_mapping->a_ops = &kvm_gmem_aops;
> >> > @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >> >       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> >> >
> >> >       return inode;
> >> > +
> >> > +out:
> >> > +     iput(inode);
> >> > +
> >> > +     return ERR_PTR(err);
> >> >  }
> >> >
> >> >  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> >> > @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >> >       if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> >> >               valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> >> >
> >> > +     if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> >> > +             valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> >> > +
> >> >       if (flags & ~valid_flags)
> >> >               return -EINVAL;
> >> >
> >> > @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >       if (!file)
> >> >               return -EFAULT;
> >> >
> >> > +     filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> >> > +
> >> >       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> >> >       if (IS_ERR(folio)) {
> >> >               r = PTR_ERR(folio);
> >> > @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >> >               *page = folio_file_page(folio, index);
> >> >       else
> >> >               folio_put(folio);
> >> > -
> >> >  out:
> >> > +     filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> >> >       fput(file);
> >> >       return r;
> >> >  }
> >> > --
> >> > 2.49.0.1045.g170613ef41-goog
> >> >
> >> >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-30 20:10     ` Ackerley Tng
@ 2025-06-03  0:54       ` Binbin Wu
  0 siblings, 0 replies; 231+ messages in thread
From: Binbin Wu @ 2025-06-03  0:54 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, vannapurve, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li



On 5/31/2025 4:10 AM, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
>> On 5/15/2025 7:41 AM, Ackerley Tng wrote:
>>
>> [...]
>>> +
>>> +static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>> +				  size_t nr_pages, bool shared,
>>> +				  pgoff_t *error_index)
>>> +{
>>> +	struct conversion_work *work, *tmp, *rollback_stop_item;
>>> +	LIST_HEAD(work_list);
>>> +	struct inode *inode;
>>> +	enum shareability m;
>>> +	int ret;
>>> +
>>> +	inode = file_inode(file);
>>> +
>>> +	filemap_invalidate_lock(inode->i_mapping);
>>> +
>>> +	m = shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
>>> +	ret = kvm_gmem_convert_compute_work(inode, start, nr_pages, m, &work_list);
>>> +	if (ret || list_empty(&work_list))
>>> +		goto out;
>>> +
>>> +	list_for_each_entry(work, &work_list, list)
>>> +		kvm_gmem_convert_invalidate_begin(inode, work);
>>> +
>>> +	list_for_each_entry(work, &work_list, list) {
>>> +		ret = kvm_gmem_convert_should_proceed(inode, work, shared,
>>> +						      error_index);
>> Since kvm_gmem_invalidate_begin() begins to handle shared memory,
>> kvm_gmem_convert_invalidate_begin() will zap the table.
>> The shared mapping could be zapped in kvm_gmem_convert_invalidate_begin() even
>> when kvm_gmem_convert_should_proceed() returns error.
>> The sequence is a bit confusing to me, at least in this patch so far.
>>
> It is true that zapping of pages from the guest page table will happen
> before we figure out whether conversion is allowed.
>
> For a shared-to-private conversion, we will definitely unmap from the
> host before checking if conversion is allowed, and there's no choice
> there since conversion is allowed if there are no unexpected refcounts,
> and the way to eliminate expected refcounts is to unmap from the host.
>
> Since we're unmapping before checking if conversion is allowed, I
> thought it would be fine to also zap from guest page tables before
> checking if conversion is allowed.
>
> Conversion is not meant to happen very regularly, and even if it is
> unmapped or zapped, the next access will fault in the page anyway, so
> there is a performance but not a functionality impact.
Yes, it's OK for shared mapping.

>
> Hope that helps.

It helped, thanks!

> Is it still odd to zap before checking if conversion
> should proceed?
>
>>> +		if (ret)
>>> +			goto invalidate_end;
>>> +	}
>>> +
>>> +	list_for_each_entry(work, &work_list, list) {
>>> +		rollback_stop_item = work;
>>> +		ret = kvm_gmem_shareability_apply(inode, work, m);
>>> +		if (ret)
>>> +			break;
>>> +	}
>>> +
>>> +	if (ret) {
>>> +		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>>> +		list_for_each_entry(work, &work_list, list) {
>>> +			if (work == rollback_stop_item)
>>> +				break;
>>> +
>>> +			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
>>> +		}
>>> +	}
>>> +
>>> +invalidate_end:
>>> +	list_for_each_entry(work, &work_list, list)
>>> +		kvm_gmem_convert_invalidate_end(inode, work);
>>> +out:
>>> +	filemap_invalidate_unlock(inode->i_mapping);
>>> +
>>> +	list_for_each_entry_safe(work, tmp, &work_list, list) {
>>> +		list_del(&work->list);
>>> +		kfree(work);
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>> [...]
>>> @@ -186,15 +490,26 @@ static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
>>>    	unsigned long index;
>>>    
>>>    	xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
>>> +		enum kvm_gfn_range_filter filter;
>>>    		pgoff_t pgoff = slot->gmem.pgoff;
>>>    
>>> +		filter = KVM_FILTER_PRIVATE;
>>> +		if (kvm_gmem_memslot_supports_shared(slot)) {
>>> +			/*
>>> +			 * Unmapping would also cause invalidation, but cannot
>>> +			 * rely on mmu_notifiers to do invalidation via
>>> +			 * unmapping, since memory may not be mapped to
>>> +			 * userspace.
>>> +			 */
>>> +			filter |= KVM_FILTER_SHARED;
>>> +		}
>>> +
>>>    		struct kvm_gfn_range gfn_range = {
>>>    			.start = slot->base_gfn + max(pgoff, start) - pgoff,
>>>    			.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
>>>    			.slot = slot,
>>>    			.may_block = true,
>>> -			/* guest memfd is relevant to only private mappings. */
>>> -			.attr_filter = KVM_FILTER_PRIVATE,
>>> +			.attr_filter = filter,
>>>    		};
>>>    
>>>    		if (!found_memslot) {
>>> @@ -484,11 +799,49 @@ EXPORT_SYMBOL_GPL(kvm_gmem_memslot_supports_shared);
>>>    #define kvm_gmem_mmap NULL
>>>    #endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>>>    
>> [...]


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
                     ` (2 preceding siblings ...)
  2025-05-28 10:58   ` Yan Zhao
@ 2025-06-03  7:43   ` Binbin Wu
  2025-07-16 22:13     ` Ackerley Tng
  3 siblings, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-06-03  7:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:42 AM, Ackerley Tng wrote:
[...]
>   
>   	list_for_each_entry(gmem, gmem_list, entry)
>   		kvm_gmem_invalidate_end(gmem, start, end);
> @@ -776,6 +879,16 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>   
>   	start = offset >> PAGE_SHIFT;
>   	end = (offset + len) >> PAGE_SHIFT;
> +	if (kvm_gmem_has_custom_allocator(inode)) {
> +		size_t nr_pages;
> +		void *p;
> +
> +		p = kvm_gmem_allocator_private(inode);
> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
> +
> +		start = round_down(start, nr_pages);
> +		end = round_down(end, nr_pages);
It's weird here.
Should the end be round_up()?

> +	}
>   
>   	r = 0;
>   	for (index = start; index < end; ) {
>
[...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
                     ` (2 preceding siblings ...)
  2025-05-27  8:45   ` Yan Zhao
@ 2025-06-05  5:24   ` Binbin Wu
  2025-06-05 19:16     ` Ackerley Tng
  3 siblings, 1 reply; 231+ messages in thread
From: Binbin Wu @ 2025-06-05  5:24 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li



On 5/15/2025 7:42 AM, Ackerley Tng wrote:
[...]
> +
> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> +						      struct folio *folio)
> +{
> +	size_t to_nr_pages;
> +	void *priv;
> +
> +	if (!kvm_gmem_has_custom_allocator(inode))
> +		return 0;
> +
> +	priv = kvm_gmem_allocator_private(inode);
> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
> +
> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))

What if a huge page whose attribute is shared?

> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
> +
> +	return 0;
> +}
> +
[...]
>   
>   static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> @@ -563,11 +1005,16 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   		return folio;
>   
>   	if (kvm_gmem_has_custom_allocator(inode)) {
> -		void *p = kvm_gmem_allocator_private(inode);
> +		size_t nr_pages;
> +		void *p;
>   
> +		p = kvm_gmem_allocator_private(inode);
>   		folio = kvm_gmem_allocator_ops(inode)->alloc_folio(p);
>   		if (IS_ERR(folio))
>   			return folio;
> +
> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
> +		index_floor = round_down(index, nr_pages);
>   	} else {
>   		gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
>   
> @@ -580,10 +1027,11 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   			folio_put(folio);
>   			return ERR_PTR(ret);
>   		}
> +
> +		index_floor = index;
>   	}
>   	allocated_size = folio_size(folio);
>   
> -	index_floor = round_down(index, folio_nr_pages(folio));
>   	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
>   	if (ret) {
>   		folio_put(folio);
> @@ -600,6 +1048,13 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   		return ERR_PTR(ret);
>   	}
>   
> +	/* Leave just filemap's refcounts on folio. */
> +	folio_put(folio);
> +
> +	ret = kvm_gmem_try_split_folio_in_filemap(inode, folio);

When !CONFIG_KVM_GMEM_SHARED_MEM, kvm_gmem_try_split_folio_in_filemap() is
undefined.

> +	if (ret)
> +		goto err;
> +
>   	spin_lock(&inode->i_lock);
>   	inode->i_blocks += allocated_size / 512;
>   	spin_unlock(&inode->i_lock);
>
[...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-22 22:19   ` Edgecombe, Rick P
@ 2025-06-05 17:15     ` Ackerley Tng
  2025-06-05 17:53       ` Edgecombe, Rick P
  2025-06-05 17:15     ` Ackerley Tng
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:15 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>
> General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
> checked all over the place across this series. There are only two allocators
> after this, right? So one is implemented with callbacks presumably designed to
> fit other allocators, and one has special case logic in guest_memfd.c.
>
> Did you consider designing struct guestmem_allocator_operations so that it could
> encapsulate the special logic for both the existing and new
> allocators?

I did, yes. I believe it is definitely possible to make standard 4K
pages become another allocator too.

I would love to clean this up. Not sure if that will be a new series
after this one, or part of this one though.

> If it
> didn't work well, could we expect that a next allocator would actually fit
> struct guestmem_allocator_operations?
>

This was definitely designed to support allocators beyond
guestmem_hugetlb, though I won't promise that it will be a perfect fit
for future allocators. This is internal to the kernel and this interface
can be updated for future allocators though.

>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-22 22:19   ` Edgecombe, Rick P
  2025-06-05 17:15     ` Ackerley Tng
@ 2025-06-05 17:15     ` Ackerley Tng
  2025-06-05 17:16     ` Ackerley Tng
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:15 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>
> General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
> checked all over the place across this series. There are only two allocators
> after this, right? So one is implemented with callbacks presumably designed to
> fit other allocators, and one has special case logic in guest_memfd.c.
>
> Did you consider designing struct guestmem_allocator_operations so that it could
> encapsulate the special logic for both the existing and new
> allocators?

I did, yes. I believe it is definitely possible to make standard 4K
pages become another allocator too.

I would love to clean this up. Not sure if that will be a new series
after this one, or part of this one though.

> If it
> didn't work well, could we expect that a next allocator would actually fit
> struct guestmem_allocator_operations?
>

This was definitely designed to support allocators beyond
guestmem_hugetlb, though I won't promise that it will be a perfect fit
for future allocators. This is internal to the kernel and this interface
can be updated for future allocators though.

>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-22 22:19   ` Edgecombe, Rick P
  2025-06-05 17:15     ` Ackerley Tng
  2025-06-05 17:15     ` Ackerley Tng
@ 2025-06-05 17:16     ` Ackerley Tng
  2025-06-05 17:16     ` Ackerley Tng
  2025-06-05 17:16     ` Ackerley Tng
  4 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:16 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>
> General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
> checked all over the place across this series. There are only two allocators
> after this, right? So one is implemented with callbacks presumably designed to
> fit other allocators, and one has special case logic in guest_memfd.c.
>
> Did you consider designing struct guestmem_allocator_operations so that it could
> encapsulate the special logic for both the existing and new
> allocators?

I did, yes. I believe it is definitely possible to make standard 4K
pages become another allocator too.

I would love to clean this up. Not sure if that will be a new series
after this one, or part of this one though.

> If it
> didn't work well, could we expect that a next allocator would actually fit
> struct guestmem_allocator_operations?
>

This was definitely designed to support allocators beyond
guestmem_hugetlb, though I won't promise that it will be a perfect fit
for future allocators. This is internal to the kernel and this interface
can be updated for future allocators though.

>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-22 22:19   ` Edgecombe, Rick P
                       ` (2 preceding siblings ...)
  2025-06-05 17:16     ` Ackerley Tng
@ 2025-06-05 17:16     ` Ackerley Tng
  2025-06-05 17:16     ` Ackerley Tng
  4 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:16 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>
> General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
> checked all over the place across this series. There are only two allocators
> after this, right? So one is implemented with callbacks presumably designed to
> fit other allocators, and one has special case logic in guest_memfd.c.
>
> Did you consider designing struct guestmem_allocator_operations so that it could
> encapsulate the special logic for both the existing and new
> allocators?

I did, yes. I believe it is definitely possible to make standard 4K
pages become another allocator too.

I would love to clean this up. Not sure if that will be a new series
after this one, or part of this one though.

> If it
> didn't work well, could we expect that a next allocator would actually fit
> struct guestmem_allocator_operations?
>

This was definitely designed to support allocators beyond
guestmem_hugetlb, though I won't promise that it will be a perfect fit
for future allocators. This is internal to the kernel and this interface
can be updated for future allocators though.

>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-22 22:19   ` Edgecombe, Rick P
                       ` (3 preceding siblings ...)
  2025-06-05 17:16     ` Ackerley Tng
@ 2025-06-05 17:16     ` Ackerley Tng
  4 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:16 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: palmer@dabbelt.com, pvorel@suse.cz, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, vbabka@suse.cz,
	jack@suse.cz, amoorthy@google.com, maz@kernel.org,
	keirf@google.com, vkuznets@redhat.com, quic_eberman@quicinc.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	anthony.yznaga@oracle.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, quic_svaddagi@quicinc.com, Hansen, Dave,
	ajones@ventanamicro.com, paul.walmsley@sifive.com,
	nsaenz@amazon.es, aik@amd.com, usama.arif@bytedance.com,
	quic_mnalajal@quicinc.com, fvdl@google.com, rppt@kernel.org,
	quic_cvanscha@quicinc.com, bfoster@redhat.com,
	willy@infradead.org, anup@brainfault.org, thomas.lendacky@amd.com,
	tabba@google.com, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, Zhao, Yan Y, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	mpe@ellerman.id.au, Aktas, Erdem, david@redhat.com, jgg@ziepe.ca,
	Annapurve, Vishal, Xu, Haibo1, jhubbard@nvidia.com,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, pankaj.gupta@amd.com,
	Peng, Chao P, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, Weiny, Ira, aou@eecs.berkeley.edu,
	Li, Xiaoyao, qperret@google.com, kent.overstreet@linux.dev,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	roypat@amazon.co.uk, seanjc@google.com

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Wed, 2025-05-14 at 16:42 -0700, Ackerley Tng wrote:
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>
> General comment - It's a bit unfortunate how kvm_gmem_has_custom_allocator() is
> checked all over the place across this series. There are only two allocators
> after this, right? So one is implemented with callbacks presumably designed to
> fit other allocators, and one has special case logic in guest_memfd.c.
>
> Did you consider designing struct guestmem_allocator_operations so that it could
> encapsulate the special logic for both the existing and new
> allocators?

I did, yes. I believe it is definitely possible to make standard 4K
pages become another allocator too.

I would love to clean this up. Not sure if that will be a new series
after this one, or part of this one though.

> If it
> didn't work well, could we expect that a next allocator would actually fit
> struct guestmem_allocator_operations?
>

This was definitely designed to support allocators beyond
guestmem_hugetlb, though I won't promise that it will be a perfect fit
for future allocators. This is internal to the kernel and this interface
can be updated for future allocators though.

>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-27  4:30   ` Yan Zhao
  2025-05-27  4:38     ` Yan Zhao
@ 2025-06-05 17:50     ` Ackerley Tng
  1 sibling, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 17:50 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:

[...]

>> +static pgoff_t kvm_gmem_compute_invalidate_bound(struct inode *inode,
>> +						 pgoff_t bound, bool start)
>> +{
>> +	size_t nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return bound;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (start)
>> +		return round_down(bound, nr_pages);
>> +	else
>> +		return round_up(bound, nr_pages);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_start(struct inode *inode,
>> +						 pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, true);
>> +}
>> +
>> +static pgoff_t kvm_gmem_compute_invalidate_end(struct inode *inode,
>> +					       pgoff_t bound)
>> +{
>> +	return kvm_gmem_compute_invalidate_bound(inode, bound, false);
>> +}
>> +
>>  static int kvm_gmem_shareability_apply(struct inode *inode,
>>  				       struct conversion_work *work,
>>  				       enum shareability m)
>> @@ -299,35 +428,53 @@ static void kvm_gmem_convert_invalidate_begin(struct inode *inode,
>>  					      struct conversion_work *work)
>>  {
>>  	struct list_head *gmem_list;
>> +	pgoff_t invalidate_start;
>> +	pgoff_t invalidate_end;
>>  	struct kvm_gmem *gmem;
>> -	pgoff_t end;
>> +	pgoff_t work_end;
>>  
>> -	end = work->start + work->nr_pages;
>> +	work_end = work->start + work->nr_pages;
>> +	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
>> +	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);

The invalidation range is broadened to include the full range to take
care of this race [1] reported for the conversion flow that uses
KVM_SET_MEMORY_ATTRIBUTES ioctl, so I also repeated the broadening for
this guest_memfd conversion ioctl.

> Could we just notify the exact gfn range and let KVM adjust the invalidate
> range?
>

How do we get KVM to adjust the invalidate range?

> Then kvm_gmem_invalidate_begin() can asks KVM to do EPT splitting before any
> kvm_mmu_unmap_gfn_range() is performed.
>
>

In this snapshot of my WIP of putting this HugeTLB support with TDX huge
page EPT support [2], I was thinking to combine EPT splitting together
with unmap, and leaving the invalidate to be a separate part. (See
kvm_gmem_unmap_private().) I did it this way so that the EPT splitting
is range is the unmapping range, and only the invalidation range is
broadened.

What do you think of that?

>>  	gmem_list = &inode->i_mapping->i_private_list;
>>  	list_for_each_entry(gmem, gmem_list, entry)
>> -		kvm_gmem_invalidate_begin(gmem, work->start, end);
>> +		kvm_gmem_invalidate_begin(gmem, invalidate_start, invalidate_end);
>>  }
>>  
>>  static void kvm_gmem_convert_invalidate_end(struct inode *inode,
>>  					    struct conversion_work *work)
>>  {
>>  	struct list_head *gmem_list;
>> +	pgoff_t invalidate_start;
>> +	pgoff_t invalidate_end;
>>  	struct kvm_gmem *gmem;
>> -	pgoff_t end;
>> +	pgoff_t work_end;
>>  
>> -	end = work->start + work->nr_pages;
>> +	work_end = work->start + work->nr_pages;
>> +	invalidate_start = kvm_gmem_compute_invalidate_start(inode, work->start);
>> +	invalidate_end = kvm_gmem_compute_invalidate_end(inode, work_end);
>>  
>>  	gmem_list = &inode->i_mapping->i_private_list;
>>  	list_for_each_entry(gmem, gmem_list, entry)
>> -		kvm_gmem_invalidate_end(gmem, work->start, end);
>> +		kvm_gmem_invalidate_end(gmem, invalidate_start, invalidate_end);
>>  }
>>  
>>  static int kvm_gmem_convert_should_proceed(struct inode *inode,
>>  					   struct conversion_work *work,
>>  					   bool to_shared, pgoff_t *error_index)
>>  {
>> -	if (!to_shared) {
>> +	if (to_shared) {
>> +		struct list_head *gmem_list;
>> +		struct kvm_gmem *gmem;
>> +		pgoff_t work_end;
>> +
>> +		work_end = work->start + work->nr_pages;
>> +
>> +		gmem_list = &inode->i_mapping->i_private_list;
>> +		list_for_each_entry(gmem, gmem_list, entry)
>> +			kvm_gmem_unmap_private(gmem, work->start, work_end);
>> +	} else {
>>  		unmap_mapping_pages(inode->i_mapping, work->start,
>>  				    work->nr_pages, false);
>>  
>> @@ -340,6 +487,27 @@ static int kvm_gmem_convert_should_proceed(struct inode *inode,
>>  	return 0;
>>  }
>>  
>> +static int kvm_gmem_convert_execute_work(struct inode *inode,
>> +					 struct conversion_work *work,
>> +					 bool to_shared)
>> +{
>> +	enum shareability m;
>> +	int ret;
>> +
>> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
>> +	ret = kvm_gmem_shareability_apply(inode, work, m);
>> +	if (ret)
>> +		return ret;
>> +	/*
>> +	 * Apply shareability first so split/merge can operate on new
>> +	 * shareability state.
>> +	 */
>> +	ret = kvm_gmem_restructure_folios_in_range(
>> +		inode, work->start, work->nr_pages, to_shared);
>> +
>> +	return ret;
>> +}
>> +
>>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  				  size_t nr_pages, bool shared,
>>  				  pgoff_t *error_index)
>> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  
>>  	list_for_each_entry(work, &work_list, list) {
>>  		rollback_stop_item = work;
>> -		ret = kvm_gmem_shareability_apply(inode, work, m);
>> +
>> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
>>  		if (ret)
>>  			break;
>>  	}
>>  
>>  	if (ret) {
>> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>>  		list_for_each_entry(work, &work_list, list) {
>> +			int r;
>> +
>> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
>> +			WARN_ON(r);
>> +
>>  			if (work == rollback_stop_item)
>>  				break;
>> -
>> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
>>  		}
>>  	}
>>  
>> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_KVM_GMEM_HUGETLB
>> +
>> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	struct address_space *mapping = folio->mapping;
>> +
>> +	spin_lock(&mapping->host->i_lock);
>> +	xa_lock_irq(&mapping->i_pages);
>> +
>> +	__filemap_remove_folio(folio, NULL);
>> +
>> +	xa_unlock_irq(&mapping->i_pages);
>> +	spin_unlock(&mapping->host->i_lock);
>> +}
>> +
>> +/**
>> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
>> + * split/merge.
>> + *
>> + * @folio: the folio to be removed.
>> + *
>> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
>> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
>> + *
>> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
>> + *          freeze these refcounts away so that no other users will interfere
>> + *          with restructuring.
>> + */
>> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	int filemap_refcount;
>> +
>> +	filemap_refcount = folio_nr_pages(folio);
>> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
>> +		/*
>> +		 * At this point only filemap refcounts are expected, hence okay
>> +		 * to spin until speculative refcounts go away.
>> +		 */
>> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
>> +	}
>> +
>> +	folio_lock(folio);
>> +	__filemap_remove_folio_for_restructuring(folio);
>> +	folio_unlock(folio);
>> +}
>> +
>> +/**
>> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @folio: folio to be split.
>> + *
>> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
>> + * and add back the split folios.
>> + *
>> + * Context: Expects that before this call, folio's refcount is just the
>> + *          filemap's refcounts. After this function returns, the split folios'
>> + *          refcounts will also be filemap's refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
>> +{
>> +	size_t orig_nr_pages;
>> +	pgoff_t orig_index;
>> +	size_t i, j;
>> +	int ret;
>> +
>> +	orig_nr_pages = folio_nr_pages(folio);
>> +	if (orig_nr_pages == 1)
>> +		return 0;
>> +
>> +	orig_index = folio->index;
>> +
>> +	filemap_remove_folio_for_restructuring(folio);
>> +
>> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
>> +	if (ret)
>> +		goto err;
>> +
>> +	for (i = 0; i < orig_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(folio, i));
>> +
>> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
>> +						   orig_index + i);
>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +
>> +	return ret;
>> +
>> +rollback:
>> +	for (j = 0; j < i; ++j) {
>> +		struct folio *f = page_folio(folio_page(folio, j));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
>> +err:
>> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
> If the guest_memfd is configured with GUESTMEM_HUGETLB_FLAG_1GB, it seems that
> whenever there's a shared page within a 1GB range, the folio will always be
> split into 4KB folios. Is it good?
>

It is not the best, but okay as an initial step.

We want to work on splitting 1G to 2M (as many 2M pages as possible)
then to 4K. I believe the agreement with the community is that the
1G->2M->4K splitting is an optimization for the patch series after this
one.

>> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
>> + * @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @first_folio: first folio among folios to be merged.
>> + *
>> + * Will clean up subfolios from filemap and add back the merged folio.
>> + *
>> + * Context: Expects that before this call, all subfolios only have filemap
>> + *          refcounts. After this function returns, the merged folio will only
>> + *          have filemap refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
>> +					   struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	void *priv;
>> +	size_t i;
>> +	int ret;
>> +
>> +	index = first_folio->index;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +	if (folio_nr_pages(first_folio) == to_nr_pages)
>> +		return 0;
>> +
>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
>> +
>> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
>> +	if (ret)
>> +		goto err_split;
>> +
>> +	return ret;
>> +
>> +err_split:
>> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
>> +						      struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
>> +		return 0;
>> +
>> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
>> +}
>> +
>> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>> +						pgoff_t start, size_t nr_pages,
>> +						bool is_split_operation)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	pgoff_t end;
>> +	void *priv;
>> +	int ret;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	end = start + nr_pages;
>> +
>> +	/* Round to allocator page size, to check all (huge) pages in range. */
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	start = round_down(start, to_nr_pages);
>> +	end = round_up(end, to_nr_pages);
>> +
>> +	for (index = start; index < end; index += to_nr_pages) {
>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> The split operation is performed after kvm_gmem_unmap_private() within
> kvm_gmem_convert_should_proceed(), right?
>
> So, it seems that that it's not necessary for TDX to avoid holding private page
> references, as TDX must have released the page refs after
> kvm_gmem_unmap_private() (except when there's TDX module or KVM bug).
>

I agree with your assessment in the follow up email.

We don't want to unmap more than the requested conversion range to avoid
extra churn. If TDX holds refcounts on mapped pages, the subpages that
are still mapped will contribute to the refcount of the huge page, and
we can't split a page that has refcounts because we don't know how the
refcounts are distributed over the subpages.

I guess technically if the refcounts are divisible across nr_pages, we
could still split, but if we have a 1G page, but only some of the 1G
subpages are mapped into TDX EPTs, then we would have a refcount that we
don't know how to divide out.

>> +		else
>> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
>> +
>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +	return ret;
>> +
>> +rollback:
>> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
>> +		else
>> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +#else
>> +

[...]

[1] https://lore.kernel.org/all/Z__AAB_EFxGFEjDR@google.com/
[2] https://github.com/googleprodkernel/linux-cc/commits/wip-tdx-gmem-conversions-hugetlb-2mept/


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-06-05 17:15     ` Ackerley Tng
@ 2025-06-05 17:53       ` Edgecombe, Rick P
  0 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-06-05 17:53 UTC (permalink / raw)
  To: ackerleytng@google.com, kvm@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, x86@kernel.org
  Cc: pvorel@suse.cz, palmer@dabbelt.com, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	vbabka@suse.cz, nsaenz@amazon.es, peterx@redhat.com,
	keirf@google.com, amoorthy@google.com, quic_svaddagi@quicinc.com,
	jack@suse.cz, vkuznets@redhat.com, maz@kernel.org,
	mail@maciej.szmigiero.name, hughd@google.com, Annapurve, Vishal,
	Wang, Wei W, tabba@google.com, Wieczor-Retman, Maciej,
	Zhao, Yan Y, ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, quic_mnalajal@quicinc.com, aik@amd.com,
	usama.arif@bytedance.com, Hansen, Dave, fvdl@google.com,
	rppt@kernel.org, bfoster@redhat.com, quic_cvanscha@quicinc.com,
	Du, Fan, anthony.yznaga@oracle.com, thomas.lendacky@amd.com,
	anup@brainfault.org, mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_eberman@quicinc.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, Aktas, Erdem, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, steven.price@arm.com,
	jhubbard@nvidia.com, Xu, Haibo1, Yamahata, Isaku,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, pankaj.gupta@amd.com, Peng, Chao P,
	nikunj@amd.com, Graf, Alexander, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, richard.weiyang@gmail.com, Weiny, Ira,
	aou@eecs.berkeley.edu, Li, Xiaoyao, qperret@google.com,
	kent.overstreet@linux.dev, dmatlack@google.com,
	james.morse@arm.com, brauner@kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	seanjc@google.com, roypat@amazon.co.uk

On Thu, 2025-06-05 at 10:15 -0700, Ackerley Tng wrote:
> > > > > > > > Did you consider designing struct guestmem_allocator_operations
> > > > > > > > so that > > > > it > > could
> > > > > > > > encapsulate the special logic for both the existing and new
> > > > > > > > allocators?
> > > > 
> > > > I did, yes. I believe it is definitely possible to make standard 4K
> > > > pages become another allocator too.
> > > > 
> > > > I would love to clean this up. Not sure if that will be a new series
> > > > after this one, or part of this one though.

Usually new work should handle the refactor first, then build on top of it. The
code today bolts on a new thing in a dirty way leaving cleanup.

Towards also expedient reviewability, a better order could be:
1. Add allocator callbacks one at a time (or in whatever granularity is
possible), moving 4k allocator to callbacks at the same time. Basically a code
move. Don't factor out common code between the planned allocators. Will be dirt
simple to review.
2. Introduce changes to hugelbfs, explaining why each will be used by guestmemfd
3. Add hugetlbsfs/1GB custom allocator to guestmemfd code, a callback at a time.
Have any necessary factoring out of 4k page allocator bits out of the callback
implementation come in a separate preceding patch. Explain the commonality.

What do you think?

Also, for (2) do you think you could move some of these pure cleanup patches out
of the series to go up ahead of time? And for any hugetlb changes that 1GB
guestmemfd depends on, explain why in the log? I'm not clear what is required
and what is opportunistic cleanup.


> > > > 
> > > > > > > > If it
> > > > > > > > didn't work well, could we expect that a next allocator would
> > > > > > > > actually > > > > fit
> > > > > > > > struct guestmem_allocator_operations?
> > > > > > > > 
> > > > 
> > > > This was definitely designed to support allocators beyond
> > > > guestmem_hugetlb, though I won't promise that it will be a perfect fit
> > > > for future allocators. This is internal to the kernel and this interface
> > > > can be updated for future allocators though.

Makes sense. This was probing on whether the interface didn't fit the 4k
allocator. It makes sense to have the interface target the existing 2
allocators, and no future ideas.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-05-27  8:45   ` Yan Zhao
@ 2025-06-05 19:10     ` Ackerley Tng
  2025-06-16 11:15       ` Yan Zhao
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 19:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
>> +static int kvm_gmem_convert_execute_work(struct inode *inode,
>> +					 struct conversion_work *work,
>> +					 bool to_shared)
>> +{
>> +	enum shareability m;
>> +	int ret;
>> +
>> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
>> +	ret = kvm_gmem_shareability_apply(inode, work, m);
>> +	if (ret)
>> +		return ret;
>> +	/*
>> +	 * Apply shareability first so split/merge can operate on new
>> +	 * shareability state.
>> +	 */
>> +	ret = kvm_gmem_restructure_folios_in_range(
>> +		inode, work->start, work->nr_pages, to_shared);
>> +
>> +	return ret;
>> +}
>> +

Hi Yan,

Thanks for your thorough reviews and your alternative suggestion in the
other discussion at [1]! I'll try to bring the conversion-related parts
of that discussion over here.

>>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  				  size_t nr_pages, bool shared,
>>  				  pgoff_t *error_index)

The guiding principle I was using for the conversion ioctls is

* Have the shareability updates and any necessary page restructuring
  (aka splitting/merging) either fully complete for not at all by the
  time the conversion ioctl returns.
* Any unmapping (from host or guest page tables) will not be re-mapped
  on errors.
* Rollback undoes changes if conversion failed, and in those cases any
  errors are turned into WARNings.

The rationale is that we want page sizes to be in sync with shareability
so that any faults after the (successful or failed) conversion will not
wrongly map in a larger page than allowed and cause any host crashes.

We considered 3 places where the memory can be mapped for conversions:

1. Host page tables
2. Guest page tables
3. IOMMU page tables

Unmapping from host page tables is the simplest case. We unmap any
shared ranges from the host page tables. Any accesses after the failed
conversion would just fault the memory back in and proceed as usual.

guest_memfd memory is not unmapped from IOMMUs in conversions. This case
is handled because IOMMU mappings hold refcounts. After unmapping from
the host, we check for unexpected refcounts and fail if there are
unexpected refcounts.

We also unmap from guest page tables. Considering failed conversions, if
the pages are shared, we're good since the next time the guest accesses
the page, the page will be faulted in as before.

If the pages are private, on the next guest access, the pages will be
faulted in again as well. This is fine for software-protected VMs IIUC.

For TDX (and SNP) IIUC the memory would have been cleared, and the
memory would also need to be re-accepted. I was thinking that this is by
design, since when a TDX guest requests a conversion it knows that the
contents is not to be used again.

The userspace VMM is obligated to keep trying convert and if it gives
up, userspace VMM should inform the guest that the conversion
failed. The guest should handle conversion failures too and not assume
that conversion always succeeds.

Putting TDX aside for a moment, so far, there are a few ways this
conversion could fail:

a. Unexpected refcounts. Userspace should clear up the unexpected
   refcounts and report failure to the guest if it can't for whatever
   reason.
b. ENOMEM because (i) we ran out of memory updating the shareability
   maple_tree or (ii) since splitting involves allocating more memory
   for struct pages and we ran out of memory there. In this case the
   userspace VMM gets -ENOMEM and can make more memory available and
   then retry, or if it can't, also report failure to the guest.

TDX introduces TDX-specific conversion failures (see discussion at
[1]), which this series doesn't handle, but I think we still have a line
of sight to handle new errors.

In the other thread [1], I was proposing to have guest_memfd decide what
to do on errors, but I think that might be baking more TDX-specific
details into guest_memfd/KVM, and perhaps this is better:

We could return the errors to userspace and let userspace determine what
to do. For retryable errors (as determined by userspace), it should do
what it needs to do, and retry. For errors like TDX being unable to
reclaim the memory, it could tell guest_memfd to leak that memory.

If userspace gives up, it should report conversion failure to the guest
if userspace thinks the guest can continue (to a clean shutdown or
otherwise). If something terrible happened during conversion, then
userspace might have to exit itself or shutdown the host.

In [2], for TDX-specific conversion failures, you proposed prepping to
eliminate errors and exiting early on failure, then actually
unmapping. I think that could work too.

I'm a little concerned that prepping could be complicated, since the
nature of conversion depends on the current state of shareability, and
there's a lot to prepare, everything from counting memory required for
maple_tree allocation (and merging ranges in the maple_tree), and
counting the number of pages required for undoing vmemmap optimization
in the case of splitting...

And even after doing all the prep to eliminate errors, the unmapping
could fail in TDX-specific cases anyway, which still needs to be
handled.

Hence I'm hoping you'll consider to let TDX-specific failures be
built-in and handled alongside other failures by getting help from the
userspace VMM, and in the worst case letting the guest know the
conversion failed.

I also appreciate comments or suggestions from anyone else!

[1] https://lore.kernel.org/all/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com/

>> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
>>  
>>  	list_for_each_entry(work, &work_list, list) {
>>  		rollback_stop_item = work;
>> -		ret = kvm_gmem_shareability_apply(inode, work, m);
>> +
>> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
>>  		if (ret)
>>  			break;
>>  	}
>>  
>>  	if (ret) {
>> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
>>  		list_for_each_entry(work, &work_list, list) {
>> +			int r;
>> +
>> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
>> +			WARN_ON(r);
>> +
>>  			if (work == rollback_stop_item)
>>  				break;
>> -
>> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> Could kvm_gmem_shareability_apply() fail here?
>

Yes, it could. If shareability cannot be updated, then we probably ran
out of memory. Userspace VMM will probably get -ENOMEM set on some
earlier ret and should handle that accordingly.

On -ENOMEM in a rollback, the host is in a very tough spot anyway, and a
clean guest shutdown may be the only way out, hence this is a WARN and
not returned to userspace.

>>  		}
>>  	}
>>  
>> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_KVM_GMEM_HUGETLB
>> +
>> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	struct address_space *mapping = folio->mapping;
>> +
>> +	spin_lock(&mapping->host->i_lock);
>> +	xa_lock_irq(&mapping->i_pages);
>> +
>> +	__filemap_remove_folio(folio, NULL);
>> +
>> +	xa_unlock_irq(&mapping->i_pages);
>> +	spin_unlock(&mapping->host->i_lock);
>> +}
>> +
>> +/**
>> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
>> + * split/merge.
>> + *
>> + * @folio: the folio to be removed.
>> + *
>> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
>> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
>> + *
>> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
>> + *          freeze these refcounts away so that no other users will interfere
>> + *          with restructuring.
>> + */
>> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
>> +{
>> +	int filemap_refcount;
>> +
>> +	filemap_refcount = folio_nr_pages(folio);
>> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
>> +		/*
>> +		 * At this point only filemap refcounts are expected, hence okay
>> +		 * to spin until speculative refcounts go away.
>> +		 */
>> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
>> +	}
>> +
>> +	folio_lock(folio);
>> +	__filemap_remove_folio_for_restructuring(folio);
>> +	folio_unlock(folio);
>> +}
>> +
>> +/**
>> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @folio: folio to be split.
>> + *
>> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
>> + * and add back the split folios.
>> + *
>> + * Context: Expects that before this call, folio's refcount is just the
>> + *          filemap's refcounts. After this function returns, the split folios'
>> + *          refcounts will also be filemap's refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
>> +{
>> +	size_t orig_nr_pages;
>> +	pgoff_t orig_index;
>> +	size_t i, j;
>> +	int ret;
>> +
>> +	orig_nr_pages = folio_nr_pages(folio);
>> +	if (orig_nr_pages == 1)
>> +		return 0;
>> +
>> +	orig_index = folio->index;
>> +
>> +	filemap_remove_folio_for_restructuring(folio);
>> +
>> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
>> +	if (ret)
>> +		goto err;
>> +
>> +	for (i = 0; i < orig_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(folio, i));
>> +
>> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
>> +						   orig_index + i);
> Why does the failure of __kvm_gmem_filemap_add_folio() here lead to rollback,    
> while the failure of the one under rollback only triggers WARN_ON()?
>

Mostly because I don't really have a choice on rollback. On rollback we
try to restore the merged folio back into the filemap, and if we
couldn't, the host is probably in rather bad shape in terms of memory
availability and there may not be many options for the userspace VMM.

Perhaps the different possible errors from
__kvm_gmem_filemap_add_folio() in both should be handled differently. Do
you have any suggestions on that?

>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +
>> +	return ret;
>> +
>> +rollback:
>> +	for (j = 0; j < i; ++j) {
>> +		struct folio *f = page_folio(folio_page(folio, j));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
>> +err:
>> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
>> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
>> + * @inode.
>> + *
>> + * @inode: inode containing the folio.
>> + * @first_folio: first folio among folios to be merged.
>> + *
>> + * Will clean up subfolios from filemap and add back the merged folio.
>> + *
>> + * Context: Expects that before this call, all subfolios only have filemap
>> + *          refcounts. After this function returns, the merged folio will only
>> + *          have filemap refcounts.
>> + * Return: 0 on success or negative error otherwise.
>> + */
>> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
>> +					   struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	void *priv;
>> +	size_t i;
>> +	int ret;
>> +
>> +	index = first_folio->index;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +	if (folio_nr_pages(first_folio) == to_nr_pages)
>> +		return 0;
>> +
>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		filemap_remove_folio_for_restructuring(f);
>> +	}
>> +
>> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
>> +
>> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
>> +	if (ret)
>> +		goto err_split;
>> +
>> +	return ret;
>> +
>> +err_split:
>> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
> guestmem_hugetlb_split_folio() is possible to fail. e.g.
> After the stash is freed by guestmem_hugetlb_unstash_free_metadata() in
> guestmem_hugetlb_merge_folio(), it's possible to get -ENOMEM for the stash
> allocation in guestmem_hugetlb_stash_metadata() in
> guestmem_hugetlb_split_folio().
>
>

Yes. This is also on the error path. In line with all the other error
and rollback paths, I don't really have other options at this point,
since on error, I probably ran out of memory, so I try my best to
restore the original state but give up with a WARN otherwise.

>> +	for (i = 0; i < to_nr_pages; ++i) {
>> +		struct folio *f = page_folio(folio_page(first_folio, i));
>> +
>> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
>> +						      struct folio *first_folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
>> +		return 0;
>> +
>> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
>> +}
>> +
>> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>> +						pgoff_t start, size_t nr_pages,
>> +						bool is_split_operation)
>> +{
>> +	size_t to_nr_pages;
>> +	pgoff_t index;
>> +	pgoff_t end;
>> +	void *priv;
>> +	int ret;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	end = start + nr_pages;
>> +
>> +	/* Round to allocator page size, to check all (huge) pages in range. */
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
>> +
>> +	start = round_down(start, to_nr_pages);
>> +	end = round_up(end, to_nr_pages);
>> +
>> +	for (index = start; index < end; index += to_nr_pages) {
>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> kvm_gmem_try_split_folio_in_filemap()?
>

Here we know for sure that this was a private-to-shared
conversion. Hence, we know that there are at least some shared parts in
this huge page and we can skip checking that. 

>> +		else
>> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
>> +

For merge, we don't know if the entire huge page might perhaps contain
some other shared subpages, hence we "try" to merge by first checking
against shareability to find shared subpages.

>> +		if (ret)
>> +			goto rollback;
>> +	}
>> +	return ret;
>> +
>> +rollback:
>> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {

Note to self: the first index -= to_nr_pages was meant to skip the index
that caused the failure, but this could cause an underflow if index = 0
when entering rollback. Need to fix this in the next revision.

>> +		struct folio *f;
>> +
>> +		f = filemap_get_folio(inode->i_mapping, index);
>> +		if (IS_ERR(f))
>> +			continue;
>> +
>> +		/* Leave just filemap's refcounts on the folio. */
>> +		folio_put(f);
>> +
>> +		if (is_split_operation)
>> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
>> +		else
>> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
> Is it safe to just leave WARN_ON()s in the rollback case?
>

Same as above. I don't think we have much of a choice.

> Besides, are the kvm_gmem_merge_folio_in_filemap() and
> kvm_gmem_split_folio_in_filemap() here duplicated with the
> kvm_gmem_split_folio_in_filemap() and kvm_gmem_try_merge_folio_in_filemap() in
> the following "r = kvm_gmem_convert_execute_work(inode, work, !shared)"?
>

This handles the case where some pages in the range [start, start +
nr_pages) were split and the failure was halfway through. I could call
kvm_gmem_convert_execute_work() with !shared but that would go over all
the folios again from the start.

>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +#else
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
>> +						pgoff_t start, size_t nr_pages,
>> +						bool is_split_operation)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>  

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-06-05  5:24   ` Binbin Wu
@ 2025-06-05 19:16     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-05 19:16 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/15/2025 7:42 AM, Ackerley Tng wrote:
> [...]
>> +
>> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
>> +						      struct folio *folio)
>> +{
>> +	size_t to_nr_pages;
>> +	void *priv;
>> +
>> +	if (!kvm_gmem_has_custom_allocator(inode))
>> +		return 0;
>> +
>> +	priv = kvm_gmem_allocator_private(inode);
>> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
>> +
>> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
>
> What if a huge page whose attribute is shared?
>

This checks if there are any shared pages in the range [folio->index,
folio->index + to_nr_pages), so if the entire huge page is shared this
function should also return true.

folio->index is the start of the merged huge page, and to_nr_pages is
the number of pages in the merged huge page, so this should be querying
exactly the entire huge page.

Note to self: rename kvm_gmem_has_some_shared() to
kvm_gmem_has_any_shared() in the next revision.

Hope I answered your question! Let me know if I misunderstood your question.

>> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
>> +
>> +	return 0;
>> +}
>> +
> [...]
>>   
>>   static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
>> @@ -563,11 +1005,16 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>>   		return folio;
>>   
>>   	if (kvm_gmem_has_custom_allocator(inode)) {
>> -		void *p = kvm_gmem_allocator_private(inode);
>> +		size_t nr_pages;
>> +		void *p;
>>   
>> +		p = kvm_gmem_allocator_private(inode);
>>   		folio = kvm_gmem_allocator_ops(inode)->alloc_folio(p);
>>   		if (IS_ERR(folio))
>>   			return folio;
>> +
>> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
>> +		index_floor = round_down(index, nr_pages);
>>   	} else {
>>   		gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
>>   
>> @@ -580,10 +1027,11 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>>   			folio_put(folio);
>>   			return ERR_PTR(ret);
>>   		}
>> +
>> +		index_floor = index;
>>   	}
>>   	allocated_size = folio_size(folio);
>>   
>> -	index_floor = round_down(index, folio_nr_pages(folio));
>>   	ret = kvm_gmem_filemap_add_folio(inode->i_mapping, folio, index_floor);
>>   	if (ret) {
>>   		folio_put(folio);
>> @@ -600,6 +1048,13 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>>   		return ERR_PTR(ret);
>>   	}
>>   
>> +	/* Leave just filemap's refcounts on folio. */
>> +	folio_put(folio);
>> +
>> +	ret = kvm_gmem_try_split_folio_in_filemap(inode, folio);
>
> When !CONFIG_KVM_GMEM_SHARED_MEM, kvm_gmem_try_split_folio_in_filemap() is
> undefined.
>

Will fix this in the next revision. Thanks!

>> +	if (ret)
>> +		goto err;
>> +
>>   	spin_lock(&inode->i_lock);
>>   	inode->i_blocks += allocated_size / 512;
>>   	spin_unlock(&inode->i_lock);
>>
> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-29  5:42   ` Michael Roth
@ 2025-06-11 21:51     ` Ackerley Tng
  2025-07-02 23:25       ` Michael Roth
  2025-06-11 22:10     ` Ackerley Tng
  1 sibling, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-06-11 21:51 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, mpe, muchun.song, nikunj,
	nsaenz, oliver.upton, palmer, pankaj.gupta, paul.walmsley,
	pbonzini, pdurrant, peterx, pgonda, pvorel, qperret,
	quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Michael Roth <michael.roth@amd.com> writes:

> On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
>> Track guest_memfd memory's shareability status within the inode as
>> opposed to the file, since it is property of the guest_memfd's memory
>> contents.
>> 
>> Shareability is a property of the memory and is indexed using the
>> page's index in the inode. Because shareability is the memory's
>> property, it is stored within guest_memfd instead of within KVM, like
>> in kvm->mem_attr_array.
>> 
>> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
>> retained to allow VMs to only use guest_memfd for private memory and
>> some other memory for shared memory.
>> 
>> Not all use cases require guest_memfd() to be shared with the host
>> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
>> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
>> private to the guest, and therefore not mappable by the
>> host. Otherwise, memory is shared until explicitly converted to
>> private.
>> 
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Co-developed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
>> ---
>>  Documentation/virt/kvm/api.rst |   5 ++
>>  include/uapi/linux/kvm.h       |   2 +
>>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
>>  3 files changed, 129 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 86f74ce7f12a..f609337ae1c2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
>>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
>>  This is validated when the guest_memfd instance is bound to the VM.
>>  
>> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
>> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
>> +will initialize the memory for the guest_memfd as guest-only and not faultable
>> +by the host.
>> +
>
> KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> like this flag should be deferred until that patch is in place. Is it
> really needed at that point though? Userspace would be able to set the
> initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
>

I can move this change to the later patch. Thanks! Will fix in the next
revision.

> The mtree contents seems to get stored in the same manner in either case so
> performance-wise only the overhead of a few userspace<->kernel switches
> would be saved. Are there any other reasons?
>
> Otherwise, maybe just settle on SHARED as a documented default (since at
> least non-CoCo VMs would be able to reliably benefit) and let
> CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> granularity makes sense for the architecture/guest configuration.
>

Because shared pages are split once any memory is allocated, having a
way to INIT_PRIVATE could avoid the split and then merge on
conversion. I feel that is enough value to have this config flag, what
do you think?

I guess we could also have userspace be careful not to do any allocation
before converting.

>>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>>  
>>  4.143 KVM_PRE_FAULT_MEMORY
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 4cc824a3a7c9..d7df312479aa 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
>>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>>  
>>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
>> +
>>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
>> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
>>  
>>  struct kvm_create_guest_memfd {
>>  	__u64 size;
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 239d0f13dcc1..590932499eba 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,7 @@
>>  #include <linux/falloc.h>
>>  #include <linux/fs.h>
>>  #include <linux/kvm_host.h>
>> +#include <linux/maple_tree.h>
>>  #include <linux/pseudo_fs.h>
>>  #include <linux/pagemap.h>
>>  
>> @@ -17,6 +18,24 @@ struct kvm_gmem {
>>  	struct list_head entry;
>>  };
>>  
>> +struct kvm_gmem_inode_private {
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +	struct maple_tree shareability;
>> +#endif
>> +};
>> +
>> +enum shareability {
>> +	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
>> +	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
>> +};
>> +
>> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
>> +
>> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
>> +{
>> +	return inode->i_mapping->i_private_data;
>> +}
>> +
>>  /**
>>   * folio_file_pfn - like folio_file_page, but return a pfn.
>>   * @folio: The folio which contains this index.
>> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
>>  	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
>>  }
>>  
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +
>> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
>> +				      loff_t size, u64 flags)
>> +{
>> +	enum shareability m;
>> +	pgoff_t last;
>> +
>> +	last = (size >> PAGE_SHIFT) - 1;
>> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
>> +						    SHAREABILITY_ALL;
>> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
>> +				 GFP_KERNEL);
>
> One really nice thing about using a maple tree is that it should get rid
> of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> (which is the current QEMU default behavior).
>
> I'd originally advocated for sticking with the xarray implementation Fuad was
> using until we'd determined we really need it for HugeTLB support, but I'm
> sort of thinking it's already justified just based on the above.
>
> Maybe it would make sense for KVM memory attributes too?
>
>> +}
>> +
>> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
>> +						 pgoff_t index)
>> +{
>> +	struct maple_tree *mt;
>> +	void *entry;
>> +
>> +	mt = &kvm_gmem_private(inode)->shareability;
>> +	entry = mtree_load(mt, index);
>> +	WARN(!entry,
>> +	     "Shareability should always be defined for all indices in inode.");
>> +
>> +	return xa_to_value(entry);
>> +}
>> +
>> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> +{
>> +	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
>> +		return ERR_PTR(-EACCES);
>> +
>> +	return kvm_gmem_get_folio(inode, index);
>> +}
>> +
>> +#else
>> +
>> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
>> +{
>> +	WARN_ONCE("Unexpected call to get shared folio.")
>> +	return NULL;
>> +}
>> +
>> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
>> +
>>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  				    pgoff_t index, struct folio *folio)
>>  {
>> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
>>  
>>  	filemap_invalidate_lock_shared(inode->i_mapping);
>>  
>> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
>>  	if (IS_ERR(folio)) {
>>  		int err = PTR_ERR(folio);
>>  
>> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
>>  	.fallocate	= kvm_gmem_fallocate,
>>  };
>>  
>> +static void kvm_gmem_free_inode(struct inode *inode)
>> +{
>> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> +
>> +	kfree(private);
>> +
>> +	free_inode_nonrcu(inode);
>> +}
>> +
>> +static void kvm_gmem_destroy_inode(struct inode *inode)
>> +{
>> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
>> +
>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +	/*
>> +	 * mtree_destroy() can't be used within rcu callback, hence can't be
>> +	 * done in ->free_inode().
>> +	 */
>> +	if (private)
>> +		mtree_destroy(&private->shareability);
>> +#endif
>> +}
>> +
>>  static const struct super_operations kvm_gmem_super_operations = {
>>  	.statfs		= simple_statfs,
>> +	.destroy_inode	= kvm_gmem_destroy_inode,
>> +	.free_inode	= kvm_gmem_free_inode,
>>  };
>>  
>>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
>> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
>>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  						      loff_t size, u64 flags)
>>  {
>> +	struct kvm_gmem_inode_private *private;
>>  	struct inode *inode;
>> +	int err;
>>  
>>  	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
>>  	if (IS_ERR(inode))
>>  		return inode;
>>  
>> +	err = -ENOMEM;
>> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
>> +	if (!private)
>> +		goto out;
>> +
>> +	mt_init(&private->shareability);
>> +	inode->i_mapping->i_private_data = private;
>> +
>> +	err = kvm_gmem_shareability_setup(private, size, flags);
>> +	if (err)
>> +		goto out;
>> +
>>  	inode->i_private = (void *)(unsigned long)flags;
>>  	inode->i_op = &kvm_gmem_iops;
>>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>  
>>  	return inode;
>> +
>> +out:
>> +	iput(inode);
>> +
>> +	return ERR_PTR(err);
>>  }
>>  
>>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>  	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
>>  		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
>>  
>> +	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
>> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
>> +
>>  	if (flags & ~valid_flags)
>>  		return -EINVAL;
>>  
>> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  	if (!file)
>>  		return -EFAULT;
>>  
>> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +
>
> I like the idea of using a write-lock/read-lock to protect write/read access
> to shareability state (though maybe not necessarily re-using filemap's
> invalidate lock), it's simple and still allows concurrent faulting in of gmem
> pages. One issue on the SNP side (which also came up in one of the gmem calls)
> is if we introduce support for tracking preparedness as discussed (e.g. via a
> new SHAREABILITY_GUEST_PREPARED state) the
> SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> fault-time, and so would need to take the write-lock and no longer allow for
> concurrent fault-handling.
>
> I was originally planning on introducing a new rw_semaphore with similar
> semantics to the rw_lock that Fuad previously had in his restricted mmap
> series[1] (and simiar semantics to filemap invalidate lock here). The main
> difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> was that in the case of a folio being present for an index, the folio lock would
> also need to be held in order to update the shareability state. Because
> of that, fault paths (which will always either have or allocate folio
> basically) can rely on the folio lock to guard shareability state in a more
> granular way and so can avoid a global write lock.
>
> They would still need to hold the read lock to access the tree however.
> Or more specifically, any paths that could allocate a folio need to take
> a read lock so there isn't a TOCTOU situation where shareability is
> being updated for an index for which a folio hasn't been allocated, but
> then just afterward the folio gets faulted in/allocated while the
> shareability state is already being updated which the understand that
> there was no folio around that needed locking.
>
> I had a branch with in-place conversion support for SNP[2] that added this
> lock reworking on top of Fuad's series along with preparation tracking,
> but I'm now planning to rebase that on top of the patches from this
> series that Sean mentioned[3] earlier:
>
>   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
>   KVM: Query guest_memfd for private/shared status
>   KVM: guest_memfd: Skip LRU for guest_memfd folios
>   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
>   KVM: guest_memfd: Introduce and use shareability to guard faulting
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
>
> but figured I'd mention it here in case there are other things to consider on
> the locking front.
>
> Definitely agree with Sean though that it would be nice to start identifying a
> common base of patches for the in-place conversion enablement for SNP, TDX, and
> pKVM so the APIs/interfaces for hugepages can be handled separately.
>
> -Mike
>
> [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
>
>>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>>  	if (IS_ERR(folio)) {
>>  		r = PTR_ERR(folio);
>> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  		*page = folio_file_page(folio, index);
>>  	else
>>  		folio_put(folio);
>> -
>>  out:
>> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>>  	fput(file);
>>  	return r;
>>  }
>> -- 
>> 2.49.0.1045.g170613ef41-goog
>> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-29  5:42   ` Michael Roth
  2025-06-11 21:51     ` Ackerley Tng
@ 2025-06-11 22:10     ` Ackerley Tng
  1 sibling, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-11 22:10 UTC (permalink / raw)
  To: Michael Roth
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, mpe, muchun.song, nikunj,
	nsaenz, oliver.upton, palmer, pankaj.gupta, paul.walmsley,
	pbonzini, pdurrant, peterx, pgonda, pvorel, qperret,
	quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Michael Roth <michael.roth@amd.com> writes:

> On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:

Missed out responses on the second two comments!

[...]

>> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
>> +
>> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
>> +				      loff_t size, u64 flags)
>> +{
>> +	enum shareability m;
>> +	pgoff_t last;
>> +
>> +	last = (size >> PAGE_SHIFT) - 1;
>> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
>> +						    SHAREABILITY_ALL;
>> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
>> +				 GFP_KERNEL);
>
> One really nice thing about using a maple tree is that it should get rid
> of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> (which is the current QEMU default behavior).
>
> I'd originally advocated for sticking with the xarray implementation Fuad was
> using until we'd determined we really need it for HugeTLB support, but I'm
> sort of thinking it's already justified just based on the above.
>

We discussed this at the guest_memfd upstream call, and I believe the
current position is to go with maple_trees. Thanks for bringing this up!

> Maybe it would make sense for KVM memory attributes too?
>

I think so, but I haven't had the chance to work on that.

>> +}
>> +

[...]

>> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  	if (!file)
>>  		return -EFAULT;
>>  
>> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +

In this RFC, the filemap_invalidate_lock() was basically used to
serialize everything that could modify shareability.

>
> I like the idea of using a write-lock/read-lock to protect write/read access
> to shareability state (though maybe not necessarily re-using filemap's
> invalidate lock), it's simple and still allows concurrent faulting in of gmem
> pages. One issue on the SNP side (which also came up in one of the gmem calls)
> is if we introduce support for tracking preparedness as discussed (e.g. via a
> new SHAREABILITY_GUEST_PREPARED state) the
> SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> fault-time, and so would need to take the write-lock and no longer allow for
> concurrent fault-handling.
>
> I was originally planning on introducing a new rw_semaphore with similar
> semantics to the rw_lock that Fuad previously had in his restricted mmap
> series[1] (and simiar semantics to filemap invalidate lock here). The main
> difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> was that in the case of a folio being present for an index, the folio lock would
> also need to be held in order to update the shareability state. Because
> of that, fault paths (which will always either have or allocate folio
> basically) can rely on the folio lock to guard shareability state in a more
> granular way and so can avoid a global write lock.
>
> They would still need to hold the read lock to access the tree however.
> Or more specifically, any paths that could allocate a folio need to take
> a read lock so there isn't a TOCTOU situation where shareability is
> being updated for an index for which a folio hasn't been allocated, but
> then just afterward the folio gets faulted in/allocated while the
> shareability state is already being updated which the understand that
> there was no folio around that needed locking.
>
> I had a branch with in-place conversion support for SNP[2] that added this
> lock reworking on top of Fuad's series along with preparation tracking,
> but I'm now planning to rebase that on top of the patches from this
> series that Sean mentioned[3] earlier:
>
>   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
>   KVM: Query guest_memfd for private/shared status
>   KVM: guest_memfd: Skip LRU for guest_memfd folios
>   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
>   KVM: guest_memfd: Introduce and use shareability to guard faulting
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
>
> but figured I'd mention it here in case there are other things to consider on
> the locking front.

We discussed this a little at the last guest_memfd call: I'll summarize
the question I raised during the call here in text. :)

Today in guest_memfd the "prepared" and "zeroed" concepts are tracked
with the folio's uptodate flag.

Preparation is only used by SNP today and TDX does the somewhat
equivalent "preparation" at time of mapping into the guest page table.

Can we do SNP's preparation at some other point in time and not let the
"prepared" state be handled by guest_memfd at all?

This might simplify locking too, so preparedness would be locked
whenever SNP needs to, independently of shareability tracking.

Also, this might simplify the routines that use kvm_gmem_populate(),
perhaps remove the need for kvm_gmem_populate()? The current callers are
basically using kvm_gmem_populate() to allocate pages, why not call
kvm_gmem_get_folio() to do the allocation?

Another tangential point: it's hard to use the uptodate flag for
tracking preparedness, since when there are huge pages, the uptodate
flag can only indicate if the entire folio is prepared, but a user of
the memory might only have part of the folio prepared.

>
> Definitely agree with Sean though that it would be nice to start identifying a
> common base of patches for the in-place conversion enablement for SNP, TDX, and
> pKVM so the APIs/interfaces for hugepages can be handled separately.
>
> -Mike
>
> [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
>
>>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
>>  	if (IS_ERR(folio)) {
>>  		r = PTR_ERR(folio);
>> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  		*page = folio_file_page(folio, index);
>>  	else
>>  		folio_put(folio);
>> -
>>  out:
>> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>>  	fput(file);
>>  	return r;
>>  }
>> -- 
>> 2.49.0.1045.g170613ef41-goog
>> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions
  2025-05-31 23:45   ` Ira Weiny
@ 2025-06-13 22:03     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-13 22:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ira Weiny <ira.weiny@intel.com> writes:

> Ackerley Tng wrote:
>> Refactor dequeue_hugetlb_folio() and alloc_surplus_hugetlb_folio() to
>> take mpol, nid and nodemask. This decouples allocation of a folio from
>> a vma.
>> 
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Change-Id: I890fb46fe8c6349383d8cf89befc68a4994eb416
>> ---
>>  mm/hugetlb.c | 64 ++++++++++++++++++++++++----------------------------
>>  1 file changed, 30 insertions(+), 34 deletions(-)
>> 
>
> [snip]
>
>>  
>> @@ -2993,6 +2974,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>>  	int ret, idx;
>>  	struct hugetlb_cgroup *h_cg = NULL;
>>  	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
>> +	struct mempolicy *mpol;
>> +	nodemask_t *nodemask;
>> +	gfp_t gfp_mask;
>> +	pgoff_t ilx;
>> +	int nid;
>>  
>>  	idx = hstate_index(h);
>>  
>> @@ -3032,7 +3018,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>>  
>>  		subpool_reservation_exists = npages_req == 0;
>>  	}
>> -
>>  	reservation_exists = vma_reservation_exists || subpool_reservation_exists;
>>  
>>  	/*
>> @@ -3048,21 +3033,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>>  			goto out_subpool_put;
>>  	}
>>  
>> +	mpol = get_vma_policy(vma, addr, h->order, &ilx);
>
> Why does the memory policy need to be acquired here instead of after the
> cgroup charge?  AFAICT this is not needed and would at least eliminate 1
> of the error conditions puts.
>

I was hoping that by taking this early, the upcoming refactoring out of
hugetlb_alloc_folio() will look like a nice, clean removal of the middle
of this function, leaving acquiring of the mpol and mpol_cond_put()
in-place.

In the next revision I'm splitting up the refactoring in this patch
further so if this is still an issue in some number of revisions' time,
I can fix this.

>> +
>>  	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
>> -	if (ret)
>> +	if (ret) {
>> +		mpol_cond_put(mpol);
>                 ^^^^
> 		here
>
> All that said I think the use of some new cleanup macros could really help
> a lot of this code.
>

I'm happy to try that out...

> What do folks in this area of the kernel think of those?
>

not sure though.

> Ira
>
> [snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio()
  2025-06-01  0:38   ` Ira Weiny
@ 2025-06-13 22:07     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-13 22:07 UTC (permalink / raw)
  To: Ira Weiny
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ira Weiny <ira.weiny@intel.com> writes:

> Ackerley Tng wrote:
>> Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which
>> handles allocation of a folio and cgroup charging.
>>
>> Other than flags to control charging in the allocation process,
>> hugetlb_alloc_folio() also has parameters for memory policy.
>>
>> This refactoring as a whole decouples the hugetlb page allocation from
>> hugetlbfs, (1) where the subpool is stored at the fs mount, (2)
>> reservations are made during mmap and stored in the vma, and (3) mpol
>> must be stored at vma->vm_policy (4) a vma must be used for allocation
>> even if the pages are not meant to be used by host process.
>>
>> This decoupling will allow hugetlb_alloc_folio() to be used by
>> guest_memfd in later patches. In guest_memfd, (1) a subpool is created
>> per-fd and is stored on the inode, (2) no vma-related reservations are
>> used (3) mpol may not be associated with a vma since (4) for private
>> pages, the pages will not be mappable to userspace and hence have to
>> associated vmas.
>>
>> This could hopefully also open hugetlb up as a more generic source of
>> hugetlb pages that are not bound to hugetlbfs, with the complexities
>> of userspace/mmap/vma-related reservations contained just to
>> hugetlbfs.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> Change-Id: I60528f246341268acbf0ed5de7752ae2cacbef93
>> ---
>>  include/linux/hugetlb.h |  12 +++
>>  mm/hugetlb.c            | 192 ++++++++++++++++++++++------------------
>>  2 files changed, 118 insertions(+), 86 deletions(-)
>>
>
> [snip]
>
>>
>> +/**
>> + * hugetlb_alloc_folio() - Allocates a hugetlb folio.
>> + *
>> + * @h: struct hstate to allocate from.
>> + * @mpol: struct mempolicy to apply for this folio allocation.
>> + * @ilx: Interleave index for interpretation of @mpol.
>> + * @charge_cgroup_rsvd: Set to true to charge cgroup reservation.
>> + * @use_existing_reservation: Set to true if this allocation should use an
>> + *                            existing hstate reservation.
>> + *
>> + * This function handles cgroup and global hstate reservations. VMA-related
>> + * reservations and subpool debiting must be handled by the caller if necessary.
>> + *
>> + * Return: folio on success or negated error otherwise.
>> + */
>> +struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
>> +				  pgoff_t ilx, bool charge_cgroup_rsvd,
>> +				  bool use_existing_reservation)
>> +{
>> +	unsigned int nr_pages = pages_per_huge_page(h);
>> +	struct hugetlb_cgroup *h_cg = NULL;
>> +	struct folio *folio = NULL;
>> +	nodemask_t *nodemask;
>> +	gfp_t gfp_mask;
>> +	int nid;
>> +	int idx;
>> +	int ret;
>> +
>> +	idx = hstate_index(h);
>> +
>> +	if (charge_cgroup_rsvd) {
>> +		if (hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg))
>> +			goto out;
>
> Why not just return here?
> 			return ERR_PTR(-ENOSPC);
>

I wanted to consistently exit the function on errors at the same place,
and also make this refactoring look like I just took the middle of
alloc_hugetlb_folio() out as much as possible.

>> +	}
>> +
>> +	if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg))
>> +		goto out_uncharge_cgroup_reservation;
>> +
>> +	gfp_mask = htlb_alloc_mask(h);
>> +	nid = policy_node_nodemask(mpol, gfp_mask, ilx, &nodemask);
>> +
>> +	spin_lock_irq(&hugetlb_lock);
>> +
>> +	if (use_existing_reservation || available_huge_pages(h))
>> +		folio = dequeue_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
>> +
>> +	if (!folio) {
>> +		spin_unlock_irq(&hugetlb_lock);
>> +		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, mpol, nid, nodemask);
>> +		if (!folio)
>> +			goto out_uncharge_cgroup;
>> +		spin_lock_irq(&hugetlb_lock);
>> +		list_add(&folio->lru, &h->hugepage_activelist);
>> +		folio_ref_unfreeze(folio, 1);
>> +		/* Fall through */
>> +	}
>> +
>> +	if (use_existing_reservation) {
>> +		folio_set_hugetlb_restore_reserve(folio);
>> +		h->resv_huge_pages--;
>> +	}
>> +
>> +	hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, folio);
>> +
>> +	if (charge_cgroup_rsvd)
>> +		hugetlb_cgroup_commit_charge_rsvd(idx, nr_pages, h_cg, folio);
>> +
>> +	spin_unlock_irq(&hugetlb_lock);
>> +
>> +	gfp_mask = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
>> +	ret = mem_cgroup_charge_hugetlb(folio, gfp_mask);
>> +	/*
>> +	 * Unconditionally increment NR_HUGETLB here. If it turns out that
>> +	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
>> +	 * decrement NR_HUGETLB.
>> +	 */
>> +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
>> +
>> +	if (ret == -ENOMEM) {
>> +		free_huge_folio(folio);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>> +
>> +	return folio;
>> +
>> +out_uncharge_cgroup:
>> +	hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
>> +out_uncharge_cgroup_reservation:
>> +	if (charge_cgroup_rsvd)
>> +		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg);
>
> I find the direct copy of the unwind logic from alloc_hugetlb_folio()
> cumbersome and it seems like a good opportunity to clean it up.
>

I really wanted to make this refactoring look like I just took the
middle of alloc_hugetlb_folio() out as much as possible, to make it
obvious and understandable. I think the cleanup can be a separate patch
(series?)

>> +out:
>> +	folio = ERR_PTR(-ENOSPC);
>> +	goto out;
>
> Endless loop?
>

Thanks, this should have been

return folio;

> Ira
>
> [snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
  2025-06-05 19:10     ` Ackerley Tng
@ 2025-06-16 11:15       ` Yan Zhao
  0 siblings, 0 replies; 231+ messages in thread
From: Yan Zhao @ 2025-06-16 11:15 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Thu, Jun 05, 2025 at 12:10:08PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
> >> +static int kvm_gmem_convert_execute_work(struct inode *inode,
> >> +					 struct conversion_work *work,
> >> +					 bool to_shared)
> >> +{
> >> +	enum shareability m;
> >> +	int ret;
> >> +
> >> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> >> +	ret = kvm_gmem_shareability_apply(inode, work, m);
> >> +	if (ret)
> >> +		return ret;
> >> +	/*
> >> +	 * Apply shareability first so split/merge can operate on new
> >> +	 * shareability state.
> >> +	 */
> >> +	ret = kvm_gmem_restructure_folios_in_range(
> >> +		inode, work->start, work->nr_pages, to_shared);
> >> +
> >> +	return ret;
> >> +}
> >> +
> 
> Hi Yan,
> 
> Thanks for your thorough reviews and your alternative suggestion in the
> other discussion at [1]! I'll try to bring the conversion-related parts
> of that discussion over here.
> 
> >>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> >>  				  size_t nr_pages, bool shared,
> >>  				  pgoff_t *error_index)
> 
> The guiding principle I was using for the conversion ioctls is
> 
> * Have the shareability updates and any necessary page restructuring
>   (aka splitting/merging) either fully complete for not at all by the
>   time the conversion ioctl returns.
> * Any unmapping (from host or guest page tables) will not be re-mapped
>   on errors.
> * Rollback undoes changes if conversion failed, and in those cases any
>   errors are turned into WARNings.
> 
> The rationale is that we want page sizes to be in sync with shareability
> so that any faults after the (successful or failed) conversion will not
> wrongly map in a larger page than allowed and cause any host crashes.
> 
> We considered 3 places where the memory can be mapped for conversions:
> 
> 1. Host page tables
> 2. Guest page tables
> 3. IOMMU page tables
> 
> Unmapping from host page tables is the simplest case. We unmap any
> shared ranges from the host page tables. Any accesses after the failed
> conversion would just fault the memory back in and proceed as usual.
> 
> guest_memfd memory is not unmapped from IOMMUs in conversions. This case
> is handled because IOMMU mappings hold refcounts. After unmapping from
> the host, we check for unexpected refcounts and fail if there are
> unexpected refcounts.
> 
> We also unmap from guest page tables. Considering failed conversions, if
> the pages are shared, we're good since the next time the guest accesses
> the page, the page will be faulted in as before.
> 
> If the pages are private, on the next guest access, the pages will be
> faulted in again as well. This is fine for software-protected VMs IIUC.
> 
> For TDX (and SNP) IIUC the memory would have been cleared, and the
> memory would also need to be re-accepted. I was thinking that this is by
> design, since when a TDX guest requests a conversion it knows that the
> contents is not to be used again.
This is not guaranteed.

On private-to-shared conversion failure, the guest may leak the page or release
the page. If the guest chooses the latter (e.g. in kvmclock_init_mem(),
kvm_arch_ptp_init()), the page is regarded as private by the guest OS.
Re-acceptance then will not happen before the guest access.

So, it's better for host to keep the original SEPT if private-to-shared
conversion fails.


 
> The userspace VMM is obligated to keep trying convert and if it gives
> up, userspace VMM should inform the guest that the conversion
> failed. The guest should handle conversion failures too and not assume
> that conversion always succeeds.
I don't think relying userspace to keep trying convert endlessly is a good
design.

> 
> Putting TDX aside for a moment, so far, there are a few ways this
> conversion could fail:
> 
> a. Unexpected refcounts. Userspace should clear up the unexpected
>    refcounts and report failure to the guest if it can't for whatever
>    reason.
This is acceptable. Unmapping shared mappings in the primary MMU or shared EPT
is harmless.

> b. ENOMEM because (i) we ran out of memory updating the shareability
>    maple_tree or (ii) since splitting involves allocating more memory
>    for struct pages and we ran out of memory there. In this case the
>    userspace VMM gets -ENOMEM and can make more memory available and
>    then retry, or if it can't, also report failure to the guest.
This is unacceptable. Why not reserve the memory before determining to start
the real conversion? If -ENOMEM is returned before executing the conversion, we
don't need to handle the restore error, which is impossible to be handled
gracefully.


> TDX introduces TDX-specific conversion failures (see discussion at
> [1]), which this series doesn't handle, but I think we still have a line
> of sight to handle new errors.
The errors can be divided into two categories:
(1) errors due to kernel bugs
(2) errors that could occur in normal or bad conditions (e.g. the -ENOMEM)

We can't handle (1), so BUG_ON or leaking memory is allowed.
However, we should try to avoid (2) especially in the rollback path.


> In the other thread [1], I was proposing to have guest_memfd decide what
> to do on errors, but I think that might be baking more TDX-specific
> details into guest_memfd/KVM, and perhaps this is better:
The TDX-specific errors in the unmapping path is of category (1).
So, we hope to resolve it by BUG_ON and leaking the memory.

The other conversion error for TDX is for splitting memory. We hope to
do the splitting before executing the real conversion.

Please check the proposal details at
https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com.

> We could return the errors to userspace and let userspace determine what
> to do. For retryable errors (as determined by userspace), it should do
> what it needs to do, and retry. For errors like TDX being unable to
> reclaim the memory, it could tell guest_memfd to leak that memory.
> 
> If userspace gives up, it should report conversion failure to the guest
> if userspace thinks the guest can continue (to a clean shutdown or
> otherwise). If something terrible happened during conversion, then
> userspace might have to exit itself or shutdown the host.
> 
> In [2], for TDX-specific conversion failures, you proposed prepping to
> eliminate errors and exiting early on failure, then actually
> unmapping. I think that could work too.
> 
> I'm a little concerned that prepping could be complicated, since the
> nature of conversion depends on the current state of shareability, and
> there's a lot to prepare, everything from counting memory required for
> maple_tree allocation (and merging ranges in the maple_tree), and
> counting the number of pages required for undoing vmemmap optimization
> in the case of splitting...
> 
> And even after doing all the prep to eliminate errors, the unmapping
> could fail in TDX-specific cases anyway, which still needs to be
> handled.
> 
> Hence I'm hoping you'll consider to let TDX-specific failures be
> built-in and handled alongside other failures by getting help from the
> userspace VMM, and in the worst case letting the guest know the
> conversion failed.
> 
> I also appreciate comments or suggestions from anyone else!
> 
> [1] https://lore.kernel.org/all/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com/
> 
> >> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> >>  
> >>  	list_for_each_entry(work, &work_list, list) {
> >>  		rollback_stop_item = work;
> >> -		ret = kvm_gmem_shareability_apply(inode, work, m);
> >> +
> >> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
> >>  		if (ret)
> >>  			break;
> >>  	}
> >>  
> >>  	if (ret) {
> >> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
> >>  		list_for_each_entry(work, &work_list, list) {
> >> +			int r;
> >> +
> >> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
> >> +			WARN_ON(r);
> >> +
> >>  			if (work == rollback_stop_item)
> >>  				break;
> >> -
> >> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> > Could kvm_gmem_shareability_apply() fail here?
> >
> 
> Yes, it could. If shareability cannot be updated, then we probably ran
> out of memory. Userspace VMM will probably get -ENOMEM set on some
> earlier ret and should handle that accordingly.
> 
> On -ENOMEM in a rollback, the host is in a very tough spot anyway, and a
> clean guest shutdown may be the only way out, hence this is a WARN and
> not returned to userspace.
> 
> >>  		}
> >>  	}
> >>  
> >> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
> >>  	return ret;
> >>  }
> >>  
> >> +#ifdef CONFIG_KVM_GMEM_HUGETLB
> >> +
> >> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
> >> +{
> >> +	struct address_space *mapping = folio->mapping;
> >> +
> >> +	spin_lock(&mapping->host->i_lock);
> >> +	xa_lock_irq(&mapping->i_pages);
> >> +
> >> +	__filemap_remove_folio(folio, NULL);
> >> +
> >> +	xa_unlock_irq(&mapping->i_pages);
> >> +	spin_unlock(&mapping->host->i_lock);
> >> +}
> >> +
> >> +/**
> >> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
> >> + * split/merge.
> >> + *
> >> + * @folio: the folio to be removed.
> >> + *
> >> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
> >> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
> >> + *
> >> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
> >> + *          freeze these refcounts away so that no other users will interfere
> >> + *          with restructuring.
> >> + */
> >> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
> >> +{
> >> +	int filemap_refcount;
> >> +
> >> +	filemap_refcount = folio_nr_pages(folio);
> >> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
> >> +		/*
> >> +		 * At this point only filemap refcounts are expected, hence okay
> >> +		 * to spin until speculative refcounts go away.
> >> +		 */
> >> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
> >> +	}
> >> +
> >> +	folio_lock(folio);
> >> +	__filemap_remove_folio_for_restructuring(folio);
> >> +	folio_unlock(folio);
> >> +}
> >> +
> >> +/**
> >> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
> >> + *
> >> + * @inode: inode containing the folio.
> >> + * @folio: folio to be split.
> >> + *
> >> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
> >> + * and add back the split folios.
> >> + *
> >> + * Context: Expects that before this call, folio's refcount is just the
> >> + *          filemap's refcounts. After this function returns, the split folios'
> >> + *          refcounts will also be filemap's refcounts.
> >> + * Return: 0 on success or negative error otherwise.
> >> + */
> >> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
> >> +{
> >> +	size_t orig_nr_pages;
> >> +	pgoff_t orig_index;
> >> +	size_t i, j;
> >> +	int ret;
> >> +
> >> +	orig_nr_pages = folio_nr_pages(folio);
> >> +	if (orig_nr_pages == 1)
> >> +		return 0;
> >> +
> >> +	orig_index = folio->index;
> >> +
> >> +	filemap_remove_folio_for_restructuring(folio);
> >> +
> >> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
> >> +	if (ret)
> >> +		goto err;
> >> +
> >> +	for (i = 0; i < orig_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(folio, i));
> >> +
> >> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
> >> +						   orig_index + i);
> > Why does the failure of __kvm_gmem_filemap_add_folio() here lead to rollback,    
> > while the failure of the one under rollback only triggers WARN_ON()?
> >
> 
> Mostly because I don't really have a choice on rollback. On rollback we
> try to restore the merged folio back into the filemap, and if we
> couldn't, the host is probably in rather bad shape in terms of memory
> availability and there may not be many options for the userspace VMM.
Out of memory is not a good excuse for rollback error.

> Perhaps the different possible errors from
> __kvm_gmem_filemap_add_folio() in both should be handled differently. Do
> you have any suggestions on that?
Maybe reserving the memory or ruling out other factors that could lead to
conversion failure before executing the conversion?
Then BUG_ON the if the failure is cause by a bug.

> >> +		if (ret)
> >> +			goto rollback;
> >> +	}
> >> +
> >> +	return ret;
> >> +
> >> +rollback:
> >> +	for (j = 0; j < i; ++j) {
> >> +		struct folio *f = page_folio(folio_page(folio, j));
> >> +
> >> +		filemap_remove_folio_for_restructuring(f);
> >> +	}
> >> +
> >> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
> >> +err:
> >> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	void *priv;
> >> +
> >> +	if (!kvm_gmem_has_custom_allocator(inode))
> >> +		return 0;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
> >> +
> >> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
> >> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
> >> + * @inode.
> >> + *
> >> + * @inode: inode containing the folio.
> >> + * @first_folio: first folio among folios to be merged.
> >> + *
> >> + * Will clean up subfolios from filemap and add back the merged folio.
> >> + *
> >> + * Context: Expects that before this call, all subfolios only have filemap
> >> + *          refcounts. After this function returns, the merged folio will only
> >> + *          have filemap refcounts.
> >> + * Return: 0 on success or negative error otherwise.
> >> + */
> >> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
> >> +					   struct folio *first_folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	pgoff_t index;
> >> +	void *priv;
> >> +	size_t i;
> >> +	int ret;
> >> +
> >> +	index = first_folio->index;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +	if (folio_nr_pages(first_folio) == to_nr_pages)
> >> +		return 0;
> >> +
> >> +	for (i = 0; i < to_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(first_folio, i));
> >> +
> >> +		filemap_remove_folio_for_restructuring(f);
> >> +	}
> >> +
> >> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
> >> +
> >> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
> >> +	if (ret)
> >> +		goto err_split;
> >> +
> >> +	return ret;
> >> +
> >> +err_split:
> >> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
> > guestmem_hugetlb_split_folio() is possible to fail. e.g.
> > After the stash is freed by guestmem_hugetlb_unstash_free_metadata() in
> > guestmem_hugetlb_merge_folio(), it's possible to get -ENOMEM for the stash
> > allocation in guestmem_hugetlb_stash_metadata() in
> > guestmem_hugetlb_split_folio().
> >
> >
> 
> Yes. This is also on the error path. In line with all the other error
> and rollback paths, I don't really have other options at this point,
> since on error, I probably ran out of memory, so I try my best to
> restore the original state but give up with a WARN otherwise.
> 
> >> +	for (i = 0; i < to_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(first_folio, i));
> >> +
> >> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *first_folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	void *priv;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +
> >> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
> >> +		return 0;
> >> +
> >> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
> >> +}
> >> +
> >> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >> +						pgoff_t start, size_t nr_pages,
> >> +						bool is_split_operation)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	pgoff_t index;
> >> +	pgoff_t end;
> >> +	void *priv;
> >> +	int ret;
> >> +
> >> +	if (!kvm_gmem_has_custom_allocator(inode))
> >> +		return 0;
> >> +
> >> +	end = start + nr_pages;
> >> +
> >> +	/* Round to allocator page size, to check all (huge) pages in range. */
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +
> >> +	start = round_down(start, to_nr_pages);
> >> +	end = round_up(end, to_nr_pages);
> >> +
> >> +	for (index = start; index < end; index += to_nr_pages) {
> >> +		struct folio *f;
> >> +
> >> +		f = filemap_get_folio(inode->i_mapping, index);
> >> +		if (IS_ERR(f))
> >> +			continue;
> >> +
> >> +		/* Leave just filemap's refcounts on the folio. */
> >> +		folio_put(f);
> >> +
> >> +		if (is_split_operation)
> >> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> > kvm_gmem_try_split_folio_in_filemap()?
> >
> 
> Here we know for sure that this was a private-to-shared
> conversion. Hence, we know that there are at least some shared parts in
> this huge page and we can skip checking that. 
Ok.

> >> +		else
> >> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
> >> +
> 
> For merge, we don't know if the entire huge page might perhaps contain
> some other shared subpages, hence we "try" to merge by first checking
> against shareability to find shared subpages.
Makes sense.


> >> +		if (ret)
> >> +			goto rollback;
> >> +	}
> >> +	return ret;
> >> +
> >> +rollback:
> >> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
> 
> Note to self: the first index -= to_nr_pages was meant to skip the index
> that caused the failure, but this could cause an underflow if index = 0
> when entering rollback. Need to fix this in the next revision.
Yes :)

> >> +		struct folio *f;
> >> +
> >> +		f = filemap_get_folio(inode->i_mapping, index);
> >> +		if (IS_ERR(f))
> >> +			continue;
> >> +
> >> +		/* Leave just filemap's refcounts on the folio. */
> >> +		folio_put(f);
> >> +
> >> +		if (is_split_operation)
> >> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> >> +		else
> >> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
> > Is it safe to just leave WARN_ON()s in the rollback case?
> >
> 
> Same as above. I don't think we have much of a choice.
> 
> > Besides, are the kvm_gmem_merge_folio_in_filemap() and
> > kvm_gmem_split_folio_in_filemap() here duplicated with the
> > kvm_gmem_split_folio_in_filemap() and kvm_gmem_try_merge_folio_in_filemap() in
> > the following "r = kvm_gmem_convert_execute_work(inode, work, !shared)"?
> >
> 
> This handles the case where some pages in the range [start, start +
> nr_pages) were split and the failure was halfway through. I could call
> kvm_gmem_convert_execute_work() with !shared but that would go over all
> the folios again from the start.
> 
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +#else
> >> +
> >> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *folio)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >> +						pgoff_t start, size_t nr_pages,
> >> +						bool is_split_operation)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +#endif
> >> +
> >  

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (53 preceding siblings ...)
  2025-05-16 22:43 ` Ackerley Tng
@ 2025-06-19  8:13 ` Yan Zhao
  2025-06-19  8:59   ` Xiaoyao Li
  2025-06-26 23:19 ` Ackerley Tng
  55 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-06-19  8:13 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
> 
> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> for guest_memfd patchset (Thanks Fuad!) [1].
> 
> For ease of testing, this series is also available, stitched together,
> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

Just to record a found issue -- not one that must be fixed.

In TDX, the initial memory region is added as private memory during TD's build
time, with its initial content copied from source pages in shared memory.
The copy operation requires simultaneous access to both shared source memory
and private target memory.

Therefore, userspace cannot store the initial content in shared memory at the
mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
private memory. This is because the guest_memfd will first unmap a PFN in shared
page tables and then check for any extra refcount held for the shared PFN before
converting it to private.

Currently, we tested the initial memory region using the in-place conversion
version of guest_memfd as backend by modifying QEMU to add an extra anonymous
backend to hold the source initial content in shared memory. The extra anonymous
backend is freed after finishing ading the initial memory region.

This issue is benign for TDX, as the initial memory region can also utilize the
traditional guest_memfd, which only allows 4KB mappings. This is acceptable for
now, as the initial memory region typically involves a small amount of memory,
and we may not enable huge pages for ranges covered by the initial memory region
in the near future.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  8:13 ` Yan Zhao
@ 2025-06-19  8:59   ` Xiaoyao Li
  2025-06-19  9:18     ` Xiaoyao Li
  2025-06-29 18:28     ` Vishal Annapurve
  0 siblings, 2 replies; 231+ messages in thread
From: Xiaoyao Li @ 2025-06-19  8:59 UTC (permalink / raw)
  To: Yan Zhao, Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On 6/19/2025 4:13 PM, Yan Zhao wrote:
> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>> Hello,
>>
>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>> upstream calls to provide 1G page support for guest_memfd by taking
>> pages from HugeTLB.
>>
>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>> for guest_memfd patchset (Thanks Fuad!) [1].
>>
>> For ease of testing, this series is also available, stitched together,
>> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
>   
> Just to record a found issue -- not one that must be fixed.
> 
> In TDX, the initial memory region is added as private memory during TD's build
> time, with its initial content copied from source pages in shared memory.
> The copy operation requires simultaneous access to both shared source memory
> and private target memory.
> 
> Therefore, userspace cannot store the initial content in shared memory at the
> mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> private memory. This is because the guest_memfd will first unmap a PFN in shared
> page tables and then check for any extra refcount held for the shared PFN before
> converting it to private.

I have an idea.

If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place 
conversion unmap the PFN in shared page tables while keeping the content 
of the page unchanged, right?

So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory 
actually for non-CoCo case actually, that userspace first mmap() it and 
ensure it's shared and writes the initial content to it, after it 
userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.

For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it 
wants the private memory to be initialized with initial content, and 
just do in-place TDH.PAGE.ADD in the hook.

> Currently, we tested the initial memory region using the in-place conversion
> version of guest_memfd as backend by modifying QEMU to add an extra anonymous
> backend to hold the source initial content in shared memory. The extra anonymous
> backend is freed after finishing ading the initial memory region.
> 
> This issue is benign for TDX, as the initial memory region can also utilize the
> traditional guest_memfd, which only allows 4KB mappings. This is acceptable for
> now, as the initial memory region typically involves a small amount of memory,
> and we may not enable huge pages for ranges covered by the initial memory region
> in the near future.


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  8:59   ` Xiaoyao Li
@ 2025-06-19  9:18     ` Xiaoyao Li
  2025-06-19  9:28       ` Yan Zhao
  2025-06-29 18:28     ` Vishal Annapurve
  1 sibling, 1 reply; 231+ messages in thread
From: Xiaoyao Li @ 2025-06-19  9:18 UTC (permalink / raw)
  To: Yan Zhao, Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>> Hello,
>>>
>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>> upstream calls to provide 1G page support for guest_memfd by taking
>>> pages from HugeTLB.
>>>
>>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>
>>> For ease of testing, this series is also available, stitched together,
>>> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page- 
>>> support-rfc-v2
>> Just to record a found issue -- not one that must be fixed.
>>
>> In TDX, the initial memory region is added as private memory during 
>> TD's build
>> time, with its initial content copied from source pages in shared memory.
>> The copy operation requires simultaneous access to both shared source 
>> memory
>> and private target memory.
>>
>> Therefore, userspace cannot store the initial content in shared memory 
>> at the
>> mmap-ed VA of a guest_memfd that performs in-place conversion between 
>> shared and
>> private memory. This is because the guest_memfd will first unmap a PFN 
>> in shared
>> page tables and then check for any extra refcount held for the shared 
>> PFN before
>> converting it to private.
> 
> I have an idea.
> 
> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place 
> conversion unmap the PFN in shared page tables while keeping the content 
> of the page unchanged, right?
> 
> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory 
> actually for non-CoCo case actually, that userspace first mmap() it and 
> ensure it's shared and writes the initial content to it, after it 
> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> 
> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it 
> wants the private memory to be initialized with initial content, and 
> just do in-place TDH.PAGE.ADD in the hook.

And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to 
explicitly request that the page range is converted to private and the 
content needs to be retained. So that TDX can identify which case needs 
to call in-place TDH.PAGE.ADD.


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  9:18     ` Xiaoyao Li
@ 2025-06-19  9:28       ` Yan Zhao
  2025-06-19  9:45         ` Xiaoyao Li
  0 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-06-19  9:28 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
> > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > Hello,
> > > > 
> > > > This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > upstream calls to provide 1G page support for guest_memfd by taking
> > > > pages from HugeTLB.
> > > > 
> > > > This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > for guest_memfd patchset (Thanks Fuad!) [1].
> > > > 
> > > > For ease of testing, this series is also available, stitched together,
> > > > at
> > > > https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
> > > > support-rfc-v2
> > > Just to record a found issue -- not one that must be fixed.
> > > 
> > > In TDX, the initial memory region is added as private memory during
> > > TD's build
> > > time, with its initial content copied from source pages in shared memory.
> > > The copy operation requires simultaneous access to both shared
> > > source memory
> > > and private target memory.
> > > 
> > > Therefore, userspace cannot store the initial content in shared
> > > memory at the
> > > mmap-ed VA of a guest_memfd that performs in-place conversion
> > > between shared and
> > > private memory. This is because the guest_memfd will first unmap a
> > > PFN in shared
> > > page tables and then check for any extra refcount held for the
> > > shared PFN before
> > > converting it to private.
> > 
> > I have an idea.
> > 
> > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > conversion unmap the PFN in shared page tables while keeping the content
> > of the page unchanged, right?
However, whenever there's a GUP in TDX to get the source page, there will be an
extra page refcount.

> > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > actually for non-CoCo case actually, that userspace first mmap() it and
> > ensure it's shared and writes the initial content to it, after it
> > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
The conversion request here will be declined therefore.


> > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > wants the private memory to be initialized with initial content, and
> > just do in-place TDH.PAGE.ADD in the hook.
> 
> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> explicitly request that the page range is converted to private and the
> content needs to be retained. So that TDX can identify which case needs to
> call in-place TDH.PAGE.ADD.
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  9:28       ` Yan Zhao
@ 2025-06-19  9:45         ` Xiaoyao Li
  2025-06-19  9:49           ` Xiaoyao Li
  0 siblings, 1 reply; 231+ messages in thread
From: Xiaoyao Li @ 2025-06-19  9:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On 6/19/2025 5:28 PM, Yan Zhao wrote:
> On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
>> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
>>> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>>>> Hello,
>>>>>
>>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>>>> upstream calls to provide 1G page support for guest_memfd by taking
>>>>> pages from HugeTLB.
>>>>>
>>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap support
>>>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>>>
>>>>> For ease of testing, this series is also available, stitched together,
>>>>> at
>>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
>>>>> support-rfc-v2
>>>> Just to record a found issue -- not one that must be fixed.
>>>>
>>>> In TDX, the initial memory region is added as private memory during
>>>> TD's build
>>>> time, with its initial content copied from source pages in shared memory.
>>>> The copy operation requires simultaneous access to both shared
>>>> source memory
>>>> and private target memory.
>>>>
>>>> Therefore, userspace cannot store the initial content in shared
>>>> memory at the
>>>> mmap-ed VA of a guest_memfd that performs in-place conversion
>>>> between shared and
>>>> private memory. This is because the guest_memfd will first unmap a
>>>> PFN in shared
>>>> page tables and then check for any extra refcount held for the
>>>> shared PFN before
>>>> converting it to private.
>>>
>>> I have an idea.
>>>
>>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
>>> conversion unmap the PFN in shared page tables while keeping the content
>>> of the page unchanged, right?
> However, whenever there's a GUP in TDX to get the source page, there will be an
> extra page refcount.

The GUP in TDX happens after the gmem converts the page to private.

In the view of TDX, the physical page is converted to private already 
and it contains the initial content. But the content is not usable for 
TDX until TDX calls in-place PAGE.ADD

>>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
>>> actually for non-CoCo case actually, that userspace first mmap() it and
>>> ensure it's shared and writes the initial content to it, after it
>>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> The conversion request here will be declined therefore.
> 
> 
>>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
>>> wants the private memory to be initialized with initial content, and
>>> just do in-place TDH.PAGE.ADD in the hook.
>>
>> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>> explicitly request that the page range is converted to private and the
>> content needs to be retained. So that TDX can identify which case needs to
>> call in-place TDH.PAGE.ADD.
>>


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  9:45         ` Xiaoyao Li
@ 2025-06-19  9:49           ` Xiaoyao Li
  0 siblings, 0 replies; 231+ messages in thread
From: Xiaoyao Li @ 2025-06-19  9:49 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On 6/19/2025 5:45 PM, Xiaoyao Li wrote:
> On 6/19/2025 5:28 PM, Yan Zhao wrote:
>> On Thu, Jun 19, 2025 at 05:18:44PM +0800, Xiaoyao Li wrote:
>>> On 6/19/2025 4:59 PM, Xiaoyao Li wrote:
>>>> On 6/19/2025 4:13 PM, Yan Zhao wrote:
>>>>> On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
>>>>>> Hello,
>>>>>>
>>>>>> This patchset builds upon discussion at LPC 2024 and many guest_memfd
>>>>>> upstream calls to provide 1G page support for guest_memfd by taking
>>>>>> pages from HugeTLB.
>>>>>>
>>>>>> This patchset is based on Linux v6.15-rc6, and requires the mmap 
>>>>>> support
>>>>>> for guest_memfd patchset (Thanks Fuad!) [1].
>>>>>>
>>>>>> For ease of testing, this series is also available, stitched 
>>>>>> together,
>>>>>> at
>>>>>> https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-
>>>>>> support-rfc-v2
>>>>> Just to record a found issue -- not one that must be fixed.
>>>>>
>>>>> In TDX, the initial memory region is added as private memory during
>>>>> TD's build
>>>>> time, with its initial content copied from source pages in shared 
>>>>> memory.
>>>>> The copy operation requires simultaneous access to both shared
>>>>> source memory
>>>>> and private target memory.
>>>>>
>>>>> Therefore, userspace cannot store the initial content in shared
>>>>> memory at the
>>>>> mmap-ed VA of a guest_memfd that performs in-place conversion
>>>>> between shared and
>>>>> private memory. This is because the guest_memfd will first unmap a
>>>>> PFN in shared
>>>>> page tables and then check for any extra refcount held for the
>>>>> shared PFN before
>>>>> converting it to private.
>>>>
>>>> I have an idea.
>>>>
>>>> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
>>>> conversion unmap the PFN in shared page tables while keeping the 
>>>> content
>>>> of the page unchanged, right?
>> However, whenever there's a GUP in TDX to get the source page, there 
>> will be an
>> extra page refcount.
> 
> The GUP in TDX happens after the gmem converts the page to private.

May it's not GUP since the page has been unmapped from userspace? (Sorry 
that I'm not familiar with the terminology)

> In the view of TDX, the physical page is converted to private already 
> and it contains the initial content. But the content is not usable for 
> TDX until TDX calls in-place PAGE.ADD
> 
>>>> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private 
>>>> memory
>>>> actually for non-CoCo case actually, that userspace first mmap() it and
>>>> ensure it's shared and writes the initial content to it, after it
>>>> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
>> The conversion request here will be declined therefore.
>>
>>
>>>> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
>>>> wants the private memory to be initialized with initial content, and
>>>> just do in-place TDH.PAGE.ADD in the hook.
>>>
>>> And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>>> explicitly request that the page range is converted to private and the
>>> content needs to be retained. So that TDX can identify which case 
>>> needs to
>>> call in-place TDH.PAGE.ADD.
>>>
> 
> 


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-05-20 14:11         ` Vishal Annapurve
  2025-05-20 14:33           ` Fuad Tabba
@ 2025-06-24  8:23           ` Alexey Kardashevskiy
  2025-06-24 13:08             ` Jason Gunthorpe
  1 sibling, 1 reply; 231+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-24  8:23 UTC (permalink / raw)
  To: Vishal Annapurve, Fuad Tabba, Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, ajones, akpm,
	amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On 21/5/25 00:11, Vishal Annapurve wrote:
> On Tue, May 20, 2025 at 6:44 AM Fuad Tabba <tabba@google.com> wrote:
>>
>> Hi Vishal,
>>
>> On Tue, 20 May 2025 at 14:02, Vishal Annapurve <vannapurve@google.com> wrote:
>>>
>>> On Tue, May 20, 2025 at 2:23 AM Fuad Tabba <tabba@google.com> wrote:
>>>>
>>>> Hi Ackerley,
>>>>
>>>> On Thu, 15 May 2025 at 00:43, Ackerley Tng <ackerleytng@google.com> wrote:
>>>>>
>>>>> The two new guest_memfd ioctls KVM_GMEM_CONVERT_SHARED and
>>>>> KVM_GMEM_CONVERT_PRIVATE convert the requested memory ranges to shared
>>>>> and private respectively.
>>>>
>>>> I have a high level question about this particular patch and this
>>>> approach for conversion: why do we need IOCTLs to manage conversion
>>>> between private and shared?
>>>>
>>>> In the presentations I gave at LPC [1, 2], and in my latest patch
>>>> series that performs in-place conversion [3] and the associated (by
>>>> now outdated) state diagram [4], I didn't see the need to have a
>>>> userspace-facing interface to manage that. KVM has all the information
>>>> it needs to handle conversions, which are triggered by the guest. To
>>>> me this seems like it adds additional complexity, as well as a user
>>>> facing interface that we would need to maintain.
>>>>
>>>> There are various ways we could handle conversion without explicit
>>>> interference from userspace. What I had in mind is the following (as
>>>> an example, details can vary according to VM type). I will use use the
>>>> case of conversion from shared to private because that is the more
>>>> complicated (interesting) case:
>>>>
>>>> - Guest issues a hypercall to request that a shared folio become private.
>>>>
>>>> - The hypervisor receives the call, and passes it to KVM.
>>>>
>>>> - KVM unmaps the folio from the guest stage-2 (EPT I think in x86
>>>> parlance), and unmaps it from the host. The host however, could still
>>>> have references (e.g., GUP).
>>>>
>>>> - KVM exits to the host (hypervisor call exit), with the information
>>>> that the folio has been unshared from it.
>>>>
>>>> - A well behaving host would now get rid of all of its references
>>>> (e.g., release GUPs), perform a VCPU run, and the guest continues
>>>> running as normal. I expect this to be the common case.
>>>>
>>>> But to handle the more interesting situation, let's say that the host
>>>> doesn't do it immediately, and for some reason it holds on to some
>>>> references to that folio.
>>>>
>>>> - Even if that's the case, the guest can still run *. If the guest
>>>> tries to access the folio, KVM detects that access when it tries to
>>>> fault it into the guest, sees that the host still has references to
>>>> that folio, and exits back to the host with a memory fault exit. At
>>>> this point, the VCPU that has tried to fault in that particular folio
>>>> cannot continue running as long as it cannot fault in that folio.
>>>
>>> Are you talking about the following scheme?
>>> 1) guest_memfd checks shareability on each get pfn and if there is a
>>> mismatch exit to the host.
>>
>> I think we are not really on the same page here (no pun intended :) ).
>> I'll try to answer your questions anyway...
>>
>> Which get_pfn? Are you referring to get_pfn when faulting the page
>> into the guest or into the host?
> 
> I am referring to guest fault handling in KVM.
> 
>>
>>> 2) host user space has to guess whether it's a pending refcount or
>>> whether it's an actual mismatch.
>>
>> No need to guess. VCPU run will let it know exactly why it's exiting.
>>
>>> 3) guest_memfd will maintain a third state
>>> "pending_private_conversion" or equivalent which will transition to
>>> private upon the last refcount drop of each page.
>>>
>>> If conversion is triggered by userspace (in case of pKVM, it will be
>>> triggered from within the KVM (?)):
>>
>> Why would conversion be triggered by userspace? As far as I know, it's
>> the guest that triggers the conversion.
>>
>>> * Conversion will just fail if there are extra refcounts and userspace
>>> can try to get rid of extra refcounts on the range while it has enough
>>> context without hitting any ambiguity with memory fault exit.
>>> * guest_memfd will not have to deal with this extra state from 3 above
>>> and overall guest_memfd conversion handling becomes relatively
>>> simpler.
>>
>> That's not really related. The extra state isn't necessary any more
>> once we agreed in the previous discussion that we will retry instead.
> 
> Who is *we* here? Which entity will retry conversion?
> 
>>
>>> Note that for x86 CoCo cases, memory conversion is already triggered
>>> by userspace using KVM ioctl, this series is proposing to use
>>> guest_memfd ioctl to do the same.
>>
>> The reason why for x86 CoCo cases conversion is already triggered by
>> userspace using KVM ioctl is that it has to, since shared memory and
>> private memory are two separate pages, and userspace needs to manage
>> that. Sharing memory in place removes the need for that.
> 
> Userspace still needs to clean up memory usage before conversion is
> successful. e.g. remove IOMMU mappings for shared to private
> conversion. I would think that memory conversion should not succeed
> before all existing users let go of the guest_memfd pages for the
> range being converted.


Ah about that. Actually IOMMU mappings can remain the same in a case like mine TSM+VFIO RFC based on the Fuad's older patches, here in particular:

https://lore.kernel.org/r/20250218111017.491719-13-aik@amd.com

which works nicely - mapped it once and forgot.

Now, I am rebasing my RFC on top of this patchset and it fails in kvm_gmem_has_safe_refcount() as IOMMU holds references to all these folios in my RFC.

So what is the expected sequence here? The userspace unmaps a DMA page and maps it back right away, all from the userspace? The end result will be the exactly same which seems useless. And IOMMU TLB is going to be flushed on a page conversion anyway (the RMPUPDATE instruction does that). All this is about AMD's x86 though.

For now (and for fun^wexperiment) I disabled kvm_gmem_has_safe_refcount() (04/51 adds it) and it seems to have no effect untiiil memfd is closed - folios_put_refs() crashes in list_del(&folio->lru). I wonder now what direction to go from here.

My TSM+VFIO RFC uses the hw ability to DMA to/from Coco VM (==AMD SEV SNP VM), both private and shared DMA at the same time is going to be allowed. Thanks,



> In x86 CoCo usecases, userspace can also decide to not allow
> conversion for scenarios where ranges are still under active use by
> the host and guest is erroneously trying to take away memory. Both
> SNP/TDX spec allow failure of conversion due to in use memory.
> 
>>
>> This series isn't using the same ioctl, it's introducing new ones to
>> perform a task that as far as I can tell so far, KVM can handle by
>> itself.
> 
> I would like to understand this better. How will KVM handle the
> conversion process for guest_memfd pages? Can you help walk an example
> sequence for shared to private conversion specifically around
> guest_memfd offset states?
> 
>>
>>>   - Allows not having to keep track of separate shared/private range
>>> information in KVM.
>>
>> This patch series is already tracking shared/private range information in KVM.
>>
>>>   - Simpler handling of the conversion process done per guest_memfd
>>> rather than for full range.
>>>       - Userspace can handle the rollback as needed, simplifying error
>>> handling in guest_memfd.
>>>   - guest_memfd is single source of truth and notifies the users of
>>> shareability change.
>>>       - e.g. IOMMU, userspace, KVM MMU all can be registered for
>>> getting notifications from guest_memfd directly and will get notified
>>> for invalidation upon shareability attribute updates.
>>
>> All of these can still be done without introducing a new ioctl.
>>
>> Cheers,
>> /fuad

-- 
Alexey


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-24  8:23           ` Alexey Kardashevskiy
@ 2025-06-24 13:08             ` Jason Gunthorpe
  2025-06-24 14:10               ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Jason Gunthorpe @ 2025-06-24 13:08 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vishal Annapurve, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:

> Now, I am rebasing my RFC on top of this patchset and it fails in
> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> folios in my RFC.
> 
> So what is the expected sequence here? The userspace unmaps a DMA
> page and maps it back right away, all from the userspace? The end
> result will be the exactly same which seems useless. And IOMMU TLB
> is going to be flushed on a page conversion anyway (the RMPUPDATE
> instruction does that). All this is about AMD's x86 though.

The iommu should not be using the VMA to manage the mapping. It should
be directly linked to the guestmemfd in some way that does not disturb
its operations. I imagine there would be some kind of invalidation
callback directly to the iommu.

Presumably that invalidation call back can include a reason for the
invalidation (addr change, shared/private conversion, etc)

I'm not sure how we will figure out which case is which but guestmemfd
should allow the iommu to plug in either invalidation scheme..

Probably invalidation should be a global to the FD thing, I imagine
that once invalidation is established the iommu will not be
incrementing page refcounts.

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-24 13:08             ` Jason Gunthorpe
@ 2025-06-24 14:10               ` Vishal Annapurve
  2025-06-27  4:49                 ` Alexey Kardashevskiy
  2025-07-02  8:35                 ` Yan Zhao
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-06-24 14:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alexey Kardashevskiy, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>
> > Now, I am rebasing my RFC on top of this patchset and it fails in
> > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > folios in my RFC.
> >
> > So what is the expected sequence here? The userspace unmaps a DMA
> > page and maps it back right away, all from the userspace? The end
> > result will be the exactly same which seems useless. And IOMMU TLB

 As Jason described, ideally IOMMU just like KVM, should just:
1) Directly rely on guest_memfd for pinning -> no page refcounts taken
by IOMMU stack
2) Directly query pfns from guest_memfd for both shared/private ranges
3) Implement an invalidation callback that guest_memfd can invoke on
conversions.

Current flow:
Private to Shared conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
         -> KVM has the concept of invalidation_begin() and end(),
which effectively ensures that between these function calls, no new
EPT/NPT entries can be added for the range.
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the KVM SEPT/NPT entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then splits the folios if needed

Shared to private conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the host mappings which will unmap the KVM non-seucure
EPT/NPT entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then merges the folios if needed.

============================

For IOMMU, could something like below work?

* A new UAPI to bind IOMMU FDs with guest_memfd ranges
* VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
guest_memfd ranges using kvm_gmem_get_pfn()
    -> kvm invokes kvm_gmem_is_private() to check for the range
shareability, IOMMU could use the same or we could add an API in gmem
that takes in access type and checks the shareability before returning
the pfn.
* IOMMU stack exposes an invalidation callback that can be invoked by
guest_memfd.

Private to Shared conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the KVM SEPT/NPT entries.
           -> guest_memfd invokes IOMMU invalidation callback to zap
the secure IOMMU entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then splits the folios if needed
     4) Userspace invokes IOMMU map operation to map the ranges in
non-secure IOMMU.

Shared to private conversion via kvm_gmem_convert_range() -
    1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
on each bound memslot overlapping with the range
     2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
actually unmaps the host mappings which will unmap the KVM non-seucure
EPT/NPT entries.
         -> guest_memfd invokes IOMMU invalidation callback to zap the
non-secure IOMMU entries.
     3) guest_memfd invokes kvm_gmem_execute_work() which updates the
shareability and then merges the folios if needed.
     4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.

There should be a way to block external IOMMU pagetable updates while
guest_memfd is performing conversion e.g. something like
kvm_invalidate_begin()/end().

> > is going to be flushed on a page conversion anyway (the RMPUPDATE
> > instruction does that). All this is about AMD's x86 though.
>
> The iommu should not be using the VMA to manage the mapping. It should

+1.

> be directly linked to the guestmemfd in some way that does not disturb
> its operations. I imagine there would be some kind of invalidation
> callback directly to the iommu.
>
> Presumably that invalidation call back can include a reason for the
> invalidation (addr change, shared/private conversion, etc)
>
> I'm not sure how we will figure out which case is which but guestmemfd
> should allow the iommu to plug in either invalidation scheme..
>
> Probably invalidation should be a global to the FD thing, I imagine
> that once invalidation is established the iommu will not be
> incrementing page refcounts.

+1.

>
> Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
                   ` (54 preceding siblings ...)
  2025-06-19  8:13 ` Yan Zhao
@ 2025-06-26 23:19 ` Ackerley Tng
  55 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-06-26 23:19 UTC (permalink / raw)
  To: kvm, linux-mm, linux-kernel, x86, linux-fsdevel
  Cc: aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Ackerley Tng <ackerleytng@google.com> writes:

> Hello,
>
> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> upstream calls to provide 1G page support for guest_memfd by taking
> pages from HugeTLB.
>
> [...]

At the guest_memfd upstream call today (2025-06-26), we talked about
when to merge folios with respect to conversions.

Just want to call out that in this RFCv2, we managed to get conversions
working with merges happening as soon as possible.

"As soon as possible" means merges happen as long as shareability is all
private (or all meaningless) within an aligned hugepage range. We try to
merge after every conversion request and on truncation. On truncation,
shareability becomes meaningless.

On explicit truncation (e.g. fallocate(PUNCH_HOLE)), truncation can fail
if there are unexpected refcounts (because we can't merge with
unexpected refcounts). Explicit truncation will succeed only if
refcounts are expected, and merge is performed before finally removing
from filemap.

On truncation caused by file close or inode release, guest_memfd may not
hold the last refcount on the folio. Only in this case, we defer merging
to the folio_put() callback, and because the callback can be called from
atomic context, the merge is further deferred to be performed by a
kernel worker.

Deferment of merging is already minimized so that most of the
restructuring is synchronous with some userspace-initiated action
(conversion or explicit truncation). The only deferred merge is when the
file is closed, and in that case there's no way to reject/fail this file
close.

(There are possible optimizations here - Yan suggested [1] checking if
the folio_put() was called from interrupt context - I have not tried
implementing that yet)

I did propose an explicit guest_memfd merge ioctl, but since RFCv2
works, I was thinking to to have the merge ioctl be a separate
optimization/project/patch series if it turns out that merging
as-soon-as-possible is an inefficient strategy, or if some VM use cases
prefer to have an explicit merge ioctl.

During the call, Michael also brought up that SNP adds some constraints
with respect to guest accepting pages/levels.

Could you please expand on that? Suppose for an SNP guest,

1. Guest accepted a page at 2M level
2. Guest converts a 4K sub page to shared
3. guest_memfd requests unmapping of the guest-requested 4K range
   (the rest of the 2M remains mapped into stage 2 page tables)
4. guest_memfd splits the huge page to 4K pages (the 4K is set to
   SHAREABILITY_ALL, the rest of the 2M is still SHAREABILITY_GUEST)

Can the SNP guest continue to use the rest of the 2M page or must it
re-accept all the pages at 4K?

And for the reverse:

1. Guest accepted a 2M range at 4K
2. guest_memfd merges the full 2M range to a single 2M page

Must the SNP guest re-accept at 2M for the guest to continue
functioning, or will the SNP guest continue to work (just with poorer
performance than if the memory was accepted at 2M)?

[1] https://lore.kernel.org/all/aDfT35EsYP%2FByf7Z@yzhao56-desk.sh.intel.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-24 14:10               ` Vishal Annapurve
@ 2025-06-27  4:49                 ` Alexey Kardashevskiy
  2025-06-27 15:17                   ` Vishal Annapurve
  2025-07-02  8:35                 ` Yan Zhao
  1 sibling, 1 reply; 231+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-27  4:49 UTC (permalink / raw)
  To: Vishal Annapurve, Jason Gunthorpe
  Cc: Fuad Tabba, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, ajones, akpm, amoorthy, anthony.yznaga, anup, aou,
	bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li



On 25/6/25 00:10, Vishal Annapurve wrote:
> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>>
>>> Now, I am rebasing my RFC on top of this patchset and it fails in
>>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>>> folios in my RFC.
>>>
>>> So what is the expected sequence here? The userspace unmaps a DMA
>>> page and maps it back right away, all from the userspace? The end
>>> result will be the exactly same which seems useless. And IOMMU TLB
> 
>   As Jason described, ideally IOMMU just like KVM, should just:
> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> by IOMMU stack
> 2) Directly query pfns from guest_memfd for both shared/private ranges
> 3) Implement an invalidation callback that guest_memfd can invoke on
> conversions.
> 
> Current flow:
> Private to Shared conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>           -> KVM has the concept of invalidation_begin() and end(),
> which effectively ensures that between these function calls, no new
> EPT/NPT entries can be added for the range.
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
> 
> ============================
> 
> For IOMMU, could something like below work?
> 
> * A new UAPI to bind IOMMU FDs with guest_memfd ranges

Done that.

> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> guest_memfd ranges using kvm_gmem_get_pfn()

This API imho should drop the confusing kvm_ prefix.

>      -> kvm invokes kvm_gmem_is_private() to check for the range
> shareability, IOMMU could use the same or we could add an API in gmem
> that takes in access type and checks the shareability before returning
> the pfn.

Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.


> * IOMMU stack exposes an invalidation callback that can be invoked by
> guest_memfd.
> 
> Private to Shared conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>             -> guest_memfd invokes IOMMU invalidation callback to zap
> the secure IOMMU entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
>       4) Userspace invokes IOMMU map operation to map the ranges in
> non-secure IOMMU.
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>           -> guest_memfd invokes IOMMU invalidation callback to zap the
> non-secure IOMMU entries.
>       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
>       4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.


Alright (although this zap+map is not necessary on the AMD hw).


> There should be a way to block external IOMMU pagetable updates while
> guest_memfd is performing conversion e.g. something like
> kvm_invalidate_begin()/end().
> 
>>> is going to be flushed on a page conversion anyway (the RMPUPDATE
>>> instruction does that). All this is about AMD's x86 though.
>>
>> The iommu should not be using the VMA to manage the mapping. It should
> 
> +1.

Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,


>> be directly linked to the guestmemfd in some way that does not disturb
>> its operations. I imagine there would be some kind of invalidation
>> callback directly to the iommu.
>>
>> Presumably that invalidation call back can include a reason for the
>> invalidation (addr change, shared/private conversion, etc)
>>
>> I'm not sure how we will figure out which case is which but guestmemfd
>> should allow the iommu to plug in either invalidation scheme..
>>
>> Probably invalidation should be a global to the FD thing, I imagine
>> that once invalidation is established the iommu will not be
>> incrementing page refcounts.
> 
> +1.

Alright. Thanks for the comments.

> 
>>
>> Jason

-- 
Alexey


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-27  4:49                 ` Alexey Kardashevskiy
@ 2025-06-27 15:17                   ` Vishal Annapurve
  2025-06-30  0:19                     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-06-27 15:17 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jason Gunthorpe, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jun 26, 2025 at 9:50 PM Alexey Kardashevskiy <aik@amd.com> wrote:
>
>
>
> On 25/6/25 00:10, Vishal Annapurve wrote:
> > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>
> >> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> >>
> >>> Now, I am rebasing my RFC on top of this patchset and it fails in
> >>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> >>> folios in my RFC.
> >>>
> >>> So what is the expected sequence here? The userspace unmaps a DMA
> >>> page and maps it back right away, all from the userspace? The end
> >>> result will be the exactly same which seems useless. And IOMMU TLB
> >
> >   As Jason described, ideally IOMMU just like KVM, should just:
> > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > by IOMMU stack
> > 2) Directly query pfns from guest_memfd for both shared/private ranges
> > 3) Implement an invalidation callback that guest_memfd can invoke on
> > conversions.

Conversions and truncations both.

> >
> > Current flow:
> > Private to Shared conversion via kvm_gmem_convert_range() -
> >      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> >           -> KVM has the concept of invalidation_begin() and end(),
> > which effectively ensures that between these function calls, no new
> > EPT/NPT entries can be added for the range.
> >       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the KVM SEPT/NPT entries.
> >       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then splits the folios if needed
> >
> > Shared to private conversion via kvm_gmem_convert_range() -
> >      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> >       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the host mappings which will unmap the KVM non-seucure
> > EPT/NPT entries.
> >       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then merges the folios if needed.
> >
> > ============================
> >
> > For IOMMU, could something like below work?
> >
> > * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>
> Done that.
>
> > * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> > guest_memfd ranges using kvm_gmem_get_pfn()
>
> This API imho should drop the confusing kvm_ prefix.
>
> >      -> kvm invokes kvm_gmem_is_private() to check for the range
> > shareability, IOMMU could use the same or we could add an API in gmem
> > that takes in access type and checks the shareability before returning
> > the pfn.
>
> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.

I don't think that's the way to avoid links between iommufd.ko and
kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
runtime registration of invalidation callbacks from KVM/IOMMU
backends. Need to think about this more.

>
>
> > * IOMMU stack exposes an invalidation callback that can be invoked by
> > guest_memfd.
> >
> > Private to Shared conversion via kvm_gmem_convert_range() -
> >      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> >       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the KVM SEPT/NPT entries.
> >             -> guest_memfd invokes IOMMU invalidation callback to zap
> > the secure IOMMU entries.
> >       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then splits the folios if needed
> >       4) Userspace invokes IOMMU map operation to map the ranges in
> > non-secure IOMMU.
> >
> > Shared to private conversion via kvm_gmem_convert_range() -
> >      1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> >       2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the host mappings which will unmap the KVM non-seucure
> > EPT/NPT entries.
> >           -> guest_memfd invokes IOMMU invalidation callback to zap the
> > non-secure IOMMU entries.
> >       3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then merges the folios if needed.
> >       4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>
>
> Alright (although this zap+map is not necessary on the AMD hw).

IMO guest_memfd ideally should not directly interact or cater to arch
specific needs, it should implement a mechanism that works for all
archs. KVM/IOMMU implement invalidation callbacks and have all the
architecture specific knowledge to take the right decisions.

>
>
> > There should be a way to block external IOMMU pagetable updates while
> > guest_memfd is performing conversion e.g. something like
> > kvm_invalidate_begin()/end().
> >
> >>> is going to be flushed on a page conversion anyway (the RMPUPDATE
> >>> instruction does that). All this is about AMD's x86 though.
> >>
> >> The iommu should not be using the VMA to manage the mapping. It should
> >
> > +1.
>
> Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,
>
>
> >> be directly linked to the guestmemfd in some way that does not disturb
> >> its operations. I imagine there would be some kind of invalidation
> >> callback directly to the iommu.
> >>
> >> Presumably that invalidation call back can include a reason for the
> >> invalidation (addr change, shared/private conversion, etc)
> >>
> >> I'm not sure how we will figure out which case is which but guestmemfd
> >> should allow the iommu to plug in either invalidation scheme..
> >>
> >> Probably invalidation should be a global to the FD thing, I imagine
> >> that once invalidation is established the iommu will not be
> >> incrementing page refcounts.
> >
> > +1.
>
> Alright. Thanks for the comments.
>
> >
> >>
> >> Jason
>
> --
> Alexey
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-19  8:59   ` Xiaoyao Li
  2025-06-19  9:18     ` Xiaoyao Li
@ 2025-06-29 18:28     ` Vishal Annapurve
  2025-06-30  3:14       ` Yan Zhao
  1 sibling, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-06-29 18:28 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Yan Zhao, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> >> Hello,
> >>
> >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> >> upstream calls to provide 1G page support for guest_memfd by taking
> >> pages from HugeTLB.
> >>
> >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> >> for guest_memfd patchset (Thanks Fuad!) [1].
> >>
> >> For ease of testing, this series is also available, stitched together,
> >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> >
> > Just to record a found issue -- not one that must be fixed.
> >
> > In TDX, the initial memory region is added as private memory during TD's build
> > time, with its initial content copied from source pages in shared memory.
> > The copy operation requires simultaneous access to both shared source memory
> > and private target memory.
> >
> > Therefore, userspace cannot store the initial content in shared memory at the
> > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > page tables and then check for any extra refcount held for the shared PFN before
> > converting it to private.
>
> I have an idea.
>
> If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> conversion unmap the PFN in shared page tables while keeping the content
> of the page unchanged, right?

That's correct.

>
> So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> actually for non-CoCo case actually, that userspace first mmap() it and
> ensure it's shared and writes the initial content to it, after it
> userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.

I think you mean pKVM by non-coco VMs that care about private memory.
Yes, initial memory regions can start as shared which userspace can
populate and then convert the ranges to private.

>
> For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> wants the private memory to be initialized with initial content, and
> just do in-place TDH.PAGE.ADD in the hook.

I think this scheme will be cleaner:
1) Userspace marks the guest_memfd ranges corresponding to initial
payload as shared.
2) Userspace mmaps and populates the ranges.
3) Userspace converts those guest_memfd ranges to private.
4) For both SNP and TDX, userspace continues to invoke corresponding
initial payload preparation operations via existing KVM ioctls e.g.
KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
   - SNP/TDX KVM logic fetches the right pfns for the target gfns
using the normal paths supported by KVM and passes those pfns directly
to the right trusted module to initialize the "encrypted" memory
contents.
       - Avoiding any GUP or memcpy from source addresses.

i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.

Since we need to support VMs that will/won't use in-place conversion,
I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit
flags to allow userspace to indicate whether to assume in-place
conversion or not. Maybe
kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr
can be null in the scenarios where in-place conversion is used.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-27 15:17                   ` Vishal Annapurve
@ 2025-06-30  0:19                     ` Alexey Kardashevskiy
  2025-06-30 14:19                       ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Alexey Kardashevskiy @ 2025-06-30  0:19 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Jason Gunthorpe, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li



On 28/6/25 01:17, Vishal Annapurve wrote:
> On Thu, Jun 26, 2025 at 9:50 PM Alexey Kardashevskiy <aik@amd.com> wrote:
>>
>>
>>
>> On 25/6/25 00:10, Vishal Annapurve wrote:
>>> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>
>>>> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>>>>
>>>>> Now, I am rebasing my RFC on top of this patchset and it fails in
>>>>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>>>>> folios in my RFC.
>>>>>
>>>>> So what is the expected sequence here? The userspace unmaps a DMA
>>>>> page and maps it back right away, all from the userspace? The end
>>>>> result will be the exactly same which seems useless. And IOMMU TLB
>>>
>>>    As Jason described, ideally IOMMU just like KVM, should just:
>>> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
>>> by IOMMU stack
>>> 2) Directly query pfns from guest_memfd for both shared/private ranges
>>> 3) Implement an invalidation callback that guest_memfd can invoke on
>>> conversions.
> 
> Conversions and truncations both.
> 
>>>
>>> Current flow:
>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>            -> KVM has the concept of invalidation_begin() and end(),
>>> which effectively ensures that between these function calls, no new
>>> EPT/NPT entries can be added for the range.
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the KVM SEPT/NPT entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then splits the folios if needed
>>>
>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>> EPT/NPT entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then merges the folios if needed.
>>>
>>> ============================
>>>
>>> For IOMMU, could something like below work?
>>>
>>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>>
>> Done that.
>>
>>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
>>> guest_memfd ranges using kvm_gmem_get_pfn()
>>
>> This API imho should drop the confusing kvm_ prefix.
>>
>>>       -> kvm invokes kvm_gmem_is_private() to check for the range
>>> shareability, IOMMU could use the same or we could add an API in gmem
>>> that takes in access type and checks the shareability before returning
>>> the pfn.
>>
>> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.
> 
> I don't think that's the way to avoid links between iommufd.ko and
> kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
> runtime registration of invalidation callbacks from KVM/IOMMU
> backends. Need to think about this more.

Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)

> 
>>
>>
>>> * IOMMU stack exposes an invalidation callback that can be invoked by
>>> guest_memfd.
>>>
>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the KVM SEPT/NPT entries.
>>>              -> guest_memfd invokes IOMMU invalidation callback to zap
>>> the secure IOMMU entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then splits the folios if needed
>>>        4) Userspace invokes IOMMU map operation to map the ranges in
>>> non-secure IOMMU.
>>>
>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>> on each bound memslot overlapping with the range
>>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>> EPT/NPT entries.
>>>            -> guest_memfd invokes IOMMU invalidation callback to zap the
>>> non-secure IOMMU entries.
>>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>> shareability and then merges the folios if needed.
>>>        4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>>
>>
>> Alright (although this zap+map is not necessary on the AMD hw).
> 
> IMO guest_memfd ideally should not directly interact or cater to arch
> specific needs, it should implement a mechanism that works for all
> archs. KVM/IOMMU implement invalidation callbacks and have all the
> architecture specific knowledge to take the right decisions.


Every page conversion will go through:

kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).

Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).

Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,


>>
>>> There should be a way to block external IOMMU pagetable updates while
>>> guest_memfd is performing conversion e.g. something like
>>> kvm_invalidate_begin()/end().
>>>
>>>>> is going to be flushed on a page conversion anyway (the RMPUPDATE
>>>>> instruction does that). All this is about AMD's x86 though.
>>>>
>>>> The iommu should not be using the VMA to manage the mapping. It should
>>>
>>> +1.
>>
>> Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,
>>
>>
>>>> be directly linked to the guestmemfd in some way that does not disturb
>>>> its operations. I imagine there would be some kind of invalidation
>>>> callback directly to the iommu.
>>>>
>>>> Presumably that invalidation call back can include a reason for the
>>>> invalidation (addr change, shared/private conversion, etc)
>>>>
>>>> I'm not sure how we will figure out which case is which but guestmemfd
>>>> should allow the iommu to plug in either invalidation scheme..
>>>>
>>>> Probably invalidation should be a global to the FD thing, I imagine
>>>> that once invalidation is established the iommu will not be
>>>> incrementing page refcounts.
>>>
>>> +1.
>>
>> Alright. Thanks for the comments.
>>
>>>
>>>>
>>>> Jason
>>
>> --
>> Alexey
>>

-- 
Alexey


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-29 18:28     ` Vishal Annapurve
@ 2025-06-30  3:14       ` Yan Zhao
  2025-06-30 14:14         ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-06-30  3:14 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >
> > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > >> Hello,
> > >>
> > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > >> upstream calls to provide 1G page support for guest_memfd by taking
> > >> pages from HugeTLB.
> > >>
> > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > >>
> > >> For ease of testing, this series is also available, stitched together,
> > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > >
> > > Just to record a found issue -- not one that must be fixed.
> > >
> > > In TDX, the initial memory region is added as private memory during TD's build
> > > time, with its initial content copied from source pages in shared memory.
> > > The copy operation requires simultaneous access to both shared source memory
> > > and private target memory.
> > >
> > > Therefore, userspace cannot store the initial content in shared memory at the
> > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > page tables and then check for any extra refcount held for the shared PFN before
> > > converting it to private.
> >
> > I have an idea.
> >
> > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > conversion unmap the PFN in shared page tables while keeping the content
> > of the page unchanged, right?
> 
> That's correct.
> 
> >
> > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > actually for non-CoCo case actually, that userspace first mmap() it and
> > ensure it's shared and writes the initial content to it, after it
> > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> 
> I think you mean pKVM by non-coco VMs that care about private memory.
> Yes, initial memory regions can start as shared which userspace can
> populate and then convert the ranges to private.
> 
> >
> > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > wants the private memory to be initialized with initial content, and
> > just do in-place TDH.PAGE.ADD in the hook.
> 
> I think this scheme will be cleaner:
> 1) Userspace marks the guest_memfd ranges corresponding to initial
> payload as shared.
> 2) Userspace mmaps and populates the ranges.
> 3) Userspace converts those guest_memfd ranges to private.
> 4) For both SNP and TDX, userspace continues to invoke corresponding
> initial payload preparation operations via existing KVM ioctls e.g.
> KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
>    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> using the normal paths supported by KVM and passes those pfns directly
> to the right trusted module to initialize the "encrypted" memory
> contents.
>        - Avoiding any GUP or memcpy from source addresses.
One caveat:

when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
Then kvm_gmem_prepare_folio() is further invoked to zero the folio.

> i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
So, upon here, the pages should not contain the original content?

> Since we need to support VMs that will/won't use in-place conversion,
> I think operations like KVM_TDX_INIT_MEM_REGION can introduce explicit
> flags to allow userspace to indicate whether to assume in-place
> conversion or not. Maybe
> kvm_tdx_init_mem_region.source_addr/kvm_sev_snp_launch_update.uaddr
> can be null in the scenarios where in-place conversion is used.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-30  3:14       ` Yan Zhao
@ 2025-06-30 14:14         ` Vishal Annapurve
  2025-07-01  5:23           ` Yan Zhao
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-06-30 14:14 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > >
> > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > >> Hello,
> > > >>
> > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > >> pages from HugeTLB.
> > > >>
> > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > >>
> > > >> For ease of testing, this series is also available, stitched together,
> > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > >
> > > > Just to record a found issue -- not one that must be fixed.
> > > >
> > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > time, with its initial content copied from source pages in shared memory.
> > > > The copy operation requires simultaneous access to both shared source memory
> > > > and private target memory.
> > > >
> > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > converting it to private.
> > >
> > > I have an idea.
> > >
> > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > conversion unmap the PFN in shared page tables while keeping the content
> > > of the page unchanged, right?
> >
> > That's correct.
> >
> > >
> > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > ensure it's shared and writes the initial content to it, after it
> > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> >
> > I think you mean pKVM by non-coco VMs that care about private memory.
> > Yes, initial memory regions can start as shared which userspace can
> > populate and then convert the ranges to private.
> >
> > >
> > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > wants the private memory to be initialized with initial content, and
> > > just do in-place TDH.PAGE.ADD in the hook.
> >
> > I think this scheme will be cleaner:
> > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > payload as shared.
> > 2) Userspace mmaps and populates the ranges.
> > 3) Userspace converts those guest_memfd ranges to private.
> > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > initial payload preparation operations via existing KVM ioctls e.g.
> > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > using the normal paths supported by KVM and passes those pfns directly
> > to the right trusted module to initialize the "encrypted" memory
> > contents.
> >        - Avoiding any GUP or memcpy from source addresses.
> One caveat:
>
> when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> Then kvm_gmem_prepare_folio() is further invoked to zero the folio.

Given that confidential VMs have their own way of initializing private
memory, I think zeroing makes sense for only shared memory ranges.
i.e. something like below:
1) Don't zero at allocation time.
2) If faulting in a shared page and its not uptodate, then zero the
page and set the page as uptodate.
3) Clear uptodate flag on private to shared conversion.
4) For faults on private ranges, don't zero the memory.

There might be some other considerations here e.g. pKVM needs
non-destructive conversion operation, which might need a way to enable
zeroing at allocation time only.

On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
pages on future platforms [1].

[1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/

>
> > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
> So, upon here, the pages should not contain the original content?
>

Pages should contain the original content. Michael is already
experimenting with similar logic [2] for SNP.

[2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-30  0:19                     ` Alexey Kardashevskiy
@ 2025-06-30 14:19                       ` Vishal Annapurve
  2025-07-10  6:57                         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-06-30 14:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Jason Gunthorpe, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On Sun, Jun 29, 2025 at 5:19 PM Alexey Kardashevskiy <aik@amd.com> wrote:
> ...
> >>> ============================
> >>>
> >>> For IOMMU, could something like below work?
> >>>
> >>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
> >>
> >> Done that.
> >>
> >>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> >>> guest_memfd ranges using kvm_gmem_get_pfn()
> >>
> >> This API imho should drop the confusing kvm_ prefix.
> >>
> >>>       -> kvm invokes kvm_gmem_is_private() to check for the range
> >>> shareability, IOMMU could use the same or we could add an API in gmem
> >>> that takes in access type and checks the shareability before returning
> >>> the pfn.
> >>
> >> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.
> >
> > I don't think that's the way to avoid links between iommufd.ko and
> > kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
> > runtime registration of invalidation callbacks from KVM/IOMMU
> > backends. Need to think about this more.
>
> Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)
>
> >
> >>
> >>
> >>> * IOMMU stack exposes an invalidation callback that can be invoked by
> >>> guest_memfd.
> >>>
> >>> Private to Shared conversion via kvm_gmem_convert_range() -
> >>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> >>> on each bound memslot overlapping with the range
> >>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> >>> actually unmaps the KVM SEPT/NPT entries.
> >>>              -> guest_memfd invokes IOMMU invalidation callback to zap
> >>> the secure IOMMU entries.
> >>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> >>> shareability and then splits the folios if needed
> >>>        4) Userspace invokes IOMMU map operation to map the ranges in
> >>> non-secure IOMMU.
> >>>
> >>> Shared to private conversion via kvm_gmem_convert_range() -
> >>>       1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> >>> on each bound memslot overlapping with the range
> >>>        2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> >>> actually unmaps the host mappings which will unmap the KVM non-seucure
> >>> EPT/NPT entries.
> >>>            -> guest_memfd invokes IOMMU invalidation callback to zap the
> >>> non-secure IOMMU entries.
> >>>        3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> >>> shareability and then merges the folios if needed.
> >>>        4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
> >>
> >>
> >> Alright (although this zap+map is not necessary on the AMD hw).
> >
> > IMO guest_memfd ideally should not directly interact or cater to arch
> > specific needs, it should implement a mechanism that works for all
> > archs. KVM/IOMMU implement invalidation callbacks and have all the
> > architecture specific knowledge to take the right decisions.
>
>
> Every page conversion will go through:
>
> kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).
>
> Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).

If there is a need, guest_memfd can support two different callbacks:
1) Conversion notifier/callback invoked by guest_memfd during
conversion handling.
2) Invalidation notifier/callback invoked by guest_memfd during truncation.

Iommufd/kvm can handle conversion callback/notifier as per the needs
of underlying architecture. e.g. for TDX connect do the unmapping vs
for SEV Trusted IO skip the unmapping.

Invalidation callback/notifier will need to be handled by unmapping page tables.

>
> Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,
>

I assume when you say huge page smashing, it means huge page NPT
mapping getting split.

AFAIR, based on discussion with Michael during guest_memfd calls,
stage2 NPT entries need to be of the same granularity as RMP tables
for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on
the KVM side during conversion. So today guest_memfd sends
invalidation notification to KVM for both conversion and truncation.
Doesn't the same constraint for keeping IOMMU page tables at the same
granularity as RMP tables hold for trusted IO?

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-06-30 14:14         ` Vishal Annapurve
@ 2025-07-01  5:23           ` Yan Zhao
  2025-07-01 19:48             ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-07-01  5:23 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote:
> On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > >
> > > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > >> Hello,
> > > > >>
> > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > > >> pages from HugeTLB.
> > > > >>
> > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > > >>
> > > > >> For ease of testing, this series is also available, stitched together,
> > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > > >
> > > > > Just to record a found issue -- not one that must be fixed.
> > > > >
> > > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > > time, with its initial content copied from source pages in shared memory.
> > > > > The copy operation requires simultaneous access to both shared source memory
> > > > > and private target memory.
> > > > >
> > > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > > converting it to private.
> > > >
> > > > I have an idea.
> > > >
> > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > > conversion unmap the PFN in shared page tables while keeping the content
> > > > of the page unchanged, right?
> > >
> > > That's correct.
> > >
> > > >
> > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > > ensure it's shared and writes the initial content to it, after it
> > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> > >
> > > I think you mean pKVM by non-coco VMs that care about private memory.
> > > Yes, initial memory regions can start as shared which userspace can
> > > populate and then convert the ranges to private.
> > >
> > > >
> > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > > wants the private memory to be initialized with initial content, and
> > > > just do in-place TDH.PAGE.ADD in the hook.
> > >
> > > I think this scheme will be cleaner:
> > > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > > payload as shared.
> > > 2) Userspace mmaps and populates the ranges.
> > > 3) Userspace converts those guest_memfd ranges to private.
> > > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > > initial payload preparation operations via existing KVM ioctls e.g.
> > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> > >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > > using the normal paths supported by KVM and passes those pfns directly
> > > to the right trusted module to initialize the "encrypted" memory
> > > contents.
> > >        - Avoiding any GUP or memcpy from source addresses.
> > One caveat:
> >
> > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> > Then kvm_gmem_prepare_folio() is further invoked to zero the folio.
> 
> Given that confidential VMs have their own way of initializing private
> memory, I think zeroing makes sense for only shared memory ranges.
> i.e. something like below:
> 1) Don't zero at allocation time.
> 2) If faulting in a shared page and its not uptodate, then zero the
> page and set the page as uptodate.
> 3) Clear uptodate flag on private to shared conversion.
> 4) For faults on private ranges, don't zero the memory.
> 
> There might be some other considerations here e.g. pKVM needs
> non-destructive conversion operation, which might need a way to enable
> zeroing at allocation time only.
> 
> On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
> pages on future platforms [1].
Yes, TDX does not need to clear pages on private page allocation.
But current kvm_gmem_prepare_folio() clears private pages in the common path
for both TDX and SEV-SNP.

I just wanted to point out that it's a kind of obstacle that need to be removed
to implement the proposed approach.


> [1] https://lore.kernel.org/lkml/6de76911-5007-4170-bf74-e1d045c68465@intel.com/
> 
> >
> > > i.e. for TDX VMs, KVM_TDX_INIT_MEM_REGION still does the in-place TDH.PAGE.ADD.
> > So, upon here, the pages should not contain the original content?
> >
> 
> Pages should contain the original content. Michael is already
> experimenting with similar logic [2] for SNP.
> 
> [2] https://lore.kernel.org/lkml/20250613005400.3694904-6-michael.roth@amd.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-01  5:23           ` Yan Zhao
@ 2025-07-01 19:48             ` Vishal Annapurve
  2025-07-07 23:25               ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-01 19:48 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, seanjc, shuah, steven.price, steven.sistare, suzuki.poulose,
	tabba, thomas.lendacky, usama.arif, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Mon, Jun 30, 2025 at 10:26 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Mon, Jun 30, 2025 at 07:14:07AM -0700, Vishal Annapurve wrote:
> > On Sun, Jun 29, 2025 at 8:17 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Sun, Jun 29, 2025 at 11:28:22AM -0700, Vishal Annapurve wrote:
> > > > On Thu, Jun 19, 2025 at 1:59 AM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> > > > >
> > > > > On 6/19/2025 4:13 PM, Yan Zhao wrote:
> > > > > > On Wed, May 14, 2025 at 04:41:39PM -0700, Ackerley Tng wrote:
> > > > > >> Hello,
> > > > > >>
> > > > > >> This patchset builds upon discussion at LPC 2024 and many guest_memfd
> > > > > >> upstream calls to provide 1G page support for guest_memfd by taking
> > > > > >> pages from HugeTLB.
> > > > > >>
> > > > > >> This patchset is based on Linux v6.15-rc6, and requires the mmap support
> > > > > >> for guest_memfd patchset (Thanks Fuad!) [1].
> > > > > >>
> > > > > >> For ease of testing, this series is also available, stitched together,
> > > > > >> at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
> > > > > >
> > > > > > Just to record a found issue -- not one that must be fixed.
> > > > > >
> > > > > > In TDX, the initial memory region is added as private memory during TD's build
> > > > > > time, with its initial content copied from source pages in shared memory.
> > > > > > The copy operation requires simultaneous access to both shared source memory
> > > > > > and private target memory.
> > > > > >
> > > > > > Therefore, userspace cannot store the initial content in shared memory at the
> > > > > > mmap-ed VA of a guest_memfd that performs in-place conversion between shared and
> > > > > > private memory. This is because the guest_memfd will first unmap a PFN in shared
> > > > > > page tables and then check for any extra refcount held for the shared PFN before
> > > > > > converting it to private.
> > > > >
> > > > > I have an idea.
> > > > >
> > > > > If I understand correctly, the KVM_GMEM_CONVERT_PRIVATE of in-place
> > > > > conversion unmap the PFN in shared page tables while keeping the content
> > > > > of the page unchanged, right?
> > > >
> > > > That's correct.
> > > >
> > > > >
> > > > > So KVM_GMEM_CONVERT_PRIVATE can be used to initialize the private memory
> > > > > actually for non-CoCo case actually, that userspace first mmap() it and
> > > > > ensure it's shared and writes the initial content to it, after it
> > > > > userspace convert it to private with KVM_GMEM_CONVERT_PRIVATE.
> > > >
> > > > I think you mean pKVM by non-coco VMs that care about private memory.
> > > > Yes, initial memory regions can start as shared which userspace can
> > > > populate and then convert the ranges to private.
> > > >
> > > > >
> > > > > For CoCo case, like TDX, it can hook to KVM_GMEM_CONVERT_PRIVATE if it
> > > > > wants the private memory to be initialized with initial content, and
> > > > > just do in-place TDH.PAGE.ADD in the hook.
> > > >
> > > > I think this scheme will be cleaner:
> > > > 1) Userspace marks the guest_memfd ranges corresponding to initial
> > > > payload as shared.
> > > > 2) Userspace mmaps and populates the ranges.
> > > > 3) Userspace converts those guest_memfd ranges to private.
> > > > 4) For both SNP and TDX, userspace continues to invoke corresponding
> > > > initial payload preparation operations via existing KVM ioctls e.g.
> > > > KVM_SEV_SNP_LAUNCH_UPDATE/KVM_TDX_INIT_MEM_REGION.
> > > >    - SNP/TDX KVM logic fetches the right pfns for the target gfns
> > > > using the normal paths supported by KVM and passes those pfns directly
> > > > to the right trusted module to initialize the "encrypted" memory
> > > > contents.
> > > >        - Avoiding any GUP or memcpy from source addresses.
> > > One caveat:
> > >
> > > when TDX populates the mirror root, kvm_gmem_get_pfn() is invoked.
> > > Then kvm_gmem_prepare_folio() is further invoked to zero the folio.
> >
> > Given that confidential VMs have their own way of initializing private
> > memory, I think zeroing makes sense for only shared memory ranges.
> > i.e. something like below:
> > 1) Don't zero at allocation time.
> > 2) If faulting in a shared page and its not uptodate, then zero the
> > page and set the page as uptodate.
> > 3) Clear uptodate flag on private to shared conversion.
> > 4) For faults on private ranges, don't zero the memory.
> >
> > There might be some other considerations here e.g. pKVM needs
> > non-destructive conversion operation, which might need a way to enable
> > zeroing at allocation time only.
> >
> > On a TDX specific note, IIUC, KVM TDX logic doesn't need to clear
> > pages on future platforms [1].
> Yes, TDX does not need to clear pages on private page allocation.
> But current kvm_gmem_prepare_folio() clears private pages in the common path
> for both TDX and SEV-SNP.
>
> I just wanted to point out that it's a kind of obstacle that need to be removed
> to implement the proposed approach.
>

Proposed approach will work with 4K pages without any additional
changes. For huge pages it's easy to prototype this approach by just
disabling zeroing logic in guest mem on faulting and instead always
doing zeroing on allocation.

I would be curious to understand if we need zeroing on conversion for
Confidential VMs. If not, then the simple rule of zeroing on
allocation only will work for all usecases.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-24 14:10               ` Vishal Annapurve
  2025-06-27  4:49                 ` Alexey Kardashevskiy
@ 2025-07-02  8:35                 ` Yan Zhao
  2025-07-02 13:54                   ` Vishal Annapurve
  2025-07-16 22:22                   ` Ackerley Tng
  1 sibling, 2 replies; 231+ messages in thread
From: Yan Zhao @ 2025-07-02  8:35 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Jason Gunthorpe, Alexey Kardashevskiy, Fuad Tabba, Ackerley Tng,
	kvm, linux-mm, linux-kernel, x86, linux-fsdevel, ajones, akpm,
	amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> >
> > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > folios in my RFC.
> > >
> > > So what is the expected sequence here? The userspace unmaps a DMA
> > > page and maps it back right away, all from the userspace? The end
> > > result will be the exactly same which seems useless. And IOMMU TLB
> 
>  As Jason described, ideally IOMMU just like KVM, should just:
> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> by IOMMU stack
In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
TDX module about which pages are used by it for DMAs purposes.
So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
unmap of the pages from S-EPT.

If IOMMU side does not increase refcount, IMHO, some way to indicate that
certain PFNs are used by TDs for DMA is still required, so guest_memfd can
reject the request before attempting the actual unmap.
Otherwise, the unmap of TD-DMA-pinned pages will fail.

Upon this kind of unmapping failure, it also doesn't help for host to retry
unmapping without unpinning from TD.


> 2) Directly query pfns from guest_memfd for both shared/private ranges
> 3) Implement an invalidation callback that guest_memfd can invoke on
> conversions.
> 
> Current flow:
> Private to Shared conversion via kvm_gmem_convert_range() -
>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>          -> KVM has the concept of invalidation_begin() and end(),
> which effectively ensures that between these function calls, no new
> EPT/NPT entries can be added for the range.
>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
> 
> ============================
> 
> For IOMMU, could something like below work?
> 
> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> guest_memfd ranges using kvm_gmem_get_pfn()
>     -> kvm invokes kvm_gmem_is_private() to check for the range
> shareability, IOMMU could use the same or we could add an API in gmem
> that takes in access type and checks the shareability before returning
> the pfn.
> * IOMMU stack exposes an invalidation callback that can be invoked by
> guest_memfd.
> 
> Private to Shared conversion via kvm_gmem_convert_range() -
>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the KVM SEPT/NPT entries.
>            -> guest_memfd invokes IOMMU invalidation callback to zap
> the secure IOMMU entries.
If guest_memfd could determine if a page is used by DMA purposes before
attempting the actual unmaps, it could reject and fail the conversion earlier,
thereby keeping IOMMU/S-EPT mappings intact.

This could prevent the conversion from partially failing.

>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then splits the folios if needed
>      4) Userspace invokes IOMMU map operation to map the ranges in
> non-secure IOMMU.
> 
> Shared to private conversion via kvm_gmem_convert_range() -
>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> on each bound memslot overlapping with the range
>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> actually unmaps the host mappings which will unmap the KVM non-seucure
> EPT/NPT entries.
>          -> guest_memfd invokes IOMMU invalidation callback to zap the
> non-secure IOMMU entries.
>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> shareability and then merges the folios if needed.
>      4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
> 
> There should be a way to block external IOMMU pagetable updates while
> guest_memfd is performing conversion e.g. something like
> kvm_invalidate_begin()/end().
> 
> > > is going to be flushed on a page conversion anyway (the RMPUPDATE
> > > instruction does that). All this is about AMD's x86 though.
> >
> > The iommu should not be using the VMA to manage the mapping. It should
> 
> +1.
> 
> > be directly linked to the guestmemfd in some way that does not disturb
> > its operations. I imagine there would be some kind of invalidation
> > callback directly to the iommu.
> >
> > Presumably that invalidation call back can include a reason for the
> > invalidation (addr change, shared/private conversion, etc)
> >
> > I'm not sure how we will figure out which case is which but guestmemfd
> > should allow the iommu to plug in either invalidation scheme..
> >
> > Probably invalidation should be a global to the FD thing, I imagine
> > that once invalidation is established the iommu will not be
> > incrementing page refcounts.
> 
> +1.
> 
> >
> > Jason
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-02  8:35                 ` Yan Zhao
@ 2025-07-02 13:54                   ` Vishal Annapurve
  2025-07-02 14:13                     ` Jason Gunthorpe
  2025-07-16 22:22                   ` Ackerley Tng
  1 sibling, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-02 13:54 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Jason Gunthorpe, Alexey Kardashevskiy, Fuad Tabba, Ackerley Tng,
	kvm, linux-mm, linux-kernel, x86, linux-fsdevel, ajones, akpm,
	amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > >
> > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > folios in my RFC.
> > > >
> > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > page and maps it back right away, all from the userspace? The end
> > > > result will be the exactly same which seems useless. And IOMMU TLB
> >
> >  As Jason described, ideally IOMMU just like KVM, should just:
> > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > by IOMMU stack
> In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> TDX module about which pages are used by it for DMAs purposes.
> So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> unmap of the pages from S-EPT.
>
> If IOMMU side does not increase refcount, IMHO, some way to indicate that
> certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> reject the request before attempting the actual unmap.

So it looks like guest_memfd will need an interface with KVM/IOMMU
backends to check if unmapping can succeed. And if unmapping still
fails, there should be a way for KVM/IOMMU backends to kill the TD and
any TDIs bound to that TD.

> Otherwise, the unmap of TD-DMA-pinned pages will fail.
>
> Upon this kind of unmapping failure, it also doesn't help for host to retry
> unmapping without unpinning from TD.
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-02 13:54                   ` Vishal Annapurve
@ 2025-07-02 14:13                     ` Jason Gunthorpe
  2025-07-02 14:32                       ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Jason Gunthorpe @ 2025-07-02 14:13 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Yan Zhao, Alexey Kardashevskiy, Fuad Tabba, Ackerley Tng, kvm,
	linux-mm, linux-kernel, x86, linux-fsdevel, ajones, akpm,
	amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Wed, Jul 02, 2025 at 06:54:10AM -0700, Vishal Annapurve wrote:
> On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >
> > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > > >
> > > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > > folios in my RFC.
> > > > >
> > > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > > page and maps it back right away, all from the userspace? The end
> > > > > result will be the exactly same which seems useless. And IOMMU TLB
> > >
> > >  As Jason described, ideally IOMMU just like KVM, should just:
> > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > > by IOMMU stack
> > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > TDX module about which pages are used by it for DMAs purposes.
> > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > unmap of the pages from S-EPT.

I don't see this as having much to do with iommufd.

iommufd will somehow support the T=1 iommu inside the TDX module but
it won't have an IOAS for it since the VMM does not control the
translation.

The discussion here is for the T=0 iommu which is controlled by
iommufd and does have an IOAS. It should be popoulated with all the
shared pages from the guestmemfd.

> > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> > reject the request before attempting the actual unmap.

This has to be delt with between the TDX module and KVM. When KVM
gives pages to become secure it may not be able to get them back..

This problem has nothing to do with iommufd.

But generally I expect that the T=1 iommu follows the S-EPT entirely
and there is no notion of pages "locked for dma". If DMA is ongoing
and a page is made non-secure then the DMA fails.

Obviously in a mode where there is a vPCI device we will need all the
pages to be pinned in the guestmemfd to prevent any kind of
migrations. Only shared/private conversions should change the page
around.

Maybe this needs to be an integral functionality in guestmemfd?

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-02 14:13                     ` Jason Gunthorpe
@ 2025-07-02 14:32                       ` Vishal Annapurve
  2025-07-10 10:50                         ` Xu Yilun
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-02 14:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yan Zhao, Alexey Kardashevskiy, Fuad Tabba, Ackerley Tng, kvm,
	linux-mm, linux-kernel, x86, linux-fsdevel, ajones, akpm,
	amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Wed, Jul 2, 2025 at 7:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jul 02, 2025 at 06:54:10AM -0700, Vishal Annapurve wrote:
> > On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >
> > > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > > > >
> > > > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > > > folios in my RFC.
> > > > > >
> > > > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > > > page and maps it back right away, all from the userspace? The end
> > > > > > result will be the exactly same which seems useless. And IOMMU TLB
> > > >
> > > >  As Jason described, ideally IOMMU just like KVM, should just:
> > > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > > > by IOMMU stack
> > > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > > TDX module about which pages are used by it for DMAs purposes.
> > > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > > unmap of the pages from S-EPT.
>
> I don't see this as having much to do with iommufd.
>
> iommufd will somehow support the T=1 iommu inside the TDX module but
> it won't have an IOAS for it since the VMM does not control the
> translation.
>
> The discussion here is for the T=0 iommu which is controlled by
> iommufd and does have an IOAS. It should be popoulated with all the
> shared pages from the guestmemfd.
>
> > > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> > > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> > > reject the request before attempting the actual unmap.
>
> This has to be delt with between the TDX module and KVM. When KVM
> gives pages to become secure it may not be able to get them back..
>
> This problem has nothing to do with iommufd.
>
> But generally I expect that the T=1 iommu follows the S-EPT entirely
> and there is no notion of pages "locked for dma". If DMA is ongoing
> and a page is made non-secure then the DMA fails.
>
> Obviously in a mode where there is a vPCI device we will need all the
> pages to be pinned in the guestmemfd to prevent any kind of
> migrations. Only shared/private conversions should change the page
> around.

Yes, guest_memfd ensures that all the faulted-in pages (irrespective
of shared or private ranges) are not migratable. We already have a
similar restriction with CPU accesses to encrypted memory ranges that
need arch specific protocols to migrate memory contents.

>
> Maybe this needs to be an integral functionality in guestmemfd?
>
> Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-06-11 21:51     ` Ackerley Tng
@ 2025-07-02 23:25       ` Michael Roth
  2025-07-03  0:46         ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Michael Roth @ 2025-07-02 23:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, mpe, muchun.song, nikunj,
	nsaenz, oliver.upton, palmer, pankaj.gupta, paul.walmsley,
	pbonzini, pdurrant, peterx, pgonda, pvorel, qperret,
	quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> Michael Roth <michael.roth@amd.com> writes:
> 
> > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> >> Track guest_memfd memory's shareability status within the inode as
> >> opposed to the file, since it is property of the guest_memfd's memory
> >> contents.
> >> 
> >> Shareability is a property of the memory and is indexed using the
> >> page's index in the inode. Because shareability is the memory's
> >> property, it is stored within guest_memfd instead of within KVM, like
> >> in kvm->mem_attr_array.
> >> 
> >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> >> retained to allow VMs to only use guest_memfd for private memory and
> >> some other memory for shared memory.
> >> 
> >> Not all use cases require guest_memfd() to be shared with the host
> >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> >> private to the guest, and therefore not mappable by the
> >> host. Otherwise, memory is shared until explicitly converted to
> >> private.
> >> 
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> >> Co-developed-by: Fuad Tabba <tabba@google.com>
> >> Signed-off-by: Fuad Tabba <tabba@google.com>
> >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> >> ---
> >>  Documentation/virt/kvm/api.rst |   5 ++
> >>  include/uapi/linux/kvm.h       |   2 +
> >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> >>  3 files changed, 129 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 86f74ce7f12a..f609337ae1c2 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> >>  This is validated when the guest_memfd instance is bound to the VM.
> >>  
> >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> >> +by the host.
> >> +
> >
> > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > like this flag should be deferred until that patch is in place. Is it
> > really needed at that point though? Userspace would be able to set the
> > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> >
> 
> I can move this change to the later patch. Thanks! Will fix in the next
> revision.
> 
> > The mtree contents seems to get stored in the same manner in either case so
> > performance-wise only the overhead of a few userspace<->kernel switches
> > would be saved. Are there any other reasons?
> >
> > Otherwise, maybe just settle on SHARED as a documented default (since at
> > least non-CoCo VMs would be able to reliably benefit) and let
> > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > granularity makes sense for the architecture/guest configuration.
> >
> 
> Because shared pages are split once any memory is allocated, having a
> way to INIT_PRIVATE could avoid the split and then merge on
> conversion. I feel that is enough value to have this config flag, what
> do you think?
> 
> I guess we could also have userspace be careful not to do any allocation
> before converting.

I assume we do want to support things like preallocating guest memory so
not sure this approach is feasible to avoid splits.

But I feel like we might be working around a deeper issue here, which is
that we are pre-emptively splitting anything that *could* be mapped into
userspace (i.e. allocated+shared/mixed), rather than splitting when
necessary.

I know that was the plan laid out in the guest_memfd calls, but I've run
into a couple instances that have me thinking we should revisit this.

1) Some of the recent guest_memfd seems to be gravitating towards having
   userspace populate/initialize guest memory payload prior to boot via
   mmap()'ing the shared guest_memfd pages so things work the same as
   they would for initialized normal VM memory payload (rather than
   relying on back-channels in the kernel to user data into guest_memfd
   pages).

   When you do this though, for an SNP guest at least, that memory
   acceptance is done in chunks of 4MB (with accept_memory=lazy), and
   because that will put each 1GB page into an allocated+mixed state,
   we end up splitting every 1GB to 4K and the guest can't even
   accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
   anything in the range. As some point the guest will convert/accept
   the entire range, at which point we could merge, but for SNP we'd
   need guest cooperation to actually use a higher-granularity in stage2
   page tables at that point since RMP entries are effectively all split
   to 4K.

   I understand the intent is to default to private where this wouldn't
   be an issue, and we could punt to userspace to deal with it, but it
   feels like an artificial restriction to place on userspace. And if we
   do want to allow/expect guest_memfd contents to be initialized pre-boot
   just like normal memory, then userspace would need to jump through
   some hoops:

   - if defaulting to private: add hooks to convert each range that's being
     modified to a shared state prior to writing to it
   - if defaulting to shared: initialize memory in-place, then covert
     everything else to private to avoid unecessarily splitting folios
     at run-time

   It feels like implementations details are bleeding out into the API
   to some degree here (e.g. we'd probably at least need to document
   this so users know how to take proper advantage of hugepage support).

2) There are some use-cases for HugeTLB + CoCo that have come to my
   attention recently that put a lot of weight on still being able to
   maximize mapping/hugepage size when accessing shared mem from userspace,
   e.g. for certain DPDK workloads that accessed shared guest buffers
   from host userspace. We don't really have a story for this, and I
   wouldn't expect us to at this stage, but I think it ties into #1 so
   might be worth considering in that context.

I'm still fine with the current approach as a starting point, but I'm
wondering if improving both #1/#2 might not be so bad and maybe even
give us some more flexibility (for instance, Sean had mentioned leaving
open the option of tracking more than just shareability/mappability, and
if there is split/merge logic associated with those transitions then
re-scanning each of these attributes for a 1G range seems like it could
benefit from some sort of intermediate data structure to help determine
things like what mapping granularity is available for guest/userspace
for a particular range.

One approach I was thinking of was that we introduce a data structure
similar to KVM's memslot->arch.lpage_info() where we store information
about what 1G/2M ranges are shared/private/mixed, and then instead of
splitting ahead of time we just record that state into this data
structure (using the same write lock as with the
shareability/mappability state), and then at *fault* time we split the
folio if our lpage_info-like data structure says the range is mixed.

Then, if guest converts a 2M/4M range to private while lazilly-accepting
(for instance), we can still keep the folio intact as 1GB, but mark
the 1G range in the lpage_info-like data structure as mixed so that we
still inform KVM/etc. they need to map it as 2MB or lower in stage2
page tables. In that case, even at guest fault-time, we can leave the
folio unsplit until userspace tries to touch it (though in most cases
it never will and we can keep most of the guest's 1G intact for the
duration of its lifetime).

On the userspace side, another nice thing there is if we see 1G is in a
mixed state, but 2M is all-shared, then we can still leave the folio as 2M,
and I think the refcount'ing logic would still work for the most part,
which makes #2 a bit easier to implement as well.

And of course, we wouldn't need the INIT_PRIVATE then since we are only
splitting when necessary.

But I guess this all comes down to how much extra pain there is in
tracking a 1G folio that's been split into a mixed of 2MB/4K regions,
but I think we'd get a lot more mileage out of getting that working and
just completely stripping out all of the merging logic for initial
implementation (other than at cleanup time), so maybe complexity-wise
it balances out a bit?

Thanks,

Mike

> 
> >>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >>  
> >>  4.143 KVM_PRE_FAULT_MEMORY
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 4cc824a3a7c9..d7df312479aa 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
> >>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> >>  
> >>  #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> >> +
> >>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED	(1UL << 0)
> >> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE	(1UL << 1)
> >>  
> >>  struct kvm_create_guest_memfd {
> >>  	__u64 size;
> >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> index 239d0f13dcc1..590932499eba 100644
> >> --- a/virt/kvm/guest_memfd.c
> >> +++ b/virt/kvm/guest_memfd.c
> >> @@ -4,6 +4,7 @@
> >>  #include <linux/falloc.h>
> >>  #include <linux/fs.h>
> >>  #include <linux/kvm_host.h>
> >> +#include <linux/maple_tree.h>
> >>  #include <linux/pseudo_fs.h>
> >>  #include <linux/pagemap.h>
> >>  
> >> @@ -17,6 +18,24 @@ struct kvm_gmem {
> >>  	struct list_head entry;
> >>  };
> >>  
> >> +struct kvm_gmem_inode_private {
> >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> +	struct maple_tree shareability;
> >> +#endif
> >> +};
> >> +
> >> +enum shareability {
> >> +	SHAREABILITY_GUEST = 1,	/* Only the guest can map (fault) folios in this range. */
> >> +	SHAREABILITY_ALL = 2,	/* Both guest and host can fault folios in this range. */
> >> +};
> >> +
> >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> >> +
> >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> >> +{
> >> +	return inode->i_mapping->i_private_data;
> >> +}
> >> +
> >>  /**
> >>   * folio_file_pfn - like folio_file_page, but return a pfn.
> >>   * @folio: The folio which contains this index.
> >> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
> >>  	return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
> >>  }
> >>  
> >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> +
> >> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> >> +				      loff_t size, u64 flags)
> >> +{
> >> +	enum shareability m;
> >> +	pgoff_t last;
> >> +
> >> +	last = (size >> PAGE_SHIFT) - 1;
> >> +	m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> >> +						    SHAREABILITY_ALL;
> >> +	return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> >> +				 GFP_KERNEL);
> >
> > One really nice thing about using a maple tree is that it should get rid
> > of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> > initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> > (which is the current QEMU default behavior).
> >
> > I'd originally advocated for sticking with the xarray implementation Fuad was
> > using until we'd determined we really need it for HugeTLB support, but I'm
> > sort of thinking it's already justified just based on the above.
> >
> > Maybe it would make sense for KVM memory attributes too?
> >
> >> +}
> >> +
> >> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> >> +						 pgoff_t index)
> >> +{
> >> +	struct maple_tree *mt;
> >> +	void *entry;
> >> +
> >> +	mt = &kvm_gmem_private(inode)->shareability;
> >> +	entry = mtree_load(mt, index);
> >> +	WARN(!entry,
> >> +	     "Shareability should always be defined for all indices in inode.");
> >> +
> >> +	return xa_to_value(entry);
> >> +}
> >> +
> >> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> +{
> >> +	if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> >> +		return ERR_PTR(-EACCES);
> >> +
> >> +	return kvm_gmem_get_folio(inode, index);
> >> +}
> >> +
> >> +#else
> >> +
> >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> >> +{
> >> +	WARN_ONCE("Unexpected call to get shared folio.")
> >> +	return NULL;
> >> +}
> >> +
> >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> >> +
> >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> >>  				    pgoff_t index, struct folio *folio)
> >>  {
> >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> >>  
> >>  	filemap_invalidate_lock_shared(inode->i_mapping);
> >>  
> >> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> >> +	folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> >>  	if (IS_ERR(folio)) {
> >>  		int err = PTR_ERR(folio);
> >>  
> >> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> >>  	.fallocate	= kvm_gmem_fallocate,
> >>  };
> >>  
> >> +static void kvm_gmem_free_inode(struct inode *inode)
> >> +{
> >> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> >> +
> >> +	kfree(private);
> >> +
> >> +	free_inode_nonrcu(inode);
> >> +}
> >> +
> >> +static void kvm_gmem_destroy_inode(struct inode *inode)
> >> +{
> >> +	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> >> +
> >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> >> +	/*
> >> +	 * mtree_destroy() can't be used within rcu callback, hence can't be
> >> +	 * done in ->free_inode().
> >> +	 */
> >> +	if (private)
> >> +		mtree_destroy(&private->shareability);
> >> +#endif
> >> +}
> >> +
> >>  static const struct super_operations kvm_gmem_super_operations = {
> >>  	.statfs		= simple_statfs,
> >> +	.destroy_inode	= kvm_gmem_destroy_inode,
> >> +	.free_inode	= kvm_gmem_free_inode,
> >>  };
> >>  
> >>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> >> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> >>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >>  						      loff_t size, u64 flags)
> >>  {
> >> +	struct kvm_gmem_inode_private *private;
> >>  	struct inode *inode;
> >> +	int err;
> >>  
> >>  	inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> >>  	if (IS_ERR(inode))
> >>  		return inode;
> >>  
> >> +	err = -ENOMEM;
> >> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> >> +	if (!private)
> >> +		goto out;
> >> +
> >> +	mt_init(&private->shareability);
> >> +	inode->i_mapping->i_private_data = private;
> >> +
> >> +	err = kvm_gmem_shareability_setup(private, size, flags);
> >> +	if (err)
> >> +		goto out;
> >> +
> >>  	inode->i_private = (void *)(unsigned long)flags;
> >>  	inode->i_op = &kvm_gmem_iops;
> >>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
> >> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> >>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> >>  
> >>  	return inode;
> >> +
> >> +out:
> >> +	iput(inode);
> >> +
> >> +	return ERR_PTR(err);
> >>  }
> >>  
> >>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> >> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >>  	if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> >>  		valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> >>  
> >> +	if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> >> +		valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> >> +
> >>  	if (flags & ~valid_flags)
> >>  		return -EINVAL;
> >>  
> >> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >>  	if (!file)
> >>  		return -EFAULT;
> >>  
> >> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> >> +
> >
> > I like the idea of using a write-lock/read-lock to protect write/read access
> > to shareability state (though maybe not necessarily re-using filemap's
> > invalidate lock), it's simple and still allows concurrent faulting in of gmem
> > pages. One issue on the SNP side (which also came up in one of the gmem calls)
> > is if we introduce support for tracking preparedness as discussed (e.g. via a
> > new SHAREABILITY_GUEST_PREPARED state) the
> > SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> > fault-time, and so would need to take the write-lock and no longer allow for
> > concurrent fault-handling.
> >
> > I was originally planning on introducing a new rw_semaphore with similar
> > semantics to the rw_lock that Fuad previously had in his restricted mmap
> > series[1] (and simiar semantics to filemap invalidate lock here). The main
> > difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> > was that in the case of a folio being present for an index, the folio lock would
> > also need to be held in order to update the shareability state. Because
> > of that, fault paths (which will always either have or allocate folio
> > basically) can rely on the folio lock to guard shareability state in a more
> > granular way and so can avoid a global write lock.
> >
> > They would still need to hold the read lock to access the tree however.
> > Or more specifically, any paths that could allocate a folio need to take
> > a read lock so there isn't a TOCTOU situation where shareability is
> > being updated for an index for which a folio hasn't been allocated, but
> > then just afterward the folio gets faulted in/allocated while the
> > shareability state is already being updated which the understand that
> > there was no folio around that needed locking.
> >
> > I had a branch with in-place conversion support for SNP[2] that added this
> > lock reworking on top of Fuad's series along with preparation tracking,
> > but I'm now planning to rebase that on top of the patches from this
> > series that Sean mentioned[3] earlier:
> >
> >   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
> >   KVM: Query guest_memfd for private/shared status
> >   KVM: guest_memfd: Skip LRU for guest_memfd folios
> >   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
> >   KVM: guest_memfd: Introduce and use shareability to guard faulting
> >   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
> >
> > but figured I'd mention it here in case there are other things to consider on
> > the locking front.
> >
> > Definitely agree with Sean though that it would be nice to start identifying a
> > common base of patches for the in-place conversion enablement for SNP, TDX, and
> > pKVM so the APIs/interfaces for hugepages can be handled separately.
> >
> > -Mike
> >
> > [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> > [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> > [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
> >
> >>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> >>  	if (IS_ERR(folio)) {
> >>  		r = PTR_ERR(folio);
> >> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> >>  		*page = folio_file_page(folio, index);
> >>  	else
> >>  		folio_put(folio);
> >> -
> >>  out:
> >> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> >>  	fput(file);
> >>  	return r;
> >>  }
> >> -- 
> >> 2.49.0.1045.g170613ef41-goog
> >> 
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-02 23:25       ` Michael Roth
@ 2025-07-03  0:46         ` Vishal Annapurve
  2025-07-03  0:52           ` Vishal Annapurve
  2025-07-03  4:12           ` Michael Roth
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-03  0:46 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
>
> On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > Michael Roth <michael.roth@amd.com> writes:
> >
> > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > >> Track guest_memfd memory's shareability status within the inode as
> > >> opposed to the file, since it is property of the guest_memfd's memory
> > >> contents.
> > >>
> > >> Shareability is a property of the memory and is indexed using the
> > >> page's index in the inode. Because shareability is the memory's
> > >> property, it is stored within guest_memfd instead of within KVM, like
> > >> in kvm->mem_attr_array.
> > >>
> > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > >> retained to allow VMs to only use guest_memfd for private memory and
> > >> some other memory for shared memory.
> > >>
> > >> Not all use cases require guest_memfd() to be shared with the host
> > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > >> private to the guest, and therefore not mappable by the
> > >> host. Otherwise, memory is shared until explicitly converted to
> > >> private.
> > >>
> > >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > >> Co-developed-by: Fuad Tabba <tabba@google.com>
> > >> Signed-off-by: Fuad Tabba <tabba@google.com>
> > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > >> ---
> > >>  Documentation/virt/kvm/api.rst |   5 ++
> > >>  include/uapi/linux/kvm.h       |   2 +
> > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > >> index 86f74ce7f12a..f609337ae1c2 100644
> > >> --- a/Documentation/virt/kvm/api.rst
> > >> +++ b/Documentation/virt/kvm/api.rst
> > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > >>  This is validated when the guest_memfd instance is bound to the VM.
> > >>
> > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > >> +by the host.
> > >> +
> > >
> > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > like this flag should be deferred until that patch is in place. Is it
> > > really needed at that point though? Userspace would be able to set the
> > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > >
> >
> > I can move this change to the later patch. Thanks! Will fix in the next
> > revision.
> >
> > > The mtree contents seems to get stored in the same manner in either case so
> > > performance-wise only the overhead of a few userspace<->kernel switches
> > > would be saved. Are there any other reasons?
> > >
> > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > least non-CoCo VMs would be able to reliably benefit) and let
> > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > granularity makes sense for the architecture/guest configuration.
> > >
> >
> > Because shared pages are split once any memory is allocated, having a
> > way to INIT_PRIVATE could avoid the split and then merge on
> > conversion. I feel that is enough value to have this config flag, what
> > do you think?
> >
> > I guess we could also have userspace be careful not to do any allocation
> > before converting.
>
> I assume we do want to support things like preallocating guest memory so
> not sure this approach is feasible to avoid splits.
>
> But I feel like we might be working around a deeper issue here, which is
> that we are pre-emptively splitting anything that *could* be mapped into
> userspace (i.e. allocated+shared/mixed), rather than splitting when
> necessary.
>
> I know that was the plan laid out in the guest_memfd calls, but I've run
> into a couple instances that have me thinking we should revisit this.
>
> 1) Some of the recent guest_memfd seems to be gravitating towards having
>    userspace populate/initialize guest memory payload prior to boot via
>    mmap()'ing the shared guest_memfd pages so things work the same as
>    they would for initialized normal VM memory payload (rather than
>    relying on back-channels in the kernel to user data into guest_memfd
>    pages).
>
>    When you do this though, for an SNP guest at least, that memory
>    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
>    because that will put each 1GB page into an allocated+mixed state,

I would like your help in understanding why we need to start
guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
private simply should mean that certain ranges are not faultable by
the userspace.

Will following work?
1) Userspace starts all guest_memfd ranges as private.
2) During early guest boot it starts issuing PSC requests for
converting memory from shared to private
    -> KVM forwards this request to userspace
    -> Userspace checks that the pages are already private and simply
does nothing.
3) Pvalidate from guest on that memory will result in guest_memfd
offset query which will cause the RMP table entries to actually get
populated.

>    we end up splitting every 1GB to 4K and the guest can't even
>    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
>    anything in the range. As some point the guest will convert/accept
>    the entire range, at which point we could merge, but for SNP we'd
>    need guest cooperation to actually use a higher-granularity in stage2
>    page tables at that point since RMP entries are effectively all split
>    to 4K.
>
>    I understand the intent is to default to private where this wouldn't
>    be an issue, and we could punt to userspace to deal with it, but it
>    feels like an artificial restriction to place on userspace. And if we
>    do want to allow/expect guest_memfd contents to be initialized pre-boot
>    just like normal memory, then userspace would need to jump through
>    some hoops:
>
>    - if defaulting to private: add hooks to convert each range that's being
>      modified to a shared state prior to writing to it

Why is that a problem?

>    - if defaulting to shared: initialize memory in-place, then covert
>      everything else to private to avoid unecessarily splitting folios
>      at run-time
>
>    It feels like implementations details are bleeding out into the API
>    to some degree here (e.g. we'd probably at least need to document
>    this so users know how to take proper advantage of hugepage support).

Does it make sense to keep the default behavior as INIT_PRIVATE for
SNP VMs always even without using hugepages?

>
> 2) There are some use-cases for HugeTLB + CoCo that have come to my
>    attention recently that put a lot of weight on still being able to
>    maximize mapping/hugepage size when accessing shared mem from userspace,
>    e.g. for certain DPDK workloads that accessed shared guest buffers
>    from host userspace. We don't really have a story for this, and I
>    wouldn't expect us to at this stage, but I think it ties into #1 so
>    might be worth considering in that context.

Major problem I see here is that if anything in the kernel does a GUP
on shared memory ranges (which is very likely to happen), it would be
difficult to get them to let go of the whole hugepage before it can be
split safely.

Another problem is guest_memfd today doesn't support management of
large user space page table mappings, this can turnout to be
significant work to do referring to hugetlb pagetable management
logic.

>
> I'm still fine with the current approach as a starting point, but I'm
> wondering if improving both #1/#2 might not be so bad and maybe even
> give us some more flexibility (for instance, Sean had mentioned leaving
> open the option of tracking more than just shareability/mappability, and
> if there is split/merge logic associated with those transitions then
> re-scanning each of these attributes for a 1G range seems like it could
> benefit from some sort of intermediate data structure to help determine
> things like what mapping granularity is available for guest/userspace
> for a particular range.
>
> One approach I was thinking of was that we introduce a data structure
> similar to KVM's memslot->arch.lpage_info() where we store information
> about what 1G/2M ranges are shared/private/mixed, and then instead of
> splitting ahead of time we just record that state into this data
> structure (using the same write lock as with the
> shareability/mappability state), and then at *fault* time we split the
> folio if our lpage_info-like data structure says the range is mixed.
>
> Then, if guest converts a 2M/4M range to private while lazilly-accepting
> (for instance), we can still keep the folio intact as 1GB, but mark
> the 1G range in the lpage_info-like data structure as mixed so that we
> still inform KVM/etc. they need to map it as 2MB or lower in stage2
> page tables. In that case, even at guest fault-time, we can leave the
> folio unsplit until userspace tries to touch it (though in most cases
> it never will and we can keep most of the guest's 1G intact for the
> duration of its lifetime).
>
> On the userspace side, another nice thing there is if we see 1G is in a
> mixed state, but 2M is all-shared, then we can still leave the folio as 2M,
> and I think the refcount'ing logic would still work for the most part,
> which makes #2 a bit easier to implement as well.
>
> And of course, we wouldn't need the INIT_PRIVATE then since we are only
> splitting when necessary.
>
> But I guess this all comes down to how much extra pain there is in
> tracking a 1G folio that's been split into a mixed of 2MB/4K regions,
> but I think we'd get a lot more mileage out of getting that working and
> just completely stripping out all of the merging logic for initial
> implementation (other than at cleanup time), so maybe complexity-wise
> it balances out a bit?
>
> Thanks,
>
> Mike
>
> >
> > >>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> > >>
> > >>  4.143 KVM_PRE_FAULT_MEMORY
> > >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > >> index 4cc824a3a7c9..d7df312479aa 100644
> > >> --- a/include/uapi/linux/kvm.h
> > >> +++ b/include/uapi/linux/kvm.h
> > >> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
> > >>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > >>
> > >>  #define KVM_CREATE_GUEST_MEMFD    _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> > >> +
> > >>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED   (1UL << 0)
> > >> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE     (1UL << 1)
> > >>
> > >>  struct kvm_create_guest_memfd {
> > >>    __u64 size;
> > >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > >> index 239d0f13dcc1..590932499eba 100644
> > >> --- a/virt/kvm/guest_memfd.c
> > >> +++ b/virt/kvm/guest_memfd.c
> > >> @@ -4,6 +4,7 @@
> > >>  #include <linux/falloc.h>
> > >>  #include <linux/fs.h>
> > >>  #include <linux/kvm_host.h>
> > >> +#include <linux/maple_tree.h>
> > >>  #include <linux/pseudo_fs.h>
> > >>  #include <linux/pagemap.h>
> > >>
> > >> @@ -17,6 +18,24 @@ struct kvm_gmem {
> > >>    struct list_head entry;
> > >>  };
> > >>
> > >> +struct kvm_gmem_inode_private {
> > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > >> +  struct maple_tree shareability;
> > >> +#endif
> > >> +};
> > >> +
> > >> +enum shareability {
> > >> +  SHAREABILITY_GUEST = 1, /* Only the guest can map (fault) folios in this range. */
> > >> +  SHAREABILITY_ALL = 2,   /* Both guest and host can fault folios in this range. */
> > >> +};
> > >> +
> > >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> > >> +
> > >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> > >> +{
> > >> +  return inode->i_mapping->i_private_data;
> > >> +}
> > >> +
> > >>  /**
> > >>   * folio_file_pfn - like folio_file_page, but return a pfn.
> > >>   * @folio: The folio which contains this index.
> > >> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
> > >>    return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
> > >>  }
> > >>
> > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > >> +
> > >> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> > >> +                                loff_t size, u64 flags)
> > >> +{
> > >> +  enum shareability m;
> > >> +  pgoff_t last;
> > >> +
> > >> +  last = (size >> PAGE_SHIFT) - 1;
> > >> +  m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> > >> +                                              SHAREABILITY_ALL;
> > >> +  return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> > >> +                           GFP_KERNEL);
> > >
> > > One really nice thing about using a maple tree is that it should get rid
> > > of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> > > initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> > > (which is the current QEMU default behavior).
> > >
> > > I'd originally advocated for sticking with the xarray implementation Fuad was
> > > using until we'd determined we really need it for HugeTLB support, but I'm
> > > sort of thinking it's already justified just based on the above.
> > >
> > > Maybe it would make sense for KVM memory attributes too?
> > >
> > >> +}
> > >> +
> > >> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> > >> +                                           pgoff_t index)
> > >> +{
> > >> +  struct maple_tree *mt;
> > >> +  void *entry;
> > >> +
> > >> +  mt = &kvm_gmem_private(inode)->shareability;
> > >> +  entry = mtree_load(mt, index);
> > >> +  WARN(!entry,
> > >> +       "Shareability should always be defined for all indices in inode.");
> > >> +
> > >> +  return xa_to_value(entry);
> > >> +}
> > >> +
> > >> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > >> +{
> > >> +  if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> > >> +          return ERR_PTR(-EACCES);
> > >> +
> > >> +  return kvm_gmem_get_folio(inode, index);
> > >> +}
> > >> +
> > >> +#else
> > >> +
> > >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> > >> +{
> > >> +  return 0;
> > >> +}
> > >> +
> > >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > >> +{
> > >> +  WARN_ONCE("Unexpected call to get shared folio.")
> > >> +  return NULL;
> > >> +}
> > >> +
> > >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > >> +
> > >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> > >>                                pgoff_t index, struct folio *folio)
> > >>  {
> > >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> > >>
> > >>    filemap_invalidate_lock_shared(inode->i_mapping);
> > >>
> > >> -  folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > >> +  folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> > >>    if (IS_ERR(folio)) {
> > >>            int err = PTR_ERR(folio);
> > >>
> > >> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> > >>    .fallocate      = kvm_gmem_fallocate,
> > >>  };
> > >>
> > >> +static void kvm_gmem_free_inode(struct inode *inode)
> > >> +{
> > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > >> +
> > >> +  kfree(private);
> > >> +
> > >> +  free_inode_nonrcu(inode);
> > >> +}
> > >> +
> > >> +static void kvm_gmem_destroy_inode(struct inode *inode)
> > >> +{
> > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > >> +
> > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > >> +  /*
> > >> +   * mtree_destroy() can't be used within rcu callback, hence can't be
> > >> +   * done in ->free_inode().
> > >> +   */
> > >> +  if (private)
> > >> +          mtree_destroy(&private->shareability);
> > >> +#endif
> > >> +}
> > >> +
> > >>  static const struct super_operations kvm_gmem_super_operations = {
> > >>    .statfs         = simple_statfs,
> > >> +  .destroy_inode  = kvm_gmem_destroy_inode,
> > >> +  .free_inode     = kvm_gmem_free_inode,
> > >>  };
> > >>
> > >>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > >> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> > >>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > >>                                                  loff_t size, u64 flags)
> > >>  {
> > >> +  struct kvm_gmem_inode_private *private;
> > >>    struct inode *inode;
> > >> +  int err;
> > >>
> > >>    inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> > >>    if (IS_ERR(inode))
> > >>            return inode;
> > >>
> > >> +  err = -ENOMEM;
> > >> +  private = kzalloc(sizeof(*private), GFP_KERNEL);
> > >> +  if (!private)
> > >> +          goto out;
> > >> +
> > >> +  mt_init(&private->shareability);
> > >> +  inode->i_mapping->i_private_data = private;
> > >> +
> > >> +  err = kvm_gmem_shareability_setup(private, size, flags);
> > >> +  if (err)
> > >> +          goto out;
> > >> +
> > >>    inode->i_private = (void *)(unsigned long)flags;
> > >>    inode->i_op = &kvm_gmem_iops;
> > >>    inode->i_mapping->a_ops = &kvm_gmem_aops;
> > >> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > >>    WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> > >>
> > >>    return inode;
> > >> +
> > >> +out:
> > >> +  iput(inode);
> > >> +
> > >> +  return ERR_PTR(err);
> > >>  }
> > >>
> > >>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> > >> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > >>    if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> > >>            valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> > >>
> > >> +  if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> > >> +          valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> > >> +
> > >>    if (flags & ~valid_flags)
> > >>            return -EINVAL;
> > >>
> > >> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > >>    if (!file)
> > >>            return -EFAULT;
> > >>
> > >> +  filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > >> +
> > >
> > > I like the idea of using a write-lock/read-lock to protect write/read access
> > > to shareability state (though maybe not necessarily re-using filemap's
> > > invalidate lock), it's simple and still allows concurrent faulting in of gmem
> > > pages. One issue on the SNP side (which also came up in one of the gmem calls)
> > > is if we introduce support for tracking preparedness as discussed (e.g. via a
> > > new SHAREABILITY_GUEST_PREPARED state) the
> > > SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> > > fault-time, and so would need to take the write-lock and no longer allow for
> > > concurrent fault-handling.
> > >
> > > I was originally planning on introducing a new rw_semaphore with similar
> > > semantics to the rw_lock that Fuad previously had in his restricted mmap
> > > series[1] (and simiar semantics to filemap invalidate lock here). The main
> > > difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> > > was that in the case of a folio being present for an index, the folio lock would
> > > also need to be held in order to update the shareability state. Because
> > > of that, fault paths (which will always either have or allocate folio
> > > basically) can rely on the folio lock to guard shareability state in a more
> > > granular way and so can avoid a global write lock.
> > >
> > > They would still need to hold the read lock to access the tree however.
> > > Or more specifically, any paths that could allocate a folio need to take
> > > a read lock so there isn't a TOCTOU situation where shareability is
> > > being updated for an index for which a folio hasn't been allocated, but
> > > then just afterward the folio gets faulted in/allocated while the
> > > shareability state is already being updated which the understand that
> > > there was no folio around that needed locking.
> > >
> > > I had a branch with in-place conversion support for SNP[2] that added this
> > > lock reworking on top of Fuad's series along with preparation tracking,
> > > but I'm now planning to rebase that on top of the patches from this
> > > series that Sean mentioned[3] earlier:
> > >
> > >   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
> > >   KVM: Query guest_memfd for private/shared status
> > >   KVM: guest_memfd: Skip LRU for guest_memfd folios
> > >   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
> > >   KVM: guest_memfd: Introduce and use shareability to guard faulting
> > >   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
> > >
> > > but figured I'd mention it here in case there are other things to consider on
> > > the locking front.
> > >
> > > Definitely agree with Sean though that it would be nice to start identifying a
> > > common base of patches for the in-place conversion enablement for SNP, TDX, and
> > > pKVM so the APIs/interfaces for hugepages can be handled separately.
> > >
> > > -Mike
> > >
> > > [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> > > [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> > > [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
> > >
> > >>    folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > >>    if (IS_ERR(folio)) {
> > >>            r = PTR_ERR(folio);
> > >> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > >>            *page = folio_file_page(folio, index);
> > >>    else
> > >>            folio_put(folio);
> > >> -
> > >>  out:
> > >> +  filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > >>    fput(file);
> > >>    return r;
> > >>  }
> > >> --
> > >> 2.49.0.1045.g170613ef41-goog
> > >>
> >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03  0:46         ` Vishal Annapurve
@ 2025-07-03  0:52           ` Vishal Annapurve
  2025-07-03  4:12           ` Michael Roth
  1 sibling, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-03  0:52 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, Jul 2, 2025 at 5:46 PM Vishal Annapurve <vannapurve@google.com> wrote:
> ...
> >
> > 2) There are some use-cases for HugeTLB + CoCo that have come to my
> >    attention recently that put a lot of weight on still being able to
> >    maximize mapping/hugepage size when accessing shared mem from userspace,
> >    e.g. for certain DPDK workloads that accessed shared guest buffers
> >    from host userspace. We don't really have a story for this, and I
> >    wouldn't expect us to at this stage, but I think it ties into #1 so
> >    might be worth considering in that context.
>
> Major problem I see here is that if anything in the kernel does a GUP
> on shared memory ranges (which is very likely to happen), it would be
> difficult to get them to let go of the whole hugepage before it can be
> split safely.

The scenario I was alluding to here:
guest trying to convert a subpage from a shared range backed by
hugepage to private.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03  0:46         ` Vishal Annapurve
  2025-07-03  0:52           ` Vishal Annapurve
@ 2025-07-03  4:12           ` Michael Roth
  2025-07-03  5:10             ` Vishal Annapurve
  2025-08-12  8:23             ` Fuad Tabba
  1 sibling, 2 replies; 231+ messages in thread
From: Michael Roth @ 2025-07-03  4:12 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > Michael Roth <michael.roth@amd.com> writes:
> > >
> > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > > >> Track guest_memfd memory's shareability status within the inode as
> > > >> opposed to the file, since it is property of the guest_memfd's memory
> > > >> contents.
> > > >>
> > > >> Shareability is a property of the memory and is indexed using the
> > > >> page's index in the inode. Because shareability is the memory's
> > > >> property, it is stored within guest_memfd instead of within KVM, like
> > > >> in kvm->mem_attr_array.
> > > >>
> > > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > > >> retained to allow VMs to only use guest_memfd for private memory and
> > > >> some other memory for shared memory.
> > > >>
> > > >> Not all use cases require guest_memfd() to be shared with the host
> > > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > > >> private to the guest, and therefore not mappable by the
> > > >> host. Otherwise, memory is shared until explicitly converted to
> > > >> private.
> > > >>
> > > >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > >> Co-developed-by: Fuad Tabba <tabba@google.com>
> > > >> Signed-off-by: Fuad Tabba <tabba@google.com>
> > > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > > >> ---
> > > >>  Documentation/virt/kvm/api.rst |   5 ++
> > > >>  include/uapi/linux/kvm.h       |   2 +
> > > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > >> index 86f74ce7f12a..f609337ae1c2 100644
> > > >> --- a/Documentation/virt/kvm/api.rst
> > > >> +++ b/Documentation/virt/kvm/api.rst
> > > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > > >>  This is validated when the guest_memfd instance is bound to the VM.
> > > >>
> > > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > > >> +by the host.
> > > >> +
> > > >
> > > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > > like this flag should be deferred until that patch is in place. Is it
> > > > really needed at that point though? Userspace would be able to set the
> > > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > > >
> > >
> > > I can move this change to the later patch. Thanks! Will fix in the next
> > > revision.
> > >
> > > > The mtree contents seems to get stored in the same manner in either case so
> > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > would be saved. Are there any other reasons?
> > > >
> > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > granularity makes sense for the architecture/guest configuration.
> > > >
> > >
> > > Because shared pages are split once any memory is allocated, having a
> > > way to INIT_PRIVATE could avoid the split and then merge on
> > > conversion. I feel that is enough value to have this config flag, what
> > > do you think?
> > >
> > > I guess we could also have userspace be careful not to do any allocation
> > > before converting.

(Re-visiting this with the assumption that we *don't* intend to use mmap() to
populate memory (in which case you can pretty much ignore my previous
response))

I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
userspace already defaults to marking everything private pretty close to
guest_memfd creation time, so the potential for allocations to occur
in-between seems small, but worth confirming.

But I know in the past there was a desire to ensure TDX/SNP could
support pre-allocating guest_memfd memory (and even pre-faulting via
KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
fallocate() handling could still avoid the split if the whole hugepage
is private, though there is a bit more potential for that fallocate()
to happen before userspace does the "manually" shared->private
conversion. I'll double-check on that aspect, but otherwise, is there
still any other need for it?

> >
> > I assume we do want to support things like preallocating guest memory so
> > not sure this approach is feasible to avoid splits.
> >
> > But I feel like we might be working around a deeper issue here, which is
> > that we are pre-emptively splitting anything that *could* be mapped into
> > userspace (i.e. allocated+shared/mixed), rather than splitting when
> > necessary.
> >
> > I know that was the plan laid out in the guest_memfd calls, but I've run
> > into a couple instances that have me thinking we should revisit this.
> >
> > 1) Some of the recent guest_memfd seems to be gravitating towards having
> >    userspace populate/initialize guest memory payload prior to boot via
> >    mmap()'ing the shared guest_memfd pages so things work the same as
> >    they would for initialized normal VM memory payload (rather than
> >    relying on back-channels in the kernel to user data into guest_memfd
> >    pages).
> >
> >    When you do this though, for an SNP guest at least, that memory
> >    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
> >    because that will put each 1GB page into an allocated+mixed state,
> 
> I would like your help in understanding why we need to start
> guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
> private simply should mean that certain ranges are not faultable by
> the userspace.

It's seeming like I probably misremembered, but I thought there was a
discussion on guest_memfd call a month (or so?) ago about whether to
continue to use backchannels to populate guest_memfd pages prior to
launch. It was in the context of whether to keep using kvm_gmem_populate()
for populating guest_memfd pages by copying them in from separate
userspace buffer vs. simply populating them directly from userspace.
I thought we were leaning on the latter since it was simpler all-around,
which is great for SNP since that is already how it populates memory: by
writing to it from userspace, which kvm_gmem_populate() then copies into
guest_memfd pages. With shared gmem support, we just skip the latter now
in the kernel rather needing changes to how userspace handles things in
that regard. But maybe that was just wishful thinking :)

But you raise some very compelling points on why this might not be a
good idea even if that was how that discussion went.

> 
> Will following work?
> 1) Userspace starts all guest_memfd ranges as private.
> 2) During early guest boot it starts issuing PSC requests for
> converting memory from shared to private
>     -> KVM forwards this request to userspace
>     -> Userspace checks that the pages are already private and simply
> does nothing.
> 3) Pvalidate from guest on that memory will result in guest_memfd
> offset query which will cause the RMP table entries to actually get
> populated.

That would work, but there will need to be changes on userspace to deal
with how SNP populates memory pre-boot just like normal VMs do. We will
instead need to copy that data into separate buffers, and pass those in
as the buffer hva instead of the shared hva corresponding to that GPA.

But that seems reasonable if it avoids so many other problems.

> 
> >    we end up splitting every 1GB to 4K and the guest can't even
> >    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
> >    anything in the range. As some point the guest will convert/accept
> >    the entire range, at which point we could merge, but for SNP we'd
> >    need guest cooperation to actually use a higher-granularity in stage2
> >    page tables at that point since RMP entries are effectively all split
> >    to 4K.
> >
> >    I understand the intent is to default to private where this wouldn't
> >    be an issue, and we could punt to userspace to deal with it, but it
> >    feels like an artificial restriction to place on userspace. And if we
> >    do want to allow/expect guest_memfd contents to be initialized pre-boot
> >    just like normal memory, then userspace would need to jump through
> >    some hoops:
> >
> >    - if defaulting to private: add hooks to convert each range that's being
> >      modified to a shared state prior to writing to it
> 
> Why is that a problem?

These were only problems if we went the above-mentioned way of
populating memory pre-boot via mmap() instead of other backchannels. If
we don't do that, then both these things cease to be problems. Sounds goods
to me. :)

> 
> >    - if defaulting to shared: initialize memory in-place, then covert
> >      everything else to private to avoid unecessarily splitting folios
> >      at run-time
> >
> >    It feels like implementations details are bleeding out into the API
> >    to some degree here (e.g. we'd probably at least need to document
> >    this so users know how to take proper advantage of hugepage support).
> 
> Does it make sense to keep the default behavior as INIT_PRIVATE for
> SNP VMs always even without using hugepages?

Yes!

Though, revisiting discussion around INIT_PRIVATE (without the baggage
of potentially relying on mmap() to populate memory), I'm still not sure why
it's needed. I responded in the context of Ackerley's initial reply
above.

> 
> >
> > 2) There are some use-cases for HugeTLB + CoCo that have come to my
> >    attention recently that put a lot of weight on still being able to
> >    maximize mapping/hugepage size when accessing shared mem from userspace,
> >    e.g. for certain DPDK workloads that accessed shared guest buffers
> >    from host userspace. We don't really have a story for this, and I
> >    wouldn't expect us to at this stage, but I think it ties into #1 so
> >    might be worth considering in that context.
> 
> Major problem I see here is that if anything in the kernel does a GUP
> on shared memory ranges (which is very likely to happen), it would be
> difficult to get them to let go of the whole hugepage before it can be
> split safely.
> 
> Another problem is guest_memfd today doesn't support management of
> large user space page table mappings, this can turnout to be
> significant work to do referring to hugetlb pagetable management
> logic.

Yah that was more line-of-sight that might be possible by going this
route, but the refcount'ing issue above is a showstopper as always. I'd
somehow convinced myself that supporting fine-grained splitting somehow
worked around it, but you still have no idea what page you need to avoid
converting and fancy splitting doesn't get you past that. More wishful
thinking. =\

Thanks,

Mike

> 
> >
> > I'm still fine with the current approach as a starting point, but I'm
> > wondering if improving both #1/#2 might not be so bad and maybe even
> > give us some more flexibility (for instance, Sean had mentioned leaving
> > open the option of tracking more than just shareability/mappability, and
> > if there is split/merge logic associated with those transitions then
> > re-scanning each of these attributes for a 1G range seems like it could
> > benefit from some sort of intermediate data structure to help determine
> > things like what mapping granularity is available for guest/userspace
> > for a particular range.
> >
> > One approach I was thinking of was that we introduce a data structure
> > similar to KVM's memslot->arch.lpage_info() where we store information
> > about what 1G/2M ranges are shared/private/mixed, and then instead of
> > splitting ahead of time we just record that state into this data
> > structure (using the same write lock as with the
> > shareability/mappability state), and then at *fault* time we split the
> > folio if our lpage_info-like data structure says the range is mixed.
> >
> > Then, if guest converts a 2M/4M range to private while lazilly-accepting
> > (for instance), we can still keep the folio intact as 1GB, but mark
> > the 1G range in the lpage_info-like data structure as mixed so that we
> > still inform KVM/etc. they need to map it as 2MB or lower in stage2
> > page tables. In that case, even at guest fault-time, we can leave the
> > folio unsplit until userspace tries to touch it (though in most cases
> > it never will and we can keep most of the guest's 1G intact for the
> > duration of its lifetime).
> >
> > On the userspace side, another nice thing there is if we see 1G is in a
> > mixed state, but 2M is all-shared, then we can still leave the folio as 2M,
> > and I think the refcount'ing logic would still work for the most part,
> > which makes #2 a bit easier to implement as well.
> >
> > And of course, we wouldn't need the INIT_PRIVATE then since we are only
> > splitting when necessary.
> >
> > But I guess this all comes down to how much extra pain there is in
> > tracking a 1G folio that's been split into a mixed of 2MB/4K regions,
> > but I think we'd get a lot more mileage out of getting that working and
> > just completely stripping out all of the merging logic for initial
> > implementation (other than at cleanup time), so maybe complexity-wise
> > it balances out a bit?
> >
> > Thanks,
> >
> > Mike
> >
> > >
> > > >>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> > > >>
> > > >>  4.143 KVM_PRE_FAULT_MEMORY
> > > >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > >> index 4cc824a3a7c9..d7df312479aa 100644
> > > >> --- a/include/uapi/linux/kvm.h
> > > >> +++ b/include/uapi/linux/kvm.h
> > > >> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
> > > >>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > > >>
> > > >>  #define KVM_CREATE_GUEST_MEMFD    _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> > > >> +
> > > >>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED   (1UL << 0)
> > > >> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE     (1UL << 1)
> > > >>
> > > >>  struct kvm_create_guest_memfd {
> > > >>    __u64 size;
> > > >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > >> index 239d0f13dcc1..590932499eba 100644
> > > >> --- a/virt/kvm/guest_memfd.c
> > > >> +++ b/virt/kvm/guest_memfd.c
> > > >> @@ -4,6 +4,7 @@
> > > >>  #include <linux/falloc.h>
> > > >>  #include <linux/fs.h>
> > > >>  #include <linux/kvm_host.h>
> > > >> +#include <linux/maple_tree.h>
> > > >>  #include <linux/pseudo_fs.h>
> > > >>  #include <linux/pagemap.h>
> > > >>
> > > >> @@ -17,6 +18,24 @@ struct kvm_gmem {
> > > >>    struct list_head entry;
> > > >>  };
> > > >>
> > > >> +struct kvm_gmem_inode_private {
> > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > >> +  struct maple_tree shareability;
> > > >> +#endif
> > > >> +};
> > > >> +
> > > >> +enum shareability {
> > > >> +  SHAREABILITY_GUEST = 1, /* Only the guest can map (fault) folios in this range. */
> > > >> +  SHAREABILITY_ALL = 2,   /* Both guest and host can fault folios in this range. */
> > > >> +};
> > > >> +
> > > >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> > > >> +
> > > >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> > > >> +{
> > > >> +  return inode->i_mapping->i_private_data;
> > > >> +}
> > > >> +
> > > >>  /**
> > > >>   * folio_file_pfn - like folio_file_page, but return a pfn.
> > > >>   * @folio: The folio which contains this index.
> > > >> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
> > > >>    return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
> > > >>  }
> > > >>
> > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > >> +
> > > >> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> > > >> +                                loff_t size, u64 flags)
> > > >> +{
> > > >> +  enum shareability m;
> > > >> +  pgoff_t last;
> > > >> +
> > > >> +  last = (size >> PAGE_SHIFT) - 1;
> > > >> +  m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> > > >> +                                              SHAREABILITY_ALL;
> > > >> +  return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> > > >> +                           GFP_KERNEL);
> > > >
> > > > One really nice thing about using a maple tree is that it should get rid
> > > > of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> > > > initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> > > > (which is the current QEMU default behavior).
> > > >
> > > > I'd originally advocated for sticking with the xarray implementation Fuad was
> > > > using until we'd determined we really need it for HugeTLB support, but I'm
> > > > sort of thinking it's already justified just based on the above.
> > > >
> > > > Maybe it would make sense for KVM memory attributes too?
> > > >
> > > >> +}
> > > >> +
> > > >> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> > > >> +                                           pgoff_t index)
> > > >> +{
> > > >> +  struct maple_tree *mt;
> > > >> +  void *entry;
> > > >> +
> > > >> +  mt = &kvm_gmem_private(inode)->shareability;
> > > >> +  entry = mtree_load(mt, index);
> > > >> +  WARN(!entry,
> > > >> +       "Shareability should always be defined for all indices in inode.");
> > > >> +
> > > >> +  return xa_to_value(entry);
> > > >> +}
> > > >> +
> > > >> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > > >> +{
> > > >> +  if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> > > >> +          return ERR_PTR(-EACCES);
> > > >> +
> > > >> +  return kvm_gmem_get_folio(inode, index);
> > > >> +}
> > > >> +
> > > >> +#else
> > > >> +
> > > >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> > > >> +{
> > > >> +  return 0;
> > > >> +}
> > > >> +
> > > >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > > >> +{
> > > >> +  WARN_ONCE("Unexpected call to get shared folio.")
> > > >> +  return NULL;
> > > >> +}
> > > >> +
> > > >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > > >> +
> > > >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > >>                                pgoff_t index, struct folio *folio)
> > > >>  {
> > > >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> > > >>
> > > >>    filemap_invalidate_lock_shared(inode->i_mapping);
> > > >>
> > > >> -  folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > > >> +  folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> > > >>    if (IS_ERR(folio)) {
> > > >>            int err = PTR_ERR(folio);
> > > >>
> > > >> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> > > >>    .fallocate      = kvm_gmem_fallocate,
> > > >>  };
> > > >>
> > > >> +static void kvm_gmem_free_inode(struct inode *inode)
> > > >> +{
> > > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > > >> +
> > > >> +  kfree(private);
> > > >> +
> > > >> +  free_inode_nonrcu(inode);
> > > >> +}
> > > >> +
> > > >> +static void kvm_gmem_destroy_inode(struct inode *inode)
> > > >> +{
> > > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > > >> +
> > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > >> +  /*
> > > >> +   * mtree_destroy() can't be used within rcu callback, hence can't be
> > > >> +   * done in ->free_inode().
> > > >> +   */
> > > >> +  if (private)
> > > >> +          mtree_destroy(&private->shareability);
> > > >> +#endif
> > > >> +}
> > > >> +
> > > >>  static const struct super_operations kvm_gmem_super_operations = {
> > > >>    .statfs         = simple_statfs,
> > > >> +  .destroy_inode  = kvm_gmem_destroy_inode,
> > > >> +  .free_inode     = kvm_gmem_free_inode,
> > > >>  };
> > > >>
> > > >>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > > >> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> > > >>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > > >>                                                  loff_t size, u64 flags)
> > > >>  {
> > > >> +  struct kvm_gmem_inode_private *private;
> > > >>    struct inode *inode;
> > > >> +  int err;
> > > >>
> > > >>    inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> > > >>    if (IS_ERR(inode))
> > > >>            return inode;
> > > >>
> > > >> +  err = -ENOMEM;
> > > >> +  private = kzalloc(sizeof(*private), GFP_KERNEL);
> > > >> +  if (!private)
> > > >> +          goto out;
> > > >> +
> > > >> +  mt_init(&private->shareability);
> > > >> +  inode->i_mapping->i_private_data = private;
> > > >> +
> > > >> +  err = kvm_gmem_shareability_setup(private, size, flags);
> > > >> +  if (err)
> > > >> +          goto out;
> > > >> +
> > > >>    inode->i_private = (void *)(unsigned long)flags;
> > > >>    inode->i_op = &kvm_gmem_iops;
> > > >>    inode->i_mapping->a_ops = &kvm_gmem_aops;
> > > >> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > > >>    WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> > > >>
> > > >>    return inode;
> > > >> +
> > > >> +out:
> > > >> +  iput(inode);
> > > >> +
> > > >> +  return ERR_PTR(err);
> > > >>  }
> > > >>
> > > >>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> > > >> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > > >>    if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> > > >>            valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> > > >>
> > > >> +  if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> > > >> +          valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> > > >> +
> > > >>    if (flags & ~valid_flags)
> > > >>            return -EINVAL;
> > > >>
> > > >> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > >>    if (!file)
> > > >>            return -EFAULT;
> > > >>
> > > >> +  filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > >> +
> > > >
> > > > I like the idea of using a write-lock/read-lock to protect write/read access
> > > > to shareability state (though maybe not necessarily re-using filemap's
> > > > invalidate lock), it's simple and still allows concurrent faulting in of gmem
> > > > pages. One issue on the SNP side (which also came up in one of the gmem calls)
> > > > is if we introduce support for tracking preparedness as discussed (e.g. via a
> > > > new SHAREABILITY_GUEST_PREPARED state) the
> > > > SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> > > > fault-time, and so would need to take the write-lock and no longer allow for
> > > > concurrent fault-handling.
> > > >
> > > > I was originally planning on introducing a new rw_semaphore with similar
> > > > semantics to the rw_lock that Fuad previously had in his restricted mmap
> > > > series[1] (and simiar semantics to filemap invalidate lock here). The main
> > > > difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> > > > was that in the case of a folio being present for an index, the folio lock would
> > > > also need to be held in order to update the shareability state. Because
> > > > of that, fault paths (which will always either have or allocate folio
> > > > basically) can rely on the folio lock to guard shareability state in a more
> > > > granular way and so can avoid a global write lock.
> > > >
> > > > They would still need to hold the read lock to access the tree however.
> > > > Or more specifically, any paths that could allocate a folio need to take
> > > > a read lock so there isn't a TOCTOU situation where shareability is
> > > > being updated for an index for which a folio hasn't been allocated, but
> > > > then just afterward the folio gets faulted in/allocated while the
> > > > shareability state is already being updated which the understand that
> > > > there was no folio around that needed locking.
> > > >
> > > > I had a branch with in-place conversion support for SNP[2] that added this
> > > > lock reworking on top of Fuad's series along with preparation tracking,
> > > > but I'm now planning to rebase that on top of the patches from this
> > > > series that Sean mentioned[3] earlier:
> > > >
> > > >   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
> > > >   KVM: Query guest_memfd for private/shared status
> > > >   KVM: guest_memfd: Skip LRU for guest_memfd folios
> > > >   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
> > > >   KVM: guest_memfd: Introduce and use shareability to guard faulting
> > > >   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
> > > >
> > > > but figured I'd mention it here in case there are other things to consider on
> > > > the locking front.
> > > >
> > > > Definitely agree with Sean though that it would be nice to start identifying a
> > > > common base of patches for the in-place conversion enablement for SNP, TDX, and
> > > > pKVM so the APIs/interfaces for hugepages can be handled separately.
> > > >
> > > > -Mike
> > > >
> > > > [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> > > > [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> > > > [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
> > > >
> > > >>    folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > >>    if (IS_ERR(folio)) {
> > > >>            r = PTR_ERR(folio);
> > > >> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > >>            *page = folio_file_page(folio, index);
> > > >>    else
> > > >>            folio_put(folio);
> > > >> -
> > > >>  out:
> > > >> +  filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > > >>    fput(file);
> > > >>    return r;
> > > >>  }
> > > >> --
> > > >> 2.49.0.1045.g170613ef41-goog
> > > >>
> > >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03  4:12           ` Michael Roth
@ 2025-07-03  5:10             ` Vishal Annapurve
  2025-07-03 20:39               ` Michael Roth
  2025-08-12  8:23             ` Fuad Tabba
  1 sibling, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-03  5:10 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, Jul 2, 2025 at 9:12 PM Michael Roth <michael.roth@amd.com> wrote:
>
> On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> > On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > > Michael Roth <michael.roth@amd.com> writes:
> > > >
> > > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > > > >> Track guest_memfd memory's shareability status within the inode as
> > > > >> opposed to the file, since it is property of the guest_memfd's memory
> > > > >> contents.
> > > > >>
> > > > >> Shareability is a property of the memory and is indexed using the
> > > > >> page's index in the inode. Because shareability is the memory's
> > > > >> property, it is stored within guest_memfd instead of within KVM, like
> > > > >> in kvm->mem_attr_array.
> > > > >>
> > > > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > > > >> retained to allow VMs to only use guest_memfd for private memory and
> > > > >> some other memory for shared memory.
> > > > >>
> > > > >> Not all use cases require guest_memfd() to be shared with the host
> > > > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > > > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > > > >> private to the guest, and therefore not mappable by the
> > > > >> host. Otherwise, memory is shared until explicitly converted to
> > > > >> private.
> > > > >>
> > > > >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > > >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > > >> Co-developed-by: Fuad Tabba <tabba@google.com>
> > > > >> Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > > > >> ---
> > > > >>  Documentation/virt/kvm/api.rst |   5 ++
> > > > >>  include/uapi/linux/kvm.h       |   2 +
> > > > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > > > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > > > >>
> > > > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > >> index 86f74ce7f12a..f609337ae1c2 100644
> > > > >> --- a/Documentation/virt/kvm/api.rst
> > > > >> +++ b/Documentation/virt/kvm/api.rst
> > > > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > > > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > > > >>  This is validated when the guest_memfd instance is bound to the VM.
> > > > >>
> > > > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > > > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > > > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > > > >> +by the host.
> > > > >> +
> > > > >
> > > > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > > > like this flag should be deferred until that patch is in place. Is it
> > > > > really needed at that point though? Userspace would be able to set the
> > > > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > > > >
> > > >
> > > > I can move this change to the later patch. Thanks! Will fix in the next
> > > > revision.
> > > >
> > > > > The mtree contents seems to get stored in the same manner in either case so
> > > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > > would be saved. Are there any other reasons?
> > > > >
> > > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > > granularity makes sense for the architecture/guest configuration.
> > > > >
> > > >
> > > > Because shared pages are split once any memory is allocated, having a
> > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > conversion. I feel that is enough value to have this config flag, what
> > > > do you think?
> > > >
> > > > I guess we could also have userspace be careful not to do any allocation
> > > > before converting.
>
> (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> populate memory (in which case you can pretty much ignore my previous
> response))

I am assuming in-place conversion with huge page backing for the
discussion below.

Looks like there are three scenarios/usecases we are discussing here:
1) Pre-allocating guest_memfd file offsets
   - Userspace can use fallocate to do this for hugepages by keeping
the file ranges marked private.
2) Prefaulting guest EPT/NPT entries
3) Populating initial guest payload into guest_memfd memory
   - Userspace can mark certain ranges as shared, populate the
contents and convert the ranges back to private. So mmap will come in
handy here.

>
> I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> userspace already defaults to marking everything private pretty close to
> guest_memfd creation time, so the potential for allocations to occur
> in-between seems small, but worth confirming.

Ok, I am not much worried about whether the INIT_PRIVATE flag gets
supported or not, but more about the default setting that different
CVMs start with. To me, it looks like all CVMs should start as
everything private by default and if there is a way to bake that
configuration during guest_memfd creation time that would be good to
have instead of doing "create and convert" operations and there is a
fairly low cost to support this flag.

>
> But I know in the past there was a desire to ensure TDX/SNP could
> support pre-allocating guest_memfd memory (and even pre-faulting via
> KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> fallocate() handling could still avoid the split if the whole hugepage
> is private, though there is a bit more potential for that fallocate()
> to happen before userspace does the "manually" shared->private
> conversion. I'll double-check on that aspect, but otherwise, is there
> still any other need for it?

This usecase of being able to preallocate should still work with
in-place conversion assuming all ranges are private before
pre-population.

>
> > >
> > > I assume we do want to support things like preallocating guest memory so
> > > not sure this approach is feasible to avoid splits.
> > >
> > > But I feel like we might be working around a deeper issue here, which is
> > > that we are pre-emptively splitting anything that *could* be mapped into
> > > userspace (i.e. allocated+shared/mixed), rather than splitting when
> > > necessary.
> > >
> > > I know that was the plan laid out in the guest_memfd calls, but I've run
> > > into a couple instances that have me thinking we should revisit this.
> > >
> > > 1) Some of the recent guest_memfd seems to be gravitating towards having
> > >    userspace populate/initialize guest memory payload prior to boot via
> > >    mmap()'ing the shared guest_memfd pages so things work the same as
> > >    they would for initialized normal VM memory payload (rather than
> > >    relying on back-channels in the kernel to user data into guest_memfd
> > >    pages).
> > >
> > >    When you do this though, for an SNP guest at least, that memory
> > >    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
> > >    because that will put each 1GB page into an allocated+mixed state,
> >
> > I would like your help in understanding why we need to start
> > guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
> > private simply should mean that certain ranges are not faultable by
> > the userspace.
>
> It's seeming like I probably misremembered, but I thought there was a
> discussion on guest_memfd call a month (or so?) ago about whether to
> continue to use backchannels to populate guest_memfd pages prior to
> launch. It was in the context of whether to keep using kvm_gmem_populate()
> for populating guest_memfd pages by copying them in from separate
> userspace buffer vs. simply populating them directly from userspace.
> I thought we were leaning on the latter since it was simpler all-around,
> which is great for SNP since that is already how it populates memory: by
> writing to it from userspace, which kvm_gmem_populate() then copies into
> guest_memfd pages. With shared gmem support, we just skip the latter now
> in the kernel rather needing changes to how userspace handles things in
> that regard. But maybe that was just wishful thinking :)

You remember it correctly and that's how userspace should pre-populate
guest memory contents with in-place conversion support available.
Userspace can simply do the following scheme as an example:
1) Create guest_memfd with the INIT_PRIVATE flag or if we decide to
not go that way, create a guest_memfd file and set all ranges as
private.
2) Preallocate the guest_memfd ranges.
3) Convert the needed ranges to shared, populate the initial guest
payload and then convert those ranges back to private.

Important point here is that guest_memfd ranges can be marked as
private before pre-allocating guest_memfd ranges.

>
> But you raise some very compelling points on why this might not be a
> good idea even if that was how that discussion went.
>
> >
> > Will following work?
> > 1) Userspace starts all guest_memfd ranges as private.
> > 2) During early guest boot it starts issuing PSC requests for
> > converting memory from shared to private
> >     -> KVM forwards this request to userspace
> >     -> Userspace checks that the pages are already private and simply
> > does nothing.
> > 3) Pvalidate from guest on that memory will result in guest_memfd
> > offset query which will cause the RMP table entries to actually get
> > populated.
>
> That would work, but there will need to be changes on userspace to deal
> with how SNP populates memory pre-boot just like normal VMs do. We will
> instead need to copy that data into separate buffers, and pass those in
> as the buffer hva instead of the shared hva corresponding to that GPA.

Initial guest memory payload generally carries a much smaller
footprint so I ignored that detail in the above sequence. As I said
above, userspace should be able to use guest_memfd ranges to directly
populate contents by converting those ranges to shared.

>
> But that seems reasonable if it avoids so many other problems.
>
> >
> > >    we end up splitting every 1GB to 4K and the guest can't even
> > >    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
> > >    anything in the range. As some point the guest will convert/accept
> > >    the entire range, at which point we could merge, but for SNP we'd
> > >    need guest cooperation to actually use a higher-granularity in stage2
> > >    page tables at that point since RMP entries are effectively all split
> > >    to 4K.
> > >
> > >    I understand the intent is to default to private where this wouldn't
> > >    be an issue, and we could punt to userspace to deal with it, but it
> > >    feels like an artificial restriction to place on userspace. And if we
> > >    do want to allow/expect guest_memfd contents to be initialized pre-boot
> > >    just like normal memory, then userspace would need to jump through
> > >    some hoops:
> > >
> > >    - if defaulting to private: add hooks to convert each range that's being
> > >      modified to a shared state prior to writing to it
> >
> > Why is that a problem?
>
> These were only problems if we went the above-mentioned way of
> populating memory pre-boot via mmap() instead of other backchannels. If
> we don't do that, then both these things cease to be problems. Sounds goods
> to me. :)

I think there wouldn't be a problem even if we pre-populated memory
pre-boot via mmap(). Using mmap() seems a preferable option to me.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03  5:10             ` Vishal Annapurve
@ 2025-07-03 20:39               ` Michael Roth
  2025-07-07 14:55                 ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Michael Roth @ 2025-07-03 20:39 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, Jul 02, 2025 at 10:10:36PM -0700, Vishal Annapurve wrote:
> On Wed, Jul 2, 2025 at 9:12 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> > > On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > > > Michael Roth <michael.roth@amd.com> writes:
> > > > >
> > > > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > > > > >> Track guest_memfd memory's shareability status within the inode as
> > > > > >> opposed to the file, since it is property of the guest_memfd's memory
> > > > > >> contents.
> > > > > >>
> > > > > >> Shareability is a property of the memory and is indexed using the
> > > > > >> page's index in the inode. Because shareability is the memory's
> > > > > >> property, it is stored within guest_memfd instead of within KVM, like
> > > > > >> in kvm->mem_attr_array.
> > > > > >>
> > > > > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > > > > >> retained to allow VMs to only use guest_memfd for private memory and
> > > > > >> some other memory for shared memory.
> > > > > >>
> > > > > >> Not all use cases require guest_memfd() to be shared with the host
> > > > > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > > > > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > > > > >> private to the guest, and therefore not mappable by the
> > > > > >> host. Otherwise, memory is shared until explicitly converted to
> > > > > >> private.
> > > > > >>
> > > > > >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > > >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > > > >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > > > >> Co-developed-by: Fuad Tabba <tabba@google.com>
> > > > > >> Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > > > > >> ---
> > > > > >>  Documentation/virt/kvm/api.rst |   5 ++
> > > > > >>  include/uapi/linux/kvm.h       |   2 +
> > > > > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > > > > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > > > > >>
> > > > > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > > >> index 86f74ce7f12a..f609337ae1c2 100644
> > > > > >> --- a/Documentation/virt/kvm/api.rst
> > > > > >> +++ b/Documentation/virt/kvm/api.rst
> > > > > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > > > > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > > > > >>  This is validated when the guest_memfd instance is bound to the VM.
> > > > > >>
> > > > > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > > > > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > > > > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > > > > >> +by the host.
> > > > > >> +
> > > > > >
> > > > > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > > > > like this flag should be deferred until that patch is in place. Is it
> > > > > > really needed at that point though? Userspace would be able to set the
> > > > > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > > > > >
> > > > >
> > > > > I can move this change to the later patch. Thanks! Will fix in the next
> > > > > revision.
> > > > >
> > > > > > The mtree contents seems to get stored in the same manner in either case so
> > > > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > > > would be saved. Are there any other reasons?
> > > > > >
> > > > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > > > granularity makes sense for the architecture/guest configuration.
> > > > > >
> > > > >
> > > > > Because shared pages are split once any memory is allocated, having a
> > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > do you think?
> > > > >
> > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > before converting.
> >
> > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > populate memory (in which case you can pretty much ignore my previous
> > response))
> 
> I am assuming in-place conversion with huge page backing for the
> discussion below.
> 
> Looks like there are three scenarios/usecases we are discussing here:
> 1) Pre-allocating guest_memfd file offsets
>    - Userspace can use fallocate to do this for hugepages by keeping
> the file ranges marked private.
> 2) Prefaulting guest EPT/NPT entries
> 3) Populating initial guest payload into guest_memfd memory
>    - Userspace can mark certain ranges as shared, populate the
> contents and convert the ranges back to private. So mmap will come in
> handy here.
> 
> >
> > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > userspace already defaults to marking everything private pretty close to
> > guest_memfd creation time, so the potential for allocations to occur
> > in-between seems small, but worth confirming.
> 
> Ok, I am not much worried about whether the INIT_PRIVATE flag gets
> supported or not, but more about the default setting that different
> CVMs start with. To me, it looks like all CVMs should start as
> everything private by default and if there is a way to bake that
> configuration during guest_memfd creation time that would be good to
> have instead of doing "create and convert" operations and there is a
> fairly low cost to support this flag.
> 
> >
> > But I know in the past there was a desire to ensure TDX/SNP could
> > support pre-allocating guest_memfd memory (and even pre-faulting via
> > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > fallocate() handling could still avoid the split if the whole hugepage
> > is private, though there is a bit more potential for that fallocate()
> > to happen before userspace does the "manually" shared->private
> > conversion. I'll double-check on that aspect, but otherwise, is there
> > still any other need for it?
> 
> This usecase of being able to preallocate should still work with
> in-place conversion assuming all ranges are private before
> pre-population.

Ok, I think I was missing that the merge logic here will then restore it
to 1GB before the guest starts, so the folio isn't permanently split if
we do the mmap() and that gives us more flexibility on how we can use
it.

I was thinking we needed to avoid the split from the start by avoiding
paths like mmap() which might trigger the split. I was trying to avoid
any merge->unsplit logic in the THP case (or unsplit in general), in
which case we'd get permanent splits via the mmap() approach, but for
2MB that's probably not a big deal.

> 
> >
> > > >
> > > > I assume we do want to support things like preallocating guest memory so
> > > > not sure this approach is feasible to avoid splits.
> > > >
> > > > But I feel like we might be working around a deeper issue here, which is
> > > > that we are pre-emptively splitting anything that *could* be mapped into
> > > > userspace (i.e. allocated+shared/mixed), rather than splitting when
> > > > necessary.
> > > >
> > > > I know that was the plan laid out in the guest_memfd calls, but I've run
> > > > into a couple instances that have me thinking we should revisit this.
> > > >
> > > > 1) Some of the recent guest_memfd seems to be gravitating towards having
> > > >    userspace populate/initialize guest memory payload prior to boot via
> > > >    mmap()'ing the shared guest_memfd pages so things work the same as
> > > >    they would for initialized normal VM memory payload (rather than
> > > >    relying on back-channels in the kernel to user data into guest_memfd
> > > >    pages).
> > > >
> > > >    When you do this though, for an SNP guest at least, that memory
> > > >    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
> > > >    because that will put each 1GB page into an allocated+mixed state,
> > >
> > > I would like your help in understanding why we need to start
> > > guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
> > > private simply should mean that certain ranges are not faultable by
> > > the userspace.
> >
> > It's seeming like I probably misremembered, but I thought there was a
> > discussion on guest_memfd call a month (or so?) ago about whether to
> > continue to use backchannels to populate guest_memfd pages prior to
> > launch. It was in the context of whether to keep using kvm_gmem_populate()
> > for populating guest_memfd pages by copying them in from separate
> > userspace buffer vs. simply populating them directly from userspace.
> > I thought we were leaning on the latter since it was simpler all-around,
> > which is great for SNP since that is already how it populates memory: by
> > writing to it from userspace, which kvm_gmem_populate() then copies into
> > guest_memfd pages. With shared gmem support, we just skip the latter now
> > in the kernel rather needing changes to how userspace handles things in
> > that regard. But maybe that was just wishful thinking :)
> 
> You remember it correctly and that's how userspace should pre-populate
> guest memory contents with in-place conversion support available.
> Userspace can simply do the following scheme as an example:
> 1) Create guest_memfd with the INIT_PRIVATE flag or if we decide to
> not go that way, create a guest_memfd file and set all ranges as
> private.
> 2) Preallocate the guest_memfd ranges.
> 3) Convert the needed ranges to shared, populate the initial guest
> payload and then convert those ranges back to private.
> 
> Important point here is that guest_memfd ranges can be marked as
> private before pre-allocating guest_memfd ranges.

Got it, and then the merge logic triggers so you get the 1GB back before
guest launch. That seems reasonable. I was only thinking of the merge
logic in the context of a running guest and it didn't seem all that useful
in that regard, but it makes perfect sense for the above sort of scenario.

Thanks,

Mike

> 
> >
> > But you raise some very compelling points on why this might not be a
> > good idea even if that was how that discussion went.
> >
> > >
> > > Will following work?
> > > 1) Userspace starts all guest_memfd ranges as private.
> > > 2) During early guest boot it starts issuing PSC requests for
> > > converting memory from shared to private
> > >     -> KVM forwards this request to userspace
> > >     -> Userspace checks that the pages are already private and simply
> > > does nothing.
> > > 3) Pvalidate from guest on that memory will result in guest_memfd
> > > offset query which will cause the RMP table entries to actually get
> > > populated.
> >
> > That would work, but there will need to be changes on userspace to deal
> > with how SNP populates memory pre-boot just like normal VMs do. We will
> > instead need to copy that data into separate buffers, and pass those in
> > as the buffer hva instead of the shared hva corresponding to that GPA.
> 
> Initial guest memory payload generally carries a much smaller
> footprint so I ignored that detail in the above sequence. As I said
> above, userspace should be able to use guest_memfd ranges to directly
> populate contents by converting those ranges to shared.
> 
> >
> > But that seems reasonable if it avoids so many other problems.
> >
> > >
> > > >    we end up splitting every 1GB to 4K and the guest can't even
> > > >    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
> > > >    anything in the range. As some point the guest will convert/accept
> > > >    the entire range, at which point we could merge, but for SNP we'd
> > > >    need guest cooperation to actually use a higher-granularity in stage2
> > > >    page tables at that point since RMP entries are effectively all split
> > > >    to 4K.
> > > >
> > > >    I understand the intent is to default to private where this wouldn't
> > > >    be an issue, and we could punt to userspace to deal with it, but it
> > > >    feels like an artificial restriction to place on userspace. And if we
> > > >    do want to allow/expect guest_memfd contents to be initialized pre-boot
> > > >    just like normal memory, then userspace would need to jump through
> > > >    some hoops:
> > > >
> > > >    - if defaulting to private: add hooks to convert each range that's being
> > > >      modified to a shared state prior to writing to it
> > >
> > > Why is that a problem?
> >
> > These were only problems if we went the above-mentioned way of
> > populating memory pre-boot via mmap() instead of other backchannels. If
> > we don't do that, then both these things cease to be problems. Sounds goods
> > to me. :)
> 
> I think there wouldn't be a problem even if we pre-populated memory
> pre-boot via mmap(). Using mmap() seems a preferable option to me.
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03 20:39               ` Michael Roth
@ 2025-07-07 14:55                 ` Vishal Annapurve
  2025-07-12  0:10                   ` Michael Roth
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-07 14:55 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Thu, Jul 3, 2025 at 1:41 PM Michael Roth <michael.roth@amd.com> wrote:
> > > > > >
> > > > > > Because shared pages are split once any memory is allocated, having a
> > > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > > do you think?
> > > > > >
> > > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > > before converting.
> > >
> > > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > > populate memory (in which case you can pretty much ignore my previous
> > > response))
> >
> > I am assuming in-place conversion with huge page backing for the
> > discussion below.
> >
> > Looks like there are three scenarios/usecases we are discussing here:
> > 1) Pre-allocating guest_memfd file offsets
> >    - Userspace can use fallocate to do this for hugepages by keeping
> > the file ranges marked private.
> > 2) Prefaulting guest EPT/NPT entries
> > 3) Populating initial guest payload into guest_memfd memory
> >    - Userspace can mark certain ranges as shared, populate the
> > contents and convert the ranges back to private. So mmap will come in
> > handy here.
> >
> > >
> > > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > > userspace already defaults to marking everything private pretty close to
> > > guest_memfd creation time, so the potential for allocations to occur
> > > in-between seems small, but worth confirming.
> >
> > Ok, I am not much worried about whether the INIT_PRIVATE flag gets
> > supported or not, but more about the default setting that different
> > CVMs start with. To me, it looks like all CVMs should start as
> > everything private by default and if there is a way to bake that
> > configuration during guest_memfd creation time that would be good to
> > have instead of doing "create and convert" operations and there is a
> > fairly low cost to support this flag.
> >
> > >
> > > But I know in the past there was a desire to ensure TDX/SNP could
> > > support pre-allocating guest_memfd memory (and even pre-faulting via
> > > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > > fallocate() handling could still avoid the split if the whole hugepage
> > > is private, though there is a bit more potential for that fallocate()
> > > to happen before userspace does the "manually" shared->private
> > > conversion. I'll double-check on that aspect, but otherwise, is there
> > > still any other need for it?
> >
> > This usecase of being able to preallocate should still work with
> > in-place conversion assuming all ranges are private before
> > pre-population.
>
> Ok, I think I was missing that the merge logic here will then restore it
> to 1GB before the guest starts, so the folio isn't permanently split if
> we do the mmap() and that gives us more flexibility on how we can use
> it.
>
> I was thinking we needed to avoid the split from the start by avoiding
> paths like mmap() which might trigger the split. I was trying to avoid
> any merge->unsplit logic in the THP case (or unsplit in general), in
> which case we'd get permanent splits via the mmap() approach, but for
> 2MB that's probably not a big deal.

After initial payload population, during its runtime guest can cause
different hugepages to get split which can remain split even after
guest converts them back to private. For THP there may not be much
benefit of merging those pages together specially if NPT/EPT entries
can't be promoted back to hugepage mapping and there is no memory
penalty as THP doesn't use HVO.

Wishful thinking on my part: It would be great to figure out a way to
promote these pagetable entries without relying on the guest, if
possible with ABI updates, as I think the host should have some
control over EPT/NPT granularities even for Confidential VMs. Along
the similar lines, it would be great to have "page struct"-less memory
working for Confidential VMs, which should greatly reduce the toil
with merge/split operations and will render the conversions mostly to
be pagetable manipulations.

That being said, memory split and merge seem to be relatively
lightweight for THP (with no memory allocation/freeing) and reusing
the memory files after reboot of the guest VM will require pages to be
merged to start with a clean slate. One possible option is to always
merge as early as possible, second option is to invent a new UAPI to
do it on demand.

For 1G pages, even if we go with 1G -> 2M -> 4K split stages, page
splits result in higher memory usage with HVO around and it becomes
useful to merge them back as early as possible as guest proceeds to
convert subranges of different hugepages over its lifetime. Merging
pages as early as possible also allows reusing of memory files during
the next reboot without having to invent a new UAPI.

Caveats with "merge as early as possible":
- Shared to private conversions will be slower for hugetlb pages.
   * Counter argument: These conversions are already slow as we need
safe refcounts to reach on the ranges getting converted.
- If guests convert a particular range often then extra merge/split
operations will result in overhead.
   * Counter argument: Since conversions are anyways slow, it's
beneficial for guests to avoid such a scenario and keep back and forth
conversions as less frequent as possible.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio()
  2025-05-14 23:41 ` [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio() Ackerley Tng
@ 2025-07-07 18:08   ` James Houghton
  0 siblings, 0 replies; 231+ messages in thread
From: James Houghton @ 2025-07-07 18:08 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jun.miao, kai.huang, keirf,
	kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

On Wed, May 14, 2025 at 4:43 PM Ackerley Tng <ackerleytng@google.com> wrote:
>
> Interpreting map_chg_state inline, within alloc_hugetlb_folio(),
> improves readability.
>
> Instead of having cow_from_owner and the result of
> vma_needs_reservation() compute a map_chg_state, and then interpreting
> map_chg_state within alloc_hugetlb_folio() to determine whether to
>
> + Get a page from the subpool or
> + Charge cgroup reservations or
> + Commit vma reservations or
> + Clean up reservations
>
> This refactoring makes those decisions just based on whether a
> vma_reservation_exists. If a vma_reservation_exists, the subpool had
> already been debited and the cgroup had been charged, hence
> alloc_hugetlb_folio() should not double-debit or double-charge. If the
> vma reservation can't be used (as in cow_from_owner), then the vma
> reservation effectively does not exist and vma_reservation_exists is
> set to false.
>
> The conditions for committing reservations or cleaning are also
> updated to be paired with the corresponding conditions guarding
> reservation creation.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Change-Id: I22d72a2cae61fb64dc78e0a870b254811a06a31e

Hi Ackerley,

Can you help me better understand how useful the refactors in this and
the preceding patch are for the series as a whole?

It seems like you and Peter had two different, but mostly equivalent,
directions with how this code should be refactored[1]. Do you gain
much by replacing Peter's refactoring strategy? If it's mostly a
stylistic thing, maybe it would be better to remove these patches just
to get the number of patches to review down.

The logic in these two patches looks good to me, and I think I do
slightly prefer your approach. But if we could drop these patches
(i.e., mail them separately), that's probably better.

[1]: https://lore.kernel.org/linux-mm/20250107204002.2683356-5-peterx@redhat.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
  2025-05-15  2:09   ` Matthew Wilcox
  2025-05-28  8:55   ` Binbin Wu
@ 2025-07-07 18:27   ` James Houghton
  2 siblings, 0 replies; 231+ messages in thread
From: James Houghton @ 2025-07-07 18:27 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jun.miao, kai.huang, keirf,
	kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

On Wed, May 14, 2025 at 4:43 PM Ackerley Tng <ackerleytng@google.com> wrote:
>
> Previously, gbl_chg was passed from alloc_hugetlb_folio() into
> dequeue_hugetlb_folio_vma(), leaking the concept of gbl_chg into
> dequeue_hugetlb_folio_vma().
>
> This patch consolidates the interpretation of gbl_chg into
> alloc_hugetlb_folio(), also renaming dequeue_hugetlb_folio_vma() to
> dequeue_hugetlb_folio() so dequeue_hugetlb_folio() can just focus on
> dequeuing a folio.
>
> Change-Id: I31bf48af2400b6e13b44d03c8be22ce1a9092a9c
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

I think I agree with Binbin[1] to either put the rename of
dequeue_hugetlb_folio{_vma => }() in its own patch or drop it
entirely.

I think the rename would 100% make sense if all of the
dequeue_hugetlb_folio*() functions were called from
dequeue_hugetlb_folio_vma() (i.e., after this patch,
dequeue_hugetlb_folio() was always the entry point to dequeue a
folio), but in fact dequeue_hugetlb_folio_nodemask() is not always
called from dequeue_hugetlb_folio_vma().

I don't feel strongly at all, either way the name is not confusing. So
feel free to add:

Reviewed-by: James Houghton <jthoughton@google.com>

[1]: https://lore.kernel.org/all/ad77da83-0e6e-47a1-abe7-8cfdfce8b254@linux.intel.com/

> ---
>  mm/hugetlb.c | 28 +++++++++++-----------------
>  1 file changed, 11 insertions(+), 17 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6ea1be71aa42..b843e869496f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1364,9 +1364,9 @@ static unsigned long available_huge_pages(struct hstate *h)
>         return h->free_huge_pages - h->resv_huge_pages;
>  }
>
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -                               struct vm_area_struct *vma,
> -                               unsigned long address, long gbl_chg)
> +static struct folio *dequeue_hugetlb_folio(struct hstate *h,
> +                                          struct vm_area_struct *vma,
> +                                          unsigned long address)
>  {
>         struct folio *folio = NULL;
>         struct mempolicy *mpol;
> @@ -1374,13 +1374,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>         nodemask_t *nodemask;
>         int nid;
>
> -       /*
> -        * gbl_chg==1 means the allocation requires a new page that was not
> -        * reserved before.  Making sure there's at least one free page.
> -        */
> -       if (gbl_chg && !available_huge_pages(h))
> -               goto err;
> -
>         gfp_mask = htlb_alloc_mask(h);
>         nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
>
> @@ -1398,9 +1391,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>
>         mpol_cond_put(mpol);
>         return folio;
> -
> -err:
> -       return NULL;
>  }
>
>  /*
> @@ -3074,12 +3064,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>                 goto out_uncharge_cgroup_reservation;
>
>         spin_lock_irq(&hugetlb_lock);
> +
>         /*
> -        * glb_chg is passed to indicate whether or not a page must be taken
> -        * from the global free pool (global change).  gbl_chg == 0 indicates
> -        * a reservation exists for the allocation.
> +        * gbl_chg == 0 indicates a reservation exists for the allocation - so
> +        * try dequeuing a page. If there are available_huge_pages(), try using
> +        * them!
>          */
> -       folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
> +       folio = NULL;
> +       if (!gbl_chg || available_huge_pages(h))
> +               folio = dequeue_hugetlb_folio(h, vma, addr);
> +
>         if (!folio) {
>                 spin_unlock_irq(&hugetlb_lock);
>                 folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
> --
> 2.49.0.1045.g170613ef41-goog
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-01 19:48             ` Vishal Annapurve
@ 2025-07-07 23:25               ` Sean Christopherson
  2025-07-08  0:14                 ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-07 23:25 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Yan Zhao, Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel,
	x86, linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga,
	anup, aou, bfoster, binbin.wu, brauner, catalin.marinas,
	chao.p.peng, chenhuacai, dave.hansen, david, dmatlack, dwmw,
	erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Tue, Jul 01, 2025, Vishal Annapurve wrote:
> I would be curious to understand if we need zeroing on conversion for
> Confidential VMs. If not, then the simple rule of zeroing on
> allocation only will work for all usecases.

Unless I'm misunderstanding what your asking, pKVM very specific does NOT want
zeroing on conversion, because one of its use cases is in-place conversion, e.g.
to fill a shared buffer and then convert it to private so that the buffer can be
processed in the TEE.

Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion,
but that's essentially a property of the architecture, i.e. an arch/vendor specific
detail.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-07 23:25               ` Sean Christopherson
@ 2025-07-08  0:14                 ` Vishal Annapurve
  2025-07-08  1:08                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-08  0:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, Xiaoyao Li, Ackerley Tng, kvm, linux-mm, linux-kernel,
	x86, linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga,
	anup, aou, bfoster, binbin.wu, brauner, catalin.marinas,
	chao.p.peng, chenhuacai, dave.hansen, david, dmatlack, dwmw,
	erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, michael.roth, mpe, muchun.song, nikunj, nsaenz, oliver.upton,
	palmer, pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx,
	pgonda, pvorel, qperret, quic_cvanscha, quic_eberman,
	quic_mnalajal, quic_pderrin, quic_pheragu, quic_svaddagi,
	quic_tsoni, richard.weiyang, rick.p.edgecombe, rientjes, roypat,
	rppt, shuah, steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, yilun.xu, yuzenghui, zhiquan1.li

On Mon, Jul 7, 2025 at 4:25 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 01, 2025, Vishal Annapurve wrote:
> > I would be curious to understand if we need zeroing on conversion for
> > Confidential VMs. If not, then the simple rule of zeroing on
> > allocation only will work for all usecases.
>
> Unless I'm misunderstanding what your asking, pKVM very specific does NOT want
> zeroing on conversion, because one of its use cases is in-place conversion, e.g.
> to fill a shared buffer and then convert it to private so that the buffer can be
> processed in the TEE.

Yeah, that makes sense. So "just zero on allocation" (and no more
zeroing during conversion) policy will work for pKVM.

>
> Some architectures, e.g. SNP and TDX, may effectively require zeroing on conversion,
> but that's essentially a property of the architecture, i.e. an arch/vendor specific
> detail.

Conversion operation is a unique capability supported by guest_memfd
files so my intention of bringing up zeroing was to better understand
the need and clarify the role of guest_memfd in handling zeroing
during conversion.

Not sure if I am misinterpreting you, but treating "zeroing during
conversion" as the responsibility of arch/vendor specific
implementation outside of guest_memfd sounds good to me.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08  0:14                 ` Vishal Annapurve
@ 2025-07-08  1:08                   ` Edgecombe, Rick P
  2025-07-08 14:20                     ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08  1:08 UTC (permalink / raw)
  To: Annapurve, Vishal, seanjc@google.com
  Cc: palmer@dabbelt.com, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, Shutemov, Kirill,
	pdurrant@amazon.co.uk, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, jack@suse.cz, maz@kernel.org,
	keirf@google.com, pvorel@suse.cz, anthony.yznaga@oracle.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	quic_eberman@quicinc.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	Hansen, Dave, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, steven.price@arm.com, vkuznets@redhat.com,
	fvdl@google.com, rppt@kernel.org, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, vbabka@suse.cz, anup@brainfault.org,
	linux-kernel@vger.kernel.org, tabba@google.com, mic@digikod.net,
	oliver.upton@linux.dev, akpm@linux-foundation.org,
	usama.arif@bytedance.com, thomas.lendacky@amd.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, Aktas, Erdem, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, willy@infradead.org, Xu, Haibo1,
	jhubbard@nvidia.com, quic_svaddagi@quicinc.com, Yamahata, Isaku,
	jthoughton@google.com, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, Peng, Chao P, pankaj.gupta@amd.com,
	Graf, Alexander, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Li, Xiaoyao, aou@eecs.berkeley.edu,
	Weiny, Ira, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote:
> > 
> > Some architectures, e.g. SNP and TDX, may effectively require zeroing on
> > conversion,
> > but that's essentially a property of the architecture, i.e. an arch/vendor
> > specific
> > detail.
> 
> Conversion operation is a unique capability supported by guest_memfd
> files so my intention of bringing up zeroing was to better understand
> the need and clarify the role of guest_memfd in handling zeroing
> during conversion.
> 
> Not sure if I am misinterpreting you, but treating "zeroing during
> conversion" as the responsibility of arch/vendor specific
> implementation outside of guest_memfd sounds good to me.

For TDX if we don't zero on conversion from private->shared we will be dependent
on behavior of the CPU when reading memory with keyid 0, which was previously
encrypted and has some protection bits set. I don't *think* the behavior is
architectural. So it might be prudent to either make it so, or zero it in the
kernel in order to not make non-architectual behavior into userspace ABI.

Up the thread Vishal says we need to support operations that use in-place
conversion (overloaded term now I think, btw). Why exactly is pKVM using
private/shared conversion for this private data provisioning? Instead of a
special provisioning operation like the others? (Xiaoyao's suggestion)



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08  1:08                   ` Edgecombe, Rick P
@ 2025-07-08 14:20                     ` Sean Christopherson
  2025-07-08 14:52                       ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 14:20 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Vishal Annapurve, palmer@dabbelt.com, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	Kirill Shutemov, pdurrant@amazon.co.uk, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, jack@suse.cz, maz@kernel.org,
	keirf@google.com, pvorel@suse.cz, anthony.yznaga@oracle.com,
	mail@maciej.szmigiero.name, hughd@google.com,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, steven.price@arm.com, vkuznets@redhat.com,
	fvdl@google.com, rppt@kernel.org, bfoster@redhat.com,
	quic_cvanscha@quicinc.com, vbabka@suse.cz, anup@brainfault.org,
	linux-kernel@vger.kernel.org, tabba@google.com, mic@digikod.net,
	oliver.upton@linux.dev, akpm@linux-foundation.org,
	usama.arif@bytedance.com, thomas.lendacky@amd.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, willy@infradead.org, Haibo1 Xu,
	jhubbard@nvidia.com, quic_svaddagi@quicinc.com, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-07-07 at 17:14 -0700, Vishal Annapurve wrote:
> > > 
> > > Some architectures, e.g. SNP and TDX, may effectively require zeroing on
> > > conversion,
> > > but that's essentially a property of the architecture, i.e. an arch/vendor
> > > specific
> > > detail.
> > 
> > Conversion operation is a unique capability supported by guest_memfd
> > files so my intention of bringing up zeroing was to better understand
> > the need and clarify the role of guest_memfd in handling zeroing
> > during conversion.
> > 
> > Not sure if I am misinterpreting you, but treating "zeroing during
> > conversion" as the responsibility of arch/vendor specific
> > implementation outside of guest_memfd sounds good to me.
> 
> For TDX if we don't zero on conversion from private->shared we will be dependent
> on behavior of the CPU when reading memory with keyid 0, which was previously
> encrypted and has some protection bits set. I don't *think* the behavior is
> architectural. So it might be prudent to either make it so, or zero it in the
> kernel in order to not make non-architectual behavior into userspace ABI.

Ya, by "vendor specific", I was also lumping in cases where the kernel would need
to zero memory in order to not end up with effectively undefined behavior.

> Up the thread Vishal says we need to support operations that use in-place
> conversion (overloaded term now I think, btw). Why exactly is pKVM using
> private/shared conversion for this private data provisioning?

Because it's literally converting memory from shared to private?  And IICU, it's
not a one-time provisioning, e.g. memory can go:

  shared => fill => private => consume => shared => fill => private => consume

> Instead of a special provisioning operation like the others? (Xiaoyao's
> suggestion)

Are you referring to this suggestion?

 : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
 : explicitly request that the page range is converted to private and the
 : content needs to be retained. So that TDX can identify which case needs
 : to call in-place TDH.PAGE.ADD.

If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
userspace has explicit control over what happens to the data during conversion,
and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
shared => private and only for select VM types.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 14:20                     ` Sean Christopherson
@ 2025-07-08 14:52                       ` Edgecombe, Rick P
  2025-07-08 15:07                         ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 14:52 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, Shutemov, Kirill, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, jack@suse.cz, quic_svaddagi@quicinc.com,
	keirf@google.com, palmer@dabbelt.com, vkuznets@redhat.com,
	mail@maciej.szmigiero.name, Annapurve, Vishal,
	anthony.yznaga@oracle.com, Wang, Wei W, tabba@google.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Hansen, Dave,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Li, Zhiquan1, rientjes@google.com, mpe@ellerman.id.au,
	Aktas, Erdem, david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	Xu, Haibo1, Du, Fan, maz@kernel.org, muchun.song@linux.dev,
	Yamahata, Isaku, jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, Peng, Chao P, pankaj.gupta@amd.com,
	Graf, Alexander, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Li, Xiaoyao, aou@eecs.berkeley.edu,
	Weiny, Ira, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > For TDX if we don't zero on conversion from private->shared we will be
> > dependent
> > on behavior of the CPU when reading memory with keyid 0, which was
> > previously
> > encrypted and has some protection bits set. I don't *think* the behavior is
> > architectural. So it might be prudent to either make it so, or zero it in
> > the
> > kernel in order to not make non-architectual behavior into userspace ABI.
> 
> Ya, by "vendor specific", I was also lumping in cases where the kernel would
> need to zero memory in order to not end up with effectively undefined
> behavior.

Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
the answer is sort of yes, even though TDX doesn't require it. But we actually
don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
that the operation is a to-shared conversion and not another type of private
zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
set in gmem such that it knows it should zero it.

> 
> > Up the thread Vishal says we need to support operations that use in-place
> > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > private/shared conversion for this private data provisioning?
> 
> Because it's literally converting memory from shared to private?  And IICU,
> it's
> not a one-time provisioning, e.g. memory can go:
> 
>   shared => fill => private => consume => shared => fill => private => consume
> 
> > Instead of a special provisioning operation like the others? (Xiaoyao's
> > suggestion)
> 
> Are you referring to this suggestion?

Yea, in general to make it a specific operation preserving operation.

> 
>  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
>  : explicitly request that the page range is converted to private and the
>  : content needs to be retained. So that TDX can identify which case needs
>  : to call in-place TDH.PAGE.ADD.
> 
> If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> userspace has explicit control over what happens to the data during
> conversion,
> and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> shared => private and only for select VM types.

Ok, we should POC how it works with TDX.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 14:52                       ` Edgecombe, Rick P
@ 2025-07-08 15:07                         ` Vishal Annapurve
  2025-07-08 15:31                           ` Edgecombe, Rick P
  2025-07-08 15:38                           ` Sean Christopherson
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-08 15:07 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Miao, Jun, Shutemov, Kirill,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, jack@suse.cz,
	quic_svaddagi@quicinc.com, keirf@google.com, palmer@dabbelt.com,
	vkuznets@redhat.com, mail@maciej.szmigiero.name,
	anthony.yznaga@oracle.com, Wang, Wei W, tabba@google.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Hansen, Dave,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Li, Zhiquan1, rientjes@google.com, mpe@ellerman.id.au,
	Aktas, Erdem, david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	Xu, Haibo1, Du, Fan, maz@kernel.org, muchun.song@linux.dev,
	Yamahata, Isaku, jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	dwmw@amazon.co.uk, Peng, Chao P, pankaj.gupta@amd.com,
	Graf, Alexander, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Li, Xiaoyao, aou@eecs.berkeley.edu,
	Weiny, Ira, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > For TDX if we don't zero on conversion from private->shared we will be
> > > dependent
> > > on behavior of the CPU when reading memory with keyid 0, which was
> > > previously
> > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > architectural. So it might be prudent to either make it so, or zero it in
> > > the
> > > kernel in order to not make non-architectual behavior into userspace ABI.
> >
> > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > need to zero memory in order to not end up with effectively undefined
> > behavior.
>
> Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> the answer is sort of yes, even though TDX doesn't require it. But we actually
> don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> that the operation is a to-shared conversion and not another type of private
> zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> set in gmem such that it knows it should zero it.

If the answer is that "always zero on private to shared conversions"
for all CC VMs, then does the scheme outlined in [1] make sense for
handling the private -> shared conversions? For pKVM, there can be a
VM type check to avoid the zeroing during conversions and instead just
zero on allocations. This allows delaying zeroing until the fault time
for CC VMs and can be done in guest_memfd centrally. We will need more
inputs from the SEV side for this discussion.

[1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/

>
> >
> > > Up the thread Vishal says we need to support operations that use in-place
> > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > private/shared conversion for this private data provisioning?
> >
> > Because it's literally converting memory from shared to private?  And IICU,
> > it's
> > not a one-time provisioning, e.g. memory can go:
> >
> >   shared => fill => private => consume => shared => fill => private => consume
> >
> > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > suggestion)
> >
> > Are you referring to this suggestion?
>
> Yea, in general to make it a specific operation preserving operation.
>
> >
> >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> >  : explicitly request that the page range is converted to private and the
> >  : content needs to be retained. So that TDX can identify which case needs
> >  : to call in-place TDH.PAGE.ADD.
> >
> > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > userspace has explicit control over what happens to the data during
> > conversion,
> > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > shared => private and only for select VM types.
>
> Ok, we should POC how it works with TDX.

I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
1) Conversions are always content-preserving for pKVM.
2) Shared to private conversions are always content-preserving for all
VMs as far as guest_memfd is concerned.
3) Private to shared conversions are not content-preserving for CC VMs
as far as guest_memfd is concerned, subject to more discussions.

[2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 15:07                         ` Vishal Annapurve
@ 2025-07-08 15:31                           ` Edgecombe, Rick P
  2025-07-08 17:16                             ` Vishal Annapurve
  2025-07-08 15:38                           ` Sean Christopherson
  1 sibling, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 15:31 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	jack@suse.cz, vkuznets@redhat.com, quic_eberman@quicinc.com,
	keirf@google.com, mail@maciej.szmigiero.name,
	anthony.yznaga@oracle.com, Wang, Wei W, rppt@kernel.org,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	Hansen, Dave, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	quic_cvanscha@quicinc.com, Shutemov, Kirill, vbabka@suse.cz,
	anup@brainfault.org, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, Du, Fan, akpm@linux-foundation.org,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, willy@infradead.org,
	hughd@google.com, Xu, Haibo1, jhubbard@nvidia.com, maz@kernel.org,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, nsaenz@amazon.es, chenhuacai@kernel.org,
	Huang, Kai, shuah@kernel.org, bfoster@redhat.com,
	dwmw@amazon.co.uk, Peng, Chao P, pankaj.gupta@amd.com,
	Graf, Alexander, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Li, Xiaoyao, aou@eecs.berkeley.edu,
	Weiny, Ira, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	seanjc@google.com, roypat@amazon.co.uk

On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > dependent
> > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > previously
> > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > the
> > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > 
> > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > need to zero memory in order to not end up with effectively undefined
> > > behavior.
> > 
> > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > that the operation is a to-shared conversion and not another type of private
> > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > set in gmem such that it knows it should zero it.
> 
> If the answer is that "always zero on private to shared conversions"
> for all CC VMs, then does the scheme outlined in [1] make sense for
> handling the private -> shared conversions? For pKVM, there can be a
> VM type check to avoid the zeroing during conversions and instead just
> zero on allocations. This allows delaying zeroing until the fault time
> for CC VMs and can be done in guest_memfd centrally. We will need more
> inputs from the SEV side for this discussion.
> 
> [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/

It's nice that we don't double zero (since TDX module will do it too) for
private allocation/mapping. Seems ok to me.

> 
> > 
> > > 
> > > > Up the thread Vishal says we need to support operations that use in-place
> > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > > private/shared conversion for this private data provisioning?
> > > 
> > > Because it's literally converting memory from shared to private?  And IICU,
> > > it's
> > > not a one-time provisioning, e.g. memory can go:
> > > 
> > >   shared => fill => private => consume => shared => fill => private => consume
> > > 
> > > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > > suggestion)
> > > 
> > > Are you referring to this suggestion?
> > 
> > Yea, in general to make it a specific operation preserving operation.
> > 
> > > 
> > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > >  : explicitly request that the page range is converted to private and the
> > >  : content needs to be retained. So that TDX can identify which case needs
> > >  : to call in-place TDH.PAGE.ADD.
> > > 
> > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > userspace has explicit control over what happens to the data during
> > > conversion,
> > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > shared => private and only for select VM types.
> > 
> > Ok, we should POC how it works with TDX.
> 
> I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> 1) Conversions are always content-preserving for pKVM.
> 2) Shared to private conversions are always content-preserving for all
> VMs as far as guest_memfd is concerned.
> 3) Private to shared conversions are not content-preserving for CC VMs
> as far as guest_memfd is concerned, subject to more discussions.
> 
> [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

Right, I read that. I still don't see why pKVM needs to do normal private/shared
conversion for data provisioning. Vs a dedicated operation/flag to make it a
special case.

I'm trying to suggest there could be a benefit to making all gmem VM types
behave the same. If conversions are always content preserving for pKVM, why
can't userspace  always use the operation that says preserve content? Vs
changing the behavior of the common operations?

So for all VM types, the user ABI would be:
private->shared          - Always zero's page
shared->private          - Always destructive
shared->private (w/flag) - Always preserves data or return error if not possible


Do you see a problem?


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 15:07                         ` Vishal Annapurve
  2025-07-08 15:31                           ` Edgecombe, Rick P
@ 2025-07-08 15:38                           ` Sean Christopherson
  2025-07-08 16:22                             ` Fuad Tabba
  1 sibling, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 15:38 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, Kirill Shutemov,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, jack@suse.cz,
	quic_svaddagi@quicinc.com, keirf@google.com, palmer@dabbelt.com,
	vkuznets@redhat.com, mail@maciej.szmigiero.name,
	anthony.yznaga@oracle.com, Wei W Wang, tabba@google.com,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Dave Hansen,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Zhiquan1 Li, rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	Fan Du, maz@kernel.org, muchun.song@linux.dev, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > dependent
> > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > previously
> > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > the
> > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > >
> > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > need to zero memory in order to not end up with effectively undefined
> > > behavior.
> >
> > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > that the operation is a to-shared conversion and not another type of private
> > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > set in gmem such that it knows it should zero it.
> 
> If the answer is that "always zero on private to shared conversions"
> for all CC VMs,

pKVM VMs *are* CoCo VMs.  Just because pKVM doesn't rely on third party firmware
to provide confidentiality and integrity doesn't make it any less of a CoCo VM.

> > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > >  : explicitly request that the page range is converted to private and the
> > >  : content needs to be retained. So that TDX can identify which case needs
> > >  : to call in-place TDH.PAGE.ADD.
> > >
> > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > userspace has explicit control over what happens to the data during
> > > conversion,
> > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > shared => private and only for select VM types.
> >
> > Ok, we should POC how it works with TDX.
> 
> I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> 1) Conversions are always content-preserving for pKVM.

No?  Perserving contents on private => shared is a security vulnerability waiting
to happen.

> 2) Shared to private conversions are always content-preserving for all
> VMs as far as guest_memfd is concerned.

There is no "as far as guest_memfd is concerned".  Userspace doesn't care whether
code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is
the behavior that userspace sees.  I don't want to end up with userspace ABI that
is vendor/VM specific.

> 3) Private to shared conversions are not content-preserving for CC VMs
> as far as guest_memfd is concerned, subject to more discussions.
> 
> [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 15:38                           ` Sean Christopherson
@ 2025-07-08 16:22                             ` Fuad Tabba
  2025-07-08 17:25                               ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-07-08 16:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Rick P Edgecombe, pvorel@suse.cz,
	kvm@vger.kernel.org, catalin.marinas@arm.com, Jun Miao,
	Kirill Shutemov, pdurrant@amazon.co.uk, vbabka@suse.cz,
	peterx@redhat.com, x86@kernel.org, amoorthy@google.com,
	jack@suse.cz, quic_svaddagi@quicinc.com, keirf@google.com,
	palmer@dabbelt.com, vkuznets@redhat.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Dave Hansen,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Zhiquan1 Li, rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	Fan Du, maz@kernel.org, muchun.song@linux.dev, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

Hi Sean,

On Tue, 8 Jul 2025 at 16:39, Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > > dependent
> > > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > > previously
> > > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > > the
> > > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > >
> > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > > need to zero memory in order to not end up with effectively undefined
> > > > behavior.
> > >
> > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > > that the operation is a to-shared conversion and not another type of private
> > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > > set in gmem such that it knows it should zero it.
> >
> > If the answer is that "always zero on private to shared conversions"
> > for all CC VMs,
>
> pKVM VMs *are* CoCo VMs.  Just because pKVM doesn't rely on third party firmware
> to provide confidentiality and integrity doesn't make it any less of a CoCo VM.



> > > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > > >  : explicitly request that the page range is converted to private and the
> > > >  : content needs to be retained. So that TDX can identify which case needs
> > > >  : to call in-place TDH.PAGE.ADD.
> > > >
> > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > > userspace has explicit control over what happens to the data during
> > > > conversion,
> > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > > shared => private and only for select VM types.
> > >
> > > Ok, we should POC how it works with TDX.
> >
> > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > 1) Conversions are always content-preserving for pKVM.
>
> No?  Perserving contents on private => shared is a security vulnerability waiting
> to happen.

Actually it is one of the requirements for pKVM as well as its current
behavior. We would like to preserve contents both ways, private <=>
shared, since it is required by some of the potential use cases (e.g.,
guest handling video encoding/decoding).

To make it clear, I'm talking about explicit sharing from the guest,
not relinquishing memory back to the host. In the case of
relinquishing (and guest teardown), relinquished memory is poisoned
(zeroed) in pKVM.

Cheers,
/fuad

> > 2) Shared to private conversions are always content-preserving for all
> > VMs as far as guest_memfd is concerned.
>
> There is no "as far as guest_memfd is concerned".  Userspace doesn't care whether
> code lives in guest_memfd.c versus arch/xxx/kvm, the only thing that matters is
> the behavior that userspace sees.  I don't want to end up with userspace ABI that
> is vendor/VM specific.
>
> > 3) Private to shared conversions are not content-preserving for CC VMs
> > as far as guest_memfd is concerned, subject to more discussions.
> >
> > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 15:31                           ` Edgecombe, Rick P
@ 2025-07-08 17:16                             ` Vishal Annapurve
  2025-07-08 17:39                               ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-08 17:16 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	steven.price@arm.com, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	jack@suse.cz, vkuznets@redhat.com, quic_eberman@quicinc.com,
	keirf@google.com, mail@maciej.szmigiero.name,
	anthony.yznaga@oracle.com, Wang, Wei W, rppt@kernel.org,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	Hansen, Dave, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	quic_cvanscha@quicinc.com, Shutemov, Kirill, vbabka@suse.cz,
	anup@brainfault.org, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, Du, Fan, akpm@linux-foundation.org,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, willy@infradead.org,
	hughd@google.com, Xu, Haibo1, jhubbard@nvidia.com, maz@kernel.org,
	Yamahata, Isaku, jthoughton@google.com, will@kernel.org,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, nsaenz@amazon.es, chenhuacai@kernel.org,
	Huang, Kai, shuah@kernel.org, bfoster@redhat.com,
	dwmw@amazon.co.uk, Peng, Chao P, pankaj.gupta@amd.com,
	Graf, Alexander, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Xu, Yilun,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Li, Xiaoyao, aou@eecs.berkeley.edu,
	Weiny, Ira, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	seanjc@google.com, roypat@amazon.co.uk

On Tue, Jul 8, 2025 at 8:31 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2025-07-08 at 08:07 -0700, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 7:52 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2025-07-08 at 07:20 -0700, Sean Christopherson wrote:
> > > > > For TDX if we don't zero on conversion from private->shared we will be
> > > > > dependent
> > > > > on behavior of the CPU when reading memory with keyid 0, which was
> > > > > previously
> > > > > encrypted and has some protection bits set. I don't *think* the behavior is
> > > > > architectural. So it might be prudent to either make it so, or zero it in
> > > > > the
> > > > > kernel in order to not make non-architectual behavior into userspace ABI.
> > > >
> > > > Ya, by "vendor specific", I was also lumping in cases where the kernel would
> > > > need to zero memory in order to not end up with effectively undefined
> > > > behavior.
> > >
> > > Yea, more of an answer to Vishal's question about if CC VMs need zeroing. And
> > > the answer is sort of yes, even though TDX doesn't require it. But we actually
> > > don't want to zero memory when reclaiming memory. So TDX KVM code needs to know
> > > that the operation is a to-shared conversion and not another type of private
> > > zap. Like a callback from gmem, or maybe more simply a kernel internal flag to
> > > set in gmem such that it knows it should zero it.
> >
> > If the answer is that "always zero on private to shared conversions"
> > for all CC VMs, then does the scheme outlined in [1] make sense for
> > handling the private -> shared conversions? For pKVM, there can be a
> > VM type check to avoid the zeroing during conversions and instead just
> > zero on allocations. This allows delaying zeroing until the fault time
> > for CC VMs and can be done in guest_memfd centrally. We will need more
> > inputs from the SEV side for this discussion.
> >
> > [1] https://lore.kernel.org/lkml/CAGtprH-83EOz8rrUjE+O8m7nUDjt=THyXx=kfft1xQry65mtQg@mail.gmail.com/
>
> It's nice that we don't double zero (since TDX module will do it too) for
> private allocation/mapping. Seems ok to me.
>
> >
> > >
> > > >
> > > > > Up the thread Vishal says we need to support operations that use in-place
> > > > > conversion (overloaded term now I think, btw). Why exactly is pKVM using
> > > > > private/shared conversion for this private data provisioning?
> > > >
> > > > Because it's literally converting memory from shared to private?  And IICU,
> > > > it's
> > > > not a one-time provisioning, e.g. memory can go:
> > > >
> > > >   shared => fill => private => consume => shared => fill => private => consume
> > > >
> > > > > Instead of a special provisioning operation like the others? (Xiaoyao's
> > > > > suggestion)
> > > >
> > > > Are you referring to this suggestion?
> > >
> > > Yea, in general to make it a specific operation preserving operation.
> > >
> > > >
> > > >  : And maybe a new flag for KVM_GMEM_CONVERT_PRIVATE for user space to
> > > >  : explicitly request that the page range is converted to private and the
> > > >  : content needs to be retained. So that TDX can identify which case needs
> > > >  : to call in-place TDH.PAGE.ADD.
> > > >
> > > > If so, I agree with that idea, e.g. add a PRESERVE flag or whatever.  That way
> > > > userspace has explicit control over what happens to the data during
> > > > conversion,
> > > > and KVM can reject unsupported conversions, e.g. PRESERVE is only allowed for
> > > > shared => private and only for select VM types.
> > >
> > > Ok, we should POC how it works with TDX.
> >
> > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > 1) Conversions are always content-preserving for pKVM.
> > 2) Shared to private conversions are always content-preserving for all
> > VMs as far as guest_memfd is concerned.
> > 3) Private to shared conversions are not content-preserving for CC VMs
> > as far as guest_memfd is concerned, subject to more discussions.
> >
> > [2] https://lore.kernel.org/lkml/CAGtprH-Kzn2kOGZ4JuNtUT53Hugw64M-_XMmhz_gCiDS6BAFtQ@mail.gmail.com/
>
> Right, I read that. I still don't see why pKVM needs to do normal private/shared
> conversion for data provisioning. Vs a dedicated operation/flag to make it a
> special case.

It's dictated by pKVM usecases, memory contents need to be preserved
for every conversion not just for initial payload population.

>
> I'm trying to suggest there could be a benefit to making all gmem VM types
> behave the same. If conversions are always content preserving for pKVM, why
> can't userspace  always use the operation that says preserve content? Vs
> changing the behavior of the common operations?

I don't see a benefit of userspace passing a flag that's kind of
default for the VM type (assuming pKVM will use a special VM type).
Common operations in guest_memfd will need to either check for the
userspace passed flag or the VM type, so no major change in
guest_memfd implementation for either mechanism.

>
> So for all VM types, the user ABI would be:
> private->shared          - Always zero's page
> shared->private          - Always destructive
> shared->private (w/flag) - Always preserves data or return error if not possible
>
>
> Do you see a problem?
>

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 16:22                             ` Fuad Tabba
@ 2025-07-08 17:25                               ` Sean Christopherson
  2025-07-08 18:37                                 ` Fuad Tabba
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 17:25 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Vishal Annapurve, Rick P Edgecombe, pvorel@suse.cz,
	kvm@vger.kernel.org, catalin.marinas@arm.com, Jun Miao,
	Kirill Shutemov, pdurrant@amazon.co.uk, vbabka@suse.cz,
	peterx@redhat.com, x86@kernel.org, amoorthy@google.com,
	jack@suse.cz, quic_svaddagi@quicinc.com, keirf@google.com,
	palmer@dabbelt.com, vkuznets@redhat.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Dave Hansen,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Zhiquan1 Li, rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	Fan Du, maz@kernel.org, muchun.song@linux.dev, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, Jul 08, 2025, Fuad Tabba wrote:
> > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > > 1) Conversions are always content-preserving for pKVM.
> >
> > No?  Perserving contents on private => shared is a security vulnerability waiting
> > to happen.
> 
> Actually it is one of the requirements for pKVM as well as its current
> behavior. We would like to preserve contents both ways, private <=>
> shared, since it is required by some of the potential use cases (e.g.,
> guest handling video encoding/decoding).
> 
> To make it clear, I'm talking about explicit sharing from the guest,
> not relinquishing memory back to the host. In the case of
> relinquishing (and guest teardown), relinquished memory is poisoned
> (zeroed) in pKVM.

I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
know it's ok to convert memory from private to shared?  I think we'd still want
to make data preservation optional, e.g. to avoid potential leakage with setups
where memory is private by default, but a flag in KVM's uAPI might not be a good
fit since whether or not to preserve data is more of a guest decision (or at least
needs to be ok'd by the guest).

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 17:16                             ` Vishal Annapurve
@ 2025-07-08 17:39                               ` Edgecombe, Rick P
  2025-07-08 18:03                                 ` Sean Christopherson
  0 siblings, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 17:39 UTC (permalink / raw)
  To: Annapurve, Vishal
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, Shutemov, Kirill,
	pdurrant@amazon.co.uk, peterx@redhat.com, x86@kernel.org,
	tabba@google.com, amoorthy@google.com, quic_svaddagi@quicinc.com,
	jack@suse.cz, vkuznets@redhat.com, quic_eberman@quicinc.com,
	keirf@google.com, mail@maciej.szmigiero.name,
	anthony.yznaga@oracle.com, Wang, Wei W, palmer@dabbelt.com,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, paul.walmsley@sifive.com, Hansen, Dave,
	aik@amd.com, usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	maz@kernel.org, vbabka@suse.cz, anup@brainfault.org,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, Du, Fan,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Li, Zhiquan1,
	rientjes@google.com, mpe@ellerman.id.au, Aktas, Erdem,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Xu, Haibo1, Yamahata, Isaku,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, Graf, Alexander, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Li, Xiaoyao,
	aou@eecs.berkeley.edu, Weiny, Ira, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	roypat@amazon.co.uk, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	seanjc@google.com, linux-mm@kvack.org

On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > Right, I read that. I still don't see why pKVM needs to do normal
> > private/shared
> > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > special case.
> 
> It's dictated by pKVM usecases, memory contents need to be preserved
> for every conversion not just for initial payload population.

We are weighing pros/cons between:
 - Unifying this uABI across all gmemfd VM types
 - Userspace for one VM type passing a flag for it's special non-shared use case

I don't see how passing a flag or not is dictated by pKVM use case.

P.S. This doesn't really impact TDX I think. Except that TDX development needs
to work in the code without bumping anything. So just wishing to work in code
with less conditionals.

> 
> > 
> > I'm trying to suggest there could be a benefit to making all gmem VM types
> > behave the same. If conversions are always content preserving for pKVM, why
> > can't userspace  always use the operation that says preserve content? Vs
> > changing the behavior of the common operations?
> 
> I don't see a benefit of userspace passing a flag that's kind of
> default for the VM type (assuming pKVM will use a special VM type).

The benefit is that we don't need to have special VM default behavior for
gmemfd. Think about if some day (very hypothetical and made up) we want to add a
mode for TDX that adds new private data to a running guest (with special accept
on the guest side or something). Then we might want to add a flag to override
the default destructive behavior. Then maybe pKVM wants to add a "don't
preserve" operation and it adds a second flag to not destroy. Now gmemfd has
lots of VM specific flags. The point of this example is to show how unified uABI
can he helpful.

> Common operations in guest_memfd will need to either check for the
> userspace passed flag or the VM type, so no major change in
> guest_memfd implementation for either mechanism.

While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
fd tied to a VM? I think there is interest in de-coupling it? Is the VM type
sticky?

It seems the more they are separate, the better it will be to not have VM-aware
behavior living in gmem.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 17:39                               ` Edgecombe, Rick P
@ 2025-07-08 18:03                                 ` Sean Christopherson
  2025-07-08 18:13                                   ` Edgecombe, Rick P
  2025-07-08 19:28                                   ` Vishal Annapurve
  0 siblings, 2 replies; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 18:03 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Vishal Annapurve, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	Kirill Shutemov, pdurrant@amazon.co.uk, peterx@redhat.com,
	x86@kernel.org, tabba@google.com, amoorthy@google.com,
	quic_svaddagi@quicinc.com, jack@suse.cz, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	palmer@dabbelt.com, Wieczor-Retman, Maciej, Yan Y Zhao,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Dave Hansen, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	maz@kernel.org, vbabka@suse.cz, anup@brainfault.org,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, Fan Du,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	roypat@amazon.co.uk, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	linux-mm@kvack.org

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > Right, I read that. I still don't see why pKVM needs to do normal
> > > private/shared
> > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > special case.
> > 
> > It's dictated by pKVM usecases, memory contents need to be preserved
> > for every conversion not just for initial payload population.
> 
> We are weighing pros/cons between:
>  - Unifying this uABI across all gmemfd VM types
>  - Userspace for one VM type passing a flag for it's special non-shared use case
> 
> I don't see how passing a flag or not is dictated by pKVM use case.

Yep.  Baking the behavior of a single usecase into the kernel's ABI is rarely a
good idea.  Just because pKVM's current usecases always wants contents to be
preserved doesn't mean that pKVM will never change.

As a general rule, KVM should push policy to userspace whenever possible.

> P.S. This doesn't really impact TDX I think. Except that TDX development needs
> to work in the code without bumping anything. So just wishing to work in code
> with less conditionals.
> 
> > 
> > > 
> > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > behave the same. If conversions are always content preserving for pKVM, why
> > > can't userspace  always use the operation that says preserve content? Vs
> > > changing the behavior of the common operations?
> > 
> > I don't see a benefit of userspace passing a flag that's kind of
> > default for the VM type (assuming pKVM will use a special VM type).
> 
> The benefit is that we don't need to have special VM default behavior for
> gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> mode for TDX that adds new private data to a running guest (with special accept
> on the guest side or something). Then we might want to add a flag to override
> the default destructive behavior. Then maybe pKVM wants to add a "don't
> preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> lots of VM specific flags. The point of this example is to show how unified uABI
> can he helpful.

Yep again. Pivoting on the VM type would be completely inflexible.  If pKVM gains
a usecase that wants to zero memory on conversions, we're hosed.  If SNP or TDX
gains the ability to preserve data on conversions, we're hosed.

The VM type may restrict what is possible, but (a) that should be abstracted,
e.g. by defining the allowed flags during guest_memfd creation, and (b) the
capabilities of the guest_memfd instance need to be communicated to userspace.
 
> > Common operations in guest_memfd will need to either check for the
> > userspace passed flag or the VM type, so no major change in
> > guest_memfd implementation for either mechanism.
> 
> While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> fd tied to a VM?

Yes.

> I think there is interest in de-coupling it?

No?  Even if we get to a point where multiple distinct VMs can bind to a single
guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
non-trivial complexity for zero practical benefit.

> Is the VM type sticky?
> 
> It seems the more they are separate, the better it will be to not have VM-aware
> behavior living in gmem.

Ya.  A guest_memfd instance may have capabilities/features that are restricted
and/or defined based on the properties of the owning VM, but we should do our
best to make guest_memfd itself blissly unaware of the VM type.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:03                                 ` Sean Christopherson
@ 2025-07-08 18:13                                   ` Edgecombe, Rick P
  2025-07-08 18:55                                     ` Sean Christopherson
  2025-07-08 19:28                                   ` Vishal Annapurve
  1 sibling, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 18:13 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	maz@kernel.org, vkuznets@redhat.com, anthony.yznaga@oracle.com,
	mail@maciej.szmigiero.name, Annapurve, Vishal,
	quic_eberman@quicinc.com, Wang, Wei W, Du, Fan,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	Hansen, Dave, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Shutemov, Kirill,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, Aktas, Erdem, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Xu, Haibo1, Yamahata, Isaku,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, Graf, Alexander, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Li, Xiaoyao,
	aou@eecs.berkeley.edu, Weiny, Ira, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > I think there is interest in de-coupling it?
> 
> No?

I'm talking about the intra-host migration/reboot optimization stuff. And not
doing a good job, sorry.

>   Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> sole
> owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.

I'm talking about moving a gmem fd between different VMs or something using
KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
feel out where the concepts are headed. It kind of allows gmem fds (or just
their source memory?) to live beyond a VM lifecycle.

[0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 17:25                               ` Sean Christopherson
@ 2025-07-08 18:37                                 ` Fuad Tabba
  2025-07-16 23:06                                   ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-07-08 18:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Rick P Edgecombe, pvorel@suse.cz,
	kvm@vger.kernel.org, catalin.marinas@arm.com, Jun Miao,
	Kirill Shutemov, pdurrant@amazon.co.uk, vbabka@suse.cz,
	peterx@redhat.com, x86@kernel.org, amoorthy@google.com,
	jack@suse.cz, quic_svaddagi@quicinc.com, keirf@google.com,
	palmer@dabbelt.com, vkuznets@redhat.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Dave Hansen,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Zhiquan1 Li, rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	Fan Du, maz@kernel.org, muchun.song@linux.dev, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	will@kernel.org, roypat@amazon.co.uk

On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Fuad Tabba wrote:
> > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
> > > > 1) Conversions are always content-preserving for pKVM.
> > >
> > > No?  Perserving contents on private => shared is a security vulnerability waiting
> > > to happen.
> >
> > Actually it is one of the requirements for pKVM as well as its current
> > behavior. We would like to preserve contents both ways, private <=>
> > shared, since it is required by some of the potential use cases (e.g.,
> > guest handling video encoding/decoding).
> >
> > To make it clear, I'm talking about explicit sharing from the guest,
> > not relinquishing memory back to the host. In the case of
> > relinquishing (and guest teardown), relinquished memory is poisoned
> > (zeroed) in pKVM.
>
> I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
> know it's ok to convert memory from private to shared?  I think we'd still want
> to make data preservation optional, e.g. to avoid potential leakage with setups
> where memory is private by default, but a flag in KVM's uAPI might not be a good
> fit since whether or not to preserve data is more of a guest decision (or at least
> needs to be ok'd by the guest).

In pKVM all sharing and unsharing is triggered by the guest via
hypercalls. The host cannot unshare. That said, making data
preservation optional works for pKVM and is a good idea, for the
reasons that you've mentioned.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:13                                   ` Edgecombe, Rick P
@ 2025-07-08 18:55                                     ` Sean Christopherson
  2025-07-08 21:23                                       ` Edgecombe, Rick P
  2025-07-09 14:28                                       ` Vishal Annapurve
  0 siblings, 2 replies; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 18:55 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Jun Miao, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	maz@kernel.org, vkuznets@redhat.com, anthony.yznaga@oracle.com,
	mail@maciej.szmigiero.name, Vishal Annapurve,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > > I think there is interest in de-coupling it?
> > 
> > No?
> 
> I'm talking about the intra-host migration/reboot optimization stuff. And not
> doing a good job, sorry.
> 
> >   Even if we get to a point where multiple distinct VMs can bind to a single
> > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> > sole
> > owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> > non-trivial complexity for zero practical benefit.
> 
> I'm talking about moving a gmem fd between different VMs or something using
> KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
> feel out where the concepts are headed. It kind of allows gmem fds (or just
> their source memory?) to live beyond a VM lifecycle.

I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].

 : No go.  Because again, the inode (physical memory) is coupled to the virtual machine
 : as a thing, not to a "struct kvm".  Or more concretely, the inode is coupled to an
 : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
 : single ASID.  And at some point in the future, I suspect we'll have multiple KVM
 : objects per HKID too.
 : 
 : The current SEV use case is for the migration helper, where two KVM objects share
 : a single ASID (the "real" VM and the helper).  I suspect TDX will end up with
 : similar behavior where helper "VMs" can use the HKID of the "real" VM.  For KVM,
 : that means multiple struct kvm objects being associated with a single HKID.
 : 
 : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
 : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
 : machine has been destroyed.
 : 
 : To put it differently, "struct kvm" is a KVM software construct that _usually_,
 : but not always, is associated 1:1 with a virtual machine.
 : 
 : And FWIW, stashing the pointer without holding a reference would not be a complete
 : solution, because it couldn't guard against KVM reusing a pointer.  E.g. if a
 : struct kvm was unbound and then freed, KVM could reuse the same memory for a new
 : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
 : check.

Exactly what that will look like in code is TBD, but the concept/logic holds up.

[*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com

> [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:03                                 ` Sean Christopherson
  2025-07-08 18:13                                   ` Edgecombe, Rick P
@ 2025-07-08 19:28                                   ` Vishal Annapurve
  2025-07-08 19:58                                     ` Sean Christopherson
  1 sibling, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-08 19:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	Kirill Shutemov, pdurrant@amazon.co.uk, peterx@redhat.com,
	x86@kernel.org, tabba@google.com, amoorthy@google.com,
	quic_svaddagi@quicinc.com, jack@suse.cz, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	palmer@dabbelt.com, Wieczor-Retman, Maciej, Yan Y Zhao,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Dave Hansen, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	maz@kernel.org, vbabka@suse.cz, anup@brainfault.org,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, Fan Du,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	roypat@amazon.co.uk, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	linux-mm@kvack.org

On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > > Right, I read that. I still don't see why pKVM needs to do normal
> > > > private/shared
> > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > > special case.
> > >
> > > It's dictated by pKVM usecases, memory contents need to be preserved
> > > for every conversion not just for initial payload population.
> >
> > We are weighing pros/cons between:
> >  - Unifying this uABI across all gmemfd VM types
> >  - Userspace for one VM type passing a flag for it's special non-shared use case
> >
> > I don't see how passing a flag or not is dictated by pKVM use case.
>
> Yep.  Baking the behavior of a single usecase into the kernel's ABI is rarely a
> good idea.  Just because pKVM's current usecases always wants contents to be
> preserved doesn't mean that pKVM will never change.
>
> As a general rule, KVM should push policy to userspace whenever possible.
>
> > P.S. This doesn't really impact TDX I think. Except that TDX development needs
> > to work in the code without bumping anything. So just wishing to work in code
> > with less conditionals.
> >
> > >
> > > >
> > > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > > behave the same. If conversions are always content preserving for pKVM, why
> > > > can't userspace  always use the operation that says preserve content? Vs
> > > > changing the behavior of the common operations?
> > >
> > > I don't see a benefit of userspace passing a flag that's kind of
> > > default for the VM type (assuming pKVM will use a special VM type).
> >
> > The benefit is that we don't need to have special VM default behavior for
> > gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> > mode for TDX that adds new private data to a running guest (with special accept
> > on the guest side or something). Then we might want to add a flag to override
> > the default destructive behavior. Then maybe pKVM wants to add a "don't
> > preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> > lots of VM specific flags. The point of this example is to show how unified uABI
> > can he helpful.
>
> Yep again. Pivoting on the VM type would be completely inflexible.  If pKVM gains
> a usecase that wants to zero memory on conversions, we're hosed.  If SNP or TDX
> gains the ability to preserve data on conversions, we're hosed.
>
> The VM type may restrict what is possible, but (a) that should be abstracted,
> e.g. by defining the allowed flags during guest_memfd creation, and (b) the
> capabilities of the guest_memfd instance need to be communicated to userspace.

Ok, I concur with this: It's beneficial to keep a unified ABI that
allows guest_memfd to make runtime decisions without relying on VM
type as far as possible.

Few points that seem important here:
1) Userspace can and should be able to only dictate if memory contents
need to be preserved on shared to private conversion.
   -> For SNP/TDX VMs:
        * Only usecase for preserving contents is initial memory
population, which can be achieved by:
               -  Userspace converting the ranges to shared,
populating the contents, converting them back to private and then
calling SNP/TDX specific existing ABI functions.
        * For runtime conversions, guest_memfd can't ensure memory
contents are preserved during shared to private conversions as the
architectures don't support that behavior.
        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX
VMs, even if we add this flag, today guest_memfd should effectively
mark this unsupported based on the backing architecture support.
2) For pKVM, if userspace wants to specify a "preserve" flag then this
flag can be allowed based on the known capabilities of the backing
architecture.

So this topic is still orthogonal to "zeroing on private to shared conversion".





>
> > > Common operations in guest_memfd will need to either check for the
> > > userspace passed flag or the VM type, so no major change in
> > > guest_memfd implementation for either mechanism.
> >
> > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> > fd tied to a VM?
>
> Yes.
>
> > I think there is interest in de-coupling it?
>
> No?  Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
> owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.
>
> > Is the VM type sticky?
> >
> > It seems the more they are separate, the better it will be to not have VM-aware
> > behavior living in gmem.
>
> Ya.  A guest_memfd instance may have capabilities/features that are restricted
> and/or defined based on the properties of the owning VM, but we should do our
> best to make guest_memfd itself blissly unaware of the VM type.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 19:28                                   ` Vishal Annapurve
@ 2025-07-08 19:58                                     ` Sean Christopherson
  2025-07-08 22:54                                       ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-08 19:58 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	Kirill Shutemov, pdurrant@amazon.co.uk, peterx@redhat.com,
	x86@kernel.org, tabba@google.com, amoorthy@google.com,
	quic_svaddagi@quicinc.com, jack@suse.cz, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	palmer@dabbelt.com, Wieczor-Retman, Maciej, Yan Y Zhao,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Dave Hansen, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	maz@kernel.org, vbabka@suse.cz, anup@brainfault.org,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, Fan Du,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	roypat@amazon.co.uk, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	linux-mm@kvack.org

On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
> Few points that seem important here:
> 1) Userspace can and should be able to only dictate if memory contents
> need to be preserved on shared to private conversion.

No, I was wrong, pKVM has use cases where it's desirable to preserve data on
private => shared conversions.

Side topic, if you're going to use fancy indentation, align the indentation so
it's actually readable.

>   -> For SNP/TDX VMs:
>        * Only usecase for preserving contents is initial memory
>          population, which can be achieved by:
>               -  Userspace converting the ranges to shared, populating the contents,
>                  converting them back to private and then calling SNP/TDX specific
>                  existing ABI functions.
>        * For runtime conversions, guest_memfd can't ensure memory contents are
>          preserved during shared to private conversions as the architectures
>          don't support that behavior.
>        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even

It makes sense, it's just not supported by the architecture *at runtime*.  Case
in point, *something* needs to allow preserving data prior to launching the VM.
If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX
and SNP, then we'll probably want to allow PRESERVE only until the VM image is
finalized.

>          if we add this flag, today guest_memfd should effectively mark this
>          unsupported based on the backing architecture support.
>
> 2) For pKVM, if userspace wants to specify a "preserve" flag then this

There is no "For pKVM".  We are defining uAPI for guest_memfd.  I.e. this statement
holds true for all implementations: PRESERVE is allowed based on the capabilities
of the architecture.

> So this topic is still orthogonal to "zeroing on private to shared conversion".

As above, no.  pKVM might not expose PRESERVE to _userspace_ since all current
conversions are initiated by the guest, but for guest_memfd itself, this is all
one and the same.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:55                                     ` Sean Christopherson
@ 2025-07-08 21:23                                       ` Edgecombe, Rick P
  2025-07-09 14:28                                       ` Vishal Annapurve
  1 sibling, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-08 21:23 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, quic_svaddagi@quicinc.com,
	maz@kernel.org, vkuznets@redhat.com, Shutemov, Kirill,
	jack@suse.cz, hughd@google.com, Annapurve, Vishal,
	mail@maciej.szmigiero.name, rppt@kernel.org,
	Wieczor-Retman, Maciej, Zhao, Yan Y, ajones@ventanamicro.com,
	willy@infradead.org, anthony.yznaga@oracle.com, Hansen, Dave,
	aik@amd.com, usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, keirf@google.com, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, Wang, Wei W, anup@brainfault.org,
	quic_eberman@quicinc.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	paul.walmsley@sifive.com, Du, Fan, akpm@linux-foundation.org,
	oliver.upton@linux.dev, muchun.song@linux.dev,
	binbin.wu@linux.intel.com, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	steven.price@arm.com, bfoster@redhat.com, jhubbard@nvidia.com,
	Xu, Haibo1, Yamahata, Isaku, jthoughton@google.com,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, Graf, Alexander, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Li, Xiaoyao,
	aou@eecs.berkeley.edu, Weiny, Ira, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Tue, 2025-07-08 at 11:55 -0700, Sean Christopherson wrote:
> I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
> instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].
> 
> 
[snip]
> Exactly what that will look like in code is TBD, but the concept/logic holds up.
> 
> [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com

Thanks for digging this up. Makes sense. One gmemfd per VM, but 
struct kvm != a VM.



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 19:58                                     ` Sean Christopherson
@ 2025-07-08 22:54                                       ` Vishal Annapurve
  0 siblings, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-08 22:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, nsaenz@amazon.es,
	Kirill Shutemov, pdurrant@amazon.co.uk, peterx@redhat.com,
	x86@kernel.org, tabba@google.com, amoorthy@google.com,
	quic_svaddagi@quicinc.com, jack@suse.cz, vkuznets@redhat.com,
	quic_eberman@quicinc.com, keirf@google.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	palmer@dabbelt.com, Wieczor-Retman, Maciej, Yan Y Zhao,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Dave Hansen, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	maz@kernel.org, vbabka@suse.cz, anup@brainfault.org,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	mic@digikod.net, oliver.upton@linux.dev, Fan Du,
	akpm@linux-foundation.org, steven.price@arm.com,
	muchun.song@linux.dev, binbin.wu@linux.intel.com, Zhiquan1 Li,
	rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	roypat@amazon.co.uk, linux-fsdevel@vger.kernel.org,
	ackerleytng@google.com, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, will@kernel.org,
	linux-mm@kvack.org

On Tue, Jul 8, 2025 at 12:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Vishal Annapurve wrote:
> > On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@google.com> wrote:
> > Few points that seem important here:
> > 1) Userspace can and should be able to only dictate if memory contents
> > need to be preserved on shared to private conversion.
>
> No, I was wrong, pKVM has use cases where it's desirable to preserve data on
> private => shared conversions.
>
> Side topic, if you're going to use fancy indentation, align the indentation so
> it's actually readable.
>
> >   -> For SNP/TDX VMs:
> >        * Only usecase for preserving contents is initial memory
> >          population, which can be achieved by:
> >               -  Userspace converting the ranges to shared, populating the contents,
> >                  converting them back to private and then calling SNP/TDX specific
> >                  existing ABI functions.
> >        * For runtime conversions, guest_memfd can't ensure memory contents are
> >          preserved during shared to private conversions as the architectures
> >          don't support that behavior.
> >        * So IMO, this "preserve" flag doesn't make sense for SNP/TDX VMs, even
>
> It makes sense, it's just not supported by the architecture *at runtime*.  Case
> in point, *something* needs to allow preserving data prior to launching the VM.
> If we want to go with the PRIVATE => SHARED => FILL => PRIVATE approach for TDX
> and SNP, then we'll probably want to allow PRESERVE only until the VM image is
> finalized.

Maybe we can simplify the story a bit here for today, how about:
1) For shared to private conversions:
       * Is it safe to say that the conversion itself is always
content preserving, it's upto the
           architecture what it does with memory contents on the private faults?
                 - During initial memory setup, userspace can control
how private memory would
                   be faulted in by architecture supported ABI operations.
                 - After initial memory setup, userspace can't control
how private memory would
                   be faulted in.
2) For private to shared conversions:
       * Architecture decides what should be done with the memory on
shared faults.
                 - guest_memfd can query architecture whether to zero
memory or not.

-> guest_memfd will only take on the responsibility of zeroing if
needed by the architecture on shared faults.
-> Architecture is responsible for the behavior on private faults.

In future, if there is a usecase for controlling runtime behavior of
private faults, architecture can expose additional ABI that userspace
can use after initiating guest_memfd conversion.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:55                                     ` Sean Christopherson
  2025-07-08 21:23                                       ` Edgecombe, Rick P
@ 2025-07-09 14:28                                       ` Vishal Annapurve
  2025-07-09 15:00                                         ` Sean Christopherson
  2025-07-09 15:17                                         ` Edgecombe, Rick P
  1 sibling, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-09 14:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Tue, Jul 8, 2025 at 11:55 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 11:03 -0700, Sean Christopherson wrote:
> > > > I think there is interest in de-coupling it?
> > >
> > > No?
> >
> > I'm talking about the intra-host migration/reboot optimization stuff. And not
> > doing a good job, sorry.
> >
> > >   Even if we get to a point where multiple distinct VMs can bind to a single
> > > guest_memfd, e.g. for inter-VM shared memory, there will still need to be a
> > > sole
> > > owner of the memory.  AFAICT, fully decoupling guest_memfd from a VM would add
> > > non-trivial complexity for zero practical benefit.
> >
> > I'm talking about moving a gmem fd between different VMs or something using
> > KVM_LINK_GUEST_MEMFD [0]. Not advocating to try to support it. But trying to
> > feel out where the concepts are headed. It kind of allows gmem fds (or just
> > their source memory?) to live beyond a VM lifecycle.
>
> I think the answer is that we want to let guest_memfd live beyond the "struct kvm"
> instance, but not beyond the Virtual Machine.  From a past discussion on this topic[*].
>
>  : No go.  Because again, the inode (physical memory) is coupled to the virtual machine
>  : as a thing, not to a "struct kvm".  Or more concretely, the inode is coupled to an
>  : ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
>  : single ASID.  And at some point in the future, I suspect we'll have multiple KVM
>  : objects per HKID too.
>  :
>  : The current SEV use case is for the migration helper, where two KVM objects share
>  : a single ASID (the "real" VM and the helper).  I suspect TDX will end up with
>  : similar behavior where helper "VMs" can use the HKID of the "real" VM.  For KVM,
>  : that means multiple struct kvm objects being associated with a single HKID.
>  :
>  : To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
>  : outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
>  : machine has been destroyed.
>  :
>  : To put it differently, "struct kvm" is a KVM software construct that _usually_,
>  : but not always, is associated 1:1 with a virtual machine.
>  :
>  : And FWIW, stashing the pointer without holding a reference would not be a complete
>  : solution, because it couldn't guard against KVM reusing a pointer.  E.g. if a
>  : struct kvm was unbound and then freed, KVM could reuse the same memory for a new
>  : struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
>  : check.
>
> Exactly what that will look like in code is TBD, but the concept/logic holds up.

I think we can simplify the role of guest_memfd in line with discussion [1]:
1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
         - It allows fallocate to populate/deallocate memory
2) guest_memfd supports the notion of private/shared faults.
3) guest_memfd supports memory access control:
         - It allows shared faults from userspace, KVM, IOMMU
         - It allows private faults from KVM, IOMMU
4) guest_memfd supports changing access control on its ranges between
shared/private.
         - It notifies the users to invalidate their mappings for the
ranges getting converted/truncated.

Responsibilities that ideally should not be taken up by guest_memfd:
1) guest_memfd can not initiate pre-faulting on behalf of it's users.
2) guest_memfd should not be directly communicating with the
underlying architecture layers.
         - All communication should go via KVM/IOMMU.
3) KVM should ideally associate the lifetime of backing
pagetables/protection tables/RMP tables with the lifetime of the
binding of memslots with guest_memfd.
         - Today KVM SNP logic ties RMP table entry lifetimes with how
long the folios are mapped in guest_memfd, which I think should be
revisited.

Some very early thoughts on how guest_memfd could be laid out for the long term:
1) guest_memfd code ideally should be built-in to the kernel.
2) guest_memfd instances should still be created using KVM IOCTLs that
carry specific capabilities/restrictions for its users based on the
backing VM/arch.
3) Any outgoing communication from guest_memfd to it's users like
userspace/KVM/IOMMU should be via notifiers to invalidate similar to
how MMU notifiers work.
4) KVM and IOMMU can implement intermediate layers to handle
interaction with guest_memfd.
     - e.g. there could be a layer within kvm that handles:
             - creating guest_memfd files and associating a
kvm_gmem_context with those files.
             - memslot binding
                       - kvm_gmem_context will be used to bind kvm
memslots with the context ranges.
             - invalidate notifier handling
                        - kvm_gmem_context will be used to intercept
guest_memfd callbacks and
                          translate them to the right GPA ranges.
             - linking
                        - kvm_gmem_context can be linked to different
KVM instances.

This line of thinking can allow cleaner separation between
guest_memfd/KVM/IOMMU [2].

[1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t
[2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/



>
> [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com
>
> > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-09 14:28                                       ` Vishal Annapurve
@ 2025-07-09 15:00                                         ` Sean Christopherson
  2025-07-10  1:30                                           ` Vishal Annapurve
  2025-07-09 15:17                                         ` Edgecombe, Rick P
  1 sibling, 1 reply; 231+ messages in thread
From: Sean Christopherson @ 2025-07-09 15:00 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> I think we can simplify the role of guest_memfd in line with discussion [1]:

I genuinely don't understand what you're trying to "simplify".  We need to define
an ABI that is flexible and robust, but beyond that most of these guidelines boil
down to "don't write bad code".

> 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.

No, guest_memfd is a memory provider for KVM guests.  That memory *might* be
mapped by userspace and/or into IOMMU page tables in order out of functional
necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop.

> 3) KVM should ideally associate the lifetime of backing
> pagetables/protection tables/RMP tables with the lifetime of the
> binding of memslots with guest_memfd.

Again, please align your indentation.

>          - Today KVM SNP logic ties RMP table entry lifetimes with how
>            long the folios are mapped in guest_memfd, which I think should be
>            revisited.

Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
inodes are tied to the Virtual Machine, not to the "struct kvm" instance.

> Some very early thoughts on how guest_memfd could be laid out for the long term:
> 1) guest_memfd code ideally should be built-in to the kernel.

Why?  How is this at all relevant?  If we need to bake some parts of guest_memfd
into the kernel in order to avoid nasty exports and/or ordering dependencies, then
we can do so.  But that is 100% an implementation detail and in no way a design
goal.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-09 14:28                                       ` Vishal Annapurve
  2025-07-09 15:00                                         ` Sean Christopherson
@ 2025-07-09 15:17                                         ` Edgecombe, Rick P
  2025-07-10  3:39                                           ` Vishal Annapurve
  1 sibling, 1 reply; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-09 15:17 UTC (permalink / raw)
  To: Annapurve, Vishal, seanjc@google.com
  Cc: pvorel@suse.cz, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, palmer@dabbelt.com, pdurrant@amazon.co.uk,
	vbabka@suse.cz, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, tabba@google.com, maz@kernel.org,
	quic_svaddagi@quicinc.com, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, jack@suse.cz,
	mail@maciej.szmigiero.name, quic_eberman@quicinc.com, Wang, Wei W,
	keirf@google.com, Wieczor-Retman, Maciej, Zhao, Yan Y,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Hansen, Dave, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, anup@brainfault.org, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, Du, Fan, akpm@linux-foundation.org,
	steven.price@arm.com, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	hughd@google.com, jhubbard@nvidia.com, Xu, Haibo1,
	Yamahata, Isaku, jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org, Shutemov, Kirill,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, Graf, Alexander, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Li, Xiaoyao,
	aou@eecs.berkeley.edu, Weiny, Ira, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote:
> I think we can simplify the role of guest_memfd in line with discussion [1]:
> 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
>          - It allows fallocate to populate/deallocate memory
> 2) guest_memfd supports the notion of private/shared faults.
> 3) guest_memfd supports memory access control:
>          - It allows shared faults from userspace, KVM, IOMMU
>          - It allows private faults from KVM, IOMMU
> 4) guest_memfd supports changing access control on its ranges between
> shared/private.
>          - It notifies the users to invalidate their mappings for the
> ranges getting converted/truncated.

KVM needs to know if a GFN is private/shared. I think it is also intended to now
be a repository for this information, right? Besides invalidations, it needs to
be queryable.

> 
> Responsibilities that ideally should not be taken up by guest_memfd:
> 1) guest_memfd can not initiate pre-faulting on behalf of it's users.
> 2) guest_memfd should not be directly communicating with the
> underlying architecture layers.
>          - All communication should go via KVM/IOMMU.

Maybe stronger, there should be generic gmem behaviors. Not any special
if (vm_type == tdx) type logic. 

> 3) KVM should ideally associate the lifetime of backing
> pagetables/protection tables/RMP tables with the lifetime of the
> binding of memslots with guest_memfd.
>          - Today KVM SNP logic ties RMP table entry lifetimes with how
> long the folios are mapped in guest_memfd, which I think should be
> revisited.

I don't understand the problem. KVM needs to respond to user accessible
invalidations, but how long it keeps other resources around could be useful for
various optimizations. Like deferring work to a work queue or something.

I think it would help to just target the ackerly series goals. We should get
that code into shape and this kind of stuff will fall out of it.

> 
> Some very early thoughts on how guest_memfd could be laid out for the long term:
> 1) guest_memfd code ideally should be built-in to the kernel.
> 2) guest_memfd instances should still be created using KVM IOCTLs that
> carry specific capabilities/restrictions for its users based on the
> backing VM/arch.
> 3) Any outgoing communication from guest_memfd to it's users like
> userspace/KVM/IOMMU should be via notifiers to invalidate similar to
> how MMU notifiers work.
> 4) KVM and IOMMU can implement intermediate layers to handle
> interaction with guest_memfd.
>      - e.g. there could be a layer within kvm that handles:
>              - creating guest_memfd files and associating a
> kvm_gmem_context with those files.
>              - memslot binding
>                        - kvm_gmem_context will be used to bind kvm
> memslots with the context ranges.
>              - invalidate notifier handling
>                         - kvm_gmem_context will be used to intercept
> guest_memfd callbacks and
>                           translate them to the right GPA ranges.
>              - linking
>                         - kvm_gmem_context can be linked to different
> KVM instances.

We can probably look at the code to decide these.

> 
> This line of thinking can allow cleaner separation between
> guest_memfd/KVM/IOMMU [2].
> 
> [1] https://lore.kernel.org/lkml/CAGtprH-+gPN8J_RaEit=M_ErHWTmFHeCipC6viT6PHhG3ELg6A@mail.gmail.com/#t
> [2] https://lore.kernel.org/lkml/31beeed3-b1be-439b-8a5b-db8c06dadc30@amd.com/
> 
> 
> 
> > 
> > [*] https://lore.kernel.org/all/ZOO782YGRY0YMuPu@google.com
> > 
> > > [0] https://lore.kernel.org/all/cover.1747368092.git.afranji@google.com/
> > > https://lore.kernel.org/kvm/cover.1749672978.git.afranji@google.com/


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-09 15:00                                         ` Sean Christopherson
@ 2025-07-10  1:30                                           ` Vishal Annapurve
  2025-07-10 23:33                                             ` Sean Christopherson
  2025-07-11 21:18                                             ` Vishal Annapurve
  0 siblings, 2 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-10  1:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> > I think we can simplify the role of guest_memfd in line with discussion [1]:
>
> I genuinely don't understand what you're trying to "simplify".  We need to define
> an ABI that is flexible and robust, but beyond that most of these guidelines boil
> down to "don't write bad code".

My goal for bringing this discussion up is to see if we can better
define the role of guest_memfd and how it interacts with other layers,
as I see some scenarios that can be improved like kvm_gmem_populate[1]
where guest_memfd is trying to fault in pages on behalf of KVM.

[1] https://lore.kernel.org/lkml/20250703062641.3247-1-yan.y.zhao@intel.com/

>
> > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
>
> No, guest_memfd is a memory provider for KVM guests.  That memory *might* be
> mapped by userspace and/or into IOMMU page tables in order out of functional
> necessity, but guest_memfd exists solely to serve memory to KVM guests, full stop.

I look at this as guest_memfd should serve memory to KVM guests and to
other users by following some KVM/Arch related guidelines e.g. for CC
VMs, guest_memfd can handle certain behavior differently.

>
> > 3) KVM should ideally associate the lifetime of backing
> > pagetables/protection tables/RMP tables with the lifetime of the
> > binding of memslots with guest_memfd.
>
> Again, please align your indentation.
>
> >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> >            long the folios are mapped in guest_memfd, which I think should be
> >            revisited.
>
> Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> inodes are tied to the Virtual Machine, not to the "struct kvm" instance.

IIUC guest_memfd can only be accessed through the window of memslots
and if there are no memslots I don't see the reason for memory still
being associated with "virtual machine". Likely because I am yet to
completely wrap my head around 'guest_memfd inodes are tied to the
Virtual Machine, not to the "struct kvm" instance', I need to spend
more time on this one.

>
> > Some very early thoughts on how guest_memfd could be laid out for the long term:
> > 1) guest_memfd code ideally should be built-in to the kernel.
>
> Why?  How is this at all relevant?  If we need to bake some parts of guest_memfd
> into the kernel in order to avoid nasty exports and/or ordering dependencies, then
> we can do so.  But that is 100% an implementation detail and in no way a design
> goal.

I agree, this is implementation detail and we need real code to
discuss this better.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-09 15:17                                         ` Edgecombe, Rick P
@ 2025-07-10  3:39                                           ` Vishal Annapurve
  0 siblings, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-10  3:39 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Miao, Jun, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	maz@kernel.org, quic_svaddagi@quicinc.com, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, jack@suse.cz,
	mail@maciej.szmigiero.name, quic_eberman@quicinc.com, Wang, Wei W,
	keirf@google.com, Wieczor-Retman, Maciej, Zhao, Yan Y,
	ajones@ventanamicro.com, willy@infradead.org,
	paul.walmsley@sifive.com, Hansen, Dave, aik@amd.com,
	usama.arif@bytedance.com, quic_mnalajal@quicinc.com,
	fvdl@google.com, rppt@kernel.org, quic_cvanscha@quicinc.com,
	nsaenz@amazon.es, anup@brainfault.org, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org, mic@digikod.net,
	oliver.upton@linux.dev, Du, Fan, akpm@linux-foundation.org,
	steven.price@arm.com, binbin.wu@linux.intel.com,
	muchun.song@linux.dev, Li, Zhiquan1, rientjes@google.com,
	Aktas, Erdem, mpe@ellerman.id.au, david@redhat.com, jgg@ziepe.ca,
	hughd@google.com, jhubbard@nvidia.com, Xu, Haibo1,
	Yamahata, Isaku, jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org, Shutemov, Kirill,
	chenhuacai@kernel.org, Huang, Kai, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, Graf, Alexander, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Li, Xiaoyao,
	aou@eecs.berkeley.edu, Weiny, Ira, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, Jul 9, 2025 at 8:17 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2025-07-09 at 07:28 -0700, Vishal Annapurve wrote:
> > I think we can simplify the role of guest_memfd in line with discussion [1]:
> > 1) guest_memfd is a memory provider for userspace, KVM, IOMMU.
> >          - It allows fallocate to populate/deallocate memory
> > 2) guest_memfd supports the notion of private/shared faults.
> > 3) guest_memfd supports memory access control:
> >          - It allows shared faults from userspace, KVM, IOMMU
> >          - It allows private faults from KVM, IOMMU
> > 4) guest_memfd supports changing access control on its ranges between
> > shared/private.
> >          - It notifies the users to invalidate their mappings for the
> > ranges getting converted/truncated.
>
> KVM needs to know if a GFN is private/shared. I think it is also intended to now
> be a repository for this information, right? Besides invalidations, it needs to
> be queryable.

Yeah, that interface can be added as well. Though, if possible KVM can
just directly pass the fault type to guest_memfd and it can return an
error if the fault type doesn't match the permission. Additionally KVM
does query the mapping order for a certain pfn/gfn which will need to
be supported as well.

>
> >
> > Responsibilities that ideally should not be taken up by guest_memfd:
> > 1) guest_memfd can not initiate pre-faulting on behalf of it's users.
> > 2) guest_memfd should not be directly communicating with the
> > underlying architecture layers.
> >          - All communication should go via KVM/IOMMU.
>
> Maybe stronger, there should be generic gmem behaviors. Not any special
> if (vm_type == tdx) type logic.
>
> > 3) KVM should ideally associate the lifetime of backing
> > pagetables/protection tables/RMP tables with the lifetime of the
> > binding of memslots with guest_memfd.
> >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > long the folios are mapped in guest_memfd, which I think should be
> > revisited.
>
> I don't understand the problem. KVM needs to respond to user accessible
> invalidations, but how long it keeps other resources around could be useful for
> various optimizations. Like deferring work to a work queue or something.

I don't think it could be deferred to a work queue as the RMP table
entries will need to be removed synchronously once the last reference
on the guest_memfd drops, unless memory itself is kept around after
filemap eviction. I can see benefits of this approach for handling
scenarios like intrahost-migration.

>
> I think it would help to just target the ackerly series goals. We should get
> that code into shape and this kind of stuff will fall out of it.
>
> >
> > Some very early thoughts on how guest_memfd could be laid out for the long term:
> > 1) guest_memfd code ideally should be built-in to the kernel.
> > 2) guest_memfd instances should still be created using KVM IOCTLs that
> > carry specific capabilities/restrictions for its users based on the
> > backing VM/arch.
> > 3) Any outgoing communication from guest_memfd to it's users like
> > userspace/KVM/IOMMU should be via notifiers to invalidate similar to
> > how MMU notifiers work.
> > 4) KVM and IOMMU can implement intermediate layers to handle
> > interaction with guest_memfd.
> >      - e.g. there could be a layer within kvm that handles:
> >              - creating guest_memfd files and associating a
> > kvm_gmem_context with those files.
> >              - memslot binding
> >                        - kvm_gmem_context will be used to bind kvm
> > memslots with the context ranges.
> >              - invalidate notifier handling
> >                         - kvm_gmem_context will be used to intercept
> > guest_memfd callbacks and
> >                           translate them to the right GPA ranges.
> >              - linking
> >                         - kvm_gmem_context can be linked to different
> > KVM instances.
>
> We can probably look at the code to decide these.
>

Agree.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-06-30 14:19                       ` Vishal Annapurve
@ 2025-07-10  6:57                         ` Alexey Kardashevskiy
  2025-07-10 17:58                           ` Jason Gunthorpe
  0 siblings, 1 reply; 231+ messages in thread
From: Alexey Kardashevskiy @ 2025-07-10  6:57 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Jason Gunthorpe, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li



On 1/7/25 00:19, Vishal Annapurve wrote:
> On Sun, Jun 29, 2025 at 5:19 PM Alexey Kardashevskiy <aik@amd.com> wrote:
>> ...
>>>>> ============================
>>>>>
>>>>> For IOMMU, could something like below work?
>>>>>
>>>>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>>>>
>>>> Done that.
>>>>
>>>>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
>>>>> guest_memfd ranges using kvm_gmem_get_pfn()
>>>>
>>>> This API imho should drop the confusing kvm_ prefix.
>>>>
>>>>>        -> kvm invokes kvm_gmem_is_private() to check for the range
>>>>> shareability, IOMMU could use the same or we could add an API in gmem
>>>>> that takes in access type and checks the shareability before returning
>>>>> the pfn.
>>>>
>>>> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.
>>>
>>> I don't think that's the way to avoid links between iommufd.ko and
>>> kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
>>> runtime registration of invalidation callbacks from KVM/IOMMU
>>> backends. Need to think about this more.
>>
>> Yeah, otherwise iommufd.ko will have to install a hook in guest_memfd (==kvm.ko) in run time so more beloved symbol_get() :)
>>
>>>
>>>>
>>>>
>>>>> * IOMMU stack exposes an invalidation callback that can be invoked by
>>>>> guest_memfd.
>>>>>
>>>>> Private to Shared conversion via kvm_gmem_convert_range() -
>>>>>        1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>>>> on each bound memslot overlapping with the range
>>>>>         2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>>>> actually unmaps the KVM SEPT/NPT entries.
>>>>>               -> guest_memfd invokes IOMMU invalidation callback to zap
>>>>> the secure IOMMU entries.
>>>>>         3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>>>> shareability and then splits the folios if needed
>>>>>         4) Userspace invokes IOMMU map operation to map the ranges in
>>>>> non-secure IOMMU.
>>>>>
>>>>> Shared to private conversion via kvm_gmem_convert_range() -
>>>>>        1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>>>>> on each bound memslot overlapping with the range
>>>>>         2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>>>>> actually unmaps the host mappings which will unmap the KVM non-seucure
>>>>> EPT/NPT entries.
>>>>>             -> guest_memfd invokes IOMMU invalidation callback to zap the
>>>>> non-secure IOMMU entries.
>>>>>         3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>>>>> shareability and then merges the folios if needed.
>>>>>         4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>>>>
>>>>
>>>> Alright (although this zap+map is not necessary on the AMD hw).
>>>
>>> IMO guest_memfd ideally should not directly interact or cater to arch
>>> specific needs, it should implement a mechanism that works for all
>>> archs. KVM/IOMMU implement invalidation callbacks and have all the
>>> architecture specific knowledge to take the right decisions.
>>
>>
>> Every page conversion will go through:
>>
>> kvm-amd.ko -1-> guest_memfd (kvm.ko) -2-> iommufd.ko -3-> amd-iommu (build-in).
>>
>> Which one decides on IOMMU not needing (un)mapping? Got to be (1) but then it need to propagate the decision to amd-iommu (and we do not have (3) at the moment in that path).
> 
> If there is a need, guest_memfd can support two different callbacks:
> 1) Conversion notifier/callback invoked by guest_memfd during
> conversion handling.
> 2) Invalidation notifier/callback invoked by guest_memfd during truncation.
> 
> Iommufd/kvm can handle conversion callback/notifier as per the needs
> of underlying architecture. e.g. for TDX connect do the unmapping vs
> for SEV Trusted IO skip the unmapping.
> 
> Invalidation callback/notifier will need to be handled by unmapping page tables.
> 
>>
>> Or we just always do unmap+map (and trigger unwanted page huge page smashing)? All is doable and neither particularly horrible, I'm trying to see where the consensus is now. Thanks,
>>
> 
> I assume when you say huge page smashing, it means huge page NPT
> mapping getting split.
> 
> AFAIR, based on discussion with Michael during guest_memfd calls,
> stage2 NPT entries need to be of the same granularity as RMP tables
> for AMD SNP guests. i.e. huge page NPT mappings need to be smashed on
> the KVM side during conversion. So today guest_memfd sends
> invalidation notification to KVM for both conversion and truncation.
> Doesn't the same constraint for keeping IOMMU page tables at the same
> granularity as RMP tables hold for trusted IO?


Currently I handle this from the KVM with a hack to get IOPDE from AMD IOMMU so both 2MB RMP entry and IOPDE entries are smashed in one go in one of many firmwares running on EPYC, and atm this is too hacky to be posted even as an RFC. This likely needs to move to IOMMUFD then (via some callbacks) which could call AMD IOMMU which then would call that firmware (called "TMPM" and it is not the PSP which is "TSM), probably. Thanks,



-- 
Alexey


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-02 14:32                       ` Vishal Annapurve
@ 2025-07-10 10:50                         ` Xu Yilun
  2025-07-10 17:54                           ` Jason Gunthorpe
  0 siblings, 1 reply; 231+ messages in thread
From: Xu Yilun @ 2025-07-10 10:50 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Jason Gunthorpe, Yan Zhao, Alexey Kardashevskiy, Fuad Tabba,
	Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Wed, Jul 02, 2025 at 07:32:36AM -0700, Vishal Annapurve wrote:
> On Wed, Jul 2, 2025 at 7:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Jul 02, 2025 at 06:54:10AM -0700, Vishal Annapurve wrote:
> > > On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >
> > > > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > > > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > >
> > > > > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > >
> > > > > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > > > > folios in my RFC.
> > > > > > >
> > > > > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > > > > page and maps it back right away, all from the userspace? The end
> > > > > > > result will be the exactly same which seems useless. And IOMMU TLB
> > > > >
> > > > >  As Jason described, ideally IOMMU just like KVM, should just:
> > > > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > > > > by IOMMU stack
> > > > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > > > TDX module about which pages are used by it for DMAs purposes.
> > > > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > > > unmap of the pages from S-EPT.
> >
> > I don't see this as having much to do with iommufd.
> >
> > iommufd will somehow support the T=1 iommu inside the TDX module but
> > it won't have an IOAS for it since the VMM does not control the
> > translation.

I partially agree with this.

This is still the DMA Silent drop issue for security.  The HW (Also
applicable to AMD/ARM) screams out if the trusted DMA path (IOMMU
mapping, or access control table like RMP) is changed out of TD's
expectation. So from HW POV, it is the iommu problem.

For SW, if we don't blame iommu, maybe we rephrase as gmemfd can't
invalidate private pages unless TD agrees.

> >
> > The discussion here is for the T=0 iommu which is controlled by
> > iommufd and does have an IOAS. It should be popoulated with all the
> > shared pages from the guestmemfd.
> >
> > > > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> > > > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> > > > reject the request before attempting the actual unmap.
> >
> > This has to be delt with between the TDX module and KVM. When KVM
> > gives pages to become secure it may not be able to get them back..

Just to be clear. With In-place conversion, it is not KVM gives pages
to become secure, it is gmemfd. Or maybe you mean gmemfd is part of KVM.

https://lore.kernel.org/all/aC86OsU2HSFZkJP6@google.com/

> >
> > This problem has nothing to do with iommufd.
> >
> > But generally I expect that the T=1 iommu follows the S-EPT entirely
> > and there is no notion of pages "locked for dma". If DMA is ongoing
> > and a page is made non-secure then the DMA fails.
> >
> > Obviously in a mode where there is a vPCI device we will need all the
> > pages to be pinned in the guestmemfd to prevent any kind of
> > migrations. Only shared/private conversions should change the page
> > around.

Only *guest permitted* conversion should change the page. I.e only when
VMM is dealing with the KVM_HC_MAP_GPA_RANGE hypercall. Not sure if we
could just let QEMU ensure this or KVM/guestmemfd should ensure this.

Thanks,
Yilun

> 
> Yes, guest_memfd ensures that all the faulted-in pages (irrespective
> of shared or private ranges) are not migratable. We already have a
> similar restriction with CPU accesses to encrypted memory ranges that
> need arch specific protocols to migrate memory contents.
> 
> >
> > Maybe this needs to be an integral functionality in guestmemfd?
> >
> > Jason
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-10 10:50                         ` Xu Yilun
@ 2025-07-10 17:54                           ` Jason Gunthorpe
  2025-07-11  4:31                             ` Xu Yilun
  0 siblings, 1 reply; 231+ messages in thread
From: Jason Gunthorpe @ 2025-07-10 17:54 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Vishal Annapurve, Yan Zhao, Alexey Kardashevskiy, Fuad Tabba,
	Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jul 10, 2025 at 06:50:09PM +0800, Xu Yilun wrote:
> On Wed, Jul 02, 2025 at 07:32:36AM -0700, Vishal Annapurve wrote:
> > On Wed, Jul 2, 2025 at 7:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Jul 02, 2025 at 06:54:10AM -0700, Vishal Annapurve wrote:
> > > > On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >
> > > > > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > > > > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > > >
> > > > > > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > > >
> > > > > > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > > > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > > > > > folios in my RFC.
> > > > > > > >
> > > > > > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > > > > > page and maps it back right away, all from the userspace? The end
> > > > > > > > result will be the exactly same which seems useless. And IOMMU TLB
> > > > > >
> > > > > >  As Jason described, ideally IOMMU just like KVM, should just:
> > > > > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > > > > > by IOMMU stack
> > > > > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > > > > TDX module about which pages are used by it for DMAs purposes.
> > > > > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > > > > unmap of the pages from S-EPT.
> > >
> > > I don't see this as having much to do with iommufd.
> > >
> > > iommufd will somehow support the T=1 iommu inside the TDX module but
> > > it won't have an IOAS for it since the VMM does not control the
> > > translation.
> 
> I partially agree with this.
> 
> This is still the DMA Silent drop issue for security.  The HW (Also
> applicable to AMD/ARM) screams out if the trusted DMA path (IOMMU
> mapping, or access control table like RMP) is changed out of TD's
> expectation. So from HW POV, it is the iommu problem.

I thought the basic idea was that the secure world would sanity check
what the insecure is doing and if it is not OK then it blows up. So if
the DMA fails because the untrusted world revoked sharability when it
should not have then this is correct and expected?

> For SW, if we don't blame iommu, maybe we rephrase as gmemfd can't
> invalidate private pages unless TD agrees.

I think you mean guestmemfd in the kernel cannot autonomously change
'something' unless instructed to explicitly by userspace.

The expectation is the userspace will only give such instructions
based on the VM telling it to do a shared/private change.

If userspace gives an instruction that was not agreed with the guest
then the secure world can police the error and blow up.
 
> Just to be clear. With In-place conversion, it is not KVM gives pages
> to become secure, it is gmemfd. Or maybe you mean gmemfd is part of KVM.

Yeah, I mean part of.

> > > Obviously in a mode where there is a vPCI device we will need all the
> > > pages to be pinned in the guestmemfd to prevent any kind of
> > > migrations. Only shared/private conversions should change the page
> > > around.
> 
> Only *guest permitted* conversion should change the page. I.e only when
> VMM is dealing with the KVM_HC_MAP_GPA_RANGE hypercall. Not sure if we
> could just let QEMU ensure this or KVM/guestmemfd should ensure this.

I think it should not be part of the kernel, no need. From a kernel
perspective userspace has requested a shared/private conversion and if
it wasn't agreed with the VM then it will explode.

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-10  6:57                         ` Alexey Kardashevskiy
@ 2025-07-10 17:58                           ` Jason Gunthorpe
  0 siblings, 0 replies; 231+ messages in thread
From: Jason Gunthorpe @ 2025-07-10 17:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Vishal Annapurve, Fuad Tabba, Ackerley Tng, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jul 10, 2025 at 04:57:25PM +1000, Alexey Kardashevskiy wrote:

> Currently I handle this from the KVM with a hack to get IOPDE from
> AMD IOMMU so both 2MB RMP entry and IOPDE entries are smashed in one
> go in one of many firmwares running on EPYC, and atm this is too
> hacky to be posted even as an RFC. This likely needs to move to
> IOMMUFD then (via some callbacks) which could call AMD IOMMU which
> then would call that firmware (called "TMPM" and it is not the PSP
> which is "TSM), probably. Thanks,

Wasn't the issue with the iommu that it needed to have a PTE break
whenever the shared/private changed in the RMP? Because the HW can't
handle an IOPTE that crosses more than one RMP entry? Or do I
misunderstand the problem?

If this is the problem I was expecting the page table code that
translates the guest memfd into the iommu PTEs would respect the
shared/private conversion boundaries and break up the PTEs
automatically.

I had thought there were three versions of of how to copy from guest
memfd into the IOPTEs:
 - HW must never have a private physaddr in an IOPTE
 - HW must have IOPTEs entirely private or shared
 - HW handles everything and IOPTEs should be maximally sized

Is this right? Is AMD #2?

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-10  1:30                                           ` Vishal Annapurve
@ 2025-07-10 23:33                                             ` Sean Christopherson
  2025-07-11 21:18                                             ` Vishal Annapurve
  1 sibling, 0 replies; 231+ messages in thread
From: Sean Christopherson @ 2025-07-10 23:33 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> On Wed, Jul 9, 2025 at 8:00 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Jul 09, 2025, Vishal Annapurve wrote:
> > > I think we can simplify the role of guest_memfd in line with discussion [1]:
> >
> > I genuinely don't understand what you're trying to "simplify".  We need to define
> > an ABI that is flexible and robust, but beyond that most of these guidelines boil
> > down to "don't write bad code".
> 
> My goal for bringing this discussion up is to see if we can better
> define the role of guest_memfd and how it interacts with other layers,
> as I see some scenarios that can be improved like kvm_gmem_populate[1]
> where guest_memfd is trying to fault in pages on behalf of KVM.

Ah, gotcha.  From my perspective, it's all just KVM, which is why I'm not feeling
the same sense of urgency to formally define anything.  We want to encapsulate
code, have separate of concerns, etc., but I don't see that as being anything
unique or special to guest_memfd.  We try to achieve the same for all major areas
of KVM, though obviously with mixed results :-)

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-10 17:54                           ` Jason Gunthorpe
@ 2025-07-11  4:31                             ` Xu Yilun
  2025-07-11  9:33                               ` Xu Yilun
  0 siblings, 1 reply; 231+ messages in thread
From: Xu Yilun @ 2025-07-11  4:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vishal Annapurve, Yan Zhao, Alexey Kardashevskiy, Fuad Tabba,
	Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jul 10, 2025 at 02:54:49PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 10, 2025 at 06:50:09PM +0800, Xu Yilun wrote:
> > On Wed, Jul 02, 2025 at 07:32:36AM -0700, Vishal Annapurve wrote:
> > > On Wed, Jul 2, 2025 at 7:13 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > On Wed, Jul 02, 2025 at 06:54:10AM -0700, Vishal Annapurve wrote:
> > > > > On Wed, Jul 2, 2025 at 1:38 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >
> > > > > > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > > > > > > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > > > >
> > > > > > > > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > > > >
> > > > > > > > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > > > > > > > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > > > > > > > > folios in my RFC.
> > > > > > > > >
> > > > > > > > > So what is the expected sequence here? The userspace unmaps a DMA
> > > > > > > > > page and maps it back right away, all from the userspace? The end
> > > > > > > > > result will be the exactly same which seems useless. And IOMMU TLB
> > > > > > >
> > > > > > >  As Jason described, ideally IOMMU just like KVM, should just:
> > > > > > > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > > > > > > by IOMMU stack
> > > > > > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > > > > > TDX module about which pages are used by it for DMAs purposes.
> > > > > > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > > > > > unmap of the pages from S-EPT.
> > > >
> > > > I don't see this as having much to do with iommufd.
> > > >
> > > > iommufd will somehow support the T=1 iommu inside the TDX module but
> > > > it won't have an IOAS for it since the VMM does not control the
> > > > translation.
> > 
> > I partially agree with this.
> > 
> > This is still the DMA Silent drop issue for security.  The HW (Also
> > applicable to AMD/ARM) screams out if the trusted DMA path (IOMMU
> > mapping, or access control table like RMP) is changed out of TD's
> > expectation. So from HW POV, it is the iommu problem.
> 
> I thought the basic idea was that the secure world would sanity check
> what the insecure is doing and if it is not OK then it blows up. So if

Yes. The secure world checks. But it let alone the unexpected change on
CPU path cause CPU is synchronous and VM just pends on the fault, no
security concern. While DMA is asynchronous and the secure world must
blow up.

> the DMA fails because the untrusted world revoked sharability when it
> should not have then this is correct and expected?

OK. From secure world POV the failing is correct & expected.

> 
> > For SW, if we don't blame iommu, maybe we rephrase as gmemfd can't
> > invalidate private pages unless TD agrees.
> 
> I think you mean guestmemfd in the kernel cannot autonomously change
> 'something' unless instructed to explicitly by userspace.
> 
> The expectation is the userspace will only give such instructions
> based on the VM telling it to do a shared/private change.
> 
> If userspace gives an instruction that was not agreed with the guest
> then the secure world can police the error and blow up.

Yes.

>  
> > Just to be clear. With In-place conversion, it is not KVM gives pages
> > to become secure, it is gmemfd. Or maybe you mean gmemfd is part of KVM.
> 
> Yeah, I mean part of.
> 
> > > > Obviously in a mode where there is a vPCI device we will need all the
> > > > pages to be pinned in the guestmemfd to prevent any kind of
> > > > migrations. Only shared/private conversions should change the page
> > > > around.
> > 
> > Only *guest permitted* conversion should change the page. I.e only when
> > VMM is dealing with the KVM_HC_MAP_GPA_RANGE hypercall. Not sure if we
> > could just let QEMU ensure this or KVM/guestmemfd should ensure this.
> 
> I think it should not be part of the kernel, no need. From a kernel
> perspective userspace has requested a shared/private conversion and if
> it wasn't agreed with the VM then it will explode.

I'm OK with it now. It's simple if we don't try to recover from the
explosion. Although I see the after explosion processing in kernel is
complex and not sure how it will advance.

Thanks,
Yilun

> 
> Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-11  4:31                             ` Xu Yilun
@ 2025-07-11  9:33                               ` Xu Yilun
  0 siblings, 0 replies; 231+ messages in thread
From: Xu Yilun @ 2025-07-11  9:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vishal Annapurve, Yan Zhao, Alexey Kardashevskiy, Fuad Tabba,
	Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

> > > 
> > > Only *guest permitted* conversion should change the page. I.e only when
> > > VMM is dealing with the KVM_HC_MAP_GPA_RANGE hypercall. Not sure if we
> > > could just let QEMU ensure this or KVM/guestmemfd should ensure this.
> > 
> > I think it should not be part of the kernel, no need. From a kernel
> > perspective userspace has requested a shared/private conversion and if
> > it wasn't agreed with the VM then it will explode.
> 
> I'm OK with it now. It's simple if we don't try to recover from the
> explosion. Although I see the after explosion processing in kernel is
> complex and not sure how it will advance.

I see the discussion in another thread about similar issue. That TDX
Module BUG causes S-EPT unmap impossible and just KVM_BUG_ON(). But this
conversion issue is a little different, usually it's not decent to panic
because of userspace request. So may need further error handling, or a
KVM/gmemfd kAPI to disallow/allow conversion and prevent more complex
error.

Thanks,
Yilun

> 
> Thanks,
> Yilun
> 
> > 
> > Jason
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-10  1:30                                           ` Vishal Annapurve
  2025-07-10 23:33                                             ` Sean Christopherson
@ 2025-07-11 21:18                                             ` Vishal Annapurve
  2025-07-12 17:33                                               ` Vishal Annapurve
  1 sibling, 1 reply; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-11 21:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote:
> > > 3) KVM should ideally associate the lifetime of backing
> > > pagetables/protection tables/RMP tables with the lifetime of the
> > > binding of memslots with guest_memfd.
> >
> > Again, please align your indentation.
> >
> > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > >            long the folios are mapped in guest_memfd, which I think should be
> > >            revisited.
> >
> > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
>
> IIUC guest_memfd can only be accessed through the window of memslots
> and if there are no memslots I don't see the reason for memory still
> being associated with "virtual machine". Likely because I am yet to
> completely wrap my head around 'guest_memfd inodes are tied to the
> Virtual Machine, not to the "struct kvm" instance', I need to spend
> more time on this one.
>

I see the benefits of tying inodes to the virtual machine and
different guest_memfd files to different KVM instances. This allows us
to exercise intra-host migration usecases for TDX/SNP. But I think
this model doesn't allow us to reuse guest_memfd files for SNP VMs
during reboot.

Reboot scenario assuming reuse of existing guest_memfd inode for the
next instance:
1) Create a VM
2) Create guest_memfd files that pin KVM instance
3) Create memslots
4) Start the VM
5) For reboot/shutdown, Execute VM specific Termination (e.g.
KVM_TDX_TERMINATE_VM)
6) if allowed, delete the memslots
7) Create a new VM instance
8) Link the existing guest_memfd files to the new VM -> which creates
new files for the same inode.
9) Close the existing guest_memfd files and the existing VM
10) Jump to step 3

The difference between SNP and TDX is that TDX memory ownership is
limited to the duration the pages are mapped in the second stage
secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
memslots and effectively remains till folios are punched out from
guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
regard with the pfns populated in GPT entries.

I don't have a sense of how critical this problem could be, but this
would mean for every reboot all large memory allocations will have to
let go and need to be reallocated. For 1G support, we will be freeing
guest_memfd pages using a background thread which may add some delays
in being able to free up the memory in time.

Instead if we did this:
1) Support creating guest_memfd files for a certain VM type that
allows KVM to dictate the behavior of the guest_memfd.
2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
memslot bindings
    - Each binding will increase a refcount on both guest_memfd file
and KVM, so both can't go away while the binding exists.
3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
operations while for TDX, KVM will invalidate secure EPT entries.

This can allow us to decouple memory lifecycle from VM lifecycle and
match the behavior with non-confidential VMs where memory can outlast
VMs. Though this approach will mean change in intrahost migration
implementation as we don't need to differentiate guest_memfd files and
inodes.

That being said, I might be missing something here and I don't have
any data to back the criticality of this usecase for SNP and possibly
CCA VMs.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-07 14:55                 ` Vishal Annapurve
@ 2025-07-12  0:10                   ` Michael Roth
  2025-07-12 17:53                     ` Vishal Annapurve
  0 siblings, 1 reply; 231+ messages in thread
From: Michael Roth @ 2025-07-12  0:10 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Mon, Jul 07, 2025 at 07:55:01AM -0700, Vishal Annapurve wrote:
> On Thu, Jul 3, 2025 at 1:41 PM Michael Roth <michael.roth@amd.com> wrote:
> > > > > > >
> > > > > > > Because shared pages are split once any memory is allocated, having a
> > > > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > > > do you think?
> > > > > > >
> > > > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > > > before converting.
> > > >
> > > > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > > > populate memory (in which case you can pretty much ignore my previous
> > > > response))
> > >
> > > I am assuming in-place conversion with huge page backing for the
> > > discussion below.
> > >
> > > Looks like there are three scenarios/usecases we are discussing here:
> > > 1) Pre-allocating guest_memfd file offsets
> > >    - Userspace can use fallocate to do this for hugepages by keeping
> > > the file ranges marked private.
> > > 2) Prefaulting guest EPT/NPT entries
> > > 3) Populating initial guest payload into guest_memfd memory
> > >    - Userspace can mark certain ranges as shared, populate the
> > > contents and convert the ranges back to private. So mmap will come in
> > > handy here.
> > >
> > > >
> > > > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > > > userspace already defaults to marking everything private pretty close to
> > > > guest_memfd creation time, so the potential for allocations to occur
> > > > in-between seems small, but worth confirming.
> > >
> > > Ok, I am not much worried about whether the INIT_PRIVATE flag gets
> > > supported or not, but more about the default setting that different
> > > CVMs start with. To me, it looks like all CVMs should start as
> > > everything private by default and if there is a way to bake that
> > > configuration during guest_memfd creation time that would be good to
> > > have instead of doing "create and convert" operations and there is a
> > > fairly low cost to support this flag.
> > >
> > > >
> > > > But I know in the past there was a desire to ensure TDX/SNP could
> > > > support pre-allocating guest_memfd memory (and even pre-faulting via
> > > > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > > > fallocate() handling could still avoid the split if the whole hugepage
> > > > is private, though there is a bit more potential for that fallocate()
> > > > to happen before userspace does the "manually" shared->private
> > > > conversion. I'll double-check on that aspect, but otherwise, is there
> > > > still any other need for it?
> > >
> > > This usecase of being able to preallocate should still work with
> > > in-place conversion assuming all ranges are private before
> > > pre-population.
> >
> > Ok, I think I was missing that the merge logic here will then restore it
> > to 1GB before the guest starts, so the folio isn't permanently split if
> > we do the mmap() and that gives us more flexibility on how we can use
> > it.
> >
> > I was thinking we needed to avoid the split from the start by avoiding
> > paths like mmap() which might trigger the split. I was trying to avoid
> > any merge->unsplit logic in the THP case (or unsplit in general), in
> > which case we'd get permanent splits via the mmap() approach, but for
> > 2MB that's probably not a big deal.
> 
> After initial payload population, during its runtime guest can cause
> different hugepages to get split which can remain split even after
> guest converts them back to private. For THP there may not be much
> benefit of merging those pages together specially if NPT/EPT entries
> can't be promoted back to hugepage mapping and there is no memory
> penalty as THP doesn't use HVO.
> 
> Wishful thinking on my part: It would be great to figure out a way to
> promote these pagetable entries without relying on the guest, if
> possible with ABI updates, as I think the host should have some
> control over EPT/NPT granularities even for Confidential VMs. Along

I'm not sure how much it would buy us. For example, for a 2MB hugetlb
SNP guest boot with 16GB of memory I see 622 2MB hugepages getting
split, but only about 30 or so of those get merged back to 2MB folios
during guest run-time. These are presumably the set of 2MB regions we
could promote back up, but it's not much given that we wouldn't expect
that value to grow proportionally for larger guests: it's really
separate things like the number of vCPUs (for shared GHCB pages), number
of virtio buffers, etc. that end up determining the upper bound on how
many pages might get split due to 4K private->shared conversion, and
these would vary all that much from get to get outside maybe vCPU
count.

For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged
during run-time and would be candidates for promotion.

This could be greatly improved from the guest side by using
higher-order allocations to create pools of shared memory that could
then be used to reduce the number of splits caused by doing
private->shared conversions on random ranges of malloc'd memory,
and this could be done even without special promotion support on the
host for pretty much the entirety of guest memory. The idea there would
be to just making optimized guests avoid the splits completely, rather
than relying on the limited subset that hardware can optimize without
guest cooperation.

> the similar lines, it would be great to have "page struct"-less memory
> working for Confidential VMs, which should greatly reduce the toil
> with merge/split operations and will render the conversions mostly to
> be pagetable manipulations.

FWIW, I did some profiling of split/merge vs. overall conversion time
(by that I mean all cycles spent within kvm_gmem_convert_execute_work()),
and while split/merge does take quite a few more cycles than your
average conversion operation (~100x more), the total cycles spent
splitting/merging ended up being about 7% of the total cycles spent
handling conversions (1043938460 cycles in this case).

For 1GB, a split/merge take >1000x more than a normal conversion
operation (46475980 cycles vs 320 in this sample), but it's probably 
still not too bad vs the overall conversion path, and as mentioned above
it only happens about 6x for 16GB SNP guest so I don't think split/merge
overhead is a huge deal for current guests, especially if we work toward
optimizing guest-side usage of shared memory in the future. (There is
potential for this to crater performance for a very poorly-optimized
guest however but I think the guest should bear some burden for that
sort of thing: e.g. flipping the same page back-and-forth between
shared/private vs. caching it for continued usage as shared page in the
guest driver path isn't something we should put too much effort into
optimizing.)

> 
> That being said, memory split and merge seem to be relatively
> lightweight for THP (with no memory allocation/freeing) and reusing
> the memory files after reboot of the guest VM will require pages to be
> merged to start with a clean slate. One possible option is to always
> merge as early as possible, second option is to invent a new UAPI to
> do it on demand.
> 
> For 1G pages, even if we go with 1G -> 2M -> 4K split stages, page
> splits result in higher memory usage with HVO around and it becomes
> useful to merge them back as early as possible as guest proceeds to
> convert subranges of different hugepages over its lifetime. Merging
> pages as early as possible also allows reusing of memory files during
> the next reboot without having to invent a new UAPI.
> 
> Caveats with "merge as early as possible":
> - Shared to private conversions will be slower for hugetlb pages.
>    * Counter argument: These conversions are already slow as we need
> safe refcounts to reach on the ranges getting converted.
> - If guests convert a particular range often then extra merge/split
> operations will result in overhead.
>    * Counter argument: Since conversions are anyways slow, it's
> beneficial for guests to avoid such a scenario and keep back and forth
> conversions as less frequent as possible.

Fair enough. I'm not seeing any major reason not to do things this way,
as the overhead doesn't seem to be very significant for the common case.

(even though, as noted above, the amount of hugetlb pages we actually end
up merging at guest run-time seems to be fairly small, but maybe there
are scenarios where this will have a bigger impact, and it certainly helps
to have it there for the pre-boot merges.)

-Mike

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-11 21:18                                             ` Vishal Annapurve
@ 2025-07-12 17:33                                               ` Vishal Annapurve
  0 siblings, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-12 17:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pvorel@suse.cz, kvm@vger.kernel.org,
	catalin.marinas@arm.com, Jun Miao, palmer@dabbelt.com,
	pdurrant@amazon.co.uk, vbabka@suse.cz, peterx@redhat.com,
	x86@kernel.org, amoorthy@google.com, tabba@google.com,
	quic_svaddagi@quicinc.com, maz@kernel.org, vkuznets@redhat.com,
	anthony.yznaga@oracle.com, mail@maciej.szmigiero.name,
	quic_eberman@quicinc.com, Wei W Wang, Fan Du,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	Dave Hansen, paul.walmsley@sifive.com, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, fvdl@google.com,
	jack@suse.cz, quic_cvanscha@quicinc.com, Kirill Shutemov,
	willy@infradead.org, steven.price@arm.com, anup@brainfault.org,
	thomas.lendacky@amd.com, keirf@google.com, mic@digikod.net,
	linux-kernel@vger.kernel.org, nsaenz@amazon.es,
	akpm@linux-foundation.org, oliver.upton@linux.dev,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Zhiquan1 Li,
	rientjes@google.com, Erdem Aktas, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, hughd@google.com,
	jhubbard@nvidia.com, Haibo1 Xu, Isaku Yamahata,
	jthoughton@google.com, rppt@kernel.org, steven.sistare@oracle.com,
	jarkko@kernel.org, quic_pheragu@quicinc.com,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	bfoster@redhat.com, dwmw@amazon.co.uk, Chao P Peng,
	pankaj.gupta@amd.com, Alexander Graf, nikunj@amd.com,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Yilun Xu, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com, Xiaoyao Li,
	aou@eecs.berkeley.edu, Ira Weiny, richard.weiyang@gmail.com,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, ackerleytng@google.com,
	pgonda@google.com, quic_pderrin@quicinc.com, roypat@amazon.co.uk,
	hch@infradead.org, will@kernel.org, linux-mm@kvack.org

On Fri, Jul 11, 2025 at 2:18 PM Vishal Annapurve <vannapurve@google.com> wrote:
>
> On Wed, Jul 9, 2025 at 6:30 PM Vishal Annapurve <vannapurve@google.com> wrote:
> > > > 3) KVM should ideally associate the lifetime of backing
> > > > pagetables/protection tables/RMP tables with the lifetime of the
> > > > binding of memslots with guest_memfd.
> > >
> > > Again, please align your indentation.
> > >
> > > >          - Today KVM SNP logic ties RMP table entry lifetimes with how
> > > >            long the folios are mapped in guest_memfd, which I think should be
> > > >            revisited.
> > >
> > > Why?  Memslots are ephemeral per-"struct kvm" mappings.  RMP entries and guest_memfd
> > > inodes are tied to the Virtual Machine, not to the "struct kvm" instance.
> >
> > IIUC guest_memfd can only be accessed through the window of memslots
> > and if there are no memslots I don't see the reason for memory still
> > being associated with "virtual machine". Likely because I am yet to
> > completely wrap my head around 'guest_memfd inodes are tied to the
> > Virtual Machine, not to the "struct kvm" instance', I need to spend
> > more time on this one.
> >
>
> I see the benefits of tying inodes to the virtual machine and
> different guest_memfd files to different KVM instances. This allows us
> to exercise intra-host migration usecases for TDX/SNP. But I think
> this model doesn't allow us to reuse guest_memfd files for SNP VMs
> during reboot.
>
> Reboot scenario assuming reuse of existing guest_memfd inode for the
> next instance:
> 1) Create a VM
> 2) Create guest_memfd files that pin KVM instance
> 3) Create memslots
> 4) Start the VM
> 5) For reboot/shutdown, Execute VM specific Termination (e.g.
> KVM_TDX_TERMINATE_VM)
> 6) if allowed, delete the memslots
> 7) Create a new VM instance
> 8) Link the existing guest_memfd files to the new VM -> which creates
> new files for the same inode.
> 9) Close the existing guest_memfd files and the existing VM
> 10) Jump to step 3
>
> The difference between SNP and TDX is that TDX memory ownership is
> limited to the duration the pages are mapped in the second stage
> secure EPT tables, whereas SNP/RMP memory ownership lasts beyond
> memslots and effectively remains till folios are punched out from
> guest_memfd filemap. IIUC CCA might follow the suite of SNP in this
> regard with the pfns populated in GPT entries.
>
> I don't have a sense of how critical this problem could be, but this
> would mean for every reboot all large memory allocations will have to
> let go and need to be reallocated. For 1G support, we will be freeing
> guest_memfd pages using a background thread which may add some delays
> in being able to free up the memory in time.
>
> Instead if we did this:
> 1) Support creating guest_memfd files for a certain VM type that
> allows KVM to dictate the behavior of the guest_memfd.
> 2) Tie lifetime of KVM SNP/TDX memory ownership with guest_memfd and
> memslot bindings
>     - Each binding will increase a refcount on both guest_memfd file
> and KVM, so both can't go away while the binding exists.

I think if we can ensure that any guest_memfd initiated interaction
with KVM is only for invalidation and is based on binding and under
filemap_invalidate_lock then there is no need to pin KVM on each
binding, as binding/unbinding should be protected using
filemap_invalidate_lock and so KVM can't go away during invalidation.



> 3) For SNP/CCA, pfns are invalidated from RMP/GPT tables during unbind
> operations while for TDX, KVM will invalidate secure EPT entries.
>
> This can allow us to decouple memory lifecycle from VM lifecycle and
> match the behavior with non-confidential VMs where memory can outlast
> VMs. Though this approach will mean change in intrahost migration
> implementation as we don't need to differentiate guest_memfd files and
> inodes.
>
> That being said, I might be missing something here and I don't have
> any data to back the criticality of this usecase for SNP and possibly
> CCA VMs.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-12  0:10                   ` Michael Roth
@ 2025-07-12 17:53                     ` Vishal Annapurve
  0 siblings, 0 replies; 231+ messages in thread
From: Vishal Annapurve @ 2025-07-12 17:53 UTC (permalink / raw)
  To: Michael Roth
  Cc: Ackerley Tng, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	aik, ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgg, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

On Fri, Jul 11, 2025 at 5:11 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > Wishful thinking on my part: It would be great to figure out a way to
> > promote these pagetable entries without relying on the guest, if
> > possible with ABI updates, as I think the host should have some
> > control over EPT/NPT granularities even for Confidential VMs. Along
>
> I'm not sure how much it would buy us. For example, for a 2MB hugetlb
> SNP guest boot with 16GB of memory I see 622 2MB hugepages getting
> split, but only about 30 or so of those get merged back to 2MB folios
> during guest run-time. These are presumably the set of 2MB regions we
> could promote back up, but it's not much given that we wouldn't expect
> that value to grow proportionally for larger guests: it's really
> separate things like the number of vCPUs (for shared GHCB pages), number
> of virtio buffers, etc. that end up determining the upper bound on how
> many pages might get split due to 4K private->shared conversion, and
> these would vary all that much from get to get outside maybe vCPU
> count.
>
> For 1GB hugetlb I see about 6 1GB pages get split, and only 2 get merged
> during run-time and would be candidates for promotion.
>

Thanks for the great analysis here. I think we will need to repeat
such analysis for other scenarios such as usage with accelerators.

> This could be greatly improved from the guest side by using
> higher-order allocations to create pools of shared memory that could
> then be used to reduce the number of splits caused by doing
> private->shared conversions on random ranges of malloc'd memory,
> and this could be done even without special promotion support on the
> host for pretty much the entirety of guest memory. The idea there would
> be to just making optimized guests avoid the splits completely, rather
> than relying on the limited subset that hardware can optimize without
> guest cooperation.

Yes, it would be great to improve the situation from the guest side,
e.g. I tried with a rough draft [1], the conclusion there was that we
need to set aside "enough" guest memory as CMA to cause all the DMA go
through 2M aligned buffers. It's hard to figure out how much is
"enough", but we could start somewhere. That being said, the host
still has to manage memory this way by splitting/merging at runtime
because I don't think it's possible to enforce all conversions to
happen at 2M (or any at 1G) granularity. So it's also very likely that
even if guests do significant chunk of conversions at hugepage
granularity, host still needs to split pages all the way to 4K for all
shared regions unless we can bake another restriction in the
conversion ABI that guests can only convert the same ranges to private
as were converted before to shared.

[1] https://lore.kernel.org/lkml/20240112055251.36101-1-vannapurve@google.com/

>
> > the similar lines, it would be great to have "page struct"-less memory
> > working for Confidential VMs, which should greatly reduce the toil
> > with merge/split operations and will render the conversions mostly to
> > be pagetable manipulations.
>
> FWIW, I did some profiling of split/merge vs. overall conversion time
> (by that I mean all cycles spent within kvm_gmem_convert_execute_work()),
> and while split/merge does take quite a few more cycles than your
> average conversion operation (~100x more), the total cycles spent
> splitting/merging ended up being about 7% of the total cycles spent
> handling conversions (1043938460 cycles in this case).
>
> For 1GB, a split/merge take >1000x more than a normal conversion
> operation (46475980 cycles vs 320 in this sample), but it's probably
> still not too bad vs the overall conversion path, and as mentioned above
> it only happens about 6x for 16GB SNP guest so I don't think split/merge
> overhead is a huge deal for current guests, especially if we work toward
> optimizing guest-side usage of shared memory in the future. (There is
> potential for this to crater performance for a very poorly-optimized
> guest however but I think the guest should bear some burden for that
> sort of thing: e.g. flipping the same page back-and-forth between
> shared/private vs. caching it for continued usage as shared page in the
> guest driver path isn't something we should put too much effort into
> optimizing.)
>

As per discussions in the past, guest_memfd private pages are simply
only managed by guest_memfd. We don't need and effectively don't want
the kernel to manage guest private memory. So effectively we can get
rid of page structs in theory just for private pages as well and
allocate page structs only for shared memory on conversion and
deallocate on conversion back to private.

And when we have base core-mm allocators that hand out raw pfns to
start with, we don't even need shared memory ranges to be backed by
page structs.

Few hurdles we need to cross:
1) Invent a new filemap equivalent that maps guest_memfd offsets to pfns
2) Modify TDX EPT management to work with pfns and not page structs
3) Modify generic KVM NPT/EPT management logic to work with pfns and
not rely on page structs
4) Modify memory error/hwpoison handling to route all memory errors on
such pfns to guest_memfd.

I believe there are obvious benefits (reduced complexity, reduced
memory footprint etc) if we go this route and we are very likely to go
this route for future usecases even if we decide to live with
conversion costs today.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from custom allocator
  2025-06-03  7:43   ` Binbin Wu
@ 2025-07-16 22:13     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-07-16 22:13 UTC (permalink / raw)
  To: Binbin Wu
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko, jgg,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Binbin Wu <binbin.wu@linux.intel.com> writes:

> On 5/15/2025 7:42 AM, Ackerley Tng wrote:
> [...]
>>   
>>   	list_for_each_entry(gmem, gmem_list, entry)
>>   		kvm_gmem_invalidate_end(gmem, start, end);
>> @@ -776,6 +879,16 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>>   
>>   	start = offset >> PAGE_SHIFT;
>>   	end = (offset + len) >> PAGE_SHIFT;
>> +	if (kvm_gmem_has_custom_allocator(inode)) {
>> +		size_t nr_pages;
>> +		void *p;
>> +
>> +		p = kvm_gmem_allocator_private(inode);
>> +		nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(p);
>> +
>> +		start = round_down(start, nr_pages);
>> +		end = round_down(end, nr_pages);
> It's weird here.
> Should the end be round_up()?
>

Thanks, you're right.

I believe the current consensus is that fallocate() will only be
permitted for offset and lengths that are aligned not only to PAGE_SIZE
but to allocator page size.

In a future revision I'll check for allocator page size earlier on,
before this function will get called, so this rounding will probably go
away.

>> +	}
>>   
>>   	r = 0;
>>   	for (index = start; index < end; ) {
>>
> [...]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-02  8:35                 ` Yan Zhao
  2025-07-02 13:54                   ` Vishal Annapurve
@ 2025-07-16 22:22                   ` Ackerley Tng
  2025-07-17  9:32                     ` Xu Yilun
  1 sibling, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-07-16 22:22 UTC (permalink / raw)
  To: Yan Zhao, Vishal Annapurve
  Cc: Jason Gunthorpe, Alexey Kardashevskiy, Fuad Tabba, kvm, linux-mm,
	linux-kernel, x86, linux-fsdevel, ajones, akpm, amoorthy,
	anthony.yznaga, anup, aou, bfoster, binbin.wu, brauner,
	catalin.marinas, chao.p.peng, chenhuacai, dave.hansen, david,
	dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu, hch,
	hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
>> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> >
>> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>> >
>> > > Now, I am rebasing my RFC on top of this patchset and it fails in
>> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>> > > folios in my RFC.
>> > >
>> > > So what is the expected sequence here? The userspace unmaps a DMA
>> > > page and maps it back right away, all from the userspace? The end
>> > > result will be the exactly same which seems useless. And IOMMU TLB
>> 
>>  As Jason described, ideally IOMMU just like KVM, should just:
>> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
>> by IOMMU stack
> In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> TDX module about which pages are used by it for DMAs purposes.
> So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> unmap of the pages from S-EPT.
>
> If IOMMU side does not increase refcount, IMHO, some way to indicate that
> certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> reject the request before attempting the actual unmap.
> Otherwise, the unmap of TD-DMA-pinned pages will fail.
>
> Upon this kind of unmapping failure, it also doesn't help for host to retry
> unmapping without unpinning from TD.
>
>

Yan, Yilun, would it work if, on conversion,

1. guest_memfd notifies IOMMU that a conversion is about to happen for a
   PFN range
2. IOMMU forwards the notification to TDX code in the kernel
3. TDX code in kernel tells TDX module to stop thinking of any PFNs in
   the range as pinned for DMA?

If the above is possible then by the time we get to unmapping from
S-EPTs, TDX module would already consider the PFNs in the range "not
pinned for DMA".

>> 2) Directly query pfns from guest_memfd for both shared/private ranges
>> 3) Implement an invalidation callback that guest_memfd can invoke on
>> conversions.
>> 
>> Current flow:
>> Private to Shared conversion via kvm_gmem_convert_range() -
>>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>> on each bound memslot overlapping with the range
>>          -> KVM has the concept of invalidation_begin() and end(),
>> which effectively ensures that between these function calls, no new
>> EPT/NPT entries can be added for the range.
>>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>> actually unmaps the KVM SEPT/NPT entries.
>>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>> shareability and then splits the folios if needed
>> 
>> Shared to private conversion via kvm_gmem_convert_range() -
>>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>> on each bound memslot overlapping with the range
>>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>> actually unmaps the host mappings which will unmap the KVM non-seucure
>> EPT/NPT entries.
>>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>> shareability and then merges the folios if needed.
>> 
>> ============================
>> 
>> For IOMMU, could something like below work?
>> 
>> * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>> * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
>> guest_memfd ranges using kvm_gmem_get_pfn()
>>     -> kvm invokes kvm_gmem_is_private() to check for the range
>> shareability, IOMMU could use the same or we could add an API in gmem
>> that takes in access type and checks the shareability before returning
>> the pfn.
>> * IOMMU stack exposes an invalidation callback that can be invoked by
>> guest_memfd.
>> 
>> Private to Shared conversion via kvm_gmem_convert_range() -
>>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>> on each bound memslot overlapping with the range
>>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>> actually unmaps the KVM SEPT/NPT entries.
>>            -> guest_memfd invokes IOMMU invalidation callback to zap
>> the secure IOMMU entries.
> If guest_memfd could determine if a page is used by DMA purposes before
> attempting the actual unmaps, it could reject and fail the conversion earlier,
> thereby keeping IOMMU/S-EPT mappings intact.
>
> This could prevent the conversion from partially failing.
>

If the above suggestion works, then instead of checking if pages are
allowed to be unmapped, guest_memfd will just force everyone to unmap.

>>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>> shareability and then splits the folios if needed
>>      4) Userspace invokes IOMMU map operation to map the ranges in
>> non-secure IOMMU.
>> 
>> Shared to private conversion via kvm_gmem_convert_range() -
>>     1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
>> on each bound memslot overlapping with the range
>>      2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
>> actually unmaps the host mappings which will unmap the KVM non-seucure
>> EPT/NPT entries.
>>          -> guest_memfd invokes IOMMU invalidation callback to zap the
>> non-secure IOMMU entries.
>>      3) guest_memfd invokes kvm_gmem_execute_work() which updates the
>> shareability and then merges the folios if needed.
>>      4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>> 
>> There should be a way to block external IOMMU pagetable updates while
>> guest_memfd is performing conversion e.g. something like
>> kvm_invalidate_begin()/end().
>> 
>> > > is going to be flushed on a page conversion anyway (the RMPUPDATE
>> > > instruction does that). All this is about AMD's x86 though.
>> >
>> > The iommu should not be using the VMA to manage the mapping. It should
>> 
>> +1.
>> 
>> > be directly linked to the guestmemfd in some way that does not disturb
>> > its operations. I imagine there would be some kind of invalidation
>> > callback directly to the iommu.
>> >
>> > Presumably that invalidation call back can include a reason for the
>> > invalidation (addr change, shared/private conversion, etc)
>> >
>> > I'm not sure how we will figure out which case is which but guestmemfd
>> > should allow the iommu to plug in either invalidation scheme..
>> >
>> > Probably invalidation should be a global to the FD thing, I imagine
>> > that once invalidation is established the iommu will not be
>> > incrementing page refcounts.
>> 
>> +1.
>> 
>> >
>> > Jason
>> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd
  2025-07-08 18:37                                 ` Fuad Tabba
@ 2025-07-16 23:06                                   ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-07-16 23:06 UTC (permalink / raw)
  To: Fuad Tabba, Sean Christopherson
  Cc: Vishal Annapurve, Rick P Edgecombe, pvorel@suse.cz,
	kvm@vger.kernel.org, catalin.marinas@arm.com, Jun Miao,
	Kirill Shutemov, pdurrant@amazon.co.uk, vbabka@suse.cz,
	peterx@redhat.com, x86@kernel.org, amoorthy@google.com,
	jack@suse.cz, quic_svaddagi@quicinc.com, keirf@google.com,
	palmer@dabbelt.com, vkuznets@redhat.com,
	mail@maciej.szmigiero.name, anthony.yznaga@oracle.com, Wei W Wang,
	Wieczor-Retman, Maciej, Yan Y Zhao, ajones@ventanamicro.com,
	willy@infradead.org, rppt@kernel.org, quic_mnalajal@quicinc.com,
	aik@amd.com, usama.arif@bytedance.com, Dave Hansen,
	fvdl@google.com, paul.walmsley@sifive.com, bfoster@redhat.com,
	nsaenz@amazon.es, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, quic_cvanscha@quicinc.com,
	steven.price@arm.com, binbin.wu@linux.intel.com, hughd@google.com,
	Zhiquan1 Li, rientjes@google.com, mpe@ellerman.id.au, Erdem Aktas,
	david@redhat.com, jgg@ziepe.ca, jhubbard@nvidia.com, Haibo1 Xu,
	Fan Du, maz@kernel.org, muchun.song@linux.dev, Isaku Yamahata,
	jthoughton@google.com, steven.sistare@oracle.com,
	quic_pheragu@quicinc.com, jarkko@kernel.org,
	chenhuacai@kernel.org, Kai Huang, shuah@kernel.org,
	dwmw@amazon.co.uk, Chao P Peng, pankaj.gupta@amd.com,
	Alexander Graf, nikunj@amd.com, viro@zeniv.linux.org.uk,
	pbonzini@redhat.com, yuzenghui@huawei.com, jroedel@suse.de,
	suzuki.poulose@arm.com, jgowans@amazon.com, Yilun Xu,
	liam.merwick@oracle.com, michael.roth@amd.com,
	quic_tsoni@quicinc.com, Xiaoyao Li, aou@eecs.berkeley.edu,
	Ira Weiny, richard.weiyang@gmail.com, kent.overstreet@linux.dev,
	qperret@google.com, dmatlack@google.com, james.morse@arm.com,
	brauner@kernel.org, linux-fsdevel@vger.kernel.org,
	pgonda@google.com, quic_pderrin@quicinc.com, hch@infradead.org,
	linux-mm@kvack.org, will@kernel.org, roypat@amazon.co.uk

Fuad Tabba <tabba@google.com> writes:

> On Tue, 8 Jul 2025 at 18:25, Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Tue, Jul 08, 2025, Fuad Tabba wrote:
>> > > > I don't think we need a flag to preserve memory as I mentioned in [2]. IIUC,
>> > > > 1) Conversions are always content-preserving for pKVM.
>> > >
>> > > No?  Perserving contents on private => shared is a security vulnerability waiting
>> > > to happen.
>> >
>> > Actually it is one of the requirements for pKVM as well as its current
>> > behavior. We would like to preserve contents both ways, private <=>
>> > shared, since it is required by some of the potential use cases (e.g.,
>> > guest handling video encoding/decoding).
>> >
>> > To make it clear, I'm talking about explicit sharing from the guest,
>> > not relinquishing memory back to the host. In the case of
>> > relinquishing (and guest teardown), relinquished memory is poisoned
>> > (zeroed) in pKVM.
>>
>> I forget, what's the "explicit sharing" flow look like?  E.g. how/when does pKVM
>> know it's ok to convert memory from private to shared?  I think we'd still want
>> to make data preservation optional, e.g. to avoid potential leakage with setups
>> where memory is private by default, but a flag in KVM's uAPI might not be a good
>> fit since whether or not to preserve data is more of a guest decision (or at least
>> needs to be ok'd by the guest).
>
> In pKVM all sharing and unsharing is triggered by the guest via
> hypercalls. The host cannot unshare.

In pKVM's case, would the conversion ioctl be disabled completely, or
would the ioctl be allowed, but conversion always checks with pKVM to
see if the guest had previously requested a unshare?

> That said, making data
> preservation optional works for pKVM and is a good idea, for the
> reasons that you've mentioned.
>
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-16 22:22                   ` Ackerley Tng
@ 2025-07-17  9:32                     ` Xu Yilun
  2025-07-17 16:56                       ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Xu Yilun @ 2025-07-17  9:32 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Yan Zhao, Vishal Annapurve, Jason Gunthorpe, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Wed, Jul 16, 2025 at 03:22:06PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> >> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> >
> >> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> >> >
> >> > > Now, I am rebasing my RFC on top of this patchset and it fails in
> >> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> >> > > folios in my RFC.
> >> > >
> >> > > So what is the expected sequence here? The userspace unmaps a DMA
> >> > > page and maps it back right away, all from the userspace? The end
> >> > > result will be the exactly same which seems useless. And IOMMU TLB
> >> 
> >>  As Jason described, ideally IOMMU just like KVM, should just:
> >> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> >> by IOMMU stack
> > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > TDX module about which pages are used by it for DMAs purposes.
> > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > unmap of the pages from S-EPT.
> >
> > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> > reject the request before attempting the actual unmap.
> > Otherwise, the unmap of TD-DMA-pinned pages will fail.
> >
> > Upon this kind of unmapping failure, it also doesn't help for host to retry
> > unmapping without unpinning from TD.
> >
> >
> 
> Yan, Yilun, would it work if, on conversion,
> 
> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
>    PFN range

It is the Guest fw call to release the pinning. By the time VMM get the
conversion requirement, the page is already physically unpinned. So I
agree with Jason the pinning doesn't have to reach to iommu from SW POV.

> 2. IOMMU forwards the notification to TDX code in the kernel
> 3. TDX code in kernel tells TDX module to stop thinking of any PFNs in
>    the range as pinned for DMA?

TDX host can't stop the pinning. Actually this mechanism is to prevent
host from unpin/unmap the DMA out of Guest expectation.

Thanks,
Yilun

> 
> If the above is possible then by the time we get to unmapping from
> S-EPTs, TDX module would already consider the PFNs in the range "not
> pinned for DMA".

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-17  9:32                     ` Xu Yilun
@ 2025-07-17 16:56                       ` Ackerley Tng
  2025-07-18  2:48                         ` Xu Yilun
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-07-17 16:56 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Yan Zhao, Vishal Annapurve, Jason Gunthorpe, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

Xu Yilun <yilun.xu@linux.intel.com> writes:

> On Wed, Jul 16, 2025 at 03:22:06PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
>> >> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> >> >
>> >> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
>> >> >
>> >> > > Now, I am rebasing my RFC on top of this patchset and it fails in
>> >> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
>> >> > > folios in my RFC.
>> >> > >
>> >> > > So what is the expected sequence here? The userspace unmaps a DMA
>> >> > > page and maps it back right away, all from the userspace? The end
>> >> > > result will be the exactly same which seems useless. And IOMMU TLB
>> >> 
>> >>  As Jason described, ideally IOMMU just like KVM, should just:
>> >> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
>> >> by IOMMU stack
>> > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
>> > TDX module about which pages are used by it for DMAs purposes.
>> > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
>> > unmap of the pages from S-EPT.
>> >
>> > If IOMMU side does not increase refcount, IMHO, some way to indicate that
>> > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
>> > reject the request before attempting the actual unmap.
>> > Otherwise, the unmap of TD-DMA-pinned pages will fail.
>> >
>> > Upon this kind of unmapping failure, it also doesn't help for host to retry
>> > unmapping without unpinning from TD.
>> >
>> >
>> 
>> Yan, Yilun, would it work if, on conversion,
>> 
>> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
>>    PFN range
>
> It is the Guest fw call to release the pinning.

I see, thanks for explaining.

> By the time VMM get the
> conversion requirement, the page is already physically unpinned. So I
> agree with Jason the pinning doesn't have to reach to iommu from SW POV.
>

If by the time KVM gets the conversion request, the page is unpinned,
then we're all good, right?

When guest_memfd gets the conversion request, as part of conversion
handling it will request to zap the page from stage-2 page tables. TDX
module would see that the page is unpinned and the unmapping will
proceed fine. Is that understanding correct?

>> 2. IOMMU forwards the notification to TDX code in the kernel
>> 3. TDX code in kernel tells TDX module to stop thinking of any PFNs in
>>    the range as pinned for DMA?
>
> TDX host can't stop the pinning. Actually this mechanism is to prevent
> host from unpin/unmap the DMA out of Guest expectation.
>

On this note, I'd also like to check something else. Putting TDX connect
and IOMMUs aside, if the host unmaps a guest private page today without
the guest requesting it, the unmapping will work and the guest will be
broken, right?

> Thanks,
> Yilun
>
>> 
>> If the above is possible then by the time we get to unmapping from
>> S-EPTs, TDX module would already consider the PFNs in the range "not
>> pinned for DMA".


^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-17 16:56                       ` Ackerley Tng
@ 2025-07-18  2:48                         ` Xu Yilun
  2025-07-18 14:15                           ` Jason Gunthorpe
  2025-07-18 15:13                           ` Ira Weiny
  0 siblings, 2 replies; 231+ messages in thread
From: Xu Yilun @ 2025-07-18  2:48 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Yan Zhao, Vishal Annapurve, Jason Gunthorpe, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Thu, Jul 17, 2025 at 09:56:01AM -0700, Ackerley Tng wrote:
> Xu Yilun <yilun.xu@linux.intel.com> writes:
> 
> > On Wed, Jul 16, 2025 at 03:22:06PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> >> >> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> >> >
> >> >> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> >> >> >
> >> >> > > Now, I am rebasing my RFC on top of this patchset and it fails in
> >> >> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> >> >> > > folios in my RFC.
> >> >> > >
> >> >> > > So what is the expected sequence here? The userspace unmaps a DMA
> >> >> > > page and maps it back right away, all from the userspace? The end
> >> >> > > result will be the exactly same which seems useless. And IOMMU TLB
> >> >> 
> >> >>  As Jason described, ideally IOMMU just like KVM, should just:
> >> >> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> >> >> by IOMMU stack
> >> > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> >> > TDX module about which pages are used by it for DMAs purposes.
> >> > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> >> > unmap of the pages from S-EPT.
> >> >
> >> > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> >> > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> >> > reject the request before attempting the actual unmap.
> >> > Otherwise, the unmap of TD-DMA-pinned pages will fail.
> >> >
> >> > Upon this kind of unmapping failure, it also doesn't help for host to retry
> >> > unmapping without unpinning from TD.
> >> >
> >> >
> >> 
> >> Yan, Yilun, would it work if, on conversion,
> >> 
> >> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
> >>    PFN range
> >
> > It is the Guest fw call to release the pinning.
> 
> I see, thanks for explaining.
> 
> > By the time VMM get the
> > conversion requirement, the page is already physically unpinned. So I
> > agree with Jason the pinning doesn't have to reach to iommu from SW POV.
> >
> 
> If by the time KVM gets the conversion request, the page is unpinned,
> then we're all good, right?

Yes, unless guest doesn't unpin the page first by mistake. Guest would
invoke a fw call tdg.mem.page.release to unpin the page before
KVM_HC_MAP_GPA_RANGE.

> 
> When guest_memfd gets the conversion request, as part of conversion
> handling it will request to zap the page from stage-2 page tables. TDX
> module would see that the page is unpinned and the unmapping will
> proceed fine. Is that understanding correct?

Yes, again unless guess doesn't unpin.

> 
> >> 2. IOMMU forwards the notification to TDX code in the kernel
> >> 3. TDX code in kernel tells TDX module to stop thinking of any PFNs in
> >>    the range as pinned for DMA?
> >
> > TDX host can't stop the pinning. Actually this mechanism is to prevent
> > host from unpin/unmap the DMA out of Guest expectation.
> >
> 
> On this note, I'd also like to check something else. Putting TDX connect
> and IOMMUs aside, if the host unmaps a guest private page today without
> the guest requesting it, the unmapping will work and the guest will be
> broken, right?

Correct. The unmapping will work, the guest can't continue anymore.

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-18  2:48                         ` Xu Yilun
@ 2025-07-18 14:15                           ` Jason Gunthorpe
  2025-07-21 14:18                             ` Xu Yilun
  2025-07-18 15:13                           ` Ira Weiny
  1 sibling, 1 reply; 231+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 14:15 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Ackerley Tng, Yan Zhao, Vishal Annapurve, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Fri, Jul 18, 2025 at 10:48:55AM +0800, Xu Yilun wrote:
> > If by the time KVM gets the conversion request, the page is unpinned,
> > then we're all good, right?
> 
> Yes, unless guest doesn't unpin the page first by mistake. Guest would
> invoke a fw call tdg.mem.page.release to unpin the page before
> KVM_HC_MAP_GPA_RANGE.

What does guest pinning mean?

Jason

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-18  2:48                         ` Xu Yilun
  2025-07-18 14:15                           ` Jason Gunthorpe
@ 2025-07-18 15:13                           ` Ira Weiny
  2025-07-21  9:58                             ` Xu Yilun
  1 sibling, 1 reply; 231+ messages in thread
From: Ira Weiny @ 2025-07-18 15:13 UTC (permalink / raw)
  To: Xu Yilun, Ackerley Tng
  Cc: Yan Zhao, Vishal Annapurve, Jason Gunthorpe, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

Xu Yilun wrote:
> On Thu, Jul 17, 2025 at 09:56:01AM -0700, Ackerley Tng wrote:
> > Xu Yilun <yilun.xu@linux.intel.com> writes:
> > 
> > > On Wed, Jul 16, 2025 at 03:22:06PM -0700, Ackerley Tng wrote:
> > >> Yan Zhao <yan.y.zhao@intel.com> writes:
> > >> 
> > >> > On Tue, Jun 24, 2025 at 07:10:38AM -0700, Vishal Annapurve wrote:
> > >> >> On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >> >> >
> > >> >> > On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> > >> >> >
> > >> >> > > Now, I am rebasing my RFC on top of this patchset and it fails in
> > >> >> > > kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> > >> >> > > folios in my RFC.
> > >> >> > >
> > >> >> > > So what is the expected sequence here? The userspace unmaps a DMA
> > >> >> > > page and maps it back right away, all from the userspace? The end
> > >> >> > > result will be the exactly same which seems useless. And IOMMU TLB
> > >> >> 
> > >> >>  As Jason described, ideally IOMMU just like KVM, should just:
> > >> >> 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > >> >> by IOMMU stack
> > >> > In TDX connect, TDX module and TDs do not trust VMM. So, it's the TDs to inform
> > >> > TDX module about which pages are used by it for DMAs purposes.
> > >> > So, if a page is regarded as pinned by TDs for DMA, the TDX module will fail the
> > >> > unmap of the pages from S-EPT.
> > >> >
> > >> > If IOMMU side does not increase refcount, IMHO, some way to indicate that
> > >> > certain PFNs are used by TDs for DMA is still required, so guest_memfd can
> > >> > reject the request before attempting the actual unmap.
> > >> > Otherwise, the unmap of TD-DMA-pinned pages will fail.
> > >> >
> > >> > Upon this kind of unmapping failure, it also doesn't help for host to retry
> > >> > unmapping without unpinning from TD.
> > >> >
> > >> >
> > >> 
> > >> Yan, Yilun, would it work if, on conversion,
> > >> 
> > >> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
> > >>    PFN range
> > >
> > > It is the Guest fw call to release the pinning.
> > 
> > I see, thanks for explaining.
> > 
> > > By the time VMM get the
> > > conversion requirement, the page is already physically unpinned. So I
> > > agree with Jason the pinning doesn't have to reach to iommu from SW POV.
> > >
> > 
> > If by the time KVM gets the conversion request, the page is unpinned,
> > then we're all good, right?
> 
> Yes, unless guest doesn't unpin the page first by mistake.

Or maliciously?  :-(

My initial response to this was that this is a bug and we don't need to be
concerned with it.  However, can't this be a DOS from one TD to crash the
system if the host uses the private page for something else and the
machine #MC's?

Ira

> Guest would
> invoke a fw call tdg.mem.page.release to unpin the page before
> KVM_HC_MAP_GPA_RANGE.
> 
> > 
> > When guest_memfd gets the conversion request, as part of conversion
> > handling it will request to zap the page from stage-2 page tables. TDX
> > module would see that the page is unpinned and the unmapping will
> > proceed fine. Is that understanding correct?
> 
> Yes, again unless guess doesn't unpin.
> 
> > 
> > >> 2. IOMMU forwards the notification to TDX code in the kernel
> > >> 3. TDX code in kernel tells TDX module to stop thinking of any PFNs in
> > >>    the range as pinned for DMA?
> > >
> > > TDX host can't stop the pinning. Actually this mechanism is to prevent
> > > host from unpin/unmap the DMA out of Guest expectation.
> > >
> > 
> > On this note, I'd also like to check something else. Putting TDX connect
> > and IOMMUs aside, if the host unmaps a guest private page today without
> > the guest requesting it, the unmapping will work and the guest will be
> > broken, right?
> 
> Correct. The unmapping will work, the guest can't continue anymore.
> 
> Thanks,
> Yilun



^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-18 15:13                           ` Ira Weiny
@ 2025-07-21  9:58                             ` Xu Yilun
  2025-07-22 18:17                               ` Ackerley Tng
  0 siblings, 1 reply; 231+ messages in thread
From: Xu Yilun @ 2025-07-21  9:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Ackerley Tng, Yan Zhao, Vishal Annapurve, Jason Gunthorpe,
	Alexey Kardashevskiy, Fuad Tabba, kvm, linux-mm, linux-kernel,
	x86, linux-fsdevel, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

> > > >> Yan, Yilun, would it work if, on conversion,
> > > >> 
> > > >> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
> > > >>    PFN range
> > > >
> > > > It is the Guest fw call to release the pinning.
> > > 
> > > I see, thanks for explaining.
> > > 
> > > > By the time VMM get the
> > > > conversion requirement, the page is already physically unpinned. So I
> > > > agree with Jason the pinning doesn't have to reach to iommu from SW POV.
> > > >
> > > 
> > > If by the time KVM gets the conversion request, the page is unpinned,
> > > then we're all good, right?
> > 
> > Yes, unless guest doesn't unpin the page first by mistake.
> 
> Or maliciously?  :-(

Yes.

> 
> My initial response to this was that this is a bug and we don't need to be
> concerned with it.  However, can't this be a DOS from one TD to crash the
> system if the host uses the private page for something else and the
> machine #MC's?

I think we are already doing something to prevent vcpus from executing
then destroy VM, so no further TD accessing. But I assume there is
concern a TD could just leak a lot of resources, and we are
investigating if host can reclaim them.

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-18 14:15                           ` Jason Gunthorpe
@ 2025-07-21 14:18                             ` Xu Yilun
  0 siblings, 0 replies; 231+ messages in thread
From: Xu Yilun @ 2025-07-21 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ackerley Tng, Yan Zhao, Vishal Annapurve, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, ira.weiny, isaku.yamahata, jack,
	james.morse, jarkko, jgowans, jhubbard, jroedel, jthoughton,
	jun.miao, kai.huang, keirf, kent.overstreet, kirill.shutemov,
	liam.merwick, maciej.wieczor-retman, mail, maz, mic, michael.roth,
	mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

On Fri, Jul 18, 2025 at 11:15:59AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 10:48:55AM +0800, Xu Yilun wrote:
> > > If by the time KVM gets the conversion request, the page is unpinned,
> > > then we're all good, right?
> > 
> > Yes, unless guest doesn't unpin the page first by mistake. Guest would
> > invoke a fw call tdg.mem.page.release to unpin the page before
> > KVM_HC_MAP_GPA_RANGE.
> 
> What does guest pinning mean?

TDX firmware provides a mode, that host can't block the S-EPT mapping
after TD accepts the mapping. Guest 'pins' the private mapping (KVM &
IOMMU).

TD should explicitly unaccept the page by tdg.mem.page.release, then
host could successfully block/unmap the S-EPT. This is necessary when
shared <-> private conversion.

When TDX Connect is enabled, this mode is enforced.

Thanks,
Yilun

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-21  9:58                             ` Xu Yilun
@ 2025-07-22 18:17                               ` Ackerley Tng
  2025-07-22 19:25                                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 231+ messages in thread
From: Ackerley Tng @ 2025-07-22 18:17 UTC (permalink / raw)
  To: Xu Yilun, Ira Weiny
  Cc: Yan Zhao, Vishal Annapurve, Jason Gunthorpe, Alexey Kardashevskiy,
	Fuad Tabba, kvm, linux-mm, linux-kernel, x86, linux-fsdevel,
	ajones, akpm, amoorthy, anthony.yznaga, anup, aou, bfoster,
	binbin.wu, brauner, catalin.marinas, chao.p.peng, chenhuacai,
	dave.hansen, david, dmatlack, dwmw, erdemaktas, fan.du, fvdl,
	graf, haibo1.xu, hch, hughd, isaku.yamahata, jack, james.morse,
	jarkko, jgowans, jhubbard, jroedel, jthoughton, jun.miao,
	kai.huang, keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, thomas.lendacky,
	usama.arif, vbabka, viro, vkuznets, wei.w.wang, will, willy,
	xiaoyao.li, yilun.xu, yuzenghui, zhiquan1.li

Xu Yilun <yilun.xu@linux.intel.com> writes:

>> > > >> Yan, Yilun, would it work if, on conversion,
>> > > >> 
>> > > >> 1. guest_memfd notifies IOMMU that a conversion is about to happen for a
>> > > >>    PFN range
>> > > >
>> > > > It is the Guest fw call to release the pinning.
>> > > 
>> > > I see, thanks for explaining.
>> > > 
>> > > > By the time VMM get the
>> > > > conversion requirement, the page is already physically unpinned. So I
>> > > > agree with Jason the pinning doesn't have to reach to iommu from SW POV.
>> > > >
>> > > 
>> > > If by the time KVM gets the conversion request, the page is unpinned,
>> > > then we're all good, right?
>> > 
>> > Yes, unless guest doesn't unpin the page first by mistake.
>> 
>> Or maliciously?  :-(
>
> Yes.
>
>> 
>> My initial response to this was that this is a bug and we don't need to be
>> concerned with it.  However, can't this be a DOS from one TD to crash the
>> system if the host uses the private page for something else and the
>> machine #MC's?
>
> I think we are already doing something to prevent vcpus from executing
> then destroy VM, so no further TD accessing. But I assume there is
> concern a TD could just leak a lot of resources, and we are
> investigating if host can reclaim them.
>
> Thanks,
> Yilun

Sounds like a malicious guest could skip unpinning private memory, and
guest_memfd's unmap will fail, leading to a KVM_BUG_ON() as Yan/Rick
suggested here [1].

Actually it seems like a legacy guest would also lead to unmap failures
and the KVM_BUG_ON(), since when TDX connect is enabled, the pinning
mode is enforced, even for non-IO private pages?

I hope your team's investigations find a good way for the host to
reclaim memory, at least from dead TDs! Otherwise this would be an open
hole for guests to leak a host's memory.

Circling back to the original topic [2], it sounds like we're okay for
IOMMU to *not* take any refcounts on pages and can rely on guest_memfd
to keep the page around on behalf of the VM?

[1] https://lore.kernel.org/all/diqzcya13x2j.fsf@ackerleytng-ctop.c.googlers.com/
[2] https://lore.kernel.org/all/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  2025-07-22 18:17                               ` Ackerley Tng
@ 2025-07-22 19:25                                 ` Edgecombe, Rick P
  0 siblings, 0 replies; 231+ messages in thread
From: Edgecombe, Rick P @ 2025-07-22 19:25 UTC (permalink / raw)
  To: ackerleytng@google.com, yilun.xu@linux.intel.com, Weiny, Ira
  Cc: palmer@dabbelt.com, kvm@vger.kernel.org, catalin.marinas@arm.com,
	Miao, Jun, nsaenz@amazon.es, kirill.shutemov@intel.com,
	pdurrant@amazon.co.uk, peterx@redhat.com, x86@kernel.org,
	amoorthy@google.com, jack@suse.cz, maz@kernel.org,
	tabba@google.com, pvorel@suse.cz, anthony.yznaga@oracle.com,
	keirf@google.com, Annapurve, Vishal, hughd@google.com,
	mail@maciej.szmigiero.name, Du, Fan, Wieczor-Retman, Maciej,
	Zhao, Yan Y, ajones@ventanamicro.com, Hansen, Dave,
	paul.walmsley@sifive.com, quic_mnalajal@quicinc.com, aik@amd.com,
	steven.price@arm.com, vkuznets@redhat.com, fvdl@google.com,
	rppt@kernel.org, bfoster@redhat.com, quic_cvanscha@quicinc.com,
	vbabka@suse.cz, anup@brainfault.org, quic_eberman@quicinc.com,
	linux-kernel@vger.kernel.org, thomas.lendacky@amd.com,
	mic@digikod.net, oliver.upton@linux.dev,
	akpm@linux-foundation.org, usama.arif@bytedance.com,
	binbin.wu@linux.intel.com, muchun.song@linux.dev, Li, Zhiquan1,
	rientjes@google.com, Aktas, Erdem, mpe@ellerman.id.au,
	david@redhat.com, jgg@ziepe.ca, willy@infradead.org, Xu, Haibo1,
	jhubbard@nvidia.com, quic_svaddagi@quicinc.com, Yamahata, Isaku,
	jthoughton@google.com, will@kernel.org, Wang, Wei W,
	steven.sistare@oracle.com, jarkko@kernel.org,
	quic_pheragu@quicinc.com, chenhuacai@kernel.org, Huang, Kai,
	shuah@kernel.org, dwmw@amazon.co.uk, Peng, Chao P,
	pankaj.gupta@amd.com, nikunj@amd.com, Graf, Alexander,
	viro@zeniv.linux.org.uk, pbonzini@redhat.com,
	yuzenghui@huawei.com, jroedel@suse.de, suzuki.poulose@arm.com,
	jgowans@amazon.com, Xu, Yilun, liam.merwick@oracle.com,
	michael.roth@amd.com, quic_tsoni@quicinc.com,
	richard.weiyang@gmail.com, aou@eecs.berkeley.edu, Li, Xiaoyao,
	kent.overstreet@linux.dev, qperret@google.com,
	dmatlack@google.com, james.morse@arm.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, pgonda@google.com,
	quic_pderrin@quicinc.com, hch@infradead.org, linux-mm@kvack.org,
	seanjc@google.com, roypat@amazon.co.uk

On Tue, 2025-07-22 at 11:17 -0700, Ackerley Tng wrote:
> Sounds like a malicious guest could skip unpinning private memory, and
> guest_memfd's unmap will fail, leading to a KVM_BUG_ON() as Yan/Rick
> suggested here [1].
> 
> Actually it seems like a legacy guest would also lead to unmap failures
> and the KVM_BUG_ON(), since when TDX connect is enabled, the pinning
> mode is enforced, even for non-IO private pages?
> 
> I hope your team's investigations find a good way for the host to
> reclaim memory, at least from dead TDs! Otherwise this would be an open
> hole for guests to leak a host's memory.
> 
> Circling back to the original topic [2], it sounds like we're okay for
> IOMMU to *not* take any refcounts on pages and can rely on guest_memfd
> to keep the page around on behalf of the VM?
> 
> [1] https://lore.kernel.org/all/diqzcya13x2j.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/all/CAGtprH_qh8sEY3s-JucW3n1Wvoq7jdVZDDokvG5HzPf0HV2=pg@mail.gmail.com/

Djbw, Yilun and I had a chat yesterday. We'll investigate a way to have an
operation that can't fail and will allow total cleanup and reclaim for the TD's
resources, as well as a per-TDX module scoped version. 

If host userspace or the guest kernel does something wrong, the guest can be
destroyed in the normal VM case. So we can try to use these operations as a way
to save host kernel complexity for cases like that. But if an error condition
might come up in normal cases (i.e. rare races, non-bugs) we need to look to
other error handling solutions.

We were planning to investigate first and then share back to the list. It
probably deserves broader consideration beyond folks still reading deep down in
this thread.

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
                     ` (2 preceding siblings ...)
  2025-05-29  5:42   ` Michael Roth
@ 2025-08-01  0:01   ` Yan Zhao
  2025-08-14 21:35     ` Ackerley Tng
  3 siblings, 1 reply; 231+ messages in thread
From: Yan Zhao @ 2025-08-01  0:01 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> +						 pgoff_t index)
> +{
> +	struct maple_tree *mt;
> +	void *entry;
> +
> +	mt = &kvm_gmem_private(inode)->shareability;
> +	entry = mtree_load(mt, index);
> +	WARN(!entry,
> +	     "Shareability should always be defined for all indices in inode.");
> +
> +	return xa_to_value(entry);
> +}
> +
Hi Ackerley,

Not sure if it's a known issue. Just want to let you know in case you're unaware.

During a test to repeatedly launching/destroying TDs, I encountered a warning
from kvm_gmem_shareability_get() (see the attached log at the bottom).
The reproducing rate is 1 in every 20-100 times of launching TD.


After some analysis, I found that the warning was produced by
kvm_gmem_shareability_get() when it's called from kvm_gmem_is_private(), which
is not protected by any locks.

I can get rid of the warning by either fix 1 or fix 2 below.
(I prefer fix 1 though :))

fix 1:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e78fbebf4f53..136d46c5b2ab 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -2024,7 +2024,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,

 #ifdef CONFIG_KVM_GMEM_SHARED_MEM
        if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
-               mt_init(&private->shareability);
+               mt_init_flags(&private->shareability, MT_FLAGS_USE_RCU);

                err = kvm_gmem_shareability_setup(private, size, flags);
                if (err)


fix 2:
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e78fbebf4f53..9a4518104d56 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -171,7 +171,9 @@ static enum shareability kvm_gmem_shareability_get(struct inode *inode,
        void *entry;

        mt = &kvm_gmem_private(inode)->shareability;
+       mtree_lock(mt);
        entry = mtree_load(mt, index);
+       mtree_unlock(mt);
        WARN(!entry,
             "Shareability should always be defined for all indices in inode.");


Thanks
Yan

[  845.253021] ------------[ cut here ]------------
[  845.259236] Shareability should always be defined for all indices in inode.
[  845.259273] WARNING: CPU: 148 PID: 3775 at arch/x86/kvm/../../../virt/kvm/guest_memfd.c:175 kvm_gmem_shareability_get.isra.0+0x39/0x50 [kvm]
[  845.283330] Modules linked in: kvm_intel i2c_i801 idxd i2c_smbus i2c_ismt kvm irqbypass nls_iso8859_1 nls_cp437 squashfs ghash_clmulni_intel hid_generic aesni_intel
[  845.300914] CPU: 148 UID: 0 PID: 3775 Comm: qemu-system-x86 Tainted: G S                  6.16.0-rc6-upstream+ #520 PREEMPT(voluntary)  49e4d0c13b52dd8fe7006bbbb80b018c4576ab2d
[  845.319631] Tainted: [S]=CPU_OUT_OF_SPEC
[  845.324956] Hardware name: Intel Corporation ArcherCity/ArcherCity, BIOS EGSDCRB1.SYS.0101.D29.2303301937 03/30/2023
[  845.337749] RIP: 0010:kvm_gmem_shareability_get.isra.0+0x39/0x50 [kvm]
[  845.346085] Code: bf 48 02 00 00 e8 a7 d4 08 d1 48 85 c0 74 09 c9 48 d1 e8 c3 cc cc cc cc 48 89 45 f8 90 48 c7 c7 a0 56 5c c0 e8 68 3c b5 cf 90 <0f> 0b 90 90 48 8b 45 f8 c9 48 d1 e8 c3 cc cc cc cc 66 0f 1f 44 00
[  845.368227] RSP: 0018:ff29e9c2e336baa0 EFLAGS: 00010282
[  845.375038] RAX: 0000000000000000 RBX: 00000000001825d4 RCX: 0000000000000000
[  845.384020] RDX: 0000000000000002 RSI: 0000000000000001 RDI: 00000000ffffffff
[  845.392966] RBP: ff29e9c2e336baa8 R08: 0000000000000000 R09: 0000000000000000
[  845.401912] R10: 0000000000000001 R11: 0000000000000000 R12: ff1236f76e067a80
[  845.410878] R13: ff1236f76e0ecc00 R14: 0000000000000000 R15: ff1236f783af8000
[  845.419850] FS:  00007f8b863fc6c0(0000) GS:ff12370458883000(0000) knlGS:0000000000000000
[  845.429915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  845.437304] CR2: 0000000000000000 CR3: 00000003e9989005 CR4: 0000000000773ef0
[  845.446265] PKRU: 55555554
[  845.450224] Call Trace:
[  845.453887]  <TASK>
[  845.457161]  kvm_gmem_is_private+0x4b/0x70 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.467348]  kvm_mmu_faultin_pfn+0x14a/0x360 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.477740]  kvm_tdp_page_fault+0x97/0xf0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.487843]  kvm_mmu_do_page_fault+0x23d/0x290 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.505524]  ? __this_cpu_preempt_check+0x13/0x20
[  845.515349]  kvm_mmu_page_fault+0x8c/0x3d0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.529136]  tdx_handle_ept_violation+0x16a/0x310 [kvm_intel 1efe846cc4054cc289d319f1912cf040ec0ca0e6]
[  845.547760]  tdx_handle_exit+0x44f/0x540 [kvm_intel 1efe846cc4054cc289d319f1912cf040ec0ca0e6]
[  845.565647]  ? lock_acquire+0x52/0x70
[  845.574284]  ? vcpu_enter_guest+0x452/0x11d0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.591886]  vt_handle_exit+0x25/0x30 [kvm_intel 1efe846cc4054cc289d319f1912cf040ec0ca0e6]
[  845.609407]  vcpu_enter_guest+0x4b1/0x11d0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.623253]  ? kvm_apic_local_deliver+0x8a/0xe0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.641247]  vcpu_run+0x4d/0x280 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.654096]  ? vcpu_run+0x4d/0x280 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.667165]  kvm_arch_vcpu_ioctl_run+0x544/0x890 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.685231]  kvm_vcpu_ioctl+0x143/0x7c0 [kvm 6f655eadf3c2ae71b90b04a3d4ef5b799600c3f8]
[  845.698810]  ? __fget_files+0xc2/0x1b0
[  845.707633]  ? __this_cpu_preempt_check+0x13/0x20
[  845.717555]  ? __fget_files+0xcc/0x1b0
[  845.726405]  __x64_sys_ioctl+0x9a/0xf0
[  845.735241]  ? __this_cpu_preempt_check+0x13/0x20
[  845.745163]  x64_sys_call+0x1054/0x20c0
[  845.754043]  do_syscall_64+0xc3/0x470
[  845.762701]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  845.772906] RIP: 0033:0x7f8d9c124ded
[  845.781398] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[  845.814651] RSP: 002b:00007f8b863f7cd0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  845.827882] RAX: ffffffffffffffda RBX: 00007f8b863fccdc RCX: 00007f8d9c124ded
[  845.840591] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000001e
[  845.853201] RBP: 00007f8b863f7d20 R08: 0000000000000000 R09: 0000000000000000
[  845.865776] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8b863fc6c0
[  845.878246] R13: ffffffffffffdbf0 R14: 0000000000000007 R15: 00007ffedb593c00
[  845.890732]  </TASK>
[  845.897565] irq event stamp: 859157
[  845.905815] hardirqs last  enabled at (859171): [<ffffffff902447d3>] __up_console_sem+0x63/0x90
[  845.923321] hardirqs last disabled at (859184): [<ffffffff902447b8>] __up_console_sem+0x48/0x90
[  845.940892] softirqs last  enabled at (859126): [<ffffffff90194ef8>] handle_softirqs+0x358/0x4b0
[  845.958654] softirqs last disabled at (859207): [<ffffffff901951cf>] __irq_exit_rcu+0xef/0x170
[  845.976232] ---[ end trace 0000000000000000 ]---



^ permalink raw reply related	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-07-03  4:12           ` Michael Roth
  2025-07-03  5:10             ` Vishal Annapurve
@ 2025-08-12  8:23             ` Fuad Tabba
  2025-08-13 17:11               ` Ira Weiny
  1 sibling, 1 reply; 231+ messages in thread
From: Fuad Tabba @ 2025-08-12  8:23 UTC (permalink / raw)
  To: Michael Roth
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Hi,

On Thu, 3 Jul 2025 at 05:12, Michael Roth <michael.roth@amd.com> wrote:
>
> On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> > On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > > Michael Roth <michael.roth@amd.com> writes:
> > > >
> > > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
> > > > >> Track guest_memfd memory's shareability status within the inode as
> > > > >> opposed to the file, since it is property of the guest_memfd's memory
> > > > >> contents.
> > > > >>
> > > > >> Shareability is a property of the memory and is indexed using the
> > > > >> page's index in the inode. Because shareability is the memory's
> > > > >> property, it is stored within guest_memfd instead of within KVM, like
> > > > >> in kvm->mem_attr_array.
> > > > >>
> > > > >> KVM_MEMORY_ATTRIBUTE_PRIVATE in kvm->mem_attr_array must still be
> > > > >> retained to allow VMs to only use guest_memfd for private memory and
> > > > >> some other memory for shared memory.
> > > > >>
> > > > >> Not all use cases require guest_memfd() to be shared with the host
> > > > >> when first created. Add a new flag, GUEST_MEMFD_FLAG_INIT_PRIVATE,
> > > > >> which when set on KVM_CREATE_GUEST_MEMFD, initializes the memory as
> > > > >> private to the guest, and therefore not mappable by the
> > > > >> host. Otherwise, memory is shared until explicitly converted to
> > > > >> private.
> > > > >>
> > > > >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > > > >> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> > > > >> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> > > > >> Co-developed-by: Fuad Tabba <tabba@google.com>
> > > > >> Signed-off-by: Fuad Tabba <tabba@google.com>
> > > > >> Change-Id: If03609cbab3ad1564685c85bdba6dcbb6b240c0f
> > > > >> ---
> > > > >>  Documentation/virt/kvm/api.rst |   5 ++
> > > > >>  include/uapi/linux/kvm.h       |   2 +
> > > > >>  virt/kvm/guest_memfd.c         | 124 ++++++++++++++++++++++++++++++++-
> > > > >>  3 files changed, 129 insertions(+), 2 deletions(-)
> > > > >>
> > > > >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > >> index 86f74ce7f12a..f609337ae1c2 100644
> > > > >> --- a/Documentation/virt/kvm/api.rst
> > > > >> +++ b/Documentation/virt/kvm/api.rst
> > > > >> @@ -6408,6 +6408,11 @@ belonging to the slot via its userspace_addr.
> > > > >>  The use of GUEST_MEMFD_FLAG_SUPPORT_SHARED will not be allowed for CoCo VMs.
> > > > >>  This is validated when the guest_memfd instance is bound to the VM.
> > > > >>
> > > > >> +If the capability KVM_CAP_GMEM_CONVERSIONS is supported, then the 'flags' field
> > > > >> +supports GUEST_MEMFD_FLAG_INIT_PRIVATE.  Setting GUEST_MEMFD_FLAG_INIT_PRIVATE
> > > > >> +will initialize the memory for the guest_memfd as guest-only and not faultable
> > > > >> +by the host.
> > > > >> +
> > > > >
> > > > > KVM_CAP_GMEM_CONVERSION doesn't get introduced until later, so it seems
> > > > > like this flag should be deferred until that patch is in place. Is it
> > > > > really needed at that point though? Userspace would be able to set the
> > > > > initial state via KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls.
> > > > >
> > > >
> > > > I can move this change to the later patch. Thanks! Will fix in the next
> > > > revision.
> > > >
> > > > > The mtree contents seems to get stored in the same manner in either case so
> > > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > > would be saved. Are there any other reasons?
> > > > >
> > > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > > granularity makes sense for the architecture/guest configuration.
> > > > >
> > > >
> > > > Because shared pages are split once any memory is allocated, having a
> > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > conversion. I feel that is enough value to have this config flag, what
> > > > do you think?
> > > >
> > > > I guess we could also have userspace be careful not to do any allocation
> > > > before converting.
>
> (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> populate memory (in which case you can pretty much ignore my previous
> response))
>
> I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> userspace already defaults to marking everything private pretty close to
> guest_memfd creation time, so the potential for allocations to occur
> in-between seems small, but worth confirming.
>
> But I know in the past there was a desire to ensure TDX/SNP could
> support pre-allocating guest_memfd memory (and even pre-faulting via
> KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> fallocate() handling could still avoid the split if the whole hugepage
> is private, though there is a bit more potential for that fallocate()
> to happen before userspace does the "manually" shared->private
> conversion. I'll double-check on that aspect, but otherwise, is there
> still any other need for it?

It's not just about performance. I think that the need is more a
matter of having a consistent API with the hypervisors guest_memfd is
going to support. Memory in guest_memfd is shared by default, but in
pKVM for example, it's private by default. Therefore, it would be good
to have a way to ensure that all guest_memfd allocations can be made
private from the get-go.

Cheers,
/fuad

> > >
> > > I assume we do want to support things like preallocating guest memory so
> > > not sure this approach is feasible to avoid splits.
> > >
> > > But I feel like we might be working around a deeper issue here, which is
> > > that we are pre-emptively splitting anything that *could* be mapped into
> > > userspace (i.e. allocated+shared/mixed), rather than splitting when
> > > necessary.
> > >
> > > I know that was the plan laid out in the guest_memfd calls, but I've run
> > > into a couple instances that have me thinking we should revisit this.
> > >
> > > 1) Some of the recent guest_memfd seems to be gravitating towards having
> > >    userspace populate/initialize guest memory payload prior to boot via
> > >    mmap()'ing the shared guest_memfd pages so things work the same as
> > >    they would for initialized normal VM memory payload (rather than
> > >    relying on back-channels in the kernel to user data into guest_memfd
> > >    pages).
> > >
> > >    When you do this though, for an SNP guest at least, that memory
> > >    acceptance is done in chunks of 4MB (with accept_memory=lazy), and
> > >    because that will put each 1GB page into an allocated+mixed state,
> >
> > I would like your help in understanding why we need to start
> > guest_memfd ranges as shared for SNP guests. guest_memfd ranges being
> > private simply should mean that certain ranges are not faultable by
> > the userspace.
>
> It's seeming like I probably misremembered, but I thought there was a
> discussion on guest_memfd call a month (or so?) ago about whether to
> continue to use backchannels to populate guest_memfd pages prior to
> launch. It was in the context of whether to keep using kvm_gmem_populate()
> for populating guest_memfd pages by copying them in from separate
> userspace buffer vs. simply populating them directly from userspace.
> I thought we were leaning on the latter since it was simpler all-around,
> which is great for SNP since that is already how it populates memory: by
> writing to it from userspace, which kvm_gmem_populate() then copies into
> guest_memfd pages. With shared gmem support, we just skip the latter now
> in the kernel rather needing changes to how userspace handles things in
> that regard. But maybe that was just wishful thinking :)
>
> But you raise some very compelling points on why this might not be a
> good idea even if that was how that discussion went.
>
> >
> > Will following work?
> > 1) Userspace starts all guest_memfd ranges as private.
> > 2) During early guest boot it starts issuing PSC requests for
> > converting memory from shared to private
> >     -> KVM forwards this request to userspace
> >     -> Userspace checks that the pages are already private and simply
> > does nothing.
> > 3) Pvalidate from guest on that memory will result in guest_memfd
> > offset query which will cause the RMP table entries to actually get
> > populated.
>
> That would work, but there will need to be changes on userspace to deal
> with how SNP populates memory pre-boot just like normal VMs do. We will
> instead need to copy that data into separate buffers, and pass those in
> as the buffer hva instead of the shared hva corresponding to that GPA.
>
> But that seems reasonable if it avoids so many other problems.
>
> >
> > >    we end up splitting every 1GB to 4K and the guest can't even
> > >    accept/PVALIDATE it 2MB at that point even if userspace doesn't touch
> > >    anything in the range. As some point the guest will convert/accept
> > >    the entire range, at which point we could merge, but for SNP we'd
> > >    need guest cooperation to actually use a higher-granularity in stage2
> > >    page tables at that point since RMP entries are effectively all split
> > >    to 4K.
> > >
> > >    I understand the intent is to default to private where this wouldn't
> > >    be an issue, and we could punt to userspace to deal with it, but it
> > >    feels like an artificial restriction to place on userspace. And if we
> > >    do want to allow/expect guest_memfd contents to be initialized pre-boot
> > >    just like normal memory, then userspace would need to jump through
> > >    some hoops:
> > >
> > >    - if defaulting to private: add hooks to convert each range that's being
> > >      modified to a shared state prior to writing to it
> >
> > Why is that a problem?
>
> These were only problems if we went the above-mentioned way of
> populating memory pre-boot via mmap() instead of other backchannels. If
> we don't do that, then both these things cease to be problems. Sounds goods
> to me. :)
>
> >
> > >    - if defaulting to shared: initialize memory in-place, then covert
> > >      everything else to private to avoid unecessarily splitting folios
> > >      at run-time
> > >
> > >    It feels like implementations details are bleeding out into the API
> > >    to some degree here (e.g. we'd probably at least need to document
> > >    this so users know how to take proper advantage of hugepage support).
> >
> > Does it make sense to keep the default behavior as INIT_PRIVATE for
> > SNP VMs always even without using hugepages?
>
> Yes!
>
> Though, revisiting discussion around INIT_PRIVATE (without the baggage
> of potentially relying on mmap() to populate memory), I'm still not sure why
> it's needed. I responded in the context of Ackerley's initial reply
> above.
>
> >
> > >
> > > 2) There are some use-cases for HugeTLB + CoCo that have come to my
> > >    attention recently that put a lot of weight on still being able to
> > >    maximize mapping/hugepage size when accessing shared mem from userspace,
> > >    e.g. for certain DPDK workloads that accessed shared guest buffers
> > >    from host userspace. We don't really have a story for this, and I
> > >    wouldn't expect us to at this stage, but I think it ties into #1 so
> > >    might be worth considering in that context.
> >
> > Major problem I see here is that if anything in the kernel does a GUP
> > on shared memory ranges (which is very likely to happen), it would be
> > difficult to get them to let go of the whole hugepage before it can be
> > split safely.
> >
> > Another problem is guest_memfd today doesn't support management of
> > large user space page table mappings, this can turnout to be
> > significant work to do referring to hugetlb pagetable management
> > logic.
>
> Yah that was more line-of-sight that might be possible by going this
> route, but the refcount'ing issue above is a showstopper as always. I'd
> somehow convinced myself that supporting fine-grained splitting somehow
> worked around it, but you still have no idea what page you need to avoid
> converting and fancy splitting doesn't get you past that. More wishful
> thinking. =\
>
> Thanks,
>
> Mike
>
> >
> > >
> > > I'm still fine with the current approach as a starting point, but I'm
> > > wondering if improving both #1/#2 might not be so bad and maybe even
> > > give us some more flexibility (for instance, Sean had mentioned leaving
> > > open the option of tracking more than just shareability/mappability, and
> > > if there is split/merge logic associated with those transitions then
> > > re-scanning each of these attributes for a 1G range seems like it could
> > > benefit from some sort of intermediate data structure to help determine
> > > things like what mapping granularity is available for guest/userspace
> > > for a particular range.
> > >
> > > One approach I was thinking of was that we introduce a data structure
> > > similar to KVM's memslot->arch.lpage_info() where we store information
> > > about what 1G/2M ranges are shared/private/mixed, and then instead of
> > > splitting ahead of time we just record that state into this data
> > > structure (using the same write lock as with the
> > > shareability/mappability state), and then at *fault* time we split the
> > > folio if our lpage_info-like data structure says the range is mixed.
> > >
> > > Then, if guest converts a 2M/4M range to private while lazilly-accepting
> > > (for instance), we can still keep the folio intact as 1GB, but mark
> > > the 1G range in the lpage_info-like data structure as mixed so that we
> > > still inform KVM/etc. they need to map it as 2MB or lower in stage2
> > > page tables. In that case, even at guest fault-time, we can leave the
> > > folio unsplit until userspace tries to touch it (though in most cases
> > > it never will and we can keep most of the guest's 1G intact for the
> > > duration of its lifetime).
> > >
> > > On the userspace side, another nice thing there is if we see 1G is in a
> > > mixed state, but 2M is all-shared, then we can still leave the folio as 2M,
> > > and I think the refcount'ing logic would still work for the most part,
> > > which makes #2 a bit easier to implement as well.
> > >
> > > And of course, we wouldn't need the INIT_PRIVATE then since we are only
> > > splitting when necessary.
> > >
> > > But I guess this all comes down to how much extra pain there is in
> > > tracking a 1G folio that's been split into a mixed of 2MB/4K regions,
> > > but I think we'd get a lot more mileage out of getting that working and
> > > just completely stripping out all of the merging logic for initial
> > > implementation (other than at cleanup time), so maybe complexity-wise
> > > it balances out a bit?
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > >
> > > > >>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
> > > > >>
> > > > >>  4.143 KVM_PRE_FAULT_MEMORY
> > > > >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > > > >> index 4cc824a3a7c9..d7df312479aa 100644
> > > > >> --- a/include/uapi/linux/kvm.h
> > > > >> +++ b/include/uapi/linux/kvm.h
> > > > >> @@ -1567,7 +1567,9 @@ struct kvm_memory_attributes {
> > > > >>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > > > >>
> > > > >>  #define KVM_CREATE_GUEST_MEMFD    _IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
> > > > >> +
> > > > >>  #define GUEST_MEMFD_FLAG_SUPPORT_SHARED   (1UL << 0)
> > > > >> +#define GUEST_MEMFD_FLAG_INIT_PRIVATE     (1UL << 1)
> > > > >>
> > > > >>  struct kvm_create_guest_memfd {
> > > > >>    __u64 size;
> > > > >> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > > >> index 239d0f13dcc1..590932499eba 100644
> > > > >> --- a/virt/kvm/guest_memfd.c
> > > > >> +++ b/virt/kvm/guest_memfd.c
> > > > >> @@ -4,6 +4,7 @@
> > > > >>  #include <linux/falloc.h>
> > > > >>  #include <linux/fs.h>
> > > > >>  #include <linux/kvm_host.h>
> > > > >> +#include <linux/maple_tree.h>
> > > > >>  #include <linux/pseudo_fs.h>
> > > > >>  #include <linux/pagemap.h>
> > > > >>
> > > > >> @@ -17,6 +18,24 @@ struct kvm_gmem {
> > > > >>    struct list_head entry;
> > > > >>  };
> > > > >>
> > > > >> +struct kvm_gmem_inode_private {
> > > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > >> +  struct maple_tree shareability;
> > > > >> +#endif
> > > > >> +};
> > > > >> +
> > > > >> +enum shareability {
> > > > >> +  SHAREABILITY_GUEST = 1, /* Only the guest can map (fault) folios in this range. */
> > > > >> +  SHAREABILITY_ALL = 2,   /* Both guest and host can fault folios in this range. */
> > > > >> +};
> > > > >> +
> > > > >> +static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index);
> > > > >> +
> > > > >> +static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
> > > > >> +{
> > > > >> +  return inode->i_mapping->i_private_data;
> > > > >> +}
> > > > >> +
> > > > >>  /**
> > > > >>   * folio_file_pfn - like folio_file_page, but return a pfn.
> > > > >>   * @folio: The folio which contains this index.
> > > > >> @@ -29,6 +48,58 @@ static inline kvm_pfn_t folio_file_pfn(struct folio *folio, pgoff_t index)
> > > > >>    return folio_pfn(folio) + (index & (folio_nr_pages(folio) - 1));
> > > > >>  }
> > > > >>
> > > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > >> +
> > > > >> +static int kvm_gmem_shareability_setup(struct kvm_gmem_inode_private *private,
> > > > >> +                                loff_t size, u64 flags)
> > > > >> +{
> > > > >> +  enum shareability m;
> > > > >> +  pgoff_t last;
> > > > >> +
> > > > >> +  last = (size >> PAGE_SHIFT) - 1;
> > > > >> +  m = flags & GUEST_MEMFD_FLAG_INIT_PRIVATE ? SHAREABILITY_GUEST :
> > > > >> +                                              SHAREABILITY_ALL;
> > > > >> +  return mtree_store_range(&private->shareability, 0, last, xa_mk_value(m),
> > > > >> +                           GFP_KERNEL);
> > > > >
> > > > > One really nice thing about using a maple tree is that it should get rid
> > > > > of a fairly significant startup delay for SNP/TDX when the entire xarray gets
> > > > > initialized with private attribute entries via KVM_SET_MEMORY_ATTRIBUTES
> > > > > (which is the current QEMU default behavior).
> > > > >
> > > > > I'd originally advocated for sticking with the xarray implementation Fuad was
> > > > > using until we'd determined we really need it for HugeTLB support, but I'm
> > > > > sort of thinking it's already justified just based on the above.
> > > > >
> > > > > Maybe it would make sense for KVM memory attributes too?
> > > > >
> > > > >> +}
> > > > >> +
> > > > >> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
> > > > >> +                                           pgoff_t index)
> > > > >> +{
> > > > >> +  struct maple_tree *mt;
> > > > >> +  void *entry;
> > > > >> +
> > > > >> +  mt = &kvm_gmem_private(inode)->shareability;
> > > > >> +  entry = mtree_load(mt, index);
> > > > >> +  WARN(!entry,
> > > > >> +       "Shareability should always be defined for all indices in inode.");
> > > > >> +
> > > > >> +  return xa_to_value(entry);
> > > > >> +}
> > > > >> +
> > > > >> +static struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > > > >> +{
> > > > >> +  if (kvm_gmem_shareability_get(inode, index) != SHAREABILITY_ALL)
> > > > >> +          return ERR_PTR(-EACCES);
> > > > >> +
> > > > >> +  return kvm_gmem_get_folio(inode, index);
> > > > >> +}
> > > > >> +
> > > > >> +#else
> > > > >> +
> > > > >> +static int kvm_gmem_shareability_setup(struct maple_tree *mt, loff_t size, u64 flags)
> > > > >> +{
> > > > >> +  return 0;
> > > > >> +}
> > > > >> +
> > > > >> +static inline struct folio *kvm_gmem_get_shared_folio(struct inode *inode, pgoff_t index)
> > > > >> +{
> > > > >> +  WARN_ONCE("Unexpected call to get shared folio.")
> > > > >> +  return NULL;
> > > > >> +}
> > > > >> +
> > > > >> +#endif /* CONFIG_KVM_GMEM_SHARED_MEM */
> > > > >> +
> > > > >>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > >>                                pgoff_t index, struct folio *folio)
> > > > >>  {
> > > > >> @@ -333,7 +404,7 @@ static vm_fault_t kvm_gmem_fault_shared(struct vm_fault *vmf)
> > > > >>
> > > > >>    filemap_invalidate_lock_shared(inode->i_mapping);
> > > > >>
> > > > >> -  folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > > > >> +  folio = kvm_gmem_get_shared_folio(inode, vmf->pgoff);
> > > > >>    if (IS_ERR(folio)) {
> > > > >>            int err = PTR_ERR(folio);
> > > > >>
> > > > >> @@ -420,8 +491,33 @@ static struct file_operations kvm_gmem_fops = {
> > > > >>    .fallocate      = kvm_gmem_fallocate,
> > > > >>  };
> > > > >>
> > > > >> +static void kvm_gmem_free_inode(struct inode *inode)
> > > > >> +{
> > > > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > > > >> +
> > > > >> +  kfree(private);
> > > > >> +
> > > > >> +  free_inode_nonrcu(inode);
> > > > >> +}
> > > > >> +
> > > > >> +static void kvm_gmem_destroy_inode(struct inode *inode)
> > > > >> +{
> > > > >> +  struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
> > > > >> +
> > > > >> +#ifdef CONFIG_KVM_GMEM_SHARED_MEM
> > > > >> +  /*
> > > > >> +   * mtree_destroy() can't be used within rcu callback, hence can't be
> > > > >> +   * done in ->free_inode().
> > > > >> +   */
> > > > >> +  if (private)
> > > > >> +          mtree_destroy(&private->shareability);
> > > > >> +#endif
> > > > >> +}
> > > > >> +
> > > > >>  static const struct super_operations kvm_gmem_super_operations = {
> > > > >>    .statfs         = simple_statfs,
> > > > >> +  .destroy_inode  = kvm_gmem_destroy_inode,
> > > > >> +  .free_inode     = kvm_gmem_free_inode,
> > > > >>  };
> > > > >>
> > > > >>  static int kvm_gmem_init_fs_context(struct fs_context *fc)
> > > > >> @@ -549,12 +645,26 @@ static const struct inode_operations kvm_gmem_iops = {
> > > > >>  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > > > >>                                                  loff_t size, u64 flags)
> > > > >>  {
> > > > >> +  struct kvm_gmem_inode_private *private;
> > > > >>    struct inode *inode;
> > > > >> +  int err;
> > > > >>
> > > > >>    inode = alloc_anon_secure_inode(kvm_gmem_mnt->mnt_sb, name);
> > > > >>    if (IS_ERR(inode))
> > > > >>            return inode;
> > > > >>
> > > > >> +  err = -ENOMEM;
> > > > >> +  private = kzalloc(sizeof(*private), GFP_KERNEL);
> > > > >> +  if (!private)
> > > > >> +          goto out;
> > > > >> +
> > > > >> +  mt_init(&private->shareability);
> > > > >> +  inode->i_mapping->i_private_data = private;
> > > > >> +
> > > > >> +  err = kvm_gmem_shareability_setup(private, size, flags);
> > > > >> +  if (err)
> > > > >> +          goto out;
> > > > >> +
> > > > >>    inode->i_private = (void *)(unsigned long)flags;
> > > > >>    inode->i_op = &kvm_gmem_iops;
> > > > >>    inode->i_mapping->a_ops = &kvm_gmem_aops;
> > > > >> @@ -566,6 +676,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
> > > > >>    WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> > > > >>
> > > > >>    return inode;
> > > > >> +
> > > > >> +out:
> > > > >> +  iput(inode);
> > > > >> +
> > > > >> +  return ERR_PTR(err);
> > > > >>  }
> > > > >>
> > > > >>  static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> > > > >> @@ -654,6 +769,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> > > > >>    if (kvm_arch_vm_supports_gmem_shared_mem(kvm))
> > > > >>            valid_flags |= GUEST_MEMFD_FLAG_SUPPORT_SHARED;
> > > > >>
> > > > >> +  if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED)
> > > > >> +          valid_flags |= GUEST_MEMFD_FLAG_INIT_PRIVATE;
> > > > >> +
> > > > >>    if (flags & ~valid_flags)
> > > > >>            return -EINVAL;
> > > > >>
> > > > >> @@ -842,6 +960,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > >>    if (!file)
> > > > >>            return -EFAULT;
> > > > >>
> > > > >> +  filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> > > > >> +
> > > > >
> > > > > I like the idea of using a write-lock/read-lock to protect write/read access
> > > > > to shareability state (though maybe not necessarily re-using filemap's
> > > > > invalidate lock), it's simple and still allows concurrent faulting in of gmem
> > > > > pages. One issue on the SNP side (which also came up in one of the gmem calls)
> > > > > is if we introduce support for tracking preparedness as discussed (e.g. via a
> > > > > new SHAREABILITY_GUEST_PREPARED state) the
> > > > > SHAREABILITY_GUEST->SHAREABILITY_GUEST_PREPARED transition would occur at
> > > > > fault-time, and so would need to take the write-lock and no longer allow for
> > > > > concurrent fault-handling.
> > > > >
> > > > > I was originally planning on introducing a new rw_semaphore with similar
> > > > > semantics to the rw_lock that Fuad previously had in his restricted mmap
> > > > > series[1] (and simiar semantics to filemap invalidate lock here). The main
> > > > > difference, to handle setting SHAREABILITY_GUEST_PREPARED within fault paths,
> > > > > was that in the case of a folio being present for an index, the folio lock would
> > > > > also need to be held in order to update the shareability state. Because
> > > > > of that, fault paths (which will always either have or allocate folio
> > > > > basically) can rely on the folio lock to guard shareability state in a more
> > > > > granular way and so can avoid a global write lock.
> > > > >
> > > > > They would still need to hold the read lock to access the tree however.
> > > > > Or more specifically, any paths that could allocate a folio need to take
> > > > > a read lock so there isn't a TOCTOU situation where shareability is
> > > > > being updated for an index for which a folio hasn't been allocated, but
> > > > > then just afterward the folio gets faulted in/allocated while the
> > > > > shareability state is already being updated which the understand that
> > > > > there was no folio around that needed locking.
> > > > >
> > > > > I had a branch with in-place conversion support for SNP[2] that added this
> > > > > lock reworking on top of Fuad's series along with preparation tracking,
> > > > > but I'm now planning to rebase that on top of the patches from this
> > > > > series that Sean mentioned[3] earlier:
> > > > >
> > > > >   KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
> > > > >   KVM: Query guest_memfd for private/shared status
> > > > >   KVM: guest_memfd: Skip LRU for guest_memfd folios
> > > > >   KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
> > > > >   KVM: guest_memfd: Introduce and use shareability to guard faulting
> > > > >   KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
> > > > >
> > > > > but figured I'd mention it here in case there are other things to consider on
> > > > > the locking front.
> > > > >
> > > > > Definitely agree with Sean though that it would be nice to start identifying a
> > > > > common base of patches for the in-place conversion enablement for SNP, TDX, and
> > > > > pKVM so the APIs/interfaces for hugepages can be handled separately.
> > > > >
> > > > > -Mike
> > > > >
> > > > > [1] https://lore.kernel.org/kvm/20250328153133.3504118-1-tabba@google.com/
> > > > > [2] https://github.com/mdroth/linux/commits/mmap-swprot-v10-snp0-wip2/
> > > > > [3] https://lore.kernel.org/kvm/aC86OsU2HSFZkJP6@google.com/
> > > > >
> > > > >>    folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > > >>    if (IS_ERR(folio)) {
> > > > >>            r = PTR_ERR(folio);
> > > > >> @@ -857,8 +977,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > >>            *page = folio_file_page(folio, index);
> > > > >>    else
> > > > >>            folio_put(folio);
> > > > >> -
> > > > >>  out:
> > > > >> +  filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> > > > >>    fput(file);
> > > > >>    return r;
> > > > >>  }
> > > > >> --
> > > > >> 2.49.0.1045.g170613ef41-goog
> > > > >>
> > > >

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
  2025-05-23 10:47   ` Yan Zhao
@ 2025-08-12  9:13   ` Tony Lindgren
  1 sibling, 0 replies; 231+ messages in thread
From: Tony Lindgren @ 2025-08-12  9:13 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yan.y.zhao, yilun.xu,
	yuzenghui, zhiquan1.li

Hi,

On Wed, May 14, 2025 at 04:42:11PM -0700, Ackerley Tng wrote:
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -133,3 +133,8 @@ config KVM_GMEM_SHARED_MEM
>         select KVM_GMEM
>         bool
>         prompt "Enables in-place shared memory for guest_memfd"
> +
> +config KVM_GMEM_HUGETLB
> +       select KVM_PRIVATE_MEM
> +       depends on GUESTMEM_HUGETLB
> +       bool "Enables using a custom allocator with guest_memfd, see CONFIG_GUESTMEM_HUGETLB"

For v3, this needs s/KVM_PRIVATE_MEM/KVM_GMEM/ assuming the patches are
based on "KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM".

Also probably good idea to run some make randconfig builds on the series to
check the various Kconfig changes and inline functions for build errors.

Regards,

Tony

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-08-12  8:23             ` Fuad Tabba
@ 2025-08-13 17:11               ` Ira Weiny
  0 siblings, 0 replies; 231+ messages in thread
From: Ira Weiny @ 2025-08-13 17:11 UTC (permalink / raw)
  To: Fuad Tabba, Michael Roth
  Cc: Vishal Annapurve, Ackerley Tng, kvm, linux-mm, linux-kernel, x86,
	linux-fsdevel, aik, ajones, akpm, amoorthy, anthony.yznaga, anup,
	aou, bfoster, binbin.wu, brauner, catalin.marinas, chao.p.peng,
	chenhuacai, dave.hansen, david, dmatlack, dwmw, erdemaktas,
	fan.du, fvdl, graf, haibo1.xu, hch, hughd, ira.weiny,
	isaku.yamahata, jack, james.morse, jarkko, jgg, jgowans, jhubbard,
	jroedel, jthoughton, jun.miao, kai.huang, keirf, kent.overstreet,
	kirill.shutemov, liam.merwick, maciej.wieczor-retman, mail, maz,
	mic, mpe, muchun.song, nikunj, nsaenz, oliver.upton, palmer,
	pankaj.gupta, paul.walmsley, pbonzini, pdurrant, peterx, pgonda,
	pvorel, qperret, quic_cvanscha, quic_eberman, quic_mnalajal,
	quic_pderrin, quic_pheragu, quic_svaddagi, quic_tsoni,
	richard.weiyang, rick.p.edgecombe, rientjes, roypat, rppt, seanjc,
	shuah, steven.price, steven.sistare, suzuki.poulose,
	thomas.lendacky, usama.arif, vbabka, viro, vkuznets, wei.w.wang,
	will, willy, xiaoyao.li, yan.y.zhao, yilun.xu, yuzenghui,
	zhiquan1.li

Fuad Tabba wrote:
> Hi,
> 
> On Thu, 3 Jul 2025 at 05:12, Michael Roth <michael.roth@amd.com> wrote:
> >
> > On Wed, Jul 02, 2025 at 05:46:23PM -0700, Vishal Annapurve wrote:
> > > On Wed, Jul 2, 2025 at 4:25 PM Michael Roth <michael.roth@amd.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 02:51:38PM -0700, Ackerley Tng wrote:
> > > > > Michael Roth <michael.roth@amd.com> writes:
> > > > >
> > > > > > On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:

[snip]

> > > > > > The mtree contents seems to get stored in the same manner in either case so
> > > > > > performance-wise only the overhead of a few userspace<->kernel switches
> > > > > > would be saved. Are there any other reasons?
> > > > > >
> > > > > > Otherwise, maybe just settle on SHARED as a documented default (since at
> > > > > > least non-CoCo VMs would be able to reliably benefit) and let
> > > > > > CoCo/GUEST_MEMFD_FLAG_SUPPORT_SHARED VMs set PRIVATE at whatever
> > > > > > granularity makes sense for the architecture/guest configuration.
> > > > > >
> > > > >
> > > > > Because shared pages are split once any memory is allocated, having a
> > > > > way to INIT_PRIVATE could avoid the split and then merge on
> > > > > conversion. I feel that is enough value to have this config flag, what
> > > > > do you think?
> > > > >
> > > > > I guess we could also have userspace be careful not to do any allocation
> > > > > before converting.
> >
> > (Re-visiting this with the assumption that we *don't* intend to use mmap() to
> > populate memory (in which case you can pretty much ignore my previous
> > response))
> >
> > I'm still not sure where the INIT_PRIVATE flag comes into play. For SNP,
> > userspace already defaults to marking everything private pretty close to
> > guest_memfd creation time, so the potential for allocations to occur
> > in-between seems small, but worth confirming.
> >
> > But I know in the past there was a desire to ensure TDX/SNP could
> > support pre-allocating guest_memfd memory (and even pre-faulting via
> > KVM_PRE_FAULT_MEMORY), but I think that could still work right? The
> > fallocate() handling could still avoid the split if the whole hugepage
> > is private, though there is a bit more potential for that fallocate()
> > to happen before userspace does the "manually" shared->private
> > conversion. I'll double-check on that aspect, but otherwise, is there
> > still any other need for it?
> 
> It's not just about performance. I think that the need is more a
> matter of having a consistent API with the hypervisors guest_memfd is
> going to support. Memory in guest_memfd is shared by default, but in
> pKVM for example, it's private by default. Therefore, it would be good
  ^^^^^^^^^^^^^^^^
And Coco's as well right?

Ira

> to have a way to ensure that all guest_memfd allocations can be made
> private from the get-go.
> 
> Cheers,
> /fuad
> 

[snip]

^ permalink raw reply	[flat|nested] 231+ messages in thread

* Re: [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting
  2025-08-01  0:01   ` Yan Zhao
@ 2025-08-14 21:35     ` Ackerley Tng
  0 siblings, 0 replies; 231+ messages in thread
From: Ackerley Tng @ 2025-08-14 21:35 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-mm, linux-kernel, x86, linux-fsdevel, aik, ajones,
	akpm, amoorthy, anthony.yznaga, anup, aou, bfoster, binbin.wu,
	brauner, catalin.marinas, chao.p.peng, chenhuacai, dave.hansen,
	david, dmatlack, dwmw, erdemaktas, fan.du, fvdl, graf, haibo1.xu,
	hch, hughd, ira.weiny, isaku.yamahata, jack, james.morse, jarkko,
	jgg, jgowans, jhubbard, jroedel, jthoughton, jun.miao, kai.huang,
	keirf, kent.overstreet, kirill.shutemov, liam.merwick,
	maciej.wieczor-retman, mail, maz, mic, michael.roth, mpe,
	muchun.song, nikunj, nsaenz, oliver.upton, palmer, pankaj.gupta,
	paul.walmsley, pbonzini, pdurrant, peterx, pgonda, pvorel,
	qperret, quic_cvanscha, quic_eberman, quic_mnalajal, quic_pderrin,
	quic_pheragu, quic_svaddagi, quic_tsoni, richard.weiyang,
	rick.p.edgecombe, rientjes, roypat, rppt, seanjc, shuah,
	steven.price, steven.sistare, suzuki.poulose, tabba,
	thomas.lendacky, usama.arif, vannapurve, vbabka, viro, vkuznets,
	wei.w.wang, will, willy, xiaoyao.li, yilun.xu, yuzenghui,
	zhiquan1.li

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, May 14, 2025 at 04:41:41PM -0700, Ackerley Tng wrote:
>> +static enum shareability kvm_gmem_shareability_get(struct inode *inode,
>> +						 pgoff_t index)
>> +{
>> +	struct maple_tree *mt;
>> +	void *entry;
>> +
>> +	mt = &kvm_gmem_private(inode)->shareability;
>> +	entry = mtree_load(mt, index);
>> +	WARN(!entry,
>> +	     "Shareability should always be defined for all indices in inode.");
>> +
>> +	return xa_to_value(entry);
>> +}
>> +
> Hi Ackerley,
>
> Not sure if it's a known issue. Just want to let you know in case you're unaware.
>

Thanks for informing me, and thanks for the analysis :)

> During a test to repeatedly launching/destroying TDs, I encountered a warning
> from kvm_gmem_shareability_get() (see the attached log at the bottom).
> The reproducing rate is 1 in every 20-100 times of launching TD.
>
> After some analysis, I found that the warning was produced by
> kvm_gmem_shareability_get() when it's called from kvm_gmem_is_private(), which
> is not protected by any locks.
>
> I can get rid of the warning by either fix 1 or fix 2 below.
> (I prefer fix 1 though :))
>
> fix 1:
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index e78fbebf4f53..136d46c5b2ab 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -2024,7 +2024,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>
>  #ifdef CONFIG_KVM_GMEM_SHARED_MEM
>         if (flags & GUEST_MEMFD_FLAG_SUPPORT_SHARED) {
> -               mt_init(&private->shareability);
> +               mt_init_flags(&private->shareability, MT_FLAGS_USE_RCU);
>
>                 err = kvm_gmem_shareability_setup(private, size, flags);
>                 if (err)
>

Not sure about the version of the conversion patch series that you're
using, in the version I'm preparing, I'm using
filemap_invalidate_lock_shared() to guard shareability
reads. filemap_invalidate_lock() is held during shareability updates, so
I think this issue should be fixed.

Please let me know if you're still seeing this issue in the next series
(coming soon). Thank you!

>
> fix 2:
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index e78fbebf4f53..9a4518104d56 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -171,7 +171,9 @@ static enum shareability kvm_gmem_shareability_get(struct inode *inode,
>         void *entry;
>
>         mt = &kvm_gmem_private(inode)->shareability;
> +       mtree_lock(mt);
>         entry = mtree_load(mt, index);
> +       mtree_unlock(mt);
>         WARN(!entry,
>              "Shareability should always be defined for all indices in inode.");
>
>
> Thanks
> Yan
>
> 
> [...snip...]
> 

^ permalink raw reply	[flat|nested] 231+ messages in thread

end of thread, other threads:[~2025-08-14 21:35 UTC | newest]

Thread overview: 231+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
2025-05-27  3:54   ` Yan Zhao
2025-05-29 18:20     ` Ackerley Tng
2025-05-30  8:53     ` Fuad Tabba
2025-05-30 18:32       ` Ackerley Tng
2025-06-02  9:43         ` Fuad Tabba
2025-05-27  8:25   ` Binbin Wu
2025-05-27  8:43     ` Binbin Wu
2025-05-29 18:26     ` Ackerley Tng
2025-05-29 20:37       ` Ackerley Tng
2025-05-29  5:42   ` Michael Roth
2025-06-11 21:51     ` Ackerley Tng
2025-07-02 23:25       ` Michael Roth
2025-07-03  0:46         ` Vishal Annapurve
2025-07-03  0:52           ` Vishal Annapurve
2025-07-03  4:12           ` Michael Roth
2025-07-03  5:10             ` Vishal Annapurve
2025-07-03 20:39               ` Michael Roth
2025-07-07 14:55                 ` Vishal Annapurve
2025-07-12  0:10                   ` Michael Roth
2025-07-12 17:53                     ` Vishal Annapurve
2025-08-12  8:23             ` Fuad Tabba
2025-08-13 17:11               ` Ira Weiny
2025-06-11 22:10     ` Ackerley Tng
2025-08-01  0:01   ` Yan Zhao
2025-08-14 21:35     ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag Ackerley Tng
2025-05-15 13:49   ` Ira Weiny
2025-05-16 17:42     ` Ackerley Tng
2025-05-16 19:31       ` Ira Weiny
2025-05-27  8:53       ` Binbin Wu
2025-05-30 19:59         ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
2025-05-15 14:50   ` Ira Weiny
2025-05-16 17:53     ` Ackerley Tng
2025-05-20  9:22   ` Fuad Tabba
2025-05-20 13:02     ` Vishal Annapurve
2025-05-20 13:44       ` Fuad Tabba
2025-05-20 14:11         ` Vishal Annapurve
2025-05-20 14:33           ` Fuad Tabba
2025-05-20 16:02             ` Vishal Annapurve
2025-05-20 18:05               ` Fuad Tabba
2025-05-20 19:40                 ` Ackerley Tng
2025-05-21 12:36                   ` Fuad Tabba
2025-05-21 14:42                     ` Vishal Annapurve
2025-05-21 15:21                       ` Fuad Tabba
2025-05-21 15:51                         ` Vishal Annapurve
2025-05-21 18:27                           ` Fuad Tabba
2025-05-22 14:52                             ` Sean Christopherson
2025-05-22 15:07                               ` Fuad Tabba
2025-05-22 16:26                                 ` Sean Christopherson
2025-05-23 10:12                                   ` Fuad Tabba
2025-06-24  8:23           ` Alexey Kardashevskiy
2025-06-24 13:08             ` Jason Gunthorpe
2025-06-24 14:10               ` Vishal Annapurve
2025-06-27  4:49                 ` Alexey Kardashevskiy
2025-06-27 15:17                   ` Vishal Annapurve
2025-06-30  0:19                     ` Alexey Kardashevskiy
2025-06-30 14:19                       ` Vishal Annapurve
2025-07-10  6:57                         ` Alexey Kardashevskiy
2025-07-10 17:58                           ` Jason Gunthorpe
2025-07-02  8:35                 ` Yan Zhao
2025-07-02 13:54                   ` Vishal Annapurve
2025-07-02 14:13                     ` Jason Gunthorpe
2025-07-02 14:32                       ` Vishal Annapurve
2025-07-10 10:50                         ` Xu Yilun
2025-07-10 17:54                           ` Jason Gunthorpe
2025-07-11  4:31                             ` Xu Yilun
2025-07-11  9:33                               ` Xu Yilun
2025-07-16 22:22                   ` Ackerley Tng
2025-07-17  9:32                     ` Xu Yilun
2025-07-17 16:56                       ` Ackerley Tng
2025-07-18  2:48                         ` Xu Yilun
2025-07-18 14:15                           ` Jason Gunthorpe
2025-07-21 14:18                             ` Xu Yilun
2025-07-18 15:13                           ` Ira Weiny
2025-07-21  9:58                             ` Xu Yilun
2025-07-22 18:17                               ` Ackerley Tng
2025-07-22 19:25                                 ` Edgecombe, Rick P
2025-05-28  3:16   ` Binbin Wu
2025-05-30 20:10     ` Ackerley Tng
2025-06-03  0:54       ` Binbin Wu
2025-05-14 23:41 ` [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios Ackerley Tng
2025-05-28  7:01   ` Binbin Wu
2025-05-30 20:32     ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status Ackerley Tng
2025-05-27  3:55   ` Yan Zhao
2025-05-28  8:08     ` Binbin Wu
2025-05-28  9:55       ` Yan Zhao
2025-05-14 23:41 ` [RFC PATCH v2 07/51] KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 08/51] KVM: selftests: Test flag validity after guest_memfd supports conversions Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 09/51] KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 10/51] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 11/51] KVM: selftests: Allow cleanup of ucall_pool from host Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 12/51] KVM: selftests: Test conversion flows for guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 13/51] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 14/51] KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 15/51] KVM: selftests: Update script to map shared memory from guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
2025-05-15  2:09   ` Matthew Wilcox
2025-05-28  8:55   ` Binbin Wu
2025-07-07 18:27   ` James Houghton
2025-05-14 23:41 ` [RFC PATCH v2 17/51] mm: hugetlb: Cleanup interpretation of gbl_chg in alloc_hugetlb_folio() Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio() Ackerley Tng
2025-07-07 18:08   ` James Houghton
2025-05-14 23:41 ` [RFC PATCH v2 19/51] mm: hugetlb: Rename alloc_surplus_hugetlb_folio Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 20/51] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 21/51] mm: hugetlb: Inline huge_node() into callers Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions Ackerley Tng
2025-05-31 23:45   ` Ira Weiny
2025-06-13 22:03     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng
2025-06-01  0:38   ` Ira Weiny
2025-06-13 22:07     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 24/51] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 25/51] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 26/51] mm: Consolidate freeing of typed folios on final folio_put() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 27/51] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 28/51] mm: Introduce guestmem_hugetlb to support folio_put() handling of guestmem pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd Ackerley Tng
2025-05-16 14:07   ` Ackerley Tng
2025-05-16 20:33     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 30/51] mm: truncate: Expose truncate_inode_folio() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 31/51] KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff misalignment Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
2025-05-23 10:47   ` Yan Zhao
2025-08-12  9:13   ` Tony Lindgren
2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
2025-05-21 18:05   ` Vishal Annapurve
2025-05-22 23:12   ` Edgecombe, Rick P
2025-05-28 10:58   ` Yan Zhao
2025-06-03  7:43   ` Binbin Wu
2025-07-16 22:13     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 34/51] mm: hugetlb: Add functions to add/delete folio from hugetlb lists Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function Ackerley Tng
2025-05-21 16:40   ` Edgecombe, Rick P
2025-05-14 23:42 ` [RFC PATCH v2 37/51] filemap: Pass address_space mapping to ->free_folio() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
2025-05-22 22:19   ` Edgecombe, Rick P
2025-06-05 17:15     ` Ackerley Tng
2025-06-05 17:53       ` Edgecombe, Rick P
2025-06-05 17:15     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-05-27  4:30   ` Yan Zhao
2025-05-27  4:38     ` Yan Zhao
2025-06-05 17:50     ` Ackerley Tng
2025-05-27  8:45   ` Yan Zhao
2025-06-05 19:10     ` Ackerley Tng
2025-06-16 11:15       ` Yan Zhao
2025-06-05  5:24   ` Binbin Wu
2025-06-05 19:16     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE) Ackerley Tng
2025-05-28 11:00   ` Yan Zhao
2025-05-28 16:39     ` Ackerley Tng
2025-05-29  3:26       ` Yan Zhao
2025-05-14 23:42 ` [RFC PATCH v2 40/51] KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page status Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 41/51] KVM: Add CAP to indicate support for HugeTLB as custom allocator Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 42/51] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 43/51] KVM: selftests: Update conversion flows test for HugeTLB Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 44/51] KVM: selftests: Test truncation paths of guest_memfd Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 45/51] KVM: selftests: Test allocation and conversion of subfolios Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 46/51] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 47/51] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 48/51] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 49/51] KVM: selftests: Update private_mem_conversions_test.sh to test with HugeTLB pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 50/51] KVM: selftests: Add script to test HugeTLB statistics Ackerley Tng
2025-05-15 18:03 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Edgecombe, Rick P
2025-05-15 18:42   ` Vishal Annapurve
2025-05-15 23:35     ` Edgecombe, Rick P
2025-05-16  0:57       ` Sean Christopherson
2025-05-16  2:12         ` Edgecombe, Rick P
2025-05-16 13:11           ` Vishal Annapurve
2025-05-16 16:45             ` Edgecombe, Rick P
2025-05-16 17:51               ` Sean Christopherson
2025-05-16 19:14                 ` Edgecombe, Rick P
2025-05-16 20:25                   ` Dave Hansen
2025-05-16 21:42                     ` Edgecombe, Rick P
2025-05-16 17:45             ` Sean Christopherson
2025-05-16 13:09         ` Jason Gunthorpe
2025-05-16 17:04           ` Edgecombe, Rick P
2025-05-16  0:22 ` [RFC PATCH v2 51/51] KVM: selftests: Test guest_memfd for accuracy of st_blocks Ackerley Tng
2025-05-16 19:48 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Ira Weiny
2025-05-16 19:59   ` Ira Weiny
2025-05-16 20:26     ` Ackerley Tng
2025-05-16 22:43 ` Ackerley Tng
2025-06-19  8:13 ` Yan Zhao
2025-06-19  8:59   ` Xiaoyao Li
2025-06-19  9:18     ` Xiaoyao Li
2025-06-19  9:28       ` Yan Zhao
2025-06-19  9:45         ` Xiaoyao Li
2025-06-19  9:49           ` Xiaoyao Li
2025-06-29 18:28     ` Vishal Annapurve
2025-06-30  3:14       ` Yan Zhao
2025-06-30 14:14         ` Vishal Annapurve
2025-07-01  5:23           ` Yan Zhao
2025-07-01 19:48             ` Vishal Annapurve
2025-07-07 23:25               ` Sean Christopherson
2025-07-08  0:14                 ` Vishal Annapurve
2025-07-08  1:08                   ` Edgecombe, Rick P
2025-07-08 14:20                     ` Sean Christopherson
2025-07-08 14:52                       ` Edgecombe, Rick P
2025-07-08 15:07                         ` Vishal Annapurve
2025-07-08 15:31                           ` Edgecombe, Rick P
2025-07-08 17:16                             ` Vishal Annapurve
2025-07-08 17:39                               ` Edgecombe, Rick P
2025-07-08 18:03                                 ` Sean Christopherson
2025-07-08 18:13                                   ` Edgecombe, Rick P
2025-07-08 18:55                                     ` Sean Christopherson
2025-07-08 21:23                                       ` Edgecombe, Rick P
2025-07-09 14:28                                       ` Vishal Annapurve
2025-07-09 15:00                                         ` Sean Christopherson
2025-07-10  1:30                                           ` Vishal Annapurve
2025-07-10 23:33                                             ` Sean Christopherson
2025-07-11 21:18                                             ` Vishal Annapurve
2025-07-12 17:33                                               ` Vishal Annapurve
2025-07-09 15:17                                         ` Edgecombe, Rick P
2025-07-10  3:39                                           ` Vishal Annapurve
2025-07-08 19:28                                   ` Vishal Annapurve
2025-07-08 19:58                                     ` Sean Christopherson
2025-07-08 22:54                                       ` Vishal Annapurve
2025-07-08 15:38                           ` Sean Christopherson
2025-07-08 16:22                             ` Fuad Tabba
2025-07-08 17:25                               ` Sean Christopherson
2025-07-08 18:37                                 ` Fuad Tabba
2025-07-16 23:06                                   ` Ackerley Tng
2025-06-26 23:19 ` Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).