[RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
@ 2024-11-05 16:45 Shivank Garg
  2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:45 UTC (permalink / raw)
  To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm
  Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
	dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
	nikunj, michael.day, Neeraj.Upadhyay

This patch series introduces fbind() syscall to support NUMA memory
policies for KVM guest_memfd, allowing VMMs to configure memory placement
for guest memory. This addresses the current limitation where guest_memfd
allocations ignore NUMA policies, potentially impacting performance of
memory-locality-sensitive workloads.

Currently, while mbind() enables NUMA policy support for userspace
applications, it cannot be used with guest_memfd as the memory isn't mapped
to userspace. This results in guest memory being allocated randomly across
host NUMA nodes, even when specific policies and node preferences are
specified in QEMU commands.

The fbind() syscall is particularly useful for SEV-SNP guests.

Following suggestions from LPC and review comment [1], I switched from an
IOCTL-based approach [2] to fbind() [3]. The fbind() approach is preferred
as it provides a generic NUMA policy configuration working with any fd,
rather than being tied to guest-memfd.

Testing:
QEMU tree- https://github.com/AMDESE/qemu/tree/guest_memfd_fbind_NUMA
Based upon
Github tree- https://github.com/AMDESE/linux/tree/snp-host-latest
Branch: snp-host-latest commit: 85ef1ac

Example command to run a SEV-SNP guest bound to NUMA Node 0 of the host:
$ qemu-system-x86_64 \
   -enable-kvm \
  ...
   -machine memory-encryption=sev0,vmport=off \
   -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1 \
   -numa node,nodeid=0,memdev=ram0,cpus=0-15 \
   -object memory-backend-memfd,id=ram0,policy=bind,host-nodes=0,size=1024M,share=true,prealloc=false

[1]: https://lore.kernel.org/linux-mm/ZvEga7srKhympQBt@intel.com/
[2]: https://lore.kernel.org/linux-mm/20240919094438.10987-1-shivankg@amd.com
[3]: https://lore.kernel.org/kvm/ZOjpIL0SFH+E3Dj4@google.com/


Shivank Garg (3):
  Introduce fbind syscall
  KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd
    APIs
  KVM: guest_memfd: Enforce NUMA mempolicy if available

Shivansh Dhiman (1):
  mm: Add mempolicy support to the filemap layer

 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/fs.h                     |  3 ++
 include/linux/mempolicy.h              |  3 ++
 include/linux/pagemap.h                | 40 ++++++++++++++++
 include/linux/syscalls.h               |  3 ++
 include/uapi/asm-generic/unistd.h      |  5 +-
 kernel/sys_ni.c                        |  1 +
 mm/Makefile                            |  2 +-
 mm/fbind.c                             | 49 +++++++++++++++++++
 mm/filemap.c                           | 30 ++++++++++--
 mm/mempolicy.c                         | 57 ++++++++++++++++++++++
 virt/kvm/guest_memfd.c                 | 66 +++++++++++++++++++++-----
 13 files changed, 242 insertions(+), 19 deletions(-)
 create mode 100644 mm/fbind.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer
  2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
@ 2024-11-05 16:45 ` Shivank Garg
  2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
  2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
  2 siblings, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:45 UTC (permalink / raw)
  To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm
  Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
	dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
	nikunj, michael.day, Neeraj.Upadhyay, Shivansh Dhiman

From: Shivansh Dhiman <shivansh.dhiman@amd.com>

Introduce mempolicy support to the filemap. Add filemap_grab_folio_mpol,
filemap_alloc_folio_mpol_noprof() and __filemap_get_folio_mpol() APIs that
take mempolicy struct as an argument.

The API is required by VMs using KVM guest-memfd memory backends for NUMA
mempolicy aware allocations.

Signed-off-by: Shivansh Dhiman <shivansh.dhiman@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/linux/pagemap.h | 40 ++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c            | 30 +++++++++++++++++++++++++-----
 2 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422b..b05b696f310b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -564,15 +564,25 @@ static inline void *detach_page_private(struct page *page)
 
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
+struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp, unsigned int order,
+						struct mempolicy *mpol);
 #else
 static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
 {
 	return folio_alloc_noprof(gfp, order);
 }
+static inline struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp,
+						unsigned int order,
+						struct mempolicy *mpol)
+{
+	return filemap_alloc_folio_noprof(gfp, order);
+}
 #endif
 
 #define filemap_alloc_folio(...)				\
 	alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
+#define filemap_alloc_folio_mpol(...)				\
+	alloc_hooks(filemap_alloc_folio_mpol_noprof(__VA_ARGS__))
 
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
@@ -652,6 +662,8 @@ static inline fgf_t fgf_set_order(size_t size)
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
+struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
+		pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *mpol);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
 
@@ -710,6 +722,34 @@ static inline struct folio *filemap_grab_folio(struct address_space *mapping,
 			mapping_gfp_mask(mapping));
 }
 
+/**
+ * filemap_grab_folio_mpol - grab a folio from the page cache
+ * @mapping: The address space to search
+ * @index: The page index
+ * @mpol: The mempolicy to apply
+ *
+ * Same as filemap_grab_folio(), except that it allocates the folio using
+ * given memory policy.
+ *
+ * Return: A found or created folio. ERR_PTR(-ENOMEM) if no folio is found
+ * and failed to create a folio.
+ */
+#ifdef CONFIG_NUMA
+static inline struct folio *filemap_grab_folio_mpol(struct address_space *mapping,
+					pgoff_t index, struct mempolicy *mpol)
+{
+	return __filemap_get_folio_mpol(mapping, index,
+			FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+			mapping_gfp_mask(mapping), mpol);
+}
+#else
+static inline struct folio *filemap_grab_folio_mpol(struct address_space *mapping,
+					pgoff_t index, struct mempolicy *mpol)
+{
+	return filemap_grab_folio(mapping, index);
+}
+#endif /* CONFIG_NUMA */
+
 /**
  * find_get_page - find and get a page reference
  * @mapping: the address_space to search
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b91..a870a05296c8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -990,8 +990,13 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 EXPORT_SYMBOL_GPL(filemap_add_folio);
 
 #ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
+struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp, unsigned int order,
+			struct mempolicy *mpol)
 {
+	if (mpol)
+		return folio_alloc_mpol_noprof(gfp, order, mpol,
+				NO_INTERLEAVE_INDEX, numa_node_id());
+
 	int n;
 	struct folio *folio;
 
@@ -1007,6 +1012,12 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
 	}
 	return folio_alloc_noprof(gfp, order);
 }
+EXPORT_SYMBOL(filemap_alloc_folio_mpol_noprof);
+
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
+{
+	return filemap_alloc_folio_mpol_noprof(gfp, order, NULL);
+}
 EXPORT_SYMBOL(filemap_alloc_folio_noprof);
 #endif
 
@@ -1861,11 +1872,12 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /**
- * __filemap_get_folio - Find and get a reference to a folio.
+ * __filemap_get_folio_mpol - Find and get a reference to a folio.
  * @mapping: The address_space to search.
  * @index: The page index.
  * @fgp_flags: %FGP flags modify how the folio is returned.
  * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
+ * @mpol: The mempolicy to apply.
  *
  * Looks up the page cache entry at @mapping & @index.
  *
@@ -1876,8 +1888,8 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		fgf_t fgp_flags, gfp_t gfp)
+struct folio *__filemap_get_folio_mpol(struct address_space *mapping, pgoff_t index,
+		fgf_t fgp_flags, gfp_t gfp, struct mempolicy *mpol)
 {
 	struct folio *folio;
 
@@ -1947,7 +1959,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			err = -ENOMEM;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
-			folio = filemap_alloc_folio(alloc_gfp, order);
+			folio = filemap_alloc_folio_mpol(alloc_gfp, order, mpol);
 			if (!folio)
 				continue;
 
@@ -1978,6 +1990,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		return ERR_PTR(-ENOENT);
 	return folio;
 }
+EXPORT_SYMBOL(__filemap_get_folio_mpol);
+
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+		fgf_t fgp_flags, gfp_t gfp)
+{
+	return __filemap_get_folio_mpol(mapping, index,
+			fgp_flags, gfp, NULL);
+}
 EXPORT_SYMBOL(__filemap_get_folio);
 
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/4] Introduce fbind syscall
  2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
  2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
@ 2024-11-05 16:55 ` Shivank Garg
  2024-11-05 16:55   ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
  2024-11-05 16:55   ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
  2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
  2 siblings, 2 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
  To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm
  Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
	dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
	nikunj, michael.day, Neeraj.Upadhyay, Shivansh Dhiman

The new fbind syscall sets the NUMA memory policy for file-backed memory
and has following signature:

long fbind(unsigned int fd, unsigned long mode,
                const unsigned long nodemask[(.maxnode + ULONG_WIDTH - 1)
                                               / ULONG_WIDTH],
                unsigned long maxnode, unsigned int flags);

fbind behaves similar to mbind except that it takes file descriptor as
input instead of address ranges.

TODO:
1. Support fbind syscall on all architectures.
2. Expand commit msg and add documentation.
3. clean-up the code.

[Shivansh: add create_mpol_from_args()]
Signed-off-by: Shivansh Dhiman <shivansh.dhiman@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/fs.h                     |  3 ++
 include/linux/mempolicy.h              |  3 ++
 include/linux/syscalls.h               |  3 ++
 include/uapi/asm-generic/unistd.h      |  5 ++-
 kernel/sys_ni.c                        |  1 +
 mm/Makefile                            |  2 +-
 mm/fbind.c                             | 49 +++++++++++++++++++++++
 mm/mempolicy.c                         | 55 ++++++++++++++++++++++++++
 10 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 mm/fbind.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 534c74b14fab..0660ce6d08d8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -468,3 +468,4 @@
 460	i386	lsm_set_self_attr	sys_lsm_set_self_attr
 461	i386	lsm_list_modules	sys_lsm_list_modules
 462	i386	mseal 			sys_mseal
+463	i386	fbind			sys_fbind
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7093ee21c0d1..9794347cc2e6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -386,6 +386,7 @@
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
 462 	common  mseal			sys_mseal
+463	common	fbind			sys_fbind
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..42042b62bdcd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2058,6 +2058,9 @@ struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+#ifdef CONFIG_NUMA
+	int (*set_policy)(struct file *, struct mempolicy *);
+#endif
 	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 				unsigned int poll_flags);
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 1add16f21612..b9023f6246a7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -299,4 +299,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 }
 
 #endif /* CONFIG_NUMA */
+struct mempolicy *create_mpol_from_args(unsigned char mode,
+					const unsigned long __user *nmask,
+					unsigned short maxnode);
 #endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4bcf6754738d..2dc686921b9f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -502,6 +502,9 @@ asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *bu
 asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
 			       struct stat __user *statbuf, int flag);
 asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
+asmlinkage long sys_fbind(unsigned int fd, unsigned long mode,
+			const unsigned long __user *nmask,
+			unsigned long maxnode, unsigned int flags);
 #if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
 asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
 asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bf6148cac2b..550730f36dae 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -841,8 +841,11 @@ __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 #define __NR_mseal 462
 __SYSCALL(__NR_mseal, sys_mseal)
 
+#define __NR_fbind 463
+__SYSCALL(__NR_fbind, sys_fbind)
+
 #undef __NR_syscalls
-#define __NR_syscalls 463
+#define __NR_syscalls 464
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..f57350e581f6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -195,6 +195,7 @@ COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
 COND_SYSCALL(cachestat);
 COND_SYSCALL(mseal);
+COND_SYSCALL(fbind);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..ba339ddc0be2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -79,7 +79,7 @@ obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
 obj-$(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP)	+= hugetlb_vmemmap.o
-obj-$(CONFIG_NUMA) 	+= mempolicy.o
+obj-$(CONFIG_NUMA) 	+= mempolicy.o fbind.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
diff --git a/mm/fbind.c b/mm/fbind.c
new file mode 100644
index 000000000000..85ec7d13345c
--- /dev/null
+++ b/mm/fbind.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implement fbind() syscall.
+ *
+ *  Copyright (c) 2024 AMD
+ *
+ *  Author: Shivank Garg <shivankg@amd.com>
+ */
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mempolicy.h>
+#include <linux/syscalls.h>
+
+static long do_fbind(unsigned int fd, unsigned long mode,
+		const unsigned long __user *nmask,
+		unsigned long maxnode, unsigned int flags)
+{
+	struct mempolicy *mpol;
+	struct fd f;
+	int ret;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	mpol = create_mpol_from_args(mode, nmask, maxnode);
+	if (IS_ERR_OR_NULL(mpol)) {
+		ret = PTR_ERR(mpol);
+		goto out_putf;
+	}
+
+	if (f.file->f_op->set_policy)
+		ret = f.file->f_op->set_policy(f.file, mpol);
+	else
+		ret = -EOPNOTSUPP;
+
+	mpol_put(mpol);
+out_putf:
+	fdput(f);
+	return ret;
+}
+
+SYSCALL_DEFINE5(fbind, unsigned int, fd, unsigned long, mode,
+		const unsigned long __user *, nmask,
+		unsigned long, maxnode, unsigned int, flags)
+{
+	return do_fbind(fd, mode, nmask, maxnode, flags);
+}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b858e22b259d..3a697080ecad 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3557,3 +3557,58 @@ static int __init mempolicy_sysfs_init(void)
 
 late_initcall(mempolicy_sysfs_init);
 #endif /* CONFIG_SYSFS */
+
+/**
+ * create_mpol_from_args - create a mempolicy structure from args
+ * @mode:  NUMA memory policy mode
+ * @nmask:  bitmask of NUMA nodes
+ * @maxnode:  number of bits in the nodes bitmask
+ *
+ * Create a mempolicy from given nodemask and memory policy such as
+ * default, preferred, interleave or bind.
+ *
+ * Return: error encoded in a pointer or memory policy on success.
+ */
+struct mempolicy *create_mpol_from_args(unsigned char mode,
+					const unsigned long __user *nmask,
+					unsigned short maxnode)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned short mode_flags;
+	struct mempolicy *mpol;
+	nodemask_t nodes;
+	int lmode = mode;
+	int err = -ENOMEM;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return ERR_PTR(err);
+
+	err = get_nodes(&nodes, nmask, maxnode);
+	if (err)
+		return ERR_PTR(err);
+
+	mpol = mpol_new(mode, mode_flags, &nodes);
+	if (IS_ERR_OR_NULL(mpol))
+		return mpol;
+
+	NODEMASK_SCRATCH(scratch);
+	if (!scratch) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+
+	mmap_write_lock(mm);
+	err = mpol_set_nodemask(mpol, &nodes, scratch);
+	mmap_write_unlock(mm);
+	NODEMASK_SCRATCH_FREE(scratch);
+
+	if (err)
+		goto err_out;
+
+	return mpol;
+
+err_out:
+	mpol_put(mpol);
+	return ERR_PTR(err);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs
  2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
@ 2024-11-05 16:55   ` Shivank Garg
  2024-11-05 16:55   ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
  1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
  To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm
  Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
	dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
	nikunj, michael.day, Neeraj.Upadhyay

Change the KVM guest_memfd APIs to pass file pointers instead of
inodes in the folio allocation functions. This is preparatory patch
for adding NUMA support to guest memory allocations.
The functional behavior remains unchanged.

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 virt/kvm/guest_memfd.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e930014b4bdc..2c6fcf7c3ec9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -91,7 +91,7 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
 {
 	pgoff_t npages = 1UL << order;
 	pgoff_t huge_index = round_down(index, npages);
-	struct address_space *mapping  = inode->i_mapping;
+	struct address_space *mapping = inode->i_mapping;
 	gfp_t gfp = mapping_gfp_mask(mapping) | __GFP_NOWARN;
 	loff_t size = i_size_read(inode);
 	struct folio *folio;
@@ -125,16 +125,16 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
  * Ignore accessed, referenced, and dirty flags.  The memory is
  * unevictable and there is no storage to write back to.
  */
-static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index,
+static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index,
 					  bool allow_huge)
 {
 	struct folio *folio = NULL;
 
 	if (gmem_2m_enabled && allow_huge)
-		folio = kvm_gmem_get_huge_folio(inode, index, PMD_ORDER);
+		folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER);
 
 	if (!folio)
-		folio = filemap_grab_folio(inode->i_mapping, index);
+		folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
 
 	pr_debug("%s: allocate folio with PFN %lx order %d\n",
 		 __func__, folio_pfn(folio), folio_order(folio));
@@ -150,9 +150,9 @@ static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index,
  * Ignore accessed, referenced, and dirty flags.  The memory is
  * unevictable and there is no storage to write back to.
  */
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
 {
-	return __kvm_gmem_get_folio(inode, index, true);
+	return __kvm_gmem_get_folio(file, index, true);
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -228,8 +228,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	return 0;
 }
 
-static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+static long kvm_gmem_allocate(struct file *file, loff_t offset, loff_t len)
 {
+	struct inode *inode = file_inode(file);
 	struct address_space *mapping = inode->i_mapping;
 	pgoff_t start, index, end;
 	int r;
@@ -252,7 +253,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 			break;
 		}
 
-		folio = kvm_gmem_get_folio(inode, index);
+		folio = kvm_gmem_get_folio(file, index);
 		if (IS_ERR(folio)) {
 			r = PTR_ERR(folio);
 			break;
@@ -292,7 +293,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
 	else
-		ret = kvm_gmem_allocate(file_inode(file), offset, len);
+		ret = kvm_gmem_allocate(file, offset, len);
 
 	if (!ret)
 		file_modified(file);
@@ -626,7 +627,7 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 		return ERR_PTR(-EIO);
 	}
 
-	folio = __kvm_gmem_get_folio(file_inode(file), index, allow_huge);
+	folio = __kvm_gmem_get_folio(file, index, allow_huge);
 	if (IS_ERR(folio))
 		return folio;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available
  2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
  2024-11-05 16:55   ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
@ 2024-11-05 16:55   ` Shivank Garg
  1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
  To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm
  Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
	dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
	nikunj, michael.day, Neeraj.Upadhyay

Enforce memory policy on guest-memfd to provide proper NUMA support.
Previously, guest-memfd allocations were following local NUMA node id
in absence of process mempolicy, resulting in random memory allocation.
Moreover, it cannot use mbind() since memory isn't mapped to userspace.

To support NUMA policies, call fbind() syscall from VMM to store
mempolicy as f_policy in struct kvm_gmem of guest_memfd. The f_policy
is retrieved to pass in filemap_grab_folio_mpol() to ensure that
allocations follow the specified memory policy.

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/mempolicy.c         |  2 ++
 virt/kvm/guest_memfd.c | 49 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3a697080ecad..af2e1ef4dae7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -347,6 +347,7 @@ void __mpol_put(struct mempolicy *pol)
 		return;
 	kmem_cache_free(policy_cache, pol);
 }
+EXPORT_SYMBOL(__mpol_put);
 
 static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 {
@@ -2599,6 +2600,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 	atomic_set(&new->refcnt, 1);
 	return new;
 }
+EXPORT_SYMBOL(__mpol_dup);
 
 /* Slow path of a mempolicy comparison */
 bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 2c6fcf7c3ec9..0237bda4382c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/mempolicy.h>
 
 #include "kvm_mm.h"
 
@@ -11,6 +12,7 @@ struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
 	struct list_head entry;
+	struct mempolicy *f_policy;
 };
 
 /**
@@ -87,7 +89,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 }
 
 static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
-					     unsigned int order)
+					     unsigned int order,
+					     struct mempolicy *policy)
 {
 	pgoff_t npages = 1UL << order;
 	pgoff_t huge_index = round_down(index, npages);
@@ -104,7 +107,7 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
 				   (loff_t)(huge_index + npages - 1) << PAGE_SHIFT))
 		return NULL;
 
-	folio = filemap_alloc_folio(gfp, order);
+	folio = filemap_alloc_folio_mpol(gfp, order, policy);
 	if (!folio)
 		return NULL;
 
@@ -129,12 +132,26 @@ static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index,
 					  bool allow_huge)
 {
 	struct folio *folio = NULL;
+	struct kvm_gmem *gmem = file->private_data;
+	struct mempolicy *policy = NULL;
+
+	/*
+	 * RCU lock is required to prevent any race condition with set_policy().
+	 */
+	if (IS_ENABLED(CONFIG_NUMA)) {
+		rcu_read_lock();
+		policy = READ_ONCE(gmem->f_policy);
+		mpol_get(policy);
+		rcu_read_unlock();
+	}
 
 	if (gmem_2m_enabled && allow_huge)
-		folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER);
+		folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER, policy);
 
 	if (!folio)
-		folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
+		folio = filemap_grab_folio_mpol(file_inode(file)->i_mapping, index, policy);
+
+	mpol_put(policy);
 
 	pr_debug("%s: allocate folio with PFN %lx order %d\n",
 		 __func__, folio_pfn(folio), folio_order(folio));
@@ -338,6 +355,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
 	mutex_unlock(&kvm->slots_lock);
 
 	xa_destroy(&gmem->bindings);
+	mpol_put(gmem->f_policy);
 	kfree(gmem);
 
 	kvm_put_kvm(kvm);
@@ -356,10 +374,32 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+#ifdef CONFIG_NUMA
+static int kvm_gmem_set_policy(struct file *file, struct mempolicy *mpol)
+{
+	struct mempolicy *old, *new;
+	struct kvm_gmem *gmem = file->private_data;
+
+	new = mpol_dup(mpol);
+	if (IS_ERR(new))
+		return PTR_ERR(new);
+
+	old = gmem->f_policy;
+	WRITE_ONCE(gmem->f_policy, new);
+	synchronize_rcu();
+	mpol_put(old);
+
+	return 0;
+}
+#endif
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
+#ifdef CONFIG_NUMA
+	.set_policy	= kvm_gmem_set_policy,
+#endif
 };
 
 void kvm_gmem_init(struct module *module)
@@ -489,6 +529,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
+	gmem->f_policy = NULL;
 	xa_init(&gmem->bindings);
 	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
  2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
  2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
@ 2024-11-05 18:55 ` Matthew Wilcox
  2024-11-07  8:54   ` Shivank Garg
  2 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-11-05 18:55 UTC (permalink / raw)
  To: Shivank Garg
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay

On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
> This patch series introduces fbind() syscall to support NUMA memory
> policies for KVM guest_memfd, allowing VMMs to configure memory placement
> for guest memory. This addresses the current limitation where guest_memfd
> allocations ignore NUMA policies, potentially impacting performance of
> memory-locality-sensitive workloads.

Why does guest_memfd ignore numa policies?  The pagecache doesn't,
eg in vma_alloc_folio_noprof().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
@ 2024-11-07  8:54   ` Shivank Garg
  2024-11-07 15:10     ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Shivank Garg @ 2024-11-07  8:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
	linux-coco

Hi Matthew,

On 11/6/2024 12:25 AM, Matthew Wilcox wrote:
> On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
>> This patch series introduces fbind() syscall to support NUMA memory
>> policies for KVM guest_memfd, allowing VMMs to configure memory placement
>> for guest memory. This addresses the current limitation where guest_memfd
>> allocations ignore NUMA policies, potentially impacting performance of
>> memory-locality-sensitive workloads.
> 
> Why does guest_memfd ignore numa policies?  The pagecache doesn't,
> eg in vma_alloc_folio_noprof().
guest_memfd doesn't have VMAs and hence can't store policy information in
VMA and use vma_alloc_folio_noprof() that fetches mpol from VMA.

The folio allocation path from guest_memfd typically looks like this...

kvm_gmem_get_folio
  filemap_grab_folio
    __filemap_get_folio
      filemap_alloc_folio
        __folio_alloc_node_noprof
          -> goes to the buddy allocator

Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.


Thanks,
Shivank

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-07  8:54   ` Shivank Garg
@ 2024-11-07 15:10     ` Matthew Wilcox
  2024-11-08  9:21       ` Shivank Garg
  2024-11-08 17:31       ` Paolo Bonzini
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2024-11-07 15:10 UTC (permalink / raw)
  To: Shivank Garg
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
	linux-coco

On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
> The folio allocation path from guest_memfd typically looks like this...
> 
> kvm_gmem_get_folio
>   filemap_grab_folio
>     __filemap_get_folio
>       filemap_alloc_folio
>         __folio_alloc_node_noprof
>           -> goes to the buddy allocator
> 
> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.

It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
real problem that you're trying to solve that cpusets are being used
incorrectly?

Backing up, it seems like you want to make a change to the page cache,
you've had a long discussion with people who aren't the page cache
maintainer, and you all understand the pros and cons of everything,
and here you are dumping a solution on me without talking to me, even
though I was at Plumbers, you didn't find me to tell me I needed to go
to your talk.

So you haven't explained a damned thing to me, and I'm annoyed at you.
Do better.  Starting with your cover letter.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-07 15:10     ` Matthew Wilcox
@ 2024-11-08  9:21       ` Shivank Garg
  2024-11-08 17:31       ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-08  9:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
	linux-coco



On 11/7/2024 8:40 PM, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>>   filemap_grab_folio
>>     __filemap_get_folio
>>       filemap_alloc_folio
>>         __folio_alloc_node_noprof
>>           -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
> 
> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?
> 
> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
> 
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better.  Starting with your cover letter.

Hi Matthew,

I apologize for any misunderstanding and not providing adequate context.

To clarify:
- You may recall this work from its earlier iteration as an
  IOCTL-based approach, where you provided valuable review comments [1].
- I was not physically present at LPC. The discussion happened through
  the mailing list [2] and lobby discussion with my colleagues who visited
  Vienna.
- Based on feedback, particularly regarding the suggestion to consider
  fbind() as a more generic solution, we shifted to the current approach.

I posted this as *RFC* specifically to gather feedback on the feasibility of
this approach and to ensure I'm heading in the right direction.

Would you be willing to help me understand:
1. What additional information would be helpful to you and other reviewers?
2. How cpusets can be used correctly to fix this? (your point on
   cpuset_do_page_mem_spread() is interesting and I'll investigate it more
   thoroughly to understand).

I'll work on improving the cover letter to better explain the problem space
and proposed solution.

Thank you for the valuable feedback.

[1] https://lore.kernel.org/linux-mm/ZuimLtrpv1dXczf5@casper.infradead.org
[2] https://lore.kernel.org/linux-mm/ZvEga7srKhympQBt@intel.com

Best regards,
Shivank

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-07 15:10     ` Matthew Wilcox
  2024-11-08  9:21       ` Shivank Garg
@ 2024-11-08 17:31       ` Paolo Bonzini
  2024-11-11 11:02         ` Vlastimil Babka
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2024-11-08 17:31 UTC (permalink / raw)
  To: Matthew Wilcox, Shivank Garg
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco,
	Linux API

On 11/7/24 16:10, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>>    filemap_grab_folio
>>      __filemap_get_folio
>>        filemap_alloc_folio
>>          __folio_alloc_node_noprof
>>            -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
> 
> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?

If it's false it's not very different, it goes to alloc_pages_noprof(). 
Then it respects the process's policy, but the policy is not 
customizable without mucking with state that is global to the process.

Taking a step back: the problem is that a VM can be configured to have 
multiple guest-side NUMA nodes, each of which will pick memory from the 
right NUMA node in the host.  Without a per-file operation it's not 
possible to do this on guest_memfd.  The discussion was whether to use 
ioctl() or a new system call.  The discussion ended with the idea of 
posting a *proposal* asking for *comments* as to whether the system call 
would be useful in general beyond KVM.

Commenting on the system call itself I am not sure I like the 
file_operations entry, though I understand that it's the simplest way to 
implement this in an RFC series.  It's a bit surprising that fbind() is 
a total no-op for everything except KVM's guest_memfd.

Maybe whatever you pass to fbind() could be stored in the struct file *, 
and used as the default when creating VMAs; as if every mmap() was 
followed by an mbind(), except that it also does the right thing with 
MAP_POPULATE for example.  Or maybe that's a horrible idea?

Adding linux-api to get input; original thread is at
https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.

Paolo

> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
> 
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better.  Starting with your cover letter.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-08 17:31       ` Paolo Bonzini
@ 2024-11-11 11:02         ` Vlastimil Babka
  2024-11-11 22:14           ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2024-11-11 11:02 UTC (permalink / raw)
  To: Paolo Bonzini, Matthew Wilcox, Shivank Garg
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco

On 11/8/24 18:31, Paolo Bonzini wrote:
> On 11/7/24 16:10, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>> The folio allocation path from guest_memfd typically looks like this...
>>>
>>> kvm_gmem_get_folio
>>>    filemap_grab_folio
>>>      __filemap_get_folio
>>>        filemap_alloc_folio
>>>          __folio_alloc_node_noprof
>>>            -> goes to the buddy allocator
>>>
>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>> 
>> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
>> real problem that you're trying to solve that cpusets are being used
>> incorrectly?
> 
> If it's false it's not very different, it goes to alloc_pages_noprof(). 
> Then it respects the process's policy, but the policy is not 
> customizable without mucking with state that is global to the process.
> 
> Taking a step back: the problem is that a VM can be configured to have 
> multiple guest-side NUMA nodes, each of which will pick memory from the 
> right NUMA node in the host.  Without a per-file operation it's not 
> possible to do this on guest_memfd.  The discussion was whether to use 
> ioctl() or a new system call.  The discussion ended with the idea of 
> posting a *proposal* asking for *comments* as to whether the system call 
> would be useful in general beyond KVM.
> 
> Commenting on the system call itself I am not sure I like the 
> file_operations entry, though I understand that it's the simplest way to 
> implement this in an RFC series.  It's a bit surprising that fbind() is 
> a total no-op for everything except KVM's guest_memfd.
> 
> Maybe whatever you pass to fbind() could be stored in the struct file *, 
> and used as the default when creating VMAs; as if every mmap() was 
> followed by an mbind(), except that it also does the right thing with 
> MAP_POPULATE for example.  Or maybe that's a horrible idea?

mbind() manpage has this:

       The  specified  policy  will  be  ignored  for  any MAP_SHARED
mappings in the specified memory range.  Rather the pages will be allocated
according to the memory policy of the thread that caused the page to be
allocated. Again, this may not be the thread that called mbind().

So that seems like we're not very keen on having one user of a file set a
policy that would affect other users of the file?

Now the next paragraph of the manpage says that shmem is different, and
guest_memfd is more like shmem than a regular file.

My conclusion from that is that fbind() might be too broad and we don't want
this for actual filesystem-backed files? And if it's limited to guest_memfd,
it shouldn't be an fbind()?

> Adding linux-api to get input; original thread is at
> https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.
> 
> Paolo
> 
>> Backing up, it seems like you want to make a change to the page cache,
>> you've had a long discussion with people who aren't the page cache
>> maintainer, and you all understand the pros and cons of everything,
>> and here you are dumping a solution on me without talking to me, even
>> though I was at Plumbers, you didn't find me to tell me I needed to go
>> to your talk.
>> 
>> So you haven't explained a damned thing to me, and I'm annoyed at you.
>> Do better.  Starting with your cover letter.
>> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
  2024-11-11 11:02         ` Vlastimil Babka
@ 2024-11-11 22:14           ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-11-11 22:14 UTC (permalink / raw)
  To: Vlastimil Babka, Paolo Bonzini, Matthew Wilcox, Shivank Garg
  Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
	linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
	thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
	kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco

On 11.11.24 12:02, Vlastimil Babka wrote:
> On 11/8/24 18:31, Paolo Bonzini wrote:
>> On 11/7/24 16:10, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>>> The folio allocation path from guest_memfd typically looks like this...
>>>>
>>>> kvm_gmem_get_folio
>>>>     filemap_grab_folio
>>>>       __filemap_get_folio
>>>>         filemap_alloc_folio
>>>>           __folio_alloc_node_noprof
>>>>             -> goes to the buddy allocator
>>>>
>>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>>>
>>> It only takes that path if cpuset_do_page_mem_spread() is true.  Is the
>>> real problem that you're trying to solve that cpusets are being used
>>> incorrectly?
>>
>> If it's false it's not very different, it goes to alloc_pages_noprof().
>> Then it respects the process's policy, but the policy is not
>> customizable without mucking with state that is global to the process.
>>
>> Taking a step back: the problem is that a VM can be configured to have
>> multiple guest-side NUMA nodes, each of which will pick memory from the
>> right NUMA node in the host.  Without a per-file operation it's not
>> possible to do this on guest_memfd.  The discussion was whether to use
>> ioctl() or a new system call.  The discussion ended with the idea of
>> posting a *proposal* asking for *comments* as to whether the system call
>> would be useful in general beyond KVM.
>>
>> Commenting on the system call itself I am not sure I like the
>> file_operations entry, though I understand that it's the simplest way to
>> implement this in an RFC series.  It's a bit surprising that fbind() is
>> a total no-op for everything except KVM's guest_memfd.
>>
>> Maybe whatever you pass to fbind() could be stored in the struct file *,
>> and used as the default when creating VMAs; as if every mmap() was
>> followed by an mbind(), except that it also does the right thing with
>> MAP_POPULATE for example.  Or maybe that's a horrible idea?
> 
> mbind() manpage has this:
> 
>         The  specified  policy  will  be  ignored  for  any MAP_SHARED
> mappings in the specified memory range.  Rather the pages will be allocated
> according to the memory policy of the thread that caused the page to be
> allocated. Again, this may not be the thread that called mbind().

I recall discussing that a couple of times in the context of QEMU. I 
have some faint recollection that the manpage is a bit imprecise:

IIRC, hugetlb also ends up using the VMA policy for MAP_SHARED mappings 
during faults (huge_node()->get_vma_policy()) -- but in contrast to 
shmem, it doesn't end up becoming the "shared" policy for the file, used 
when accessed through other VMAs.

> 
> So that seems like we're not very keen on having one user of a file set a
> policy that would affect other users of the file?

For VMs in QEMU we really want to configure the policy once in the main 
process and have all other processes (e.g., vhost-user) not worry about 
that when they mmap() guest memory.

With shmem this works by "shared policy" design (below). For hugetlb, we 
rely on the fact that mbind()+MADV_POPULATE_WRITE allows us to 
preallocate NUMA-aware. So with hugetlb we really preallocate all guest 
RAM to guarantee the NUMA placement.

It would not be the worst idea to have a clean interface to configure 
file-range policies instead of having this weird shmem mbind() behavior 
and the hugetlb hack.

Having that said, other filesystem are rarely used for backing VMs, at 
least in combination with NUMA. So nobody really cared that much for now.

Maybe fbind() would primarily only be useful for in-memory filesystems 
(shmem/hugetlb/guest_memfd).

> 
> Now the next paragraph of the manpage says that shmem is different, and
> guest_memfd is more like shmem than a regular file.
> 
> My conclusion from that is that fbind() might be too broad and we don't want
> this for actual filesystem-backed files? And if it's limited to guest_memfd,
> it shouldn't be an fbind()?

I was just once again diving into how mbind() on shmem is handled. And 
in fact, mbind() will call vma->set_policy() to update the per 
file-range policy. I wonder why we didn't do the same for hugetlb ... 
but of course, hugetlb must be special in any possible way.

Not saying it's the best idea, but as we are talking about mmap support 
of guest_memfd (only allowing to fault in shared/faultable pages), one 
*could* look into implementing mbind()+vma->set_policy() for guest_memfd 
similar to how shmem handles it.

It would require a (temporary) dummy VMA in the worst case (all private 
memory).

It sounds a bit weird, though, to require a VMA to configure this, 
though. But at least it's similar to what shmem does ...

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-11 22:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
2024-11-05 16:55   ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
2024-11-05 16:55   ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
2024-11-07  8:54   ` Shivank Garg
2024-11-07 15:10     ` Matthew Wilcox
2024-11-08  9:21       ` Shivank Garg
2024-11-08 17:31       ` Paolo Bonzini
2024-11-11 11:02         ` Vlastimil Babka
2024-11-11 22:14           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).