* [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer
2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
@ 2024-11-05 16:45 ` Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
2 siblings, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:45 UTC (permalink / raw)
To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm
Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
nikunj, michael.day, Neeraj.Upadhyay, Shivansh Dhiman
From: Shivansh Dhiman <shivansh.dhiman@amd.com>
Introduce mempolicy support to the filemap. Add filemap_grab_folio_mpol,
filemap_alloc_folio_mpol_noprof() and __filemap_get_folio_mpol() APIs that
take mempolicy struct as an argument.
The API is required by VMs using KVM guest-memfd memory backends for NUMA
mempolicy aware allocations.
Signed-off-by: Shivansh Dhiman <shivansh.dhiman@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
include/linux/pagemap.h | 40 ++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 30 +++++++++++++++++++++++++-----
2 files changed, 65 insertions(+), 5 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d9c7edb6422b..b05b696f310b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -564,15 +564,25 @@ static inline void *detach_page_private(struct page *page)
#ifdef CONFIG_NUMA
struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
+struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp, unsigned int order,
+ struct mempolicy *mpol);
#else
static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
return folio_alloc_noprof(gfp, order);
}
+static inline struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp,
+ unsigned int order,
+ struct mempolicy *mpol)
+{
+ return filemap_alloc_folio_noprof(gfp, order);
+}
#endif
#define filemap_alloc_folio(...) \
alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
+#define filemap_alloc_folio_mpol(...) \
+ alloc_hooks(filemap_alloc_folio_mpol_noprof(__VA_ARGS__))
static inline struct page *__page_cache_alloc(gfp_t gfp)
{
@@ -652,6 +662,8 @@ static inline fgf_t fgf_set_order(size_t size)
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
+struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
+ pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *mpol);
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
@@ -710,6 +722,34 @@ static inline struct folio *filemap_grab_folio(struct address_space *mapping,
mapping_gfp_mask(mapping));
}
+/**
+ * filemap_grab_folio_mpol - grab a folio from the page cache
+ * @mapping: The address space to search
+ * @index: The page index
+ * @mpol: The mempolicy to apply
+ *
+ * Same as filemap_grab_folio(), except that it allocates the folio using
+ * given memory policy.
+ *
+ * Return: A found or created folio. ERR_PTR(-ENOMEM) if no folio is found
+ * and failed to create a folio.
+ */
+#ifdef CONFIG_NUMA
+static inline struct folio *filemap_grab_folio_mpol(struct address_space *mapping,
+ pgoff_t index, struct mempolicy *mpol)
+{
+ return __filemap_get_folio_mpol(mapping, index,
+ FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
+ mapping_gfp_mask(mapping), mpol);
+}
+#else
+static inline struct folio *filemap_grab_folio_mpol(struct address_space *mapping,
+ pgoff_t index, struct mempolicy *mpol)
+{
+ return filemap_grab_folio(mapping, index);
+}
+#endif /* CONFIG_NUMA */
+
/**
* find_get_page - find and get a page reference
* @mapping: the address_space to search
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b91..a870a05296c8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -990,8 +990,13 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
EXPORT_SYMBOL_GPL(filemap_add_folio);
#ifdef CONFIG_NUMA
-struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
+struct folio *filemap_alloc_folio_mpol_noprof(gfp_t gfp, unsigned int order,
+ struct mempolicy *mpol)
{
+ if (mpol)
+ return folio_alloc_mpol_noprof(gfp, order, mpol,
+ NO_INTERLEAVE_INDEX, numa_node_id());
+
int n;
struct folio *folio;
@@ -1007,6 +1012,12 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
}
return folio_alloc_noprof(gfp, order);
}
+EXPORT_SYMBOL(filemap_alloc_folio_mpol_noprof);
+
+struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
+{
+ return filemap_alloc_folio_mpol_noprof(gfp, order, NULL);
+}
EXPORT_SYMBOL(filemap_alloc_folio_noprof);
#endif
@@ -1861,11 +1872,12 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
}
/**
- * __filemap_get_folio - Find and get a reference to a folio.
+ * __filemap_get_folio_mpol - Find and get a reference to a folio.
* @mapping: The address_space to search.
* @index: The page index.
* @fgp_flags: %FGP flags modify how the folio is returned.
* @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
+ * @mpol: The mempolicy to apply.
*
* Looks up the page cache entry at @mapping & @index.
*
@@ -1876,8 +1888,8 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
*
* Return: The found folio or an ERR_PTR() otherwise.
*/
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
- fgf_t fgp_flags, gfp_t gfp)
+struct folio *__filemap_get_folio_mpol(struct address_space *mapping, pgoff_t index,
+ fgf_t fgp_flags, gfp_t gfp, struct mempolicy *mpol)
{
struct folio *folio;
@@ -1947,7 +1959,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
err = -ENOMEM;
if (order > 0)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
- folio = filemap_alloc_folio(alloc_gfp, order);
+ folio = filemap_alloc_folio_mpol(alloc_gfp, order, mpol);
if (!folio)
continue;
@@ -1978,6 +1990,14 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
return ERR_PTR(-ENOENT);
return folio;
}
+EXPORT_SYMBOL(__filemap_get_folio_mpol);
+
+struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+ fgf_t fgp_flags, gfp_t gfp)
+{
+ return __filemap_get_folio_mpol(mapping, index,
+ fgp_flags, gfp, NULL);
+}
EXPORT_SYMBOL(__filemap_get_folio);
static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC PATCH 2/4] Introduce fbind syscall
2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
@ 2024-11-05 16:55 ` Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
2 siblings, 2 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm
Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
nikunj, michael.day, Neeraj.Upadhyay, Shivansh Dhiman
The new fbind syscall sets the NUMA memory policy for file-backed memory
and has following signature:
long fbind(unsigned int fd, unsigned long mode,
const unsigned long nodemask[(.maxnode + ULONG_WIDTH - 1)
/ ULONG_WIDTH],
unsigned long maxnode, unsigned int flags);
fbind behaves similar to mbind except that it takes file descriptor as
input instead of address ranges.
TODO:
1. Support fbind syscall on all architectures.
2. Expand commit msg and add documentation.
3. clean-up the code.
[Shivansh: add create_mpol_from_args()]
Signed-off-by: Shivansh Dhiman <shivansh.dhiman@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/fs.h | 3 ++
include/linux/mempolicy.h | 3 ++
include/linux/syscalls.h | 3 ++
include/uapi/asm-generic/unistd.h | 5 ++-
kernel/sys_ni.c | 1 +
mm/Makefile | 2 +-
mm/fbind.c | 49 +++++++++++++++++++++++
mm/mempolicy.c | 55 ++++++++++++++++++++++++++
10 files changed, 121 insertions(+), 2 deletions(-)
create mode 100644 mm/fbind.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 534c74b14fab..0660ce6d08d8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -468,3 +468,4 @@
460 i386 lsm_set_self_attr sys_lsm_set_self_attr
461 i386 lsm_list_modules sys_lsm_list_modules
462 i386 mseal sys_mseal
+463 i386 fbind sys_fbind
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7093ee21c0d1..9794347cc2e6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -386,6 +386,7 @@
460 common lsm_set_self_attr sys_lsm_set_self_attr
461 common lsm_list_modules sys_lsm_list_modules
462 common mseal sys_mseal
+463 common fbind sys_fbind
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..42042b62bdcd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2058,6 +2058,9 @@ struct file_operations {
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+#ifdef CONFIG_NUMA
+ int (*set_policy)(struct file *, struct mempolicy *);
+#endif
int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 1add16f21612..b9023f6246a7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -299,4 +299,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
}
#endif /* CONFIG_NUMA */
+struct mempolicy *create_mpol_from_args(unsigned char mode,
+ const unsigned long __user *nmask,
+ unsigned short maxnode);
#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4bcf6754738d..2dc686921b9f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -502,6 +502,9 @@ asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *bu
asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
struct stat __user *statbuf, int flag);
asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
+asmlinkage long sys_fbind(unsigned int fd, unsigned long mode,
+ const unsigned long __user *nmask,
+ unsigned long maxnode, unsigned int flags);
#if defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_COMPAT_STAT64)
asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 5bf6148cac2b..550730f36dae 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -841,8 +841,11 @@ __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
#define __NR_mseal 462
__SYSCALL(__NR_mseal, sys_mseal)
+#define __NR_fbind 463
+__SYSCALL(__NR_fbind, sys_fbind)
+
#undef __NR_syscalls
-#define __NR_syscalls 463
+#define __NR_syscalls 464
/*
* 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index c00a86931f8c..f57350e581f6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -195,6 +195,7 @@ COND_SYSCALL(move_pages);
COND_SYSCALL(set_mempolicy_home_node);
COND_SYSCALL(cachestat);
COND_SYSCALL(mseal);
+COND_SYSCALL(fbind);
COND_SYSCALL(perf_event_open);
COND_SYSCALL(accept4);
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..ba339ddc0be2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -79,7 +79,7 @@ obj-$(CONFIG_ZSWAP) += zswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) += hugetlb_vmemmap.o
-obj-$(CONFIG_NUMA) += mempolicy.o
+obj-$(CONFIG_NUMA) += mempolicy.o fbind.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
diff --git a/mm/fbind.c b/mm/fbind.c
new file mode 100644
index 000000000000..85ec7d13345c
--- /dev/null
+++ b/mm/fbind.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Implement fbind() syscall.
+ *
+ * Copyright (c) 2024 AMD
+ *
+ * Author: Shivank Garg <shivankg@amd.com>
+ */
+
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mempolicy.h>
+#include <linux/syscalls.h>
+
+static long do_fbind(unsigned int fd, unsigned long mode,
+ const unsigned long __user *nmask,
+ unsigned long maxnode, unsigned int flags)
+{
+ struct mempolicy *mpol;
+ struct fd f;
+ int ret;
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ mpol = create_mpol_from_args(mode, nmask, maxnode);
+ if (IS_ERR_OR_NULL(mpol)) {
+ ret = PTR_ERR(mpol);
+ goto out_putf;
+ }
+
+ if (f.file->f_op->set_policy)
+ ret = f.file->f_op->set_policy(f.file, mpol);
+ else
+ ret = -EOPNOTSUPP;
+
+ mpol_put(mpol);
+out_putf:
+ fdput(f);
+ return ret;
+}
+
+SYSCALL_DEFINE5(fbind, unsigned int, fd, unsigned long, mode,
+ const unsigned long __user *, nmask,
+ unsigned long, maxnode, unsigned int, flags)
+{
+ return do_fbind(fd, mode, nmask, maxnode, flags);
+}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b858e22b259d..3a697080ecad 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3557,3 +3557,58 @@ static int __init mempolicy_sysfs_init(void)
late_initcall(mempolicy_sysfs_init);
#endif /* CONFIG_SYSFS */
+
+/**
+ * create_mpol_from_args - create a mempolicy structure from args
+ * @mode: NUMA memory policy mode
+ * @nmask: bitmask of NUMA nodes
+ * @maxnode: number of bits in the nodes bitmask
+ *
+ * Create a mempolicy from given nodemask and memory policy such as
+ * default, preferred, interleave or bind.
+ *
+ * Return: error encoded in a pointer or memory policy on success.
+ */
+struct mempolicy *create_mpol_from_args(unsigned char mode,
+ const unsigned long __user *nmask,
+ unsigned short maxnode)
+{
+ struct mm_struct *mm = current->mm;
+ unsigned short mode_flags;
+ struct mempolicy *mpol;
+ nodemask_t nodes;
+ int lmode = mode;
+ int err = -ENOMEM;
+
+ err = sanitize_mpol_flags(&lmode, &mode_flags);
+ if (err)
+ return ERR_PTR(err);
+
+ err = get_nodes(&nodes, nmask, maxnode);
+ if (err)
+ return ERR_PTR(err);
+
+ mpol = mpol_new(mode, mode_flags, &nodes);
+ if (IS_ERR_OR_NULL(mpol))
+ return mpol;
+
+ NODEMASK_SCRATCH(scratch);
+ if (!scratch) {
+ err = -ENOMEM;
+ goto err_out;
+ }
+
+ mmap_write_lock(mm);
+ err = mpol_set_nodemask(mpol, &nodes, scratch);
+ mmap_write_unlock(mm);
+ NODEMASK_SCRATCH_FREE(scratch);
+
+ if (err)
+ goto err_out;
+
+ return mpol;
+
+err_out:
+ mpol_put(mpol);
+ return ERR_PTR(err);
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
@ 2024-11-05 16:55 ` Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available Shivank Garg
1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm
Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
nikunj, michael.day, Neeraj.Upadhyay
Change the KVM guest_memfd APIs to pass file pointers instead of
inodes in the folio allocation functions. This is preparatory patch
for adding NUMA support to guest memory allocations.
The functional behavior remains unchanged.
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
virt/kvm/guest_memfd.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e930014b4bdc..2c6fcf7c3ec9 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -91,7 +91,7 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
{
pgoff_t npages = 1UL << order;
pgoff_t huge_index = round_down(index, npages);
- struct address_space *mapping = inode->i_mapping;
+ struct address_space *mapping = inode->i_mapping;
gfp_t gfp = mapping_gfp_mask(mapping) | __GFP_NOWARN;
loff_t size = i_size_read(inode);
struct folio *folio;
@@ -125,16 +125,16 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
*/
-static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index,
+static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index,
bool allow_huge)
{
struct folio *folio = NULL;
if (gmem_2m_enabled && allow_huge)
- folio = kvm_gmem_get_huge_folio(inode, index, PMD_ORDER);
+ folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER);
if (!folio)
- folio = filemap_grab_folio(inode->i_mapping, index);
+ folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
pr_debug("%s: allocate folio with PFN %lx order %d\n",
__func__, folio_pfn(folio), folio_order(folio));
@@ -150,9 +150,9 @@ static struct folio *__kvm_gmem_get_folio(struct inode *inode, pgoff_t index,
* Ignore accessed, referenced, and dirty flags. The memory is
* unevictable and there is no storage to write back to.
*/
-static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
{
- return __kvm_gmem_get_folio(inode, index, true);
+ return __kvm_gmem_get_folio(file, index, true);
}
static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -228,8 +228,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
return 0;
}
-static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+static long kvm_gmem_allocate(struct file *file, loff_t offset, loff_t len)
{
+ struct inode *inode = file_inode(file);
struct address_space *mapping = inode->i_mapping;
pgoff_t start, index, end;
int r;
@@ -252,7 +253,7 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
break;
}
- folio = kvm_gmem_get_folio(inode, index);
+ folio = kvm_gmem_get_folio(file, index);
if (IS_ERR(folio)) {
r = PTR_ERR(folio);
break;
@@ -292,7 +293,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
if (mode & FALLOC_FL_PUNCH_HOLE)
ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
else
- ret = kvm_gmem_allocate(file_inode(file), offset, len);
+ ret = kvm_gmem_allocate(file, offset, len);
if (!ret)
file_modified(file);
@@ -626,7 +627,7 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
return ERR_PTR(-EIO);
}
- folio = __kvm_gmem_get_folio(file_inode(file), index, allow_huge);
+ folio = __kvm_gmem_get_folio(file, index, allow_huge);
if (IS_ERR(folio))
return folio;
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC PATCH 4/4] KVM: guest_memfd: Enforce NUMA mempolicy if available
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 3/4] KVM: guest_memfd: Pass file pointer instead of inode in guest_memfd APIs Shivank Garg
@ 2024-11-05 16:55 ` Shivank Garg
1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-05 16:55 UTC (permalink / raw)
To: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm
Cc: chao.gao, pgonda, thomas.lendacky, seanjc, luto, tglx, mingo, bp,
dave.hansen, willy, arnd, pbonzini, kees, shivankg, bharata,
nikunj, michael.day, Neeraj.Upadhyay
Enforce memory policy on guest-memfd to provide proper NUMA support.
Previously, guest-memfd allocations were following local NUMA node id
in absence of process mempolicy, resulting in random memory allocation.
Moreover, it cannot use mbind() since memory isn't mapped to userspace.
To support NUMA policies, call fbind() syscall from VMM to store
mempolicy as f_policy in struct kvm_gmem of guest_memfd. The f_policy
is retrieved to pass in filemap_grab_folio_mpol() to ensure that
allocations follow the specified memory policy.
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
mm/mempolicy.c | 2 ++
virt/kvm/guest_memfd.c | 49 ++++++++++++++++++++++++++++++++++++++----
2 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3a697080ecad..af2e1ef4dae7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -347,6 +347,7 @@ void __mpol_put(struct mempolicy *pol)
return;
kmem_cache_free(policy_cache, pol);
}
+EXPORT_SYMBOL(__mpol_put);
static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
{
@@ -2599,6 +2600,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
atomic_set(&new->refcnt, 1);
return new;
}
+EXPORT_SYMBOL(__mpol_dup);
/* Slow path of a mempolicy comparison */
bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 2c6fcf7c3ec9..0237bda4382c 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
#include <linux/kvm_host.h>
#include <linux/pagemap.h>
#include <linux/anon_inodes.h>
+#include <linux/mempolicy.h>
#include "kvm_mm.h"
@@ -11,6 +12,7 @@ struct kvm_gmem {
struct kvm *kvm;
struct xarray bindings;
struct list_head entry;
+ struct mempolicy *f_policy;
};
/**
@@ -87,7 +89,8 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
}
static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
- unsigned int order)
+ unsigned int order,
+ struct mempolicy *policy)
{
pgoff_t npages = 1UL << order;
pgoff_t huge_index = round_down(index, npages);
@@ -104,7 +107,7 @@ static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t index,
(loff_t)(huge_index + npages - 1) << PAGE_SHIFT))
return NULL;
- folio = filemap_alloc_folio(gfp, order);
+ folio = filemap_alloc_folio_mpol(gfp, order, policy);
if (!folio)
return NULL;
@@ -129,12 +132,26 @@ static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index,
bool allow_huge)
{
struct folio *folio = NULL;
+ struct kvm_gmem *gmem = file->private_data;
+ struct mempolicy *policy = NULL;
+
+ /*
+ * RCU lock is required to prevent any race condition with set_policy().
+ */
+ if (IS_ENABLED(CONFIG_NUMA)) {
+ rcu_read_lock();
+ policy = READ_ONCE(gmem->f_policy);
+ mpol_get(policy);
+ rcu_read_unlock();
+ }
if (gmem_2m_enabled && allow_huge)
- folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER);
+ folio = kvm_gmem_get_huge_folio(file_inode(file), index, PMD_ORDER, policy);
if (!folio)
- folio = filemap_grab_folio(file_inode(file)->i_mapping, index);
+ folio = filemap_grab_folio_mpol(file_inode(file)->i_mapping, index, policy);
+
+ mpol_put(policy);
pr_debug("%s: allocate folio with PFN %lx order %d\n",
__func__, folio_pfn(folio), folio_order(folio));
@@ -338,6 +355,7 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
mutex_unlock(&kvm->slots_lock);
xa_destroy(&gmem->bindings);
+ mpol_put(gmem->f_policy);
kfree(gmem);
kvm_put_kvm(kvm);
@@ -356,10 +374,32 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
return get_file_active(&slot->gmem.file);
}
+#ifdef CONFIG_NUMA
+static int kvm_gmem_set_policy(struct file *file, struct mempolicy *mpol)
+{
+ struct mempolicy *old, *new;
+ struct kvm_gmem *gmem = file->private_data;
+
+ new = mpol_dup(mpol);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ old = gmem->f_policy;
+ WRITE_ONCE(gmem->f_policy, new);
+ synchronize_rcu();
+ mpol_put(old);
+
+ return 0;
+}
+#endif
+
static struct file_operations kvm_gmem_fops = {
.open = generic_file_open,
.release = kvm_gmem_release,
.fallocate = kvm_gmem_fallocate,
+#ifdef CONFIG_NUMA
+ .set_policy = kvm_gmem_set_policy,
+#endif
};
void kvm_gmem_init(struct module *module)
@@ -489,6 +529,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
kvm_get_kvm(kvm);
gmem->kvm = kvm;
+ gmem->f_policy = NULL;
xa_init(&gmem->bindings);
list_add(&gmem->entry, &inode->i_mapping->i_private_list);
--
2.34.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-05 16:45 [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Shivank Garg
2024-11-05 16:45 ` [RFC PATCH 1/4] mm: Add mempolicy support to the filemap layer Shivank Garg
2024-11-05 16:55 ` [RFC PATCH 2/4] Introduce fbind syscall Shivank Garg
@ 2024-11-05 18:55 ` Matthew Wilcox
2024-11-07 8:54 ` Shivank Garg
2 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-11-05 18:55 UTC (permalink / raw)
To: Shivank Garg
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay
On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
> This patch series introduces fbind() syscall to support NUMA memory
> policies for KVM guest_memfd, allowing VMMs to configure memory placement
> for guest memory. This addresses the current limitation where guest_memfd
> allocations ignore NUMA policies, potentially impacting performance of
> memory-locality-sensitive workloads.
Why does guest_memfd ignore numa policies? The pagecache doesn't,
eg in vma_alloc_folio_noprof().
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-05 18:55 ` [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd Matthew Wilcox
@ 2024-11-07 8:54 ` Shivank Garg
2024-11-07 15:10 ` Matthew Wilcox
0 siblings, 1 reply; 12+ messages in thread
From: Shivank Garg @ 2024-11-07 8:54 UTC (permalink / raw)
To: Matthew Wilcox
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
linux-coco
Hi Matthew,
On 11/6/2024 12:25 AM, Matthew Wilcox wrote:
> On Tue, Nov 05, 2024 at 04:45:45PM +0000, Shivank Garg wrote:
>> This patch series introduces fbind() syscall to support NUMA memory
>> policies for KVM guest_memfd, allowing VMMs to configure memory placement
>> for guest memory. This addresses the current limitation where guest_memfd
>> allocations ignore NUMA policies, potentially impacting performance of
>> memory-locality-sensitive workloads.
>
> Why does guest_memfd ignore numa policies? The pagecache doesn't,
> eg in vma_alloc_folio_noprof().
guest_memfd doesn't have VMAs and hence can't store policy information in
VMA and use vma_alloc_folio_noprof() that fetches mpol from VMA.
The folio allocation path from guest_memfd typically looks like this...
kvm_gmem_get_folio
filemap_grab_folio
__filemap_get_folio
filemap_alloc_folio
__folio_alloc_node_noprof
-> goes to the buddy allocator
Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
Thanks,
Shivank
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-07 8:54 ` Shivank Garg
@ 2024-11-07 15:10 ` Matthew Wilcox
2024-11-08 9:21 ` Shivank Garg
2024-11-08 17:31 ` Paolo Bonzini
0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2024-11-07 15:10 UTC (permalink / raw)
To: Shivank Garg
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
linux-coco
On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
> The folio allocation path from guest_memfd typically looks like this...
>
> kvm_gmem_get_folio
> filemap_grab_folio
> __filemap_get_folio
> filemap_alloc_folio
> __folio_alloc_node_noprof
> -> goes to the buddy allocator
>
> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
It only takes that path if cpuset_do_page_mem_spread() is true. Is the
real problem that you're trying to solve that cpusets are being used
incorrectly?
Backing up, it seems like you want to make a change to the page cache,
you've had a long discussion with people who aren't the page cache
maintainer, and you all understand the pros and cons of everything,
and here you are dumping a solution on me without talking to me, even
though I was at Plumbers, you didn't find me to tell me I needed to go
to your talk.
So you haven't explained a damned thing to me, and I'm annoyed at you.
Do better. Starting with your cover letter.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-07 15:10 ` Matthew Wilcox
@ 2024-11-08 9:21 ` Shivank Garg
2024-11-08 17:31 ` Paolo Bonzini
1 sibling, 0 replies; 12+ messages in thread
From: Shivank Garg @ 2024-11-08 9:21 UTC (permalink / raw)
To: Matthew Wilcox
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
pbonzini, kees, bharata, nikunj, michael.day, Neeraj.Upadhyay,
linux-coco
On 11/7/2024 8:40 PM, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>> filemap_grab_folio
>> __filemap_get_folio
>> filemap_alloc_folio
>> __folio_alloc_node_noprof
>> -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>
> It only takes that path if cpuset_do_page_mem_spread() is true. Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?
>
> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
>
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better. Starting with your cover letter.
Hi Matthew,
I apologize for any misunderstanding and not providing adequate context.
To clarify:
- You may recall this work from its earlier iteration as an
IOCTL-based approach, where you provided valuable review comments [1].
- I was not physically present at LPC. The discussion happened through
the mailing list [2] and lobby discussion with my colleagues who visited
Vienna.
- Based on feedback, particularly regarding the suggestion to consider
fbind() as a more generic solution, we shifted to the current approach.
I posted this as *RFC* specifically to gather feedback on the feasibility of
this approach and to ensure I'm heading in the right direction.
Would you be willing to help me understand:
1. What additional information would be helpful to you and other reviewers?
2. How cpusets can be used correctly to fix this? (your point on
cpuset_do_page_mem_spread() is interesting and I'll investigate it more
thoroughly to understand).
I'll work on improving the cover letter to better explain the problem space
and proposed solution.
Thank you for the valuable feedback.
[1] https://lore.kernel.org/linux-mm/ZuimLtrpv1dXczf5@casper.infradead.org
[2] https://lore.kernel.org/linux-mm/ZvEga7srKhympQBt@intel.com
Best regards,
Shivank
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-07 15:10 ` Matthew Wilcox
2024-11-08 9:21 ` Shivank Garg
@ 2024-11-08 17:31 ` Paolo Bonzini
2024-11-11 11:02 ` Vlastimil Babka
1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2024-11-08 17:31 UTC (permalink / raw)
To: Matthew Wilcox, Shivank Garg
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco,
Linux API
On 11/7/24 16:10, Matthew Wilcox wrote:
> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>> The folio allocation path from guest_memfd typically looks like this...
>>
>> kvm_gmem_get_folio
>> filemap_grab_folio
>> __filemap_get_folio
>> filemap_alloc_folio
>> __folio_alloc_node_noprof
>> -> goes to the buddy allocator
>>
>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>
> It only takes that path if cpuset_do_page_mem_spread() is true. Is the
> real problem that you're trying to solve that cpusets are being used
> incorrectly?
If it's false it's not very different, it goes to alloc_pages_noprof().
Then it respects the process's policy, but the policy is not
customizable without mucking with state that is global to the process.
Taking a step back: the problem is that a VM can be configured to have
multiple guest-side NUMA nodes, each of which will pick memory from the
right NUMA node in the host. Without a per-file operation it's not
possible to do this on guest_memfd. The discussion was whether to use
ioctl() or a new system call. The discussion ended with the idea of
posting a *proposal* asking for *comments* as to whether the system call
would be useful in general beyond KVM.
Commenting on the system call itself I am not sure I like the
file_operations entry, though I understand that it's the simplest way to
implement this in an RFC series. It's a bit surprising that fbind() is
a total no-op for everything except KVM's guest_memfd.
Maybe whatever you pass to fbind() could be stored in the struct file *,
and used as the default when creating VMAs; as if every mmap() was
followed by an mbind(), except that it also does the right thing with
MAP_POPULATE for example. Or maybe that's a horrible idea?
Adding linux-api to get input; original thread is at
https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.
Paolo
> Backing up, it seems like you want to make a change to the page cache,
> you've had a long discussion with people who aren't the page cache
> maintainer, and you all understand the pros and cons of everything,
> and here you are dumping a solution on me without talking to me, even
> though I was at Plumbers, you didn't find me to tell me I needed to go
> to your talk.
>
> So you haven't explained a damned thing to me, and I'm annoyed at you.
> Do better. Starting with your cover letter.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-08 17:31 ` Paolo Bonzini
@ 2024-11-11 11:02 ` Vlastimil Babka
2024-11-11 22:14 ` David Hildenbrand
0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2024-11-11 11:02 UTC (permalink / raw)
To: Paolo Bonzini, Matthew Wilcox, Shivank Garg
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco
On 11/8/24 18:31, Paolo Bonzini wrote:
> On 11/7/24 16:10, Matthew Wilcox wrote:
>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>> The folio allocation path from guest_memfd typically looks like this...
>>>
>>> kvm_gmem_get_folio
>>> filemap_grab_folio
>>> __filemap_get_folio
>>> filemap_alloc_folio
>>> __folio_alloc_node_noprof
>>> -> goes to the buddy allocator
>>>
>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>>
>> It only takes that path if cpuset_do_page_mem_spread() is true. Is the
>> real problem that you're trying to solve that cpusets are being used
>> incorrectly?
>
> If it's false it's not very different, it goes to alloc_pages_noprof().
> Then it respects the process's policy, but the policy is not
> customizable without mucking with state that is global to the process.
>
> Taking a step back: the problem is that a VM can be configured to have
> multiple guest-side NUMA nodes, each of which will pick memory from the
> right NUMA node in the host. Without a per-file operation it's not
> possible to do this on guest_memfd. The discussion was whether to use
> ioctl() or a new system call. The discussion ended with the idea of
> posting a *proposal* asking for *comments* as to whether the system call
> would be useful in general beyond KVM.
>
> Commenting on the system call itself I am not sure I like the
> file_operations entry, though I understand that it's the simplest way to
> implement this in an RFC series. It's a bit surprising that fbind() is
> a total no-op for everything except KVM's guest_memfd.
>
> Maybe whatever you pass to fbind() could be stored in the struct file *,
> and used as the default when creating VMAs; as if every mmap() was
> followed by an mbind(), except that it also does the right thing with
> MAP_POPULATE for example. Or maybe that's a horrible idea?
mbind() manpage has this:
The specified policy will be ignored for any MAP_SHARED
mappings in the specified memory range. Rather the pages will be allocated
according to the memory policy of the thread that caused the page to be
allocated. Again, this may not be the thread that called mbind().
So that seems like we're not very keen on having one user of a file set a
policy that would affect other users of the file?
Now the next paragraph of the manpage says that shmem is different, and
guest_memfd is more like shmem than a regular file.
My conclusion from that is that fbind() might be too broad and we don't want
this for actual filesystem-backed files? And if it's limited to guest_memfd,
it shouldn't be an fbind()?
> Adding linux-api to get input; original thread is at
> https://lore.kernel.org/kvm/20241105164549.154700-1-shivankg@amd.com/.
>
> Paolo
>
>> Backing up, it seems like you want to make a change to the page cache,
>> you've had a long discussion with people who aren't the page cache
>> maintainer, and you all understand the pros and cons of everything,
>> and here you are dumping a solution on me without talking to me, even
>> though I was at Plumbers, you didn't find me to tell me I needed to go
>> to your talk.
>>
>> So you haven't explained a damned thing to me, and I'm annoyed at you.
>> Do better. Starting with your cover letter.
>>
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Add fbind() and NUMA mempolicy support for KVM guest_memfd
2024-11-11 11:02 ` Vlastimil Babka
@ 2024-11-11 22:14 ` David Hildenbrand
0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-11-11 22:14 UTC (permalink / raw)
To: Vlastimil Babka, Paolo Bonzini, Matthew Wilcox, Shivank Garg
Cc: x86, viro, brauner, jack, akpm, linux-kernel, linux-fsdevel,
linux-mm, linux-api, linux-arch, kvm, chao.gao, pgonda,
thomas.lendacky, seanjc, luto, tglx, mingo, bp, dave.hansen, arnd,
kees, bharata, nikunj, michael.day, Neeraj.Upadhyay, linux-coco
On 11.11.24 12:02, Vlastimil Babka wrote:
> On 11/8/24 18:31, Paolo Bonzini wrote:
>> On 11/7/24 16:10, Matthew Wilcox wrote:
>>> On Thu, Nov 07, 2024 at 02:24:20PM +0530, Shivank Garg wrote:
>>>> The folio allocation path from guest_memfd typically looks like this...
>>>>
>>>> kvm_gmem_get_folio
>>>> filemap_grab_folio
>>>> __filemap_get_folio
>>>> filemap_alloc_folio
>>>> __folio_alloc_node_noprof
>>>> -> goes to the buddy allocator
>>>>
>>>> Hence, I am trying to have a version of filemap_alloc_folio() that takes an mpol.
>>>
>>> It only takes that path if cpuset_do_page_mem_spread() is true. Is the
>>> real problem that you're trying to solve that cpusets are being used
>>> incorrectly?
>>
>> If it's false it's not very different, it goes to alloc_pages_noprof().
>> Then it respects the process's policy, but the policy is not
>> customizable without mucking with state that is global to the process.
>>
>> Taking a step back: the problem is that a VM can be configured to have
>> multiple guest-side NUMA nodes, each of which will pick memory from the
>> right NUMA node in the host. Without a per-file operation it's not
>> possible to do this on guest_memfd. The discussion was whether to use
>> ioctl() or a new system call. The discussion ended with the idea of
>> posting a *proposal* asking for *comments* as to whether the system call
>> would be useful in general beyond KVM.
>>
>> Commenting on the system call itself I am not sure I like the
>> file_operations entry, though I understand that it's the simplest way to
>> implement this in an RFC series. It's a bit surprising that fbind() is
>> a total no-op for everything except KVM's guest_memfd.
>>
>> Maybe whatever you pass to fbind() could be stored in the struct file *,
>> and used as the default when creating VMAs; as if every mmap() was
>> followed by an mbind(), except that it also does the right thing with
>> MAP_POPULATE for example. Or maybe that's a horrible idea?
>
> mbind() manpage has this:
>
> The specified policy will be ignored for any MAP_SHARED
> mappings in the specified memory range. Rather the pages will be allocated
> according to the memory policy of the thread that caused the page to be
> allocated. Again, this may not be the thread that called mbind().
I recall discussing that a couple of times in the context of QEMU. I
have some faint recollection that the manpage is a bit imprecise:
IIRC, hugetlb also ends up using the VMA policy for MAP_SHARED mappings
during faults (huge_node()->get_vma_policy()) -- but in contrast to
shmem, it doesn't end up becoming the "shared" policy for the file, used
when accessed through other VMAs.
>
> So that seems like we're not very keen on having one user of a file set a
> policy that would affect other users of the file?
For VMs in QEMU we really want to configure the policy once in the main
process and have all other processes (e.g., vhost-user) not worry about
that when they mmap() guest memory.
With shmem this works by "shared policy" design (below). For hugetlb, we
rely on the fact that mbind()+MADV_POPULATE_WRITE allows us to
preallocate NUMA-aware. So with hugetlb we really preallocate all guest
RAM to guarantee the NUMA placement.
It would not be the worst idea to have a clean interface to configure
file-range policies instead of having this weird shmem mbind() behavior
and the hugetlb hack.
Having that said, other filesystem are rarely used for backing VMs, at
least in combination with NUMA. So nobody really cared that much for now.
Maybe fbind() would primarily only be useful for in-memory filesystems
(shmem/hugetlb/guest_memfd).
>
> Now the next paragraph of the manpage says that shmem is different, and
> guest_memfd is more like shmem than a regular file.
>
> My conclusion from that is that fbind() might be too broad and we don't want
> this for actual filesystem-backed files? And if it's limited to guest_memfd,
> it shouldn't be an fbind()?
I was just once again diving into how mbind() on shmem is handled. And
in fact, mbind() will call vma->set_policy() to update the per
file-range policy. I wonder why we didn't do the same for hugetlb ...
but of course, hugetlb must be special in any possible way.
Not saying it's the best idea, but as we are talking about mmap support
of guest_memfd (only allowing to fault in shared/faultable pages), one
*could* look into implementing mbind()+vma->set_policy() for guest_memfd
similar to how shmem handles it.
It would require a (temporary) dummy VMA in the worst case (all private
memory).
It sounds a bit weird, though, to require a VMA to configure this,
though. But at least it's similar to what shmem does ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread