[PATCH v2 0/4] mm/userfaultfd: modulize memory types

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm/userfaultfd: modulize memory types
@ 2025-06-27 15:46 Peter Xu
  2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Peter Xu @ 2025-06-27 15:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Suren Baghdasaryan, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Hugh Dickins, Andrew Morton, James Houghton,
	peterx, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

[based on latest akpm/mm-new of June 27th, commit 9be7387ae43f]

v2 changelog:
- Patch 1
  - update English in commit log [David]
  - move vm_uffd_ops definition to userfaultfd_k.h [Mike]
- Patch 4
  - fix sparse warning on bitwise type conversions [syzbot]
  - Commit message updates on explanation of vma_can_userfault check [James]

v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com

This series is an alternative proposal of what Nikita proposed here on the
initial three patches:

  https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com

This is not yet relevant to any guest-memfd support, but paving way for it.
Here, the major goal is to make kernel modules be able to opt-in with any
form of userfaultfd supports, like guest-memfd.  This alternative option
should hopefully be cleaner, and avoid leaking userfault details into
vm_ops.fault().

It also means this series does not depend on anything.  It's a pure
refactoring of userfaultfd internals to provide a generic API, so that
other types of files, especially RAM based, can support userfaultfd without
touching mm/ at all.

To achieve it, this series introduced a file operation called vm_uffd_ops.
The ops needs to be provided when a file type supports any of userfaultfd.

With that, I moved both hugetlbfs and shmem over.

Hugetlbfs is still very special that it will only use partial of the
vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
BUG() and so far hard-coded into core mm.  But this should still be better,
because at least hugetlbfs is still always involved in feature probing
(e.g. where it used to not support ZEROPAGE and we have a hard-coded line
to fail that, and some more).  Meanwhile after this series, shmem will be
completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
shmem looks like this:

static const vm_uffd_ops shmem_uffd_ops = {
	.uffd_features	= 	__VM_UFFD_FLAGS,
	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
				BIT(_UFFDIO_ZEROPAGE) |
				BIT(_UFFDIO_WRITEPROTECT) |
				BIT(_UFFDIO_CONTINUE) |
				BIT(_UFFDIO_POISON),
	.uffd_get_folio	=	shmem_uffd_get_folio,
	.uffd_copy	=	shmem_mfill_atomic_pte,
};

As I mentioned in one of my reply to Nikita, I don't like the current
interface of uffd_copy(), but this will be the minimum change version of
such API to support complete extrenal-module-ready userfaultfd.  Here, very
minimal change will be needed from shmem side to support that.

Meanwhile, the vm_uffd_ops is also not the only place one will need to
provide to support userfaultfd.  Normally vm_ops.fault() will also need to
be updated, but that's a generic function and it'll play together with the
new vm_uffd_ops to make everything fly.

No functional change expected at all after the whole series applied.  There
might be some slightly stricter check on uffd ops here and there in the
last patch, but that really shouldn't stand out anywhere to anyone.

For testing: besides the cross-compilation tests, I did also try with
uffd-stress in a VM to measure any perf difference before/after the change;
The static call becomes a pointer now.  I really cannot measure anything
different, which is more or less expected.

Comments welcomed, thanks.

Peter Xu (4):
  mm: Introduce vm_uffd_ops API
  mm/shmem: Support vm_uffd_ops API
  mm/hugetlb: Support vm_uffd_ops API
  mm: Apply vm_uffd_ops API to core mm

 include/linux/mm.h            |   9 +++
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  98 +++++++++++++++++++----------
 mm/hugetlb.c                  |  19 ++++++
 mm/shmem.c                    |  28 ++++++++-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 6 files changed, 207 insertions(+), 76 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
@ 2025-06-27 15:46 ` Peter Xu
  2025-06-29  8:50   ` Mike Rapoport
  2025-06-30 10:15   ` Lorenzo Stoakes
  2025-06-27 15:46 ` [PATCH v2 2/4] mm/shmem: Support " Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 42+ messages in thread
From: Peter Xu @ 2025-06-27 15:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Suren Baghdasaryan, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Hugh Dickins, Andrew Morton, James Houghton,
	peterx, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Introduce a generic userfaultfd API for vm_operations_struct, so that one
vma, especially when as a module, can support userfaults without modifying
the core files.  More importantly, when the module can be compiled out of
the kernel.

So, instead of having core mm referencing modules that may not ever exist,
we need to have modules opt-in on core mm hooks instead.

After this API applied, if a module wants to support userfaultfd, the
module should only need to touch its own file and properly define
vm_uffd_ops, instead of changing anything in core mm.

Note that such API will not work for anonymous. Core mm will process
anonymous memory separately for userfault operations like before.

This patch only introduces the API alone so that we can start to move
existing users over but without breaking them.

Currently the uffd_copy() API is almost designed to be the simplistic with
minimum mm changes to move over to the API.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h            |  9 ++++++
 include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef40f68c1183..6a5447bd43fd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -576,6 +576,8 @@ struct vm_fault {
 					 */
 };
 
+struct vm_uffd_ops;
+
 /*
  * These are the virtual MM functions - opening of an area, closing and
  * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -653,6 +655,13 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Userfaultfd related ops.  Modules need to define this to support
+	 * userfaultfd.
+	 */
+	const struct vm_uffd_ops *userfaultfd_ops;
+#endif
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index df85330bcfa6..c9a093c4502b 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -92,6 +92,58 @@ enum mfill_atomic_mode {
 	NR_MFILL_ATOMIC_MODES,
 };
 
+/* VMA userfaultfd operations */
+struct vm_uffd_ops {
+	/**
+	 * @uffd_features: features supported in bitmask.
+	 *
+	 * When the ops is defined, the driver must set non-zero features
+	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
+	 */
+	unsigned long uffd_features;
+	/**
+	 * @uffd_ioctls: ioctls supported in bitmask.
+	 *
+	 * Userfaultfd ioctls supported by the module.  Below will always
+	 * be supported by default whenever a module provides vm_uffd_ops:
+	 *
+	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
+	 *
+	 * The module needs to provide all the rest optionally supported
+	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
+	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
+	 * is optional.
+	 */
+	unsigned long uffd_ioctls;
+	/**
+	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
+	 *
+	 * @inode: the inode for folio lookup
+	 * @pgoff: the pgoff of the folio
+	 * @folio: returned folio pointer
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
+			      struct folio **folio);
+	/**
+	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
+	 *
+	 * @dst_pmd: target pmd to resolve page fault
+	 * @dst_vma: target vma
+	 * @dst_addr: target virtual address
+	 * @src_addr: source address to copy from
+	 * @flags: userfaultfd request flags
+	 * @foliop: previously allocated folio
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
+			 unsigned long dst_addr, unsigned long src_addr,
+			 uffd_flags_t flags, struct folio **foliop);
+};
+typedef struct vm_uffd_ops vm_uffd_ops;
+
 #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
 #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 2/4] mm/shmem: Support vm_uffd_ops API
  2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
  2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
@ 2025-06-27 15:46 ` Peter Xu
  2025-06-29  8:51   ` Mike Rapoport
  2025-06-27 15:46 ` [PATCH v2 3/4] mm/hugetlb: " Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-06-27 15:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Suren Baghdasaryan, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Hugh Dickins, Andrew Morton, James Houghton,
	peterx, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Add support for the new vm_uffd_ops API for shmem.  Note that this only
introduces the support, the API is not yet used by core mm.

Due to the tailored uffd_copy() API, shmem is extremely easy to support it
by reusing the existing mfill function.

It only needs a separate uffd_get_folio() definition but that's oneliner.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/shmem.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2b19965d27df..9a8b8dd4709b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
 #endif /* CONFIG_TMPFS_QUOTA */
 
 #ifdef CONFIG_USERFAULTFD
+
+static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
+				struct folio **folio)
+{
+	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
+}
+
 int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
@@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping,
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops shmem_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_ZEROPAGE) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	.uffd_get_folio	=	shmem_uffd_get_folio,
+	.uffd_copy	=	shmem_mfill_atomic_pte,
+};
+#endif
+
 static const struct address_space_operations shmem_aops = {
 	.dirty_folio	= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
@@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 int shmem_init_fs_context(struct fs_context *fc)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 3/4] mm/hugetlb: Support vm_uffd_ops API
  2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
  2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
  2025-06-27 15:46 ` [PATCH v2 2/4] mm/shmem: Support " Peter Xu
@ 2025-06-27 15:46 ` Peter Xu
  2025-06-29  8:52   ` Mike Rapoport
  2025-06-27 15:46 ` [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
  2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
  4 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-06-27 15:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Suren Baghdasaryan, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Hugh Dickins, Andrew Morton, James Houghton,
	peterx, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Add support for the new vm_uffd_ops API for hugetlb.  Note that this only
introduces the support, the API is not yet used by core mm.

Due to legacy reasons, it's still not trivial to move hugetlb completely to
the API (like shmem).  But it will still use uffd_features and uffd_ioctls
properly on the API because that's pretty general.

Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 11d5668ff6e7..ccd2be152d36 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5457,6 +5457,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops hugetlb_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	/* _UFFDIO_ZEROPAGE not supported */
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	/*
+	 * Hugetlbfs still has its own hard-coded handler in userfaultfd,
+	 * due to limitations similar to vm_operations_struct.fault().
+	 * TODO: generalize it to use the API functions.
+	 */
+};
+#endif
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5470,6 +5486,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &hugetlb_uffd_ops,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
                   ` (2 preceding siblings ...)
  2025-06-27 15:46 ` [PATCH v2 3/4] mm/hugetlb: " Peter Xu
@ 2025-06-27 15:46 ` Peter Xu
  2025-06-29  8:55   ` Mike Rapoport
  2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
  4 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-06-27 15:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Vlastimil Babka, Suren Baghdasaryan, Muchun Song, Mike Rapoport,
	Lorenzo Stoakes, Hugh Dickins, Andrew Morton, James Houghton,
	peterx, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

This patch completely moves the old userfaultfd core to use the new
vm_uffd_ops API.  After this change, existing file systems will start to
use the new API for userfault operations.

When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
because it's getting too big.  It's only used in slow paths so it shouldn't
be an issue.  Move the pte marker check before wp_async, which might be
more intuitive because wp_async depends on pte markers.  That shouldn't
cause any functional change though because only one check would take effect
depending on whether pte marker was selected in config.

This will also remove quite some hard-coded checks for either shmem or
hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.

Note that anonymous memory will still need to be processed separately
because it doesn't have vm_ops at all.

Reviewed-by: James Houghton <jthoughton@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  46 ++++----------
 mm/shmem.c                    |   2 +-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 4 files changed, 101 insertions(+), 76 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 6d0f9c599ff7..2f5b7b295cf6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
-#ifdef CONFIG_USERFAULTFD
-#ifdef CONFIG_SHMEM
-extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
-				  struct vm_area_struct *dst_vma,
-				  unsigned long dst_addr,
-				  unsigned long src_addr,
-				  uffd_flags_t flags,
-				  struct folio **foliop);
-#else /* !CONFIG_SHMEM */
-#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
-			       src_addr, flags, foliop) ({ BUG(); 0; })
-#endif /* CONFIG_SHMEM */
-#endif /* CONFIG_USERFAULTFD */
-
 /*
  * Used space is stored as unsigned 64-bit value in bytes but
  * quota core supports only signed 64-bit values so use that
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c9a093c4502b..1aa9b246fb84 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -149,9 +149,14 @@ typedef struct vm_uffd_ops vm_uffd_ops;
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
 #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
 
+static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
+{
+	return (__force enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
+}
+
 static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
 {
-	return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
+	return uffd_flags_get_mode(flags) == expected;
 }
 
 static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
@@ -260,41 +265,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     vm_flags_t vm_flags,
-				     bool wp_async)
+static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
 {
-	vm_flags &= __VM_UFFD_FLAGS;
-
-	if (vma->vm_flags & VM_DROPPABLE)
-		return false;
-
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
-		return false;
-
-	/*
-	 * If wp async enabled, and WP is the only mode enabled, allow any
-	 * memory type.
-	 */
-	if (wp_async && (vm_flags == VM_UFFD_WP))
-		return true;
-
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
-	/*
-	 * If user requested uffd-wp but not enabled pte markers for
-	 * uffd-wp, then shmem & hugetlbfs are not supported but only
-	 * anonymous.
-	 */
-	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
-		return false;
-#endif
-
-	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
+		return vma->vm_ops->userfaultfd_ops;
+	return NULL;
 }
 
+bool vma_can_userfault(struct vm_area_struct *vma,
+		       unsigned long vm_flags, bool wp_async);
+
 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
 {
 	struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
diff --git a/mm/shmem.c b/mm/shmem.c
index 9a8b8dd4709b..fc85002dd8af 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
 	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
 }
 
-int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
+static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index cbed91b09640..52d7d5f144b8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,12 +14,48 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
-#include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
 #include "swap.h"
 
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
+		       bool wp_async)
+{
+	unsigned long supported;
+
+	if (vma->vm_flags & VM_DROPPABLE)
+		return false;
+
+	vm_flags &= __VM_UFFD_FLAGS;
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	/*
+	 * If user requested uffd-wp but not enabled pte markers for
+	 * uffd-wp, then any file system (like shmem or hugetlbfs) are not
+	 * supported but only anonymous.
+	 */
+	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+		return false;
+#endif
+	/*
+	 * If wp async enabled, and WP is the only mode enabled, allow any
+	 * memory type.
+	 */
+	if (wp_async && (vm_flags == VM_UFFD_WP))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		/* Anonymous has no page cache, MINOR not supported */
+		supported = VM_UFFD_MISSING | VM_UFFD_WP;
+	else if (vma_get_uffd_ops(vma))
+		supported = vma_get_uffd_ops(vma)->uffd_features;
+	else
+		return false;
+
+	return !(vm_flags & (~supported));
+}
+
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
@@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
 	struct folio *folio;
 	struct page *page;
 	int ret;
 
-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+	if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
+		return -EINVAL;
+
+	ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
 	/* Our caller expects us to return -EFAULT if we failed to find folio */
 	if (ret == -ENOENT)
 		ret = -EFAULT;
@@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	u32 hash;
 	struct address_space *mapping;
 
-	/*
-	 * There is no default zero huge page for all huge page sizes as
-	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
-	 * by THP.  Since we can not reliably insert a zero page, this
-	 * feature is not supported.
-	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
-		up_read(&ctx->map_changing_lock);
-		uffd_mfill_unlock(dst_vma);
-		return -EINVAL;
-	}
-
 	src_addr = src_start;
 	dst_addr = dst_start;
 	copied = 0;
@@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 			err = mfill_atomic_pte_zeropage(dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
-		err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
-					     dst_addr, src_addr,
-					     flags, foliop);
+		const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
+
+		if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
+			err = -EINVAL;
+		} else {
+			err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
+						  dst_addr, src_addr,
+						  flags, foliop);
+		}
 	}
 
 	return err;
 }
 
+static inline bool
+vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
+{
+	enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
+	const vm_uffd_ops *uffd_ops;
+	unsigned long uffd_ioctls;
+
+	if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
+		return false;
+
+	/* Anonymous supports everything except CONTINUE */
+	if (vma_is_anonymous(vma))
+		return mode != MFILL_ATOMIC_CONTINUE;
+
+	uffd_ops = vma_get_uffd_ops(vma);
+	if (!uffd_ops)
+		return false;
+
+	uffd_ioctls = uffd_ops->uffd_ioctls;
+	switch (mode) {
+	case MFILL_ATOMIC_COPY:
+		return uffd_ioctls & BIT(_UFFDIO_COPY);
+	case MFILL_ATOMIC_ZEROPAGE:
+		return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
+	case MFILL_ATOMIC_CONTINUE:
+		if (!(vma->vm_flags & VM_SHARED))
+			return false;
+		return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
+	case MFILL_ATOMIC_POISON:
+		return uffd_ioctls & BIT(_UFFDIO_POISON);
+	default:
+		return false;
+	}
+}
+
 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 					    unsigned long dst_start,
 					    unsigned long src_start,
@@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
-	/*
-	 * validate 'mode' now that we know the dst_vma: don't allow
-	 * a wrprotect copy if the userfaultfd didn't register as WP.
-	 */
-	if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
+	if (!vma_uffd_ops_supported(dst_vma, flags))
 		goto out_unlock;
 
 	/*
@@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
 					     src_start, len, flags);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
-		goto out_unlock;
-
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
@ 2025-06-29  8:50   ` Mike Rapoport
  2025-07-02 19:30     ` Peter Xu
  2025-06-30 10:15   ` Lorenzo Stoakes
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Rapoport @ 2025-06-29  8:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Hi Peter,

On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying
> the core files.  More importantly, when the module can be compiled out of
> the kernel.
> 
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.
> 
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.

I liked the changelog update you proposed in v1 thread. I took liberty to
slightly update it and here's what I've got:

  Currently, most of the userfaultfd features are implemented directly in the
  core mm.  It will invoke VMA specific functions whenever necessary.  So far
  it is fine because it almost only interacts with shmem and hugetlbfs.

  Introduce a generic userfaultfd API extension for vm_operations_struct,
  so that any code that implements vm_operations_struct (including kernel
  modules that can be compiled separately from the kernel core) can support
  userfaults without modifying the core files.

  With this API applied, if a module wants to support userfaultfd, the
  module should only need to properly define vm_uffd_ops and hook it to
  vm_operations_struct, instead of changing anything in core mm.

> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.

Maybe:

  This API will not work for anonymous memory. Handling of userfault
  operations for anonymous memory remains unchanged in core mm.
 
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.

Please use imperative mood, e.g.
 
  Only introduce the new API so that ...
 
> Currently the uffd_copy() API is almost designed to be the simplistic with
> minimum mm changes to move over to the API.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h            |  9 ++++++
>  include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
>  2 files changed, 61 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef40f68c1183..6a5447bd43fd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -576,6 +576,8 @@ struct vm_fault {
>  					 */
>  };
>  
> +struct vm_uffd_ops;
> +
>  /*
>   * These are the virtual MM functions - opening of an area, closing and
>   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -653,6 +655,13 @@ struct vm_operations_struct {
>  	 */
>  	struct page *(*find_special_page)(struct vm_area_struct *vma,
>  					  unsigned long addr);
> +#ifdef CONFIG_USERFAULTFD
> +	/*
> +	 * Userfaultfd related ops.  Modules need to define this to support
> +	 * userfaultfd.
> +	 */
> +	const struct vm_uffd_ops *userfaultfd_ops;
> +#endif
>  };
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index df85330bcfa6..c9a093c4502b 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
>  	NR_MFILL_ATOMIC_MODES,
>  };
>  
> +/* VMA userfaultfd operations */
> +struct vm_uffd_ops {
> +	/**
> +	 * @uffd_features: features supported in bitmask.
> +	 *
> +	 * When the ops is defined, the driver must set non-zero features
> +	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> +	 */
> +	unsigned long uffd_features;
> +	/**
> +	 * @uffd_ioctls: ioctls supported in bitmask.
> +	 *
> +	 * Userfaultfd ioctls supported by the module.  Below will always
> +	 * be supported by default whenever a module provides vm_uffd_ops:
> +	 *
> +	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> +	 *
> +	 * The module needs to provide all the rest optionally supported
> +	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
> +	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> +	 * is optional.
> +	 */
> +	unsigned long uffd_ioctls;
> +	/**
> +	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> +	 *
> +	 * @inode: the inode for folio lookup
> +	 * @pgoff: the pgoff of the folio
> +	 * @folio: returned folio pointer
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +			      struct folio **folio);
> +	/**
> +	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> +	 *
> +	 * @dst_pmd: target pmd to resolve page fault
> +	 * @dst_vma: target vma
> +	 * @dst_addr: target virtual address
> +	 * @src_addr: source address to copy from
> +	 * @flags: userfaultfd request flags
> +	 * @foliop: previously allocated folio
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> +			 unsigned long dst_addr, unsigned long src_addr,
> +			 uffd_flags_t flags, struct folio **foliop);
> +};
> +typedef struct vm_uffd_ops vm_uffd_ops;

Either use vm_uffd_ops_t for the typedef or drop the typedef entirely. My
preference is for the second option.

> +
>  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
>  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> -- 
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 2/4] mm/shmem: Support vm_uffd_ops API
  2025-06-27 15:46 ` [PATCH v2 2/4] mm/shmem: Support " Peter Xu
@ 2025-06-29  8:51   ` Mike Rapoport
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Rapoport @ 2025-06-29  8:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jun 27, 2025 at 11:46:53AM -0400, Peter Xu wrote:
> Add support for the new vm_uffd_ops API for shmem.  Note that this only
> introduces the support, the API is not yet used by core mm.
> 
> Due to the tailored uffd_copy() API, shmem is extremely easy to support it
> by reusing the existing mfill function.
> 
> It only needs a separate uffd_get_folio() definition but that's oneliner.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  mm/shmem.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2b19965d27df..9a8b8dd4709b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
>  #endif /* CONFIG_TMPFS_QUOTA */
>  
>  #ifdef CONFIG_USERFAULTFD
> +
> +static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
> +				struct folio **folio)
> +{
> +	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
> +}
> +
>  int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>  			   struct vm_area_struct *dst_vma,
>  			   unsigned long dst_addr,
> @@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_USERFAULTFD
> +static const vm_uffd_ops shmem_uffd_ops = {
> +	.uffd_features	= 	__VM_UFFD_FLAGS,
> +	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> +				BIT(_UFFDIO_ZEROPAGE) |
> +				BIT(_UFFDIO_WRITEPROTECT) |
> +				BIT(_UFFDIO_CONTINUE) |
> +				BIT(_UFFDIO_POISON),
> +	.uffd_get_folio	=	shmem_uffd_get_folio,
> +	.uffd_copy	=	shmem_mfill_atomic_pte,
> +};
> +#endif
> +
>  static const struct address_space_operations shmem_aops = {
>  	.dirty_folio	= noop_dirty_folio,
>  #ifdef CONFIG_TMPFS
> @@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
>  	.set_policy     = shmem_set_policy,
>  	.get_policy     = shmem_get_policy,
>  #endif
> +#ifdef CONFIG_USERFAULTFD
> +	.userfaultfd_ops = &shmem_uffd_ops,
> +#endif
>  };
>  
>  static const struct vm_operations_struct shmem_anon_vm_ops = {
> @@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
>  	.set_policy     = shmem_set_policy,
>  	.get_policy     = shmem_get_policy,
>  #endif
> +#ifdef CONFIG_USERFAULTFD
> +	.userfaultfd_ops = &shmem_uffd_ops,
> +#endif
>  };
>  
>  int shmem_init_fs_context(struct fs_context *fc)
> -- 
> 2.49.0
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 3/4] mm/hugetlb: Support vm_uffd_ops API
  2025-06-27 15:46 ` [PATCH v2 3/4] mm/hugetlb: " Peter Xu
@ 2025-06-29  8:52   ` Mike Rapoport
  0 siblings, 0 replies; 42+ messages in thread
From: Mike Rapoport @ 2025-06-29  8:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jun 27, 2025 at 11:46:54AM -0400, Peter Xu wrote:
> Add support for the new vm_uffd_ops API for hugetlb.  Note that this only
> introduces the support, the API is not yet used by core mm.
> 
> Due to legacy reasons, it's still not trivial to move hugetlb completely to
> the API (like shmem).  But it will still use uffd_features and uffd_ioctls
> properly on the API because that's pretty general.
> 
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  mm/hugetlb.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 11d5668ff6e7..ccd2be152d36 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5457,6 +5457,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_USERFAULTFD
> +static const vm_uffd_ops hugetlb_uffd_ops = {
> +	.uffd_features	= 	__VM_UFFD_FLAGS,
> +	/* _UFFDIO_ZEROPAGE not supported */
> +	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> +				BIT(_UFFDIO_WRITEPROTECT) |
> +				BIT(_UFFDIO_CONTINUE) |
> +				BIT(_UFFDIO_POISON),
> +	/*
> +	 * Hugetlbfs still has its own hard-coded handler in userfaultfd,
> +	 * due to limitations similar to vm_operations_struct.fault().
> +	 * TODO: generalize it to use the API functions.
> +	 */
> +};
> +#endif
> +
>  /*
>   * When a new function is introduced to vm_operations_struct and added
>   * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -5470,6 +5486,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>  	.close = hugetlb_vm_op_close,
>  	.may_split = hugetlb_vm_op_split,
>  	.pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_USERFAULTFD
> +	.userfaultfd_ops = &hugetlb_uffd_ops,
> +#endif
>  };
>  
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
> -- 
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-27 15:46 ` [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
@ 2025-06-29  8:55   ` Mike Rapoport
  2025-07-02 20:38     ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Rapoport @ 2025-06-29  8:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jun 27, 2025 at 11:46:55AM -0400, Peter Xu wrote:
> This patch completely moves the old userfaultfd core to use the new
> vm_uffd_ops API.  After this change, existing file systems will start to
> use the new API for userfault operations.

Maybe:

  Move userfaultfd core to use new vm_uffd_ops API. After this change file
  systems that implement vm_operations_struct can start using new API for
  userfaultfd operations.
 
> When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
> because it's getting too big.  It's only used in slow paths so it shouldn't
> be an issue.  Move the pte marker check before wp_async, which might be
> more intuitive because wp_async depends on pte markers.  That shouldn't
> cause any functional change though because only one check would take effect
> depending on whether pte marker was selected in config.
> 
> This will also remove quite some hard-coded checks for either shmem or
> hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.
> 
> Note that anonymous memory will still need to be processed separately
> because it doesn't have vm_ops at all.
> 
> Reviewed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  include/linux/shmem_fs.h      |  14 -----
>  include/linux/userfaultfd_k.h |  46 ++++----------
>  mm/shmem.c                    |   2 +-
>  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>  4 files changed, 101 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 6d0f9c599ff7..2f5b7b295cf6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>  
> -#ifdef CONFIG_USERFAULTFD
> -#ifdef CONFIG_SHMEM
> -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> -				  struct vm_area_struct *dst_vma,
> -				  unsigned long dst_addr,
> -				  unsigned long src_addr,
> -				  uffd_flags_t flags,
> -				  struct folio **foliop);
> -#else /* !CONFIG_SHMEM */
> -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
> -			       src_addr, flags, foliop) ({ BUG(); 0; })
> -#endif /* CONFIG_SHMEM */
> -#endif /* CONFIG_USERFAULTFD */
> -
>  /*
>   * Used space is stored as unsigned 64-bit value in bytes but
>   * quota core supports only signed 64-bit values so use that
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c9a093c4502b..1aa9b246fb84 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -149,9 +149,14 @@ typedef struct vm_uffd_ops vm_uffd_ops;
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
>  #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
>  
> +static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
> +{
> +	return (__force enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
> +}
> +
>  static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
>  {
> -	return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
> +	return uffd_flags_get_mode(flags) == expected;
>  }
>  
>  static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
> @@ -260,41 +265,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>  	return vma->vm_flags & __VM_UFFD_FLAGS;
>  }
>  
> -static inline bool vma_can_userfault(struct vm_area_struct *vma,
> -				     vm_flags_t vm_flags,
> -				     bool wp_async)
> +static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
>  {
> -	vm_flags &= __VM_UFFD_FLAGS;
> -
> -	if (vma->vm_flags & VM_DROPPABLE)
> -		return false;
> -
> -	if ((vm_flags & VM_UFFD_MINOR) &&
> -	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> -		return false;
> -
> -	/*
> -	 * If wp async enabled, and WP is the only mode enabled, allow any
> -	 * memory type.
> -	 */
> -	if (wp_async && (vm_flags == VM_UFFD_WP))
> -		return true;
> -
> -#ifndef CONFIG_PTE_MARKER_UFFD_WP
> -	/*
> -	 * If user requested uffd-wp but not enabled pte markers for
> -	 * uffd-wp, then shmem & hugetlbfs are not supported but only
> -	 * anonymous.
> -	 */
> -	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> -		return false;
> -#endif
> -
> -	/* By default, allow any of anon|shmem|hugetlb */
> -	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> -	    vma_is_shmem(vma);
> +	if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
> +		return vma->vm_ops->userfaultfd_ops;
> +	return NULL;
>  }
>  
> +bool vma_can_userfault(struct vm_area_struct *vma,
> +		       unsigned long vm_flags, bool wp_async);
> +
>  static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
>  {
>  	struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9a8b8dd4709b..fc85002dd8af 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
>  	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
>  }
>  
> -int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>  			   struct vm_area_struct *dst_vma,
>  			   unsigned long dst_addr,
>  			   unsigned long src_addr,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index cbed91b09640..52d7d5f144b8 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,12 +14,48 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> -#include <linux/shmem_fs.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
>  #include "swap.h"
>  
> +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags,
> +		       bool wp_async)
> +{
> +	unsigned long supported;
> +
> +	if (vma->vm_flags & VM_DROPPABLE)
> +		return false;
> +
> +	vm_flags &= __VM_UFFD_FLAGS;
> +
> +#ifndef CONFIG_PTE_MARKER_UFFD_WP
> +	/*
> +	 * If user requested uffd-wp but not enabled pte markers for
> +	 * uffd-wp, then any file system (like shmem or hugetlbfs) are not
> +	 * supported but only anonymous.
> +	 */
> +	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> +		return false;
> +#endif
> +	/*
> +	 * If wp async enabled, and WP is the only mode enabled, allow any
> +	 * memory type.
> +	 */
> +	if (wp_async && (vm_flags == VM_UFFD_WP))
> +		return true;
> +
> +	if (vma_is_anonymous(vma))
> +		/* Anonymous has no page cache, MINOR not supported */
> +		supported = VM_UFFD_MISSING | VM_UFFD_WP;
> +	else if (vma_get_uffd_ops(vma))
> +		supported = vma_get_uffd_ops(vma)->uffd_features;
> +	else
> +		return false;
> +
> +	return !(vm_flags & (~supported));
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
>  {
>  	struct inode *inode = file_inode(dst_vma->vm_file);
>  	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
>  	struct folio *folio;
>  	struct page *page;
>  	int ret;
>  
> -	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> +	if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
> +		return -EINVAL;
> +
> +	ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
>  	/* Our caller expects us to return -EFAULT if we failed to find folio */
>  	if (ret == -ENOENT)
>  		ret = -EFAULT;
> @@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>  	u32 hash;
>  	struct address_space *mapping;
>  
> -	/*
> -	 * There is no default zero huge page for all huge page sizes as
> -	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> -	 * by THP.  Since we can not reliably insert a zero page, this
> -	 * feature is not supported.
> -	 */
> -	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> -		up_read(&ctx->map_changing_lock);
> -		uffd_mfill_unlock(dst_vma);
> -		return -EINVAL;
> -	}
> -
>  	src_addr = src_start;
>  	dst_addr = dst_start;
>  	copied = 0;
> @@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
>  			err = mfill_atomic_pte_zeropage(dst_pmd,
>  						 dst_vma, dst_addr);
>  	} else {
> -		err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
> -					     dst_addr, src_addr,
> -					     flags, foliop);
> +		const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
> +
> +		if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
> +			err = -EINVAL;
> +		} else {
> +			err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
> +						  dst_addr, src_addr,
> +						  flags, foliop);
> +		}
>  	}
>  
>  	return err;
>  }
>  
> +static inline bool
> +vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
> +{
> +	enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
> +	const vm_uffd_ops *uffd_ops;
> +	unsigned long uffd_ioctls;
> +
> +	if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
> +		return false;
> +
> +	/* Anonymous supports everything except CONTINUE */
> +	if (vma_is_anonymous(vma))
> +		return mode != MFILL_ATOMIC_CONTINUE;
> +
> +	uffd_ops = vma_get_uffd_ops(vma);
> +	if (!uffd_ops)
> +		return false;
> +
> +	uffd_ioctls = uffd_ops->uffd_ioctls;
> +	switch (mode) {
> +	case MFILL_ATOMIC_COPY:
> +		return uffd_ioctls & BIT(_UFFDIO_COPY);
> +	case MFILL_ATOMIC_ZEROPAGE:
> +		return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
> +	case MFILL_ATOMIC_CONTINUE:
> +		if (!(vma->vm_flags & VM_SHARED))
> +			return false;
> +		return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
> +	case MFILL_ATOMIC_POISON:
> +		return uffd_ioctls & BIT(_UFFDIO_POISON);
> +	default:
> +		return false;
> +	}
> +}
> +
>  static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>  					    unsigned long dst_start,
>  					    unsigned long src_start,
> @@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>  	    dst_vma->vm_flags & VM_SHARED))
>  		goto out_unlock;
>  
> -	/*
> -	 * validate 'mode' now that we know the dst_vma: don't allow
> -	 * a wrprotect copy if the userfaultfd didn't register as WP.
> -	 */
> -	if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> +	if (!vma_uffd_ops_supported(dst_vma, flags))
>  		goto out_unlock;
>  
>  	/*
> @@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>  		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
>  					     src_start, len, flags);
>  
> -	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> -		goto out_unlock;
> -	if (!vma_is_shmem(dst_vma) &&
> -	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> -		goto out_unlock;
> -
>  	while (src_addr < src_start + len) {
>  		pmd_t dst_pmdval;
>  
> -- 
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
  2025-06-29  8:50   ` Mike Rapoport
@ 2025-06-30 10:15   ` Lorenzo Stoakes
  2025-07-01 17:04     ` Suren Baghdasaryan
  1 sibling, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 10:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying
> the core files.  More importantly, when the module can be compiled out of
> the kernel.

I'm not sure I understand this. One VMA 'especially when as a module'?

This whole sentence is really unclear.

So it seems to me that you want to be able to allow more than just:

- anonymous memory
- shmem
- hugetlb

To support userfaultfd correct?

>
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.

OK this sounds reasonable.

>
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.

Yes this also sounds reasonable, _as long as_ :) the module authors are
aware of any conditions that might exist in the VMA that might be odd or
confusing.

For instance, if locks need to be held etc.

I am generally slightly wary of us giving VMAs in hooks because people do
crazy things with them that

>
> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.

Right.

>
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.

Sounds sensible.

>
> Currently the uffd_copy() API is almost designed to be the simplistic with
> minimum mm changes to move over to the API.

A good approach.

>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h            |  9 ++++++
>  include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
>  2 files changed, 61 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef40f68c1183..6a5447bd43fd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -576,6 +576,8 @@ struct vm_fault {
>  					 */
>  };
>
> +struct vm_uffd_ops;
> +
>  /*
>   * These are the virtual MM functions - opening of an area, closing and
>   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -653,6 +655,13 @@ struct vm_operations_struct {
>  	 */
>  	struct page *(*find_special_page)(struct vm_area_struct *vma,
>  					  unsigned long addr);
> +#ifdef CONFIG_USERFAULTFD
> +	/*
> +	 * Userfaultfd related ops.  Modules need to define this to support
> +	 * userfaultfd.
> +	 */
> +	const struct vm_uffd_ops *userfaultfd_ops;
> +#endif
>  };

This shouldn't go in vm_ops like this. You're basically changing a
fundamental convention here - that _ops structs define handlers, now you
can have somehow nested ones?

It seems more appropriate to have something like:

struct vm_uffd_ops *(*get_uffd_ops)(struct vm_area_struct *vma);

This then matches e.g. mempolicy's get_policy handler.

>
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index df85330bcfa6..c9a093c4502b 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
>  	NR_MFILL_ATOMIC_MODES,
>  };
>
> +/* VMA userfaultfd operations */

Are we sure this should be operating at the VMA level?

I mean are the operations going to be operating on VMAs or VM fault
structs? If not, then this surely belongs to the file operations no?

> +struct vm_uffd_ops {

I'd comment on the naming, but 'vm_operations_struct' is so bad that it
seems we have no sensible convention anyway so this is fine :P

> +	/**
> +	 * @uffd_features: features supported in bitmask.
> +	 *
> +	 * When the ops is defined, the driver must set non-zero features

I don't think the 'when the ops are defined' bit is necessray here, you're
commenting on the ops here, you can safely assume they're defined.

So I'd just say 'must be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.'

> +	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> +	 */
> +	unsigned long uffd_features;
> +	/**
> +	 * @uffd_ioctls: ioctls supported in bitmask.
> +	 *
> +	 * Userfaultfd ioctls supported by the module.  Below will always
> +	 * be supported by default whenever a module provides vm_uffd_ops:
> +	 *
> +	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> +	 *
> +	 * The module needs to provide all the rest optionally supported
> +	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
> +	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> +	 * is optional.

I'm not sure why we need this field? Isn't this implied by whether handlers
are set or NULL?

> +	 */
> +	unsigned long uffd_ioctls;
> +	/**
> +	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> +	 *
> +	 * @inode: the inode for folio lookup
> +	 * @pgoff: the pgoff of the folio
> +	 * @folio: returned folio pointer
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +			      struct folio **folio);

Not sure uufd_ prefix is needed in a struct that's obviously about uffd
already? I'd strip that.

Also this name is really confusing, I think it should contain continue in
some form, such as 'handle_continue()'?

This really feels like it shouldn't be a VMA function as you're dealing
with inode, pgoff, and folio and not VMA at all?


> +	/**
> +	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> +	 *
> +	 * @dst_pmd: target pmd to resolve page fault
> +	 * @dst_vma: target vma
> +	 * @dst_addr: target virtual address
> +	 * @src_addr: source address to copy from
> +	 * @flags: userfaultfd request flags
> +	 * @foliop: previously allocated folio
> +	 *
> +	 * Return: zero if succeeded, negative for errors.

Can you please ensure you put details as to VMA lock state here. Uffd has
some very tricky handling around stuff like this.

> +	 */
> +	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> +			 unsigned long dst_addr, unsigned long src_addr,
> +			 uffd_flags_t flags, struct folio **foliop);

Do we not need a uffd_ctx parameter here?

It seems like we're assuming a _lot_ of mm understanding in the underlying
driver here.

I'm not sure it's really normal to be handing around page table state and
folios etc. to a driver like this, this is really... worrying to me.

This feels like you're trying to put mm functionality outside of mm?

What if we change how we handle page tables in future, or the locking
changes? We might not know to go and update x, y, z driver.

This is concerning.

> +};
> +typedef struct vm_uffd_ops vm_uffd_ops;
> +
>  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
>  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
                   ` (3 preceding siblings ...)
  2025-06-27 15:46 ` [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
@ 2025-06-30 10:29 ` Lorenzo Stoakes
  2025-07-01  0:15   ` Andrew Morton
  2025-07-02 20:36   ` Peter Xu
  4 siblings, 2 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-06-30 10:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote:
> [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f]
>
> v2 changelog:
> - Patch 1
>   - update English in commit log [David]
>   - move vm_uffd_ops definition to userfaultfd_k.h [Mike]
> - Patch 4
>   - fix sparse warning on bitwise type conversions [syzbot]
>   - Commit message updates on explanation of vma_can_userfault check [James]
>
> v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com
>
> This series is an alternative proposal of what Nikita proposed here on the
> initial three patches:
>
>   https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
>
> This is not yet relevant to any guest-memfd support, but paving way for it.
> Here, the major goal is to make kernel modules be able to opt-in with any
> form of userfaultfd supports, like guest-memfd.  This alternative option
> should hopefully be cleaner, and avoid leaking userfault details into
> vm_ops.fault().
>
> It also means this series does not depend on anything.  It's a pure
> refactoring of userfaultfd internals to provide a generic API, so that
> other types of files, especially RAM based, can support userfaultfd without
> touching mm/ at all.

I'm very concerned that this change will simply move core mm functionality out
of mm and into drivers where it can bitrot and cause subtle bugs?

You're proposing providing stuff like page table state and asking for a folio
back from a driver etc.

I absolutely am not in favour of us providing core mm internals like this to
drivers, and I don't want to see us having to EXPORT() mm internals just to make
module-ised uffd code work (I mean I just will flat out refuse to do that).

I think we need to think _very_ carefully about how we do this.

I also feel like this series is at a really basic level and you've not fully
determined what API calls you need.

I agree that it's sensible to be incremental, but I feel like you sort of need
to somewhat prove the case that you can jump from 'incremental version where we
only support code in mm/' to supporting arbitrary file system code that might be
modules.

Because otherwise you're basically _guessing_ that you can do this, possibly, in
the future and maybe it's just not the right approach but that's not clear yet?

>
> To achieve it, this series introduced a file operation called vm_uffd_ops.
> The ops needs to be provided when a file type supports any of userfaultfd.
>
> With that, I moved both hugetlbfs and shmem over.

Well as you say below hugetlbfs is sort of a stub implementation, I wonder
whether we'd need quite a bit more to make that work.

One thing I'd _really_ like to avoid is us having to add a bunch of hook points
into core mm code just for uffd that then call out to some driver.

We've encountered such a total nightmare with .mmap() for instance in the past
(including stuff that resulted in security issues) because we - simply cannot
assume anything - about what the hook implementor might do with the passed
parameters.

This is really really problematic.

I also absolutely hate the:

if (uffd)
	do_something_weird();

Pattern, so hopefully this won't proliferate that.

>
> Hugetlbfs is still very special that it will only use partial of the
> vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> BUG() and so far hard-coded into core mm.  But this should still be better,
> because at least hugetlbfs is still always involved in feature probing
> (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> to fail that, and some more).  Meanwhile after this series, shmem will be
> completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> shmem looks like this:
>
> static const vm_uffd_ops shmem_uffd_ops = {
> 	.uffd_features	= 	__VM_UFFD_FLAGS,
> 	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> 				BIT(_UFFDIO_ZEROPAGE) |
> 				BIT(_UFFDIO_WRITEPROTECT) |
> 				BIT(_UFFDIO_CONTINUE) |
> 				BIT(_UFFDIO_POISON),
> 	.uffd_get_folio	=	shmem_uffd_get_folio,
> 	.uffd_copy	=	shmem_mfill_atomic_pte,
> };
>
> As I mentioned in one of my reply to Nikita, I don't like the current
> interface of uffd_copy(), but this will be the minimum change version of
> such API to support complete extrenal-module-ready userfaultfd.  Here, very
> minimal change will be needed from shmem side to support that.

Right, maybe a better version of this interface might address some of my
concerns... :)

>
> Meanwhile, the vm_uffd_ops is also not the only place one will need to
> provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> be updated, but that's a generic function and it'll play together with the
> new vm_uffd_ops to make everything fly.
>
> No functional change expected at all after the whole series applied.  There
> might be some slightly stricter check on uffd ops here and there in the
> last patch, but that really shouldn't stand out anywhere to anyone.
>
> For testing: besides the cross-compilation tests, I did also try with
> uffd-stress in a VM to measure any perf difference before/after the change;
> The static call becomes a pointer now.  I really cannot measure anything
> different, which is more or less expected.
>
> Comments welcomed, thanks.
>
> Peter Xu (4):
>   mm: Introduce vm_uffd_ops API
>   mm/shmem: Support vm_uffd_ops API
>   mm/hugetlb: Support vm_uffd_ops API
>   mm: Apply vm_uffd_ops API to core mm
>
>  include/linux/mm.h            |   9 +++
>  include/linux/shmem_fs.h      |  14 -----
>  include/linux/userfaultfd_k.h |  98 +++++++++++++++++++----------
>  mm/hugetlb.c                  |  19 ++++++
>  mm/shmem.c                    |  28 ++++++++-
>  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>  6 files changed, 207 insertions(+), 76 deletions(-)
>
> --
> 2.49.0
>

Sorry to be critical, I just want to make sure we're not setting ourselves up
for trouble here.

I _very much_ support efforts to make uffd more generalised, and ideally to find
a way to separate out shmem and hugetlbfs implementation bits, so I support the
intent _fully_.

I just want to make sure we do it in a safe way :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
@ 2025-07-01  0:15   ` Andrew Morton
  2025-07-02 20:36   ` Peter Xu
  1 sibling, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2025-07-01  0:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Xu, linux-mm, linux-kernel, Vlastimil Babka,
	Suren Baghdasaryan, Muchun Song, Mike Rapoport, Hugh Dickins,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Mon, 30 Jun 2025 11:29:30 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> > It also means this series does not depend on anything.  It's a pure
> > refactoring of userfaultfd internals to provide a generic API, so that
> > other types of files, especially RAM based, can support userfaultfd without
> > touching mm/ at all.
> 
> I'm very concerned that this change will simply move core mm functionality out
> of mm and into drivers where it can bitrot and cause subtle bugs?
> 
> You're proposing providing stuff like page table state and asking for a folio
> back from a driver etc.
> 
> I absolutely am not in favour of us providing core mm internals like this to
> drivers, and I don't want to see us having to EXPORT() mm internals just to make
> module-ised uffd code work (I mean I just will flat out refuse to do that).
> 
> I think we need to think _very_ carefully about how we do this.
> 
> I also feel like this series is at a really basic level and you've not fully
> determined what API calls you need.
> 
> I agree that it's sensible to be incremental, but I feel like you sort of need
> to somewhat prove the case that you can jump from 'incremental version where we
> only support code in mm/' to supporting arbitrary file system code that might be
> modules.
> 
> Because otherwise you're basically _guessing_ that you can do this, possibly, in
> the future and maybe it's just not the right approach but that's not clear yet?

Thanks, this is pretty fundamental stuff so I'll push this series back
into mm-new while we think about it.

Soon, please - I don't want people to be basing new work on something
which might go away,


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-06-30 10:15   ` Lorenzo Stoakes
@ 2025-07-01 17:04     ` Suren Baghdasaryan
  2025-07-02 15:40       ` Liam R. Howlett
                         ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Suren Baghdasaryan @ 2025-07-01 17:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Xu, linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Mike Rapoport, Hugh Dickins, Andrew Morton, James Houghton,
	Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> > Introduce a generic userfaultfd API for vm_operations_struct, so that one
> > vma, especially when as a module, can support userfaults without modifying
> > the core files.  More importantly, when the module can be compiled out of
> > the kernel.
>
> I'm not sure I understand this. One VMA 'especially when as a module'?
>
> This whole sentence is really unclear.
>
> So it seems to me that you want to be able to allow more than just:
>
> - anonymous memory
> - shmem
> - hugetlb
>
> To support userfaultfd correct?
>
> >
> > So, instead of having core mm referencing modules that may not ever exist,
> > we need to have modules opt-in on core mm hooks instead.
>
> OK this sounds reasonable.
>
> >
> > After this API applied, if a module wants to support userfaultfd, the
> > module should only need to touch its own file and properly define
> > vm_uffd_ops, instead of changing anything in core mm.
>
> Yes this also sounds reasonable, _as long as_ :) the module authors are
> aware of any conditions that might exist in the VMA that might be odd or
> confusing.
>
> For instance, if locks need to be held etc.
>
> I am generally slightly wary of us giving VMAs in hooks because people do
> crazy things with them that
>
> >
> > Note that such API will not work for anonymous. Core mm will process
> > anonymous memory separately for userfault operations like before.
>
> Right.
>
> >
> > This patch only introduces the API alone so that we can start to move
> > existing users over but without breaking them.
>
> Sounds sensible.
>
> >
> > Currently the uffd_copy() API is almost designed to be the simplistic with
> > minimum mm changes to move over to the API.
>
> A good approach.
>
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/mm.h            |  9 ++++++
> >  include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 61 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef40f68c1183..6a5447bd43fd 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -576,6 +576,8 @@ struct vm_fault {
> >                                        */
> >  };
> >
> > +struct vm_uffd_ops;
> > +
> >  /*
> >   * These are the virtual MM functions - opening of an area, closing and
> >   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> > @@ -653,6 +655,13 @@ struct vm_operations_struct {
> >        */
> >       struct page *(*find_special_page)(struct vm_area_struct *vma,
> >                                         unsigned long addr);
> > +#ifdef CONFIG_USERFAULTFD
> > +     /*
> > +      * Userfaultfd related ops.  Modules need to define this to support
> > +      * userfaultfd.
> > +      */
> > +     const struct vm_uffd_ops *userfaultfd_ops;
> > +#endif
> >  };
>
> This shouldn't go in vm_ops like this. You're basically changing a
> fundamental convention here - that _ops structs define handlers, now you
> can have somehow nested ones?
>
> It seems more appropriate to have something like:
>
> struct vm_uffd_ops *(*get_uffd_ops)(struct vm_area_struct *vma);
>
> This then matches e.g. mempolicy's get_policy handler.
>
> >
> >  #ifdef CONFIG_NUMA_BALANCING
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index df85330bcfa6..c9a093c4502b 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
> >       NR_MFILL_ATOMIC_MODES,
> >  };
> >
> > +/* VMA userfaultfd operations */
>
> Are we sure this should be operating at the VMA level?
>
> I mean are the operations going to be operating on VMAs or VM fault
> structs? If not, then this surely belongs to the file operations no?
>
> > +struct vm_uffd_ops {
>
> I'd comment on the naming, but 'vm_operations_struct' is so bad that it
> seems we have no sensible convention anyway so this is fine :P
>
> > +     /**
> > +      * @uffd_features: features supported in bitmask.
> > +      *
> > +      * When the ops is defined, the driver must set non-zero features
>
> I don't think the 'when the ops are defined' bit is necessray here, you're
> commenting on the ops here, you can safely assume they're defined.
>
> So I'd just say 'must be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.'
>
> > +      * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> > +      */
> > +     unsigned long uffd_features;
> > +     /**
> > +      * @uffd_ioctls: ioctls supported in bitmask.
> > +      *
> > +      * Userfaultfd ioctls supported by the module.  Below will always
> > +      * be supported by default whenever a module provides vm_uffd_ops:
> > +      *
> > +      *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> > +      *
> > +      * The module needs to provide all the rest optionally supported
> > +      * ioctls.  For example, when VM_UFFD_MISSING was supported,
> > +      * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> > +      * is optional.
>
> I'm not sure why we need this field? Isn't this implied by whether handlers
> are set or NULL?
>
> > +      */
> > +     unsigned long uffd_ioctls;
> > +     /**
> > +      * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> > +      *
> > +      * @inode: the inode for folio lookup
> > +      * @pgoff: the pgoff of the folio
> > +      * @folio: returned folio pointer
> > +      *
> > +      * Return: zero if succeeded, negative for errors.
> > +      */
> > +     int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > +                           struct folio **folio);
>
> Not sure uufd_ prefix is needed in a struct that's obviously about uffd
> already? I'd strip that.
>
> Also this name is really confusing, I think it should contain continue in
> some form, such as 'handle_continue()'?
>
> This really feels like it shouldn't be a VMA function as you're dealing
> with inode, pgoff, and folio and not VMA at all?
>
>
> > +     /**
> > +      * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> > +      *
> > +      * @dst_pmd: target pmd to resolve page fault
> > +      * @dst_vma: target vma
> > +      * @dst_addr: target virtual address
> > +      * @src_addr: source address to copy from
> > +      * @flags: userfaultfd request flags
> > +      * @foliop: previously allocated folio
> > +      *
> > +      * Return: zero if succeeded, negative for errors.
>
> Can you please ensure you put details as to VMA lock state here. Uffd has
> some very tricky handling around stuff like this.
>
> > +      */
> > +     int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> > +                      unsigned long dst_addr, unsigned long src_addr,
> > +                      uffd_flags_t flags, struct folio **foliop);
>
> Do we not need a uffd_ctx parameter here?
>
> It seems like we're assuming a _lot_ of mm understanding in the underlying
> driver here.
>
> I'm not sure it's really normal to be handing around page table state and
> folios etc. to a driver like this, this is really... worrying to me.
>
> This feels like you're trying to put mm functionality outside of mm?

To second that, two things stick out for me here:
1. uffd_copy and uffd_get_folio seem to be at different abstraction
levels. uffd_copy is almost the entire copy operation for VM_SHARED
VMAs while uffd_get_folio is a small part of the continue operation.
2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
last patch is quite a complex function which itself calls some IMO
pretty internal functions like mfill_atomic_install_pte(). Expecting
modules to implement such functionality seems like a stretch to me but
maybe this is for some specialized modules which are written by mm
experts only?

>
> What if we change how we handle page tables in future, or the locking
> changes? We might not know to go and update x, y, z driver.
>
> This is concerning.
>
> > +};
> > +typedef struct vm_uffd_ops vm_uffd_ops;
> > +
> >  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
> >  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
> >  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> > --
> > 2.49.0
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-01 17:04     ` Suren Baghdasaryan
@ 2025-07-02 15:40       ` Liam R. Howlett
  2025-07-02 15:56       ` Lorenzo Stoakes
  2025-07-02 18:16       ` Mike Rapoport
  2 siblings, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2025-07-02 15:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, Peter Xu, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Mike Rapoport, Hugh Dickins,
	Andrew Morton, James Houghton, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Suren Baghdasaryan <surenb@google.com> [250701 13:04]:
...

> > > +     /**
> > > +      * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> > > +      *
> > > +      * @dst_pmd: target pmd to resolve page fault
> > > +      * @dst_vma: target vma
> > > +      * @dst_addr: target virtual address
> > > +      * @src_addr: source address to copy from
> > > +      * @flags: userfaultfd request flags
> > > +      * @foliop: previously allocated folio
> > > +      *
> > > +      * Return: zero if succeeded, negative for errors.
> >
> > Can you please ensure you put details as to VMA lock state here. Uffd has
> > some very tricky handling around stuff like this.
> >
> > > +      */
> > > +     int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> > > +                      unsigned long dst_addr, unsigned long src_addr,
> > > +                      uffd_flags_t flags, struct folio **foliop);
> >
> > Do we not need a uffd_ctx parameter here?
> >
> > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > driver here.
> >
> > I'm not sure it's really normal to be handing around page table state and
> > folios etc. to a driver like this, this is really... worrying to me.
> >
> > This feels like you're trying to put mm functionality outside of mm?
> 
> To second that, two things stick out for me here:
> 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> levels. uffd_copy is almost the entire copy operation for VM_SHARED
> VMAs while uffd_get_folio is a small part of the continue operation.
> 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> last patch is quite a complex function which itself calls some IMO
> pretty internal functions like mfill_atomic_install_pte(). Expecting
> modules to implement such functionality seems like a stretch to me

Yes.

I don't think this is a good idea to expose, since there is no way to
restrict it to a specific implementations.

We used to pass out a vma to a driver for updating the vma flags, and
even that proved to be too permissive and it ended very poorly.  We had
drivers saving the pointer then dropping the lock, we had drivers
changing things under us.  Then there was the fallout in the mm for
trying to deal with what may have happened - and the failure scenarios
of the dealing with it didn't work out.

What this is doing is leading down a path to allow such things at an
even lower level.  I think this is too flexible and opens us up to
unintentional abuse.

That is to say, I don't think we should expose this outside of mm for
the benefit of everyone.

> but
> maybe this is for some specialized modules which are written by mm
> experts only?
> 

Even with 'experts' (do any of us really know what we're doing,
anyways? ;) we may get into a situation where a company may write
themselves into a corner, depending on specific functionality that's
hard to re-implement and a nightmare to keep working. I don't want to
bind.. er.. to a specific example, but I believe we can all think of at
least one.

Is there more information you can share as to why you want to expose
this functionality?  Maybe we can find another way that does not expose
the internals and accomplishes what you want?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-01 17:04     ` Suren Baghdasaryan
  2025-07-02 15:40       ` Liam R. Howlett
@ 2025-07-02 15:56       ` Lorenzo Stoakes
  2025-07-02 17:08         ` Nikita Kalyazin
  2025-07-02 20:24         ` Peter Xu
  2025-07-02 18:16       ` Mike Rapoport
  2 siblings, 2 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-02 15:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Peter Xu, linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Mike Rapoport, Hugh Dickins, Andrew Morton, James Houghton,
	Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > This feels like you're trying to put mm functionality outside of mm?
>
> To second that, two things stick out for me here:
> 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> levels. uffd_copy is almost the entire copy operation for VM_SHARED
> VMAs while uffd_get_folio is a small part of the continue operation.
> 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> last patch is quite a complex function which itself calls some IMO
> pretty internal functions like mfill_atomic_install_pte(). Expecting
> modules to implement such functionality seems like a stretch to me but
> maybe this is for some specialized modules which are written by mm
> experts only?

To echo what Liam said - I don't think we can truly rely on expertise here
(we make enough mistakes in core mm for that to be a dubious proposition
even tere :) and even if experts were involved, having core mm
functionality outside of core mm carries significant risk - we are
constantly changing things, including assumptions around sensitive topics
such as locking (think VMA locking) - having code elsewhere significantly
increases the risk of missing things.

I am also absolutely, to be frank, not going to accept us EXPORT()'ing
anything core.

Page table manipulation really must rely in core mm and arch code only, it
is easily some of the most subtle, confusing and dangerous code in mm (I
have spent subtantial hours banging my head against it recently), and again
- subject to constant change.

But to come back to Liam's comments and to reiterate what I was referring
to earlier, even permitting drivers to have access to VMAs is _highly_
problematic and has resulted in very real bugs and subtle issues that took
many hours, much stress + gnashing of teeth to adress.

The very thing of:

xxx
<hand off sensitive mutable state X, Y, Z to driver>
yyy

Means that between xxx and yyy we can make literally no assumptions about
what just happened to all handed off state. A single instance of this has
caused mayhem, if we did this in such a way as to affect the _many_ uffd
hooks we could have a realy serious problem.

So - what seems really positive about this series is the _generalisation_
and _abstraction_ of uffd functionality.

That is something I appreciate and I think uffd sorely needs, in fact if we
could find a way to not need to do:

if (some_uffd_predicate())
	some_uffd_specific_fn();

That'd be incredible.

So I think the answer here is to do something like this, and to keep all
the mm-specific code in core mm.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 15:56       ` Lorenzo Stoakes
@ 2025-07-02 17:08         ` Nikita Kalyazin
  2025-07-02 17:39           ` Mike Rapoport
  2025-07-02 21:24           ` Liam R. Howlett
  2025-07-02 20:24         ` Peter Xu
  1 sibling, 2 replies; 42+ messages in thread
From: Nikita Kalyazin @ 2025-07-02 17:08 UTC (permalink / raw)
  To: Lorenzo Stoakes, Suren Baghdasaryan
  Cc: Peter Xu, linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Mike Rapoport, Hugh Dickins, Andrew Morton, James Houghton,
	Liam R . Howlett, Michal Hocko, David Hildenbrand,
	Andrea Arcangeli, Oscar Salvador, Axel Rasmussen, Ujwal Kundur



On 02/07/2025 16:56, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
>> On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
>> <lorenzo.stoakes@oracle.com> wrote:
>>> This feels like you're trying to put mm functionality outside of mm?
>>
>> To second that, two things stick out for me here:
>> 1. uffd_copy and uffd_get_folio seem to be at different abstraction
>> levels. uffd_copy is almost the entire copy operation for VM_SHARED
>> VMAs while uffd_get_folio is a small part of the continue operation.
>> 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
>> last patch is quite a complex function which itself calls some IMO
>> pretty internal functions like mfill_atomic_install_pte(). Expecting
>> modules to implement such functionality seems like a stretch to me but
>> maybe this is for some specialized modules which are written by mm
>> experts only?
> 
> To echo what Liam said - I don't think we can truly rely on expertise here
> (we make enough mistakes in core mm for that to be a dubious proposition
> even tere :) and even if experts were involved, having core mm
> functionality outside of core mm carries significant risk - we are
> constantly changing things, including assumptions around sensitive topics
> such as locking (think VMA locking) - having code elsewhere significantly
> increases the risk of missing things.
> 
> I am also absolutely, to be frank, not going to accept us EXPORT()'ing
> anything core.
> 
> Page table manipulation really must rely in core mm and arch code only, it
> is easily some of the most subtle, confusing and dangerous code in mm (I
> have spent subtantial hours banging my head against it recently), and again
> - subject to constant change.
> 
> But to come back to Liam's comments and to reiterate what I was referring
> to earlier, even permitting drivers to have access to VMAs is _highly_
> problematic and has resulted in very real bugs and subtle issues that took
> many hours, much stress + gnashing of teeth to adress.

The main target of this change is the implementation of UFFD for 
KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code 
into the mm codebase.  We usually mean KVM by the "drivers" in this 
context, and it is already somewhat "knowledgeable" of the mm.  I don't 
think there are existing use cases for other drivers to implement this 
at the moment.

Although I can't see new exports in this series, there is now a way to 
limit exports to particular modules [3].  Would it help if we only do it 
for KVM initially (if/when actually needed)?

[1] 
https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
[2] 
https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com/
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=707f853d7fa3ce323a6875487890c213e34d81a0

Thanks,
Nikita

> 
> The very thing of:
> 
> xxx
> <hand off sensitive mutable state X, Y, Z to driver>
> yyy
> 
> Means that between xxx and yyy we can make literally no assumptions about
> what just happened to all handed off state. A single instance of this has
> caused mayhem, if we did this in such a way as to affect the _many_ uffd
> hooks we could have a realy serious problem.
> 
> So - what seems really positive about this series is the _generalisation_
> and _abstraction_ of uffd functionality.
> 
> That is something I appreciate and I think uffd sorely needs, in fact if we
> could find a way to not need to do:
> 
> if (some_uffd_predicate())
>          some_uffd_specific_fn();
> 
> That'd be incredible.
> 
> So I think the answer here is to do something like this, and to keep all
> the mm-specific code in core mm.
> 
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 17:08         ` Nikita Kalyazin
@ 2025-07-02 17:39           ` Mike Rapoport
  2025-07-02 19:46             ` Peter Xu
  2025-07-02 21:24           ` Liam R. Howlett
  1 sibling, 1 reply; 42+ messages in thread
From: Mike Rapoport @ 2025-07-02 17:39 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Lorenzo Stoakes, Suren Baghdasaryan, Peter Xu, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Liam R . Howlett, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 06:08:44PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 02/07/2025 16:56, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > This feels like you're trying to put mm functionality outside of mm?
> > > 
> > > To second that, two things stick out for me here:
> > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > last patch is quite a complex function which itself calls some IMO
> > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > modules to implement such functionality seems like a stretch to me but
> > > maybe this is for some specialized modules which are written by mm
> > > experts only?
> > 
> > To echo what Liam said - I don't think we can truly rely on expertise here
> > (we make enough mistakes in core mm for that to be a dubious proposition
> > even tere :) and even if experts were involved, having core mm
> > functionality outside of core mm carries significant risk - we are
> > constantly changing things, including assumptions around sensitive topics
> > such as locking (think VMA locking) - having code elsewhere significantly
> > increases the risk of missing things.
> > 
> > I am also absolutely, to be frank, not going to accept us EXPORT()'ing
> > anything core.
> > 
> > Page table manipulation really must rely in core mm and arch code only, it
> > is easily some of the most subtle, confusing and dangerous code in mm (I
> > have spent subtantial hours banging my head against it recently), and again
> > - subject to constant change.
> > 
> > But to come back to Liam's comments and to reiterate what I was referring
> > to earlier, even permitting drivers to have access to VMAs is _highly_
> > problematic and has resulted in very real bugs and subtle issues that took
> > many hours, much stress + gnashing of teeth to adress.
> 
> The main target of this change is the implementation of UFFD for
> KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> and it is already somewhat "knowledgeable" of the mm.  I don't think there
> are existing use cases for other drivers to implement this at the moment.
> 
> Although I can't see new exports in this series, there is now a way to limit
> exports to particular modules [3].  Would it help if we only do it for KVM
> initially (if/when actually needed)?

There were talks about pulling out guest_memfd core into mm, but I don't
remember patches about it. If parts of guest_memfd were already in mm/ that
would make easier to export uffd ops to it.
 
> [1]
> https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
> [2]
> https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com/
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=707f853d7fa3ce323a6875487890c213e34d81a0
> 
> Thanks,
> Nikita
> 
> > 
> > The very thing of:
> > 
> > xxx
> > <hand off sensitive mutable state X, Y, Z to driver>
> > yyy
> > 
> > Means that between xxx and yyy we can make literally no assumptions about
> > what just happened to all handed off state. A single instance of this has
> > caused mayhem, if we did this in such a way as to affect the _many_ uffd
> > hooks we could have a realy serious problem.
> > 
> > So - what seems really positive about this series is the _generalisation_
> > and _abstraction_ of uffd functionality.
> > 
> > That is something I appreciate and I think uffd sorely needs, in fact if we
> > could find a way to not need to do:
> > 
> > if (some_uffd_predicate())
> >          some_uffd_specific_fn();
> > 
> > That'd be incredible.
> > 
> > So I think the answer here is to do something like this, and to keep all
> > the mm-specific code in core mm.
> > 
> > Thanks, Lorenzo
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-01 17:04     ` Suren Baghdasaryan
  2025-07-02 15:40       ` Liam R. Howlett
  2025-07-02 15:56       ` Lorenzo Stoakes
@ 2025-07-02 18:16       ` Mike Rapoport
  2025-07-02 20:22         ` Peter Xu
  2 siblings, 1 reply; 42+ messages in thread
From: Mike Rapoport @ 2025-07-02 18:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, Peter Xu, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > driver here.
> >
> > I'm not sure it's really normal to be handing around page table state and
> > folios etc. to a driver like this, this is really... worrying to me.
> >
> > This feels like you're trying to put mm functionality outside of mm?
> 
> To second that, two things stick out for me here:
> 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> levels. uffd_copy is almost the entire copy operation for VM_SHARED
> VMAs while uffd_get_folio is a small part of the continue operation.
> 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> last patch is quite a complex function which itself calls some IMO
> pretty internal functions like mfill_atomic_install_pte(). Expecting
> modules to implement such functionality seems like a stretch to me but
> maybe this is for some specialized modules which are written by mm
> experts only?

Largely shmem_mfill_atomic_pte() differs from anonymous memory version
(mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
whether it's added to the page cache. So instead of uffd_copy(...) we might
add

	int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
	void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);

and then use them in mfill_atomic_pte_copy():

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..6bad0dd70d3d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -247,8 +247,11 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
 
 	if (!*foliop) {
 		ret = -ENOMEM;
-		folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
-					dst_addr);
+		if (uffd_ops(dst_vma) && uffd_ops(dst_vma)->folio_alloc)
+			folio = uffd_ops(dst_vma)->folio_alloc(dst_vma, dst_addr);
+		else
+			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma,
+						dst_addr);
 		if (!folio)
 			goto out;
 
@@ -307,6 +310,8 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd,
 	return ret;
 out_release:
 	folio_put(folio);
+	if (uffd_ops(dst_vma) && uffd_ops(dst_vma)->folio_release)
+		uffd_ops(dst_vma)->folio_release(dst_vma, folio);
 	goto out;
 }
 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-06-29  8:50   ` Mike Rapoport
@ 2025-07-02 19:30     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2025-07-02 19:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Sun, Jun 29, 2025 at 11:50:11AM +0300, Mike Rapoport wrote:
> Hi Peter,

Hi, Mike,

> 
> On Fri, Jun 27, 2025 at 11:46:52AM -0400, Peter Xu wrote:
> > Introduce a generic userfaultfd API for vm_operations_struct, so that one
> > vma, especially when as a module, can support userfaults without modifying
> > the core files.  More importantly, when the module can be compiled out of
> > the kernel.
> > 
> > So, instead of having core mm referencing modules that may not ever exist,
> > we need to have modules opt-in on core mm hooks instead.
> > 
> > After this API applied, if a module wants to support userfaultfd, the
> > module should only need to touch its own file and properly define
> > vm_uffd_ops, instead of changing anything in core mm.
> 
> I liked the changelog update you proposed in v1 thread. I took liberty to

It's definitely hard to satisfy all reviewers on one version of commit
message..

> slightly update it and here's what I've got:
> 
>   Currently, most of the userfaultfd features are implemented directly in the
>   core mm.  It will invoke VMA specific functions whenever necessary.  So far
>   it is fine because it almost only interacts with shmem and hugetlbfs.
> 
>   Introduce a generic userfaultfd API extension for vm_operations_struct,
>   so that any code that implements vm_operations_struct (including kernel
>   modules that can be compiled separately from the kernel core) can support
>   userfaults without modifying the core files.
> 
>   With this API applied, if a module wants to support userfaultfd, the
>   module should only need to properly define vm_uffd_ops and hook it to
>   vm_operations_struct, instead of changing anything in core mm.

Thanks, I very much appreciate explicit suggestions on the wordings.
Personally I like it and the rest suggestions, I'll use it when repost, but
I'll also wait for others if anyone has other things to say.

> 
> > Note that such API will not work for anonymous. Core mm will process
> > anonymous memory separately for userfault operations like before.
> 
> Maybe:
> 
>   This API will not work for anonymous memory. Handling of userfault
>   operations for anonymous memory remains unchanged in core mm.
>  
> > This patch only introduces the API alone so that we can start to move
> > existing users over but without breaking them.
> 
> Please use imperative mood, e.g.
>  
>   Only introduce the new API so that ...
>  
> > Currently the uffd_copy() API is almost designed to be the simplistic with
> > minimum mm changes to move over to the API.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/mm.h            |  9 ++++++
> >  include/linux/userfaultfd_k.h | 52 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 61 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef40f68c1183..6a5447bd43fd 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -576,6 +576,8 @@ struct vm_fault {
> >  					 */
> >  };
> >  
> > +struct vm_uffd_ops;
> > +
> >  /*
> >   * These are the virtual MM functions - opening of an area, closing and
> >   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> > @@ -653,6 +655,13 @@ struct vm_operations_struct {
> >  	 */
> >  	struct page *(*find_special_page)(struct vm_area_struct *vma,
> >  					  unsigned long addr);
> > +#ifdef CONFIG_USERFAULTFD
> > +	/*
> > +	 * Userfaultfd related ops.  Modules need to define this to support
> > +	 * userfaultfd.
> > +	 */
> > +	const struct vm_uffd_ops *userfaultfd_ops;
> > +#endif
> >  };
> >  
> >  #ifdef CONFIG_NUMA_BALANCING
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index df85330bcfa6..c9a093c4502b 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -92,6 +92,58 @@ enum mfill_atomic_mode {
> >  	NR_MFILL_ATOMIC_MODES,
> >  };
> >  
> > +/* VMA userfaultfd operations */
> > +struct vm_uffd_ops {
> > +	/**
> > +	 * @uffd_features: features supported in bitmask.
> > +	 *
> > +	 * When the ops is defined, the driver must set non-zero features
> > +	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> > +	 */
> > +	unsigned long uffd_features;
> > +	/**
> > +	 * @uffd_ioctls: ioctls supported in bitmask.
> > +	 *
> > +	 * Userfaultfd ioctls supported by the module.  Below will always
> > +	 * be supported by default whenever a module provides vm_uffd_ops:
> > +	 *
> > +	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> > +	 *
> > +	 * The module needs to provide all the rest optionally supported
> > +	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
> > +	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> > +	 * is optional.
> > +	 */
> > +	unsigned long uffd_ioctls;
> > +	/**
> > +	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> > +	 *
> > +	 * @inode: the inode for folio lookup
> > +	 * @pgoff: the pgoff of the folio
> > +	 * @folio: returned folio pointer
> > +	 *
> > +	 * Return: zero if succeeded, negative for errors.
> > +	 */
> > +	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> > +			      struct folio **folio);
> > +	/**
> > +	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> > +	 *
> > +	 * @dst_pmd: target pmd to resolve page fault
> > +	 * @dst_vma: target vma
> > +	 * @dst_addr: target virtual address
> > +	 * @src_addr: source address to copy from
> > +	 * @flags: userfaultfd request flags
> > +	 * @foliop: previously allocated folio
> > +	 *
> > +	 * Return: zero if succeeded, negative for errors.
> > +	 */
> > +	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> > +			 unsigned long dst_addr, unsigned long src_addr,
> > +			 uffd_flags_t flags, struct folio **foliop);
> > +};
> > +typedef struct vm_uffd_ops vm_uffd_ops;
> 
> Either use vm_uffd_ops_t for the typedef or drop the typedef entirely. My
> preference is for the second option.

Andrew helped me to fix some hidden spaces which I appreciated, then I
found checkpatch warns on this one too besides the spaces fixed in mm-new.
I do not know why checkpatch doesn't like typedefs even if typedefs are
massively used in Linux..

I think I'll simply stick with not using typedefs.

Thanks,

> 
> > +
> >  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
> >  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
> >  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> > -- 
> > 2.49.0
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 17:39           ` Mike Rapoport
@ 2025-07-02 19:46             ` Peter Xu
  2025-07-03 17:48               ` Mike Rapoport
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-02 19:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Liam R . Howlett, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:

[...]

> > The main target of this change is the implementation of UFFD for
> > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > are existing use cases for other drivers to implement this at the moment.
> > 
> > Although I can't see new exports in this series, there is now a way to limit
> > exports to particular modules [3].  Would it help if we only do it for KVM
> > initially (if/when actually needed)?
> 
> There were talks about pulling out guest_memfd core into mm, but I don't
> remember patches about it. If parts of guest_memfd were already in mm/ that
> would make easier to export uffd ops to it.

Do we have a link to such discussion?  I'm also curious whether that idea
was acknowledged by KVM maintainers.

Having an abstraction layer for userfaultfd memtypes within mm always makes
some sense on its own to me, so as to remove separate random checks over
either shmem or hugetlb.  E.g. after the series applied, we can drop the
shmem header in userfaultfd code, which should also be a step forward.

Thanks,

>  
> > [1]
> > https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
> > [2]
> > https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com/
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=707f853d7fa3ce323a6875487890c213e34d81a0

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 18:16       ` Mike Rapoport
@ 2025-07-02 20:22         ` Peter Xu
  2025-07-03 15:01           ` Suren Baghdasaryan
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-02 20:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Suren Baghdasaryan, Lorenzo Stoakes, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > driver here.
> > >
> > > I'm not sure it's really normal to be handing around page table state and
> > > folios etc. to a driver like this, this is really... worrying to me.
> > >
> > > This feels like you're trying to put mm functionality outside of mm?
> > 
> > To second that, two things stick out for me here:
> > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > VMAs while uffd_get_folio is a small part of the continue operation.
> > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > last patch is quite a complex function which itself calls some IMO
> > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > modules to implement such functionality seems like a stretch to me but
> > maybe this is for some specialized modules which are written by mm
> > experts only?
> 
> Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> whether it's added to the page cache. So instead of uffd_copy(...) we might
> add
> 
> 	int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> 	void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);

Thanks for digging into this, Mike.  It's just that IMHO it may not be
enough..

I actually tried to think about a more complicated API, but more I thought
of that, more it looked like an overkill.  I can list something here to
show why I chose the simplest solution with uffd_copy() as of now.

Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
accounting we need to do before allocations, and with/without allocations.
That accounting can't be put into folio_alloc() yet even if we'll have one,
because we could have the folio allocated when reaching here (that is, when
*foliop != NULL).  That was a very delicated decision of us to do shmem
accounting separately in 2016:

https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/

Then, there's also the complexity on how the page cache should be managed
for any mem type.  For shmem, folio was only injected right before the
pgtable installations.  We can't inject it when folio_alloc() because then
others can fault-in without data populated. It means we at least need one
more API to do page cache injections for the folio just got allocated from
folio_alloc().

We also may want to have different treatment on how the folio flags are
setup.  It may not always happen in folio_alloc().  E.g. for shmem right
now we do this right before the page cache injections:

	VM_BUG_ON(folio_test_locked(folio));
	VM_BUG_ON(folio_test_swapbacked(folio));
	__folio_set_locked(folio);
	__folio_set_swapbacked(folio);
	__folio_mark_uptodate(folio);

We may not want to do exactly the same for all the rest mem types.  E.g. we
likely don't want to set swapbacked for guest-memfd folios.  We may need
one more API to do it.

Then if to think about failure path, when we have the question above on
shmem acct issue, we may want to have yet another post_failure hook doing
shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
for guest-memfd, but we still need that for shmem to properly unacct when
folio allocation succeeded, but copy_from_user failed somehow.

Then the question is, do we really need all these fuss almost for nothing?

Note that if we want, IIUC we can still change this in the future. The
initial isolation like this series would still be good to land earlier; we
don't even plan to support MISSING for guest-memfd in the near future, but
only MINOR mode for now.  We don't necessarily need to work out MISSING
immediately to move on useful features landing Linux.  Even if we'll have
it for guest-memfd, it shouldn't be a challenge to maintain both.  This is
just not a contract we need to sign forever yet.

Hope above clarifies a bit on why I chose the simplest solution as of
now. I also don't like this API, as I mentioned in the cover letter. It's
really a trade-off I made at least for now the best I can come up with.

Said that, if any of us has better solution, please shoot.  I'm always open
to better alternatives.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 15:56       ` Lorenzo Stoakes
  2025-07-02 17:08         ` Nikita Kalyazin
@ 2025-07-02 20:24         ` Peter Xu
  2025-07-03 16:32           ` Lorenzo Stoakes
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-02 20:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 04:56:04PM +0100, Lorenzo Stoakes wrote:
> I am also absolutely, to be frank, not going to accept us EXPORT()'ing
> anything core.

Could you be explicit on what you're absolutely against?  Is that the
vm_uffd_ops (or any form of it) being exported, or anything else?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
  2025-07-01  0:15   ` Andrew Morton
@ 2025-07-02 20:36   ` Peter Xu
  2025-07-03 15:55     ` Lorenzo Stoakes
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-02 20:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote:
> > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f]
> >
> > v2 changelog:
> > - Patch 1
> >   - update English in commit log [David]
> >   - move vm_uffd_ops definition to userfaultfd_k.h [Mike]
> > - Patch 4
> >   - fix sparse warning on bitwise type conversions [syzbot]
> >   - Commit message updates on explanation of vma_can_userfault check [James]
> >
> > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com
> >
> > This series is an alternative proposal of what Nikita proposed here on the
> > initial three patches:
> >
> >   https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> >
> > This is not yet relevant to any guest-memfd support, but paving way for it.
> > Here, the major goal is to make kernel modules be able to opt-in with any
> > form of userfaultfd supports, like guest-memfd.  This alternative option
> > should hopefully be cleaner, and avoid leaking userfault details into
> > vm_ops.fault().
> >
> > It also means this series does not depend on anything.  It's a pure
> > refactoring of userfaultfd internals to provide a generic API, so that
> > other types of files, especially RAM based, can support userfaultfd without
> > touching mm/ at all.
> 
> I'm very concerned that this change will simply move core mm functionality out
> of mm and into drivers where it can bitrot and cause subtle bugs?
> 
> You're proposing providing stuff like page table state and asking for a folio
> back from a driver etc.
> 
> I absolutely am not in favour of us providing core mm internals like this to
> drivers, and I don't want to see us having to EXPORT() mm internals just to make
> module-ised uffd code work (I mean I just will flat out refuse to do that).
> 
> I think we need to think _very_ carefully about how we do this.
> 
> I also feel like this series is at a really basic level and you've not fully
> determined what API calls you need.

See:

https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t

> 
> I agree that it's sensible to be incremental, but I feel like you sort of need
> to somewhat prove the case that you can jump from 'incremental version where we
> only support code in mm/' to supporting arbitrary file system code that might be
> modules.
> 
> Because otherwise you're basically _guessing_ that you can do this, possibly, in
> the future and maybe it's just not the right approach but that's not clear yet?

Did you follow up with the discussions in v1?  I copied you too.

https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com

Would Nikita's work help here?  Could you explain what are you asking for
to prove that this works for us?

> 
> >
> > To achieve it, this series introduced a file operation called vm_uffd_ops.
> > The ops needs to be provided when a file type supports any of userfaultfd.
> >
> > With that, I moved both hugetlbfs and shmem over.
> 
> Well as you say below hugetlbfs is sort of a stub implementation, I wonder
> whether we'd need quite a bit more to make that work.
> 
> One thing I'd _really_ like to avoid is us having to add a bunch of hook points
> into core mm code just for uffd that then call out to some driver.
> 
> We've encountered such a total nightmare with .mmap() for instance in the past
> (including stuff that resulted in security issues) because we - simply cannot
> assume anything - about what the hook implementor might do with the passed
> parameters.
> 
> This is really really problematic.
> 
> I also absolutely hate the:
> 
> if (uffd)
> 	do_something_weird();
> 
> Pattern, so hopefully this won't proliferate that.
> 
> >
> > Hugetlbfs is still very special that it will only use partial of the
> > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> > BUG() and so far hard-coded into core mm.  But this should still be better,
> > because at least hugetlbfs is still always involved in feature probing
> > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> > to fail that, and some more).  Meanwhile after this series, shmem will be
> > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> > shmem looks like this:
> >
> > static const vm_uffd_ops shmem_uffd_ops = {
> > 	.uffd_features	= 	__VM_UFFD_FLAGS,
> > 	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> > 				BIT(_UFFDIO_ZEROPAGE) |
> > 				BIT(_UFFDIO_WRITEPROTECT) |
> > 				BIT(_UFFDIO_CONTINUE) |
> > 				BIT(_UFFDIO_POISON),
> > 	.uffd_get_folio	=	shmem_uffd_get_folio,
> > 	.uffd_copy	=	shmem_mfill_atomic_pte,
> > };
> >
> > As I mentioned in one of my reply to Nikita, I don't like the current
> > interface of uffd_copy(), but this will be the minimum change version of
> > such API to support complete extrenal-module-ready userfaultfd.  Here, very
> > minimal change will be needed from shmem side to support that.
> 
> Right, maybe a better version of this interface might address some of my
> concerns... :)
> 
> >
> > Meanwhile, the vm_uffd_ops is also not the only place one will need to
> > provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> > be updated, but that's a generic function and it'll play together with the
> > new vm_uffd_ops to make everything fly.
> >
> > No functional change expected at all after the whole series applied.  There
> > might be some slightly stricter check on uffd ops here and there in the
> > last patch, but that really shouldn't stand out anywhere to anyone.
> >
> > For testing: besides the cross-compilation tests, I did also try with
> > uffd-stress in a VM to measure any perf difference before/after the change;
> > The static call becomes a pointer now.  I really cannot measure anything
> > different, which is more or less expected.
> >
> > Comments welcomed, thanks.
> >
> > Peter Xu (4):
> >   mm: Introduce vm_uffd_ops API
> >   mm/shmem: Support vm_uffd_ops API
> >   mm/hugetlb: Support vm_uffd_ops API
> >   mm: Apply vm_uffd_ops API to core mm
> >
> >  include/linux/mm.h            |   9 +++
> >  include/linux/shmem_fs.h      |  14 -----
> >  include/linux/userfaultfd_k.h |  98 +++++++++++++++++++----------
> >  mm/hugetlb.c                  |  19 ++++++
> >  mm/shmem.c                    |  28 ++++++++-
> >  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
> >  6 files changed, 207 insertions(+), 76 deletions(-)
> >
> > --
> > 2.49.0
> >
> 
> Sorry to be critical, I just want to make sure we're not setting ourselves up
> for trouble here.
> 
> I _very much_ support efforts to make uffd more generalised, and ideally to find
> a way to separate out shmem and hugetlbfs implementation bits, so I support the
> intent _fully_.
> 
> I just want to make sure we do it in a safe way :)

Any explicit suggestions (besides objections)?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-29  8:55   ` Mike Rapoport
@ 2025-07-02 20:38     ` Peter Xu
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Xu @ 2025-07-02 20:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Lorenzo Stoakes, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Sun, Jun 29, 2025 at 11:55:27AM +0300, Mike Rapoport wrote:
> On Fri, Jun 27, 2025 at 11:46:55AM -0400, Peter Xu wrote:
> > This patch completely moves the old userfaultfd core to use the new
> > vm_uffd_ops API.  After this change, existing file systems will start to
> > use the new API for userfault operations.
> 
> Maybe:
> 
>   Move userfaultfd core to use new vm_uffd_ops API. After this change file
>   systems that implement vm_operations_struct can start using new API for
>   userfaultfd operations.

Sure.

>  
> > When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
> > because it's getting too big.  It's only used in slow paths so it shouldn't
> > be an issue.  Move the pte marker check before wp_async, which might be
> > more intuitive because wp_async depends on pte markers.  That shouldn't
> > cause any functional change though because only one check would take effect
> > depending on whether pte marker was selected in config.
> > 
> > This will also remove quite some hard-coded checks for either shmem or
> > hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.
> > 
> > Note that anonymous memory will still need to be processed separately
> > because it doesn't have vm_ops at all.
> > 
> > Reviewed-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

I'll take this for now, thanks.  We can finish the discussion in patch 1 to
see whether we need to refine the API.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 17:08         ` Nikita Kalyazin
  2025-07-02 17:39           ` Mike Rapoport
@ 2025-07-02 21:24           ` Liam R. Howlett
  2025-07-02 21:36             ` Peter Xu
  1 sibling, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2025-07-02 21:24 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Lorenzo Stoakes, Suren Baghdasaryan, Peter Xu, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Mike Rapoport,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Nikita Kalyazin <kalyazin@amazon.com> [250702 13:08]:
> 
> 
> On 02/07/2025 16:56, Lorenzo Stoakes wrote:
> > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > This feels like you're trying to put mm functionality outside of mm?
> > > 
> > > To second that, two things stick out for me here:
> > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > last patch is quite a complex function which itself calls some IMO
> > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > modules to implement such functionality seems like a stretch to me but
> > > maybe this is for some specialized modules which are written by mm
> > > experts only?
> > 
> > To echo what Liam said - I don't think we can truly rely on expertise here
> > (we make enough mistakes in core mm for that to be a dubious proposition
> > even tere :) and even if experts were involved, having core mm
> > functionality outside of core mm carries significant risk - we are
> > constantly changing things, including assumptions around sensitive topics
> > such as locking (think VMA locking) - having code elsewhere significantly
> > increases the risk of missing things.
> > 
> > I am also absolutely, to be frank, not going to accept us EXPORT()'ing
> > anything core.
> > 
> > Page table manipulation really must rely in core mm and arch code only, it
> > is easily some of the most subtle, confusing and dangerous code in mm (I
> > have spent subtantial hours banging my head against it recently), and again
> > - subject to constant change.
> > 
> > But to come back to Liam's comments and to reiterate what I was referring
> > to earlier, even permitting drivers to have access to VMAs is _highly_
> > problematic and has resulted in very real bugs and subtle issues that took
> > many hours, much stress + gnashing of teeth to adress.
> 
> The main target of this change is the implementation of UFFD for
> KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> and it is already somewhat "knowledgeable" of the mm.  I don't think there
> are existing use cases for other drivers to implement this at the moment.
> 
> Although I can't see new exports in this series, there is now a way to limit
> exports to particular modules [3].  Would it help if we only do it for KVM
> initially (if/when actually needed)?
> 
> [1]
> https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
> [2]
> https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com/
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=707f853d7fa3ce323a6875487890c213e34d81a0
> 

That's because the entry point is from a function pointer, so [3] won't
help at all.

It is recreating the situation that existed for the vma through the
vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
want to relive that experience.

We are not doing this.  It is for the benefit of everyone that we are
not doing this.

We need to find another way.

Regards,
Liam


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 21:24           ` Liam R. Howlett
@ 2025-07-02 21:36             ` Peter Xu
  2025-07-03  2:00               ` Liam R. Howlett
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-02 21:36 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, David Hildenbrand, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 05:24:02PM -0400, Liam R. Howlett wrote:
> That's because the entry point is from a function pointer, so [3] won't
> help at all.
> 
> It is recreating the situation that existed for the vma through the
> vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
> want to relive that experience.
> 
> We are not doing this.  It is for the benefit of everyone that we are
> not doing this.

Is the vma issue about "allowing vma->vm_flags to be modified anywhere"
issue?  Or is there a pointer to the issue being discussed if not?

> 
> We need to find another way.

Could you suggest something?  The minimum goal is to allow guest-memfd
support minor faults.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 21:36             ` Peter Xu
@ 2025-07-03  2:00               ` Liam R. Howlett
  2025-07-03 15:24                 ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2025-07-03  2:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Mike Rapoport,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250702 17:36]:
> On Wed, Jul 02, 2025 at 05:24:02PM -0400, Liam R. Howlett wrote:
> > That's because the entry point is from a function pointer, so [3] won't
> > help at all.
> > 
> > It is recreating the situation that existed for the vma through the
> > vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
> > want to relive that experience.
> > 
> > We are not doing this.  It is for the benefit of everyone that we are
> > not doing this.
> 
> Is the vma issue about "allowing vma->vm_flags to be modified anywhere"
> issue?  Or is there a pointer to the issue being discussed if not?

The issue is passing pointers of structs that are protected by locks or
ref counters into modules to do what they please.

vma->vm_flags was an example of where we learned how wrong this can go.

There is also the concern of the state of the folio on return from the
callback.  The error handling gets messy quick.

Now, imagine we have something that gets a folio, but then we find a
solution for contention of a lock or ref count (whatever is next), but
it doesn't work because the mm code has been bleeding into random
modules and we have no clue what that module is supposed to be doing, or
we can't make the necessary change because this module will break
userspace, or cause a performance decrease, or any other random thing
that we cannot work around without rewriting (probably suboptimally)
something we don't maintain.

Again, these are examples of how this can go bad but not an exhaustive
list by any means.

So the issue is with allowing modules to play with the folio and page
tables on their own.

If this is outside the mm, we probably won't even be Cc'ed on modules
that use it.

And do we want to be Cc'ed on modules that want to use it?

We will most likely be Cc'ed or emailed directly on the resulting memory
leak/security issue that results in what should be mm code.  It'll be a
Saturday because it always is.. :)

Even the example use code had a potential ref leak that you found [1].

> > 
> > We need to find another way.
> 
> Could you suggest something?  The minimum goal is to allow guest-memfd
> support minor faults.

Mike brought up another idea, that seems worth looking into.

Thanks,
Liam

[1] https://lore.kernel.org/all/7455220c-e35b-4509-b7c3-a78fde5b12d5@amazon.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 20:22         ` Peter Xu
@ 2025-07-03 15:01           ` Suren Baghdasaryan
  2025-07-03 15:45             ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Suren Baghdasaryan @ 2025-07-03 15:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, Lorenzo Stoakes, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 2, 2025 at 1:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > > driver here.
> > > >
> > > > I'm not sure it's really normal to be handing around page table state and
> > > > folios etc. to a driver like this, this is really... worrying to me.
> > > >
> > > > This feels like you're trying to put mm functionality outside of mm?
> > >
> > > To second that, two things stick out for me here:
> > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > last patch is quite a complex function which itself calls some IMO
> > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > modules to implement such functionality seems like a stretch to me but
> > > maybe this is for some specialized modules which are written by mm
> > > experts only?
> >
> > Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> > whether it's added to the page cache. So instead of uffd_copy(...) we might
> > add
> >
> >       int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> >       void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);
>
> Thanks for digging into this, Mike.  It's just that IMHO it may not be
> enough..
>
> I actually tried to think about a more complicated API, but more I thought
> of that, more it looked like an overkill.  I can list something here to
> show why I chose the simplest solution with uffd_copy() as of now.

TBH below does not sound like an overkill to me for keeping mm parts
to itself without over-exposing them to modules.

>
> Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
> accounting we need to do before allocations, and with/without allocations.

Ok, this results in an additional folio_prealloc() hook.

> That accounting can't be put into folio_alloc() yet even if we'll have one,
> because we could have the folio allocated when reaching here (that is, when
> *foliop != NULL).  That was a very delicated decision of us to do shmem
> accounting separately in 2016:
>
> https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/
>
> Then, there's also the complexity on how the page cache should be managed
> for any mem type.  For shmem, folio was only injected right before the
> pgtable installations.  We can't inject it when folio_alloc() because then
> others can fault-in without data populated. It means we at least need one
> more API to do page cache injections for the folio just got allocated from
> folio_alloc().

folio_add_to_cache() hook?

>
> We also may want to have different treatment on how the folio flags are
> setup.  It may not always happen in folio_alloc().  E.g. for shmem right
> now we do this right before the page cache injections:
>
>         VM_BUG_ON(folio_test_locked(folio));
>         VM_BUG_ON(folio_test_swapbacked(folio));
>         __folio_set_locked(folio);
>         __folio_set_swapbacked(folio);
>         __folio_mark_uptodate(folio);
>
> We may not want to do exactly the same for all the rest mem types.  E.g. we
> likely don't want to set swapbacked for guest-memfd folios.  We may need
> one more API to do it.

Can we do that inside folio_add_to_cache() hook before doing the injection?

>
> Then if to think about failure path, when we have the question above on
> shmem acct issue, we may want to have yet another post_failure hook doing
> shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
> for guest-memfd, but we still need that for shmem to properly unacct when
> folio allocation succeeded, but copy_from_user failed somehow.

Failure handling hook.

>
> Then the question is, do we really need all these fuss almost for nothing?

If that helps to keep modules simpler and mm details contained inside
the core kernel I think it is worth doing. I imagine if the shmem was
a module then implementing the current API would require exporting
functions like mfill_atomic_install_pte(). That seems like
over-exposure to me. And if we can avoid all the locking and
refcounting that Liam mentioned that would definitely be worth it
IMHO.

>
> Note that if we want, IIUC we can still change this in the future. The
> initial isolation like this series would still be good to land earlier; we
> don't even plan to support MISSING for guest-memfd in the near future, but
> only MINOR mode for now.  We don't necessarily need to work out MISSING
> immediately to move on useful features landing Linux.  Even if we'll have
> it for guest-memfd, it shouldn't be a challenge to maintain both.  This is
> just not a contract we need to sign forever yet.
>
> Hope above clarifies a bit on why I chose the simplest solution as of
> now. I also don't like this API, as I mentioned in the cover letter. It's
> really a trade-off I made at least for now the best I can come up with.
>
> Said that, if any of us has better solution, please shoot.  I'm always open
> to better alternatives.

I didn't explore this code as deep as you have done and therefore
might not see all the pitfalls but looks like you already considered
an alternative which does sound better to me. What are the drawbacks
that I might be missing?
Thanks,
Suren.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03  2:00               ` Liam R. Howlett
@ 2025-07-03 15:24                 ` Peter Xu
  2025-07-03 16:15                   ` Lorenzo Stoakes
  2025-07-03 17:39                   ` Liam R. Howlett
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Xu @ 2025-07-03 15:24 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Lorenzo Stoakes,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, David Hildenbrand, Andrea Arcangeli,
	Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 10:00:51PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250702 17:36]:
> > On Wed, Jul 02, 2025 at 05:24:02PM -0400, Liam R. Howlett wrote:
> > > That's because the entry point is from a function pointer, so [3] won't
> > > help at all.
> > > 
> > > It is recreating the situation that existed for the vma through the
> > > vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
> > > want to relive that experience.
> > > 
> > > We are not doing this.  It is for the benefit of everyone that we are
> > > not doing this.
> > 
> > Is the vma issue about "allowing vma->vm_flags to be modified anywhere"
> > issue?  Or is there a pointer to the issue being discussed if not?
> 
> The issue is passing pointers of structs that are protected by locks or
> ref counters into modules to do what they please.
> 
> vma->vm_flags was an example of where we learned how wrong this can go.
> 
> There is also the concern of the state of the folio on return from the
> callback.  The error handling gets messy quick.
> 
> Now, imagine we have something that gets a folio, but then we find a
> solution for contention of a lock or ref count (whatever is next), but
> it doesn't work because the mm code has been bleeding into random
> modules and we have no clue what that module is supposed to be doing, or
> we can't make the necessary change because this module will break
> userspace, or cause a performance decrease, or any other random thing
> that we cannot work around without rewriting (probably suboptimally)
> something we don't maintain.
> 
> Again, these are examples of how this can go bad but not an exhaustive
> list by any means.
> 
> So the issue is with allowing modules to play with the folio and page
> tables on their own.

I understand the concern, however IMHO that's really why mm can be hard and
important at the same time..

We definitely have driver code manipulating pgtables.  We also have folios
or pages that can be directly accessible from drivers.

After all mm is the core function provider for those and there needs to be
some API accessing them from outside.

I agree some protection would be nice, like what Suren did with the
vm_flags using __private, even though it's unfortunate it only works with
sparse not a warn/error when compiling, as vm_flags is not a pointer.
OTOH, forbid exposing anything might be an overkill, IMHO.  It stops mm
from growing in healthy ways.

> 
> If this is outside the mm, we probably won't even be Cc'ed on modules
> that use it.
> 
> And do we want to be Cc'ed on modules that want to use it?

For this specific case, I'm happy to be copied if guest-memfd will start to
support userfaultfd, because obviously I also work with the kvm community.
It'll be the same if not, as I'm list as an userfaultfd reviewer.

But when it's in the modules, it should really be the modules job.  It's ok
too when it's an API then mm people do not get copied.  It looks fine to me.

> 
> We will most likely be Cc'ed or emailed directly on the resulting memory
> leak/security issue that results in what should be mm code.  It'll be a
> Saturday because it always is.. :)

True, it's just unavoidable IMHO, and after triaged then the module owner
needs to figure out how to fix it, not a mm developer, if the bug only
happens with the module.

It's the same when a module allocated a folio/page and randomly update its
flags.  It may also crash core mm later.  We can have more protections all
over the places but I don't see an easy way to completely separate core mm
from modules.

> 
> Even the example use code had a potential ref leak that you found [1].

That's totally ok.  I appreciate Nikita's help completely and never thought
it as an issue.  IMHO the leak is not a big deal in verifying the API.

> 
> > > 
> > > We need to find another way.
> > 
> > Could you suggest something?  The minimum goal is to allow guest-memfd
> > support minor faults.
> 
> Mike brought up another idea, that seems worth looking into.

I replied to Mike already before we extended this thread.  Feel free to
chime in with any suggestions on top.  So far this series is still almost
the best I can think of.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03 15:01           ` Suren Baghdasaryan
@ 2025-07-03 15:45             ` Peter Xu
  2025-07-03 16:01               ` Lorenzo Stoakes
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-03 15:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Mike Rapoport, Lorenzo Stoakes, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Jul 03, 2025 at 08:01:12AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 2, 2025 at 1:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> > > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > > > driver here.
> > > > >
> > > > > I'm not sure it's really normal to be handing around page table state and
> > > > > folios etc. to a driver like this, this is really... worrying to me.
> > > > >
> > > > > This feels like you're trying to put mm functionality outside of mm?
> > > >
> > > > To second that, two things stick out for me here:
> > > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > > last patch is quite a complex function which itself calls some IMO
> > > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > > modules to implement such functionality seems like a stretch to me but
> > > > maybe this is for some specialized modules which are written by mm
> > > > experts only?
> > >
> > > Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> > > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> > > whether it's added to the page cache. So instead of uffd_copy(...) we might
> > > add
> > >
> > >       int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> > >       void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);
> >
> > Thanks for digging into this, Mike.  It's just that IMHO it may not be
> > enough..
> >
> > I actually tried to think about a more complicated API, but more I thought
> > of that, more it looked like an overkill.  I can list something here to
> > show why I chose the simplest solution with uffd_copy() as of now.
> 
> TBH below does not sound like an overkill to me for keeping mm parts
> to itself without over-exposing them to modules.
> 
> >
> > Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
> > accounting we need to do before allocations, and with/without allocations.
> 
> Ok, this results in an additional folio_prealloc() hook.
> 
> > That accounting can't be put into folio_alloc() yet even if we'll have one,
> > because we could have the folio allocated when reaching here (that is, when
> > *foliop != NULL).  That was a very delicated decision of us to do shmem
> > accounting separately in 2016:
> >
> > https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/
> >
> > Then, there's also the complexity on how the page cache should be managed
> > for any mem type.  For shmem, folio was only injected right before the
> > pgtable installations.  We can't inject it when folio_alloc() because then
> > others can fault-in without data populated. It means we at least need one
> > more API to do page cache injections for the folio just got allocated from
> > folio_alloc().
> 
> folio_add_to_cache() hook?
> 
> >
> > We also may want to have different treatment on how the folio flags are
> > setup.  It may not always happen in folio_alloc().  E.g. for shmem right
> > now we do this right before the page cache injections:
> >
> >         VM_BUG_ON(folio_test_locked(folio));
> >         VM_BUG_ON(folio_test_swapbacked(folio));
> >         __folio_set_locked(folio);
> >         __folio_set_swapbacked(folio);
> >         __folio_mark_uptodate(folio);
> >
> > We may not want to do exactly the same for all the rest mem types.  E.g. we
> > likely don't want to set swapbacked for guest-memfd folios.  We may need
> > one more API to do it.
> 
> Can we do that inside folio_add_to_cache() hook before doing the injection?

The folio can be freed outside this function by userfaultfd callers, so I
wonder if it'll crash the kernel if mm sees a folio locked while being freed.

> 
> >
> > Then if to think about failure path, when we have the question above on
> > shmem acct issue, we may want to have yet another post_failure hook doing
> > shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
> > for guest-memfd, but we still need that for shmem to properly unacct when
> > folio allocation succeeded, but copy_from_user failed somehow.
> 
> Failure handling hook.
> 
> >
> > Then the question is, do we really need all these fuss almost for nothing?
> 
> If that helps to keep modules simpler and mm details contained inside
> the core kernel I think it is worth doing. I imagine if the shmem was
> a module then implementing the current API would require exporting
> functions like mfill_atomic_install_pte(). That seems like
> over-exposure to me. And if we can avoid all the locking and
> refcounting that Liam mentioned that would definitely be worth it
> IMHO.
> 
> >
> > Note that if we want, IIUC we can still change this in the future. The
> > initial isolation like this series would still be good to land earlier; we
> > don't even plan to support MISSING for guest-memfd in the near future, but
> > only MINOR mode for now.  We don't necessarily need to work out MISSING
> > immediately to move on useful features landing Linux.  Even if we'll have
> > it for guest-memfd, it shouldn't be a challenge to maintain both.  This is
> > just not a contract we need to sign forever yet.

[1]

> >
> > Hope above clarifies a bit on why I chose the simplest solution as of
> > now. I also don't like this API, as I mentioned in the cover letter. It's
> > really a trade-off I made at least for now the best I can come up with.
> >
> > Said that, if any of us has better solution, please shoot.  I'm always open
> > to better alternatives.
> 
> I didn't explore this code as deep as you have done and therefore
> might not see all the pitfalls but looks like you already considered
> an alternative which does sound better to me. What are the drawbacks
> that I might be missing?

It'll be a complex API from another angle, that we'll need 5-6 APIs just to
implement uffd missing mode.

Meanwhile it'll also already start to introduce those function jumps into
shmem even if one would be enough to decouple it.

As I mentioned above, I'm also not against doing it, but IMHO that does not
need to be done right now [1], as currently it looks to me guest-memfd may
not really need missing mode traps as of now, and I am actually not sure
whether it will.  We may not want to introduce 5-6 APIs with no further
user.  We can also always discuss a proper API when the demand arrives.

If uffd_copy is so concerning to people, there's also another alternative
which is to make vm_uffd_ops only support traps except MISSING.  uffd_copy
is only about missing traps. Then we can drop uffd_copy API, but the API
itself is then broken on its own.

I still feel like we're over-worried on this.  For OOT drivers I never
cared breaking anything.  For in-tree ones, this discussion can really
happen when there's a need to.  And at that time we can also have a look at
the impl using uffd_copy(), maybe it'll not be that bad either.  It seems
to me we don't necessarily need to figure this out right now.  IMHO we can
do it two-steps in worst case.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-07-02 20:36   ` Peter Xu
@ 2025-07-03 15:55     ` Lorenzo Stoakes
  2025-07-03 16:26       ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-03 15:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

Peter you've ignored a large part of what I've said here, I'm not sure this is
productive.

On Wed, Jul 02, 2025 at 04:36:31PM -0400, Peter Xu wrote:
> On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote:
> > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f]
> > >
> > > v2 changelog:
> > > - Patch 1
> > >   - update English in commit log [David]
> > >   - move vm_uffd_ops definition to userfaultfd_k.h [Mike]
> > > - Patch 4
> > >   - fix sparse warning on bitwise type conversions [syzbot]
> > >   - Commit message updates on explanation of vma_can_userfault check [James]
> > >
> > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com
> > >
> > > This series is an alternative proposal of what Nikita proposed here on the
> > > initial three patches:
> > >
> > >   https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > >
> > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > Here, the major goal is to make kernel modules be able to opt-in with any
> > > form of userfaultfd supports, like guest-memfd.  This alternative option
> > > should hopefully be cleaner, and avoid leaking userfault details into
> > > vm_ops.fault().
> > >
> > > It also means this series does not depend on anything.  It's a pure
> > > refactoring of userfaultfd internals to provide a generic API, so that
> > > other types of files, especially RAM based, can support userfaultfd without
> > > touching mm/ at all.
> >
> > I'm very concerned that this change will simply move core mm functionality out
> > of mm and into drivers where it can bitrot and cause subtle bugs?
> >
> > You're proposing providing stuff like page table state and asking for a folio
> > back from a driver etc.
> >
> > I absolutely am not in favour of us providing core mm internals like this to
> > drivers, and I don't want to see us having to EXPORT() mm internals just to make
> > module-ised uffd code work (I mean I just will flat out refuse to do that).
> >
> > I think we need to think _very_ carefully about how we do this.
> >
> > I also feel like this series is at a really basic level and you've not fully
> > determined what API calls you need.
>
> See:
>
> https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t

OK so it seems the point you're making there is - there's a lot of complexity -
and you'd rather not think about it for now?

My concern is that in _actuality_ you'll need to do a _lot_ more and expose a
_lot_ more mm internals to achieve what you need to.

I am happy for the API to be internal-to-mm-only.

My concerns are really simple:

- exposing page table manipulation outside of mm is something I just cannot
  accept us doing for reasons I already mentioned and also Liam

- we should absolutely minimise how much we do expose

- we mustn't have hooks that don't allow us to make sensible decisions in core
  mm code.

I think overall I'm just not in favour of us having uffd functionality in
modules, I think it's an abstraction mismatch.

Now if you instead had something like:

enum uffd_minor_fault_handler_type {
	UFFD_MINOR_FAULT_DEFAULT_HANDLER,
	UFFD_MINOR_FAULT_FOO_HANDLER,
	UFFD_MINOR_FAULT_BAR_HANDLER,
	etc.
};

And the module could _choose_ which should be used then that's far far better.

Really - hermetically seal the sensitive parts but allow module choice.

You could find ways to do this in a more sophisticated way too by e.g. having a
driver callback that provides a config struct that sets things up.

If we're going 'simple first' and can 'change what we want' why not do it like
this to start?

>
> >
> > I agree that it's sensible to be incremental, but I feel like you sort of need
> > to somewhat prove the case that you can jump from 'incremental version where we
> > only support code in mm/' to supporting arbitrary file system code that might be
> > modules.
> >
> > Because otherwise you're basically _guessing_ that you can do this, possibly, in
> > the future and maybe it's just not the right approach but that's not clear yet?
>
> Did you follow up with the discussions in v1?  I copied you too.
>
> https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com
>
> Would Nikita's work help here?  Could you explain what are you asking for
> to prove that this works for us?

Yeah thanks this seems like it actually implements useful functionality. The
point is the API seems currently to be a stub so doesn't really give us much to
go on as to what might ultimately be required.

You say yourself in the series that you'll likely need to expand things and you
already have todos to this effect.

>
> >
> > >
> > > To achieve it, this series introduced a file operation called vm_uffd_ops.
> > > The ops needs to be provided when a file type supports any of userfaultfd.
> > >
> > > With that, I moved both hugetlbfs and shmem over.
> >
> > Well as you say below hugetlbfs is sort of a stub implementation, I wonder
> > whether we'd need quite a bit more to make that work.
> >
> > One thing I'd _really_ like to avoid is us having to add a bunch of hook points
> > into core mm code just for uffd that then call out to some driver.
> >
> > We've encountered such a total nightmare with .mmap() for instance in the past
> > (including stuff that resulted in security issues) because we - simply cannot
> > assume anything - about what the hook implementor might do with the passed
> > parameters.
> >
> > This is really really problematic.
> >
> > I also absolutely hate the:
> >
> > if (uffd)
> > 	do_something_weird();
> >
> > Pattern, so hopefully this won't proliferate that.

^ you didn't respond to this.

> >
> > >
> > > Hugetlbfs is still very special that it will only use partial of the
> > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> > > BUG() and so far hard-coded into core mm.  But this should still be better,
> > > because at least hugetlbfs is still always involved in feature probing
> > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> > > to fail that, and some more).  Meanwhile after this series, shmem will be
> > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> > > shmem looks like this:
> > >
> > > static const vm_uffd_ops shmem_uffd_ops = {
> > > 	.uffd_features	= 	__VM_UFFD_FLAGS,
> > > 	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> > > 				BIT(_UFFDIO_ZEROPAGE) |
> > > 				BIT(_UFFDIO_WRITEPROTECT) |
> > > 				BIT(_UFFDIO_CONTINUE) |
> > > 				BIT(_UFFDIO_POISON),
> > > 	.uffd_get_folio	=	shmem_uffd_get_folio,
> > > 	.uffd_copy	=	shmem_mfill_atomic_pte,
> > > };
> > >
> > > As I mentioned in one of my reply to Nikita, I don't like the current
> > > interface of uffd_copy(), but this will be the minimum change version of
> > > such API to support complete extrenal-module-ready userfaultfd.  Here, very
> > > minimal change will be needed from shmem side to support that.
> >
> > Right, maybe a better version of this interface might address some of my
> > concerns... :)
> >
> > >
> > > Meanwhile, the vm_uffd_ops is also not the only place one will need to
> > > provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> > > be updated, but that's a generic function and it'll play together with the
> > > new vm_uffd_ops to make everything fly.
> > >
> > > No functional change expected at all after the whole series applied.  There
> > > might be some slightly stricter check on uffd ops here and there in the
> > > last patch, but that really shouldn't stand out anywhere to anyone.
> > >
> > > For testing: besides the cross-compilation tests, I did also try with
> > > uffd-stress in a VM to measure any perf difference before/after the change;
> > > The static call becomes a pointer now.  I really cannot measure anything
> > > different, which is more or less expected.
> > >
> > > Comments welcomed, thanks.
> > >
> > > Peter Xu (4):
> > >   mm: Introduce vm_uffd_ops API
> > >   mm/shmem: Support vm_uffd_ops API
> > >   mm/hugetlb: Support vm_uffd_ops API
> > >   mm: Apply vm_uffd_ops API to core mm
> > >
> > >  include/linux/mm.h            |   9 +++
> > >  include/linux/shmem_fs.h      |  14 -----
> > >  include/linux/userfaultfd_k.h |  98 +++++++++++++++++++----------
> > >  mm/hugetlb.c                  |  19 ++++++
> > >  mm/shmem.c                    |  28 ++++++++-
> > >  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
> > >  6 files changed, 207 insertions(+), 76 deletions(-)
> > >
> > > --
> > > 2.49.0
> > >
> >
> > Sorry to be critical, I just want to make sure we're not setting ourselves up
> > for trouble here.
> >
> > I _very much_ support efforts to make uffd more generalised, and ideally to find
> > a way to separate out shmem and hugetlbfs implementation bits, so I support the
> > intent _fully_.
> >
> > I just want to make sure we do it in a safe way :)
>
> Any explicit suggestions (besides objections)?

I gave some above.

I'm fundamentally opposed to us exposing page table manipulation or really any
state subject to sensitive locks to modules.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03 15:45             ` Peter Xu
@ 2025-07-03 16:01               ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-03 16:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Suren Baghdasaryan, Mike Rapoport, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Jul 03, 2025 at 11:45:40AM -0400, Peter Xu wrote:
> I still feel like we're over-worried on this.  For OOT drivers I never
> cared breaking anything.  For in-tree ones, this discussion can really
> happen when there's a need to.  And at that time we can also have a look at
> the impl using uffd_copy(), maybe it'll not be that bad either.  It seems
> to me we don't necessarily need to figure this out right now.  IMHO we can
> do it two-steps in worst case.

This kind of approach in relation to f_op->mmap() is precisely how I ended
up being cc'd on an embargoed report of a critical zero-day security flaw.

So I think if anything there is under-worrying going on here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03 15:24                 ` Peter Xu
@ 2025-07-03 16:15                   ` Lorenzo Stoakes
  2025-07-03 17:39                   ` Liam R. Howlett
  1 sibling, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-03 16:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Nikita Kalyazin, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Mike Rapoport,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Jul 03, 2025 at 11:24:21AM -0400, Peter Xu wrote:
> On Wed, Jul 02, 2025 at 10:00:51PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@redhat.com> [250702 17:36]:
> > > On Wed, Jul 02, 2025 at 05:24:02PM -0400, Liam R. Howlett wrote:
> > > > That's because the entry point is from a function pointer, so [3] won't
> > > > help at all.
> > > >
> > > > It is recreating the situation that existed for the vma through the
> > > > vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
> > > > want to relive that experience.
> > > >
> > > > We are not doing this.  It is for the benefit of everyone that we are
> > > > not doing this.
> > >
> > > Is the vma issue about "allowing vma->vm_flags to be modified anywhere"
> > > issue?  Or is there a pointer to the issue being discussed if not?
> >
> > The issue is passing pointers of structs that are protected by locks or
> > ref counters into modules to do what they please.
> >
> > vma->vm_flags was an example of where we learned how wrong this can go.
> >
> > There is also the concern of the state of the folio on return from the
> > callback.  The error handling gets messy quick.
> >
> > Now, imagine we have something that gets a folio, but then we find a
> > solution for contention of a lock or ref count (whatever is next), but
> > it doesn't work because the mm code has been bleeding into random
> > modules and we have no clue what that module is supposed to be doing, or
> > we can't make the necessary change because this module will break
> > userspace, or cause a performance decrease, or any other random thing
> > that we cannot work around without rewriting (probably suboptimally)
> > something we don't maintain.
> >
> > Again, these are examples of how this can go bad but not an exhaustive
> > list by any means.
> >
> > So the issue is with allowing modules to play with the folio and page
> > tables on their own.
>
> I understand the concern, however IMHO that's really why mm can be hard and
> important at the same time..

I feel like you're dismissing the concerns altogether honestly.

>
> We definitely have driver code manipulating pgtables.  We also have folios
> or pages that can be directly accessible from drivers.

'There's already bad stuff going on' is not an argument for doing more.

>
> After all mm is the core function provider for those and there needs to be
> some API accessing them from outside.

The point being made here is what are internals and what are not.

We don't expose internals, we do expose a carefully controlled interface that
minimises risk of things being broken.

>
> I agree some protection would be nice, like what Suren did with the
> vm_flags using __private, even though it's unfortunate it only works with
> sparse not a warn/error when compiling, as vm_flags is not a pointer.
> OTOH, forbid exposing anything might be an overkill, IMHO.  It stops mm
> from growing in healthy ways.

I'm not sure what is healthy about no longer being able to make assumptions
about what code does because hooks are being invoked with drivers doing
_whatever they like_.

Part of the purpose of review is avoiding making decisions that cause
problems down the line.

>
> >
> > If this is outside the mm, we probably won't even be Cc'ed on modules
> > that use it.
> >
> > And do we want to be Cc'ed on modules that want to use it?
>
> For this specific case, I'm happy to be copied if guest-memfd will start to
> support userfaultfd, because obviously I also work with the kvm community.
> It'll be the same if not, as I'm list as an userfaultfd reviewer.
>
> But when it's in the modules, it should really be the modules job.  It's ok
> too when it's an API then mm people do not get copied.  It looks fine to me.

This is ridiculous, we expose mm internals to modules, and then no longer
have to care when they break things, or a subtle thing changes in mm?

On the one hand you argue that in-tree drivers can be 'monitored' +
therefore it's fine, but on the other hand you say it doesn't matter what
they do and it's nothing to do with us so we shouldn't care about being
infomred of changes?

These two positions are contradcitory and neither are good.

>
> >
> > We will most likely be Cc'ed or emailed directly on the resulting memory
> > leak/security issue that results in what should be mm code.  It'll be a
> > Saturday because it always is.. :)
>
> True, it's just unavoidable IMHO, and after triaged then the module owner
> needs to figure out how to fix it, not a mm developer, if the bug only
> happens with the module.

Except it impacts mm as a whole.

And it is avoidable, we can just _not do this_.

>
> It's the same when a module allocated a folio/page and randomly update its
> flags.  It may also crash core mm later.  We can have more protections all
> over the places but I don't see an easy way to completely separate core mm
> from modules.

Yes if modules absolutely abuse things then it's problematic, but the issue
is that mm has a whole host of extremely subtle considerations.

I documented a lot of these at
https://kernel.org/doc/html/latest/mm/process_addrs.html

These details change quite often, including some extremely sensitive and
delicate details around locking.

Do yo really think module authors are going to correctly implement all of
this?

It's very obvious these are internal implementation details.

>
> >
> > Even the example use code had a potential ref leak that you found [1].
>
> That's totally ok.  I appreciate Nikita's help completely and never thought
> it as an issue.  IMHO the leak is not a big deal in verifying the API.

I think it's quite telling as to your under-worrying :) about this stuff to
think that that's not a problem.

You guys have already made subtle errors in implementing the simplest
possible use of the API.

This just makes the point for us I think?

>
> >
> > > >
> > > > We need to find another way.
> > >
> > > Could you suggest something?  The minimum goal is to allow guest-memfd
> > > support minor faults.
> >
> > Mike brought up another idea, that seems worth looking into.
>
> I replied to Mike already before we extended this thread.  Feel free to
> chime in with any suggestions on top.  So far this series is still almost
> the best I can think of.

I mean if you don't want to consider alternatives, maybe this series simply
isn't viable?

I made suggestions in the other thread re: a very _soft_ interface where we
implement things in core mm but _configure_ them in modules as an
altenrative.

That would avoid exposure of mm internals.

I really don't think a series that exposes anything even vaguely sensitive
is viable imo.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-07-03 15:55     ` Lorenzo Stoakes
@ 2025-07-03 16:26       ` Peter Xu
  2025-07-03 16:44         ` Lorenzo Stoakes
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-03 16:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Jul 03, 2025 at 04:55:01PM +0100, Lorenzo Stoakes wrote:
> Peter you've ignored a large part of what I've said here, I'm not sure this is
> productive.
> 
> On Wed, Jul 02, 2025 at 04:36:31PM -0400, Peter Xu wrote:
> > On Mon, Jun 30, 2025 at 11:29:30AM +0100, Lorenzo Stoakes wrote:
> > > On Fri, Jun 27, 2025 at 11:46:51AM -0400, Peter Xu wrote:
> > > > [based on latest akpm/mm-new of June 27th, commit 9be7387ae43f]
> > > >
> > > > v2 changelog:
> > > > - Patch 1
> > > >   - update English in commit log [David]
> > > >   - move vm_uffd_ops definition to userfaultfd_k.h [Mike]
> > > > - Patch 4
> > > >   - fix sparse warning on bitwise type conversions [syzbot]
> > > >   - Commit message updates on explanation of vma_can_userfault check [James]
> > > >
> > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com
> > > >
> > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > initial three patches:
> > > >
> > > >   https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > >
> > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > > Here, the major goal is to make kernel modules be able to opt-in with any
> > > > form of userfaultfd supports, like guest-memfd.  This alternative option
> > > > should hopefully be cleaner, and avoid leaking userfault details into
> > > > vm_ops.fault().
> > > >
> > > > It also means this series does not depend on anything.  It's a pure
> > > > refactoring of userfaultfd internals to provide a generic API, so that
> > > > other types of files, especially RAM based, can support userfaultfd without
> > > > touching mm/ at all.
> > >
> > > I'm very concerned that this change will simply move core mm functionality out
> > > of mm and into drivers where it can bitrot and cause subtle bugs?
> > >
> > > You're proposing providing stuff like page table state and asking for a folio
> > > back from a driver etc.
> > >
> > > I absolutely am not in favour of us providing core mm internals like this to
> > > drivers, and I don't want to see us having to EXPORT() mm internals just to make
> > > module-ised uffd code work (I mean I just will flat out refuse to do that).
> > >
> > > I think we need to think _very_ carefully about how we do this.
> > >
> > > I also feel like this series is at a really basic level and you've not fully
> > > determined what API calls you need.
> >
> > See:
> >
> > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t
> 
> OK so it seems the point you're making there is - there's a lot of complexity -
> and you'd rather not think about it for now?

This is not a fair judgement.  I think I have provided my thought process
and how I made the decision to come up with this series.

> 
> My concern is that in _actuality_ you'll need to do a _lot_ more and expose a
> _lot_ more mm internals to achieve what you need to.
> 
> I am happy for the API to be internal-to-mm-only.
> 
> My concerns are really simple:
> 
> - exposing page table manipulation outside of mm is something I just cannot
>   accept us doing for reasons I already mentioned and also Liam
> 
> - we should absolutely minimise how much we do expose
> 
> - we mustn't have hooks that don't allow us to make sensible decisions in core
>   mm code.
> 
> I think overall I'm just not in favour of us having uffd functionality in
> modules, I think it's an abstraction mismatch.
> 
> Now if you instead had something like:
> 
> enum uffd_minor_fault_handler_type {
> 	UFFD_MINOR_FAULT_DEFAULT_HANDLER,
> 	UFFD_MINOR_FAULT_FOO_HANDLER,
> 	UFFD_MINOR_FAULT_BAR_HANDLER,
> 	etc.
> };
> 
> And the module could _choose_ which should be used then that's far far better.
> 
> Really - hermetically seal the sensitive parts but allow module choice.
> 
> You could find ways to do this in a more sophisticated way too by e.g. having a
> driver callback that provides a config struct that sets things up.
> 
> If we're going 'simple first' and can 'change what we want' why not do it like
> this to start?

Could you help to define UFFD_MINOR_FAULT_FOO_HANDLER for guest-memfd, and
how that handler would be able to hook to the guest-memfd driver?  The
kernel needs to build with/without CONFIG_KVM.

What about MISSING?  Do you plan to support missing in the proposal you
mentioned?

> 
> >
> > >
> > > I agree that it's sensible to be incremental, but I feel like you sort of need
> > > to somewhat prove the case that you can jump from 'incremental version where we
> > > only support code in mm/' to supporting arbitrary file system code that might be
> > > modules.
> > >
> > > Because otherwise you're basically _guessing_ that you can do this, possibly, in
> > > the future and maybe it's just not the right approach but that's not clear yet?
> >
> > Did you follow up with the discussions in v1?  I copied you too.
> >
> > https://lore.kernel.org/r/114133f5-0282-463d-9d65-3143aa658806@amazon.com
> >
> > Would Nikita's work help here?  Could you explain what are you asking for
> > to prove that this works for us?
> 
> Yeah thanks this seems like it actually implements useful functionality. The
> point is the API seems currently to be a stub so doesn't really give us much to
> go on as to what might ultimately be required.
> 
> You say yourself in the series that you'll likely need to expand things and you
> already have todos to this effect.
> 
> >
> > >
> > > >
> > > > To achieve it, this series introduced a file operation called vm_uffd_ops.
> > > > The ops needs to be provided when a file type supports any of userfaultfd.
> > > >
> > > > With that, I moved both hugetlbfs and shmem over.
> > >
> > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder
> > > whether we'd need quite a bit more to make that work.
> > >
> > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points
> > > into core mm code just for uffd that then call out to some driver.
> > >
> > > We've encountered such a total nightmare with .mmap() for instance in the past
> > > (including stuff that resulted in security issues) because we - simply cannot
> > > assume anything - about what the hook implementor might do with the passed
> > > parameters.
> > >
> > > This is really really problematic.
> > >
> > > I also absolutely hate the:
> > >
> > > if (uffd)
> > > 	do_something_weird();
> > >
> > > Pattern, so hopefully this won't proliferate that.
> 
> ^ you didn't respond to this.

Can you elaborate?  I don't understand how this is attached to this series.
AFAIU, the series is trying to remove some "if()s", not adding.

> 
> > >
> > > >
> > > > Hugetlbfs is still very special that it will only use partial of the
> > > > vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> > > > BUG() and so far hard-coded into core mm.  But this should still be better,
> > > > because at least hugetlbfs is still always involved in feature probing
> > > > (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> > > > to fail that, and some more).  Meanwhile after this series, shmem will be
> > > > completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> > > > shmem looks like this:
> > > >
> > > > static const vm_uffd_ops shmem_uffd_ops = {
> > > > 	.uffd_features	= 	__VM_UFFD_FLAGS,
> > > > 	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> > > > 				BIT(_UFFDIO_ZEROPAGE) |
> > > > 				BIT(_UFFDIO_WRITEPROTECT) |
> > > > 				BIT(_UFFDIO_CONTINUE) |
> > > > 				BIT(_UFFDIO_POISON),
> > > > 	.uffd_get_folio	=	shmem_uffd_get_folio,
> > > > 	.uffd_copy	=	shmem_mfill_atomic_pte,
> > > > };
> > > >
> > > > As I mentioned in one of my reply to Nikita, I don't like the current
> > > > interface of uffd_copy(), but this will be the minimum change version of
> > > > such API to support complete extrenal-module-ready userfaultfd.  Here, very
> > > > minimal change will be needed from shmem side to support that.
> > >
> > > Right, maybe a better version of this interface might address some of my
> > > concerns... :)
> > >
> > > >
> > > > Meanwhile, the vm_uffd_ops is also not the only place one will need to
> > > > provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> > > > be updated, but that's a generic function and it'll play together with the
> > > > new vm_uffd_ops to make everything fly.
> > > >
> > > > No functional change expected at all after the whole series applied.  There
> > > > might be some slightly stricter check on uffd ops here and there in the
> > > > last patch, but that really shouldn't stand out anywhere to anyone.
> > > >
> > > > For testing: besides the cross-compilation tests, I did also try with
> > > > uffd-stress in a VM to measure any perf difference before/after the change;
> > > > The static call becomes a pointer now.  I really cannot measure anything
> > > > different, which is more or less expected.
> > > >
> > > > Comments welcomed, thanks.
> > > >
> > > > Peter Xu (4):
> > > >   mm: Introduce vm_uffd_ops API
> > > >   mm/shmem: Support vm_uffd_ops API
> > > >   mm/hugetlb: Support vm_uffd_ops API
> > > >   mm: Apply vm_uffd_ops API to core mm
> > > >
> > > >  include/linux/mm.h            |   9 +++
> > > >  include/linux/shmem_fs.h      |  14 -----
> > > >  include/linux/userfaultfd_k.h |  98 +++++++++++++++++++----------
> > > >  mm/hugetlb.c                  |  19 ++++++
> > > >  mm/shmem.c                    |  28 ++++++++-
> > > >  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
> > > >  6 files changed, 207 insertions(+), 76 deletions(-)
> > > >
> > > > --
> > > > 2.49.0
> > > >
> > >
> > > Sorry to be critical, I just want to make sure we're not setting ourselves up
> > > for trouble here.
> > >
> > > I _very much_ support efforts to make uffd more generalised, and ideally to find
> > > a way to separate out shmem and hugetlbfs implementation bits, so I support the
> > > intent _fully_.
> > >
> > > I just want to make sure we do it in a safe way :)
> >
> > Any explicit suggestions (besides objections)?
> 
> I gave some above.
> 
> I'm fundamentally opposed to us exposing page table manipulation or really any
> state subject to sensitive locks to modules.
> 
> Cheers, Lorenzo
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 20:24         ` Peter Xu
@ 2025-07-03 16:32           ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-03 16:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 04:24:25PM -0400, Peter Xu wrote:
> On Wed, Jul 02, 2025 at 04:56:04PM +0100, Lorenzo Stoakes wrote:
> > I am also absolutely, to be frank, not going to accept us EXPORT()'ing
> > anything core.
>
> Could you be explicit on what you're absolutely against?  Is that the
> vm_uffd_ops (or any form of it) being exported, or anything else?

page table manipulation
core mm internals

In fact I think we need to minimise as much exposure as posible.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 0/4] mm/userfaultfd: modulize memory types
  2025-07-03 16:26       ` Peter Xu
@ 2025-07-03 16:44         ` Lorenzo Stoakes
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Stoakes @ 2025-07-03 16:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Muchun Song, Mike Rapoport, Hugh Dickins, Andrew Morton,
	James Houghton, Liam R . Howlett, Nikita Kalyazin, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Thu, Jul 03, 2025 at 12:26:17PM -0400, Peter Xu wrote:
> On Thu, Jul 03, 2025 at 04:55:01PM +0100, Lorenzo Stoakes wrote:
> > > > I'm very concerned that this change will simply move core mm functionality out
> > > > of mm and into drivers where it can bitrot and cause subtle bugs?
> > > >
> > > > You're proposing providing stuff like page table state and asking for a folio
> > > > back from a driver etc.
> > > >
> > > > I absolutely am not in favour of us providing core mm internals like this to
> > > > drivers, and I don't want to see us having to EXPORT() mm internals just to make
> > > > module-ised uffd code work (I mean I just will flat out refuse to do that).
> > > >
> > > > I think we need to think _very_ carefully about how we do this.
> > > >
> > > > I also feel like this series is at a really basic level and you've not fully
> > > > determined what API calls you need.
> > >
> > > See:
> > >
> > > https://lore.kernel.org/all/aGWVIjmmsmskA4bp@x1.local/#t
> >
> > OK so it seems the point you're making there is - there's a lot of complexity -
> > and you'd rather not think about it for now?
>
> This is not a fair judgement.  I think I have provided my thought process
> and how I made the decision to come up with this series.

Sorry Peter I am not trying to suggest you've not thought this through, that's
not it at all.

Maybe I'm phrasing this badly.

What I mean to say is - you've implemented effectively a minimum viable
interface, and my concern is that it doesn't fully express what you're going to
need to actually implement this in reality.

And my interpretation of what you say in the patches is that you are basically
stating this. But happy to be corrected!

>
> >
> > My concern is that in _actuality_ you'll need to do a _lot_ more and expose a
> > _lot_ more mm internals to achieve what you need to.
> >
> > I am happy for the API to be internal-to-mm-only.
> >
> > My concerns are really simple:
> >
> > - exposing page table manipulation outside of mm is something I just cannot
> >   accept us doing for reasons I already mentioned and also Liam
> >
> > - we should absolutely minimise how much we do expose
> >
> > - we mustn't have hooks that don't allow us to make sensible decisions in core
> >   mm code.
> >
> > I think overall I'm just not in favour of us having uffd functionality in
> > modules, I think it's an abstraction mismatch.
> >
> > Now if you instead had something like:
> >
> > enum uffd_minor_fault_handler_type {
> > 	UFFD_MINOR_FAULT_DEFAULT_HANDLER,
> > 	UFFD_MINOR_FAULT_FOO_HANDLER,
> > 	UFFD_MINOR_FAULT_BAR_HANDLER,
> > 	etc.
> > };
> >
> > And the module could _choose_ which should be used then that's far far better.
> >
> > Really - hermetically seal the sensitive parts but allow module choice.
> >
> > You could find ways to do this in a more sophisticated way too by e.g. having a
> > driver callback that provides a config struct that sets things up.
> >
> > If we're going 'simple first' and can 'change what we want' why not do it like
> > this to start?
>
> Could you help to define UFFD_MINOR_FAULT_FOO_HANDLER for guest-memfd, and
> how that handler would be able to hook to the guest-memfd driver?  The
> kernel needs to build with/without CONFIG_KVM.
>
> What about MISSING?  Do you plan to support missing in the proposal you
> mentioned?

I'm simply providing a vague hand wavey notion of something that doesn't expose
mm internals.

I would have thought you could figure out ways of doing this kind of thing, or
providing some minimal and safely controlled hook for the different modes?

It seems odd that on the one hand you're ok with providing something imcomplete
- but which exposes mm internals here - but then require an entirely complete
solution for an alternative proposal.

Exposing mm internals to drivers is just a no-go.

On the other hand, providing internals for -mm code only- is fine. If the
purpose of the series was changed to expose a simplified interface, that could
then _call into_ mm code that used an internal-mm API that'd be a way forward
also.

> > > > Well as you say below hugetlbfs is sort of a stub implementation, I wonder
> > > > whether we'd need quite a bit more to make that work.
> > > >
> > > > One thing I'd _really_ like to avoid is us having to add a bunch of hook points
> > > > into core mm code just for uffd that then call out to some driver.
> > > >
> > > > We've encountered such a total nightmare with .mmap() for instance in the past
> > > > (including stuff that resulted in security issues) because we - simply cannot
> > > > assume anything - about what the hook implementor might do with the passed
> > > > parameters.
> > > >
> > > > This is really really problematic.
> > > >
> > > > I also absolutely hate the:
> > > >
> > > > if (uffd)
> > > > 	do_something_weird();
> > > >
> > > > Pattern, so hopefully this won't proliferate that.
> >
> > ^ you didn't respond to this.
>
> Can you elaborate?  I don't understand how this is attached to this series.
> AFAIU, the series is trying to remove some "if()s", not adding.

What I'm saying is that is a _very good_ thing!

I was just raising the point that us doing anything to address that is
positive. But it can't be at the cost of the issues raised.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03 15:24                 ` Peter Xu
  2025-07-03 16:15                   ` Lorenzo Stoakes
@ 2025-07-03 17:39                   ` Liam R. Howlett
  1 sibling, 0 replies; 42+ messages in thread
From: Liam R. Howlett @ 2025-07-03 17:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Mike Rapoport,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250703 11:24]:
> On Wed, Jul 02, 2025 at 10:00:51PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@redhat.com> [250702 17:36]:
> > > On Wed, Jul 02, 2025 at 05:24:02PM -0400, Liam R. Howlett wrote:
> > > > That's because the entry point is from a function pointer, so [3] won't
> > > > help at all.
> > > > 
> > > > It is recreating the situation that existed for the vma through the
> > > > vm_ops in mmap, but for uffd.  And at a lower level (page tables).  I do not
> > > > want to relive that experience.
> > > > 
> > > > We are not doing this.  It is for the benefit of everyone that we are
> > > > not doing this.
> > > 
> > > Is the vma issue about "allowing vma->vm_flags to be modified anywhere"
> > > issue?  Or is there a pointer to the issue being discussed if not?
> > 
> > The issue is passing pointers of structs that are protected by locks or
> > ref counters into modules to do what they please.
> > 
> > vma->vm_flags was an example of where we learned how wrong this can go.
> > 
> > There is also the concern of the state of the folio on return from the
> > callback.  The error handling gets messy quick.
> > 
> > Now, imagine we have something that gets a folio, but then we find a
> > solution for contention of a lock or ref count (whatever is next), but
> > it doesn't work because the mm code has been bleeding into random
> > modules and we have no clue what that module is supposed to be doing, or
> > we can't make the necessary change because this module will break
> > userspace, or cause a performance decrease, or any other random thing
> > that we cannot work around without rewriting (probably suboptimally)
> > something we don't maintain.
> > 
> > Again, these are examples of how this can go bad but not an exhaustive
> > list by any means.
> > 
> > So the issue is with allowing modules to play with the folio and page
> > tables on their own.
> 
> I understand the concern, however IMHO that's really why mm can be hard and
> important at the same time..

I'm glad you understand, but I think you do not understand the severity
of the concern and the repercussions.

These patches, as they exist today, are not going to be accepted.

I am not okay with it going in, and I don't see many saying differently.

I am not okay with you pushing for it any longer.

Several people have told you it is not a good idea, the people who have
to deal with the fallout of this when it goes bad - and it _will_ go
bad.

You have been given ample reasons why, technical reasons as well as real
examples of failures - and security bugs - that have resulted from the
exact same pattern in the exact same structure where you are reproducing
the patter.

Please stop trying to push this plan.  It is a waste of time and energy.

We need a new plan.

> 
> We definitely have driver code manipulating pgtables.  We also have folios
> or pages that can be directly accessible from drivers.
> 
> After all mm is the core function provider for those and there needs to be
> some API accessing them from outside.

But that's not what you are doing, you are handing out pointers and
expecting the modules to do the core function.

And the core function that was suggested in the example user already had
a ref counting issue.  Bugs happen, but that sort of thing is going to
be difficult to find and it won't be the driver author hunting it down,
potentially under an embargo.  And good luck validating what was done on
return from the module.

> 
> I agree some protection would be nice, like what Suren did with the
> vm_flags using __private, even though it's unfortunate it only works with
> sparse not a warn/error when compiling, as vm_flags is not a pointer.
> OTOH, forbid exposing anything might be an overkill, IMHO.  It stops mm
> from growing in healthy ways.

This isn't healthy, it is repeating the errors of the past and expecting
a different result.

This isn't about balance or forbidding any exposure, it's about finding
a less bug-prone way of getting your feature to work.

I'm not saying it's wrong TO do this. I'm saying these patches are the
wrong way OF doing this - from experience of dealing with the same
pattern.

Anyways, I am not spending more time fighting about this.  Let me know
when you come up with an alternative approach.

Regards,
Liam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-02 19:46             ` Peter Xu
@ 2025-07-03 17:48               ` Mike Rapoport
  2025-07-04  9:34                 ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Mike Rapoport @ 2025-07-03 17:48 UTC (permalink / raw)
  To: Peter Xu
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Liam R . Howlett, Michal Hocko,
	David Hildenbrand, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> 
> [...]
> 
> > > The main target of this change is the implementation of UFFD for
> > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > are existing use cases for other drivers to implement this at the moment.
> > > 
> > > Although I can't see new exports in this series, there is now a way to limit
> > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > initially (if/when actually needed)?
> > 
> > There were talks about pulling out guest_memfd core into mm, but I don't
> > remember patches about it. If parts of guest_memfd were already in mm/ that
> > would make easier to export uffd ops to it.
> 
> Do we have a link to such discussion?  I'm also curious whether that idea
> was acknowledged by KVM maintainers.

AFAIR it was discussed at one of David's guest_memfd calls
 
> Having an abstraction layer for userfaultfd memtypes within mm always makes
> some sense on its own to me, so as to remove separate random checks over
> either shmem or hugetlb.  E.g. after the series applied, we can drop the
> shmem header in userfaultfd code, which should also be a step forward.
> 
> Thanks,
> 
> >  
> > > [1]
> > > https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
> > > [2]
> > > https://lore.kernel.org/all/7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com/
> > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git/commit/?h=kbuild&id=707f853d7fa3ce323a6875487890c213e34d81a0
> 
> -- 
> Peter Xu
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-03 17:48               ` Mike Rapoport
@ 2025-07-04  9:34                 ` David Hildenbrand
  2025-07-04 14:59                   ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2025-07-04  9:34 UTC (permalink / raw)
  To: Mike Rapoport, Peter Xu
  Cc: Nikita Kalyazin, Lorenzo Stoakes, Suren Baghdasaryan, linux-mm,
	linux-kernel, Vlastimil Babka, Muchun Song, Hugh Dickins,
	Andrew Morton, James Houghton, Liam R . Howlett, Michal Hocko,
	Andrea Arcangeli, Oscar Salvador, Axel Rasmussen, Ujwal Kundur

On 03.07.25 19:48, Mike Rapoport wrote:
> On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
>> On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
>>
>> [...]
>>
>>>> The main target of this change is the implementation of UFFD for
>>>> KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
>>>> into the mm codebase.  We usually mean KVM by the "drivers" in this context,
>>>> and it is already somewhat "knowledgeable" of the mm.  I don't think there
>>>> are existing use cases for other drivers to implement this at the moment.
>>>>
>>>> Although I can't see new exports in this series, there is now a way to limit
>>>> exports to particular modules [3].  Would it help if we only do it for KVM
>>>> initially (if/when actually needed)?
>>>
>>> There were talks about pulling out guest_memfd core into mm, but I don't
>>> remember patches about it. If parts of guest_memfd were already in mm/ that
>>> would make easier to export uffd ops to it.
>>
>> Do we have a link to such discussion?  I'm also curious whether that idea
>> was acknowledged by KVM maintainers.
> 
> AFAIR it was discussed at one of David's guest_memfd calls

While it was discussed in the call a couple of times in different 
context (guest_memfd as a library / guest_memfd shim), I think we 
already discussed it back at LPC last year.

One of the main reasons for doing that is supporting guest_memfd in 
other hypervisors -- the gunyah hypervisor in the kernel wants to make 
use of it as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-04  9:34                 ` David Hildenbrand
@ 2025-07-04 14:59                   ` Peter Xu
  2025-07-04 19:39                     ` Liam R. Howlett
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2025-07-04 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Nikita Kalyazin, Lorenzo Stoakes,
	Suren Baghdasaryan, linux-mm, linux-kernel, Vlastimil Babka,
	Muchun Song, Hugh Dickins, Andrew Morton, James Houghton,
	Liam R . Howlett, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
> On 03.07.25 19:48, Mike Rapoport wrote:
> > On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> > > On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> > > 
> > > [...]
> > > 
> > > > > The main target of this change is the implementation of UFFD for
> > > > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > > > are existing use cases for other drivers to implement this at the moment.
> > > > > 
> > > > > Although I can't see new exports in this series, there is now a way to limit
> > > > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > > > initially (if/when actually needed)?
> > > > 
> > > > There were talks about pulling out guest_memfd core into mm, but I don't
> > > > remember patches about it. If parts of guest_memfd were already in mm/ that
> > > > would make easier to export uffd ops to it.
> > > 
> > > Do we have a link to such discussion?  I'm also curious whether that idea
> > > was acknowledged by KVM maintainers.
> > 
> > AFAIR it was discussed at one of David's guest_memfd calls
> 
> While it was discussed in the call a couple of times in different context
> (guest_memfd as a library / guest_memfd shim), I think we already discussed
> it back at LPC last year.
> 
> One of the main reasons for doing that is supporting guest_memfd in other
> hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
> as well.

I see, thanks for the info. I found the series, it's here:

https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/

Here, the question is whether do we still want to keep explicit calls to
shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
guest-memfd doesn't change a huge lot on answering this question, IIUC.

It definitely reduces the use of mfill_atomic_install_pte() so that we
don't need to export it.

However if we want to generalize userfaultfd capability for a type of
memory, we will still need something like the vm_uffd_ops hook to report
such information.  It means drivers can still overwrite these, with/without
an exported mfill_atomic_install_pte() functions.  I'm not sure whether
that eases the concern.

So to me, generalizing the mem type looks helpful with/without moving
guest-memfd under mm/.

We do have the option to keep hard-code guest-memfd like shmem or
hugetlbfs. This is still "doable", but this likely means guest-memfd
support for userfaultfd needs to be done after that work.  I did quickly
check the status of gunyah hypervisor [1,2,3], I found that all of the
efforts are not yet continued in 2025.  The hypervisor last update was Jan
2024 with a batch push [1].

I still prefer generalizing uffd capabilities using the ops.  That makes
guest-memfd support on MINOR not be blocked and it should be able to be
done concurrently v.s. guest-memfd library.  If guest-memfd library idea
didn't move on, it's non-issue either.

I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
I'm going to repost - that looks like the only thing that people are
against with, even though that is not my preference, as that'll make the
API half-broken on its own.  Said that, I still prefer this against
hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.

I'll wait for a few more days to see whether there's comment on above plan
to drop uffd_copy().

Thanks,

[1] https://github.com/quic/gunyah-hypervisor/tree/develop/hyp
[2] https://lore.kernel.org/all/20240516143356.1739402-1-quic_svaddagi@quicinc.com/
[3] https://lore.kernel.org/lkml/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-04 14:59                   ` Peter Xu
@ 2025-07-04 19:39                     ` Liam R. Howlett
  2025-09-01 16:01                       ` Nikita Kalyazin
  0 siblings, 1 reply; 42+ messages in thread
From: Liam R. Howlett @ 2025-07-04 19:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Mike Rapoport, Nikita Kalyazin,
	Lorenzo Stoakes, Suren Baghdasaryan, linux-mm, linux-kernel,
	Vlastimil Babka, Muchun Song, Hugh Dickins, Andrew Morton,
	James Houghton, Michal Hocko, Andrea Arcangeli, Oscar Salvador,
	Axel Rasmussen, Ujwal Kundur

* Peter Xu <peterx@redhat.com> [250704 11:00]:
> On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
> > On 03.07.25 19:48, Mike Rapoport wrote:
> > > On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> > > > On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > > The main target of this change is the implementation of UFFD for
> > > > > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > > > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > > > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > > > > are existing use cases for other drivers to implement this at the moment.
> > > > > > 
> > > > > > Although I can't see new exports in this series, there is now a way to limit
> > > > > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > > > > initially (if/when actually needed)?
> > > > > 
> > > > > There were talks about pulling out guest_memfd core into mm, but I don't
> > > > > remember patches about it. If parts of guest_memfd were already in mm/ that
> > > > > would make easier to export uffd ops to it.
> > > > 
> > > > Do we have a link to such discussion?  I'm also curious whether that idea
> > > > was acknowledged by KVM maintainers.
> > > 
> > > AFAIR it was discussed at one of David's guest_memfd calls
> > 
> > While it was discussed in the call a couple of times in different context
> > (guest_memfd as a library / guest_memfd shim), I think we already discussed
> > it back at LPC last year.
> > 
> > One of the main reasons for doing that is supporting guest_memfd in other
> > hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
> > as well.
> 
> I see, thanks for the info. I found the series, it's here:
> 
> https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
> 
> Here, the question is whether do we still want to keep explicit calls to
> shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
> guest-memfd doesn't change a huge lot on answering this question, IIUC.

Can you explore moving hugetlb_mfill_atomic_pte and
shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
use the same code?

That is, split the similar blocks into functions and reduce duplication.

These are under the UFFD config option and are pretty similar.  This
will also limit the bleeding of mfill_atomic_mode out of uffd.

If you look at what the code does in userfaultfd.c, you can see that
some refactoring is necessary for other reasons:

mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
(src_addr < src_start + len) to call mfill_atomic_pte().. which might
call shmem_mfill_atomic_pte() in mm/shmem.c

mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c

The shmem call already depends on the vma flags.. which it still does in
your patch 4 here.  So you've replaced this:

if (!(dst_vma->vm_flags & VM_SHARED)) {
...
} else {
        shmem_mfill_atomic_pte()
}

With...

if (!(dst_vma->vm_flags & VM_SHARED)) {
...
} else {
...
        uffd_ops->uffd_copy()
}

So, really, what needs to happen first is userfaultfd needs to be
sorted.

There's no point of creating a vm_ops_uffd if it will just serve as
replacing the call locations of the functions like this, as it has done
nothing to simplify the logic.

> However if we want to generalize userfaultfd capability for a type of
> memory, we will still need something like the vm_uffd_ops hook to report
> such information.  It means drivers can still overwrite these, with/without
> an exported mfill_atomic_install_pte() functions.  I'm not sure whether
> that eases the concern.

If we work through the duplication and reduction where possible, the
path forward may be easier to see.

> 
> So to me, generalizing the mem type looks helpful with/without moving
> guest-memfd under mm/.

Yes, it should decrease the duplication across hugetlb.c and shmem.c,
but I think that userfaultfd is the place to start.

> 
> We do have the option to keep hard-code guest-memfd like shmem or
> hugetlbfs. This is still "doable", but this likely means guest-memfd
> support for userfaultfd needs to be done after that work.  I did quickly
> check the status of gunyah hypervisor [1,2,3], I found that all of the
> efforts are not yet continued in 2025.  The hypervisor last update was Jan
> 2024 with a batch push [1].
> 
> I still prefer generalizing uffd capabilities using the ops.  That makes
> guest-memfd support on MINOR not be blocked and it should be able to be
> done concurrently v.s. guest-memfd library.  If guest-memfd library idea
> didn't move on, it's non-issue either.
> 
> I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
> I'm going to repost - that looks like the only thing that people are
> against with, even though that is not my preference, as that'll make the
> API half-broken on its own.

The generalisation you did does not generalize much, as I pointed out
above, and so it seems less impactful than it could be.

These patches also do not explore what this means for guest_memfd.  So
it is not clear that the expected behaviour will serve the need.

You sent a link to an example user.  Can you please keep this work
together in the patch set so that we know it'll work for your use case
and allows us an easier way to pull down this work so we can examine it.

Alternatively, you can reduce and combine the memory types without
exposing the changes externally, if they stand on their own.  But I
don't think anyone is going to accept using a vm_ops change where a
direct function call could be used.

> Said that, I still prefer this against
> hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.

It also does nothing with regards to the CONFIG_USERFAULTFD in other
areas.  My hope is that you could pull out the common code and make the
CONFIG_ sections smaller.

And, by the way, it isn't the fact that we're going to copy something
(or mfill_atomic it, I guess?) but the fact that we're handing out the
pointer for something else to do that.  Please handle this manipulation
in the core mm code without handing out pointers to folios, or page
tables.

You could do this with the address being passed in and figure out the
type, or even a function pointer that you specifically pass in an enum
of the type (I think this is what Lorenzo was suggesting somewhere),
maybe with multiple flags for actions and fallback (MFILL|ZERO for
example).

Thanks,
Liam

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API
  2025-07-04 19:39                     ` Liam R. Howlett
@ 2025-09-01 16:01                       ` Nikita Kalyazin
  0 siblings, 0 replies; 42+ messages in thread
From: Nikita Kalyazin @ 2025-09-01 16:01 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes
  Cc: Peter Xu, David Hildenbrand, Mike Rapoport, Suren Baghdasaryan,
	linux-mm, linux-kernel, Vlastimil Babka, Muchun Song,
	Hugh Dickins, Andrew Morton, James Houghton, Michal Hocko,
	Andrea Arcangeli, Oscar Salvador, Axel Rasmussen, Ujwal Kundur



On 04/07/2025 20:39, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [250704 11:00]:
>> On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
>>> On 03.07.25 19:48, Mike Rapoport wrote:
>>>> On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
>>>>> On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> The main target of this change is the implementation of UFFD for
>>>>>>> KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
>>>>>>> into the mm codebase.  We usually mean KVM by the "drivers" in this context,
>>>>>>> and it is already somewhat "knowledgeable" of the mm.  I don't think there
>>>>>>> are existing use cases for other drivers to implement this at the moment.
>>>>>>>
>>>>>>> Although I can't see new exports in this series, there is now a way to limit
>>>>>>> exports to particular modules [3].  Would it help if we only do it for KVM
>>>>>>> initially (if/when actually needed)?
>>>>>>
>>>>>> There were talks about pulling out guest_memfd core into mm, but I don't
>>>>>> remember patches about it. If parts of guest_memfd were already in mm/ that
>>>>>> would make easier to export uffd ops to it.
>>>>>
>>>>> Do we have a link to such discussion?  I'm also curious whether that idea
>>>>> was acknowledged by KVM maintainers.
>>>>
>>>> AFAIR it was discussed at one of David's guest_memfd calls
>>>
>>> While it was discussed in the call a couple of times in different context
>>> (guest_memfd as a library / guest_memfd shim), I think we already discussed
>>> it back at LPC last year.
>>>
>>> One of the main reasons for doing that is supporting guest_memfd in other
>>> hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
>>> as well.
>>
>> I see, thanks for the info. I found the series, it's here:
>>
>> https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
>>
>> Here, the question is whether do we still want to keep explicit calls to
>> shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
>> guest-memfd doesn't change a huge lot on answering this question, IIUC.
> 
> Can you explore moving hugetlb_mfill_atomic_pte and
> shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> use the same code?
> 
> That is, split the similar blocks into functions and reduce duplication.
> 
> These are under the UFFD config option and are pretty similar.  This
> will also limit the bleeding of mfill_atomic_mode out of uffd.
> 
> 
> 
> If you look at what the code does in userfaultfd.c, you can see that
> some refactoring is necessary for other reasons:
> 
> mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> call shmem_mfill_atomic_pte() in mm/shmem.c
> 
> mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c
> 
> The shmem call already depends on the vma flags.. which it still does in
> your patch 4 here.  So you've replaced this:
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
>          shmem_mfill_atomic_pte()
> }
> 
> With...
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
> ...
>          uffd_ops->uffd_copy()
> }
> 
> So, really, what needs to happen first is userfaultfd needs to be
> sorted.
> 
> There's no point of creating a vm_ops_uffd if it will just serve as
> replacing the call locations of the functions like this, as it has done
> nothing to simplify the logic.
> 
>> However if we want to generalize userfaultfd capability for a type of
>> memory, we will still need something like the vm_uffd_ops hook to report
>> such information.  It means drivers can still overwrite these, with/without
>> an exported mfill_atomic_install_pte() functions.  I'm not sure whether
>> that eases the concern.
> 
> If we work through the duplication and reduction where possible, the
> path forward may be easier to see.
> 
>>
>> So to me, generalizing the mem type looks helpful with/without moving
>> guest-memfd under mm/.
> 
> Yes, it should decrease the duplication across hugetlb.c and shmem.c,
> but I think that userfaultfd is the place to start.
> 
>>
>> We do have the option to keep hard-code guest-memfd like shmem or
>> hugetlbfs. This is still "doable", but this likely means guest-memfd
>> support for userfaultfd needs to be done after that work.  I did quickly
>> check the status of gunyah hypervisor [1,2,3], I found that all of the
>> efforts are not yet continued in 2025.  The hypervisor last update was Jan
>> 2024 with a batch push [1].
>>
>> I still prefer generalizing uffd capabilities using the ops.  That makes
>> guest-memfd support on MINOR not be blocked and it should be able to be
>> done concurrently v.s. guest-memfd library.  If guest-memfd library idea
>> didn't move on, it's non-issue either.
>>
>> I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
>> I'm going to repost - that looks like the only thing that people are
>> against with, even though that is not my preference, as that'll make the
>> API half-broken on its own.
> 
> The generalisation you did does not generalize much, as I pointed out
> above, and so it seems less impactful than it could be.
> 
> These patches also do not explore what this means for guest_memfd.  So
> it is not clear that the expected behaviour will serve the need.
> 
> You sent a link to an example user.  Can you please keep this work
> together in the patch set so that we know it'll work for your use case
> and allows us an easier way to pull down this work so we can examine it.

Hi Liam, Lorenzo,

With mmap support in guest_memfd being recently accepted and merged into 
kvm/next [1], UFFDIO_CONTINUE support in guest_memfd becomes a real use 
case.

 From what I understand, it is the API for UFFDIO_COPY (ie the 
.uffd_copy callback) proposed by Peter that raises the safery issues, 
while the generalisation of the checks (.uffd_features, .uffd_ioctls) 
and .uffd_get_folio needed for UFFDIO_CONTINUE do not introduce such 
concerns.  In order to unblock the userfaultfd support in guest_memfd, 
would it be acceptable to start with implementing 
.uffd_get_folio/UFFDIO_CONTINUE only, leaving the callback for 
UFFDIO_COPY for later when we have an idea about a safer API and a clear 
use case for that?

If that's the case, what would be the best way to demonstrate the client 
code?  Would using kvm/next as the base for the aggregated series (new 
mm API + client code in kvm/guest_memfd) work?

Thanks,
Nikita

[1]: https://lore.kernel.org/kvm/20250729225455.670324-1-seanjc@google.com/
> 
> Alternatively, you can reduce and combine the memory types without
> exposing the changes externally, if they stand on their own.  But I
> don't think anyone is going to accept using a vm_ops change where a
> direct function call could be used.
> 
>> Said that, I still prefer this against
>> hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.
> 
> It also does nothing with regards to the CONFIG_USERFAULTFD in other
> areas.  My hope is that you could pull out the common code and make the
> CONFIG_ sections smaller.
> 
> And, by the way, it isn't the fact that we're going to copy something
> (or mfill_atomic it, I guess?) but the fact that we're handing out the
> pointer for something else to do that.  Please handle this manipulation
> in the core mm code without handing out pointers to folios, or page
> tables.
> 
> You could do this with the address being passed in and figure out the
> type, or even a function pointer that you specifically pass in an enum
> of the type (I think this is what Lorenzo was suggesting somewhere),
> maybe with multiple flags for actions and fallback (MFILL|ZERO for
> example).
> 
> Thanks,
> Liam
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-09-01 16:01 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 15:46 [PATCH v2 0/4] mm/userfaultfd: modulize memory types Peter Xu
2025-06-27 15:46 ` [PATCH v2 1/4] mm: Introduce vm_uffd_ops API Peter Xu
2025-06-29  8:50   ` Mike Rapoport
2025-07-02 19:30     ` Peter Xu
2025-06-30 10:15   ` Lorenzo Stoakes
2025-07-01 17:04     ` Suren Baghdasaryan
2025-07-02 15:40       ` Liam R. Howlett
2025-07-02 15:56       ` Lorenzo Stoakes
2025-07-02 17:08         ` Nikita Kalyazin
2025-07-02 17:39           ` Mike Rapoport
2025-07-02 19:46             ` Peter Xu
2025-07-03 17:48               ` Mike Rapoport
2025-07-04  9:34                 ` David Hildenbrand
2025-07-04 14:59                   ` Peter Xu
2025-07-04 19:39                     ` Liam R. Howlett
2025-09-01 16:01                       ` Nikita Kalyazin
2025-07-02 21:24           ` Liam R. Howlett
2025-07-02 21:36             ` Peter Xu
2025-07-03  2:00               ` Liam R. Howlett
2025-07-03 15:24                 ` Peter Xu
2025-07-03 16:15                   ` Lorenzo Stoakes
2025-07-03 17:39                   ` Liam R. Howlett
2025-07-02 20:24         ` Peter Xu
2025-07-03 16:32           ` Lorenzo Stoakes
2025-07-02 18:16       ` Mike Rapoport
2025-07-02 20:22         ` Peter Xu
2025-07-03 15:01           ` Suren Baghdasaryan
2025-07-03 15:45             ` Peter Xu
2025-07-03 16:01               ` Lorenzo Stoakes
2025-06-27 15:46 ` [PATCH v2 2/4] mm/shmem: Support " Peter Xu
2025-06-29  8:51   ` Mike Rapoport
2025-06-27 15:46 ` [PATCH v2 3/4] mm/hugetlb: " Peter Xu
2025-06-29  8:52   ` Mike Rapoport
2025-06-27 15:46 ` [PATCH v2 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
2025-06-29  8:55   ` Mike Rapoport
2025-07-02 20:38     ` Peter Xu
2025-06-30 10:29 ` [PATCH v2 0/4] mm/userfaultfd: modulize memory types Lorenzo Stoakes
2025-07-01  0:15   ` Andrew Morton
2025-07-02 20:36   ` Peter Xu
2025-07-03 15:55     ` Lorenzo Stoakes
2025-07-03 16:26       ` Peter Xu
2025-07-03 16:44         ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).