[PATCH 0/4] mm/userfaultfd: modulize memory types

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] mm/userfaultfd: modulize memory types
@ 2025-06-20 19:03 Peter Xu
  2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-20 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nikita Kalyazin, peterx, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

[based on akpm/mm-new]

This series is an alternative proposal of what Nikita proposed here on the
initial three patches:

  https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com

This is not yet relevant to any guest-memfd support, but paving way for it.
Here, the major goal is to make kernel modules be able to opt-in with any
form of userfaultfd supports, like guest-memfd.  This alternative option
should hopefully be cleaner, and avoid leaking userfault details into
vm_ops.fault().

It also means this series does not depend on anything.  It's a pure
refactoring of userfaultfd internals to provide a generic API, so that
other types of files, especially RAM based, can support userfaultfd without
touching mm/ at all.

To achieve it, this series introduced a file operation called vm_uffd_ops.
The ops needs to be provided when a file type supports any of userfaultfd.

With that, I moved both hugetlbfs and shmem over.

Hugetlbfs is still very special that it will only use partial of the
vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
BUG() and so far hard-coded into core mm.  But this should still be better,
because at least hugetlbfs is still always involved in feature probing
(e.g. where it used to not support ZEROPAGE and we have a hard-coded line
to fail that, and some more).  Meanwhile after this series, shmem will be
completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
shmem looks like this:

static const vm_uffd_ops shmem_uffd_ops = {
	.uffd_features	= 	__VM_UFFD_FLAGS,
	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
				BIT(_UFFDIO_ZEROPAGE) |
				BIT(_UFFDIO_WRITEPROTECT) |
				BIT(_UFFDIO_CONTINUE) |
				BIT(_UFFDIO_POISON),
	.uffd_get_folio	=	shmem_uffd_get_folio,
	.uffd_copy	=	shmem_mfill_atomic_pte,
};

As I mentioned in one of my reply to Nikita, I don't like the current
interface of uffd_copy(), but this will be the minimum change version of
such API to support complete extrenal-module-ready userfaultfd.  Here, very
minimal change will be needed from shmem side to support that.

Meanwhile, the vm_uffd_ops is also not the only place one will need to
provide to support userfaultfd.  Normally vm_ops.fault() will also need to
be updated, but that's a generic function and it'll play together with the
new vm_uffd_ops to make everything fly.

No functional change expected at all after the whole series applied.  There
might be some slightly stricter check on uffd ops here and there in the
last patch, but that really shouldn't stand out anywhere to anyone.

For testing: besides the cross-compilation tests, I did also try with
uffd-stress in a VM to measure any perf difference before/after the change;
The static call becomes a pointer now.  I really cannot measure anything
different, which is more or less expected.

Comments welcomed, thanks.

Peter Xu (4):
  mm: Introduce vm_uffd_ops API
  mm/shmem: Support vm_uffd_ops API
  mm/hugetlb: Support vm_uffd_ops API
  mm: Apply vm_uffd_ops API to core mm

 include/linux/mm.h            |  71 +++++++++++++++++++++
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  58 ++++-------------
 mm/hugetlb.c                  |  19 ++++++
 mm/shmem.c                    |  28 ++++++++-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 6 files changed, 217 insertions(+), 88 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
@ 2025-06-20 19:03 ` Peter Xu
  2025-06-22  7:28   ` Mike Rapoport
  2025-06-23  8:25   ` David Hildenbrand
  2025-06-20 19:03 ` [PATCH 2/4] mm/shmem: Support " Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-20 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nikita Kalyazin, peterx, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

Introduce a generic userfaultfd API for vm_operations_struct, so that one
vma, especially when as a module, can support userfaults without modifying
the core files.  More importantly, when the module can be compiled out of
the kernel.

So, instead of having core mm referencing modules that may not ever exist,
we need to have modules opt-in on core mm hooks instead.

After this API applied, if a module wants to support userfaultfd, the
module should only need to touch its own file and properly define
vm_uffd_ops, instead of changing anything in core mm.

Note that such API will not work for anonymous. Core mm will process
anonymous memory separately for userfault operations like before.

This patch only introduces the API alone so that we can start to move
existing users over but without breaking them.

Currently the uffd_copy() API is almost designed to be the simplistic with
minimum mm changes to move over to the API.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h            | 71 +++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h | 12 ------
 2 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307..8dfd83f01d3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -576,6 +576,70 @@ struct vm_fault {
 					 */
 };
 
+#ifdef CONFIG_USERFAULTFD
+/* A combined operation mode + behavior flags. */
+typedef unsigned int __bitwise uffd_flags_t;
+
+enum mfill_atomic_mode {
+	MFILL_ATOMIC_COPY,
+	MFILL_ATOMIC_ZEROPAGE,
+	MFILL_ATOMIC_CONTINUE,
+	MFILL_ATOMIC_POISON,
+	NR_MFILL_ATOMIC_MODES,
+};
+
+/* VMA userfaultfd operations */
+typedef struct {
+	/**
+	 * @uffd_features: features supported in bitmask.
+	 *
+	 * When the ops is defined, the driver must set non-zero features
+	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
+	 */
+	unsigned long uffd_features;
+	/**
+	 * @uffd_ioctls: ioctls supported in bitmask.
+	 *
+	 * Userfaultfd ioctls supported by the module.  Below will always
+	 * be supported by default whenever a module provides vm_uffd_ops:
+	 *
+	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
+	 *
+	 * The module needs to provide all the rest optionally supported
+	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
+	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
+	 * is optional.
+	 */
+	unsigned long uffd_ioctls;
+	/**
+	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
+	 *
+	 * @inode: the inode for folio lookup
+	 * @pgoff: the pgoff of the folio
+	 * @folio: returned folio pointer
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
+			      struct folio **folio);
+	/**
+	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
+	 *
+	 * @dst_pmd: target pmd to resolve page fault
+	 * @dst_vma: target vma
+	 * @dst_addr: target virtual address
+	 * @src_addr: source address to copy from
+	 * @flags: userfaultfd request flags
+	 * @foliop: previously allocated folio
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
+			 unsigned long dst_addr, unsigned long src_addr,
+			 uffd_flags_t flags, struct folio **foliop);
+} vm_uffd_ops;
+#endif
+
 /*
  * These are the virtual MM functions - opening of an area, closing and
  * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -653,6 +717,13 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Userfaultfd related ops.  Modules need to define this to support
+	 * userfaultfd.
+	 */
+	const vm_uffd_ops *userfaultfd_ops;
+#endif
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ccad58602846..e79c724b3b95 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -80,18 +80,6 @@ struct userfaultfd_ctx {
 
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
-/* A combined operation mode + behavior flags. */
-typedef unsigned int __bitwise uffd_flags_t;
-
-/* Mutually exclusive modes of operation. */
-enum mfill_atomic_mode {
-	MFILL_ATOMIC_COPY,
-	MFILL_ATOMIC_ZEROPAGE,
-	MFILL_ATOMIC_CONTINUE,
-	MFILL_ATOMIC_POISON,
-	NR_MFILL_ATOMIC_MODES,
-};
-
 #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
 #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] mm/shmem: Support vm_uffd_ops API
  2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
  2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
@ 2025-06-20 19:03 ` Peter Xu
  2025-06-20 19:03 ` [PATCH 3/4] mm/hugetlb: " Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-20 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nikita Kalyazin, peterx, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

Add support for the new vm_uffd_ops API for shmem.  Note that this only
introduces the support, the API is not yet used by core mm.

Due to the tailored uffd_copy() API, shmem is extremely easy to support it
by reusing the existing mfill function.

It only needs a separate uffd_get_folio() definition but that's oneliner.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/shmem.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0bc30dafad90..bd0a29000318 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
 #endif /* CONFIG_TMPFS_QUOTA */
 
 #ifdef CONFIG_USERFAULTFD
+
+static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
+				struct folio **folio)
+{
+	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
+}
+
 int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
@@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping,
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops shmem_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_ZEROPAGE) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	.uffd_get_folio	=	shmem_uffd_get_folio,
+	.uffd_copy	=	shmem_mfill_atomic_pte,
+};
+#endif
+
 static const struct address_space_operations shmem_aops = {
 	.dirty_folio	= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
@@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 int shmem_init_fs_context(struct fs_context *fc)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] mm/hugetlb: Support vm_uffd_ops API
  2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
  2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
  2025-06-20 19:03 ` [PATCH 2/4] mm/shmem: Support " Peter Xu
@ 2025-06-20 19:03 ` Peter Xu
  2025-06-20 19:03 ` [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
  2025-06-25 16:56 ` [PATCH 0/4] mm/userfaultfd: modulize memory types Nikita Kalyazin
  4 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-20 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nikita Kalyazin, peterx, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

Add support for the new vm_uffd_ops API for hugetlb.  Note that this only
introduces the support, the API is not yet used by core mm.

Due to legacy reasons, it's still not trivial to move hugetlb completely to
the API (like shmem).  But it will still use uffd_features and uffd_ioctls
properly on the API because that's pretty general.

Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d61ec17c15a..b9e473fab871 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5459,6 +5459,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops hugetlb_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	/* _UFFDIO_ZEROPAGE not supported */
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	/*
+	 * Hugetlbfs still has its own hard-coded handler in userfaultfd,
+	 * due to limitations similar to vm_operations_struct.fault().
+	 * TODO: generalize it to use the API functions.
+	 */
+};
+#endif
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5472,6 +5488,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &hugetlb_uffd_ops,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
                   ` (2 preceding siblings ...)
  2025-06-20 19:03 ` [PATCH 3/4] mm/hugetlb: " Peter Xu
@ 2025-06-20 19:03 ` Peter Xu
  2025-06-22 19:09   ` kernel test robot
  2025-06-25 20:31   ` James Houghton
  2025-06-25 16:56 ` [PATCH 0/4] mm/userfaultfd: modulize memory types Nikita Kalyazin
  4 siblings, 2 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-20 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Nikita Kalyazin, peterx, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

This patch completely moves the old userfaultfd core to use the new
vm_uffd_ops API.  After this change, existing file systems will start to
use the new API for userfault operations.

When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
because it's getting too big.  It's only used in slow paths so it shouldn't
be an issue.

This will also remove quite some hard-coded checks for either shmem or
hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.

Note that anonymous memory will still need to be processed separately
because it doesn't have vm_ops at all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  46 ++++----------
 mm/shmem.c                    |   2 +-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 4 files changed, 101 insertions(+), 76 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 6d0f9c599ff7..2f5b7b295cf6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
-#ifdef CONFIG_USERFAULTFD
-#ifdef CONFIG_SHMEM
-extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
-				  struct vm_area_struct *dst_vma,
-				  unsigned long dst_addr,
-				  unsigned long src_addr,
-				  uffd_flags_t flags,
-				  struct folio **foliop);
-#else /* !CONFIG_SHMEM */
-#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
-			       src_addr, flags, foliop) ({ BUG(); 0; })
-#endif /* CONFIG_SHMEM */
-#endif /* CONFIG_USERFAULTFD */
-
 /*
  * Used space is stored as unsigned 64-bit value in bytes but
  * quota core supports only signed 64-bit values so use that
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e79c724b3b95..4e56ad423a4a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -85,9 +85,14 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
 #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
 
+static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
+{
+	return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
+}
+
 static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
 {
-	return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
+	return uffd_flags_get_mode(flags) == expected;
 }
 
 static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
@@ -196,41 +201,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     unsigned long vm_flags,
-				     bool wp_async)
+static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
 {
-	vm_flags &= __VM_UFFD_FLAGS;
-
-	if (vma->vm_flags & VM_DROPPABLE)
-		return false;
-
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
-		return false;
-
-	/*
-	 * If wp async enabled, and WP is the only mode enabled, allow any
-	 * memory type.
-	 */
-	if (wp_async && (vm_flags == VM_UFFD_WP))
-		return true;
-
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
-	/*
-	 * If user requested uffd-wp but not enabled pte markers for
-	 * uffd-wp, then shmem & hugetlbfs are not supported but only
-	 * anonymous.
-	 */
-	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
-		return false;
-#endif
-
-	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
+		return vma->vm_ops->userfaultfd_ops;
+	return NULL;
 }
 
+bool vma_can_userfault(struct vm_area_struct *vma,
+		       unsigned long vm_flags, bool wp_async);
+
 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
 {
 	struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
diff --git a/mm/shmem.c b/mm/shmem.c
index bd0a29000318..4d71fc7be358 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
 	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
 }
 
-int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
+static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 879505c6996f..61783ff2d335 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,12 +14,48 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
-#include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
 #include "swap.h"
 
+bool vma_can_userfault(struct vm_area_struct *vma,
+		       unsigned long vm_flags, bool wp_async)
+{
+	unsigned long supported;
+
+	if (vma->vm_flags & VM_DROPPABLE)
+		return false;
+
+	vm_flags &= __VM_UFFD_FLAGS;
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	/*
+	 * If user requested uffd-wp but not enabled pte markers for
+	 * uffd-wp, then any file system (like shmem or hugetlbfs) are not
+	 * supported but only anonymous.
+	 */
+	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+		return false;
+#endif
+	/*
+	 * If wp async enabled, and WP is the only mode enabled, allow any
+	 * memory type.
+	 */
+	if (wp_async && (vm_flags == VM_UFFD_WP))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		/* Anonymous has no page cache, MINOR not supported */
+		supported = VM_UFFD_MISSING | VM_UFFD_WP;
+	else if (vma_get_uffd_ops(vma))
+		supported = vma_get_uffd_ops(vma)->uffd_features;
+	else
+		return false;
+
+	return !(vm_flags & (~supported));
+}
+
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
@@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
 	struct folio *folio;
 	struct page *page;
 	int ret;
 
-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+	if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
+		return -EINVAL;
+
+	ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
 	/* Our caller expects us to return -EFAULT if we failed to find folio */
 	if (ret == -ENOENT)
 		ret = -EFAULT;
@@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	u32 hash;
 	struct address_space *mapping;
 
-	/*
-	 * There is no default zero huge page for all huge page sizes as
-	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
-	 * by THP.  Since we can not reliably insert a zero page, this
-	 * feature is not supported.
-	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
-		up_read(&ctx->map_changing_lock);
-		uffd_mfill_unlock(dst_vma);
-		return -EINVAL;
-	}
-
 	src_addr = src_start;
 	dst_addr = dst_start;
 	copied = 0;
@@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 			err = mfill_atomic_pte_zeropage(dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
-		err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
-					     dst_addr, src_addr,
-					     flags, foliop);
+		const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
+
+		if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
+			err = -EINVAL;
+		} else {
+			err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
+						  dst_addr, src_addr,
+						  flags, foliop);
+		}
 	}
 
 	return err;
 }
 
+static inline bool
+vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
+{
+	enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
+	const vm_uffd_ops *uffd_ops;
+	unsigned long uffd_ioctls;
+
+	if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
+		return false;
+
+	/* Anonymous supports everything except CONTINUE */
+	if (vma_is_anonymous(vma))
+		return mode != MFILL_ATOMIC_CONTINUE;
+
+	uffd_ops = vma_get_uffd_ops(vma);
+	if (!uffd_ops)
+		return false;
+
+	uffd_ioctls = uffd_ops->uffd_ioctls;
+	switch (mode) {
+	case MFILL_ATOMIC_COPY:
+		return uffd_ioctls & BIT(_UFFDIO_COPY);
+	case MFILL_ATOMIC_ZEROPAGE:
+		return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
+	case MFILL_ATOMIC_CONTINUE:
+		if (!(vma->vm_flags & VM_SHARED))
+			return false;
+		return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
+	case MFILL_ATOMIC_POISON:
+		return uffd_ioctls & BIT(_UFFDIO_POISON);
+	default:
+		return false;
+	}
+}
+
 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 					    unsigned long dst_start,
 					    unsigned long src_start,
@@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
-	/*
-	 * validate 'mode' now that we know the dst_vma: don't allow
-	 * a wrprotect copy if the userfaultfd didn't register as WP.
-	 */
-	if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
+	if (!vma_uffd_ops_supported(dst_vma, flags))
 		goto out_unlock;
 
 	/*
@@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
 					     src_start, len, flags);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
-		goto out_unlock;
-
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
@ 2025-06-22  7:28   ` Mike Rapoport
  2025-06-23 13:36     ` Peter Xu
  2025-06-23  8:25   ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Rapoport @ 2025-06-22  7:28 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Vlastimil Babka, Liam R . Howlett, James Houghton,
	Lorenzo Stoakes, Axel Rasmussen

Hi Peter,

On Fri, Jun 20, 2025 at 03:03:39PM -0400, Peter Xu wrote:
> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying
> the core files.  More importantly, when the module can be compiled out of
> the kernel.
> 
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.
> 
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.
> 
> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.
> 
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.
> 
> Currently the uffd_copy() API is almost designed to be the simplistic with
> minimum mm changes to move over to the API.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm.h            | 71 +++++++++++++++++++++++++++++++++++
>  include/linux/userfaultfd_k.h | 12 ------
>  2 files changed, 71 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a606908307..8dfd83f01d3d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -576,6 +576,70 @@ struct vm_fault {
>  					 */
>  };
>  
> +#ifdef CONFIG_USERFAULTFD
> +/* A combined operation mode + behavior flags. */
> +typedef unsigned int __bitwise uffd_flags_t;
> +
> +enum mfill_atomic_mode {
> +	MFILL_ATOMIC_COPY,
> +	MFILL_ATOMIC_ZEROPAGE,
> +	MFILL_ATOMIC_CONTINUE,
> +	MFILL_ATOMIC_POISON,
> +	NR_MFILL_ATOMIC_MODES,
> +};
> +
> +/* VMA userfaultfd operations */
> +typedef struct {
> +	/**
> +	 * @uffd_features: features supported in bitmask.
> +	 *
> +	 * When the ops is defined, the driver must set non-zero features
> +	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> +	 */
> +	unsigned long uffd_features;
> +	/**
> +	 * @uffd_ioctls: ioctls supported in bitmask.
> +	 *
> +	 * Userfaultfd ioctls supported by the module.  Below will always
> +	 * be supported by default whenever a module provides vm_uffd_ops:
> +	 *
> +	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> +	 *
> +	 * The module needs to provide all the rest optionally supported
> +	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
> +	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> +	 * is optional.
> +	 */
> +	unsigned long uffd_ioctls;
> +	/**
> +	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> +	 *
> +	 * @inode: the inode for folio lookup
> +	 * @pgoff: the pgoff of the folio
> +	 * @folio: returned folio pointer
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +			      struct folio **folio);
> +	/**
> +	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> +	 *
> +	 * @dst_pmd: target pmd to resolve page fault
> +	 * @dst_vma: target vma
> +	 * @dst_addr: target virtual address
> +	 * @src_addr: source address to copy from
> +	 * @flags: userfaultfd request flags
> +	 * @foliop: previously allocated folio
> +	 *
> +	 * Return: zero if succeeded, negative for errors.
> +	 */
> +	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> +			 unsigned long dst_addr, unsigned long src_addr,
> +			 uffd_flags_t flags, struct folio **foliop);
> +} vm_uffd_ops;
> +#endif

Can't we define vm_uffd_ops in userfaultfd_k.h?

A forward declaration in mm.h should suffice and modules that want to use
uffd can include userfaultfd_k.h.

> +
>  /*
>   * These are the virtual MM functions - opening of an area, closing and
>   * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -653,6 +717,13 @@ struct vm_operations_struct {
>  	 */
>  	struct page *(*find_special_page)(struct vm_area_struct *vma,
>  					  unsigned long addr);
> +#ifdef CONFIG_USERFAULTFD
> +	/*
> +	 * Userfaultfd related ops.  Modules need to define this to support
> +	 * userfaultfd.
> +	 */
> +	const vm_uffd_ops *userfaultfd_ops;
> +#endif
>  };
>  
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index ccad58602846..e79c724b3b95 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -80,18 +80,6 @@ struct userfaultfd_ctx {
>  
>  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>  
> -/* A combined operation mode + behavior flags. */
> -typedef unsigned int __bitwise uffd_flags_t;
> -
> -/* Mutually exclusive modes of operation. */
> -enum mfill_atomic_mode {
> -	MFILL_ATOMIC_COPY,
> -	MFILL_ATOMIC_ZEROPAGE,
> -	MFILL_ATOMIC_CONTINUE,
> -	MFILL_ATOMIC_POISON,
> -	NR_MFILL_ATOMIC_MODES,
> -};
> -
>  #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
>  #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> -- 
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-20 19:03 ` [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
@ 2025-06-22 19:09   ` kernel test robot
  2025-06-23 18:12     ` Peter Xu
  2025-06-25 20:31   ` James Houghton
  1 sibling, 1 reply; 24+ messages in thread
From: kernel test robot @ 2025-06-22 19:09 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: oe-kbuild-all, Nikita Kalyazin, peterx, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Linux Memory Management List, Vlastimil Babka, Liam R . Howlett,
	James Houghton, Mike Rapoport, Lorenzo Stoakes, Axel Rasmussen

Hi Peter,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20250620]
[cannot apply to akpm-mm/mm-everything linus/master v6.16-rc2 v6.16-rc1 v6.15 v6.16-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Introduce-vm_uffd_ops-API/20250621-030557
base:   next-20250620
patch link:    https://lore.kernel.org/r/20250620190342.1780170-5-peterx%40redhat.com
patch subject: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
config: i386-randconfig-061-20250622 (https://download.01.org/0day-ci/archive/20250623/202506230216.JVgQj2Si-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250623/202506230216.JVgQj2Si-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506230216.JVgQj2Si-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   mm/shmem.c: note: in included file (through include/linux/shmem_fs.h):
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
--
   mm/hugetlb.c:669:12: sparse: sparse: context imbalance in 'allocate_file_region_entries' - wrong count at exit
   mm/hugetlb.c:740:13: sparse: sparse: context imbalance in 'region_add' - wrong count at exit
   mm/hugetlb.c:807:13: sparse: sparse: context imbalance in 'region_chg' - wrong count at exit
   mm/hugetlb.c:5798:20: sparse: sparse: context imbalance in 'move_huge_pte' - different lock contexts for basic block
   mm/hugetlb.c: note: in included file:
   include/linux/mm.h:1391:22: sparse: sparse: context imbalance in 'hugetlb_wp' - unexpected unlock
   mm/hugetlb.c: note: in included file (through include/linux/hugetlb.h, include/linux/migrate.h):
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
--
   mm/userfaultfd.c:270:9: sparse: sparse: context imbalance in 'mfill_atomic_install_pte' - different lock contexts for basic block
   mm/userfaultfd.c:412:9: sparse: sparse: context imbalance in 'mfill_atomic_pte_zeropage' - different lock contexts for basic block
   mm/userfaultfd.c:498:9: sparse: sparse: context imbalance in 'mfill_atomic_pte_poison' - different lock contexts for basic block
   mm/userfaultfd.c: note: in included file:
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
   mm/userfaultfd.c: note: in included file (through include/linux/rbtree.h, include/linux/mm_types.h, include/linux/mmzone.h, ...):
   include/linux/rcupdate.h:871:25: sparse: sparse: context imbalance in 'move_pages_pte' - unexpected unlock

vim +90 include/linux/userfaultfd_k.h

    87	
    88	static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
    89	{
  > 90		return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
    91	}
    92	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
  2025-06-22  7:28   ` Mike Rapoport
@ 2025-06-23  8:25   ` David Hildenbrand
  2025-06-23 13:59     ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-06-23  8:25 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Nikita Kalyazin, Hugh Dickins, Oscar Salvador, Michal Hocko,
	Muchun Song, Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan,
	Andrew Morton, Vlastimil Babka, Liam R . Howlett, James Houghton,
	Mike Rapoport, Lorenzo Stoakes, Axel Rasmussen

On 20.06.25 21:03, Peter Xu wrote:

Hi Peter,

> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying

The sentence is confusing ("vma ... as a module").

Did you mean something like ".. so that a vma that is backed by a 
special-purpose in-memory filesystem like shmem or hugetlb can support 
userfaultfd without modifying the uffd core; this is required when the 
in-memory filesystem is built as a module."

> the core files.  More importantly, when the module can be compiled out of
> the kernel.
> 
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.
> 
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.

Talking about modules that much is a bit confusing. I think this is more 
about cleanly supporting in-memory filesystems, without the need to 
special-case each and every one of them; can be viewed a cleanup 
independent of the module requirement from guest_memfd.

> 
> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.
> 
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.
> 
> Currently the uffd_copy() API is almost designed to be the simplistic with
> minimum mm changes to move over to the API.
> 

Is there a way to move part of the actual implementation (how this is 
all wired up) from patch #4 into this patch, to then only remove the old 
shmem/hugetlb hooks (that are effectively unused) in patch #4?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-22  7:28   ` Mike Rapoport
@ 2025-06-23 13:36     ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-23 13:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Vlastimil Babka, Liam R . Howlett, James Houghton,
	Lorenzo Stoakes, Axel Rasmussen

On Sun, Jun 22, 2025 at 10:28:04AM +0300, Mike Rapoport wrote:
> > +} vm_uffd_ops;
> > +#endif
> 
> Can't we define vm_uffd_ops in userfaultfd_k.h?
> 
> A forward declaration in mm.h should suffice and modules that want to use
> uffd can include userfaultfd_k.h.

Good point, I'll do that, thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-23  8:25   ` David Hildenbrand
@ 2025-06-23 13:59     ` Peter Xu
  2025-06-23 16:50       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-23 13:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Mon, Jun 23, 2025 at 10:25:33AM +0200, David Hildenbrand wrote:
> On 20.06.25 21:03, Peter Xu wrote:
> 
> Hi Peter,

Hey David,

> 
> > Introduce a generic userfaultfd API for vm_operations_struct, so that one
> > vma, especially when as a module, can support userfaults without modifying
> 
> The sentence is confusing ("vma ... as a module").
> 
> Did you mean something like ".. so that a vma that is backed by a
> special-purpose in-memory filesystem like shmem or hugetlb can support
> userfaultfd without modifying the uffd core; this is required when the
> in-memory filesystem is built as a module."

I wanted to avoid mentioning of "in-memory file systems" here.

How about an updated commit like this?

  Currently, most of the userfaultfd features are implemented directly in the
  core mm.  It will invoke VMA specific functions whenever necessary.  So far
  it is fine because it almost only interacts with shmem and hugetlbfs.

  This patch introduces a generic userfaultfd API for vm_operations_struct,
  so that any type of file (including kernel modules that can be compiled
  separately from the kernel core) can support userfaults without modifying
  the core files.

  After this API applied, if a module wants to support userfaultfd, the
  module should only need to touch its own file and properly define
  vm_uffd_ops, instead of changing anything in core mm.

  ...

> 
> > the core files.  More importantly, when the module can be compiled out of
> > the kernel.
> > 
> > So, instead of having core mm referencing modules that may not ever exist,
> > we need to have modules opt-in on core mm hooks instead.
> > 
> > After this API applied, if a module wants to support userfaultfd, the
> > module should only need to touch its own file and properly define
> > vm_uffd_ops, instead of changing anything in core mm.
> 
> Talking about modules that much is a bit confusing. I think this is more
> about cleanly supporting in-memory filesystems, without the need to
> special-case each and every one of them; can be viewed a cleanup independent
> of the module requirement from guest_memfd.

Yes.  But if we don't need to support kernel modules actually we don't need
this.. IMHO it's so far really about cleanly support kernel modules, which
can even be out-of-tree (though that's not my purpose of the change..).

Please help check if above updated commit message would be better.

> 
> > 
> > Note that such API will not work for anonymous. Core mm will process
> > anonymous memory separately for userfault operations like before.
> > 
> > This patch only introduces the API alone so that we can start to move
> > existing users over but without breaking them.
> > 
> > Currently the uffd_copy() API is almost designed to be the simplistic with
> > minimum mm changes to move over to the API.
> > 
> 
> Is there a way to move part of the actual implementation (how this is all
> wired up) from patch #4 into this patch, to then only remove the old
> shmem/hugetlb hooks (that are effectively unused) in patch #4?

Not much I really removed on the hooks, but I was trying to reuse almost
existing functions.  Here hugetlb is almost untouched on hooks, then I
reused the shmem existing function for uffd_copy() rather than removing it
(I did need to remove the definition in the shmem header though becuse it's
not needed to be exported).

The major thing got removed in patch 4 was some random checks over uffd ops
and vma flags.  I intentionally made them all in patch 4 to make review
possible.  Otherwise it can be slightly awkward to reason what got removed
without knowing what is protecting those checks.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-23 13:59     ` Peter Xu
@ 2025-06-23 16:50       ` David Hildenbrand
  2025-06-23 17:20         ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-06-23 16:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On 23.06.25 15:59, Peter Xu wrote:
> On Mon, Jun 23, 2025 at 10:25:33AM +0200, David Hildenbrand wrote:
>> On 20.06.25 21:03, Peter Xu wrote:
>>
>> Hi Peter,
> 
> Hey David,
> 
>>
>>> Introduce a generic userfaultfd API for vm_operations_struct, so that one
>>> vma, especially when as a module, can support userfaults without modifying
>>
>> The sentence is confusing ("vma ... as a module").
>>
>> Did you mean something like ".. so that a vma that is backed by a
>> special-purpose in-memory filesystem like shmem or hugetlb can support
>> userfaultfd without modifying the uffd core; this is required when the
>> in-memory filesystem is built as a module."
> 
> I wanted to avoid mentioning of "in-memory file systems" here.

I thought one of the challenges of supporting guest_memfd on anything 
that is not a special in-memory file system is also related to how the 
pagecache handles readahead.

So ...

> 
> How about an updated commit like this?
> 
>    Currently, most of the userfaultfd features are implemented directly in the
>    core mm.  It will invoke VMA specific functions whenever necessary.  So far
>    it is fine because it almost only interacts with shmem and hugetlbfs.
> 
>    This patch introduces a generic userfaultfd API for vm_operations_struct,
>    so that any type of file (including kernel modules that can be compiled
>    separately from the kernel core) can support userfaults without modifying
>    the core files.

.... is it really "any file" ? I doubt it, but you likely have a better 
idea on how it all could just work with "any file".

> 
>    After this API applied, if a module wants to support userfaultfd, the
>    module should only need to touch its own file and properly define
>    vm_uffd_ops, instead of changing anything in core mm.
> 
>    ...

Talking about files and modules is still confusing I'm afraid. It's 
really a special-purpose file (really, not any ordinary files on 
ordinary filesystems), no?

> 
>>
>>> the core files.  More importantly, when the module can be compiled out of
>>> the kernel.
>>>
>>> So, instead of having core mm referencing modules that may not ever exist,
>>> we need to have modules opt-in on core mm hooks instead.
>>>
>>> After this API applied, if a module wants to support userfaultfd, the
>>> module should only need to touch its own file and properly define
>>> vm_uffd_ops, instead of changing anything in core mm.
>>
>> Talking about modules that much is a bit confusing. I think this is more
>> about cleanly supporting in-memory filesystems, without the need to
>> special-case each and every one of them; can be viewed a cleanup independent
>> of the module requirement from guest_memfd.
> 
> Yes.  But if we don't need to support kernel modules actually we don't need
> this.. IMHO it's so far really about cleanly support kernel modules, which
> can even be out-of-tree (though that's not my purpose of the change..).
> 
> Please help check if above updated commit message would be better.

I agree that another special-purpose file (like implemented by 
guest_memfd) would need that. But if we could get rid of 
"hugetlb"/"shmem" special-casing in userfaultfd, it would be a rasonable 
independent cleanup.

But I can spot in patch #3 now:

"Hugetlbfs still has its own hard-coded handler in userfaultfd, due to 
limitations similar to vm_operations_struct.fault(). TODO: generalize it 
to use the API function."

I would have hoped that we clean that up in one go instead.

> 
>>
>>>
>>> Note that such API will not work for anonymous. Core mm will process
>>> anonymous memory separately for userfault operations like before.
>>>
>>> This patch only introduces the API alone so that we can start to move
>>> existing users over but without breaking them.
>>>
>>> Currently the uffd_copy() API is almost designed to be the simplistic with
>>> minimum mm changes to move over to the API.
>>>
>>
>> Is there a way to move part of the actual implementation (how this is all
>> wired up) from patch #4 into this patch, to then only remove the old
>> shmem/hugetlb hooks (that are effectively unused) in patch #4?
> 
> Not much I really removed on the hooks, but I was trying to reuse almost
> existing functions.  Here hugetlb is almost untouched on hooks, then I
> reused the shmem existing function for uffd_copy() rather than removing it
> (I did need to remove the definition in the shmem header though becuse it's
> not needed to be exported).
> 
> The major thing got removed in patch 4 was some random checks over uffd ops
> and vma flags.  I intentionally made them all in patch 4 to make review
> possible.  Otherwise it can be slightly awkward to reason what got removed
> without knowing what is protecting those checks.

Agreed. It's a shame the new API is not a proper replacement for hugetlb 
special casing just yet ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-23 16:50       ` David Hildenbrand
@ 2025-06-23 17:20         ` Peter Xu
  2025-06-23 17:25           ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-23 17:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Mon, Jun 23, 2025 at 06:50:42PM +0200, David Hildenbrand wrote:
> On 23.06.25 15:59, Peter Xu wrote:
> > On Mon, Jun 23, 2025 at 10:25:33AM +0200, David Hildenbrand wrote:
> > > On 20.06.25 21:03, Peter Xu wrote:
> > > 
> > > Hi Peter,
> > 
> > Hey David,
> > 
> > > 
> > > > Introduce a generic userfaultfd API for vm_operations_struct, so that one
> > > > vma, especially when as a module, can support userfaults without modifying
> > > 
> > > The sentence is confusing ("vma ... as a module").
> > > 
> > > Did you mean something like ".. so that a vma that is backed by a
> > > special-purpose in-memory filesystem like shmem or hugetlb can support
> > > userfaultfd without modifying the uffd core; this is required when the
> > > in-memory filesystem is built as a module."
> > 
> > I wanted to avoid mentioning of "in-memory file systems" here.
> 
> I thought one of the challenges of supporting guest_memfd on anything that
> is not a special in-memory file system is also related to how the pagecache
> handles readahead.
> 
> So ...

See uffd_disable_fault_around(). We should make sure no such happens into
pgtables when some special type of file is suppoorted, if it ever happens,
besides shmem.  IIUC readahead on page caches are fine for non-MISSING
traps.  So a file can support MINOR, for example, but then it'll also need
to make sure all those aspected are well considered.

> 
> > 
> > How about an updated commit like this?
> > 
> >    Currently, most of the userfaultfd features are implemented directly in the
> >    core mm.  It will invoke VMA specific functions whenever necessary.  So far
> >    it is fine because it almost only interacts with shmem and hugetlbfs.
> > 
> >    This patch introduces a generic userfaultfd API for vm_operations_struct,
> >    so that any type of file (including kernel modules that can be compiled
> >    separately from the kernel core) can support userfaults without modifying
> >    the core files.
> 
> .... is it really "any file" ? I doubt it, but you likely have a better idea
> on how it all could just work with "any file".
> 
> > 
> >    After this API applied, if a module wants to support userfaultfd, the
> >    module should only need to touch its own file and properly define
> >    vm_uffd_ops, instead of changing anything in core mm.
> > 
> >    ...
> 
> Talking about files and modules is still confusing I'm afraid. It's really a
> special-purpose file (really, not any ordinary files on ordinary
> filesystems), no?

One major reason I wanted to avoid the term "in-memory" is that we already
support most of the files on WP_ASYNC, so emphasizing on in-memory might be
misleading, even though WP_ASYNC isn't much taken into the picture of the
vm_uffd_ops being proposed.

The other thing is, besides the original form of userfaultfd (which is the
MISSING traps), almost all the rest (sync-wp, continue, poison, maybe even
MOVE but that's still more special) should be at least logically doable on
most of the files like WP_ASYNC.  When proposing this API, I wanted to make
it as generic as possible when people reading about it.  Hope that makes
sense.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-23 17:20         ` Peter Xu
@ 2025-06-23 17:25           ` David Hildenbrand
  2025-06-23 17:56             ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-06-23 17:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

>>>
>>> How about an updated commit like this?
>>>
>>>     Currently, most of the userfaultfd features are implemented directly in the
>>>     core mm.  It will invoke VMA specific functions whenever necessary.  So far
>>>     it is fine because it almost only interacts with shmem and hugetlbfs.
>>>
>>>     This patch introduces a generic userfaultfd API for vm_operations_struct,
>>>     so that any type of file (including kernel modules that can be compiled
>>>     separately from the kernel core) can support userfaults without modifying
>>>     the core files.
>>
>> .... is it really "any file" ? I doubt it, but you likely have a better idea
>> on how it all could just work with "any file".
>>
>>>
>>>     After this API applied, if a module wants to support userfaultfd, the
>>>     module should only need to touch its own file and properly define
>>>     vm_uffd_ops, instead of changing anything in core mm.
>>>
>>>     ...
>>
>> Talking about files and modules is still confusing I'm afraid. It's really a
>> special-purpose file (really, not any ordinary files on ordinary
>> filesystems), no?
> 
> One major reason I wanted to avoid the term "in-memory" is that we already
> support most of the files on WP_ASYNC, so emphasizing on in-memory might be
> misleading, even though WP_ASYNC isn't much taken into the picture of the
> vm_uffd_ops being proposed.

Oh, yes, agreed on WP_ASYNC. But they would not be using the vma_ops 
thingy, right?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] mm: Introduce vm_uffd_ops API
  2025-06-23 17:25           ` David Hildenbrand
@ 2025-06-23 17:56             ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-23 17:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Mon, Jun 23, 2025 at 07:25:11PM +0200, David Hildenbrand wrote:
> Oh, yes, agreed on WP_ASYNC. But they would not be using the vma_ops thingy,
> right?

Yes, currently WP_ASYNC bypassed that. So if something declares WP in the
uffd_features in the new API it'll be only for sync (and it really should
almost start working for most files too like what async did..).

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-22 19:09   ` kernel test robot
@ 2025-06-23 18:12     ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-23 18:12 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-mm, linux-kernel, oe-kbuild-all, Nikita Kalyazin,
	Hugh Dickins, Oscar Salvador, Michal Hocko, David Hildenbrand,
	Muchun Song, Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan,
	Andrew Morton, Vlastimil Babka, Liam R . Howlett, James Houghton,
	Mike Rapoport, Lorenzo Stoakes, Axel Rasmussen

On Mon, Jun 23, 2025 at 03:09:13AM +0800, kernel test robot wrote:
> Hi Peter,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on next-20250620]
> [cannot apply to akpm-mm/mm-everything linus/master v6.16-rc2 v6.16-rc1 v6.15 v6.16-rc2]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Introduce-vm_uffd_ops-API/20250621-030557
> base:   next-20250620
> patch link:    https://lore.kernel.org/r/20250620190342.1780170-5-peterx%40redhat.com
> patch subject: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
> config: i386-randconfig-061-20250622 (https://download.01.org/0day-ci/archive/20250623/202506230216.JVgQj2Si-lkp@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250623/202506230216.JVgQj2Si-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202506230216.JVgQj2Si-lkp@intel.com/
> 
> sparse warnings: (new ones prefixed by >>)
>    mm/shmem.c: note: in included file (through include/linux/shmem_fs.h):
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> --
>    mm/hugetlb.c:669:12: sparse: sparse: context imbalance in 'allocate_file_region_entries' - wrong count at exit
>    mm/hugetlb.c:740:13: sparse: sparse: context imbalance in 'region_add' - wrong count at exit
>    mm/hugetlb.c:807:13: sparse: sparse: context imbalance in 'region_chg' - wrong count at exit
>    mm/hugetlb.c:5798:20: sparse: sparse: context imbalance in 'move_huge_pte' - different lock contexts for basic block
>    mm/hugetlb.c: note: in included file:
>    include/linux/mm.h:1391:22: sparse: sparse: context imbalance in 'hugetlb_wp' - unexpected unlock
>    mm/hugetlb.c: note: in included file (through include/linux/hugetlb.h, include/linux/migrate.h):
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> --
>    mm/userfaultfd.c:270:9: sparse: sparse: context imbalance in 'mfill_atomic_install_pte' - different lock contexts for basic block
>    mm/userfaultfd.c:412:9: sparse: sparse: context imbalance in 'mfill_atomic_pte_zeropage' - different lock contexts for basic block
>    mm/userfaultfd.c:498:9: sparse: sparse: context imbalance in 'mfill_atomic_pte_poison' - different lock contexts for basic block
>    mm/userfaultfd.c: note: in included file:
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
> >> include/linux/userfaultfd_k.h:90:17: sparse: sparse: cast from restricted uffd_flags_t
>    mm/userfaultfd.c: note: in included file (through include/linux/rbtree.h, include/linux/mm_types.h, include/linux/mmzone.h, ...):
>    include/linux/rcupdate.h:871:25: sparse: sparse: context imbalance in 'move_pages_pte' - unexpected unlock
> 
> vim +90 include/linux/userfaultfd_k.h
> 
>     87	
>     88	static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
>     89	{
>   > 90		return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
>     91	}
>     92	

Sigh.. I thought the cast would helped to tell sparse on this.  Looks like
I need __force..

I'll squash below change when repost:

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 6064f9542d5b..6c5ca68204dd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -87,7 +87,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);

 static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
 {
-       return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
+       return (__force enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
 }

 static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
                   ` (3 preceding siblings ...)
  2025-06-20 19:03 ` [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
@ 2025-06-25 16:56 ` Nikita Kalyazin
  2025-06-25 20:17   ` Peter Xu
  4 siblings, 1 reply; 24+ messages in thread
From: Nikita Kalyazin @ 2025-06-25 16:56 UTC (permalink / raw)
  To: Peter Xu, linux-mm, linux-kernel
  Cc: Hugh Dickins, Oscar Salvador, Michal Hocko, David Hildenbrand,
	Muchun Song, Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan,
	Andrew Morton, Vlastimil Babka, Liam R . Howlett, James Houghton,
	Mike Rapoport, Lorenzo Stoakes, Axel Rasmussen



On 20/06/2025 20:03, Peter Xu wrote:
> [based on akpm/mm-new]
> 
> This series is an alternative proposal of what Nikita proposed here on the
> initial three patches:
> 
>    https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> 
> This is not yet relevant to any guest-memfd support, but paving way for it.

Hi Peter,

Thanks for posting this.  I confirmed that minor fault handling was 
working for guest_memfd based on this series and looked simple (a draft 
based on mmap support in guest_memfd v7 [1]):

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5abb6d52a375..6ddc73419724 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -5,6 +5,9 @@
  #include <linux/pagemap.h>
  #include <linux/anon_inodes.h>
  #include <linux/set_memory.h>
+#ifdef CONFIG_USERFAULTFD
+#include <linux/userfaultfd_k.h>
+#endif

  #include "kvm_mm.h"

@@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  		kvm_gmem_mark_prepared(folio);
  	}

+#ifdef CONFIG_USERFAULTFD
+	if (userfaultfd_minor(vmf->vma)) {
+		folio_unlock(folio);
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+		return handle_userfault(vmf, VM_UFFD_MINOR);
+	}
+#endif
+
  	vmf->page = folio_file_page(folio, vmf->pgoff);

  out_folio:
@@ -410,8 +421,39 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  	return ret;
  }

+#ifdef CONFIG_USERFAULTFD
+static int kvm_gmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
+				struct folio **foliop)
+{
+	struct folio *folio;
+	folio = kvm_gmem_get_folio(inode, pgoff);
+
+	if (IS_ERR(folio)) {
+		*foliop = NULL;
+		return PTR_ERR(folio);
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		clear_highpage(folio_page(folio, 0));
+		kvm_gmem_mark_prepared(folio);
+	}
+
+	*foliop = folio;
+	return 0;
+}
+
+static const vm_uffd_ops kvm_gmem_uffd_ops = {
+	.uffd_features	= 	VM_UFFD_MINOR,
+	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
+	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
+};
+#endif
+
  static const struct vm_operations_struct kvm_gmem_vm_ops = {
  	.fault = kvm_gmem_fault,
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &kvm_gmem_uffd_ops,
+#endif
  };

  static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)

[1]: https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/

> Here, the major goal is to make kernel modules be able to opt-in with any
> form of userfaultfd supports, like guest-memfd.  This alternative option
> should hopefully be cleaner, and avoid leaking userfault details into
> vm_ops.fault().
> 
> It also means this series does not depend on anything.  It's a pure
> refactoring of userfaultfd internals to provide a generic API, so that
> other types of files, especially RAM based, can support userfaultfd without
> touching mm/ at all.
> 
> To achieve it, this series introduced a file operation called vm_uffd_ops.
> The ops needs to be provided when a file type supports any of userfaultfd.
> 
> With that, I moved both hugetlbfs and shmem over.
> 
> Hugetlbfs is still very special that it will only use partial of the
> vm_uffd_ops API, due to similar reason why hugetlb_vm_op_fault() has a
> BUG() and so far hard-coded into core mm.  But this should still be better,
> because at least hugetlbfs is still always involved in feature probing
> (e.g. where it used to not support ZEROPAGE and we have a hard-coded line
> to fail that, and some more).  Meanwhile after this series, shmem will be
> completely converted to the new vm_uffd_ops API; the final vm_uffd_ops for
> shmem looks like this:
> 
> static const vm_uffd_ops shmem_uffd_ops = {
>          .uffd_features  =       __VM_UFFD_FLAGS,
>          .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
>                                  BIT(_UFFDIO_ZEROPAGE) |
>                                  BIT(_UFFDIO_WRITEPROTECT) |
>                                  BIT(_UFFDIO_CONTINUE) |
>                                  BIT(_UFFDIO_POISON),
>          .uffd_get_folio =       shmem_uffd_get_folio,
>          .uffd_copy      =       shmem_mfill_atomic_pte,
> };
> 
> As I mentioned in one of my reply to Nikita, I don't like the current
> interface of uffd_copy(), but this will be the minimum change version of
> such API to support complete extrenal-module-ready userfaultfd.  Here, very
> minimal change will be needed from shmem side to support that.
> 
> Meanwhile, the vm_uffd_ops is also not the only place one will need to
> provide to support userfaultfd.  Normally vm_ops.fault() will also need to
> be updated, but that's a generic function and it'll play together with the
> new vm_uffd_ops to make everything fly.
> 
> No functional change expected at all after the whole series applied.  There
> might be some slightly stricter check on uffd ops here and there in the
> last patch, but that really shouldn't stand out anywhere to anyone.
> 
> For testing: besides the cross-compilation tests, I did also try with
> uffd-stress in a VM to measure any perf difference before/after the change;
> The static call becomes a pointer now.  I really cannot measure anything
> different, which is more or less expected.
> 
> Comments welcomed, thanks.
> 
> Peter Xu (4):
>    mm: Introduce vm_uffd_ops API
>    mm/shmem: Support vm_uffd_ops API
>    mm/hugetlb: Support vm_uffd_ops API
>    mm: Apply vm_uffd_ops API to core mm
> 
>   include/linux/mm.h            |  71 +++++++++++++++++++++
>   include/linux/shmem_fs.h      |  14 -----
>   include/linux/userfaultfd_k.h |  58 ++++-------------
>   mm/hugetlb.c                  |  19 ++++++
>   mm/shmem.c                    |  28 ++++++++-
>   mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>   6 files changed, 217 insertions(+), 88 deletions(-)
> 
> --
> 2.49.0
> 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-25 16:56 ` [PATCH 0/4] mm/userfaultfd: modulize memory types Nikita Kalyazin
@ 2025-06-25 20:17   ` Peter Xu
  2025-06-26 16:09     ` Nikita Kalyazin
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-25 20:17 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: linux-mm, linux-kernel, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 20/06/2025 20:03, Peter Xu wrote:
> > [based on akpm/mm-new]
> > 
> > This series is an alternative proposal of what Nikita proposed here on the
> > initial three patches:
> > 
> >    https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > 
> > This is not yet relevant to any guest-memfd support, but paving way for it.
> 
> Hi Peter,

Hi, Nikita,

> 
> Thanks for posting this.  I confirmed that minor fault handling was working
> for guest_memfd based on this series and looked simple (a draft based on
> mmap support in guest_memfd v7 [1]):

Thanks for the quick spin, glad to know it works. Some trivial things to
mention below..

> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 5abb6d52a375..6ddc73419724 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -5,6 +5,9 @@
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/set_memory.h>
> +#ifdef CONFIG_USERFAULTFD

This ifdef not needed, userfaultfd_k.h has taken care of all cases.

> +#include <linux/userfaultfd_k.h>
> +#endif
> 
>  #include "kvm_mm.h"
> 
> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  		kvm_gmem_mark_prepared(folio);
>  	}
> 
> +#ifdef CONFIG_USERFAULTFD

Same here.  userfaultfd_minor() is always defined.

I'll wait for a few more days for reviewers, and likely send v2 before next
week.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-20 19:03 ` [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
  2025-06-22 19:09   ` kernel test robot
@ 2025-06-25 20:31   ` James Houghton
  2025-06-25 21:21     ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: James Houghton @ 2025-06-25 20:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Vlastimil Babka, Liam R . Howlett, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Fri, Jun 20, 2025 at 12:04 PM Peter Xu <peterx@redhat.com> wrote:
>
> This patch completely moves the old userfaultfd core to use the new
> vm_uffd_ops API.  After this change, existing file systems will start to
> use the new API for userfault operations.
>
> When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
> because it's getting too big.  It's only used in slow paths so it shouldn't
> be an issue.
>
> This will also remove quite some hard-coded checks for either shmem or
> hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.
>
> Note that anonymous memory will still need to be processed separately
> because it doesn't have vm_ops at all.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/shmem_fs.h      |  14 -----
>  include/linux/userfaultfd_k.h |  46 ++++----------
>  mm/shmem.c                    |   2 +-
>  mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>  4 files changed, 101 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 6d0f9c599ff7..2f5b7b295cf6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>
> -#ifdef CONFIG_USERFAULTFD
> -#ifdef CONFIG_SHMEM
> -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> -                                 struct vm_area_struct *dst_vma,
> -                                 unsigned long dst_addr,
> -                                 unsigned long src_addr,
> -                                 uffd_flags_t flags,
> -                                 struct folio **foliop);
> -#else /* !CONFIG_SHMEM */
> -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
> -                              src_addr, flags, foliop) ({ BUG(); 0; })
> -#endif /* CONFIG_SHMEM */
> -#endif /* CONFIG_USERFAULTFD */
> -
>  /*
>   * Used space is stored as unsigned 64-bit value in bytes but
>   * quota core supports only signed 64-bit values so use that
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index e79c724b3b95..4e56ad423a4a 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -85,9 +85,14 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>  #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
>  #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
>
> +static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
> +{
> +       return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
> +}
> +
>  static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
>  {
> -       return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
> +       return uffd_flags_get_mode(flags) == expected;
>  }
>
>  static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
> @@ -196,41 +201,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>         return vma->vm_flags & __VM_UFFD_FLAGS;
>  }
>
> -static inline bool vma_can_userfault(struct vm_area_struct *vma,
> -                                    unsigned long vm_flags,
> -                                    bool wp_async)
> +static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
>  {
> -       vm_flags &= __VM_UFFD_FLAGS;
> -
> -       if (vma->vm_flags & VM_DROPPABLE)
> -               return false;
> -
> -       if ((vm_flags & VM_UFFD_MINOR) &&
> -           (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> -               return false;
> -
> -       /*
> -        * If wp async enabled, and WP is the only mode enabled, allow any
> -        * memory type.
> -        */
> -       if (wp_async && (vm_flags == VM_UFFD_WP))
> -               return true;
> -
> -#ifndef CONFIG_PTE_MARKER_UFFD_WP
> -       /*
> -        * If user requested uffd-wp but not enabled pte markers for
> -        * uffd-wp, then shmem & hugetlbfs are not supported but only
> -        * anonymous.
> -        */
> -       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> -               return false;
> -#endif

Hi Peter,

Thanks for this cleanup!

It looks like the above two checks, the wp-async one and the PTE
marker check, have been reordered in this patch. Does this result in a
functional difference?

The rest of this series looks fine to me. :)

> -
> -       /* By default, allow any of anon|shmem|hugetlb */
> -       return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> -           vma_is_shmem(vma);
> +       if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
> +               return vma->vm_ops->userfaultfd_ops;
> +       return NULL;
>  }
>
> +bool vma_can_userfault(struct vm_area_struct *vma,
> +                      unsigned long vm_flags, bool wp_async);
> +
>  static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
>  {
>         struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd0a29000318..4d71fc7be358 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
>         return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
>  }
>
> -int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>                            struct vm_area_struct *dst_vma,
>                            unsigned long dst_addr,
>                            unsigned long src_addr,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 879505c6996f..61783ff2d335 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,12 +14,48 @@
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> -#include <linux/shmem_fs.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
>  #include "swap.h"
>
> +bool vma_can_userfault(struct vm_area_struct *vma,
> +                      unsigned long vm_flags, bool wp_async)
> +{
> +       unsigned long supported;
> +
> +       if (vma->vm_flags & VM_DROPPABLE)
> +               return false;
> +
> +       vm_flags &= __VM_UFFD_FLAGS;
> +
> +#ifndef CONFIG_PTE_MARKER_UFFD_WP
> +       /*
> +        * If user requested uffd-wp but not enabled pte markers for
> +        * uffd-wp, then any file system (like shmem or hugetlbfs) are not
> +        * supported but only anonymous.
> +        */
> +       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> +               return false;
> +#endif
> +       /*
> +        * If wp async enabled, and WP is the only mode enabled, allow any
> +        * memory type.
> +        */
> +       if (wp_async && (vm_flags == VM_UFFD_WP))
> +               return true;
> +
> +       if (vma_is_anonymous(vma))
> +               /* Anonymous has no page cache, MINOR not supported */
> +               supported = VM_UFFD_MISSING | VM_UFFD_WP;
> +       else if (vma_get_uffd_ops(vma))
> +               supported = vma_get_uffd_ops(vma)->uffd_features;
> +       else
> +               return false;
> +
> +       return !(vm_flags & (~supported));
> +}
> +
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>  {
> @@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
>  {
>         struct inode *inode = file_inode(dst_vma->vm_file);
>         pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +       const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
>         struct folio *folio;
>         struct page *page;
>         int ret;
>
> -       ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> +       if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
> +               return -EINVAL;
> +
> +       ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
>         /* Our caller expects us to return -EFAULT if we failed to find folio */
>         if (ret == -ENOENT)
>                 ret = -EFAULT;
> @@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>         u32 hash;
>         struct address_space *mapping;
>
> -       /*
> -        * There is no default zero huge page for all huge page sizes as
> -        * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> -        * by THP.  Since we can not reliably insert a zero page, this
> -        * feature is not supported.
> -        */
> -       if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> -               up_read(&ctx->map_changing_lock);
> -               uffd_mfill_unlock(dst_vma);
> -               return -EINVAL;
> -       }
> -
>         src_addr = src_start;
>         dst_addr = dst_start;
>         copied = 0;
> @@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
>                         err = mfill_atomic_pte_zeropage(dst_pmd,
>                                                  dst_vma, dst_addr);
>         } else {
> -               err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
> -                                            dst_addr, src_addr,
> -                                            flags, foliop);
> +               const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
> +
> +               if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
> +                       err = -EINVAL;
> +               } else {
> +                       err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
> +                                                 dst_addr, src_addr,
> +                                                 flags, foliop);
> +               }
>         }
>
>         return err;
>  }
>
> +static inline bool
> +vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
> +{
> +       enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
> +       const vm_uffd_ops *uffd_ops;
> +       unsigned long uffd_ioctls;
> +
> +       if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
> +               return false;
> +
> +       /* Anonymous supports everything except CONTINUE */
> +       if (vma_is_anonymous(vma))
> +               return mode != MFILL_ATOMIC_CONTINUE;
> +
> +       uffd_ops = vma_get_uffd_ops(vma);
> +       if (!uffd_ops)
> +               return false;
> +
> +       uffd_ioctls = uffd_ops->uffd_ioctls;
> +       switch (mode) {
> +       case MFILL_ATOMIC_COPY:
> +               return uffd_ioctls & BIT(_UFFDIO_COPY);
> +       case MFILL_ATOMIC_ZEROPAGE:
> +               return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
> +       case MFILL_ATOMIC_CONTINUE:
> +               if (!(vma->vm_flags & VM_SHARED))
> +                       return false;
> +               return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
> +       case MFILL_ATOMIC_POISON:
> +               return uffd_ioctls & BIT(_UFFDIO_POISON);
> +       default:
> +               return false;
> +       }
> +}
> +
>  static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>                                             unsigned long dst_start,
>                                             unsigned long src_start,
> @@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>             dst_vma->vm_flags & VM_SHARED))
>                 goto out_unlock;
>
> -       /*
> -        * validate 'mode' now that we know the dst_vma: don't allow
> -        * a wrprotect copy if the userfaultfd didn't register as WP.
> -        */
> -       if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> +       if (!vma_uffd_ops_supported(dst_vma, flags))
>                 goto out_unlock;
>
>         /*
> @@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>                 return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
>                                              src_start, len, flags);
>
> -       if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> -               goto out_unlock;
> -       if (!vma_is_shmem(dst_vma) &&
> -           uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> -               goto out_unlock;
> -
>         while (src_addr < src_start + len) {
>                 pmd_t dst_pmdval;
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-25 20:31   ` James Houghton
@ 2025-06-25 21:21     ` Peter Xu
  2025-06-25 21:52       ` James Houghton
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-25 21:21 UTC (permalink / raw)
  To: James Houghton
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Vlastimil Babka, Liam R . Howlett, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Wed, Jun 25, 2025 at 01:31:49PM -0700, James Houghton wrote:
> > -static inline bool vma_can_userfault(struct vm_area_struct *vma,
> > -                                    unsigned long vm_flags,
> > -                                    bool wp_async)
> > +static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
> >  {
> > -       vm_flags &= __VM_UFFD_FLAGS;
> > -
> > -       if (vma->vm_flags & VM_DROPPABLE)
> > -               return false;
> > -
> > -       if ((vm_flags & VM_UFFD_MINOR) &&
> > -           (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> > -               return false;
> > -
> > -       /*
> > -        * If wp async enabled, and WP is the only mode enabled, allow any
> > -        * memory type.
> > -        */
> > -       if (wp_async && (vm_flags == VM_UFFD_WP))
> > -               return true;
> > -
> > -#ifndef CONFIG_PTE_MARKER_UFFD_WP
> > -       /*
> > -        * If user requested uffd-wp but not enabled pte markers for
> > -        * uffd-wp, then shmem & hugetlbfs are not supported but only
> > -        * anonymous.
> > -        */
> > -       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> > -               return false;
> > -#endif
> 
> Hi Peter,
> 
> Thanks for this cleanup!
> 
> It looks like the above two checks, the wp-async one and the PTE
> marker check, have been reordered in this patch. Does this result in a
> functional difference?
> 
> The rest of this series looks fine to me. :)

Thanks for the very careful review, James!

Yes that's a small tweak I did when moving. I don't expect to have any
functional change.  Maybe I should at least mention that in the commit log.

Here I did the movement because fundamentally wp_async depends on the pte
markers, so it may be slightly more intuitive to check pte markers first,
rejecting any form of file wr-protect traps.  Otherwise it may looks like
we could return the true for wp_async==true too early.  In reality IIUC it
can't happen.

For example, currently userfaultfd_api() has:

#ifndef CONFIG_PTE_MARKER_UFFD_WP
	uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
	uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
	uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
#endif

So when wp_async can be true above, pte markers must be compiled..  IOW,
above code clip should work identically with below lines:

    #ifdef CONFIG_PTE_MARKER_UFFD_WP
       if (wp_async && (vm_flags == VM_UFFD_WP))
            return true;
    #endif
    
    #ifndef CONFIG_PTE_MARKER_UFFD_WP
        if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
            return false;
    #endif

Then it means both chunks of code cannot be compiled together.  The order
shouldn't matter.

But maybe I should just move it back as before, to save the explain and
confusions.  Let me know if you have any preference.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
  2025-06-25 21:21     ` Peter Xu
@ 2025-06-25 21:52       ` James Houghton
  0 siblings, 0 replies; 24+ messages in thread
From: James Houghton @ 2025-06-25 21:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Nikita Kalyazin, Hugh Dickins,
	Oscar Salvador, Michal Hocko, David Hildenbrand, Muchun Song,
	Andrea Arcangeli, Ujwal Kundur, Suren Baghdasaryan, Andrew Morton,
	Vlastimil Babka, Liam R . Howlett, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Wed, Jun 25, 2025 at 2:21 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jun 25, 2025 at 01:31:49PM -0700, James Houghton wrote:
> > > -static inline bool vma_can_userfault(struct vm_area_struct *vma,
> > > -                                    unsigned long vm_flags,
> > > -                                    bool wp_async)
> > > +static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
> > >  {
> > > -       vm_flags &= __VM_UFFD_FLAGS;
> > > -
> > > -       if (vma->vm_flags & VM_DROPPABLE)
> > > -               return false;
> > > -
> > > -       if ((vm_flags & VM_UFFD_MINOR) &&
> > > -           (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> > > -               return false;
> > > -
> > > -       /*
> > > -        * If wp async enabled, and WP is the only mode enabled, allow any
> > > -        * memory type.
> > > -        */
> > > -       if (wp_async && (vm_flags == VM_UFFD_WP))
> > > -               return true;
> > > -
> > > -#ifndef CONFIG_PTE_MARKER_UFFD_WP
> > > -       /*
> > > -        * If user requested uffd-wp but not enabled pte markers for
> > > -        * uffd-wp, then shmem & hugetlbfs are not supported but only
> > > -        * anonymous.
> > > -        */
> > > -       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> > > -               return false;
> > > -#endif
> >
> > Hi Peter,
> >
> > Thanks for this cleanup!
> >
> > It looks like the above two checks, the wp-async one and the PTE
> > marker check, have been reordered in this patch. Does this result in a
> > functional difference?
> >
> > The rest of this series looks fine to me. :)
>
> Thanks for the very careful review, James!
>
> Yes that's a small tweak I did when moving. I don't expect to have any
> functional change.  Maybe I should at least mention that in the commit log.

Yeah if you could leave a small explanation, like what you have below,
in the commit log, that would be good. :)

>
> Here I did the movement because fundamentally wp_async depends on the pte
> markers, so it may be slightly more intuitive to check pte markers first,
> rejecting any form of file wr-protect traps.  Otherwise it may looks like
> we could return the true for wp_async==true too early.  In reality IIUC it
> can't happen.
>
> For example, currently userfaultfd_api() has:
>
> #ifndef CONFIG_PTE_MARKER_UFFD_WP
>         uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
>         uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
>         uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
> #endif

Ah, I see, thanks!

> So when wp_async can be true above, pte markers must be compiled..  IOW,
> above code clip should work identically with below lines:
>
>     #ifdef CONFIG_PTE_MARKER_UFFD_WP
>        if (wp_async && (vm_flags == VM_UFFD_WP))
>             return true;
>     #endif
>
>     #ifndef CONFIG_PTE_MARKER_UFFD_WP
>         if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
>             return false;
>     #endif
>
> Then it means both chunks of code cannot be compiled together.  The order
> shouldn't matter.
>
> But maybe I should just move it back as before, to save the explain and
> confusions.  Let me know if you have any preference.

Feel free to leave this patch as it is, that's fine. Thanks for the
explanation. :)

If you'd like, feel free to add:

Reviewed-by: James Houghton <jthoughton@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-25 20:17   ` Peter Xu
@ 2025-06-26 16:09     ` Nikita Kalyazin
  2025-06-27 13:51       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Nikita Kalyazin @ 2025-06-26 16:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen



On 25/06/2025 21:17, Peter Xu wrote:
> On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 20/06/2025 20:03, Peter Xu wrote:
>>> [based on akpm/mm-new]
>>>
>>> This series is an alternative proposal of what Nikita proposed here on the
>>> initial three patches:
>>>
>>>     https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
>>>
>>> This is not yet relevant to any guest-memfd support, but paving way for it.
>>
>> Hi Peter,
> 
> Hi, Nikita,
> 
>>
>> Thanks for posting this.  I confirmed that minor fault handling was working
>> for guest_memfd based on this series and looked simple (a draft based on
>> mmap support in guest_memfd v7 [1]):
> 
> Thanks for the quick spin, glad to know it works. Some trivial things to
> mention below..

Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm 
it works as well:

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8c44e4b9f5f8..b5458a22fff4 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file 
*file, pgoff_t index)

  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  {
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
  	struct inode *inode = file_inode(vmf->vma->vm_file);
  	struct folio *folio;
  	vm_fault_t ret = VM_FAULT_LOCKED;

  	filemap_invalidate_lock_shared(inode->i_mapping);

+	folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
+	if (!folio && vma && userfaultfd_missing(vma)) {
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+		return handle_userfault(vmf, VM_UFFD_MISSING);
+	}
+
  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
  	if (IS_ERR(folio)) {
  		int err = PTR_ERR(folio);
@@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode 
*inode, pgoff_t pgoff,
  	return 0;
  }

+static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
+			   struct vm_area_struct *dst_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   uffd_flags_t flags,
+			   struct folio **foliop)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct folio *folio;
+	int ret;
+
+	folio = kvm_gmem_get_folio(inode, pgoff);
+	if (IS_ERR(folio)) {
+		ret = PTR_ERR(folio);
+		goto out;
+	}
+
+	folio_unlock(folio);
+
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
+		void *vaddr = kmap_local_folio(folio, 0);
+		ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
+		kunmap_local(vaddr);
+		if (unlikely(ret)) {
+			*foliop = folio;
+			ret = -ENOENT;
+			goto out;
+		}
+	} else {		/* ZEROPAGE */
+		clear_user_highpage(&folio->page, dst_addr);
+	}
+
+	kvm_gmem_mark_prepared(folio);
+
+	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
+				       &folio->page, true, flags);
+
+	if (ret)
+		folio_put(folio);
+out:
+	return ret;
+}
+
  static const vm_uffd_ops kvm_gmem_uffd_ops = {
-	.uffd_features	= 	VM_UFFD_MINOR,
-	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
+	.uffd_features	= 	VM_UFFD_MISSING | VM_UFFD_MINOR,
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_ZEROPAGE) |
+				BIT(_UFFDIO_CONTINUE),
  	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
+	.uffd_copy	=	kvm_gmem_mfill_atomic_pte,
  };
  #endif

> 
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 5abb6d52a375..6ddc73419724 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -5,6 +5,9 @@
>>   #include <linux/pagemap.h>
>>   #include <linux/anon_inodes.h>
>>   #include <linux/set_memory.h>
>> +#ifdef CONFIG_USERFAULTFD
> 
> This ifdef not needed, userfaultfd_k.h has taken care of all cases.

Good to know, thanks.

>> +#include <linux/userfaultfd_k.h>
>> +#endif
>>
>>   #include "kvm_mm.h"
>>
>> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>                kvm_gmem_mark_prepared(folio);
>>        }
>>
>> +#ifdef CONFIG_USERFAULTFD
> 
> Same here.  userfaultfd_minor() is always defined.

Thank you.

> I'll wait for a few more days for reviewers, and likely send v2 before next
> week.
> 
> Thanks,
> 
> --
> Peter Xu
> 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-26 16:09     ` Nikita Kalyazin
@ 2025-06-27 13:51       ` Peter Xu
  2025-06-27 16:59         ` Nikita Kalyazin
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2025-06-27 13:51 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: linux-mm, linux-kernel, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 25/06/2025 21:17, Peter Xu wrote:
> > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 20/06/2025 20:03, Peter Xu wrote:
> > > > [based on akpm/mm-new]
> > > > 
> > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > initial three patches:
> > > > 
> > > >     https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > > 
> > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > 
> > > Hi Peter,
> > 
> > Hi, Nikita,
> > 
> > > 
> > > Thanks for posting this.  I confirmed that minor fault handling was working
> > > for guest_memfd based on this series and looked simple (a draft based on
> > > mmap support in guest_memfd v7 [1]):
> > 
> > Thanks for the quick spin, glad to know it works. Some trivial things to
> > mention below..
> 
> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
> works as well:

Appreciated.

Since at it, I'll comment quickly below.

> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 8c44e4b9f5f8..b5458a22fff4 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
> *file, pgoff_t index)
> 
>  static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  {
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
> 
>  	filemap_invalidate_lock_shared(inode->i_mapping);
> 
> +	folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
> +	if (!folio && vma && userfaultfd_missing(vma)) {
> +		filemap_invalidate_unlock_shared(inode->i_mapping);
> +		return handle_userfault(vmf, VM_UFFD_MISSING);
> +	}

Likely a possible refcount leak when folio != NULL here.

> +
>  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		int err = PTR_ERR(folio);
> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
> *inode, pgoff_t pgoff,
>  	return 0;
>  }
> 
> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +			   struct vm_area_struct *dst_vma,
> +			   unsigned long dst_addr,
> +			   unsigned long src_addr,
> +			   uffd_flags_t flags,
> +			   struct folio **foliop)
> +{
> +	struct inode *inode = file_inode(dst_vma->vm_file);
> +	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = kvm_gmem_get_folio(inode, pgoff);
> +	if (IS_ERR(folio)) {
> +		ret = PTR_ERR(folio);
> +		goto out;
> +	}
> +
> +	folio_unlock(folio);
> +
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> +		void *vaddr = kmap_local_folio(folio, 0);
> +		ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
> +		kunmap_local(vaddr);
> +		if (unlikely(ret)) {
> +			*foliop = folio;
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +	} else {		/* ZEROPAGE */
> +		clear_user_highpage(&folio->page, dst_addr);
> +	}
> +
> +	kvm_gmem_mark_prepared(folio);

Since Faud's series hasn't yet landed, so I'm almost looking at the current
code base with an imagination of what might happen.

In general, missing trapping for guest-memfd could start to be slightly
trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
a prior fallocate() syscall, not during fault.  So I suppose we will need
to use uptodate bit to mark folio ready, like what's done here.

If so, we may want to make sure in fault path any !uptodate fault will get
trapped for missing too, even if it sounds not strictly a "cache miss"
... so slightly confusing but sounds necessary.

Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into
account, because CoCo needs to prepare the pages, so mark uptodate may not
be enough?  I don't know well on the CoCo side to tell.  Otherwise we'll at
least need to restrict MISSING traps to only happen on fully shared
guest-memfds.

OTOH, MINOR should be much easier to be done for guest-memfd, not only
because the code to support that would be very minimum which is definitely
lovely, but also because it's still pretty common idea to monitor pgtable
entries, and it should logically even apply to CoCo: in a fault(), we need
to check whether the guest-memfd folio is "shared" and/or "faultable"
first; it should already fail the fault() if it's a private folio.  Then if
it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a
MINOR too.  For !CoCo it'll always trap as it's always faultable.

MINOR also makes more sense to be used in the future with 1G postcopy
support on top of gmem, because that's almost the only way to go.  Looks
like we've made up our mind to reuse Hugetlb pages for gmem which sounds
good, then Hugetlb pages are in 1G granule in allocations, and we can't
easily do 4K miss trapping on one 1G huge page.  MINOR is simpler but
actually more powerful from that POV.

To summarize, I think after this we can do MINOR before MISSING for
guest-memfd if MINOR already works for you.  We can leave MISSING until we
know how we would use it.

Thanks,

> +
> +	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
> +				       &folio->page, true, flags);
> +
> +	if (ret)
> +		folio_put(folio);
> +out:
> +	return ret;
> +}
> +
>  static const vm_uffd_ops kvm_gmem_uffd_ops = {
> -	.uffd_features	= 	VM_UFFD_MINOR,
> -	.uffd_ioctls	= 	BIT(_UFFDIO_CONTINUE),
> +	.uffd_features	= 	VM_UFFD_MISSING | VM_UFFD_MINOR,
> +	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
> +				BIT(_UFFDIO_ZEROPAGE) |
> +				BIT(_UFFDIO_CONTINUE),
>  	.uffd_get_folio	=	kvm_gmem_uffd_get_folio,
> +	.uffd_copy	=	kvm_gmem_mfill_atomic_pte,
>  };
>  #endif
> 
> > 
> > > 
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 5abb6d52a375..6ddc73419724 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -5,6 +5,9 @@
> > >   #include <linux/pagemap.h>
> > >   #include <linux/anon_inodes.h>
> > >   #include <linux/set_memory.h>
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > This ifdef not needed, userfaultfd_k.h has taken care of all cases.
> 
> Good to know, thanks.
> 
> > > +#include <linux/userfaultfd_k.h>
> > > +#endif
> > > 
> > >   #include "kvm_mm.h"
> > > 
> > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > >                kvm_gmem_mark_prepared(folio);
> > >        }
> > > 
> > > +#ifdef CONFIG_USERFAULTFD
> > 
> > Same here.  userfaultfd_minor() is always defined.
> 
> Thank you.
> 
> > I'll wait for a few more days for reviewers, and likely send v2 before next
> > week.
> > 
> > Thanks,
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-27 13:51       ` Peter Xu
@ 2025-06-27 16:59         ` Nikita Kalyazin
  2025-06-27 18:46           ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Nikita Kalyazin @ 2025-06-27 16:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen



On 27/06/2025 14:51, Peter Xu wrote:
> On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 25/06/2025 21:17, Peter Xu wrote:
>>> On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
>>>>
>>>>
>>>> On 20/06/2025 20:03, Peter Xu wrote:
>>>>> [based on akpm/mm-new]
>>>>>
>>>>> This series is an alternative proposal of what Nikita proposed here on the
>>>>> initial three patches:
>>>>>
>>>>>      https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
>>>>>
>>>>> This is not yet relevant to any guest-memfd support, but paving way for it.
>>>>
>>>> Hi Peter,
>>>
>>> Hi, Nikita,
>>>
>>>>
>>>> Thanks for posting this.  I confirmed that minor fault handling was working
>>>> for guest_memfd based on this series and looked simple (a draft based on
>>>> mmap support in guest_memfd v7 [1]):
>>>
>>> Thanks for the quick spin, glad to know it works. Some trivial things to
>>> mention below..
>>
>> Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
>> works as well:
> 
> Appreciated.
> 
> Since at it, I'll comment quickly below.
> 
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 8c44e4b9f5f8..b5458a22fff4 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
>> *file, pgoff_t index)
>>
>>   static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>   {
>> +     struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>>        struct inode *inode = file_inode(vmf->vma->vm_file);
>>        struct folio *folio;
>>        vm_fault_t ret = VM_FAULT_LOCKED;
>>
>>        filemap_invalidate_lock_shared(inode->i_mapping);
>>
>> +     folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
>> +     if (!folio && vma && userfaultfd_missing(vma)) {
>> +             filemap_invalidate_unlock_shared(inode->i_mapping);
>> +             return handle_userfault(vmf, VM_UFFD_MISSING);
>> +     }
> 
> Likely a possible refcount leak when folio != NULL here.

Thank you.  I was only aiming to cover the happy case for know.  I will 
keep it in mind for the future.
>> +
>>        folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>        if (IS_ERR(folio)) {
>>                int err = PTR_ERR(folio);
>> @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
>> *inode, pgoff_t pgoff,
>>        return 0;
>>   }
>>
>> +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
>> +                        struct vm_area_struct *dst_vma,
>> +                        unsigned long dst_addr,
>> +                        unsigned long src_addr,
>> +                        uffd_flags_t flags,
>> +                        struct folio **foliop)
>> +{
>> +     struct inode *inode = file_inode(dst_vma->vm_file);
>> +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
>> +     struct folio *folio;
>> +     int ret;
>> +
>> +     folio = kvm_gmem_get_folio(inode, pgoff);
>> +     if (IS_ERR(folio)) {
>> +             ret = PTR_ERR(folio);
>> +             goto out;
>> +     }
>> +
>> +     folio_unlock(folio);
>> +
>> +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
>> +             void *vaddr = kmap_local_folio(folio, 0);
>> +             ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
>> +             kunmap_local(vaddr);
>> +             if (unlikely(ret)) {
>> +                     *foliop = folio;
>> +                     ret = -ENOENT;
>> +                     goto out;
>> +             }
>> +     } else {                /* ZEROPAGE */
>> +             clear_user_highpage(&folio->page, dst_addr);
>> +     }
>> +
>> +     kvm_gmem_mark_prepared(folio);
> 
> Since Faud's series hasn't yet landed, so I'm almost looking at the current
> code base with an imagination of what might happen.
> 
> In general, missing trapping for guest-memfd could start to be slightly
> trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
> a prior fallocate() syscall, not during fault.  So I suppose we will need
> to use uptodate bit to mark folio ready, like what's done here.

I don't think I'm familiar with the fallocate() requirement in 
guest_memfd.  Fuad's v12 [1] (although I think it has been like that 
from the beginning) calls kvm_gmem_get_folio() that populates pagecache 
in the fault handler (kvm_gmem_fault_shared()).  SEV [2] and TDX [3] 
seem to use kvm_gmem_populate() for both allocation and preparation.

[1] 
https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611
[2] 
https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331
[3] 
https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236

> 
> If so, we may want to make sure in fault path any !uptodate fault will get
> trapped for missing too, even if it sounds not strictly a "cache miss"
> ... so slightly confusing but sounds necessary.
> 
> Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into
> account, because CoCo needs to prepare the pages, so mark uptodate may not
> be enough?  I don't know well on the CoCo side to tell.  Otherwise we'll at
> least need to restrict MISSING traps to only happen on fully shared
> guest-memfds.

I am not fluent in CoCo either, but I thought CoCo needed to do 
preparation for private pages only, while UFFD shouldn't be dealing with 
them so issuing MISSING only on shared looks sensible to me.
> OTOH, MINOR should be much easier to be done for guest-memfd, not only
> because the code to support that would be very minimum which is definitely
> lovely, but also because it's still pretty common idea to monitor pgtable
> entries, and it should logically even apply to CoCo: in a fault(), we need
> to check whether the guest-memfd folio is "shared" and/or "faultable"
> first; it should already fail the fault() if it's a private folio.  Then if
> it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a
> MINOR too.  For !CoCo it'll always trap as it's always faultable. > MINOR also makes more sense to be used in the future with 1G postcopy
> support on top of gmem, because that's almost the only way to go.  Looks
> like we've made up our mind to reuse Hugetlb pages for gmem which sounds
> good, then Hugetlb pages are in 1G granule in allocations, and we can't
> easily do 4K miss trapping on one 1G huge page.  MINOR is simpler but
> actually more powerful from that POV.
> 
> To summarize, I think after this we can do MINOR before MISSING for
> guest-memfd if MINOR already works for you.  We can leave MISSING until we
> know how we would use it.

Starting with MINOR sounds good to me.

> 
> Thanks,
> 
>> +
>> +     ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
>> +                                    &folio->page, true, flags);
>> +
>> +     if (ret)
>> +             folio_put(folio);
>> +out:
>> +     return ret;
>> +}
>> +
>>   static const vm_uffd_ops kvm_gmem_uffd_ops = {
>> -     .uffd_features  =       VM_UFFD_MINOR,
>> -     .uffd_ioctls    =       BIT(_UFFDIO_CONTINUE),
>> +     .uffd_features  =       VM_UFFD_MISSING | VM_UFFD_MINOR,
>> +     .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
>> +                             BIT(_UFFDIO_ZEROPAGE) |
>> +                             BIT(_UFFDIO_CONTINUE),
>>        .uffd_get_folio =       kvm_gmem_uffd_get_folio,
>> +     .uffd_copy      =       kvm_gmem_mfill_atomic_pte,
>>   };
>>   #endif
>>
>>>
>>>>
>>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>>> index 5abb6d52a375..6ddc73419724 100644
>>>> --- a/virt/kvm/guest_memfd.c
>>>> +++ b/virt/kvm/guest_memfd.c
>>>> @@ -5,6 +5,9 @@
>>>>    #include <linux/pagemap.h>
>>>>    #include <linux/anon_inodes.h>
>>>>    #include <linux/set_memory.h>
>>>> +#ifdef CONFIG_USERFAULTFD
>>>
>>> This ifdef not needed, userfaultfd_k.h has taken care of all cases.
>>
>> Good to know, thanks.
>>
>>>> +#include <linux/userfaultfd_k.h>
>>>> +#endif
>>>>
>>>>    #include "kvm_mm.h"
>>>>
>>>> @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>>>                 kvm_gmem_mark_prepared(folio);
>>>>         }
>>>>
>>>> +#ifdef CONFIG_USERFAULTFD
>>>
>>> Same here.  userfaultfd_minor() is always defined.
>>
>> Thank you.
>>
>>> I'll wait for a few more days for reviewers, and likely send v2 before next
>>> week.
>>>
>>> Thanks,
>>>
>>> --
>>> Peter Xu
>>>
>>
> 
> --
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] mm/userfaultfd: modulize memory types
  2025-06-27 16:59         ` Nikita Kalyazin
@ 2025-06-27 18:46           ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2025-06-27 18:46 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: linux-mm, linux-kernel, Hugh Dickins, Oscar Salvador,
	Michal Hocko, David Hildenbrand, Muchun Song, Andrea Arcangeli,
	Ujwal Kundur, Suren Baghdasaryan, Andrew Morton, Vlastimil Babka,
	Liam R . Howlett, James Houghton, Mike Rapoport, Lorenzo Stoakes,
	Axel Rasmussen

On Fri, Jun 27, 2025 at 05:59:49PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 27/06/2025 14:51, Peter Xu wrote:
> > On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 25/06/2025 21:17, Peter Xu wrote:
> > > > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote:
> > > > > 
> > > > > 
> > > > > On 20/06/2025 20:03, Peter Xu wrote:
> > > > > > [based on akpm/mm-new]
> > > > > > 
> > > > > > This series is an alternative proposal of what Nikita proposed here on the
> > > > > > initial three patches:
> > > > > > 
> > > > > >      https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
> > > > > > 
> > > > > > This is not yet relevant to any guest-memfd support, but paving way for it.
> > > > > 
> > > > > Hi Peter,
> > > > 
> > > > Hi, Nikita,
> > > > 
> > > > > 
> > > > > Thanks for posting this.  I confirmed that minor fault handling was working
> > > > > for guest_memfd based on this series and looked simple (a draft based on
> > > > > mmap support in guest_memfd v7 [1]):
> > > > 
> > > > Thanks for the quick spin, glad to know it works. Some trivial things to
> > > > mention below..
> > > 
> > > Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it
> > > works as well:
> > 
> > Appreciated.
> > 
> > Since at it, I'll comment quickly below.
> > 
> > > 
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 8c44e4b9f5f8..b5458a22fff4 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file
> > > *file, pgoff_t index)
> > > 
> > >   static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
> > >   {
> > > +     struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> > >        struct inode *inode = file_inode(vmf->vma->vm_file);
> > >        struct folio *folio;
> > >        vm_fault_t ret = VM_FAULT_LOCKED;
> > > 
> > >        filemap_invalidate_lock_shared(inode->i_mapping);
> > > 
> > > +     folio = filemap_get_entry(inode->i_mapping, vmf->pgoff);
> > > +     if (!folio && vma && userfaultfd_missing(vma)) {
> > > +             filemap_invalidate_unlock_shared(inode->i_mapping);
> > > +             return handle_userfault(vmf, VM_UFFD_MISSING);
> > > +     }
> > 
> > Likely a possible refcount leak when folio != NULL here.
> 
> Thank you.  I was only aiming to cover the happy case for know.  I will keep
> it in mind for the future.

Yep that's good enough, thanks.  It's really something I'd comment
passingly, it's definitely reassuring to know the happy case works.

> > > +
> > >        folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> > >        if (IS_ERR(folio)) {
> > >                int err = PTR_ERR(folio);
> > > @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode
> > > *inode, pgoff_t pgoff,
> > >        return 0;
> > >   }
> > > 
> > > +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd,
> > > +                        struct vm_area_struct *dst_vma,
> > > +                        unsigned long dst_addr,
> > > +                        unsigned long src_addr,
> > > +                        uffd_flags_t flags,
> > > +                        struct folio **foliop)
> > > +{
> > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > +     struct folio *folio;
> > > +     int ret;
> > > +
> > > +     folio = kvm_gmem_get_folio(inode, pgoff);
> > > +     if (IS_ERR(folio)) {
> > > +             ret = PTR_ERR(folio);
> > > +             goto out;
> > > +     }
> > > +
> > > +     folio_unlock(folio);
> > > +
> > > +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) {
> > > +             void *vaddr = kmap_local_folio(folio, 0);
> > > +             ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE);
> > > +             kunmap_local(vaddr);
> > > +             if (unlikely(ret)) {
> > > +                     *foliop = folio;
> > > +                     ret = -ENOENT;
> > > +                     goto out;
> > > +             }
> > > +     } else {                /* ZEROPAGE */
> > > +             clear_user_highpage(&folio->page, dst_addr);
> > > +     }
> > > +
> > > +     kvm_gmem_mark_prepared(folio);
> > 
> > Since Faud's series hasn't yet landed, so I'm almost looking at the current
> > code base with an imagination of what might happen.
> > 
> > In general, missing trapping for guest-memfd could start to be slightly
> > trickier.  So far IIUC guest-memfd cache pool needs to be populated only by
> > a prior fallocate() syscall, not during fault.  So I suppose we will need
> > to use uptodate bit to mark folio ready, like what's done here.
> 
> I don't think I'm familiar with the fallocate() requirement in guest_memfd.
> Fuad's v12 [1] (although I think it has been like that from the beginning)
> calls kvm_gmem_get_folio() that populates pagecache in the fault handler
> (kvm_gmem_fault_shared()).  SEV [2] and TDX [3] seem to use
> kvm_gmem_populate() for both allocation and preparation.

I actually didn't notice fault() uses kvm_gmem_get_folio(), which has
FGP_CREAT indeed.

I checked Ackerley's latest 1G patchset, which also did the same that
kvm_gmem_get_folio() will invoke the custom allocator to allocate 1G pages
even during a fault().

Not sure whether it's intentional though, for example, if the tests in
userspace always does fallocate() then the code should run the same, and
FGP_CREAT will just never be used.

Thanks for pointing this out.  I definitely didn't notice this trivial
detail before. Looks like it's not a major issue, if the folio can be
dynamically allocated, then MISSING mode (if/when it'll be supported) can
capture both "!folio" and "folio && !uptodate" cases here as missing.

> 
> [1] https://lore.kernel.org/kvm/20250611133330.1514028-1-tabba@google.com/T/#m15b53a741e4f328e61f995a01afb9c4682ffe611
> [2] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/svm/sev.c#L2331
> [3] https://elixir.bootlin.com/linux/v6.16-rc3/source/arch/x86/kvm/vmx/tdx.c#L3236

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-27 18:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 19:03 [PATCH 0/4] mm/userfaultfd: modulize memory types Peter Xu
2025-06-20 19:03 ` [PATCH 1/4] mm: Introduce vm_uffd_ops API Peter Xu
2025-06-22  7:28   ` Mike Rapoport
2025-06-23 13:36     ` Peter Xu
2025-06-23  8:25   ` David Hildenbrand
2025-06-23 13:59     ` Peter Xu
2025-06-23 16:50       ` David Hildenbrand
2025-06-23 17:20         ` Peter Xu
2025-06-23 17:25           ` David Hildenbrand
2025-06-23 17:56             ` Peter Xu
2025-06-20 19:03 ` [PATCH 2/4] mm/shmem: Support " Peter Xu
2025-06-20 19:03 ` [PATCH 3/4] mm/hugetlb: " Peter Xu
2025-06-20 19:03 ` [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu
2025-06-22 19:09   ` kernel test robot
2025-06-23 18:12     ` Peter Xu
2025-06-25 20:31   ` James Houghton
2025-06-25 21:21     ` Peter Xu
2025-06-25 21:52       ` James Houghton
2025-06-25 16:56 ` [PATCH 0/4] mm/userfaultfd: modulize memory types Nikita Kalyazin
2025-06-25 20:17   ` Peter Xu
2025-06-26 16:09     ` Nikita Kalyazin
2025-06-27 13:51       ` Peter Xu
2025-06-27 16:59         ` Nikita Kalyazin
2025-06-27 18:46           ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).