[PATCH v3 0/6] KVM: guest_memfd: support for uffd minor

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
@ 2025-04-04 15:43 Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

This series is built on top of the Fuad's v7 "mapping guest_memfd backed
memory at the host" [1].

With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace.  However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables.  In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.

Changes since v2 [4]:
 - James: Fix sgp type when calling shmem_get_folio_gfp
 - James: Improved vm_ops->fault() error handling
 - James: Add and make use of the can_userfault() VMA operation
 - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
 - James: Fix typos and add more checks in the test

Nikita

[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
[4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/

Nikita Kalyazin (6):
  mm: userfaultfd: generic continue for non hugetlbfs
  mm: provide can_userfault vma operation
  mm: userfaultfd: use can_userfault vma operation
  KVM: guest_memfd: add support for userfaultfd minor
  mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
  KVM: selftests: test userfaultfd minor for guest_memfd

 fs/userfaultfd.c                              |  3 +-
 include/linux/mm.h                            |  5 +
 include/linux/mm_types.h                      |  4 +
 include/linux/userfaultfd_k.h                 | 10 +-
 include/uapi/linux/userfaultfd.h              |  8 +-
 mm/hugetlb.c                                  |  9 +-
 mm/shmem.c                                    | 17 +++-
 mm/userfaultfd.c                              | 47 ++++++---
 .../testing/selftests/kvm/guest_memfd_test.c  | 99 +++++++++++++++++++
 virt/kvm/guest_memfd.c                        | 10 ++
 10 files changed, 188 insertions(+), 24 deletions(-)


base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
-- 
2.47.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-06-10 22:22   ` Peter Xu
  2025-04-04 15:43 ` [PATCH v3 2/6] mm: provide can_userfault vma operation Nikita Kalyazin
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

Remove shmem-specific code from UFFDIO_CONTINUE implementation for
non-huge pages by calling vm_ops->fault().  A new VMF flag,
FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
handle_userfault().

Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 include/linux/mm_types.h |  4 ++++
 mm/hugetlb.c             |  2 +-
 mm/shmem.c               |  9 ++++++---
 mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
 4 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..2f26ee9742bf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
  * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
  *                        We should only access orig_pte if this flag set.
  * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
+ * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
+ *                                 minor handler as it is being called by the
+ *                                 userfaultfd code itself.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -1467,6 +1470,7 @@ enum fault_flag {
 	FAULT_FLAG_UNSHARE =		1 << 10,
 	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
 	FAULT_FLAG_VMA_LOCK =		1 << 12,
+	FAULT_FLAG_USERFAULT_CONTINUE = 1 << 13,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 97930d44d460..c004cfdcd4e2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6228,7 +6228,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		}
 
 		/* Check for page in userfault range. */
-		if (userfaultfd_minor(vma)) {
+		if (userfaultfd_minor(vma) && !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
 			folio_unlock(folio);
 			folio_put(folio);
 			/* See comment in userfaultfd_missing() block above */
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..b4159303fe59 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2467,7 +2467,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	fault_mm = vma ? vma->vm_mm : NULL;
 
 	folio = filemap_get_entry(inode->i_mapping, index);
-	if (folio && vma && userfaultfd_minor(vma)) {
+	if (folio && vma && userfaultfd_minor(vma) &&
+	    !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
 		if (!xa_is_value(folio))
 			folio_put(folio);
 		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
@@ -2727,6 +2728,8 @@ static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode)
 static vm_fault_t shmem_fault(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
+	enum sgp_type sgp = vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE ?
+	    SGP_NOALLOC : SGP_CACHE;
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
 	struct folio *folio = NULL;
 	vm_fault_t ret = 0;
@@ -2743,8 +2746,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	}
 
 	WARN_ON_ONCE(vmf->page != NULL);
-	err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, SGP_CACHE,
-				  gfp, vmf, &ret);
+	err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, sgp, gfp, vmf,
+				  &ret);
 	if (err)
 		return vmf_error(err);
 	if (folio) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..4b3dbc7dac64 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -380,30 +380,47 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
 	return ret;
 }
 
-/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
+/* Handles UFFDIO_CONTINUE for all VMAs */
 static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 				     struct vm_area_struct *dst_vma,
 				     unsigned long dst_addr,
 				     uffd_flags_t flags)
 {
-	struct inode *inode = file_inode(dst_vma->vm_file);
-	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
 	struct folio *folio;
 	struct page *page;
 	int ret;
+	struct vm_fault vmf = {
+		.vma = dst_vma,
+		.address = dst_addr,
+		.flags = FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE |
+		    FAULT_FLAG_USERFAULT_CONTINUE,
+		.pte = NULL,
+		.page = NULL,
+		.pgoff = linear_page_index(dst_vma, dst_addr),
+	};
+
+	if (!dst_vma->vm_ops || !dst_vma->vm_ops->fault)
+		return -EINVAL;
 
-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
-	/* Our caller expects us to return -EFAULT if we failed to find folio */
-	if (ret == -ENOENT)
+retry:
+	ret = dst_vma->vm_ops->fault(&vmf);
+	if (ret & VM_FAULT_ERROR) {
 		ret = -EFAULT;
-	if (ret)
 		goto out;
-	if (!folio) {
-		ret = -EFAULT;
+	}
+
+	if (ret & VM_FAULT_NOPAGE) {
+		ret = -EAGAIN;
 		goto out;
 	}
 
-	page = folio_file_page(folio, pgoff);
+	if (ret & VM_FAULT_RETRY)
+		goto retry;
+
+	page = vmf.page;
+	folio = page_folio(page);
+	BUG_ON(!folio);
+
 	if (PageHWPoison(page)) {
 		ret = -EIO;
 		goto out_release;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 2/6] mm: provide can_userfault vma operation
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 3/6] mm: userfaultfd: use " Nikita Kalyazin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

The new operation allows to decouple the userfaulfd code from
dependencies to VMA types, specifically, shmem and hugetlb.  The
vm_flags bitmap argument is processed with "any" logic, meaning if the
VMA type supports any of the flags set, it returns true.  This is to
avoid multiple calls when checking for __VM_UFFD_FLAGS.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 include/linux/mm.h | 5 +++++
 mm/hugetlb.c       | 7 +++++++
 mm/shmem.c         | 8 ++++++++
 3 files changed, 20 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..488d721d8542 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -680,6 +680,11 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	/*
+	 * True if the VMA supports userfault at least for one of the vm_flags
+	 */
+	bool (*can_userfault)(struct vm_area_struct *vma,
+			      unsigned long vm_flags);
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c004cfdcd4e2..f3901c11e1fd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5143,6 +5143,12 @@ static unsigned long hugetlb_vm_op_pagesize(struct vm_area_struct *vma)
 	return huge_page_size(hstate_vma(vma));
 }
 
+static bool hugetlb_vm_op_can_userfault(struct vm_area_struct *vma,
+					unsigned long vm_flags)
+{
+	return true;
+}
+
 /*
  * We cannot handle pagefaults against hugetlb pages at all.  They cause
  * handle_mm_fault() to try to instantiate regular-sized pages in the
@@ -5168,6 +5174,7 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+	.can_userfault = hugetlb_vm_op_can_userfault,
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
diff --git a/mm/shmem.c b/mm/shmem.c
index b4159303fe59..0b9e19abd1e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2891,6 +2891,12 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 	return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index);
 }
 
+static bool shmem_can_userfault(struct vm_area_struct *vma,
+				unsigned long vm_flags)
+{
+	return true;
+}
+
 static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
 			pgoff_t index, unsigned int order, pgoff_t *ilx)
 {
@@ -5309,6 +5315,7 @@ static const struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+	.can_userfault  = shmem_can_userfault,
 };
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5318,6 +5325,7 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+	.can_userfault  = shmem_can_userfault,
 };
 
 int shmem_init_fs_context(struct fs_context *fc)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 3/6] mm: userfaultfd: use can_userfault vma operation
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 2/6] mm: provide can_userfault vma operation Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 include/linux/userfaultfd_k.h | 13 ++++++-------
 mm/userfaultfd.c              | 10 +++++++---
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..64551e8a55fb 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -221,8 +221,8 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 	if (vm_flags & VM_DROPPABLE)
 		return false;
 
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+	if (!vma->vm_ops->can_userfault ||
+	    !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR))
 		return false;
 
 	/*
@@ -235,16 +235,15 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
 #ifndef CONFIG_PTE_MARKER_UFFD_WP
 	/*
 	 * If user requested uffd-wp but not enabled pte markers for
-	 * uffd-wp, then shmem & hugetlbfs are not supported but only
-	 * anonymous.
+	 * uffd-wp, then only anonymous is supported.
 	 */
 	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
 		return false;
 #endif
 
-	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	return vma_is_anonymous(vma) ||
+	    (vma->vm_ops->can_userfault &&
+	     vma->vm_ops->can_userfault(vma, vm_flags));
 }
 
 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4b3dbc7dac64..0aa82c968e16 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -728,6 +728,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	unsigned long src_addr, dst_addr;
 	long copied;
 	struct folio *folio;
+	bool can_userfault;
 
 	/*
 	 * Sanitize the command parameters:
@@ -787,10 +788,13 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
 					     src_start, len, flags);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
+	can_userfault = dst_vma->vm_ops->can_userfault &&
+	    dst_vma->vm_ops->can_userfault(dst_vma, __VM_UFFD_FLAGS);
+
+	if (!vma_is_anonymous(dst_vma) && !can_userfault)
 		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+
+	if (!can_userfault && uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
 		goto out_unlock;
 
 	while (src_addr < src_start + len) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
                   ` (2 preceding siblings ...)
  2025-04-04 15:43 ` [PATCH v3 3/6] mm: userfaultfd: use " Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-06-10 22:25   ` Peter Xu
  2025-04-04 15:43 ` [PATCH v3 5/6] mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD Nikita Kalyazin
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

Add support for sending a pagefault event if userfaultfd is registered.
Only page minor event is currently supported.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 virt/kvm/guest_memfd.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fbf89e643add..096d89e7282d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,9 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#include <linux/userfaultfd_k.h>
+#endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #include "kvm_mm.h"
 
@@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 		kvm_gmem_mark_prepared(folio);
 	}
 
+	if (userfaultfd_minor(vmf->vma) &&
+	    !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
+		folio_unlock(folio);
+		filemap_invalidate_unlock_shared(inode->i_mapping);
+		return handle_userfault(vmf, VM_UFFD_MINOR);
+	}
+
 	vmf->page = folio_file_page(folio, vmf->pgoff);
 
 out_folio:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 5/6] mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
                   ` (3 preceding siblings ...)
  2025-04-04 15:43 ` [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-04-04 15:43 ` [PATCH v3 6/6] KVM: selftests: test userfaultfd minor for guest_memfd Nikita Kalyazin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 fs/userfaultfd.c                 | 3 ++-
 include/uapi/linux/userfaultfd.h | 8 +++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 97c4d71115d8..32152bfa462a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1954,7 +1954,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	uffdio_api.features = UFFD_API_FEATURES;
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 	uffdio_api.features &=
-		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
+		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM |
+		UFFD_FEATURE_MINOR_GUEST_MEMFD);
 #endif
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 2841e4ea8f2c..ed688797eba7 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -42,7 +42,8 @@
 			   UFFD_FEATURE_WP_UNPOPULATED |	\
 			   UFFD_FEATURE_POISON |		\
 			   UFFD_FEATURE_WP_ASYNC |		\
-			   UFFD_FEATURE_MOVE)
+			   UFFD_FEATURE_MOVE |			\
+			   UFFD_FEATURE_MINOR_GUEST_MEMFD)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -230,6 +231,10 @@ struct uffdio_api {
 	 *
 	 * UFFD_FEATURE_MOVE indicates that the kernel supports moving an
 	 * existing page contents from userspace.
+	 *
+	 * UFFD_FEATURE_MINOR_GUEST_MEMFD indicates the same support as
+	 * UFFD_FEATURE_MINOR_HUGETLBFS, but for guest_memfd-backed pages
+	 * instead.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -248,6 +253,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_POISON			(1<<14)
 #define UFFD_FEATURE_WP_ASYNC			(1<<15)
 #define UFFD_FEATURE_MOVE			(1<<16)
+#define UFFD_FEATURE_MINOR_GUEST_MEMFD		(1<<17)
 	__u64 features;
 
 	__u64 ioctls;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v3 6/6] KVM: selftests: test userfaultfd minor for guest_memfd
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
                   ` (4 preceding siblings ...)
  2025-04-04 15:43 ` [PATCH v3 5/6] mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD Nikita Kalyazin
@ 2025-04-04 15:43 ` Nikita Kalyazin
  2025-04-04 16:33 ` [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Lorenzo Stoakes
  2025-04-04 17:12 ` Liam R. Howlett
  7 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 15:43 UTC (permalink / raw)
  To: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd
  Cc: kvm, linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx, kalyazin

The test demonstrates that a minor userfaultfd event in guest_memfd can
be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 99 +++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 38c501e49e0e..9a06c2486218 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,12 +10,16 @@
 #include <errno.h>
 #include <stdio.h>
 #include <fcntl.h>
+#include <pthread.h>
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
+#include <linux/userfaultfd.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
 
 #include "kvm_util.h"
 #include "test_util.h"
@@ -206,6 +210,98 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 	close(fd1);
 }
 
+struct fault_args {
+	char *addr;
+	volatile char value;
+};
+
+static void *fault_thread_fn(void *arg)
+{
+	struct fault_args *args = arg;
+
+	/* Trigger page fault */
+	args->value = *args->addr;
+	return NULL;
+}
+
+static void test_uffd_minor(int fd, size_t page_size, size_t total_size)
+{
+	struct uffdio_register uffd_reg;
+	struct uffdio_continue uffd_cont;
+	struct uffd_msg msg;
+	struct fault_args args;
+	pthread_t fault_thread;
+	void *mem, *mem_nofault, *buf = NULL;
+	int uffd, ret;
+	off_t offset = page_size;
+	void *fault_addr;
+
+	ret = posix_memalign(&buf, page_size, total_size);
+	TEST_ASSERT_EQ(ret, 0);
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC);
+	TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed");
+
+	struct uffdio_api uffdio_api = {
+		.api = UFFD_API,
+		.features = UFFD_FEATURE_MINOR_GUEST_MEMFD,
+	};
+	ret = ioctl(uffd, UFFDIO_API, &uffdio_api);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed");
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed");
+
+	mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed");
+
+	uffd_reg.range.start = (unsigned long)mem;
+	uffd_reg.range.len = total_size;
+	uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR;
+	ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+			offset, page_size);
+	TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed");
+
+	fault_addr = mem + offset;
+	args.addr = fault_addr;
+
+	ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args);
+	TEST_ASSERT(ret == 0, "pthread_create should succeed");
+
+	ret = read(uffd, &msg, sizeof(msg));
+	TEST_ASSERT(ret != -1, "read from userfaultfd should succeed");
+	TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault");
+	TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr,
+		    "pagefault should occur at expected address");
+
+	memcpy(mem_nofault + offset, buf + offset, page_size);
+
+	uffd_cont.range.start = (unsigned long)fault_addr;
+	uffd_cont.range.len = page_size;
+	uffd_cont.mode = 0;
+	ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont);
+	TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed");
+
+	TEST_ASSERT(args.value == *(char *)(mem_nofault + offset),
+		    "memory should contain the value that was copied");
+	TEST_ASSERT(args.value == *(char *)(mem + offset),
+		    "no further fault is expected");
+
+	ret = pthread_join(fault_thread, NULL);
+	TEST_ASSERT(ret == 0, "pthread_join should succeed");
+
+	ret = munmap(mem_nofault, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap should succeed");
+	free(buf);
+	close(uffd);
+}
+
 unsigned long get_shared_type(void)
 {
 #ifdef __x86_64__
@@ -244,6 +340,9 @@ void test_vm_type(unsigned long type, bool is_shared)
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
 
+	if (is_shared)
+		test_uffd_minor(fd, page_size, total_size);
+
 	close(fd);
 	kvm_vm_release(vm);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
                   ` (5 preceding siblings ...)
  2025-04-04 15:43 ` [PATCH v3 6/6] KVM: selftests: test userfaultfd minor for guest_memfd Nikita Kalyazin
@ 2025-04-04 16:33 ` Lorenzo Stoakes
  2025-04-04 16:56   ` Nikita Kalyazin
  2025-04-04 17:12 ` Liam R. Howlett
  7 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-04-04 16:33 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	Liam.Howlett, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
> This series is built on top of the Fuad's v7 "mapping guest_memfd backed
> memory at the host" [1].

Hm if this is based on an unmerged series this seems quite speculative and
should maybe be an RFC? I mean that series at least still seems quite under
discussion/experiencing issues?

Maybe worth RFC'ing until that one settles down first to avoid complexity
in review/application to tree?

Thanks!

>
> With James's KVM userfault [2], it is possible to handle stage-2 faults
> in guest_memfd in userspace.  However, KVM itself also triggers faults
> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> PV EOI and page table walking code when fetching the MMIO instruction on
> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> that KVM would be accessing those pages via userspace page tables.  In
> order for such faults to be handled in userspace, guest_memfd needs to
> support userfaultfd.
>
> Changes since v2 [4]:
>  - James: Fix sgp type when calling shmem_get_folio_gfp
>  - James: Improved vm_ops->fault() error handling
>  - James: Add and make use of the can_userfault() VMA operation
>  - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
>  - James: Fix typos and add more checks in the test
>
> Nikita
>
> [1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
> [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
> [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
>
> Nikita Kalyazin (6):
>   mm: userfaultfd: generic continue for non hugetlbfs
>   mm: provide can_userfault vma operation
>   mm: userfaultfd: use can_userfault vma operation
>   KVM: guest_memfd: add support for userfaultfd minor
>   mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
>   KVM: selftests: test userfaultfd minor for guest_memfd
>
>  fs/userfaultfd.c                              |  3 +-
>  include/linux/mm.h                            |  5 +
>  include/linux/mm_types.h                      |  4 +
>  include/linux/userfaultfd_k.h                 | 10 +-
>  include/uapi/linux/userfaultfd.h              |  8 +-
>  mm/hugetlb.c                                  |  9 +-
>  mm/shmem.c                                    | 17 +++-
>  mm/userfaultfd.c                              | 47 ++++++---
>  .../testing/selftests/kvm/guest_memfd_test.c  | 99 +++++++++++++++++++
>  virt/kvm/guest_memfd.c                        | 10 ++
>  10 files changed, 188 insertions(+), 24 deletions(-)
>
>
> base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-04 16:33 ` [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Lorenzo Stoakes
@ 2025-04-04 16:56   ` Nikita Kalyazin
  2025-04-04 16:59     ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-04 16:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	Liam.Howlett, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 04/04/2025 17:33, Lorenzo Stoakes wrote:
> On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
>> This series is built on top of the Fuad's v7 "mapping guest_memfd backed
>> memory at the host" [1].
> 
> Hm if this is based on an unmerged series this seems quite speculative and
> should maybe be an RFC? I mean that series at least still seems quite under
> discussion/experiencing issues?
> 
> Maybe worth RFC'ing until that one settles down first to avoid complexity
> in review/application to tree?

Hi,

I dropped the RFC tag because I saw similar examples before, but I'm 
happy to bring it back next time if the dependency is not merged until then.

> 
> Thanks!

Thanks!

> 
>>
>> With James's KVM userfault [2], it is possible to handle stage-2 faults
>> in guest_memfd in userspace.  However, KVM itself also triggers faults
>> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
>> PV EOI and page table walking code when fetching the MMIO instruction on
>> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
>> that KVM would be accessing those pages via userspace page tables.  In
>> order for such faults to be handled in userspace, guest_memfd needs to
>> support userfaultfd.
>>
>> Changes since v2 [4]:
>>   - James: Fix sgp type when calling shmem_get_folio_gfp
>>   - James: Improved vm_ops->fault() error handling
>>   - James: Add and make use of the can_userfault() VMA operation
>>   - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
>>   - James: Fix typos and add more checks in the test
>>
>> Nikita
>>
>> [1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
>> [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
>> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
>> [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
>>
>> Nikita Kalyazin (6):
>>    mm: userfaultfd: generic continue for non hugetlbfs
>>    mm: provide can_userfault vma operation
>>    mm: userfaultfd: use can_userfault vma operation
>>    KVM: guest_memfd: add support for userfaultfd minor
>>    mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
>>    KVM: selftests: test userfaultfd minor for guest_memfd
>>
>>   fs/userfaultfd.c                              |  3 +-
>>   include/linux/mm.h                            |  5 +
>>   include/linux/mm_types.h                      |  4 +
>>   include/linux/userfaultfd_k.h                 | 10 +-
>>   include/uapi/linux/userfaultfd.h              |  8 +-
>>   mm/hugetlb.c                                  |  9 +-
>>   mm/shmem.c                                    | 17 +++-
>>   mm/userfaultfd.c                              | 47 ++++++---
>>   .../testing/selftests/kvm/guest_memfd_test.c  | 99 +++++++++++++++++++
>>   virt/kvm/guest_memfd.c                        | 10 ++
>>   10 files changed, 188 insertions(+), 24 deletions(-)
>>
>>
>> base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
>> --
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-04 16:56   ` Nikita Kalyazin
@ 2025-04-04 16:59     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-04-04 16:59 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	Liam.Howlett, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Fri, Apr 04, 2025 at 05:56:58PM +0100, Nikita Kalyazin wrote:
>
>
> On 04/04/2025 17:33, Lorenzo Stoakes wrote:
> > On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
> > > This series is built on top of the Fuad's v7 "mapping guest_memfd backed
> > > memory at the host" [1].
> >
> > Hm if this is based on an unmerged series this seems quite speculative and
> > should maybe be an RFC? I mean that series at least still seems quite under
> > discussion/experiencing issues?
> >
> > Maybe worth RFC'ing until that one settles down first to avoid complexity
> > in review/application to tree?
>
> Hi,
>
> I dropped the RFC tag because I saw similar examples before, but I'm happy
> to bring it back next time if the dependency is not merged until then.

Yeah really sorry to be a pain haha, I realise this particular situation is
a bit unclear, but I think just for the sake of getting our ducks in a row
and ensuring things are settled on the baseline (and it's sort of a fairly
big baseline), it'd be best to bring it back!

>
> >
> > Thanks!
>
> Thanks!

Cheers!

>
> >
> > >
> > > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > > in guest_memfd in userspace.  However, KVM itself also triggers faults
> > > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > > PV EOI and page table walking code when fetching the MMIO instruction on
> > > x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > > that KVM would be accessing those pages via userspace page tables.  In
> > > order for such faults to be handled in userspace, guest_memfd needs to
> > > support userfaultfd.
> > >
> > > Changes since v2 [4]:
> > >   - James: Fix sgp type when calling shmem_get_folio_gfp
> > >   - James: Improved vm_ops->fault() error handling
> > >   - James: Add and make use of the can_userfault() VMA operation
> > >   - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
> > >   - James: Fix typos and add more checks in the test
> > >
> > > Nikita
> > >
> > > [1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
> > > [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
> > > [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
> > > [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
> > >
> > > Nikita Kalyazin (6):
> > >    mm: userfaultfd: generic continue for non hugetlbfs
> > >    mm: provide can_userfault vma operation
> > >    mm: userfaultfd: use can_userfault vma operation
> > >    KVM: guest_memfd: add support for userfaultfd minor
> > >    mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD
> > >    KVM: selftests: test userfaultfd minor for guest_memfd
> > >
> > >   fs/userfaultfd.c                              |  3 +-
> > >   include/linux/mm.h                            |  5 +
> > >   include/linux/mm_types.h                      |  4 +
> > >   include/linux/userfaultfd_k.h                 | 10 +-
> > >   include/uapi/linux/userfaultfd.h              |  8 +-
> > >   mm/hugetlb.c                                  |  9 +-
> > >   mm/shmem.c                                    | 17 +++-
> > >   mm/userfaultfd.c                              | 47 ++++++---
> > >   .../testing/selftests/kvm/guest_memfd_test.c  | 99 +++++++++++++++++++
> > >   virt/kvm/guest_memfd.c                        | 10 ++
> > >   10 files changed, 188 insertions(+), 24 deletions(-)
> > >
> > >
> > > base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
> > > --
> > > 2.47.1
> > >
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
                   ` (6 preceding siblings ...)
  2025-04-04 16:33 ` [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Lorenzo Stoakes
@ 2025-04-04 17:12 ` Liam R. Howlett
  2025-04-07 11:04   ` Nikita Kalyazin
  7 siblings, 1 reply; 28+ messages in thread
From: Liam R. Howlett @ 2025-04-04 17:12 UTC (permalink / raw)
  To: Nikita Kalyazin, Ackerley Tng, Vishal Annapurve, Fuad Tabba
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

+To authors of v7 series referenced in [1]

* Nikita Kalyazin <kalyazin@amazon.com> [250404 11:44]:
> This series is built on top of the Fuad's v7 "mapping guest_memfd backed
> memory at the host" [1].

I didn't see their addresses in the to/cc, so I added them to my
response as I reference the v7 patch set below.

> 
> With James's KVM userfault [2], it is possible to handle stage-2 faults
> in guest_memfd in userspace.  However, KVM itself also triggers faults
> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> PV EOI and page table walking code when fetching the MMIO instruction on
> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> that KVM would be accessing those pages via userspace page tables.

Thanks for being open about the technical call, but it would be better
to capture the reasons and not the call date.  I explain why in the
linking section as well.

>In
> order for such faults to be handled in userspace, guest_memfd needs to
> support userfaultfd.
> 
> Changes since v2 [4]:
>  - James: Fix sgp type when calling shmem_get_folio_gfp
>  - James: Improved vm_ops->fault() error handling
>  - James: Add and make use of the can_userfault() VMA operation
>  - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
>  - James: Fix typos and add more checks in the test
> 
> Nikita

Please slow down...

This patch is at v3, the v7 patch that you are building off has lockdep
issues [1] reported by one of the authors, and (sorry for sounding harsh
about the v7 of that patch) the cover letter reads a bit more like an
RFC than a set ready to go into linux-mm.

Maybe the lockdep issue is just a patch ordering thing or removed in a
later patch set, but that's not mentioned in the discovery email?

What exactly is the goal here and the path forward for the rest of us
trying to build on this once it's in mm-new/mm-unstable?

Note that mm-unstable is shared with a lot of other people through
linux-next, and we are really trying to stop breaking stuff on them.

Obviously v7 cannot go in until it works with lockdep - otherwise none
of us can use lockdep which is not okay.

Also, I am concerned about the amount of testing in the v7 and v3 patch
sets that did not bring up a lockdep issue..

> 
> [1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
> [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3

If there is anything we need to know about the decisions in the call and
that document, can you please pull it into this change log?

I don't think anyone can ensure google will not rename docs to some
other office theme tomorrow - as they famously ditch basically every
name and application.

Also, most of the community does not want to go to a 17 page (and
growing) spreadsheet to hunt down the facts when there is an acceptable
and ideal place to document them in git.  It's another barrier of entry
on reviewing your code as well.

But please, don't take this suggestion as carte blanche for copying a
conversation from the doc, just give us the technical reasons for your
decisions as briefly as possible.

> [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/

[1]. https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com/

Thanks,
Liam

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-04 17:12 ` Liam R. Howlett
@ 2025-04-07 11:04   ` Nikita Kalyazin
  2025-04-07 13:40     ` Liam R. Howlett
  0 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-07 11:04 UTC (permalink / raw)
  To: Liam R. Howlett, Ackerley Tng, Vishal Annapurve, Fuad Tabba, akpm,
	pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 04/04/2025 18:12, Liam R. Howlett wrote:
> +To authors of v7 series referenced in [1]
> 
> * Nikita Kalyazin <kalyazin@amazon.com> [250404 11:44]:
>> This series is built on top of the Fuad's v7 "mapping guest_memfd backed
>> memory at the host" [1].
> 
> I didn't see their addresses in the to/cc, so I added them to my
> response as I reference the v7 patch set below.

Hi Liam,

Thanks for the feedback and for extending the list.

> 
>>
>> With James's KVM userfault [2], it is possible to handle stage-2 faults
>> in guest_memfd in userspace.  However, KVM itself also triggers faults
>> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
>> PV EOI and page table walking code when fetching the MMIO instruction on
>> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
>> that KVM would be accessing those pages via userspace page tables.
> 
> Thanks for being open about the technical call, but it would be better
> to capture the reasons and not the call date.  I explain why in the
> linking section as well.

Thanks for bringing that up.  The document mostly contains the decision 
itself.  The main alternative considered previously was a temporary 
reintroduction of the pages to the direct map whenever a KVM-internal 
access is required.  It was coming with a significant complexity of 
guaranteeing correctness in all cases [1].  Since the memslot structure 
already contains a guest memory pointer supplied by the userspace, KVM 
can use it directly when in the VMM or vCPU context.  I will add this in 
the cover for the next version.

[1] 
https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m4f367c52bbad0f0ba7fb07ca347c7b37258a73e5

> 
>> In
>> order for such faults to be handled in userspace, guest_memfd needs to
>> support userfaultfd.
>>
>> Changes since v2 [4]:
>>   - James: Fix sgp type when calling shmem_get_folio_gfp
>>   - James: Improved vm_ops->fault() error handling
>>   - James: Add and make use of the can_userfault() VMA operation
>>   - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
>>   - James: Fix typos and add more checks in the test
>>
>> Nikita
> 
> Please slow down...
> 
> This patch is at v3, the v7 patch that you are building off has lockdep
> issues [1] reported by one of the authors, and (sorry for sounding harsh
> about the v7 of that patch) the cover letter reads a bit more like an
> RFC than a set ready to go into linux-mm.

AFAIK the lockdep issue was reported on a v7 of a different change.
I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at 
the host for software protected VMs"), while the issue was reported on 
[2] ("KVM: Restricted mapping of guest_memfd at the host and arm64 
support"), which is also built on top of [2].  Please correct me if I'm 
missing something.

The key feature that is required by my series is the ability to mmap 
guest_memfd when the VM type allows.  My understanding is no-one is 
opposed to that as of now, that's why I assumed it's safe to build on 
top of that.

[2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
[3] 
https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com/T/

> 
> Maybe the lockdep issue is just a patch ordering thing or removed in a
> later patch set, but that's not mentioned in the discovery email?
> 
> What exactly is the goal here and the path forward for the rest of us
> trying to build on this once it's in mm-new/mm-unstable?
> 
> Note that mm-unstable is shared with a lot of other people through
> linux-next, and we are really trying to stop breaking stuff on them.
> 
> Obviously v7 cannot go in until it works with lockdep - otherwise none
> of us can use lockdep which is not okay.
> 
> Also, I am concerned about the amount of testing in the v7 and v3 patch
> sets that did not bring up a lockdep issue..
> 
>>
>> [1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
>> [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T/
>> [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0#heading=h.w1126rgli5e3
> 
> If there is anything we need to know about the decisions in the call and
> that document, can you please pull it into this change log?
> 
> I don't think anyone can ensure google will not rename docs to some
> other office theme tomorrow - as they famously ditch basically every
> name and application.
> 
> Also, most of the community does not want to go to a 17 page (and
> growing) spreadsheet to hunt down the facts when there is an acceptable
> and ideal place to document them in git.  It's another barrier of entry
> on reviewing your code as well.
> 
> But please, don't take this suggestion as carte blanche for copying a
> conversation from the doc, just give us the technical reasons for your
> decisions as briefly as possible.
> 
> 
>> [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
> 
> [1]. https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com/
> 
> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 11:04   ` Nikita Kalyazin
@ 2025-04-07 13:40     ` Liam R. Howlett
  2025-04-07 14:04       ` Nikita Kalyazin
  0 siblings, 1 reply; 28+ messages in thread
From: Liam R. Howlett @ 2025-04-07 13:40 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Ackerley Tng, Vishal Annapurve, Fuad Tabba, akpm, pbonzini, shuah,
	viro, brauner, muchun.song, hughd, kvm, linux-kselftest,
	linux-kernel, linux-mm, linux-fsdevel, jack, lorenzo.stoakes,
	jannh, ryan.roberts, david, jthoughton, peterx, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx

* Nikita Kalyazin <kalyazin@amazon.com> [250407 07:04]:
> 
> 
> On 04/04/2025 18:12, Liam R. Howlett wrote:
> > +To authors of v7 series referenced in [1]
> > 
> > * Nikita Kalyazin <kalyazin@amazon.com> [250404 11:44]:
> > > This series is built on top of the Fuad's v7 "mapping guest_memfd backed
> > > memory at the host" [1].
> > 
> > I didn't see their addresses in the to/cc, so I added them to my
> > response as I reference the v7 patch set below.
> 
> Hi Liam,
> 
> Thanks for the feedback and for extending the list.
> 
> > 
> > > 
> > > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > > in guest_memfd in userspace.  However, KVM itself also triggers faults
> > > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > > PV EOI and page table walking code when fetching the MMIO instruction on
> > > x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > > that KVM would be accessing those pages via userspace page tables.
> > 
> > Thanks for being open about the technical call, but it would be better
> > to capture the reasons and not the call date.  I explain why in the
> > linking section as well.
> 
> Thanks for bringing that up.  The document mostly contains the decision
> itself.  The main alternative considered previously was a temporary
> reintroduction of the pages to the direct map whenever a KVM-internal access
> is required.  It was coming with a significant complexity of guaranteeing
> correctness in all cases [1].  Since the memslot structure already contains
> a guest memory pointer supplied by the userspace, KVM can use it directly
> when in the VMM or vCPU context.  I will add this in the cover for the next
> version.

Thank you.

> 
> [1] https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m4f367c52bbad0f0ba7fb07ca347c7b37258a73e5
> 
> > 
> > > In
> > > order for such faults to be handled in userspace, guest_memfd needs to
> > > support userfaultfd.
> > > 
> > > Changes since v2 [4]:
> > >   - James: Fix sgp type when calling shmem_get_folio_gfp
> > >   - James: Improved vm_ops->fault() error handling
> > >   - James: Add and make use of the can_userfault() VMA operation
> > >   - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
> > >   - James: Fix typos and add more checks in the test
> > > 
> > > Nikita
> > 
> > Please slow down...
> > 
> > This patch is at v3, the v7 patch that you are building off has lockdep
> > issues [1] reported by one of the authors, and (sorry for sounding harsh
> > about the v7 of that patch) the cover letter reads a bit more like an
> > RFC than a set ready to go into linux-mm.
> 
> AFAIK the lockdep issue was reported on a v7 of a different change.
> I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at the
> host for software protected VMs"), while the issue was reported on [2]
> ("KVM: Restricted mapping of guest_memfd at the host and arm64 support"),
> which is also built on top of [2].  Please correct me if I'm missing
> something.

I think you messed up the numbering in your statement above.

I believe you are making the point that I messed up which patches depend
on what and your code does not depend on faulty locking, which appears
to be the case.

There are a few issues with the required patch set?

> 
> The key feature that is required by my series is the ability to mmap
> guest_memfd when the VM type allows.  My understanding is no-one is opposed
> to that as of now, that's why I assumed it's safe to build on top of that.
> 
> [2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
> [3] https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com/T/

All of this is extremely confusing because the onus of figuring out what
the final code will look like is put on the reviewer.  As it is, we have
issues with people not doing enough review of the code (due to limited
time).  One way to get reviews is to make the barrier of entry as low as
possible.

I spent Friday going down a rabbit hole of patches referring to each
other as dependencies and I gave up.  It looks like I mistook one set of
patches as required vs them requiring the same in-flight ones as your
patches.

I am struggling to see how we can adequately support all of you given
the way the patches are sent out in batches with dependencies - it is
just too time consuming to sort out.

Thank you,
Liam


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 13:40     ` Liam R. Howlett
@ 2025-04-07 14:04       ` Nikita Kalyazin
  2025-04-07 14:24         ` Liam R. Howlett
  0 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-04-07 14:04 UTC (permalink / raw)
  To: Liam R. Howlett, Ackerley Tng, Vishal Annapurve, Fuad Tabba, akpm,
	pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, jannh, ryan.roberts, david, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 07/04/2025 14:40, Liam R. Howlett wrote:
> * Nikita Kalyazin <kalyazin@amazon.com> [250407 07:04]:
>>
>>
>> On 04/04/2025 18:12, Liam R. Howlett wrote:
>>> +To authors of v7 series referenced in [1]
>>>
>>> * Nikita Kalyazin <kalyazin@amazon.com> [250404 11:44]:
>>>> This series is built on top of the Fuad's v7 "mapping guest_memfd backed
>>>> memory at the host" [1].
>>>
>>> I didn't see their addresses in the to/cc, so I added them to my
>>> response as I reference the v7 patch set below.
>>
>> Hi Liam,
>>
>> Thanks for the feedback and for extending the list.
>>
>>>
>>>>
>>>> With James's KVM userfault [2], it is possible to handle stage-2 faults
>>>> in guest_memfd in userspace.  However, KVM itself also triggers faults
>>>> in guest_memfd in some cases, for example: PV interfaces like kvmclock,
>>>> PV EOI and page table walking code when fetching the MMIO instruction on
>>>> x86.  It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
>>>> that KVM would be accessing those pages via userspace page tables.
>>>
>>> Thanks for being open about the technical call, but it would be better
>>> to capture the reasons and not the call date.  I explain why in the
>>> linking section as well.
>>
>> Thanks for bringing that up.  The document mostly contains the decision
>> itself.  The main alternative considered previously was a temporary
>> reintroduction of the pages to the direct map whenever a KVM-internal access
>> is required.  It was coming with a significant complexity of guaranteeing
>> correctness in all cases [1].  Since the memslot structure already contains
>> a guest memory pointer supplied by the userspace, KVM can use it directly
>> when in the VMM or vCPU context.  I will add this in the cover for the next
>> version.
> 
> Thank you.
> 
>>
>> [1] https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m4f367c52bbad0f0ba7fb07ca347c7b37258a73e5
>>
>>>
>>>> In
>>>> order for such faults to be handled in userspace, guest_memfd needs to
>>>> support userfaultfd.
>>>>
>>>> Changes since v2 [4]:
>>>>    - James: Fix sgp type when calling shmem_get_folio_gfp
>>>>    - James: Improved vm_ops->fault() error handling
>>>>    - James: Add and make use of the can_userfault() VMA operation
>>>>    - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
>>>>    - James: Fix typos and add more checks in the test
>>>>
>>>> Nikita
>>>
>>> Please slow down...
>>>
>>> This patch is at v3, the v7 patch that you are building off has lockdep
>>> issues [1] reported by one of the authors, and (sorry for sounding harsh
>>> about the v7 of that patch) the cover letter reads a bit more like an
>>> RFC than a set ready to go into linux-mm.
>>
>> AFAIK the lockdep issue was reported on a v7 of a different change.
>> I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at the
>> host for software protected VMs"), while the issue was reported on [2]
>> ("KVM: Restricted mapping of guest_memfd at the host and arm64 support"),
>> which is also built on top of [2].  Please correct me if I'm missing
>> something.
> 
> I think you messed up the numbering in your statement above.

I did, in an attempt to make it "even more clear" :) Sorry about that, 
glad you got the intention.

> 
> I believe you are making the point that I messed up which patches depend
> on what and your code does not depend on faulty locking, which appears
> to be the case.
> 
> There are a few issues with the required patch set?

There are indeed, but not in the part this series depends on, as far as 
I can see.

> 
>>
>> The key feature that is required by my series is the ability to mmap
>> guest_memfd when the VM type allows.  My understanding is no-one is opposed
>> to that as of now, that's why I assumed it's safe to build on top of that.
>>
>> [2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/
>> [3] https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com/T/
> 
> All of this is extremely confusing because the onus of figuring out what
> the final code will look like is put on the reviewer.  As it is, we have
> issues with people not doing enough review of the code (due to limited
> time).  One way to get reviews is to make the barrier of entry as low as
> possible.
> 
> I spent Friday going down a rabbit hole of patches referring to each
> other as dependencies and I gave up.  It looks like I mistook one set of
> patches as required vs them requiring the same in-flight ones as your
> patches.
> 
> I am struggling to see how we can adequately support all of you given
> the way the patches are sent out in batches with dependencies - it is
> just too time consuming to sort out.

I'm happy to do whatever I can to make the review easier.  I suppose the 
extreme case is to wait for the dependencies to get accepted, 
effectively serialising submissions, but that slows the process down 
significantly.  For example, I received very good feedback on v1 and v2 
of this series and was able to address it instead of waiting for the 
dependency.  Would including the required patches directly in the series 
help?  My only concern is in that case the same patch will be submitted 
multiple times (as a part of every depending series), but if it's 
better, I'll be doing that instead.

> 
> Thank you,
> Liam
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 14:04       ` Nikita Kalyazin
@ 2025-04-07 14:24         ` Liam R. Howlett
  2025-04-07 14:46           ` David Hildenbrand
  0 siblings, 1 reply; 28+ messages in thread
From: Liam R. Howlett @ 2025-04-07 14:24 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Ackerley Tng, Vishal Annapurve, Fuad Tabba, akpm, pbonzini, shuah,
	viro, brauner, muchun.song, hughd, kvm, linux-kselftest,
	linux-kernel, linux-mm, linux-fsdevel, jack, lorenzo.stoakes,
	jannh, ryan.roberts, david, jthoughton, peterx, graf, jgowans,
	roypat, derekmn, nsaenz, xmarcalx

* Nikita Kalyazin <kalyazin@amazon.com> [250407 10:05]:
> 

...

> > 
> > All of this is extremely confusing because the onus of figuring out what
> > the final code will look like is put on the reviewer.  As it is, we have
> > issues with people not doing enough review of the code (due to limited
> > time).  One way to get reviews is to make the barrier of entry as low as
> > possible.
> > 
> > I spent Friday going down a rabbit hole of patches referring to each
> > other as dependencies and I gave up.  It looks like I mistook one set of
> > patches as required vs them requiring the same in-flight ones as your
> > patches.
> > 
> > I am struggling to see how we can adequately support all of you given
> > the way the patches are sent out in batches with dependencies - it is
> > just too time consuming to sort out.
> 
> I'm happy to do whatever I can to make the review easier.  I suppose the
> extreme case is to wait for the dependencies to get accepted, effectively
> serialising submissions, but that slows the process down significantly.  For
> example, I received very good feedback on v1 and v2 of this series and was
> able to address it instead of waiting for the dependency.  Would including
> the required patches directly in the series help?  My only concern is in
> that case the same patch will be submitted multiple times (as a part of
> every depending series), but if it's better, I'll be doing that instead.

Don't resend patches that someone else is upstreaming, that'll cause
other problems.

Three methods come to mind:

1. As you stated, wait for the dependencies to land.  This is will mean
what you are working against is well tested and won't change (and you
won't have to re-spin due to an unstable base).

2. Combine them into a bigger patch set.  I can then pull one patch set
and look at the parts of interest to the mm side.

3. Provide a git repo with the necessary changes together.

I think 2 and 3 together should be used for the guest_memfd patches.
Someone needs to be managing these to send upstream.  See the discussion
in another patch set on guest_memfd here [1].

As this is not based on fully upstream patches, this should be marked as
RFC, imo.

Thanks,
Liam

[1]. https://lore.kernel.org/all/aizia2elwspxcmfrjote5h7k5wdw2stp42slytkl5visrjvzwi@jj3lwuudiyjk/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 14:24         ` Liam R. Howlett
@ 2025-04-07 14:46           ` David Hildenbrand
  2025-04-07 15:14             ` Lorenzo Stoakes
  2025-04-08  8:20             ` Christian Brauner
  0 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2025-04-07 14:46 UTC (permalink / raw)
  To: Liam R. Howlett, Nikita Kalyazin, Ackerley Tng, Vishal Annapurve,
	Fuad Tabba, akpm, pbonzini, shuah, viro, brauner, muchun.song,
	hughd, kvm, linux-kselftest, linux-kernel, linux-mm,
	linux-fsdevel, jack, lorenzo.stoakes, jannh, ryan.roberts,
	jthoughton, peterx, graf, jgowans, roypat, derekmn, nsaenz,
	xmarcalx

On 07.04.25 16:24, Liam R. Howlett wrote:
> * Nikita Kalyazin <kalyazin@amazon.com> [250407 10:05]:
>>
> 
> ...
> 
>>>
>>> All of this is extremely confusing because the onus of figuring out what
>>> the final code will look like is put on the reviewer.  As it is, we have
>>> issues with people not doing enough review of the code (due to limited
>>> time).  One way to get reviews is to make the barrier of entry as low as
>>> possible.
>>>
>>> I spent Friday going down a rabbit hole of patches referring to each
>>> other as dependencies and I gave up.  It looks like I mistook one set of
>>> patches as required vs them requiring the same in-flight ones as your
>>> patches.
>>>
>>> I am struggling to see how we can adequately support all of you given
>>> the way the patches are sent out in batches with dependencies - it is
>>> just too time consuming to sort out.
>>
>> I'm happy to do whatever I can to make the review easier.  I suppose the
>> extreme case is to wait for the dependencies to get accepted, effectively
>> serialising submissions, but that slows the process down significantly.  For
>> example, I received very good feedback on v1 and v2 of this series and was
>> able to address it instead of waiting for the dependency.  Would including
>> the required patches directly in the series help?  My only concern is in
>> that case the same patch will be submitted multiple times (as a part of
>> every depending series), but if it's better, I'll be doing that instead.
> 
> Don't resend patches that someone else is upstreaming, that'll cause
> other problems.
> 
> Three methods come to mind:
> 
> 1. As you stated, wait for the dependencies to land.  This is will mean
> what you are working against is well tested and won't change (and you
> won't have to re-spin due to an unstable base).
> 
> 2. Combine them into a bigger patch set.  I can then pull one patch set
> and look at the parts of interest to the mm side.
> 
> 3. Provide a git repo with the necessary changes together.
> 
> I think 2 and 3 together should be used for the guest_memfd patches.
> Someone needs to be managing these to send upstream.  See the discussion
> in another patch set on guest_memfd here [1].

The issue is that most extensions are fairly independent from each 
other, except that they built up on Fuad's mmap support,

Sending all together as one thing might not be the best option.

Once basic mmap support is upstream, some of the extensions (e.g., 
directmap removal) can go in next.

So until that is upstream, I agree that tagging the stuff that builds up 
on that is the right thing to do, and providing git trees is another 
very good idea.

I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, 
I know)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 14:46           ` David Hildenbrand
@ 2025-04-07 15:14             ` Lorenzo Stoakes
  2025-04-07 15:26               ` David Hildenbrand
  2025-04-08  8:20             ` Christian Brauner
  1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-04-07 15:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam R. Howlett, Nikita Kalyazin, Ackerley Tng, Vishal Annapurve,
	Fuad Tabba, akpm, pbonzini, shuah, viro, brauner, muchun.song,
	hughd, kvm, linux-kselftest, linux-kernel, linux-mm,
	linux-fsdevel, jack, jannh, ryan.roberts, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
> On 07.04.25 16:24, Liam R. Howlett wrote:
> > * Nikita Kalyazin <kalyazin@amazon.com> [250407 10:05]:
> > >
> >
> > ...
> >
> > > >
> > > > All of this is extremely confusing because the onus of figuring out what
> > > > the final code will look like is put on the reviewer.  As it is, we have
> > > > issues with people not doing enough review of the code (due to limited
> > > > time).  One way to get reviews is to make the barrier of entry as low as
> > > > possible.
> > > >
> > > > I spent Friday going down a rabbit hole of patches referring to each
> > > > other as dependencies and I gave up.  It looks like I mistook one set of
> > > > patches as required vs them requiring the same in-flight ones as your
> > > > patches.
> > > >
> > > > I am struggling to see how we can adequately support all of you given
> > > > the way the patches are sent out in batches with dependencies - it is
> > > > just too time consuming to sort out.
> > >
> > > I'm happy to do whatever I can to make the review easier.  I suppose the
> > > extreme case is to wait for the dependencies to get accepted, effectively
> > > serialising submissions, but that slows the process down significantly.  For
> > > example, I received very good feedback on v1 and v2 of this series and was
> > > able to address it instead of waiting for the dependency.  Would including
> > > the required patches directly in the series help?  My only concern is in
> > > that case the same patch will be submitted multiple times (as a part of
> > > every depending series), but if it's better, I'll be doing that instead.
> >
> > Don't resend patches that someone else is upstreaming, that'll cause
> > other problems.
> >
> > Three methods come to mind:
> >
> > 1. As you stated, wait for the dependencies to land.  This is will mean
> > what you are working against is well tested and won't change (and you
> > won't have to re-spin due to an unstable base).
> >
> > 2. Combine them into a bigger patch set.  I can then pull one patch set
> > and look at the parts of interest to the mm side.
> >
> > 3. Provide a git repo with the necessary changes together.
> >
> > I think 2 and 3 together should be used for the guest_memfd patches.
> > Someone needs to be managing these to send upstream.  See the discussion
> > in another patch set on guest_memfd here [1].
>
> The issue is that most extensions are fairly independent from each other,
> except that they built up on Fuad's mmap support,
>
> Sending all together as one thing might not be the best option.
>
> Once basic mmap support is upstream, some of the extensions (e.g., directmap
> removal) can go in next.
>
> So until that is upstream, I agree that tagging the stuff that builds up on
> that is the right thing to do, and providing git trees is another very good
> idea.
>
> I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I
> know)

Which series is this? Sorry maybe lost track of this one.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 15:14             ` Lorenzo Stoakes
@ 2025-04-07 15:26               ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2025-04-07 15:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, Nikita Kalyazin, Ackerley Tng, Vishal Annapurve,
	Fuad Tabba, akpm, pbonzini, shuah, viro, brauner, muchun.song,
	hughd, kvm, linux-kselftest, linux-kernel, linux-mm,
	linux-fsdevel, jack, jannh, ryan.roberts, jthoughton, peterx,
	graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 07.04.25 17:14, Lorenzo Stoakes wrote:
> On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
>> On 07.04.25 16:24, Liam R. Howlett wrote:
>>> * Nikita Kalyazin <kalyazin@amazon.com> [250407 10:05]:
>>>>
>>>
>>> ...
>>>
>>>>>
>>>>> All of this is extremely confusing because the onus of figuring out what
>>>>> the final code will look like is put on the reviewer.  As it is, we have
>>>>> issues with people not doing enough review of the code (due to limited
>>>>> time).  One way to get reviews is to make the barrier of entry as low as
>>>>> possible.
>>>>>
>>>>> I spent Friday going down a rabbit hole of patches referring to each
>>>>> other as dependencies and I gave up.  It looks like I mistook one set of
>>>>> patches as required vs them requiring the same in-flight ones as your
>>>>> patches.
>>>>>
>>>>> I am struggling to see how we can adequately support all of you given
>>>>> the way the patches are sent out in batches with dependencies - it is
>>>>> just too time consuming to sort out.
>>>>
>>>> I'm happy to do whatever I can to make the review easier.  I suppose the
>>>> extreme case is to wait for the dependencies to get accepted, effectively
>>>> serialising submissions, but that slows the process down significantly.  For
>>>> example, I received very good feedback on v1 and v2 of this series and was
>>>> able to address it instead of waiting for the dependency.  Would including
>>>> the required patches directly in the series help?  My only concern is in
>>>> that case the same patch will be submitted multiple times (as a part of
>>>> every depending series), but if it's better, I'll be doing that instead.
>>>
>>> Don't resend patches that someone else is upstreaming, that'll cause
>>> other problems.
>>>
>>> Three methods come to mind:
>>>
>>> 1. As you stated, wait for the dependencies to land.  This is will mean
>>> what you are working against is well tested and won't change (and you
>>> won't have to re-spin due to an unstable base).
>>>
>>> 2. Combine them into a bigger patch set.  I can then pull one patch set
>>> and look at the parts of interest to the mm side.
>>>
>>> 3. Provide a git repo with the necessary changes together.
>>>
>>> I think 2 and 3 together should be used for the guest_memfd patches.
>>> Someone needs to be managing these to send upstream.  See the discussion
>>> in another patch set on guest_memfd here [1].
>>
>> The issue is that most extensions are fairly independent from each other,
>> except that they built up on Fuad's mmap support,
>>
>> Sending all together as one thing might not be the best option.
>>
>> Once basic mmap support is upstream, some of the extensions (e.g., directmap
>> removal) can go in next.
>>
>> So until that is upstream, I agree that tagging the stuff that builds up on
>> that is the right thing to do, and providing git trees is another very good
>> idea.
>>
>> I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I
>> know)
> 
> Which series is this? Sorry maybe lost track of this one.

Heh, not your fault :)

The most important one for basic mmap support is "KVM: Mapping 
guest_memfd backed memory at the host for software protected VMs" [1]. 
Some stuff (e.g., direct map removal) should be able to make progress 
once that landed.

I do expect the MM-specific patch in there ("mm: Consolidate freeing of 
typed folios on final folio_put()") to not be included as part of that work.

[I shared the feedback from the LSF/MM session in the upstream 
guest_memfd call, and we decided to minimize the usage of the 
folio_put() callback to where absolutely required; that will simplify 
things and avoid issues as pointed out by Willy, which is great]

The next important one will be "[PATCH v7 0/7] KVM: Restricted mapping 
of guest_memfd at the host and arm64 support" [2], but I similarly 
expect a simplification as we try moving away from folio_put() for the 
"shared <-> private" page conversion case.


So I expect a v8 of [1] (and that also [2] needs to be updated).

@Fuad, please let me know if I am wrong.

[1] 
https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/T/#u
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-07 14:46           ` David Hildenbrand
  2025-04-07 15:14             ` Lorenzo Stoakes
@ 2025-04-08  8:20             ` Christian Brauner
  2025-04-08 13:15               ` Ackerley Tng
  1 sibling, 1 reply; 28+ messages in thread
From: Christian Brauner @ 2025-04-08  8:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam R. Howlett, Nikita Kalyazin, Ackerley Tng, Vishal Annapurve,
	Fuad Tabba, akpm, pbonzini, shuah, viro, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, jannh, ryan.roberts, jthoughton, peterx, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
> On 07.04.25 16:24, Liam R. Howlett wrote:
> > * Nikita Kalyazin <kalyazin@amazon.com> [250407 10:05]:
> > > 
> > 
> > ...
> > 
> > > > 
> > > > All of this is extremely confusing because the onus of figuring out what
> > > > the final code will look like is put on the reviewer.  As it is, we have
> > > > issues with people not doing enough review of the code (due to limited
> > > > time).  One way to get reviews is to make the barrier of entry as low as
> > > > possible.
> > > > 
> > > > I spent Friday going down a rabbit hole of patches referring to each
> > > > other as dependencies and I gave up.  It looks like I mistook one set of
> > > > patches as required vs them requiring the same in-flight ones as your
> > > > patches.
> > > > 
> > > > I am struggling to see how we can adequately support all of you given
> > > > the way the patches are sent out in batches with dependencies - it is
> > > > just too time consuming to sort out.
> > > 
> > > I'm happy to do whatever I can to make the review easier.  I suppose the
> > > extreme case is to wait for the dependencies to get accepted, effectively
> > > serialising submissions, but that slows the process down significantly.  For
> > > example, I received very good feedback on v1 and v2 of this series and was
> > > able to address it instead of waiting for the dependency.  Would including
> > > the required patches directly in the series help?  My only concern is in
> > > that case the same patch will be submitted multiple times (as a part of
> > > every depending series), but if it's better, I'll be doing that instead.
> > 
> > Don't resend patches that someone else is upstreaming, that'll cause
> > other problems.
> > 
> > Three methods come to mind:
> > 
> > 1. As you stated, wait for the dependencies to land.  This is will mean
> > what you are working against is well tested and won't change (and you
> > won't have to re-spin due to an unstable base).
> > 
> > 2. Combine them into a bigger patch set.  I can then pull one patch set
> > and look at the parts of interest to the mm side.
> > 
> > 3. Provide a git repo with the necessary changes together.
> > 
> > I think 2 and 3 together should be used for the guest_memfd patches.
> > Someone needs to be managing these to send upstream.  See the discussion
> > in another patch set on guest_memfd here [1].
> 
> The issue is that most extensions are fairly independent from each other,
> except that they built up on Fuad's mmap support,
> 
> Sending all together as one thing might not be the best option.
> 
> Once basic mmap support is upstream, some of the extensions (e.g., directmap
> removal) can go in next.
> 
> So until that is upstream, I agree that tagging the stuff that builds up on
> that is the right thing to do, and providing git trees is another very good
> idea.
> 
> I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I
> know)

Fwiw, b4 allows to specify dependencies so you can b4 shazam/am and it
will pull in all prerequisite patches:

b4 prep --edit-deps           Edit the series dependencies in your defined $EDITOR (or core.editor)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor
  2025-04-08  8:20             ` Christian Brauner
@ 2025-04-08 13:15               ` Ackerley Tng
  0 siblings, 0 replies; 28+ messages in thread
From: Ackerley Tng @ 2025-04-08 13:15 UTC (permalink / raw)
  To: Christian Brauner, David Hildenbrand
  Cc: Liam R. Howlett, Nikita Kalyazin, Vishal Annapurve, Fuad Tabba,
	akpm, pbonzini, shuah, viro, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, jannh, ryan.roberts, jthoughton, peterx, graf,
	jgowans, roypat, derekmn, nsaenz, xmarcalx

Christian Brauner <brauner@kernel.org> writes:

> On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
>
> <snip>
>
> Fwiw, b4 allows to specify dependencies so you can b4 shazam/am and it
> will pull in all prerequisite patches:
>
> b4 prep --edit-deps           Edit the series dependencies in your defined $EDITOR (or core.editor)

Thank you for this tip! On this note, what are some good CONFIGs people
always enable during development?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
@ 2025-06-10 22:22   ` Peter Xu
  2025-06-11 12:09     ` Nikita Kalyazin
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2025-06-10 22:22 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
> non-huge pages by calling vm_ops->fault().  A new VMF flag,
> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
> handle_userfault().

It's not clear yet on why this is needed to be generalized out of the blue.

Some mentioning of guest_memfd use case might help for other reviewers, or
some mention of the need to introduce userfaultfd support in kernel
modules.

> 
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>  include/linux/mm_types.h |  4 ++++
>  mm/hugetlb.c             |  2 +-
>  mm/shmem.c               |  9 ++++++---
>  mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
>  4 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6..2f26ee9742bf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>   * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>   *                        We should only access orig_pte if this flag set.
>   * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
> + *                                 minor handler as it is being called by the
> + *                                 userfaultfd code itself.

We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
as it's not easy to follow when without userfault minor context.  It might
be better to use generic terms like NO_USERFAULT.

Said that, I wonder if we'll need to add a vm_ops anyway in the latter
patch, whether we can also avoid reusing fault() but instead resolve the
page faults using the vm_ops hook too.  That might be helpful because then
we can avoid this new FAULT_FLAG_* that is totally not useful to
non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
struct below just to suite the current fault() interfacing.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor
  2025-04-04 15:43 ` [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
@ 2025-06-10 22:25   ` Peter Xu
  2025-06-11 12:09     ` Nikita Kalyazin
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2025-06-10 22:25 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Fri, Apr 04, 2025 at 03:43:50PM +0000, Nikita Kalyazin wrote:
> Add support for sending a pagefault event if userfaultfd is registered.
> Only page minor event is currently supported.
> 
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>  virt/kvm/guest_memfd.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index fbf89e643add..096d89e7282d 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,9 @@
>  #include <linux/kvm_host.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +#include <linux/userfaultfd_k.h>
> +#endif /* CONFIG_KVM_PRIVATE_MEM */
>  
>  #include "kvm_mm.h"
>  
> @@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  		kvm_gmem_mark_prepared(folio);
>  	}
>  
> +	if (userfaultfd_minor(vmf->vma) &&
> +	    !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
> +		folio_unlock(folio);
> +		filemap_invalidate_unlock_shared(inode->i_mapping);
> +		return handle_userfault(vmf, VM_UFFD_MINOR);
> +	}
> +

Hmm, does guest-memfd (when with your current approach) at least needs to
define the new can_userfault() hook?

Meanwhile, we have some hard-coded lines so far, like:

mfill_atomic():
	if (!vma_is_shmem(dst_vma) &&
	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
		goto out_unlock;

I thought it would fail guest-memfd already on a CONTINUE request, and it
doesn't seem to be touched yet in this series.

I'm not yet sure how the test worked out without hitting things like it.
Highly likely I missed something.  Some explanations would be welcomed.. 

Thanks,

>  	vmf->page = folio_file_page(folio, vmf->pgoff);
>  
>  out_folio:
> -- 
> 2.47.1
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-06-10 22:22   ` Peter Xu
@ 2025-06-11 12:09     ` Nikita Kalyazin
  2025-06-11 12:56       ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-06-11 12:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 10/06/2025 23:22, Peter Xu wrote:
> On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>> non-huge pages by calling vm_ops->fault().  A new VMF flag,
>> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
>> handle_userfault().
> 
> It's not clear yet on why this is needed to be generalized out of the blue.
> 
> Some mentioning of guest_memfd use case might help for other reviewers, or
> some mention of the need to introduce userfaultfd support in kernel
> modules.

Hi Peter,

Sounds fair, thank you.

>>
>> Suggested-by: James Houghton <jthoughton@google.com>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>>   include/linux/mm_types.h |  4 ++++
>>   mm/hugetlb.c             |  2 +-
>>   mm/shmem.c               |  9 ++++++---
>>   mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
>>   4 files changed, 38 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 0234f14f2aa6..2f26ee9742bf 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>>    * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>>    *                        We should only access orig_pte if this flag set.
>>    * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
>> + *                                 minor handler as it is being called by the
>> + *                                 userfaultfd code itself.
> 
> We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
> as it's not easy to follow when without userfault minor context.  It might
> be better to use generic terms like NO_USERFAULT.

Yes, I agree, can name it more generically.

> Said that, I wonder if we'll need to add a vm_ops anyway in the latter
> patch, whether we can also avoid reusing fault() but instead resolve the
> page faults using the vm_ops hook too.  That might be helpful because then
> we can avoid this new FAULT_FLAG_* that is totally not useful to
> non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
> struct below just to suite the current fault() interfacing.

I'm not sure I fully understand that.  Calling fault() op helps us reuse 
the FS specifics when resolving the fault.  I get that the new op can 
imply the userfault flag so the flag doesn't need to be exposed to mm, 
but doing so will bring duplication of the logic within FSes between 
this new op and the fault(), unless we attempt to factor common parts 
out.  For example, for shmem_get_folio_gfp(), we would still need to 
find a way to suppress the call to handle_userfault() when 
shmem_get_folio_gfp() is called from the new op.  Is that what you're 
proposing?

> 
> Thanks,
> 
> --
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor
  2025-06-10 22:25   ` Peter Xu
@ 2025-06-11 12:09     ` Nikita Kalyazin
  0 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-06-11 12:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 10/06/2025 23:25, Peter Xu wrote:
> On Fri, Apr 04, 2025 at 03:43:50PM +0000, Nikita Kalyazin wrote:
>> Add support for sending a pagefault event if userfaultfd is registered.
>> Only page minor event is currently supported.
>>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>>   virt/kvm/guest_memfd.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index fbf89e643add..096d89e7282d 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -4,6 +4,9 @@
>>   #include <linux/kvm_host.h>
>>   #include <linux/pagemap.h>
>>   #include <linux/anon_inodes.h>
>> +#ifdef CONFIG_KVM_PRIVATE_MEM
>> +#include <linux/userfaultfd_k.h>
>> +#endif /* CONFIG_KVM_PRIVATE_MEM */
>>
>>   #include "kvm_mm.h"
>>
>> @@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>>                kvm_gmem_mark_prepared(folio);
>>        }
>>
>> +     if (userfaultfd_minor(vmf->vma) &&
>> +         !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
>> +             folio_unlock(folio);
>> +             filemap_invalidate_unlock_shared(inode->i_mapping);
>> +             return handle_userfault(vmf, VM_UFFD_MINOR);
>> +     }
>> +
> 
> Hmm, does guest-memfd (when with your current approach) at least needs to
> define the new can_userfault() hook?
> 
> Meanwhile, we have some hard-coded lines so far, like:
> 
> mfill_atomic():
>          if (!vma_is_shmem(dst_vma) &&
>              uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
>                  goto out_unlock;
> 
> I thought it would fail guest-memfd already on a CONTINUE request, and it
> doesn't seem to be touched yet in this series.
> 
> I'm not yet sure how the test worked out without hitting things like it.
> Highly likely I missed something.  Some explanations would be welcomed..

Yes, I realised that I'd failed to post this part soon after I sent the 
series, but I refrained from sending a new version because the upstream 
consensus was to review/merge the mmap support in guest_memfd [1] before 
continuing to build on top of it.  This is the missed part I planned to 
include in the next version.  Sorry for the confusion.

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 64551e8a55fb..080437fa7eab 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -221,8 +221,10 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  	if (vm_flags & VM_DROPPABLE)
  		return false;

-	if (!vma->vm_ops->can_userfault ||
-	    !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR))
+	if ((vm_flags & VM_UFFD_MINOR) &&
+	     (!vma->vm_ops ||
+	      !vma->vm_ops->can_userfault ||
+	      !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR)))
  		return false;

  	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0aa82c968e16..638360a78561 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -788,7 +788,9 @@ static __always_inline ssize_t mfill_atomic(struct 
userfaultfd_ctx *ctx,
  		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
  					     src_start, len, flags);

-	can_userfault = dst_vma->vm_ops->can_userfault &&
+	can_userfault =
+	    dst_vma->vm_ops &&
+	    dst_vma->vm_ops->can_userfault &&
  	    dst_vma->vm_ops->can_userfault(dst_vma, __VM_UFFD_FLAGS);

  	if (!vma_is_anonymous(dst_vma) && !can_userfault)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 91ee5dd91c31..202b12dc4b6f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -420,8 +420,15 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
  	return ret;
  }

+static bool kvm_gmem_can_userfault(struct vm_area_struct *vma,
+				   unsigned long vm_flags)
+{
+	return vm_flags & VM_UFFD_MINOR;
+}
+
  static const struct vm_operations_struct kvm_gmem_vm_ops = {
-	.fault = kvm_gmem_fault,
+	.fault         = kvm_gmem_fault,
+	.can_userfault = kvm_gmem_can_userfault,
  };

  static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)


[1] https://lore.kernel.org/kvm/20250605153800.557144-1-tabba@google.com/

> 
> Thanks,
> 
>>        vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>>   out_folio:
>> --
>> 2.47.1
>>
> 
> --
> Peter Xu
> 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-06-11 12:09     ` Nikita Kalyazin
@ 2025-06-11 12:56       ` Peter Xu
  2025-06-20 12:00         ` Nikita Kalyazin
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2025-06-11 12:56 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
> 
> 
> On 10/06/2025 23:22, Peter Xu wrote:
> > On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
> > > Remove shmem-specific code from UFFDIO_CONTINUE implementation for
> > > non-huge pages by calling vm_ops->fault().  A new VMF flag,
> > > FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
> > > handle_userfault().
> > 
> > It's not clear yet on why this is needed to be generalized out of the blue.
> > 
> > Some mentioning of guest_memfd use case might help for other reviewers, or
> > some mention of the need to introduce userfaultfd support in kernel
> > modules.
> 
> Hi Peter,
> 
> Sounds fair, thank you.
> 
> > > 
> > > Suggested-by: James Houghton <jthoughton@google.com>
> > > Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> > > ---
> > >   include/linux/mm_types.h |  4 ++++
> > >   mm/hugetlb.c             |  2 +-
> > >   mm/shmem.c               |  9 ++++++---
> > >   mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
> > >   4 files changed, 38 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 0234f14f2aa6..2f26ee9742bf 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
> > >    * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> > >    *                        We should only access orig_pte if this flag set.
> > >    * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
> > > + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
> > > + *                                 minor handler as it is being called by the
> > > + *                                 userfaultfd code itself.
> > 
> > We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
> > as it's not easy to follow when without userfault minor context.  It might
> > be better to use generic terms like NO_USERFAULT.
> 
> Yes, I agree, can name it more generically.
> 
> > Said that, I wonder if we'll need to add a vm_ops anyway in the latter
> > patch, whether we can also avoid reusing fault() but instead resolve the
> > page faults using the vm_ops hook too.  That might be helpful because then
> > we can avoid this new FAULT_FLAG_* that is totally not useful to
> > non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
> > struct below just to suite the current fault() interfacing.
> 
> I'm not sure I fully understand that.  Calling fault() op helps us reuse the
> FS specifics when resolving the fault.  I get that the new op can imply the
> userfault flag so the flag doesn't need to be exposed to mm, but doing so
> will bring duplication of the logic within FSes between this new op and the
> fault(), unless we attempt to factor common parts out.  For example, for
> shmem_get_folio_gfp(), we would still need to find a way to suppress the
> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
> op.  Is that what you're proposing?

Yes it is what I was proposing.  shmem_get_folio_gfp() always has that
handling when vmf==NULL, then vma==NULL and userfault will be skipped.

So what I was thinking is one vm_ops.userfaultfd_request(req), where req
can be:

  (1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
      both MISSING/WP/MINOR.  Here WP should mean sync-wp tracking, async
      was so far by default almost supported everywhere except
      VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
      as of now (even if I think it shouldn't be hard to support the rest
      two..).

  (2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
      defined to suite userfault's need on fault resolutions.  It likely
      doesn't need vmf as the parameter, but likely (when anon isn't taking
      into account, after all anon have vm_ops==NULL..) the inode and
      offsets, perhaps some flag would be needed to identify MISSING or
      MINOR faults, for example.

Maybe some more.

I was even thinking whether we could merge hugetlb into the picture too on
generalize its fault resolutions.  Hugetlb was always special, maye this is
a chance too to make it generalized, but it doesn't need to happen in one
shot even if it could work.  We could start with shmem.

So this does sound like slightly involved, and I'm not yet 100% sure this
will work, but likely.  If you want, I can take a stab at this this week or
next just to see whether it'll work in general.  I also don't expect this
to depend on guest-memfd at all - it can be alone a refactoring making
userfault module-ready.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-06-11 12:56       ` Peter Xu
@ 2025-06-20 12:00         ` Nikita Kalyazin
  2025-06-20 15:21           ` Peter Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Nikita Kalyazin @ 2025-06-20 12:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

On 11/06/2025 13:56, Peter Xu wrote:
> On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
>>
>>
>> On 10/06/2025 23:22, Peter Xu wrote:
>>> On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
>>>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>>>> non-huge pages by calling vm_ops->fault().  A new VMF flag,
>>>> FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to
>>>> handle_userfault().
>>>
>>> It's not clear yet on why this is needed to be generalized out of the blue.
>>>
>>> Some mentioning of guest_memfd use case might help for other reviewers, or
>>> some mention of the need to introduce userfaultfd support in kernel
>>> modules.
>>
>> Hi Peter,
>>
>> Sounds fair, thank you.
>>
>>>>
>>>> Suggested-by: James Houghton <jthoughton@google.com>
>>>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>>>> ---
>>>>    include/linux/mm_types.h |  4 ++++
>>>>    mm/hugetlb.c             |  2 +-
>>>>    mm/shmem.c               |  9 ++++++---
>>>>    mm/userfaultfd.c         | 37 +++++++++++++++++++++++++++----------
>>>>    4 files changed, 38 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 0234f14f2aa6..2f26ee9742bf 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
>>>>     * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>>>>     *                        We should only access orig_pte if this flag set.
>>>>     * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>>>> + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
>>>> + *                                 minor handler as it is being called by the
>>>> + *                                 userfaultfd code itself.
>>>
>>> We probably shouldn't leak the "CONTINUE" concept to mm core if possible,
>>> as it's not easy to follow when without userfault minor context.  It might
>>> be better to use generic terms like NO_USERFAULT.
>>
>> Yes, I agree, can name it more generically.
>>
>>> Said that, I wonder if we'll need to add a vm_ops anyway in the latter
>>> patch, whether we can also avoid reusing fault() but instead resolve the
>>> page faults using the vm_ops hook too.  That might be helpful because then
>>> we can avoid this new FAULT_FLAG_* that is totally not useful to
>>> non-userfault users, meanwhile we also don't need to hand-cook the vm_fault
>>> struct below just to suite the current fault() interfacing.
>>
>> I'm not sure I fully understand that.  Calling fault() op helps us reuse the
>> FS specifics when resolving the fault.  I get that the new op can imply the
>> userfault flag so the flag doesn't need to be exposed to mm, but doing so
>> will bring duplication of the logic within FSes between this new op and the
>> fault(), unless we attempt to factor common parts out.  For example, for
>> shmem_get_folio_gfp(), we would still need to find a way to suppress the
>> call to handle_userfault() when shmem_get_folio_gfp() is called from the new
>> op.  Is that what you're proposing?
> 
> Yes it is what I was proposing.  shmem_get_folio_gfp() always has that
> handling when vmf==NULL, then vma==NULL and userfault will be skipped.
> 
> So what I was thinking is one vm_ops.userfaultfd_request(req), where req
> can be:
> 
>    (1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return
>        both MISSING/WP/MINOR.  Here WP should mean sync-wp tracking, async
>        was so far by default almost supported everywhere except
>        VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only
>        as of now (even if I think it shouldn't be hard to support the rest
>        two..).
> 
>    (2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well
>        defined to suite userfault's need on fault resolutions.  It likely
>        doesn't need vmf as the parameter, but likely (when anon isn't taking
>        into account, after all anon have vm_ops==NULL..) the inode and
>        offsets, perhaps some flag would be needed to identify MISSING or
>        MINOR faults, for example.
> 
> Maybe some more.
> 
> I was even thinking whether we could merge hugetlb into the picture too on
> generalize its fault resolutions.  Hugetlb was always special, maye this is
> a chance too to make it generalized, but it doesn't need to happen in one
> shot even if it could work.  We could start with shmem.
> 
> So this does sound like slightly involved, and I'm not yet 100% sure this
> will work, but likely.  If you want, I can take a stab at this this week or
> next just to see whether it'll work in general.  I also don't expect this
> to depend on guest-memfd at all - it can be alone a refactoring making
> userfault module-ready.

Thanks for explaining that.  I played a bit with it myself and it 
appears to be working for the MISSING mode for both shmem and 
guest_memfd.  Attaching my sketch below.  Please let me know if this is 
how you see it.

I found that arguments and return values are significantly different 
between the two request types, which may be a bit confusing, although we 
do not expect many callers of those.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..eb30b23b24d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -603,6 +603,16 @@ struct vm_fault {
  					 */
  };

+#ifdef CONFIG_USERFAULTFD
+/*
+ * Used by userfaultfd_request().
+ */
+enum uffd_req {
+	UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */
+	UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */
+};
+#endif
+
  /*
   * These are the virtual MM functions - opening of an area, closing and
   * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -680,6 +690,22 @@ struct vm_operations_struct {
  	 */
  	struct page *(*find_special_page)(struct vm_area_struct *vma,
  					  unsigned long addr);
+
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Called by the userfaultfd code to query supported modes or request
+	 * fault resolution.
+	 * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask
+	 * of modes as in struct uffdio_register.  No other arguments are
+	 * used.
+	 * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault
+	 * using the mode specified in the mode argument. The inode, pgoff and
+	 * foliop arguments must be set accordingly.
+	 */
+	int (*userfaultfd_request)(enum uffd_req req, int mode,
+				   struct inode *inode, pgoff_t pgoff,
+				   struct folio **foliop);
+#endif
  };

  #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 75342022d144..1cabb925da0e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  		return false;

  	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
+	    (!is_vm_hugetlb_page(vma) &&
+	     !vma->vm_ops->userfaultfd_request &&
+	     !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						NULL, 0, NULL) &
+	           UFFDIO_REGISTER_MODE_MINOR)))
  		return false;

  	/*
@@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct 
vm_area_struct *vma,
  #endif

  	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	return vma_is_anonymous(vma) ||
+	    is_vm_hugetlb_page(vma) ||
+	    (vma->vm_ops->userfaultfd_request &&
+	     vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL,
+					      0, NULL));
  }

  static inline bool vma_has_uffd_without_event_remap(struct 
vm_area_struct *vma)
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..a5b5c4131dcf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct 
address_space *mapping,
  	return 0;
  }

+#ifdef CONFIG_USERFAULTFD
+static int shmem_userfaultfd_request(enum uffd_req req, int mode,
+				     struct inode *inode, pgoff_t pgoff,
+				     struct folio **foliop)
+{
+	int ret;
+
+	switch (req) {
+	case UFFD_REQ_GET_SUPPORTED:
+		ret =
+		    UFFDIO_REGISTER_MODE_MISSING |
+		    UFFDIO_REGISTER_MODE_WP |
+		    UFFDIO_REGISTER_MODE_MINOR;
+		break;
+	case UFFD_REQ_FAULT_RESOLVE:
+		ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC);
+		if (ret == -ENOENT)
+			ret = -EFAULT;
+		if (ret)
+			break;
+		if (!*foliop) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+#endif
+
  static const struct address_space_operations shmem_aops = {
  	.writepage	= shmem_writepage,
  	.dirty_folio	= noop_dirty_folio,
@@ -5306,6 +5340,9 @@ static const struct vm_operations_struct 
shmem_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5315,6 +5352,9 @@ static const struct vm_operations_struct 
shmem_anon_vm_ops = {
  	.set_policy     = shmem_set_policy,
  	.get_policy     = shmem_get_policy,
  #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_request = shmem_userfaultfd_request,
+#endif
  };

  int shmem_init_fs_context(struct fs_context *fc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..efc150bf5691 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
  	struct page *page;
  	int ret;

-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
-	/* Our caller expects us to return -EFAULT if we failed to find folio */
-	if (ret == -ENOENT)
-		ret = -EFAULT;
+	if (!dst_vma->vm_ops->userfaultfd_request ||
+	    !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						   NULL, 0, NULL) &
+	      UFFDIO_REGISTER_MODE_MINOR)) {
+		return -EFAULT;
+	}
+
+	ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE,
+						   UFFDIO_REGISTER_MODE_MINOR,
+						   inode, pgoff, &folio);
  	if (ret)
  		goto out;
-	if (!folio) {
-		ret = -EFAULT;
-		goto out;
-	}

  	page = folio_file_page(folio, pgoff);
  	if (PageHWPoison(page)) {
@@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct 
userfaultfd_ctx *ctx,
  		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
  					     src_start, len, flags);

-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+	if (!vma_is_anonymous(dst_vma) &&
+            (!dst_vma->vm_ops->userfaultfd_request ||
+	     (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0,
+						    NULL, 0, NULL))))
  		goto out_unlock;

  	while (src_addr < src_start + len) {

> 
> Thanks,
> 
> --
> Peter Xu
> 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-06-20 12:00         ` Nikita Kalyazin
@ 2025-06-20 15:21           ` Peter Xu
  2025-06-20 16:51             ` Nikita Kalyazin
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Xu @ 2025-06-20 15:21 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx

Hi, Nikita,

On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
> Thanks for explaining that.  I played a bit with it myself and it appears to
> be working for the MISSING mode for both shmem and guest_memfd.  Attaching

[1]

> my sketch below.  Please let me know if this is how you see it.
> 
> I found that arguments and return values are significantly different between
> the two request types, which may be a bit confusing, although we do not
> expect many callers of those.

Indeed.  Actually since I didn't yet get your reply, early this week I gave
it a shot, and then I found the same thing that it'll be nice to keep the
type checks all over the places.  It'll also be awkward if we want to add
MISSING into the picture with the current req() interface (btw, IIUC you
meant MINOR above, not MISSING).

Please have a look at what I came up with.  I didn't yet got a chance to
post it, but it did compile all fine and pass the smoke tests here.  Feel
free to take it over if you think that makes sense to you, or I can also
post it officially after some more tests.

So, ultimately I introduced a vm_uffd_ops to keep all the type checks.  I
don't think I like the uffd_copy() interfacing too much, but it should
still be the minimum changeset I can think of to generalize shmem as an
userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in
the last patch.

It's also unfortunate that hugetlb won't be able to already use the API,
similar to why we have hugetlb's fault() to BUG() and hard-coded it in
handle_mm_fault().  However it'll at least start to use the rest API all
fine, so as to generalize some hugetlb checks.

The shmem definition looks like this:

static const vm_uffd_ops shmem_uffd_ops = {
	.uffd_features	= 	__VM_UFFD_FLAGS,
	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
				BIT(_UFFDIO_ZEROPAGE) |
				BIT(_UFFDIO_WRITEPROTECT) |
				BIT(_UFFDIO_CONTINUE) |
				BIT(_UFFDIO_POISON),
	.uffd_get_folio	=	shmem_uffd_get_folio,
	.uffd_copy	=	shmem_mfill_atomic_pte,
};

Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and
provide uffd_get_folio() for supporting MINOR.

Let me know what do you think.

Thanks,

===8<===
From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 12 Jun 2025 11:51:58 -0400
Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API

Introduce a generic userfaultfd API for vm_operations_struct, so that one
vma, especially when as a module, can support userfaults without modifying
the core files.  More importantly, when the module can be compiled out of
the kernel.

So, instead of having core mm referencing modules that may not ever exist,
we need to have modules opt-in on core mm hooks instead.

After this API applied, if a module wants to support userfaultfd, the
module should only need to touch its own file and properly define
vm_uffd_ops, instead of changing anything in core mm.

Note that such API will not work for anonymous. Core mm will process
anonymous memory separately for userfault operations like before.

This patch only introduces the API alone so that we can start to move
existing users over but without breaking them.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h            | 71 +++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h | 12 ------
 2 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307..8dfd83f01d3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -576,6 +576,70 @@ struct vm_fault {
 					 */
 };
 
+#ifdef CONFIG_USERFAULTFD
+/* A combined operation mode + behavior flags. */
+typedef unsigned int __bitwise uffd_flags_t;
+
+enum mfill_atomic_mode {
+	MFILL_ATOMIC_COPY,
+	MFILL_ATOMIC_ZEROPAGE,
+	MFILL_ATOMIC_CONTINUE,
+	MFILL_ATOMIC_POISON,
+	NR_MFILL_ATOMIC_MODES,
+};
+
+/* VMA userfaultfd operations */
+typedef struct {
+	/**
+	 * @uffd_features: features supported in bitmask.
+	 *
+	 * When the ops is defined, the driver must set non-zero features
+	 * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
+	 */
+	unsigned long uffd_features;
+	/**
+	 * @uffd_ioctls: ioctls supported in bitmask.
+	 *
+	 * Userfaultfd ioctls supported by the module.  Below will always
+	 * be supported by default whenever a module provides vm_uffd_ops:
+	 *
+	 *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
+	 *
+	 * The module needs to provide all the rest optionally supported
+	 * ioctls.  For example, when VM_UFFD_MISSING was supported,
+	 * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
+	 * is optional.
+	 */
+	unsigned long uffd_ioctls;
+	/**
+	 * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
+	 *
+	 * @inode: the inode for folio lookup
+	 * @pgoff: the pgoff of the folio
+	 * @folio: returned folio pointer
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
+			      struct folio **folio);
+	/**
+	 * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
+	 *
+	 * @dst_pmd: target pmd to resolve page fault
+	 * @dst_vma: target vma
+	 * @dst_addr: target virtual address
+	 * @src_addr: source address to copy from
+	 * @flags: userfaultfd request flags
+	 * @foliop: previously allocated folio
+	 *
+	 * Return: zero if succeeded, negative for errors.
+	 */
+	int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
+			 unsigned long dst_addr, unsigned long src_addr,
+			 uffd_flags_t flags, struct folio **foliop);
+} vm_uffd_ops;
+#endif
+
 /*
  * These are the virtual MM functions - opening of an area, closing and
  * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -653,6 +717,13 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+#ifdef CONFIG_USERFAULTFD
+	/*
+	 * Userfaultfd related ops.  Modules need to define this to support
+	 * userfaultfd.
+	 */
+	const vm_uffd_ops *userfaultfd_ops;
+#endif
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ccad58602846..e79c724b3b95 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -80,18 +80,6 @@ struct userfaultfd_ctx {
 
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
-/* A combined operation mode + behavior flags. */
-typedef unsigned int __bitwise uffd_flags_t;
-
-/* Mutually exclusive modes of operation. */
-enum mfill_atomic_mode {
-	MFILL_ATOMIC_COPY,
-	MFILL_ATOMIC_ZEROPAGE,
-	MFILL_ATOMIC_CONTINUE,
-	MFILL_ATOMIC_POISON,
-	NR_MFILL_ATOMIC_MODES,
-};
-
 #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
 #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
-- 
2.49.0


From a7094b86d3308e91ac7ab785b7d71ae6cc4739f4 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 11 Jun 2025 10:18:23 -0400
Subject: [PATCH 2/4] mm/shmem: Support vm_uffd_ops API

Add support for the new vm_uffd_ops API for shmem.  Note that this only
introduces the support, the API is not yet used by core mm.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/shmem.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0bc30dafad90..bd0a29000318 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
 #endif /* CONFIG_TMPFS_QUOTA */
 
 #ifdef CONFIG_USERFAULTFD
+
+static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
+				struct folio **folio)
+{
+	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
+}
+
 int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
@@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping,
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops shmem_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_ZEROPAGE) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	.uffd_get_folio	=	shmem_uffd_get_folio,
+	.uffd_copy	=	shmem_mfill_atomic_pte,
+};
+#endif
+
 static const struct address_space_operations shmem_aops = {
 	.dirty_folio	= noop_dirty_folio,
 #ifdef CONFIG_TMPFS
@@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
@@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
 	.set_policy     = shmem_set_policy,
 	.get_policy     = shmem_get_policy,
 #endif
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &shmem_uffd_ops,
+#endif
 };
 
 int shmem_init_fs_context(struct fs_context *fc)
-- 
2.49.0


From fab8f8312982619ca80299d6cf35d5661cf61911 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 11 Jun 2025 10:18:40 -0400
Subject: [PATCH 3/4] mm/hugetlb: Support vm_uffd_ops API

Add support for the new vm_uffd_ops API for hugetlb.  Note that this only
introduces the support, the API is not yet used by core mm.

Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d61ec17c15a..b9e473fab871 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5459,6 +5459,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+#ifdef CONFIG_USERFAULTFD
+static const vm_uffd_ops hugetlb_uffd_ops = {
+	.uffd_features	= 	__VM_UFFD_FLAGS,
+	/* _UFFDIO_ZEROPAGE not supported */
+	.uffd_ioctls	= 	BIT(_UFFDIO_COPY) |
+				BIT(_UFFDIO_WRITEPROTECT) |
+				BIT(_UFFDIO_CONTINUE) |
+				BIT(_UFFDIO_POISON),
+	/*
+	 * Hugetlbfs still has its own hard-coded handler in userfaultfd,
+	 * due to limitations similar to vm_operations_struct.fault().
+	 * TODO: generalize it to use the API functions.
+	 */
+};
+#endif
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5472,6 +5488,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_USERFAULTFD
+	.userfaultfd_ops = &hugetlb_uffd_ops,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
-- 
2.49.0


From de6ac50b189dc16b2d5759f67b32d528a6c9ccde Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 12 Jun 2025 11:55:08 -0400
Subject: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm

This patch completely moves the old userfaultfd core to use the new
vm_uffd_ops API.  After this change, existing file systems will start to
use the new API for userfault operations.

When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
because it's getting too big.  It's only used in slow paths so it shouldn't
be an issue.

This will also remove quite some hard-coded checks for either shmem or
hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/shmem_fs.h      |  14 -----
 include/linux/userfaultfd_k.h |  46 ++++----------
 mm/shmem.c                    |   2 +-
 mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
 4 files changed, 101 insertions(+), 76 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 6d0f9c599ff7..2f5b7b295cf6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
-#ifdef CONFIG_USERFAULTFD
-#ifdef CONFIG_SHMEM
-extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
-				  struct vm_area_struct *dst_vma,
-				  unsigned long dst_addr,
-				  unsigned long src_addr,
-				  uffd_flags_t flags,
-				  struct folio **foliop);
-#else /* !CONFIG_SHMEM */
-#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
-			       src_addr, flags, foliop) ({ BUG(); 0; })
-#endif /* CONFIG_SHMEM */
-#endif /* CONFIG_USERFAULTFD */
-
 /*
  * Used space is stored as unsigned 64-bit value in bytes but
  * quota core supports only signed 64-bit values so use that
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e79c724b3b95..4e56ad423a4a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -85,9 +85,14 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
 #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
 
+static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
+{
+	return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
+}
+
 static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
 {
-	return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
+	return uffd_flags_get_mode(flags) == expected;
 }
 
 static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
@@ -196,41 +201,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & __VM_UFFD_FLAGS;
 }
 
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
-				     unsigned long vm_flags,
-				     bool wp_async)
+static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
 {
-	vm_flags &= __VM_UFFD_FLAGS;
-
-	if (vma->vm_flags & VM_DROPPABLE)
-		return false;
-
-	if ((vm_flags & VM_UFFD_MINOR) &&
-	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
-		return false;
-
-	/*
-	 * If wp async enabled, and WP is the only mode enabled, allow any
-	 * memory type.
-	 */
-	if (wp_async && (vm_flags == VM_UFFD_WP))
-		return true;
-
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
-	/*
-	 * If user requested uffd-wp but not enabled pte markers for
-	 * uffd-wp, then shmem & hugetlbfs are not supported but only
-	 * anonymous.
-	 */
-	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
-		return false;
-#endif
-
-	/* By default, allow any of anon|shmem|hugetlb */
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
-	    vma_is_shmem(vma);
+	if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
+		return vma->vm_ops->userfaultfd_ops;
+	return NULL;
 }
 
+bool vma_can_userfault(struct vm_area_struct *vma,
+		       unsigned long vm_flags, bool wp_async);
+
 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
 {
 	struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
diff --git a/mm/shmem.c b/mm/shmem.c
index bd0a29000318..4d71fc7be358 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
 	return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
 }
 
-int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
+static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 			   struct vm_area_struct *dst_vma,
 			   unsigned long dst_addr,
 			   unsigned long src_addr,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 879505c6996f..61783ff2d335 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,12 +14,48 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
-#include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
 #include "swap.h"
 
+bool vma_can_userfault(struct vm_area_struct *vma,
+		       unsigned long vm_flags, bool wp_async)
+{
+	unsigned long supported;
+
+	if (vma->vm_flags & VM_DROPPABLE)
+		return false;
+
+	vm_flags &= __VM_UFFD_FLAGS;
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+	/*
+	 * If user requested uffd-wp but not enabled pte markers for
+	 * uffd-wp, then any file system (like shmem or hugetlbfs) are not
+	 * supported but only anonymous.
+	 */
+	if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+		return false;
+#endif
+	/*
+	 * If wp async enabled, and WP is the only mode enabled, allow any
+	 * memory type.
+	 */
+	if (wp_async && (vm_flags == VM_UFFD_WP))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		/* Anonymous has no page cache, MINOR not supported */
+		supported = VM_UFFD_MISSING | VM_UFFD_WP;
+	else if (vma_get_uffd_ops(vma))
+		supported = vma_get_uffd_ops(vma)->uffd_features;
+	else
+		return false;
+
+	return !(vm_flags & (~supported));
+}
+
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
 {
@@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
 	struct folio *folio;
 	struct page *page;
 	int ret;
 
-	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
+	if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
+		return -EINVAL;
+
+	ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
 	/* Our caller expects us to return -EFAULT if we failed to find folio */
 	if (ret == -ENOENT)
 		ret = -EFAULT;
@@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 	u32 hash;
 	struct address_space *mapping;
 
-	/*
-	 * There is no default zero huge page for all huge page sizes as
-	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
-	 * by THP.  Since we can not reliably insert a zero page, this
-	 * feature is not supported.
-	 */
-	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
-		up_read(&ctx->map_changing_lock);
-		uffd_mfill_unlock(dst_vma);
-		return -EINVAL;
-	}
-
 	src_addr = src_start;
 	dst_addr = dst_start;
 	copied = 0;
@@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
 			err = mfill_atomic_pte_zeropage(dst_pmd,
 						 dst_vma, dst_addr);
 	} else {
-		err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
-					     dst_addr, src_addr,
-					     flags, foliop);
+		const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
+
+		if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
+			err = -EINVAL;
+		} else {
+			err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
+						  dst_addr, src_addr,
+						  flags, foliop);
+		}
 	}
 
 	return err;
 }
 
+static inline bool
+vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
+{
+	enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
+	const vm_uffd_ops *uffd_ops;
+	unsigned long uffd_ioctls;
+
+	if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
+		return false;
+
+	/* Anonymous supports everything except CONTINUE */
+	if (vma_is_anonymous(vma))
+		return mode != MFILL_ATOMIC_CONTINUE;
+
+	uffd_ops = vma_get_uffd_ops(vma);
+	if (!uffd_ops)
+		return false;
+
+	uffd_ioctls = uffd_ops->uffd_ioctls;
+	switch (mode) {
+	case MFILL_ATOMIC_COPY:
+		return uffd_ioctls & BIT(_UFFDIO_COPY);
+	case MFILL_ATOMIC_ZEROPAGE:
+		return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
+	case MFILL_ATOMIC_CONTINUE:
+		if (!(vma->vm_flags & VM_SHARED))
+			return false;
+		return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
+	case MFILL_ATOMIC_POISON:
+		return uffd_ioctls & BIT(_UFFDIO_POISON);
+	default:
+		return false;
+	}
+}
+
 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 					    unsigned long dst_start,
 					    unsigned long src_start,
@@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
 
-	/*
-	 * validate 'mode' now that we know the dst_vma: don't allow
-	 * a wrprotect copy if the userfaultfd didn't register as WP.
-	 */
-	if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
+	if (!vma_uffd_ops_supported(dst_vma, flags))
 		goto out_unlock;
 
 	/*
@@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
 					     src_start, len, flags);
 
-	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
-		goto out_unlock;
-	if (!vma_is_shmem(dst_vma) &&
-	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
-		goto out_unlock;
-
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
 
-- 
2.49.0


-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs
  2025-06-20 15:21           ` Peter Xu
@ 2025-06-20 16:51             ` Nikita Kalyazin
  0 siblings, 0 replies; 28+ messages in thread
From: Nikita Kalyazin @ 2025-06-20 16:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: akpm, pbonzini, shuah, viro, brauner, muchun.song, hughd, kvm,
	linux-kselftest, linux-kernel, linux-mm, linux-fsdevel, jack,
	lorenzo.stoakes, Liam.Howlett, jannh, ryan.roberts, david,
	jthoughton, graf, jgowans, roypat, derekmn, nsaenz, xmarcalx



On 20/06/2025 16:21, Peter Xu wrote:
> Hi, Nikita,
> 
> On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
>> Thanks for explaining that.  I played a bit with it myself and it appears to
>> be working for the MISSING mode for both shmem and guest_memfd.  Attaching
> 
> [1]
> 
>> my sketch below.  Please let me know if this is how you see it.
>>
>> I found that arguments and return values are significantly different between
>> the two request types, which may be a bit confusing, although we do not
>> expect many callers of those.
> 
> Indeed.  Actually since I didn't yet get your reply, early this week I gave
> it a shot, and then I found the same thing that it'll be nice to keep the
> type checks all over the places.  It'll also be awkward if we want to add
> MISSING into the picture with the current req() interface (btw, IIUC you
> meant MINOR above, not MISSING).

I did indeed, sorry for the typo.

> Please have a look at what I came up with.  I didn't yet got a chance to
> post it, but it did compile all fine and pass the smoke tests here.  Feel
> free to take it over if you think that makes sense to you, or I can also
> post it officially after some more tests.

This looks much cleaner.  Please go ahead and post it.  I will make use 
of the interface in guest_memfd and run some tests as well meanwhile.

> So, ultimately I introduced a vm_uffd_ops to keep all the type checks.  I
> don't think I like the uffd_copy() interfacing too much, but it should
> still be the minimum changeset I can think of to generalize shmem as an
> userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in
> the last patch.
> 
> It's also unfortunate that hugetlb won't be able to already use the API,
> similar to why we have hugetlb's fault() to BUG() and hard-coded it in
> handle_mm_fault().  However it'll at least start to use the rest API all
> fine, so as to generalize some hugetlb checks.
> 
> The shmem definition looks like this:
> 
> static const vm_uffd_ops shmem_uffd_ops = {
>          .uffd_features  =       __VM_UFFD_FLAGS,
>          .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
>                                  BIT(_UFFDIO_ZEROPAGE) |
>                                  BIT(_UFFDIO_WRITEPROTECT) |
>                                  BIT(_UFFDIO_CONTINUE) |
>                                  BIT(_UFFDIO_POISON),
>          .uffd_get_folio =       shmem_uffd_get_folio,
>          .uffd_copy      =       shmem_mfill_atomic_pte,
> };
> 
> Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and
> provide uffd_get_folio() for supporting MINOR.
> 
> Let me know what do you think.
> 
> Thanks,
> 
> ===8<===
>  From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 12 Jun 2025 11:51:58 -0400
> Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API
> 
> Introduce a generic userfaultfd API for vm_operations_struct, so that one
> vma, especially when as a module, can support userfaults without modifying
> the core files.  More importantly, when the module can be compiled out of
> the kernel.
> 
> So, instead of having core mm referencing modules that may not ever exist,
> we need to have modules opt-in on core mm hooks instead.
> 
> After this API applied, if a module wants to support userfaultfd, the
> module should only need to touch its own file and properly define
> vm_uffd_ops, instead of changing anything in core mm.
> 
> Note that such API will not work for anonymous. Core mm will process
> anonymous memory separately for userfault operations like before.
> 
> This patch only introduces the API alone so that we can start to move
> existing users over but without breaking them.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/mm.h            | 71 +++++++++++++++++++++++++++++++++++
>   include/linux/userfaultfd_k.h | 12 ------
>   2 files changed, 71 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a606908307..8dfd83f01d3d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -576,6 +576,70 @@ struct vm_fault {
>                                           */
>   };
> 
> +#ifdef CONFIG_USERFAULTFD
> +/* A combined operation mode + behavior flags. */
> +typedef unsigned int __bitwise uffd_flags_t;
> +
> +enum mfill_atomic_mode {
> +       MFILL_ATOMIC_COPY,
> +       MFILL_ATOMIC_ZEROPAGE,
> +       MFILL_ATOMIC_CONTINUE,
> +       MFILL_ATOMIC_POISON,
> +       NR_MFILL_ATOMIC_MODES,
> +};
> +
> +/* VMA userfaultfd operations */
> +typedef struct {
> +       /**
> +        * @uffd_features: features supported in bitmask.
> +        *
> +        * When the ops is defined, the driver must set non-zero features
> +        * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
> +        */
> +       unsigned long uffd_features;
> +       /**
> +        * @uffd_ioctls: ioctls supported in bitmask.
> +        *
> +        * Userfaultfd ioctls supported by the module.  Below will always
> +        * be supported by default whenever a module provides vm_uffd_ops:
> +        *
> +        *   _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
> +        *
> +        * The module needs to provide all the rest optionally supported
> +        * ioctls.  For example, when VM_UFFD_MISSING was supported,
> +        * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
> +        * is optional.
> +        */
> +       unsigned long uffd_ioctls;
> +       /**
> +        * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
> +        *
> +        * @inode: the inode for folio lookup
> +        * @pgoff: the pgoff of the folio
> +        * @folio: returned folio pointer
> +        *
> +        * Return: zero if succeeded, negative for errors.
> +        */
> +       int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
> +                             struct folio **folio);
> +       /**
> +        * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
> +        *
> +        * @dst_pmd: target pmd to resolve page fault
> +        * @dst_vma: target vma
> +        * @dst_addr: target virtual address
> +        * @src_addr: source address to copy from
> +        * @flags: userfaultfd request flags
> +        * @foliop: previously allocated folio
> +        *
> +        * Return: zero if succeeded, negative for errors.
> +        */
> +       int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
> +                        unsigned long dst_addr, unsigned long src_addr,
> +                        uffd_flags_t flags, struct folio **foliop);
> +} vm_uffd_ops;
> +#endif
> +
>   /*
>    * These are the virtual MM functions - opening of an area, closing and
>    * unmapping it (needed to keep files on disk up-to-date etc), pointer
> @@ -653,6 +717,13 @@ struct vm_operations_struct {
>           */
>          struct page *(*find_special_page)(struct vm_area_struct *vma,
>                                            unsigned long addr);
> +#ifdef CONFIG_USERFAULTFD
> +       /*
> +        * Userfaultfd related ops.  Modules need to define this to support
> +        * userfaultfd.
> +        */
> +       const vm_uffd_ops *userfaultfd_ops;
> +#endif
>   };
> 
>   #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index ccad58602846..e79c724b3b95 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -80,18 +80,6 @@ struct userfaultfd_ctx {
> 
>   extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> 
> -/* A combined operation mode + behavior flags. */
> -typedef unsigned int __bitwise uffd_flags_t;
> -
> -/* Mutually exclusive modes of operation. */
> -enum mfill_atomic_mode {
> -       MFILL_ATOMIC_COPY,
> -       MFILL_ATOMIC_ZEROPAGE,
> -       MFILL_ATOMIC_CONTINUE,
> -       MFILL_ATOMIC_POISON,
> -       NR_MFILL_ATOMIC_MODES,
> -};
> -
>   #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1)
>   #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr))
>   #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
> --
> 2.49.0
> 
> 
>  From a7094b86d3308e91ac7ab785b7d71ae6cc4739f4 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 11 Jun 2025 10:18:23 -0400
> Subject: [PATCH 2/4] mm/shmem: Support vm_uffd_ops API
> 
> Add support for the new vm_uffd_ops API for shmem.  Note that this only
> introduces the support, the API is not yet used by core mm.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/shmem.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0bc30dafad90..bd0a29000318 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap,
>   #endif /* CONFIG_TMPFS_QUOTA */
> 
>   #ifdef CONFIG_USERFAULTFD
> +
> +static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
> +                               struct folio **folio)
> +{
> +       return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
> +}
> +
>   int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>                             struct vm_area_struct *dst_vma,
>                             unsigned long dst_addr,
> @@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping,
>          return 0;
>   }
> 
> +#ifdef CONFIG_USERFAULTFD
> +static const vm_uffd_ops shmem_uffd_ops = {
> +       .uffd_features  =       __VM_UFFD_FLAGS,
> +       .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
> +                               BIT(_UFFDIO_ZEROPAGE) |
> +                               BIT(_UFFDIO_WRITEPROTECT) |
> +                               BIT(_UFFDIO_CONTINUE) |
> +                               BIT(_UFFDIO_POISON),
> +       .uffd_get_folio =       shmem_uffd_get_folio,
> +       .uffd_copy      =       shmem_mfill_atomic_pte,
> +};
> +#endif
> +
>   static const struct address_space_operations shmem_aops = {
>          .dirty_folio    = noop_dirty_folio,
>   #ifdef CONFIG_TMPFS
> @@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = {
>          .set_policy     = shmem_set_policy,
>          .get_policy     = shmem_get_policy,
>   #endif
> +#ifdef CONFIG_USERFAULTFD
> +       .userfaultfd_ops = &shmem_uffd_ops,
> +#endif
>   };
> 
>   static const struct vm_operations_struct shmem_anon_vm_ops = {
> @@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = {
>          .set_policy     = shmem_set_policy,
>          .get_policy     = shmem_get_policy,
>   #endif
> +#ifdef CONFIG_USERFAULTFD
> +       .userfaultfd_ops = &shmem_uffd_ops,
> +#endif
>   };
> 
>   int shmem_init_fs_context(struct fs_context *fc)
> --
> 2.49.0
> 
> 
>  From fab8f8312982619ca80299d6cf35d5661cf61911 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 11 Jun 2025 10:18:40 -0400
> Subject: [PATCH 3/4] mm/hugetlb: Support vm_uffd_ops API
> 
> Add support for the new vm_uffd_ops API for hugetlb.  Note that this only
> introduces the support, the API is not yet used by core mm.
> 
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/hugetlb.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3d61ec17c15a..b9e473fab871 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5459,6 +5459,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
>          return 0;
>   }
> 
> +#ifdef CONFIG_USERFAULTFD
> +static const vm_uffd_ops hugetlb_uffd_ops = {
> +       .uffd_features  =       __VM_UFFD_FLAGS,
> +       /* _UFFDIO_ZEROPAGE not supported */
> +       .uffd_ioctls    =       BIT(_UFFDIO_COPY) |
> +                               BIT(_UFFDIO_WRITEPROTECT) |
> +                               BIT(_UFFDIO_CONTINUE) |
> +                               BIT(_UFFDIO_POISON),
> +       /*
> +        * Hugetlbfs still has its own hard-coded handler in userfaultfd,
> +        * due to limitations similar to vm_operations_struct.fault().
> +        * TODO: generalize it to use the API functions.
> +        */
> +};
> +#endif
> +
>   /*
>    * When a new function is introduced to vm_operations_struct and added
>    * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -5472,6 +5488,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>          .close = hugetlb_vm_op_close,
>          .may_split = hugetlb_vm_op_split,
>          .pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_USERFAULTFD
> +       .userfaultfd_ops = &hugetlb_uffd_ops,
> +#endif
>   };
> 
>   static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
> --
> 2.49.0
> 
> 
>  From de6ac50b189dc16b2d5759f67b32d528a6c9ccde Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 12 Jun 2025 11:55:08 -0400
> Subject: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
> 
> This patch completely moves the old userfaultfd core to use the new
> vm_uffd_ops API.  After this change, existing file systems will start to
> use the new API for userfault operations.
> 
> When at it, moving vma_can_userfault() into mm/userfaultfd.c instead,
> because it's getting too big.  It's only used in slow paths so it shouldn't
> be an issue.
> 
> This will also remove quite some hard-coded checks for either shmem or
> hugetlbfs.  Now all the old checks should still work but with vm_uffd_ops.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/shmem_fs.h      |  14 -----
>   include/linux/userfaultfd_k.h |  46 ++++----------
>   mm/shmem.c                    |   2 +-
>   mm/userfaultfd.c              | 115 +++++++++++++++++++++++++---------
>   4 files changed, 101 insertions(+), 76 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 6d0f9c599ff7..2f5b7b295cf6 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
>   extern bool shmem_charge(struct inode *inode, long pages);
>   extern void shmem_uncharge(struct inode *inode, long pages);
> 
> -#ifdef CONFIG_USERFAULTFD
> -#ifdef CONFIG_SHMEM
> -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> -                                 struct vm_area_struct *dst_vma,
> -                                 unsigned long dst_addr,
> -                                 unsigned long src_addr,
> -                                 uffd_flags_t flags,
> -                                 struct folio **foliop);
> -#else /* !CONFIG_SHMEM */
> -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
> -                              src_addr, flags, foliop) ({ BUG(); 0; })
> -#endif /* CONFIG_SHMEM */
> -#endif /* CONFIG_USERFAULTFD */
> -
>   /*
>    * Used space is stored as unsigned 64-bit value in bytes but
>    * quota core supports only signed 64-bit values so use that
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index e79c724b3b95..4e56ad423a4a 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -85,9 +85,14 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>   #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
>   #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
> 
> +static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags)
> +{
> +       return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
> +}
> +
>   static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected)
>   {
> -       return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
> +       return uffd_flags_get_mode(flags) == expected;
>   }
> 
>   static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
> @@ -196,41 +201,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
>          return vma->vm_flags & __VM_UFFD_FLAGS;
>   }
> 
> -static inline bool vma_can_userfault(struct vm_area_struct *vma,
> -                                    unsigned long vm_flags,
> -                                    bool wp_async)
> +static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma)
>   {
> -       vm_flags &= __VM_UFFD_FLAGS;
> -
> -       if (vma->vm_flags & VM_DROPPABLE)
> -               return false;
> -
> -       if ((vm_flags & VM_UFFD_MINOR) &&
> -           (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> -               return false;
> -
> -       /*
> -        * If wp async enabled, and WP is the only mode enabled, allow any
> -        * memory type.
> -        */
> -       if (wp_async && (vm_flags == VM_UFFD_WP))
> -               return true;
> -
> -#ifndef CONFIG_PTE_MARKER_UFFD_WP
> -       /*
> -        * If user requested uffd-wp but not enabled pte markers for
> -        * uffd-wp, then shmem & hugetlbfs are not supported but only
> -        * anonymous.
> -        */
> -       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> -               return false;
> -#endif
> -
> -       /* By default, allow any of anon|shmem|hugetlb */
> -       return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> -           vma_is_shmem(vma);
> +       if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
> +               return vma->vm_ops->userfaultfd_ops;
> +       return NULL;
>   }
> 
> +bool vma_can_userfault(struct vm_area_struct *vma,
> +                      unsigned long vm_flags, bool wp_async);
> +
>   static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma)
>   {
>          struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd0a29000318..4d71fc7be358 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
>          return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
>   }
> 
> -int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
> +static int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
>                             struct vm_area_struct *dst_vma,
>                             unsigned long dst_addr,
>                             unsigned long src_addr,
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 879505c6996f..61783ff2d335 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,12 +14,48 @@
>   #include <linux/userfaultfd_k.h>
>   #include <linux/mmu_notifier.h>
>   #include <linux/hugetlb.h>
> -#include <linux/shmem_fs.h>
>   #include <asm/tlbflush.h>
>   #include <asm/tlb.h>
>   #include "internal.h"
>   #include "swap.h"
> 
> +bool vma_can_userfault(struct vm_area_struct *vma,
> +                      unsigned long vm_flags, bool wp_async)
> +{
> +       unsigned long supported;
> +
> +       if (vma->vm_flags & VM_DROPPABLE)
> +               return false;
> +
> +       vm_flags &= __VM_UFFD_FLAGS;
> +
> +#ifndef CONFIG_PTE_MARKER_UFFD_WP
> +       /*
> +        * If user requested uffd-wp but not enabled pte markers for
> +        * uffd-wp, then any file system (like shmem or hugetlbfs) are not
> +        * supported but only anonymous.
> +        */
> +       if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
> +               return false;
> +#endif
> +       /*
> +        * If wp async enabled, and WP is the only mode enabled, allow any
> +        * memory type.
> +        */
> +       if (wp_async && (vm_flags == VM_UFFD_WP))
> +               return true;
> +
> +       if (vma_is_anonymous(vma))
> +               /* Anonymous has no page cache, MINOR not supported */
> +               supported = VM_UFFD_MISSING | VM_UFFD_WP;
> +       else if (vma_get_uffd_ops(vma))
> +               supported = vma_get_uffd_ops(vma)->uffd_features;
> +       else
> +               return false;
> +
> +       return !(vm_flags & (~supported));
> +}
> +
>   static __always_inline
>   bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
>   {
> @@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
>   {
>          struct inode *inode = file_inode(dst_vma->vm_file);
>          pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +       const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
>          struct folio *folio;
>          struct page *page;
>          int ret;
> 
> -       ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> +       if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
> +               return -EINVAL;
> +
> +       ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio);
>          /* Our caller expects us to return -EFAULT if we failed to find folio */
>          if (ret == -ENOENT)
>                  ret = -EFAULT;
> @@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>          u32 hash;
>          struct address_space *mapping;
> 
> -       /*
> -        * There is no default zero huge page for all huge page sizes as
> -        * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> -        * by THP.  Since we can not reliably insert a zero page, this
> -        * feature is not supported.
> -        */
> -       if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
> -               up_read(&ctx->map_changing_lock);
> -               uffd_mfill_unlock(dst_vma);
> -               return -EINVAL;
> -       }
> -
>          src_addr = src_start;
>          dst_addr = dst_start;
>          copied = 0;
> @@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd,
>                          err = mfill_atomic_pte_zeropage(dst_pmd,
>                                                   dst_vma, dst_addr);
>          } else {
> -               err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
> -                                            dst_addr, src_addr,
> -                                            flags, foliop);
> +               const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
> +
> +               if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
> +                       err = -EINVAL;
> +               } else {
> +                       err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
> +                                                 dst_addr, src_addr,
> +                                                 flags, foliop);
> +               }
>          }
> 
>          return err;
>   }
> 
> +static inline bool
> +vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags)
> +{
> +       enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
> +       const vm_uffd_ops *uffd_ops;
> +       unsigned long uffd_ioctls;
> +
> +       if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
> +               return false;
> +
> +       /* Anonymous supports everything except CONTINUE */
> +       if (vma_is_anonymous(vma))
> +               return mode != MFILL_ATOMIC_CONTINUE;
> +
> +       uffd_ops = vma_get_uffd_ops(vma);
> +       if (!uffd_ops)
> +               return false;
> +
> +       uffd_ioctls = uffd_ops->uffd_ioctls;
> +       switch (mode) {
> +       case MFILL_ATOMIC_COPY:
> +               return uffd_ioctls & BIT(_UFFDIO_COPY);
> +       case MFILL_ATOMIC_ZEROPAGE:
> +               return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
> +       case MFILL_ATOMIC_CONTINUE:
> +               if (!(vma->vm_flags & VM_SHARED))
> +                       return false;
> +               return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
> +       case MFILL_ATOMIC_POISON:
> +               return uffd_ioctls & BIT(_UFFDIO_POISON);
> +       default:
> +               return false;
> +       }
> +}
> +
>   static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>                                              unsigned long dst_start,
>                                              unsigned long src_start,
> @@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>              dst_vma->vm_flags & VM_SHARED))
>                  goto out_unlock;
> 
> -       /*
> -        * validate 'mode' now that we know the dst_vma: don't allow
> -        * a wrprotect copy if the userfaultfd didn't register as WP.
> -        */
> -       if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
> +       if (!vma_uffd_ops_supported(dst_vma, flags))
>                  goto out_unlock;
> 
>          /*
> @@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>                  return  mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
>                                               src_start, len, flags);
> 
> -       if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> -               goto out_unlock;
> -       if (!vma_is_shmem(dst_vma) &&
> -           uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> -               goto out_unlock;
> -
>          while (src_addr < src_start + len) {
>                  pmd_t dst_pmdval;
> 
> --
> 2.49.0
> 
> 
> --
> Peter Xu
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-06-20 16:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 15:43 [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 1/6] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
2025-06-10 22:22   ` Peter Xu
2025-06-11 12:09     ` Nikita Kalyazin
2025-06-11 12:56       ` Peter Xu
2025-06-20 12:00         ` Nikita Kalyazin
2025-06-20 15:21           ` Peter Xu
2025-06-20 16:51             ` Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 2/6] mm: provide can_userfault vma operation Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 3/6] mm: userfaultfd: use " Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 4/6] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
2025-06-10 22:25   ` Peter Xu
2025-06-11 12:09     ` Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 5/6] mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD Nikita Kalyazin
2025-04-04 15:43 ` [PATCH v3 6/6] KVM: selftests: test userfaultfd minor for guest_memfd Nikita Kalyazin
2025-04-04 16:33 ` [PATCH v3 0/6] KVM: guest_memfd: support for uffd minor Lorenzo Stoakes
2025-04-04 16:56   ` Nikita Kalyazin
2025-04-04 16:59     ` Lorenzo Stoakes
2025-04-04 17:12 ` Liam R. Howlett
2025-04-07 11:04   ` Nikita Kalyazin
2025-04-07 13:40     ` Liam R. Howlett
2025-04-07 14:04       ` Nikita Kalyazin
2025-04-07 14:24         ` Liam R. Howlett
2025-04-07 14:46           ` David Hildenbrand
2025-04-07 15:14             ` Lorenzo Stoakes
2025-04-07 15:26               ` David Hildenbrand
2025-04-08  8:20             ` Christian Brauner
2025-04-08 13:15               ` Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).