* [PATCH v4 0/4] mm/userfaultfd: modulize memory types
@ 2025-10-14 23:14 Peter Xu
2025-10-14 23:14 ` [PATCH v4 1/4] mm: Introduce vm_uffd_ops API Peter Xu
` (4 more replies)
0 siblings, 5 replies; 37+ messages in thread
From: Peter Xu @ 2025-10-14 23:14 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka,
Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes,
David Hildenbrand, Hugh Dickins, Michal Hocko, Ujwal Kundur,
Liam R . Howlett, peterx, Oscar Salvador, Suren Baghdasaryan,
Andrea Arcangeli
[based on latest akpm/mm-new of Oct 14th, commit 36c6c5ce1b275]
v4:
- Some cleanups within vma_can_userfault() [David]
- Rename uffd_get_folio() to minor_get_folio() [David]
- Remove uffd_features in vm_uffd_ops, deduce it from supported ioctls [David]
v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com
v2: https://lore.kernel.org/r/20250627154655.2085903-1-peterx@redhat.com
v3: https://lore.kernel.org/r/20250926211650.525109-1-peterx@redhat.com
This series is an alternative proposal of what Nikita proposed here on the
initial three patches:
https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com
This is not yet relevant to any guest-memfd support, but paving way for it.
Here, the major goal is to make kernel modules be able to opt-in with any
form of userfaultfd supports, like guest-memfd. This alternative option
should hopefully be cleaner, and avoid leaking userfault details into
vm_ops.fault().
It also means this series does not depend on anything. It's a pure
refactoring of userfaultfd internals to provide a generic API, so that
other types of files, especially RAM based, can support userfaultfd without
touching mm/ at all.
To achieve it, this series introduced a file operation called vm_uffd_ops.
The ops needs to be provided when a file type supports any of userfaultfd.
With that, I moved both hugetlbfs and shmem over, whenever possible. So
far due to concerns on exposing an uffd_copy() API, the MISSING faults are
still separately processed and can only be done within mm/. Hugetlbfs kept
its special paths untouched.
An example of shmem uffd_ops:
static const struct vm_uffd_ops shmem_uffd_ops = {
.supported_ioctls = BIT(_UFFDIO_COPY) |
BIT(_UFFDIO_ZEROPAGE) |
BIT(_UFFDIO_WRITEPROTECT) |
BIT(_UFFDIO_CONTINUE) |
BIT(_UFFDIO_POISON),
.minor_get_folio = shmem_uffd_get_folio,
};
To show another sample, this is the patch that Nikita posted to implement
minor fault for guest-memfd (on top of older versions of this series):
https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/
No functional change expected at all after the whole series applied. There
might be some slightly stricter check on uffd ops here and there in the
last patch, but that really shouldn't stand out anywhere to anyone.
For testing: besides the cross-compilation tests, I did also try with
uffd-stress in a VM to measure any perf difference before/after the change;
The static call becomes a pointer now. I really cannot measure anything
different, which is more or less expected.
Comments welcomed, thanks.
Peter Xu (4):
mm: Introduce vm_uffd_ops API
mm/shmem: Support vm_uffd_ops API
mm/hugetlb: Support vm_uffd_ops API
mm: Apply vm_uffd_ops API to core mm
include/linux/mm.h | 9 +++
include/linux/userfaultfd_k.h | 75 +++++++++++----------
mm/hugetlb.c | 18 +++++
mm/shmem.c | 24 +++++++
mm/userfaultfd.c | 120 +++++++++++++++++++++++++++-------
5 files changed, 189 insertions(+), 57 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 37+ messages in thread* [PATCH v4 1/4] mm: Introduce vm_uffd_ops API 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu @ 2025-10-14 23:14 ` Peter Xu 2025-10-20 14:18 ` David Hildenbrand 2025-10-14 23:14 ` [PATCH v4 2/4] mm/shmem: Support " Peter Xu ` (3 subsequent siblings) 4 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-10-14 23:14 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, David Hildenbrand, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, peterx, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Currently, most of the userfaultfd features are implemented directly in the core mm. It will invoke VMA specific functions whenever necessary. So far it is fine because it almost only interacts with shmem and hugetlbfs. Introduce a generic userfaultfd API extension for vm_operations_struct, so that any code that implements vm_operations_struct (including kernel modules that can be compiled separately from the kernel core) can support userfaults without modifying the core files. With this API applied, if a module wants to support userfaultfd, the module should only need to properly define vm_uffd_ops and hook it to vm_operations_struct, instead of changing anything in core mm. This API will not work for anonymous memory. Handling of userfault operations for anonymous memory remains unchanged in core mm. Due to a security concern while reviewing older versions of this series [1], uffd_copy() will be temprorarily removed. IOW, MISSING-capable memory types can only be hard-coded and implemented in mm/. It would also affect UFFDIO_COPY and UFFDIO_ZEROPAGE. Other functions should still be able to be provided from vm_uffd_ops. Introduces the API only so that existing userfaultfd users can be moved over without breaking them. [1] https://lore.kernel.org/all/20250627154655.2085903-1-peterx@redhat.com/ Signed-off-by: Peter Xu <peterx@redhat.com> --- include/linux/mm.h | 9 +++++++++ include/linux/userfaultfd_k.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5c01c4b59ca67..011962130c148 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -620,6 +620,8 @@ struct vm_fault { */ }; +struct vm_uffd_ops; + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -705,6 +707,13 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD + /* + * Userfaultfd related ops. Modules need to define this to support + * userfaultfd. + */ + const struct vm_uffd_ops *userfaultfd_ops; +#endif }; #ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index c0e716aec26aa..b5b4f3f174b32 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -92,6 +92,35 @@ enum mfill_atomic_mode { NR_MFILL_ATOMIC_MODES, }; +/* VMA userfaultfd operations */ +struct vm_uffd_ops { + /** + * @supported_ioctls: userfaultfd ioctls supported in bitmask. + * + * Userfaultfd ioctls supported by the module. Below will always + * be supported by default whenever a module provides vm_uffd_ops: + * + * _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE + * + * The module needs to provide all the rest optionally supported + * ioctls as a bitmask. For example, a module needs to set the bit + * BIT(_UFFDIO_CONTINUE) to support userfaultfd minor faults. + */ + unsigned long supported_ioctls; + /** + * minor_get_folio: Handler to resolve UFFDIO_CONTINUE request. + * Must be specified if _UFFDIO_CONTINUE is set. + * + * @inode: the inode for folio lookup + * @pgoff: the pgoff of the folio + * @folio: returned folio pointer + * + * Return: zero if succeeded, negative for errors. + */ + int (*minor_get_folio)(struct inode *inode, pgoff_t pgoff, + struct folio **folio); +}; + #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1) #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr)) #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr)) -- 2.50.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v4 1/4] mm: Introduce vm_uffd_ops API 2025-10-14 23:14 ` [PATCH v4 1/4] mm: Introduce vm_uffd_ops API Peter Xu @ 2025-10-20 14:18 ` David Hildenbrand 0 siblings, 0 replies; 37+ messages in thread From: David Hildenbrand @ 2025-10-20 14:18 UTC (permalink / raw) To: Peter Xu, linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 15.10.25 01:14, Peter Xu wrote: > Currently, most of the userfaultfd features are implemented directly in the > core mm. It will invoke VMA specific functions whenever necessary. So far > it is fine because it almost only interacts with shmem and hugetlbfs. > > Introduce a generic userfaultfd API extension for vm_operations_struct, > so that any code that implements vm_operations_struct (including kernel > modules that can be compiled separately from the kernel core) can support > userfaults without modifying the core files. > > With this API applied, if a module wants to support userfaultfd, the > module should only need to properly define vm_uffd_ops and hook it to > vm_operations_struct, instead of changing anything in core mm. > > This API will not work for anonymous memory. Handling of userfault > operations for anonymous memory remains unchanged in core mm. > > Due to a security concern while reviewing older versions of this series > [1], uffd_copy() will be temprorarily removed. IOW, MISSING-capable memory > types can only be hard-coded and implemented in mm/. It would also affect > UFFDIO_COPY and UFFDIO_ZEROPAGE. Other functions should still be able to > be provided from vm_uffd_ops. > > Introduces the API only so that existing userfaultfd users can be moved > over without breaking them. > > [1] https://lore.kernel.org/all/20250627154655.2085903-1-peterx@redhat.com/ > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- Acked-by: David Hildenbrand <david@redhat.com> -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v4 2/4] mm/shmem: Support vm_uffd_ops API 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu 2025-10-14 23:14 ` [PATCH v4 1/4] mm: Introduce vm_uffd_ops API Peter Xu @ 2025-10-14 23:14 ` Peter Xu 2025-10-20 14:18 ` David Hildenbrand 2025-10-14 23:15 ` [PATCH v4 3/4] mm/hugetlb: " Peter Xu ` (2 subsequent siblings) 4 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-10-14 23:14 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, David Hildenbrand, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, peterx, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Add support for the new vm_uffd_ops API for shmem. Note that this only introduces the support, the API is not yet used by core mm. It only needs a separate minor_get_folio() definition but that's oneliner. Due to the limitation of the current vm_uffd_ops on MISSING mode support, the shmem UFFDIO_COPY/ZEROPAGE process are still hard-coded in mm/. Cc: Hugh Dickins <hughd@google.com> Acked-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/shmem.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index b50ce7dbc84a0..0be112689f9e5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3146,6 +3146,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, #endif /* CONFIG_TMPFS_QUOTA */ #ifdef CONFIG_USERFAULTFD + +static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff, + struct folio **folio) +{ + return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC); +} + int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, @@ -5189,6 +5196,17 @@ static int shmem_error_remove_folio(struct address_space *mapping, return 0; } +#ifdef CONFIG_USERFAULTFD +static const struct vm_uffd_ops shmem_uffd_ops = { + .supported_ioctls = BIT(_UFFDIO_COPY) | + BIT(_UFFDIO_ZEROPAGE) | + BIT(_UFFDIO_WRITEPROTECT) | + BIT(_UFFDIO_CONTINUE) | + BIT(_UFFDIO_POISON), + .minor_get_folio = shmem_uffd_get_folio, +}; +#endif + static const struct address_space_operations shmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_TMPFS @@ -5291,6 +5309,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .userfaultfd_ops = &shmem_uffd_ops, +#endif }; static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5300,6 +5321,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .userfaultfd_ops = &shmem_uffd_ops, +#endif }; int shmem_init_fs_context(struct fs_context *fc) -- 2.50.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v4 2/4] mm/shmem: Support vm_uffd_ops API 2025-10-14 23:14 ` [PATCH v4 2/4] mm/shmem: Support " Peter Xu @ 2025-10-20 14:18 ` David Hildenbrand 0 siblings, 0 replies; 37+ messages in thread From: David Hildenbrand @ 2025-10-20 14:18 UTC (permalink / raw) To: Peter Xu, linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 15.10.25 01:14, Peter Xu wrote: > Add support for the new vm_uffd_ops API for shmem. Note that this only > introduces the support, the API is not yet used by core mm. > > It only needs a separate minor_get_folio() definition but that's oneliner. > > Due to the limitation of the current vm_uffd_ops on MISSING mode support, > the shmem UFFDIO_COPY/ZEROPAGE process are still hard-coded in mm/. > > Cc: Hugh Dickins <hughd@google.com> > Acked-by: Mike Rapoport <rppt@kernel.org> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- Acked-by: David Hildenbrand <david@redhat.com> -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v4 3/4] mm/hugetlb: Support vm_uffd_ops API 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu 2025-10-14 23:14 ` [PATCH v4 1/4] mm: Introduce vm_uffd_ops API Peter Xu 2025-10-14 23:14 ` [PATCH v4 2/4] mm/shmem: Support " Peter Xu @ 2025-10-14 23:15 ` Peter Xu 2025-10-20 14:19 ` David Hildenbrand 2025-10-14 23:15 ` [PATCH v4 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu 2025-10-20 13:34 ` [PATCH v4 0/4] mm/userfaultfd: modulize memory types David Hildenbrand 4 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-10-14 23:15 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, David Hildenbrand, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, peterx, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Add support for the new vm_uffd_ops API for hugetlb. Note that this only introduces the support, the API is not yet used by core mm. Due to legacy reasons, it's still not trivial to move hugetlb completely to the API. But it will still use supported_ioctls properly on the API with all the implementations still hard coded in mm/. Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/hugetlb.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 657b9facba280..ad98876d338e5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5503,6 +5503,21 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_USERFAULTFD +static const struct vm_uffd_ops hugetlb_uffd_ops = { + /* _UFFDIO_ZEROPAGE not supported */ + .supported_ioctls = BIT(_UFFDIO_COPY) | + BIT(_UFFDIO_WRITEPROTECT) | + BIT(_UFFDIO_CONTINUE) | + BIT(_UFFDIO_POISON), + /* + * Hugetlbfs still has its own hard-coded handler in userfaultfd, + * due to limitations similar to vm_operations_struct.fault(). + * TODO: generalize it to use the API functions. + */ +}; +#endif + /* * When a new function is introduced to vm_operations_struct and added * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops. @@ -5516,6 +5531,9 @@ const struct vm_operations_struct hugetlb_vm_ops = { .close = hugetlb_vm_op_close, .may_split = hugetlb_vm_op_split, .pagesize = hugetlb_vm_op_pagesize, +#ifdef CONFIG_USERFAULTFD + .userfaultfd_ops = &hugetlb_uffd_ops, +#endif }; static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio, -- 2.50.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v4 3/4] mm/hugetlb: Support vm_uffd_ops API 2025-10-14 23:15 ` [PATCH v4 3/4] mm/hugetlb: " Peter Xu @ 2025-10-20 14:19 ` David Hildenbrand 0 siblings, 0 replies; 37+ messages in thread From: David Hildenbrand @ 2025-10-20 14:19 UTC (permalink / raw) To: Peter Xu, linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 15.10.25 01:15, Peter Xu wrote: > Add support for the new vm_uffd_ops API for hugetlb. Note that this only > introduces the support, the API is not yet used by core mm. > > Due to legacy reasons, it's still not trivial to move hugetlb completely to > the API. But it will still use supported_ioctls properly on the API with > all the implementations still hard coded in mm/. > > Cc: Muchun Song <muchun.song@linux.dev> > Cc: Oscar Salvador <osalvador@suse.de> > Acked-by: Mike Rapoport <rppt@kernel.org> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- Acked-by: David Hildenbrand <david@redhat.com> -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v4 4/4] mm: Apply vm_uffd_ops API to core mm 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu ` (2 preceding siblings ...) 2025-10-14 23:15 ` [PATCH v4 3/4] mm/hugetlb: " Peter Xu @ 2025-10-14 23:15 ` Peter Xu 2025-10-20 13:34 ` [PATCH v4 0/4] mm/userfaultfd: modulize memory types David Hildenbrand 4 siblings, 0 replies; 37+ messages in thread From: Peter Xu @ 2025-10-14 23:15 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, David Hildenbrand, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, peterx, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Move userfaultfd core to use new vm_uffd_ops API. After this change, file systems that implement vm_operations_struct can start using new API for userfaultfd operations. When at it, moving vma_can_userfault() into mm/userfaultfd.c instead, because it's getting too big. It's only used in slow paths so it shouldn't be an issue. Move the pte marker check before wp_async, which might be more intuitive because wp_async depends on pte markers. That shouldn't cause any functional change though because only one check would take effect depending on whether pte marker was selected in config. This will also remove quite some hard-coded checks for either shmem or hugetlbfs. Now all the old checks should still work but with vm_uffd_ops. Note that anonymous memory will still need to be processed separately because it doesn't have vm_ops at all. Reviewed-by: James Houghton <jthoughton@google.com> Acked-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Peter Xu <peterx@redhat.com> --- include/linux/userfaultfd_k.h | 46 ++++--------- mm/userfaultfd.c | 120 +++++++++++++++++++++++++++------- 2 files changed, 109 insertions(+), 57 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index b5b4f3f174b32..983efaecfdcb0 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -126,9 +126,14 @@ struct vm_uffd_ops { #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr)) #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1)) +static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags) +{ + return (__force enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK); +} + static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected) { - return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected); + return uffd_flags_get_mode(flags) == expected; } static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode) @@ -237,41 +242,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; } -static inline bool vma_can_userfault(struct vm_area_struct *vma, - vm_flags_t vm_flags, - bool wp_async) +static inline const struct vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma) { - vm_flags &= __VM_UFFD_FLAGS; - - if (vma->vm_flags & VM_DROPPABLE) - return false; - - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; - - /* - * If wp async enabled, and WP is the only mode enabled, allow any - * memory type. - */ - if (wp_async && (vm_flags == VM_UFFD_WP)) - return true; - -#ifndef CONFIG_PTE_MARKER_UFFD_WP - /* - * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. - */ - if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) - return false; -#endif - - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + if (vma->vm_ops && vma->vm_ops->userfaultfd_ops) + return vma->vm_ops->userfaultfd_ops; + return NULL; } +bool vma_can_userfault(struct vm_area_struct *vma, + unsigned long vm_flags, bool wp_async); + static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) { struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 00122f42718cc..b50db51e75d5b 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -20,6 +20,61 @@ #include "internal.h" #include "swap.h" +static unsigned long uffd_get_supported_modes(struct vm_area_struct *vma) +{ + const struct vm_uffd_ops *uffd_ops = vma_get_uffd_ops(vma); + unsigned long ioctls, supported = 0; + + if (!uffd_ops) + return 0; + + ioctls = uffd_ops->supported_ioctls; + + if (ioctls & (BIT(_UFFDIO_COPY) | BIT(_UFFDIO_ZEROPAGE))) + supported |= VM_UFFD_MISSING; + if (ioctls & BIT(_UFFDIO_CONTINUE)) + supported |= VM_UFFD_MINOR; + if (ioctls & BIT(_UFFDIO_WRITEPROTECT)) + supported |= VM_UFFD_WP; + + return supported; +} + +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async) +{ + unsigned long supported = 0; + + if (vma->vm_flags & VM_DROPPABLE) + return false; + + vm_flags &= __VM_UFFD_FLAGS; + + /* + * If user requested uffd-wp but not enabled pte markers for + * uffd-wp, then any file system (like shmem or hugetlbfs) are not + * supported but only anonymous. + */ + if (!IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP) && + ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))) + return false; + + /* + * If wp async enabled, and WP is the only mode enabled, allow any + * memory type. + */ + if (wp_async && (vm_flags == VM_UFFD_WP)) + return true; + + if (vma_is_anonymous(vma)) + /* Anonymous has no page cache, MINOR not supported */ + supported = VM_UFFD_MISSING | VM_UFFD_WP; + else if (vma_get_uffd_ops(vma)) + supported = uffd_get_supported_modes(vma); + + return !(vm_flags & ~supported); +} + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -382,13 +437,17 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, unsigned long dst_addr, uffd_flags_t flags) { + const struct vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma); struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct folio *folio; struct page *page; int ret; - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->minor_get_folio)) + return -EINVAL; + + ret = uffd_ops->minor_get_folio(inode, pgoff, &folio); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) ret = -EFAULT; @@ -504,18 +563,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb( u32 hash; struct address_space *mapping; - /* - * There is no default zero huge page for all huge page sizes as - * supported by hugetlb. A PMD_SIZE huge pages may exist as used - * by THP. Since we can not reliably insert a zero page, this - * feature is not supported. - */ - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { - up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(dst_vma); - return -EINVAL; - } - src_addr = src_start; dst_addr = dst_start; copied = 0; @@ -694,6 +741,41 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, return err; } +static inline bool +vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags) +{ + enum mfill_atomic_mode mode = uffd_flags_get_mode(flags); + const struct vm_uffd_ops *uffd_ops; + unsigned long supported_ioctls; + + if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP)) + return false; + + /* Anonymous supports everything except CONTINUE */ + if (vma_is_anonymous(vma)) + return mode != MFILL_ATOMIC_CONTINUE; + + uffd_ops = vma_get_uffd_ops(vma); + if (!uffd_ops) + return false; + + supported_ioctls = uffd_ops->supported_ioctls; + switch (mode) { + case MFILL_ATOMIC_COPY: + return supported_ioctls & BIT(_UFFDIO_COPY); + case MFILL_ATOMIC_ZEROPAGE: + return supported_ioctls & BIT(_UFFDIO_ZEROPAGE); + case MFILL_ATOMIC_CONTINUE: + if (!(vma->vm_flags & VM_SHARED)) + return false; + return supported_ioctls & BIT(_UFFDIO_CONTINUE); + case MFILL_ATOMIC_POISON: + return supported_ioctls & BIT(_UFFDIO_POISON); + default: + return false; + } +} + static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long dst_start, unsigned long src_start, @@ -752,11 +834,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, dst_vma->vm_flags & VM_SHARED)) goto out_unlock; - /* - * validate 'mode' now that we know the dst_vma: don't allow - * a wrprotect copy if the userfaultfd didn't register as WP. - */ - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) + if (!vma_uffd_ops_supported(dst_vma, flags)) goto out_unlock; /* @@ -766,12 +844,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags); - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) - goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) - goto out_unlock; - while (src_addr < src_start + len) { pmd_t dst_pmdval; -- 2.50.1 ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu ` (3 preceding siblings ...) 2025-10-14 23:15 ` [PATCH v4 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu @ 2025-10-20 13:34 ` David Hildenbrand 2025-10-20 14:12 ` Peter Xu 4 siblings, 1 reply; 37+ messages in thread From: David Hildenbrand @ 2025-10-20 13:34 UTC (permalink / raw) To: Peter Xu, linux-kernel, linux-mm Cc: Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 15.10.25 01:14, Peter Xu wrote: > [based on latest akpm/mm-new of Oct 14th, commit 36c6c5ce1b275] > > v4: > - Some cleanups within vma_can_userfault() [David] > - Rename uffd_get_folio() to minor_get_folio() [David] > - Remove uffd_features in vm_uffd_ops, deduce it from supported ioctls [David] > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > v2: https://lore.kernel.org/r/20250627154655.2085903-1-peterx@redhat.com > v3: https://lore.kernel.org/r/20250926211650.525109-1-peterx@redhat.com > > This series is an alternative proposal of what Nikita proposed here on the > initial three patches: > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > This is not yet relevant to any guest-memfd support, but paving way for it. > Here, the major goal is to make kernel modules be able to opt-in with any > form of userfaultfd supports, like guest-memfd. This alternative option > should hopefully be cleaner, and avoid leaking userfault details into > vm_ops.fault(). > > It also means this series does not depend on anything. It's a pure > refactoring of userfaultfd internals to provide a generic API, so that > other types of files, especially RAM based, can support userfaultfd without > touching mm/ at all. > > To achieve it, this series introduced a file operation called vm_uffd_ops. > The ops needs to be provided when a file type supports any of userfaultfd. > > With that, I moved both hugetlbfs and shmem over, whenever possible. So > far due to concerns on exposing an uffd_copy() API, the MISSING faults are > still separately processed and can only be done within mm/. Hugetlbfs kept > its special paths untouched. > > An example of shmem uffd_ops: > > static const struct vm_uffd_ops shmem_uffd_ops = { > .supported_ioctls = BIT(_UFFDIO_COPY) | > BIT(_UFFDIO_ZEROPAGE) | > BIT(_UFFDIO_WRITEPROTECT) | > BIT(_UFFDIO_CONTINUE) | > BIT(_UFFDIO_POISON), > .minor_get_folio = shmem_uffd_get_folio, > }; This looks better than the previous version to me. Long term the goal should be to move all hugetlb/shmem specific stuff out of mm/hugetlb.c and of course, we won't be adding any new ones to mm/userfaultfd.c I agree with Liam that a better interface could be providing default handlers for the separate ioctls [1], but there is always the option to evolve this interface into something like that later. [1] https://lkml.kernel.org/r/frnos5jtmlqvzpcrredcoummuzvllweku5dgp5ii5in6epwnw5@anu4dqsz6shy -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-20 13:34 ` [PATCH v4 0/4] mm/userfaultfd: modulize memory types David Hildenbrand @ 2025-10-20 14:12 ` Peter Xu 2025-10-21 15:51 ` Liam R. Howlett 0 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-10-20 14:12 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Liam R . Howlett, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On Mon, Oct 20, 2025 at 03:34:47PM +0200, David Hildenbrand wrote: > On 15.10.25 01:14, Peter Xu wrote: > > [based on latest akpm/mm-new of Oct 14th, commit 36c6c5ce1b275] > > > > v4: > > - Some cleanups within vma_can_userfault() [David] > > - Rename uffd_get_folio() to minor_get_folio() [David] > > - Remove uffd_features in vm_uffd_ops, deduce it from supported ioctls [David] > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > v2: https://lore.kernel.org/r/20250627154655.2085903-1-peterx@redhat.com > > v3: https://lore.kernel.org/r/20250926211650.525109-1-peterx@redhat.com > > > > This series is an alternative proposal of what Nikita proposed here on the > > initial three patches: > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > Here, the major goal is to make kernel modules be able to opt-in with any > > form of userfaultfd supports, like guest-memfd. This alternative option > > should hopefully be cleaner, and avoid leaking userfault details into > > vm_ops.fault(). > > > > It also means this series does not depend on anything. It's a pure > > refactoring of userfaultfd internals to provide a generic API, so that > > other types of files, especially RAM based, can support userfaultfd without > > touching mm/ at all. > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > With that, I moved both hugetlbfs and shmem over, whenever possible. So > > far due to concerns on exposing an uffd_copy() API, the MISSING faults are > > still separately processed and can only be done within mm/. Hugetlbfs kept > > its special paths untouched. > > > > An example of shmem uffd_ops: > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > .supported_ioctls = BIT(_UFFDIO_COPY) | > > BIT(_UFFDIO_ZEROPAGE) | > > BIT(_UFFDIO_WRITEPROTECT) | > > BIT(_UFFDIO_CONTINUE) | > > BIT(_UFFDIO_POISON), > > .minor_get_folio = shmem_uffd_get_folio, > > }; > > This looks better than the previous version to me. > > Long term the goal should be to move all hugetlb/shmem specific stuff out of > mm/hugetlb.c and of course, we won't be adding any new ones to > mm/userfaultfd.c > > I agree with Liam that a better interface could be providing default > handlers for the separate ioctls [1], but there is always the option to > evolve this interface into something like that later. Thanks for accepting this current form. > > > [1] https://lkml.kernel.org/r/frnos5jtmlqvzpcrredcoummuzvllweku5dgp5ii5in6epwnw5@anu4dqsz6shy I have replied to that, here: https://lore.kernel.org/all/aOVEDii4HPB6outm@x1.local/ If we ignore hugetlbfs, most of the hooks may not be needed, as explained. If we introduce hooks only for hugetlbfs, IMHO it's going backwards. When we want to get rid of hugetlbfs paths, we will have something more to get rid of.. We can wait until some use cases to appear, that may help us to better understand what might be more suitable to be abstracted. Especially, if there will be any consumer for missing faults. However if guest-memfd has it optional, my expectation so far is there'll be no real demand on that for the foreseeable future. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-20 14:12 ` Peter Xu @ 2025-10-21 15:51 ` Liam R. Howlett 2025-10-21 16:28 ` Peter Xu 0 siblings, 1 reply; 37+ messages in thread From: Liam R. Howlett @ 2025-10-21 15:51 UTC (permalink / raw) To: Peter Xu Cc: David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * Peter Xu <peterx@redhat.com> [251020 10:12]: > On Mon, Oct 20, 2025 at 03:34:47PM +0200, David Hildenbrand wrote: > > On 15.10.25 01:14, Peter Xu wrote: > > > [based on latest akpm/mm-new of Oct 14th, commit 36c6c5ce1b275] > > > > > > v4: > > > - Some cleanups within vma_can_userfault() [David] > > > - Rename uffd_get_folio() to minor_get_folio() [David] > > > - Remove uffd_features in vm_uffd_ops, deduce it from supported ioctls [David] > > > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > v2: https://lore.kernel.org/r/20250627154655.2085903-1-peterx@redhat.com > > > v3: https://lore.kernel.org/r/20250926211650.525109-1-peterx@redhat.com > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > initial three patches: > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > Here, the major goal is to make kernel modules be able to opt-in with any > > > form of userfaultfd supports, like guest-memfd. This alternative option > > > should hopefully be cleaner, and avoid leaking userfault details into > > > vm_ops.fault(). > > > > > > It also means this series does not depend on anything. It's a pure > > > refactoring of userfaultfd internals to provide a generic API, so that > > > other types of files, especially RAM based, can support userfaultfd without > > > touching mm/ at all. > > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > > > With that, I moved both hugetlbfs and shmem over, whenever possible. So > > > far due to concerns on exposing an uffd_copy() API, the MISSING faults are > > > still separately processed and can only be done within mm/. Hugetlbfs kept > > > its special paths untouched. > > > > > > An example of shmem uffd_ops: > > > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > > .supported_ioctls = BIT(_UFFDIO_COPY) | > > > BIT(_UFFDIO_ZEROPAGE) | > > > BIT(_UFFDIO_WRITEPROTECT) | > > > BIT(_UFFDIO_CONTINUE) | > > > BIT(_UFFDIO_POISON), > > > .minor_get_folio = shmem_uffd_get_folio, > > > }; I think you forgot to add the link to the guest_memfd implementation [1] to your cover letter. > > > > This looks better than the previous version to me. > > > > Long term the goal should be to move all hugetlb/shmem specific stuff out of > > mm/hugetlb.c and of course, we won't be adding any new ones to > > mm/userfaultfd.c > > > > I agree with Liam that a better interface could be providing default > > handlers for the separate ioctls [1], but there is always the option to > > evolve this interface into something like that later. > > Thanks for accepting this current form. > > > > > > > [1] https://lkml.kernel.org/r/frnos5jtmlqvzpcrredcoummuzvllweku5dgp5ii5in6epwnw5@anu4dqsz6shy > > I have replied to that, here: > > https://lore.kernel.org/all/aOVEDii4HPB6outm@x1.local/ > > If we ignore hugetlbfs, most of the hooks may not be needed, as explained. Those were examples. Hooks allow for all the memory type checking to go away in the code, which allows for more readable code and less operations per call. > > If we introduce hooks only for hugetlbfs, IMHO it's going backwards. When > we want to get rid of hugetlbfs paths, we will have something more to get > rid of.. This is just wrong. It is far easier to remove one function pointer than go through all the code and remove the checks for hugetlbfs. Are you thinking the hooks will just point to the generic function? This is the only way I can see your statement making sense. That's not the idea I'm trying to communicate. The idea is that you split the functions into parts that everyone does and special parts, then call them in the correct sequence for each type. New types need new special parts while using the generic code for the majority of the work. In this way, the memory types are modularized into function pointers that all use common code without adding complexity. In fact, knowing implicitly which context from call path means we don't need to check the types and should be able to reduce the complexity. Then adding a new memory type will call almost all the same functions except for special areas. Removing old memory types would me removing the special areas only - and maybe a function pointer if they are the only user. The current patch set does not modularizing memory, it is creating a middleware level where we have to parse a value to figure out what to do. These patches DO expose a method for memory types to be coded in a kernel module, which is fundamentally different than modularizing the memory types. Different enough to be glossed over on a ML by looking at the subject alone. Yes, one value is better than two values, but no magic values is ideal. Is it a significant amount of work to remove the magic value by fragmenting the code into memory type specific function pointers? IOW, instead of decoding the value to figure out where to route calls, just expose the calls directly in the function pointer layer that you are creating? What is the minimum amount of function pointers to get the guest_memfd to work without this value being parsed? [1]. https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/ Regards, Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-21 15:51 ` Liam R. Howlett @ 2025-10-21 16:28 ` Peter Xu 2025-10-30 17:13 ` Liam R. Howlett 0 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-10-21 16:28 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On Tue, Oct 21, 2025 at 11:51:33AM -0400, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251020 10:12]: > > On Mon, Oct 20, 2025 at 03:34:47PM +0200, David Hildenbrand wrote: > > > On 15.10.25 01:14, Peter Xu wrote: > > > > [based on latest akpm/mm-new of Oct 14th, commit 36c6c5ce1b275] > > > > > > > > v4: > > > > - Some cleanups within vma_can_userfault() [David] > > > > - Rename uffd_get_folio() to minor_get_folio() [David] > > > > - Remove uffd_features in vm_uffd_ops, deduce it from supported ioctls [David] > > > > > > > > v1: https://lore.kernel.org/r/20250620190342.1780170-1-peterx@redhat.com > > > > v2: https://lore.kernel.org/r/20250627154655.2085903-1-peterx@redhat.com > > > > v3: https://lore.kernel.org/r/20250926211650.525109-1-peterx@redhat.com > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > initial three patches: > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > Here, the major goal is to make kernel modules be able to opt-in with any > > > > form of userfaultfd supports, like guest-memfd. This alternative option > > > > should hopefully be cleaner, and avoid leaking userfault details into > > > > vm_ops.fault(). > > > > > > > > It also means this series does not depend on anything. It's a pure > > > > refactoring of userfaultfd internals to provide a generic API, so that > > > > other types of files, especially RAM based, can support userfaultfd without > > > > touching mm/ at all. > > > > > > > > To achieve it, this series introduced a file operation called vm_uffd_ops. > > > > The ops needs to be provided when a file type supports any of userfaultfd. > > > > > > > > With that, I moved both hugetlbfs and shmem over, whenever possible. So > > > > far due to concerns on exposing an uffd_copy() API, the MISSING faults are > > > > still separately processed and can only be done within mm/. Hugetlbfs kept > > > > its special paths untouched. > > > > > > > > An example of shmem uffd_ops: > > > > > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > > > .supported_ioctls = BIT(_UFFDIO_COPY) | > > > > BIT(_UFFDIO_ZEROPAGE) | > > > > BIT(_UFFDIO_WRITEPROTECT) | > > > > BIT(_UFFDIO_CONTINUE) | > > > > BIT(_UFFDIO_POISON), > > > > .minor_get_folio = shmem_uffd_get_folio, > > > > }; > > I think you forgot to add the link to the guest_memfd implementation [1] > to your cover letter. I didn't. https://lore.kernel.org/all/20251014231501.2301398-1-peterx@redhat.com/ To show another sample, this is the patch that Nikita posted to implement minor fault for guest-memfd (on top of older versions of this series): https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/ > > > > > > > This looks better than the previous version to me. > > > > > > Long term the goal should be to move all hugetlb/shmem specific stuff out of > > > mm/hugetlb.c and of course, we won't be adding any new ones to > > > mm/userfaultfd.c > > > > > > I agree with Liam that a better interface could be providing default > > > handlers for the separate ioctls [1], but there is always the option to > > > evolve this interface into something like that later. > > > > Thanks for accepting this current form. > > > > > > > > > > > [1] https://lkml.kernel.org/r/frnos5jtmlqvzpcrredcoummuzvllweku5dgp5ii5in6epwnw5@anu4dqsz6shy > > > > I have replied to that, here: > > > > https://lore.kernel.org/all/aOVEDii4HPB6outm@x1.local/ > > > > If we ignore hugetlbfs, most of the hooks may not be needed, as explained. > > Those were examples. > > Hooks allow for all the memory type checking to go away in the code, > which allows for more readable code and less operations per call. > > > > > If we introduce hooks only for hugetlbfs, IMHO it's going backwards. When > > we want to get rid of hugetlbfs paths, we will have something more to get > > rid of.. > > This is just wrong. > > It is far easier to remove one function pointer than go through all the > code and remove the checks for hugetlbfs. > > Are you thinking the hooks will just point to the generic function? > This is the only way I can see your statement making sense. That's not > the idea I'm trying to communicate. > > The idea is that you split the functions into parts that everyone does > and special parts, then call them in the correct sequence for each type. > New types need new special parts while using the generic code for the > majority of the work. > > In this way, the memory types are modularized into function pointers > that all use common code without adding complexity. In fact, knowing > implicitly which context from call path means we don't need to check the > types and should be able to reduce the complexity. > > Then adding a new memory type will call almost all the same functions > except for special areas. > > Removing old memory types would me removing the special areas only - and > maybe a function pointer if they are the only user. > > The current patch set does not modularizing memory, it is creating a > middleware level where we have to parse a value to figure out what to > do. > > These patches DO expose a method for memory types to be coded in a > kernel module, which is fundamentally different than modularizing the > memory types. Different enough to be glossed over on a ML by looking at > the subject alone. > > Yes, one value is better than two values, but no magic values is ideal. > > Is it a significant amount of work to remove the magic value by > fragmenting the code into memory type specific function pointers? > > IOW, instead of decoding the value to figure out where to route calls, > just expose the calls directly in the function pointer layer that you > are creating? What is the minimum amount of function pointers to get > the guest_memfd to work without this value being parsed? > > [1]. https://lore.kernel.org/all/114133f5-0282-463d-9d65-3143aa658806@amazon.com/ I don't know what you're looking for. I think I got most acks from userfaultfd developers whoever were active in the past few years, ever since v1... Then, we got some concern on uffd_copy() API being complicated, it's fine, I dropped it. We got some other concern on having a function returning folio pointer. We talked it all through, luckily, even if I do not know what really happened. Now, I really don't know what you're suggesting here. Can you send some patches and show us the code, help everyone to support guest-memfd minor fault, please? -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-21 16:28 ` Peter Xu @ 2025-10-30 17:13 ` Liam R. Howlett 2025-10-30 18:00 ` Nikita Kalyazin ` (3 more replies) 0 siblings, 4 replies; 37+ messages in thread From: Liam R. Howlett @ 2025-10-30 17:13 UTC (permalink / raw) To: Peter Xu Cc: David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * Peter Xu <peterx@redhat.com> [251021 12:28]: ... > Can you send some patches and show us the code, help everyone to support > guest-memfd minor fault, please? Patches are here: https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem This is actually modularized memory types. That means there is no hugetlb.h or shmem.h included in mm/userfaultfd.c code. uffd_flag_t has been removed. This was turning into a middleware and it is not necessary. Neither is supported_ioctls. hugetlb now uses the same functions as every other memory type, including anon memory. Any memory type can change functionality without adding instructions or flags or anything to some other code. This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap tests). guest-memfd can implement whatever it needs to (or use others implementations), like shmem_uffd_ops here: static const struct vm_uffd_ops shmem_uffd_ops = { .copy = shmem_mfill_atomic_pte_copy, .zeropage = shmem_mfill_atomic_pte_zeropage, .cont = shmem_mfill_atomic_pte_continue, .poison = mfill_atomic_pte_poison, .writeprotect = uffd_writeprotect, .is_dst_valid = shmem_is_dst_valid, .increment = mfill_size, .failed_do_unlock = uffd_failed_do_unlock, .page_shift = uffd_page_shift, .complete_register = uffd_complete_register, }; Where guest-memfd needs to write the one function: guest_memfd_pte_continue(), from what I understand. Obviously some of the shmem_ functions would need to be added to a header, or such. And most of that can come from shmem_mfill_atomic_pte_continue(), from what I understand. This is about 40 lines of code, but may require exposing some shmem functions to keep the code that compact. So we don't need to expose getting a folio to a module, or decode any special flags or whatever. We just call the function that needs to be called on the vma that is found. If anyone has tests I can use for guest-memfd and instructions on guest-memfd setup, I'll just write it instead of expanding the userfaultfd middleware. Thanks, Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 17:13 ` Liam R. Howlett @ 2025-10-30 18:00 ` Nikita Kalyazin 2025-10-30 19:07 ` Peter Xu ` (2 subsequent siblings) 3 siblings, 0 replies; 37+ messages in thread From: Nikita Kalyazin @ 2025-10-30 18:00 UTC (permalink / raw) To: Liam R. Howlett, Peter Xu, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 30/10/2025 17:13, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > ... > >> Can you send some patches and show us the code, help everyone to support >> guest-memfd minor fault, please? > > Patches are here: > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > This is actually modularized memory types. That means there is no > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > uffd_flag_t has been removed. This was turning into a middleware and > it is not necessary. Neither is supported_ioctls. > > hugetlb now uses the same functions as every other memory type, > including anon memory. > > Any memory type can change functionality without adding instructions or > flags or anything to some other code. > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > tests). > > guest-memfd can implement whatever it needs to (or use others > implementations), like shmem_uffd_ops here: > > static const struct vm_uffd_ops shmem_uffd_ops = { > .copy = shmem_mfill_atomic_pte_copy, > .zeropage = shmem_mfill_atomic_pte_zeropage, > .cont = shmem_mfill_atomic_pte_continue, > .poison = mfill_atomic_pte_poison, > .writeprotect = uffd_writeprotect, > .is_dst_valid = shmem_is_dst_valid, > .increment = mfill_size, > .failed_do_unlock = uffd_failed_do_unlock, > .page_shift = uffd_page_shift, > .complete_register = uffd_complete_register, > }; > > Where guest-memfd needs to write the one function: > guest_memfd_pte_continue(), from what I understand. > > Obviously some of the shmem_ functions would need to be added to a > header, or such. > > And most of that can come from shmem_mfill_atomic_pte_continue(), from > what I understand. This is about 40 lines of code, but may require > exposing some shmem functions to keep the code that compact. > > So we don't need to expose getting a folio to a module, or decode any > special flags or whatever. We just call the function that needs to be > called on the vma that is found. > > If anyone has tests I can use for guest-memfd and instructions on > guest-memfd setup, I'll just write it instead of expanding the > userfaultfd middleware. I used to use [1] as a test. The test function can be called from inside if (flags & GUEST_MEMFD_FLAG_MMAP) { test_mmap_supported(fd, page_size, total_size); test_fault_overflow(fd, page_size, total_size); } else { in your tree (tools/testing/selftests/kvm/guest_memfd_test.c). CONFIG_KVM_GUEST_MEMFD=y should be sufficient to enable guest_memfd support. [1] https://lore.kernel.org/kvm/20250404154352.23078-7-kalyazin@amazon.com/ > > Thanks, > Liam > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 17:13 ` Liam R. Howlett 2025-10-30 18:00 ` Nikita Kalyazin @ 2025-10-30 19:07 ` Peter Xu 2025-10-30 19:55 ` Peter Xu 2025-10-30 20:24 ` Liam R. Howlett 2025-11-03 16:11 ` Mike Rapoport 2025-11-05 21:23 ` David Hildenbrand 3 siblings, 2 replies; 37+ messages in thread From: Peter Xu @ 2025-10-30 19:07 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > ... > > > Can you send some patches and show us the code, help everyone to support > > guest-memfd minor fault, please? > > Patches are here: > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem Great! Finally we have something solid to discuss on top. Yes, I'm extremely happy to see whatever code there is, I'm happy to review it. I'm happy to see it rolling. If it is better, we can adopt it. > > This is actually modularized memory types. That means there is no > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > uffd_flag_t has been removed. This was turning into a middleware and > it is not necessary. Neither is supported_ioctls. > > hugetlb now uses the same functions as every other memory type, > including anon memory. > > Any memory type can change functionality without adding instructions or > flags or anything to some other code. > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > tests). > > guest-memfd can implement whatever it needs to (or use others > implementations), like shmem_uffd_ops here: I didn't look at the patches in details, however I already commented previously on a similar comment you left. Since you have solid code this time, let me ask one by one clearly this time inline: > > static const struct vm_uffd_ops shmem_uffd_ops = { > .copy = shmem_mfill_atomic_pte_copy, This is optional, if you did the convertion it's perfect (though it's buggy right now, more below). Yes, UFFDIO_COPY might be a good fit for a global API like this, however that's the least useful, as I mentioned, I do not expect a new user.. It means, what you did on this may not grow any user. The whole change may not be useful to anyone.. Then I see what you introduced as the API: struct vm_uffd_ops { int (*copy)(struct uffd_info *info); int (*zeropage)(struct uffd_info *info); int (*cont)(struct uffd_info *info); int (*poison)(struct uffd_info *info); int (*writeprotect)(struct uffd_info *info); /* Required features below */ ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long len); unsigned long (*increment)(struct vm_area_struct *vma); ssize_t (*failed_do_unlock)(struct uffd_info *info); unsigned int (*page_shift)(struct vm_area_struct *src_vma); void (*complete_register)(struct vm_area_struct *vma); }; The copy() interface (and most of the rest) takes something called uffd_info*. Then it looks like this: struct uffd_info { unsigned long dst_addr; /* Target address */ unsigned long src_addr; /* Source address */ unsigned long len; /* Total length to copy */ unsigned long src_last; /* starting src_addr + len */ unsigned long dst_last; /* starting dst_addr + len */ struct folio *foliop; /* folio pointer for retry */ struct userfaultfd_ctx *ctx; /* The userfaultfd context */ struct vm_area_struct *dst_vma; /* Target vma */ unsigned long increment; /* Size of each operation */ bool wp; /* Operation is requesting write protection */ const struct vm_uffd_ops *uffd_ops; /* The vma uffd_ops pointer */ int (*op)(struct uffd_info *info); /* The operation to perform on the dst */ long copied; }; You went almost mad when I introduced uffd_copy() in v1. It might be because there used to have pmd* and something around pgtables. However I'll still need to question whether this is a better and easier to adopt interface if a mem type wants to opt-in any uffd features. So are you happy yourself now with above complicated struct that memtype needs to implement and support? > .zeropage = shmem_mfill_atomic_pte_zeropage, Why a memory type needs to provide a separate hook to inject a zeropage? It should almost always be the same of UFFDIO_COPY except copying. > .cont = shmem_mfill_atomic_pte_continue, It's OK to have it. However said that, what we really need is "fetching a cache folio". I'll need to check how you exposed the userfaultfd helpers so that it will support mem types to opt-in this. To me, this is really an overkill. Shmem impl: static int shmem_mfill_atomic_pte_continue(struct uffd_info *info) { struct vm_area_struct *dst_vma = info->dst_vma; struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, info->dst_addr); pmd_t *dst_pmd; struct folio *folio; struct page *page; int ret; ret = uffd_get_dst_pmd(dst_vma, info->dst_addr, &dst_pmd); if (ret) return ret; ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) ret = -EFAULT; if (ret) goto out; if (!folio) { ret = -EFAULT; goto out; } page = folio_file_page(folio, pgoff); if (PageHWPoison(page)) { ret = -EIO; goto out_release; } ret = mfill_atomic_install_pte(dst_pmd, dst_vma, info->dst_addr, page, false, info->wp); if (ret) goto out_release; folio_unlock(folio); ret = 0; out: return ret; out_release: folio_unlock(folio); folio_put(folio); goto out; } So are you sure this is better than the oneliner? In your new API, the driver also needs to operate on pmd*. Is it a concern to you? Maybe you don't think it's a concern now, even if you used to think uffd_copy() has concerns exposing pmd* pointers? The current series I proposed is not only simpler, but only expose folio*. That's at least safer at least from your theory, is that right? > .poison = mfill_atomic_pte_poison, Why this is needed if UFFDIO_POISON should exactly do the same thing for each memory type? > .writeprotect = uffd_writeprotect, Same question. Wr-protect over a pagecache folio should really behave the same. Why do you need to introduce it? After all, even in your branch you reused change_protection(), where the hugetlb special operations reside. I don't see much point on why it's needed. > .is_dst_valid = shmem_is_dst_valid, It's definitely not obvious what this is for. Looks like it's trying to verify vma validity. However then I see shmem impl has this: static ssize_t shmem_is_dst_valid(struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long len) { if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) return -EINVAL; return 0; } Why shmem allows anon vma? > .increment = mfill_size, This is almost only useful for hugetlbfs, same to page_shift and complete_register. It's ok if you want to clean hugetlbfs. But if we want to fetch the granule of the folios inside a vma, IMHO we should make it a vma attribute, not something special to userfaultfd. > .failed_do_unlock = uffd_failed_do_unlock, You'd better at least change the name of it. It so far unlocks some misterious locks then try to copy the folio without mmap/vma lock. inline ssize_t uffd_failed_do_unlock(struct uffd_info *info) { ssize_t err; void *kaddr; up_read(&info->ctx->map_changing_lock); uffd_mfill_unlock(info->dst_vma); VM_WARN_ON_ONCE(!info->foliop); kaddr = kmap_local_folio(info->foliop, 0); err = copy_from_user(kaddr, (const void __user *) info->src_addr, PAGE_SIZE); kunmap_local(kaddr); if (unlikely(err)) return -EFAULT; flush_dcache_folio(info->foliop); return 0; } How do the mem type know what locks to unlock if they do not lock it first themselves? > .page_shift = uffd_page_shift, > .complete_register = uffd_complete_register, Hugetlbfs specific hooks. It's OK if you prefer rewritting code for hugetlbfs. I don't have objections. > }; > > Where guest-memfd needs to write the one function: > guest_memfd_pte_continue(), from what I understand. You did mention what is required in your new API: /* Required features below */ ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long len); unsigned long (*increment)(struct vm_area_struct *vma); ssize_t (*failed_do_unlock)(struct uffd_info *info); unsigned int (*page_shift)(struct vm_area_struct *src_vma); void (*complete_register)(struct vm_area_struct *vma); So guest-memfd needs to implement these 6 APIs to support minor fault. > > Obviously some of the shmem_ functions would need to be added to a > header, or such. > > And most of that can come from shmem_mfill_atomic_pte_continue(), from > what I understand. This is about 40 lines of code, but may require > exposing some shmem functions to keep the code that compact. > > So we don't need to expose getting a folio to a module, or decode any > special flags or whatever. We just call the function that needs to be I didn't reply before, but I don't think supported_ioctls is special flags or magic values. They're simply flags showing what the mem type supports on the ioctls. One can set it wrong, but I don't think what you proposed with above 6 APIs would be easier to get right. Any module can also make things wrong in any of above. UFFDIO_CONTINUE itself is definitely getting much more complicated, it used to only set a flag in supported_ioctls, provide a oneliner to fetch a cache. Now it needs all above 6 APIs, one of them taking uffd_info* which further contains 10+ fields to process. > called on the vma that is found. > > If anyone has tests I can use for guest-memfd and instructions on > guest-memfd setup, I'll just write it instead of expanding the > userfaultfd middleware. Personally, this whole thing is an overkill to me: $ git diff --stat 63f84ba525ea04ef376eac851efce2f82dd05f21..HEAD fs/userfaultfd.c | 14 +-- include/linux/hugetlb.h | 21 ---- include/linux/mm.h | 11 ++ include/linux/shmem_fs.h | 14 --- include/linux/userfaultfd_k.h | 108 ++++++++++------ mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- mm/shmem.c | 245 ++++++++++++++++++++++++------------ mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- 8 files changed, 962 insertions(+), 679 deletions(-) I can wait for some second opinions, but if you are confident this will be welcomed by others, I suggest you prepare a minimum changeset preparing support for guest-memfd minor fault, then post it formally upstream. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 19:07 ` Peter Xu @ 2025-10-30 19:55 ` Peter Xu 2025-10-30 20:23 ` Lorenzo Stoakes 2025-10-30 20:52 ` Liam R. Howlett 2025-10-30 20:24 ` Liam R. Howlett 1 sibling, 2 replies; 37+ messages in thread From: Peter Xu @ 2025-10-30 19:55 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On Thu, Oct 30, 2025 at 03:07:18PM -0400, Peter Xu wrote: > > Patches are here: > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > Great! Finally we have something solid to discuss on top. > > Yes, I'm extremely happy to see whatever code there is, I'm happy to review > it. I'm happy to see it rolling. If it is better, we can adopt it. So here is a summary of why I think my proposal is better: - Much less code I think this is crystal clear.. I'm pasting once more in this summary email on what your proposal touches: fs/userfaultfd.c | 14 +-- include/linux/hugetlb.h | 21 ---- include/linux/mm.h | 11 ++ include/linux/shmem_fs.h | 14 --- include/linux/userfaultfd_k.h | 108 ++++++++++------ mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- mm/shmem.c | 245 ++++++++++++++++++++++++------------ mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- 8 files changed, 962 insertions(+), 679 deletions(-) - Much less future code The new proposal needs at least 6 APIs to implement even minor fault.. One of the API needs to be implemented with uffd_info* which further includes 10+ fields to process. It means we'll have a bunch of duplicated code in the future if new things pop up, so it's not only about what we merge. - Much less exported functions to modules My solution, after exposing vm_uffd_ops, doesn't need to export any function. Your solution needs to export a lot of new functions to modules. I didn't pay a lot of attention but the list should at least include these 10 functions: void uffd_complete_register(struct vm_area_struct *vma); unsigned int uffd_page_shift(struct vm_area_struct *vma); int uffd_writeprotect(struct uffd_info *info); ssize_t uffd_failed_do_unlock(struct uffd_info *info); int uffd_atomic_pte_copy(struct folio *folio, unsigned long src_addr); unsigned long mfill_size(struct vm_area_struct *vma) int mfill_atomic_pte_poison(struct uffd_info *info); int mfill_atomic_pte_copy(struct uffd_info *info); int mfill_atomic_pte_zeropage(struct uffd_info *info); ssize_t uffd_get_dst_pmd(struct vm_area_struct *dst_vma, unsigned long dst_addr,pmd_t **dst_pmd); It's simply unnecessary. - Less error prone At least to support minor fault, my solution only needs one hook fetching page cache, then set the CONTINUE ioctl in the supported_ioctls. - Safer Your code allows to operate on pmd* in a module??? That's too risky and mm can explode! Isn't it? - Do not build new codes on top of hugetlbfs AFAICT, more than half of your solution's API is trying to service hugetlbfs. IMHO that's the wrong way to go. I explained to you multiple times. We should either keep hugetlbfs alone, or having hugetlbfs adopt mm APIs instead. We shouldn't build any new code only trying to service hugetlbfsv1 but nobody else. We shouldn't introduce new mm API only to service hugetlbfs. - Much less risk of breaking things I'm pretty sure my code introduce zero or very little bug, if there's one, I'll fix it, but really, likely not, because the changes are straightforward. Your changes are huge. I would not be surprised you break things here and there. I hope at least you will be around fixing them when it happens, even if we're not sure the benefits of most of the changes. -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 19:55 ` Peter Xu @ 2025-10-30 20:23 ` Lorenzo Stoakes 2025-10-30 21:13 ` Peter Xu 2025-10-30 20:52 ` Liam R. Howlett 1 sibling, 1 reply; 37+ messages in thread From: Lorenzo Stoakes @ 2025-10-30 20:23 UTC (permalink / raw) To: Peter Xu Cc: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct +cc CoC Peter, I'm sorry but your reply here is completely out of line. I know tensions can run high sometimes, but this is a _good faith_ effort to try to find a way forward. Please take a step back and show some respect for the fact that Liam has put VERY significant effort in preparing this after you _repeatedly_ asked him to show him code. I am starting to worry that your approach here is to bat off criticism by trying to wear reviewers down and that's really not a good thing. Again, this is _good faith_. Nobody is trying to unreasonably push back on these changes, we are just trying to find the best solution possible. Comments like: 'Your code allows to operate on pmd* in a module??? That's too risky and mm can explode! Isn't it?' and 'that's the wrong way to go. I explained to you multiple times.' and 'I'm pretty sure my code introduce zero or very little bug, if there's one, I'll fix it, but really, likely not, because the changes are straightforward.' vs. 'Your changes are huge. I would not be surprised you break things here and there. I hope at least you will be around fixing them when it happens, even if we're not sure the benefits of most of the changes.' are just _entirely_ unhelpful and really unacceptable. I have an extremely heavy workload at the moment anyway, but honestly interactions like this have seriously put me off being involved in this review personally. Do we really want this to be how review in mm or the kernel is? Is that really the culture we want to have here? Thanks, Lorenzo On Thu, Oct 30, 2025 at 03:55:44PM -0400, Peter Xu wrote: > On Thu, Oct 30, 2025 at 03:07:18PM -0400, Peter Xu wrote: > > > Patches are here: > > > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > > > Great! Finally we have something solid to discuss on top. > > > > Yes, I'm extremely happy to see whatever code there is, I'm happy to review > > it. I'm happy to see it rolling. If it is better, we can adopt it. > > So here is a summary of why I think my proposal is better: > > - Much less code > > I think this is crystal clear.. I'm pasting once more in this summary > email on what your proposal touches: > > fs/userfaultfd.c | 14 +-- > include/linux/hugetlb.h | 21 ---- > include/linux/mm.h | 11 ++ > include/linux/shmem_fs.h | 14 --- > include/linux/userfaultfd_k.h | 108 ++++++++++------ > mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- > mm/shmem.c | 245 ++++++++++++++++++++++++------------ > mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- > 8 files changed, 962 insertions(+), 679 deletions(-) > > - Much less future code > > The new proposal needs at least 6 APIs to implement even minor fault.. > One of the API needs to be implemented with uffd_info* which further > includes 10+ fields to process. It means we'll have a bunch of > duplicated code in the future if new things pop up, so it's not only > about what we merge. > > - Much less exported functions to modules > > My solution, after exposing vm_uffd_ops, doesn't need to export any > function. > > Your solution needs to export a lot of new functions to modules. I > didn't pay a lot of attention but the list should at least include these > 10 functions: > > void uffd_complete_register(struct vm_area_struct *vma); > unsigned int uffd_page_shift(struct vm_area_struct *vma); > int uffd_writeprotect(struct uffd_info *info); > ssize_t uffd_failed_do_unlock(struct uffd_info *info); > int uffd_atomic_pte_copy(struct folio *folio, unsigned long src_addr); > unsigned long mfill_size(struct vm_area_struct *vma) > int mfill_atomic_pte_poison(struct uffd_info *info); > int mfill_atomic_pte_copy(struct uffd_info *info); > int mfill_atomic_pte_zeropage(struct uffd_info *info); > ssize_t uffd_get_dst_pmd(struct vm_area_struct *dst_vma, unsigned long dst_addr,pmd_t **dst_pmd); > > It's simply unnecessary. > > - Less error prone > > At least to support minor fault, my solution only needs one hook fetching > page cache, then set the CONTINUE ioctl in the supported_ioctls. > > - Safer > > Your code allows to operate on pmd* in a module??? That's too risky and > mm can explode! Isn't it? > > - Do not build new codes on top of hugetlbfs > > AFAICT, more than half of your solution's API is trying to service > hugetlbfs. IMHO that's the wrong way to go. I explained to you multiple > times. We should either keep hugetlbfs alone, or having hugetlbfs adopt > mm APIs instead. We shouldn't build any new code only trying to service > hugetlbfsv1 but nobody else. We shouldn't introduce new mm API only to > service hugetlbfs. > > - Much less risk of breaking things > > I'm pretty sure my code introduce zero or very little bug, if there's > one, I'll fix it, but really, likely not, because the changes are > straightforward. > > Your changes are huge. I would not be surprised you break things here > and there. I hope at least you will be around fixing them when it > happens, even if we're not sure the benefits of most of the changes. > > -- > Peter Xu > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 20:23 ` Lorenzo Stoakes @ 2025-10-30 21:13 ` Peter Xu 2025-10-30 21:27 ` Peter 2025-11-03 20:01 ` David Hildenbrand (Red Hat) 0 siblings, 2 replies; 37+ messages in thread From: Peter Xu @ 2025-10-30 21:13 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On Thu, Oct 30, 2025 at 08:23:03PM +0000, Lorenzo Stoakes wrote: > +cc CoC > > Peter, > > I'm sorry but your reply here is completely out of line. > > I know tensions can run high sometimes, but this is a _good faith_ effort to try > to find a way forward. > > Please take a step back and show some respect for the fact that Liam has put > VERY significant effort in preparing this after you _repeatedly_ asked him to > show him code. > > I am starting to worry that your approach here is to bat off criticism by trying > to wear reviewers down and that's really not a good thing. > > Again, this is _good faith_. Nobody is trying to unreasonably push back on these > changes, we are just trying to find the best solution possible. > > Comments like: > > 'Your code allows to operate on pmd* in a module??? That's too risky and mm can > explode! Isn't it?' > > and 'that's the wrong way to go. I explained to you multiple times.' > > and 'I'm pretty sure my code introduce zero or very little bug, if there's one, I'll > fix it, but really, likely not, because the changes are straightforward.' > > vs. 'Your changes are huge. I would not be surprised you break things here and > there. I hope at least you will be around fixing them when it happens, even if > we're not sure the benefits of most of the changes.' > > are just _entirely_ unhelpful and really unacceptable. > > I have an extremely heavy workload at the moment anyway, but honestly > interactions like this have seriously put me off being involved in this review > personally. > > Do we really want this to be how review in mm or the kernel is? > > Is that really the culture we want to have here? Gosh.. Seriously? I'm ok if this needs to be audited. I have all the previous discussions in the cover letter as links. -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 21:13 ` Peter Xu @ 2025-10-30 21:27 ` Peter 2025-11-03 20:01 ` David Hildenbrand (Red Hat) 1 sibling, 0 replies; 37+ messages in thread From: Peter @ 2025-10-30 21:27 UTC (permalink / raw) To: Peter Xu Cc: Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct thank you On Thu, 30 Oct 2025 at 21:24, Peter Xu <peterx@redhat.com> wrote: > > On Thu, Oct 30, 2025 at 08:23:03PM +0000, Lorenzo Stoakes wrote: > > +cc CoC > > > > Peter, > > > > I'm sorry but your reply here is completely out of line. > > > > I know tensions can run high sometimes, but this is a _good faith_ effort to try > > to find a way forward. > > > > Please take a step back and show some respect for the fact that Liam has put > > VERY significant effort in preparing this after you _repeatedly_ asked him to > > show him code. > > > > I am starting to worry that your approach here is to bat off criticism by trying > > to wear reviewers down and that's really not a good thing. > > > > Again, this is _good faith_. Nobody is trying to unreasonably push back on these > > changes, we are just trying to find the best solution possible. > > > > Comments like: > > > > 'Your code allows to operate on pmd* in a module??? That's too risky and mm can > > explode! Isn't it?' > > > > and 'that's the wrong way to go. I explained to you multiple times.' > > > > and 'I'm pretty sure my code introduce zero or very little bug, if there's one, I'll > > fix it, but really, likely not, because the changes are straightforward.' > > > > vs. 'Your changes are huge. I would not be surprised you break things here and > > there. I hope at least you will be around fixing them when it happens, even if > > we're not sure the benefits of most of the changes.' > > > > are just _entirely_ unhelpful and really unacceptable. > > > > I have an extremely heavy workload at the moment anyway, but honestly > > interactions like this have seriously put me off being involved in this review > > personally. > > > > Do we really want this to be how review in mm or the kernel is? > > > > Is that really the culture we want to have here? > > Gosh.. Seriously? > > I'm ok if this needs to be audited. I have all the previous discussions in > the cover letter as links. > > -- > Peter Xu > > ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 21:13 ` Peter Xu 2025-10-30 21:27 ` Peter @ 2025-11-03 20:01 ` David Hildenbrand (Red Hat) 2025-11-03 20:46 ` Peter Xu 1 sibling, 1 reply; 37+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-03 20:01 UTC (permalink / raw) To: Peter Xu, Lorenzo Stoakes Cc: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct >> I have an extremely heavy workload at the moment anyway, but honestly >> interactions like this have seriously put me off being involved in this review >> personally. >> >> Do we really want this to be how review in mm or the kernel is? >> >> Is that really the culture we want to have here? > > Gosh.. Seriously? > > I'm ok if this needs to be audited. I have all the previous discussions in > the cover letter as links. I'm late to the party (or whatever this here is called. ah right, drama), and I haven't yet dug through all the emails and certainly not through all the of involved code changes. Peter, I was a bit surprised by your messages here indeed, not what I expected. The "Your code allows to operate on pmd* in a module??? That's too risky and mm can explode! Isn't it?" definitely was absolutely unnecessary ... or telling Liam that "he want almost mad". Again, not what I would have expected from you, and I would assume that you had a bad day and would at least apologize now that some time passed. I understand that you were upset by the previous feedback on the earlier series. There were some heated arguments in the last discussions, but most of them were based on misunderstandings. I would have thought that once they were resolved that we could continue focusing on discussing the technical details again. From what I can see you asked for actual code and when Liam came back with some code that looks like *a lot of work* to me. He really seems to care (which I highly appreciate) and went the extra mile to show us how the uffd code could evolve. We've all (well okay, some of us) been crying for some proper userfaultfd cleanups for years. So is there a way we can move forward with this without thinking in binary? Is there some middle-ground to be had? Can some reworks come on top of your series? Can so reworks be integrated in this series? I agree that what Liam proposes here is on the larger side, and probably does a lot of things in a single rework. That doesn't mean that we couldn't move into that direction in smaller steps. (I really don't think we should be thinking in terms of a CoC war like: show them what I did and I will show them what they did. We are all working on the same bigger goal here after all ...) -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 20:01 ` David Hildenbrand (Red Hat) @ 2025-11-03 20:46 ` Peter Xu 2025-11-03 21:27 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 37+ messages in thread From: Peter Xu @ 2025-11-03 20:46 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On Mon, Nov 03, 2025 at 09:01:02PM +0100, David Hildenbrand (Red Hat) wrote: > > > I have an extremely heavy workload at the moment anyway, but honestly > > > interactions like this have seriously put me off being involved in this review > > > personally. > > > > > > Do we really want this to be how review in mm or the kernel is? > > > > > > Is that really the culture we want to have here? > > > > Gosh.. Seriously? > > > > I'm ok if this needs to be audited. I have all the previous discussions in > > the cover letter as links. > > I'm late to the party (or whatever this here is called. ah right, drama), > and I haven't yet dug through all the emails and certainly not through all > the of involved code changes. > > Peter, I was a bit surprised by your messages here indeed, not what I > expected. > > The "Your code allows to operate on pmd* in a module??? That's too risky and > mm can explode! Isn't it?" definitely was absolutely unnecessary ... or > telling Liam that "he want almost mad". It was a joke! uffd_copy() API was NACKed because of this. Now the new proposal introduced it. I made a joke saying Liam allows that to happen in his branch, but forbid mine. I thought it was super clear to identify. > > Again, not what I would have expected from you, and I would assume that you > had a bad day and would at least apologize now that some time passed. Sorry, no. I won't apologize for that. I was not fair treated, and now I think it's fair I at least make a joke. David, you're leaving, and I'm totally dissappointed that at this point of time, you ask me to apologize instead. I thought it was obvious a joke, because I never thought having pmd* in a function in a module is not OK. I always thought it was fine, Linux is not a micro kernel. It's just fine. It is what happening in Linux right now. It is so obvious. In case it was not clear, I hope I make it clear now. If I'm going to formally NACK Liam's series, I won't use this as one of the real reasons. I just hide it in some of others that are real reasons. However if to be fair, when this reason is removed, this series should also remove the "highlight" that it removed shmem.h header, because my v1 also did that when with uffd_copy(). > > I understand that you were upset by the previous feedback on the earlier > series. > > There were some heated arguments in the last discussions, but most of them > were based on misunderstandings. I would have thought that once they were > resolved that we could continue focusing on discussing the technical details > again. > > From what I can see you asked for actual code and when Liam came back with > some code that looks like *a lot of work* to me. It's Liam who stood out strongly pushing back what he at least used to be not familiar with. This was, IMHO, rude. It's ok to keep silent on some patchset that one isn't familiar. It's ok to ask questions. It's not ok to strongly push back without being extremely familiar with the code. He might be more familiar now, I wish he is. But it's Liam's decision to work on the code. We're adults, we do what we should do, not what we asked to do. If we do what we asked to do, we should have our reasons. My ask was trying to make Liam see that what he proposed is over engineering the whole thing. I was pretty sure of that, he wasn't. I explained to him multiple times on why it was an overkill, he doesn't agree. It's fine for him to disagree, it's Liam's right. Then it's also fine for me to ask him code it up to notice himself, if I can't persuade him. That's the only way for him to persuade me instead. I sincerely wished that works out. As I said, then I'll properly review it, and then we build whatever we need on top. I'm totally fine. However it didn't go like that, the API is exactly what I pictured. I prefer my proposal. That's what I did: showing the difference when there're two proposals, and ask for a second opinion. It's not fair to put that on top of me to blame. He's trying to justify he's correct. It has nothing to do with me. He can stop pushing back anytime. He can keep proposing what he works on. It's his decision, not mine. > > He really seems to care (which I highly appreciate) and went the extra mile > to show us how the uffd code could evolve. > > We've all (well okay, some of us) been crying for some proper userfaultfd > cleanups for years. > > So is there a way we can move forward with this without thinking in binary? > Is there some middle-ground to be had? Can some reworks come on top of your > series? Can so reworks be integrated in this series? > > I agree that what Liam proposes here is on the larger side, and probably > does a lot of things in a single rework. That doesn't mean that we couldn't > move into that direction in smaller steps. > > (I really don't think we should be thinking in terms of a CoC war like: show > them what I did and I will show them what they did. We are all working on > the same bigger goal here after all ...) We've got some second opinion from Mike, please read it first. David, you're co-maintaining mm with Andrew. I think it's fair indeed you provide how things should go together with Andrew. It's fair you and Andrew whoever would like to make a decision on how to move forward. I'm fine on whatever decision you want to make. I think I tried my best trying to make gmem work as simple as possible. For the CoC report, I wish someone from CoC would also review it. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 20:46 ` Peter Xu @ 2025-11-03 21:27 ` David Hildenbrand (Red Hat) 2025-11-03 22:49 ` Peter Xu 2025-11-04 7:21 ` Mike Rapoport 0 siblings, 2 replies; 37+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-03 21:27 UTC (permalink / raw) To: Peter Xu Cc: Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On 03.11.25 21:46, Peter Xu wrote: > On Mon, Nov 03, 2025 at 09:01:02PM +0100, David Hildenbrand (Red Hat) wrote: >>>> I have an extremely heavy workload at the moment anyway, but honestly >>>> interactions like this have seriously put me off being involved in this review >>>> personally. >>>> >>>> Do we really want this to be how review in mm or the kernel is? >>>> >>>> Is that really the culture we want to have here? >>> >>> Gosh.. Seriously? >>> >>> I'm ok if this needs to be audited. I have all the previous discussions in >>> the cover letter as links. >> >> I'm late to the party (or whatever this here is called. ah right, drama), >> and I haven't yet dug through all the emails and certainly not through all >> the of involved code changes. >> >> Peter, I was a bit surprised by your messages here indeed, not what I >> expected. >> >> The "Your code allows to operate on pmd* in a module??? That's too risky and >> mm can explode! Isn't it?" definitely was absolutely unnecessary ... or >> telling Liam that "he want almost mad". > > It was a joke! > > uffd_copy() API was NACKed because of this. Now the new proposal > introduced it. I made a joke saying Liam allows that to happen in his > branch, but forbid mine. > > I thought it was super clear to identify. Text is a very bad medium for that, especially given the previous discussions that were rather heated. So it's good that you clarify that -- I am not sure how many people got that it was a joke TBH. I understood the reference to previous discussions but to me it sounded rather dismissive in the context of this discussion. > >> >> Again, not what I would have expected from you, and I would assume that you >> had a bad day and would at least apologize now that some time passed. > > Sorry, no. I won't apologize for that. I was not fair treated, and now I > think it's fair I at least make a joke. Peter, if you would tell me that I am going mad I would not be able to understand that as a joke -- unless maybe if you add plenty of :) . :) > > David, you're leaving, and I'm totally dissappointed that at this point of > time, you ask me to apologize instead. I'll be right here, working for the community as I always do. So please read my message as if nothing in that regard happened. I don't want you to feel bad here, I want us to find a solution without more of this drama. Because that's what we have here, unfortunately :( > > I thought it was obvious a joke, because I never thought having pmd* in a > function in a module is not OK. Unfortunately it was not clear. > > I always thought it was fine, Linux is not a micro kernel. It's just fine. > It is what happening in Linux right now. It is so obvious. In case it was > not clear, I hope I make it clear now. If I'm going to formally NACK > Liam's series, I won't use this as one of the real reasons. I just hide it > in some of others that are real reasons. However if to be fair, when this > reason is removed, this series should also remove the "highlight" that it > removed shmem.h header, because my v1 also did that when with uffd_copy(). > >> >> I understand that you were upset by the previous feedback on the earlier >> series. >> >> There were some heated arguments in the last discussions, but most of them >> were based on misunderstandings. I would have thought that once they were >> resolved that we could continue focusing on discussing the technical details >> again. >> >> From what I can see you asked for actual code and when Liam came back with >> some code that looks like *a lot of work* to me. > > It's Liam who stood out strongly pushing back what he at least used to be > not familiar with. This was, IMHO, rude. It's ok to keep silent on some > patchset that one isn't familiar. It's ok to ask questions. It's not ok > to strongly push back without being extremely familiar with the code. /me am I a rude person? :( ;) The previous discussions on this were not ideal, because there were misunderstandings, yes. Liam has a lot of background on VMA handling, so I think getting is input is actually pretty valuable. > > He might be more familiar now, I wish he is. But it's Liam's decision to > work on the code. Right, Liam took the time to actually implement what he envisionsed. I assume it was a great learning experience for him. Shame that this drama here seems to make him want to stop using that experience in the future. > > We're adults, we do what we should do, not what we asked to do. If we do > what we asked to do, we should have our reasons. > > My ask was trying to make Liam see that what he proposed is over > engineering the whole thing. I was pretty sure of that, he wasn't. I > explained to him multiple times on why it was an overkill, he doesn't > agree. It's fine for him to disagree, it's Liam's right. Then it's also > fine for me to ask him code it up to notice himself, if I can't persuade > him. That's the only way for him to persuade me instead. Well, he noticed that we can apparently cleanup userfaultfd quite heavily. :) And maybe that's the main problem here: Liam talks about general uffd cleanups while you are focused on supporting guest_memfd minor mode "as simple as possible" (as you write below). I acked your series for a reason: I think it is good enough to implement that (as simple as possible), but I also have the feeling that we can do much better in general. > > I sincerely wished that works out. As I said, then I'll properly review > it, and then we build whatever we need on top. I'm totally fine. However > it didn't go like that, the API is exactly what I pictured. I prefer my > proposal. That's what I did: showing the difference when there're two > proposals, and ask for a second opinion. > > It's not fair to put that on top of me to blame. He's trying to justify > he's correct. It has nothing to do with me. He can stop pushing back > anytime. He can keep proposing what he works on. It's his decision, not > mine. I would prefer if we can come to a conclusion instead of having people stop pushing back and walking away. I assume positive intend here from both sides. > >> >> He really seems to care (which I highly appreciate) and went the extra mile >> to show us how the uffd code could evolve. >> >> We've all (well okay, some of us) been crying for some proper userfaultfd >> cleanups for years. >> >> So is there a way we can move forward with this without thinking in binary? >> Is there some middle-ground to be had? Can some reworks come on top of your >> series? Can so reworks be integrated in this series? >> >> I agree that what Liam proposes here is on the larger side, and probably >> does a lot of things in a single rework. That doesn't mean that we couldn't >> move into that direction in smaller steps. >> >> (I really don't think we should be thinking in terms of a CoC war like: show >> them what I did and I will show them what they did. We are all working on >> the same bigger goal here after all ...) > > We've got some second opinion from Mike, please read it first. I read it, and I will have to look into some more details. But what I could read from Mikes reply is that there could be a discussion continuing where we would find a middle ground. Well, if I can motivate Liam to keep working on userfaultfd at all. David, > you're co-maintaining mm with Andrew. I think it's fair indeed you provide > how things should go together with Andrew. It's fair you and Andrew > whoever would like to make a decision on how to move forward. I'm fine on > whatever decision you want to make. Unfortuantely (or fortunately?) I am not officially maintaining userfaultfd. And Andrew is not involved enough I am afraid to make a decision. Of course, I *could* make a decision, but that would likely involve that we continue the discussion without this drama. But do people want that? -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 21:27 ` David Hildenbrand (Red Hat) @ 2025-11-03 22:49 ` Peter Xu 2025-11-04 7:10 ` Lorenzo Stoakes 2025-11-04 14:18 ` David Hildenbrand (Red Hat) 2025-11-04 7:21 ` Mike Rapoport 1 sibling, 2 replies; 37+ messages in thread From: Peter Xu @ 2025-11-03 22:49 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On Mon, Nov 03, 2025 at 10:27:05PM +0100, David Hildenbrand (Red Hat) wrote: > On 03.11.25 21:46, Peter Xu wrote: > > On Mon, Nov 03, 2025 at 09:01:02PM +0100, David Hildenbrand (Red Hat) wrote: > > > > > I have an extremely heavy workload at the moment anyway, but honestly > > > > > interactions like this have seriously put me off being involved in this review > > > > > personally. > > > > > > > > > > Do we really want this to be how review in mm or the kernel is? > > > > > > > > > > Is that really the culture we want to have here? > > > > > > > > Gosh.. Seriously? > > > > > > > > I'm ok if this needs to be audited. I have all the previous discussions in > > > > the cover letter as links. > > > > > > I'm late to the party (or whatever this here is called. ah right, drama), > > > and I haven't yet dug through all the emails and certainly not through all > > > the of involved code changes. > > > > > > Peter, I was a bit surprised by your messages here indeed, not what I > > > expected. > > > > > > The "Your code allows to operate on pmd* in a module??? That's too risky and > > > mm can explode! Isn't it?" definitely was absolutely unnecessary ... or > > > telling Liam that "he want almost mad". > > > > It was a joke! > > > > uffd_copy() API was NACKed because of this. Now the new proposal > > introduced it. I made a joke saying Liam allows that to happen in his > > branch, but forbid mine. > > > > I thought it was super clear to identify. > > Text is a very bad medium for that, especially given the previous > discussions that were rather heated. > > So it's good that you clarify that -- I am not sure how many people got that > it was a joke TBH. > > I understood the reference to previous discussions but to me it sounded > rather dismissive in the context of this discussion. > > > > > > > > > Again, not what I would have expected from you, and I would assume that you > > > had a bad day and would at least apologize now that some time passed. > > > > Sorry, no. I won't apologize for that. I was not fair treated, and now I > > think it's fair I at least make a joke. > > Peter, if you would tell me that I am going mad I would not be able to > understand that as a joke -- unless maybe if you add plenty of :) . :) If it's a problematic use of the word "mad", it could be my English not my attitude. I can easily say I'm mad at something when I'm not satisfied. I admit I'm not a native speaker. But then, if you think "mad" is a bad word, how about: https://lore.kernel.org/all/6odeeo7bgxgq4v6y3jercrriqyreynuelofrw6k6roh7ws5vy2@wyvx7uiztb5y/ I'm happy to address changes, but I'm not happy to accept more middleware and "it's not part of the patch set" to address any problem as you push more trash into an already horrible code base. I didn't raise CoC report for that. I was still trying to be technical on the whole discussion as much as I can do best. Hence, my reply was: > I'm happy to address changes, but I'm not happy to accept more > middleware and "it's not part of the patch set" to address any problem > as you push more trash into an already horrible code base. > > We need to fix things too. > > So I'm fixing it. Let's wait for a 2nd opinion on the approaches. As I said, I'm OK if everyone likes your solution and if I'm the only one NACKing it. If we can support guest-memfd finally whoever adds that, it's not so bad. Is "trash" a better word than "mad"? > > > > > David, you're leaving, and I'm totally dissappointed that at this point of > > time, you ask me to apologize instead. > > I'll be right here, working for the community as I always do. So please read > my message as if nothing in that regard happened. > > I don't want you to feel bad here, I want us to find a solution without more > of this drama. > > Because that's what we have here, unfortunately :( > > > > > I thought it was obvious a joke, because I never thought having pmd* in a > > function in a module is not OK. > > Unfortunately it was not clear. > > > > > I always thought it was fine, Linux is not a micro kernel. It's just fine. > > It is what happening in Linux right now. It is so obvious. In case it was > > not clear, I hope I make it clear now. If I'm going to formally NACK > > Liam's series, I won't use this as one of the real reasons. I just hide it > > in some of others that are real reasons. However if to be fair, when this > > reason is removed, this series should also remove the "highlight" that it > > removed shmem.h header, because my v1 also did that when with uffd_copy(). > > > > > > > > I understand that you were upset by the previous feedback on the earlier > > > series. > > > > > > There were some heated arguments in the last discussions, but most of them > > > were based on misunderstandings. I would have thought that once they were > > > resolved that we could continue focusing on discussing the technical details > > > again. > > > > > > From what I can see you asked for actual code and when Liam came back with > > > some code that looks like *a lot of work* to me. > > > > It's Liam who stood out strongly pushing back what he at least used to be > > not familiar with. This was, IMHO, rude. It's ok to keep silent on some > > patchset that one isn't familiar. It's ok to ask questions. It's not ok > > to strongly push back without being extremely familiar with the code. > > /me am I a rude person? :( ;) Frankly, I would trust that if you strongly NACK a series, then you should have good knowledge of the code base and the series you disagree. If you didn't have enough knowledge and NACK some patchset without really knowing much better than the author, yes, IMHO I think it's rude too. If I did it, it's the same. I will be the one to be rude. It has nothing to do with who's doing it. Like if I strongly push back whatever change in maple tree, I'll be rude. I never did, and I'll never do that. > > The previous discussions on this were not ideal, because there were > misunderstandings, yes. Liam has a lot of background on VMA handling, so I > think getting is input is actually pretty valuable. There is a line. I can't tell how to draw a line, but there is. > > > > > He might be more familiar now, I wish he is. But it's Liam's decision to > > work on the code. > > Right, Liam took the time to actually implement what he envisionsed. I > assume it was a great learning experience for him. > > Shame that this drama here seems to make him want to stop using that > experience in the future. > > > > > We're adults, we do what we should do, not what we asked to do. If we do > > what we asked to do, we should have our reasons. > > > > My ask was trying to make Liam see that what he proposed is over > > engineering the whole thing. I was pretty sure of that, he wasn't. I > > explained to him multiple times on why it was an overkill, he doesn't > > agree. It's fine for him to disagree, it's Liam's right. Then it's also > > fine for me to ask him code it up to notice himself, if I can't persuade > > him. That's the only way for him to persuade me instead. > > Well, he noticed that we can apparently cleanup userfaultfd quite heavily. > :) > > And maybe that's the main problem here: Liam talks about general uffd > cleanups while you are focused on supporting guest_memfd minor mode "as > simple as possible" (as you write below). > > I acked your series for a reason: I think it is good enough to implement > that (as simple as possible), but I also have the feeling that we can do > much better in general. "feeling" is not a good reason to block a series from landing. If you, or Liam, or anyone, has good proposal already, we can always consider it. The thing is I don't easily see a good proposal so far, it's non-trivial. Meanwhile, the current v1-v4 I posted should be simply enough that even if one day someone wants to clean it up we can revert relevant changes and apply the cleanup idea on top, because the changeset needed to do the cleanup on top of v1-v4 of mine will be trivial. It doesn't need to be blocked. I mentioned too that I think userfaultfd code isn't the cleanest, for example, here: https://lore.kernel.org/all/aQPX859LbBg5FmE8@x1.local/ On Thu, Oct 30, 2025 at 04:24:46PM -0400, Liam R. Howlett wrote: > Right, so the existing code a huge mess today and you won't fix > anything, ever. IMHO fix is the wrong word. Cleanup it is. I agree the current userfaultfd code isn't extremely easy to follow. So I agree cleanups might help. Liam explained his "vision" on how to cleanup. I explained why it won't work, starting from: https://lore.kernel.org/all/20250926211650.525109-1-peterx@redhat.com At that time, the proposal was still: struct vm_uffd_ops hugetlb_uffd_ops = { .missing = hugetlb_handle_userfault, .write_protect = mwriteprotect_range, .minor = hugetlb_handle_userfault_minor, .mfill_atomic = hugetlb_mfill_atomic_pte, .mfill_atomic_continue = ... .mfill_zeropage = ... .mfill_poison = ... .mfill_copy = NULL, /* For example */ }; Obviously, whoever proposed above hasn't looked at handle_userfaultfd() at all. That's also why I stopped commenting at that time, because it means who proposed it actually doesn't know the code well yet, and I don't necessarily need to comment on each line. I explained from high level on why it's an overkill at that time. It's fine to propose something being familiar or not with it, but again it's not fine to strongly pushback one series with such a proposal, and without the familiarity of the code base. > > > > > I sincerely wished that works out. As I said, then I'll properly review > > it, and then we build whatever we need on top. I'm totally fine. However > > it didn't go like that, the API is exactly what I pictured. I prefer my > > proposal. That's what I did: showing the difference when there're two > > proposals, and ask for a second opinion. > > > > It's not fair to put that on top of me to blame. He's trying to justify > > he's correct. It has nothing to do with me. He can stop pushing back > > anytime. He can keep proposing what he works on. It's his decision, not > > mine. > > I would prefer if we can come to a conclusion instead of having people stop > pushing back and walking away. Obviously people can walk away from either side. It's not always that who push back things can walk away. I'm nobody. I don't mean I'll be walking away and I'm comparing that to Liam's walking away. Liam is doing very well on maple trees (I didn't follow at all, but I'd trust that), while I'm pretty sure Linux will run as smooth if I walked away. However IMHO this is not a reason at all to allow anyone randomly NACK on anything without good reasons, and without good knowledge base of what the patchset is touching. Not to mention this is also not the 1st time it got strong NACK. v2 was almost NACKed because I introduced a function that returns a folio*. > > I assume positive intend here from both sides. > > > > > > > > > He really seems to care (which I highly appreciate) and went the extra mile > > > to show us how the uffd code could evolve. > > > > > > We've all (well okay, some of us) been crying for some proper userfaultfd > > > cleanups for years. > > > > > > So is there a way we can move forward with this without thinking in binary? > > > Is there some middle-ground to be had? Can some reworks come on top of your > > > series? Can so reworks be integrated in this series? > > > > > > I agree that what Liam proposes here is on the larger side, and probably > > > does a lot of things in a single rework. That doesn't mean that we couldn't > > > move into that direction in smaller steps. > > > > > > (I really don't think we should be thinking in terms of a CoC war like: show > > > them what I did and I will show them what they did. We are all working on > > > the same bigger goal here after all ...) > > > > We've got some second opinion from Mike, please read it first. > > I read it, and I will have to look into some more details. But what I could > read from Mikes reply is that there could be a discussion continuing where > we would find a middle ground. > > Well, if I can motivate Liam to keep working on userfaultfd at all. > > David, > > you're co-maintaining mm with Andrew. I think it's fair indeed you provide > > how things should go together with Andrew. It's fair you and Andrew > > whoever would like to make a decision on how to move forward. I'm fine on > > whatever decision you want to make. > > Unfortuantely (or fortunately?) I am not officially maintaining userfaultfd. > And Andrew is not involved enough I am afraid to make a decision. > > Of course, I *could* make a decision, but that would likely involve that we > continue the discussion without this drama. But do people want that? If you get my whole point, I was sincerely trying to collect 2nd opinions. I can paste my reply once more here, for my attitude: https://lore.kernel.org/all/aQPX859LbBg5FmE8@x1.local/ If you are talking about drama, just to mention I didn't raise a CoC report even if my code was evaluated as "trash". IMHO, whoever reads these discussions likely wasted some part of one's time. I don't want to waste time for whoever is going to audit this whole thing. I left my opinion in maybe 1 hour after Liam shared his branch, that included having lunch. I can glimpse ~1000 LOC as fast almost because what Liam proposed matched almost like what I can imagine. Mike shared his opinion today. You can definitely share yours after taking time to read about it. We stuck here for months. So many things happened. It's definitely not a problem if we take this slow. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 22:49 ` Peter Xu @ 2025-11-04 7:10 ` Lorenzo Stoakes 2025-11-04 14:18 ` David Hildenbrand (Red Hat) 1 sibling, 0 replies; 37+ messages in thread From: Lorenzo Stoakes @ 2025-11-04 7:10 UTC (permalink / raw) To: Peter Xu Cc: David Hildenbrand (Red Hat), Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct Peter, I feel like this is getting entirely out of hand. You have three mm maintainers telling you that you've crossed the line, with respect - might I suggest that's an indication that you in fact have? This is perhaps the result of some level of miscommunication resulting in your being defensive due to you misinterpreting good faith technical feedback based on genuine concerns as some form of attack. They are not, and have never been. A couple instances of exasperated reference to UFFD code in need of cleaning up (something you admit to yourself) does not amount to this, sorry. I find your comments about 'familiarity' deeply disturbing - UFFD is hooked into all of mm including VMA logic, page table logic, in fact a lot of the issue with the code is that it's coupled in this way - experienced mm maintainers' opinons are _entirely_ valid as as result. But in any case - you ought to engage on the technical points. A blanket dismissal is wholly inappropriate. If somebody's 'unfamiliarity' with parts of the code base are an issue, then you should very trivially be able to push back on the points raised right? The issue here is non-technical, having experienced it myself, and as exemplified in this thread - people provide valid criticism, you respond by essentially refusing to engage on the points raised. The net result is exactly what's happened here - Liam has (entirely understandably) decided to no longer engage with the thread. I've not been engaging on this review either - It's not that I'm bowled over by the technical arguments, it's that it's not worth the effort sitting down and assessing the technical merits because what's the point? You don't accept technical criticism and I don't review/maintain this code so am not obliged to. Your desire for a '2nd opinion' seems to be more so desiring that people agree with you. You have received a 2nd opinion. You didn't like it so now you want another. Your series will likely now land with review comments unaddressed having silenced criticism through non-technical means. Rather than all of this nonsense, we could simply have had a technical debate - with strong feelings on both sides perhaps - but constrained to the technical issues at hand. Instead we are now going to end up merging code where criticism has been silenced. Again I question if this is how things should work in mm or the kernel. Perhaps something to discuss at the next LSF/MM. Regards, Lorenzo ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 22:49 ` Peter Xu 2025-11-04 7:10 ` Lorenzo Stoakes @ 2025-11-04 14:18 ` David Hildenbrand (Red Hat) 1 sibling, 0 replies; 37+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-04 14:18 UTC (permalink / raw) To: Peter Xu Cc: Lorenzo Stoakes, Liam R. Howlett, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct >>>> Again, not what I would have expected from you, and I would assume that you >>>> had a bad day and would at least apologize now that some time passed. >>> >>> Sorry, no. I won't apologize for that. I was not fair treated, and now I >>> think it's fair I at least make a joke. >> >> Peter, if you would tell me that I am going mad I would not be able to >> understand that as a joke -- unless maybe if you add plenty of :) . :) > > If it's a problematic use of the word "mad", it could be my English not my > attitude. I can easily say I'm mad at something when I'm not satisfied. > I admit I'm not a native speaker. I'll note that there is a difference between "going mad" and "getting mad". Many of us are not native-speakers, so this stuff can happen. We can explain that it was not our intention and move on. > > But then, if you think "mad" is a bad word, how about: The issue is when it gets personal, not the word itself. I.e., calling code mad/trash vs. calling people mad/trash. Could be that some userfaultfd code might be mad trash :) > > https://lore.kernel.org/all/6odeeo7bgxgq4v6y3jercrriqyreynuelofrw6k6roh7ws5vy2@wyvx7uiztb5y/ > > I'm happy to address changes, but I'm not happy to accept more > middleware and "it's not part of the patch set" to address any > problem as you push more trash into an already horrible code base. > > I didn't raise CoC report for that. I don't think this is CoC material, but I am sure no expert on that. > I was still trying to be technical on > the whole discussion as much as I can do best. Hence, my reply was: > > > I'm happy to address changes, but I'm not happy to accept more > > middleware and "it's not part of the patch set" to address any problem > > as you push more trash into an already horrible code base. > > > > We need to fix things too. > > > > So I'm fixing it. > > Let's wait for a 2nd opinion on the approaches. > > As I said, I'm OK if everyone likes your solution and if I'm the only one > NACKing it. If we can support guest-memfd finally whoever adds that, it's > not so bad. > > Is "trash" a better word than "mad"? See above. [...] >>> >>> It's Liam who stood out strongly pushing back what he at least used to be >>> not familiar with. This was, IMHO, rude. It's ok to keep silent on some >>> patchset that one isn't familiar. It's ok to ask questions. It's not ok >>> to strongly push back without being extremely familiar with the code. >> >> /me am I a rude person? :( ;) > > Frankly, I would trust that if you strongly NACK a series, then you should > have good knowledge of the code base and the series you disagree. Right, I didn't, I pushed back against some things in there to reduce the scope. Even if it would go upstream as is I could live with it. I don't really maintain the VMA pieces though, that this interacts with. So maintainers that maintain related pieces (e.g., VMA) might have a stronger opinion on the interface than me. > > If you didn't have enough knowledge and NACK some patchset without really > knowing much better than the author, yes, IMHO I think it's rude too. I'm usually careful with NACKs, I barely use them :) > > If I did it, it's the same. I will be the one to be rude. It has nothing > to do with who's doing it. > > Like if I strongly push back whatever change in maple tree, I'll be rude. > I never did, and I'll never do that. Well, assume you maintain VMA handling and there are changes to the maple tree that would affect the VMA handling, sure you could push back against some interface changes without being an expert on the maple tree implementation. That's how people push back against some rust changes without being an expert on rust ... Anyhow, I agree that jumping in into something completely unrelated could be considered rude. I don't think this is the case with Liam and Lorenzo here, though. But that's just my personal take. [...] >> I acked your series for a reason: I think it is good enough to implement >> that (as simple as possible), but I also have the feeling that we can do >> much better in general. > > "feeling" is not a good reason to block a series from landing. Right, but I didn't block it because of that. I merely pointed at the raised alternative. > If you, or > Liam, or anyone, has good proposal already, we can always consider it. Yeah, that's what Liam ended up doing. Spending a lot of time to have a technical discussion to see how we can move forward. > > The thing is I don't easily see a good proposal so far, it's non-trivial. > > Meanwhile, the current v1-v4 I posted should be simply enough that even if > one day someone wants to clean it up we can revert relevant changes and > apply the cleanup idea on top, because the changeset needed to do the > cleanup on top of v1-v4 of mine will be trivial. It doesn't need to be > blocked. > > I mentioned too that I think userfaultfd code isn't the cleanest, for > example, here: > > https://lore.kernel.org/all/aQPX859LbBg5FmE8@x1.local/ > > On Thu, Oct 30, 2025 at 04:24:46PM -0400, Liam R. Howlett wrote: > > Right, so the existing code a huge mess today and you won't fix > > anything, ever. > > IMHO fix is the wrong word. Cleanup it is. I agree the current > userfaultfd code isn't extremely easy to follow. > > So I agree cleanups might help. Good that we all seem to agree on that at least. > > Liam explained his "vision" on how to cleanup. I explained why it won't > work, starting from: > > https://lore.kernel.org/all/20250926211650.525109-1-peterx@redhat.com > > At that time, the proposal was still: > > struct vm_uffd_ops hugetlb_uffd_ops = { > .missing = hugetlb_handle_userfault, > .write_protect = mwriteprotect_range, > .minor = hugetlb_handle_userfault_minor, > > .mfill_atomic = hugetlb_mfill_atomic_pte, > .mfill_atomic_continue = ... > .mfill_zeropage = ... > .mfill_poison = ... > .mfill_copy = NULL, /* For example */ > }; > > Obviously, whoever proposed above hasn't looked at handle_userfaultfd() at > all. That's also why I stopped commenting at that time, because it means > who proposed it actually doesn't know the code well yet, and I don't > necessarily need to comment on each line. I explained from high level on > why it's an overkill at that time. > > It's fine to propose something being familiar or not with it, but again > it's not fine to strongly pushback one series with such a proposal, and > without the familiarity of the code base. Again, see my reasoning above regarding interfaces. Don't get me wrong, I understand both sides here and I am personally also not a friend of hard pushback / NACKs, which is why I usually try to phrase stuff differently. > >> >>> >>> I sincerely wished that works out. As I said, then I'll properly review >>> it, and then we build whatever we need on top. I'm totally fine. However >>> it didn't go like that, the API is exactly what I pictured. I prefer my >>> proposal. That's what I did: showing the difference when there're two >>> proposals, and ask for a second opinion. >>> >>> It's not fair to put that on top of me to blame. He's trying to justify >>> he's correct. It has nothing to do with me. He can stop pushing back >>> anytime. He can keep proposing what he works on. It's his decision, not >>> mine. >> >> I would prefer if we can come to a conclusion instead of having people stop >> pushing back and walking away. > > Obviously people can walk away from either side. It's not always that who > push back things can walk away. > > I'm nobody. I don't mean I'll be walking away and I'm comparing that to > Liam's walking away. Liam is doing very well on maple trees (I didn't > follow at all, but I'd trust that), while I'm pretty sure Linux will run as > smooth if I walked away. Of course, I wouldn't say "nobody". But yeah, pretty much all of use could leave the kernel today and Linux would keep on running fine (at least for a while :) ). > However IMHO this is not a reason at all to allow > anyone randomly NACK on anything without good reasons, and without good > knowledge base of what the patchset is touching. > > Not to mention this is also not the 1st time it got strong NACK. v2 was > almost NACKed because I introduced a function that returns a folio*. Yeah, I hate how this patch set evolved. I guess most of us do. [...] >> >> Of course, I *could* make a decision, but that would likely involve that we >> continue the discussion without this drama. But do people want that? > > If you get my whole point, I was sincerely trying to collect 2nd opinions. > > I can paste my reply once more here, for my attitude: > > https://lore.kernel.org/all/aQPX859LbBg5FmE8@x1.local/ That was the mail that was IMHO absolutely appropriately phrased. > > If you are talking about drama, just to mention I didn't raise a CoC report > even if my code was evaluated as "trash". IMHO, whoever reads these > discussions likely wasted some part of one's time. I don't want to waste > time for whoever is going to audit this whole thing. Again, calling code trash is not really CoC material but I might be wrong. > > I left my opinion in maybe 1 hour after Liam shared his branch, that > included having lunch. I can glimpse ~1000 LOC as fast almost because what > Liam proposed matched almost like what I can imagine. > > Mike shared his opinion today. > > You can definitely share yours after taking time to read about it. > > We stuck here for months. So many things happened. It's definitely not a > problem if we take this slow. I hope we can de-escalate now and try to move forward while assuming positive intend and maybe slapping in a bunch of :) to highlight jokes ;) -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 21:27 ` David Hildenbrand (Red Hat) 2025-11-03 22:49 ` Peter Xu @ 2025-11-04 7:21 ` Mike Rapoport 2025-11-04 12:23 ` David Hildenbrand (Red Hat) 2025-11-06 16:32 ` Liam R. Howlett 1 sibling, 2 replies; 37+ messages in thread From: Mike Rapoport @ 2025-11-04 7:21 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Peter Xu, Lorenzo Stoakes, Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On Mon, Nov 03, 2025 at 10:27:05PM +0100, David Hildenbrand (Red Hat) wrote: > > And maybe that's the main problem here: Liam talks about general uffd > cleanups while you are focused on supporting guest_memfd minor mode "as > simple as possible" (as you write below). Hijacking for the technical part for a moment ;-) It seems that "as simple as possible" can even avoid data members in struct vm_uffd_ops, e.g something along these lines: diff --git a/include/linux/mm.h b/include/linux/mm.h index d16b33bacc32..840986780cb5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -605,6 +605,8 @@ struct vm_fault { */ }; +struct vm_uffd_ops; + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -690,6 +692,9 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD + const struct vm_uffd_ops *uffd_ops; +#endif }; #ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index c0e716aec26a..aac7ac616636 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -111,6 +111,11 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at /* Flags controlling behavior. These behavior changes are mode-independent. */ #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) +struct vm_uffd_ops { + int (*minor_get_folio)(struct inode *inode, pgoff_t pgoff, + struct folio **folio); +}; + extern int mfill_atomic_install_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, struct page *page, diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28..b4318ad3bdf9 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3260,6 +3260,17 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, shmem_inode_unacct_blocks(inode, 1); return ret; } + +static int shmem_uffd_minor_get_folio(struct inode *inode, pgoff_t pgoff, + struct folio **folio) +{ + return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC); +} + +static const struct vm_uffd_ops shmem_uffd_ops = { + .minor_get_folio = shmem_uffd_minor_get_folio, +}; + #endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS @@ -5292,6 +5303,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops = &shmem_uffd_ops, +#endif }; static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5301,6 +5315,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops = &shmem_uffd_ops, +#endif }; int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index af61b95c89e4..6b30a8f39f4d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -20,6 +20,20 @@ #include "internal.h" #include "swap.h" +static const struct vm_uffd_ops anon_uffd_ops = { +}; + +static inline const struct vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma) +{ + if (vma->vm_ops && vma->vm_ops->uffd_ops) + return vma->vm_ops->uffd_ops; + + if (vma_is_anonymous(vma)) + return &anon_uffd_ops; + + return NULL; +} + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) { @@ -382,13 +396,14 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, unsigned long dst_addr, uffd_flags_t flags) { + const struct vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma); struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct folio *folio; struct page *page; int ret; - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + ret = uffd_ops->minor_get_folio(inode, pgoff, &folio); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) ret = -EFAULT; @@ -707,6 +722,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long src_addr, dst_addr; long copied; struct folio *folio; + const struct vm_uffd_ops *uffd_ops; /* * Sanitize the command parameters: @@ -766,10 +782,11 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags); - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + uffd_ops = vma_get_uffd_ops(dst_vma); + if (!uffd_ops) goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && + !uffd_ops->minor_get_folio) goto out_unlock; while (src_addr < src_start + len) { -- Sincerely yours, Mike. ^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-04 7:21 ` Mike Rapoport @ 2025-11-04 12:23 ` David Hildenbrand (Red Hat) 2025-11-06 16:32 ` Liam R. Howlett 1 sibling, 0 replies; 37+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-04 12:23 UTC (permalink / raw) To: Mike Rapoport Cc: Peter Xu, Lorenzo Stoakes, Liam R. Howlett, linux-kernel, linux-mm, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct On 04.11.25 08:21, Mike Rapoport wrote: > On Mon, Nov 03, 2025 at 10:27:05PM +0100, David Hildenbrand (Red Hat) wrote: >> >> And maybe that's the main problem here: Liam talks about general uffd >> cleanups while you are focused on supporting guest_memfd minor mode "as >> simple as possible" (as you write below). > > Hijacking for the technical part for a moment ;-) What?! That's crazy :P > > It seems that "as simple as possible" can even avoid data members in struct > vm_uffd_ops, e.g something along these lines: Right, that's certainly even more "minimal". In the end I agree with previous assessment that the original series is not completely a "modulize memory types". If we want to make it easier for guest_memfd to support MFILL_ATOMIC_CONTINUE, then maybe we should focus on that, and only that, for the time being to make progress. -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-04 7:21 ` Mike Rapoport 2025-11-04 12:23 ` David Hildenbrand (Red Hat) @ 2025-11-06 16:32 ` Liam R. Howlett 1 sibling, 0 replies; 37+ messages in thread From: Liam R. Howlett @ 2025-11-06 16:32 UTC (permalink / raw) To: Mike Rapoport Cc: David Hildenbrand (Red Hat), Peter Xu, Lorenzo Stoakes, David Hildenbrand, linux-kernel, linux-mm, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli, conduct * Mike Rapoport <rppt@kernel.org> [251104 02:22]: > On Mon, Nov 03, 2025 at 10:27:05PM +0100, David Hildenbrand (Red Hat) wrote: > > > > And maybe that's the main problem here: Liam talks about general uffd > > cleanups while you are focused on supporting guest_memfd minor mode "as > > simple as possible" (as you write below). > > Hijacking for the technical part for a moment ;-) > > It seems that "as simple as possible" can even avoid data members in struct > vm_uffd_ops, e.g something along these lines: I like this because it removes the flag. If we don't want to return the folio, we could modify the mfill_atomic_pte_continue() to __mfill_atomic_pte_continue() which takes a function pointer and have the callers pass a different get_folio() by memory type. Each memory type (anon, shmem, and guest_memfd) would have a small stub that would be set in the vm_ops. It also looks similar to vma_get_uffd_ops() in 1fa9377e57eb1 ("mm/userfaultfd: Introduce userfaultfd ops and use it for destination validation") [1]. But I always returned a uffd ops, which passes all uffd testing. When would your NULL uffd ops be hit? That is, when would uffd_ops not be set and not be anon? [1]. https://git.infradead.org/?p=users/jedix/linux-maple.git;a=blobdiff;f=mm/userfaultfd.c;h=e2570e72242e5a350508f785119c5dee4d8176c1;hp=e8341a45e7e8d239c64f460afeb5b2b8b29ed853;hb=1fa9377e57eb16d7fa579ea7f8eb832164d209ac;hpb=2166e91882eb195677717ac2f8fbfc58171196ce Thanks, Liam > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index d16b33bacc32..840986780cb5 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -605,6 +605,8 @@ struct vm_fault { > */ > }; > > +struct vm_uffd_ops; > + > /* > * These are the virtual MM functions - opening of an area, closing and > * unmapping it (needed to keep files on disk up-to-date etc), pointer > @@ -690,6 +692,9 @@ struct vm_operations_struct { > struct page *(*find_normal_page)(struct vm_area_struct *vma, > unsigned long addr); > #endif /* CONFIG_FIND_NORMAL_PAGE */ > +#ifdef CONFIG_USERFAULTFD > + const struct vm_uffd_ops *uffd_ops; > +#endif > }; > > #ifdef CONFIG_NUMA_BALANCING > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > index c0e716aec26a..aac7ac616636 100644 > --- a/include/linux/userfaultfd_k.h > +++ b/include/linux/userfaultfd_k.h > @@ -111,6 +111,11 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_at > /* Flags controlling behavior. These behavior changes are mode-independent. */ > #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) > > +struct vm_uffd_ops { > + int (*minor_get_folio)(struct inode *inode, pgoff_t pgoff, > + struct folio **folio); > +}; > + > extern int mfill_atomic_install_pte(pmd_t *dst_pmd, > struct vm_area_struct *dst_vma, > unsigned long dst_addr, struct page *page, > diff --git a/mm/shmem.c b/mm/shmem.c > index b9081b817d28..b4318ad3bdf9 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3260,6 +3260,17 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, > shmem_inode_unacct_blocks(inode, 1); > return ret; > } > + > +static int shmem_uffd_minor_get_folio(struct inode *inode, pgoff_t pgoff, > + struct folio **folio) > +{ > + return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC); > +} > + > +static const struct vm_uffd_ops shmem_uffd_ops = { > + .minor_get_folio = shmem_uffd_minor_get_folio, > +}; > + > #endif /* CONFIG_USERFAULTFD */ > > #ifdef CONFIG_TMPFS > @@ -5292,6 +5303,9 @@ static const struct vm_operations_struct shmem_vm_ops = { > .set_policy = shmem_set_policy, > .get_policy = shmem_get_policy, > #endif > +#ifdef CONFIG_USERFAULTFD > + .uffd_ops = &shmem_uffd_ops, > +#endif > }; > > static const struct vm_operations_struct shmem_anon_vm_ops = { > @@ -5301,6 +5315,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { > .set_policy = shmem_set_policy, > .get_policy = shmem_get_policy, > #endif > +#ifdef CONFIG_USERFAULTFD > + .uffd_ops = &shmem_uffd_ops, > +#endif > }; > > int shmem_init_fs_context(struct fs_context *fc) > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index af61b95c89e4..6b30a8f39f4d 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -20,6 +20,20 @@ > #include "internal.h" > #include "swap.h" > > +static const struct vm_uffd_ops anon_uffd_ops = { > +}; > + > +static inline const struct vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma) > +{ > + if (vma->vm_ops && vma->vm_ops->uffd_ops) > + return vma->vm_ops->uffd_ops; > + > + if (vma_is_anonymous(vma)) > + return &anon_uffd_ops; > + > + return NULL; > +} > + > static __always_inline > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) > { > @@ -382,13 +396,14 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, > unsigned long dst_addr, > uffd_flags_t flags) > { > + const struct vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma); > struct inode *inode = file_inode(dst_vma->vm_file); > pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); > struct folio *folio; > struct page *page; > int ret; > > - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); > + ret = uffd_ops->minor_get_folio(inode, pgoff, &folio); > /* Our caller expects us to return -EFAULT if we failed to find folio */ > if (ret == -ENOENT) > ret = -EFAULT; > @@ -707,6 +722,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, > unsigned long src_addr, dst_addr; > long copied; > struct folio *folio; > + const struct vm_uffd_ops *uffd_ops; > > /* > * Sanitize the command parameters: > @@ -766,10 +782,11 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, > return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, > src_start, len, flags); > > - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) > + uffd_ops = vma_get_uffd_ops(dst_vma); > + if (!uffd_ops) > goto out_unlock; > - if (!vma_is_shmem(dst_vma) && > - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && > + !uffd_ops->minor_get_folio) > goto out_unlock; > > while (src_addr < src_start + len) { > > -- > Sincerely yours, > Mike. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 19:55 ` Peter Xu 2025-10-30 20:23 ` Lorenzo Stoakes @ 2025-10-30 20:52 ` Liam R. Howlett 2025-10-30 21:33 ` Peter Xu 1 sibling, 1 reply; 37+ messages in thread From: Liam R. Howlett @ 2025-10-30 20:52 UTC (permalink / raw) To: Peter Xu Cc: David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * Peter Xu <peterx@redhat.com> [251030 15:56]: > On Thu, Oct 30, 2025 at 03:07:18PM -0400, Peter Xu wrote: > > > Patches are here: > > > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > > > Great! Finally we have something solid to discuss on top. > > > > Yes, I'm extremely happy to see whatever code there is, I'm happy to review > > it. I'm happy to see it rolling. If it is better, we can adopt it. > > So here is a summary of why I think my proposal is better: > > - Much less code > > I think this is crystal clear.. I'm pasting once more in this summary > email on what your proposal touches: > > fs/userfaultfd.c | 14 +-- > include/linux/hugetlb.h | 21 ---- > include/linux/mm.h | 11 ++ > include/linux/shmem_fs.h | 14 --- > include/linux/userfaultfd_k.h | 108 ++++++++++------ > mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- > mm/shmem.c | 245 ++++++++++++++++++++++++------------ > mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- > 8 files changed, 962 insertions(+), 679 deletions(-) Including the following highlights: -#include <linux/hugetlb.h> and -typedef unsigned int __bitwise uffd_flags_t; > > - Much less future code > > The new proposal needs at least 6 APIs to implement even minor fault.. > One of the API needs to be implemented with uffd_info* which further > includes 10+ fields to process. It means we'll have a bunch of > duplicated code in the future if new things pop up, so it's not only > about what we merge. You can reuse existing functions if there is no change. > > - Much less exported functions to modules I haven't exported anything. You asked for code and I provided it. This doesn't do what guest_memfd needs as it is. This is all clean up you wouldn't do. > > My solution, after exposing vm_uffd_ops, doesn't need to export any > function. > > Your solution needs to export a lot of new functions to modules. I > didn't pay a lot of attention but the list should at least include these > 10 functions: > > void uffd_complete_register(struct vm_area_struct *vma); > unsigned int uffd_page_shift(struct vm_area_struct *vma); > int uffd_writeprotect(struct uffd_info *info); > ssize_t uffd_failed_do_unlock(struct uffd_info *info); > int uffd_atomic_pte_copy(struct folio *folio, unsigned long src_addr); > unsigned long mfill_size(struct vm_area_struct *vma) > int mfill_atomic_pte_poison(struct uffd_info *info); > int mfill_atomic_pte_copy(struct uffd_info *info); > int mfill_atomic_pte_zeropage(struct uffd_info *info); > ssize_t uffd_get_dst_pmd(struct vm_area_struct *dst_vma, unsigned long dst_addr,pmd_t **dst_pmd); > > It's simply unnecessary. Maybe we don't export any of them. Maybe there's another function pointer that could be checked or overwritten? We can do that without a flag or setting or wahtever the name you used for your flag was. > > - Less error prone > > At least to support minor fault, my solution only needs one hook fetching > page cache, then set the CONTINUE ioctl in the supported_ioctls. Your code just adds more junk to uffd, and fails to modularize anything beyond getting a folio. And you only support certain types and places. > > - Safer > > Your code allows to operate on pmd* in a module??? That's too risky and > mm can explode! Isn't it? Again. I didn't export anything. > > - Do not build new codes on top of hugetlbfs > > AFAICT, more than half of your solution's API is trying to service > hugetlbfs. IMHO that's the wrong way to go. I explained to you multiple > times. We should either keep hugetlbfs alone, or having hugetlbfs adopt > mm APIs instead. We shouldn't build any new code only trying to service > hugetlbfsv1 but nobody else. We shouldn't introduce new mm API only to > service hugetlbfs. Ignoring hugetlb exists is a problem. I have removed the hugetlb from being included in uffd while you have left it in its own loop. This doesn't build new things on hugetlb, it moves the code for hugetlb out of mm/userfaultfd.c - how is it building on hugetlb? Believe it or not, hugetlb is a memory type. Certainly smaller changes are inherently less bug prone. I honestly think all of what I have here needs to be done, regardless of memfd support. I cannot believe that things were allowed to be pushed this far. I do not think they should be allowed to go further. > > - Much less risk of breaking things > > I'm pretty sure my code introduce zero or very little bug, if there's > one, I'll fix it, but really, likely not, because the changes are > straightforward. > > Your changes are huge. I would not be surprised you break things here > and there. I hope at least you will be around fixing them when it > happens, even if we're not sure the benefits of most of the changes. I have always been prompt in fixing my issues and have taken responsibility for anything I've written. I maintain the maple tree and other areas of mm. I have no plans of leaving Linux and I hope not to die. I can maintain mm/userfaultfd.c, if that helps. I didn't feel like I knew the area enough before, but I'm learning a lot doing this. Thanks, Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 20:52 ` Liam R. Howlett @ 2025-10-30 21:33 ` Peter Xu 0 siblings, 0 replies; 37+ messages in thread From: Peter Xu @ 2025-10-30 21:33 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On Thu, Oct 30, 2025 at 04:52:26PM -0400, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251030 15:56]: > > On Thu, Oct 30, 2025 at 03:07:18PM -0400, Peter Xu wrote: > > > > Patches are here: > > > > > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > > > > > Great! Finally we have something solid to discuss on top. > > > > > > Yes, I'm extremely happy to see whatever code there is, I'm happy to review > > > it. I'm happy to see it rolling. If it is better, we can adopt it. > > > > So here is a summary of why I think my proposal is better: > > > > - Much less code > > > > I think this is crystal clear.. I'm pasting once more in this summary > > email on what your proposal touches: > > > > fs/userfaultfd.c | 14 +-- > > include/linux/hugetlb.h | 21 ---- > > include/linux/mm.h | 11 ++ > > include/linux/shmem_fs.h | 14 --- > > include/linux/userfaultfd_k.h | 108 ++++++++++------ > > mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- > > mm/shmem.c | 245 ++++++++++++++++++++++++------------ > > mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- > > 8 files changed, 962 insertions(+), 679 deletions(-) > > Including the following highlights: > -#include <linux/hugetlb.h> > > and > > -typedef unsigned int __bitwise uffd_flags_t; So you spent ~1000 LOC for these as highlights.. It's ok, I'll wait for a second opinion. > > > > > > - Much less future code > > > > The new proposal needs at least 6 APIs to implement even minor fault.. > > One of the API needs to be implemented with uffd_info* which further > > includes 10+ fields to process. It means we'll have a bunch of > > duplicated code in the future if new things pop up, so it's not only > > about what we merge. > > You can reuse existing functions if there is no change. > > > > > - Much less exported functions to modules > > I haven't exported anything. You asked for code and I provided it. > This doesn't do what guest_memfd needs as it is. This is all clean up > you wouldn't do. > > > > > My solution, after exposing vm_uffd_ops, doesn't need to export any > > function. > > > > Your solution needs to export a lot of new functions to modules. I > > didn't pay a lot of attention but the list should at least include these > > 10 functions: > > > > void uffd_complete_register(struct vm_area_struct *vma); > > unsigned int uffd_page_shift(struct vm_area_struct *vma); > > int uffd_writeprotect(struct uffd_info *info); > > ssize_t uffd_failed_do_unlock(struct uffd_info *info); > > int uffd_atomic_pte_copy(struct folio *folio, unsigned long src_addr); > > unsigned long mfill_size(struct vm_area_struct *vma) > > int mfill_atomic_pte_poison(struct uffd_info *info); > > int mfill_atomic_pte_copy(struct uffd_info *info); > > int mfill_atomic_pte_zeropage(struct uffd_info *info); > > ssize_t uffd_get_dst_pmd(struct vm_area_struct *dst_vma, unsigned long dst_addr,pmd_t **dst_pmd); > > > > It's simply unnecessary. > > Maybe we don't export any of them. Maybe there's another function > pointer that could be checked or overwritten? We can do that without a > flag or setting or wahtever the name you used for your flag was. You need to export them when guest-memfd will be involved, am I right? > > > > > - Less error prone > > > > At least to support minor fault, my solution only needs one hook fetching > > page cache, then set the CONTINUE ioctl in the supported_ioctls. > > Your code just adds more junk to uffd, and fails to modularize anything > beyond getting a folio. And you only support certain types and places. So would CoC accepts "junk" in this context? > > > > > - Safer > > > > Your code allows to operate on pmd* in a module??? That's too risky and > > mm can explode! Isn't it? > > Again. I didn't export anything. > > > > > - Do not build new codes on top of hugetlbfs > > > > AFAICT, more than half of your solution's API is trying to service > > hugetlbfs. IMHO that's the wrong way to go. I explained to you multiple > > times. We should either keep hugetlbfs alone, or having hugetlbfs adopt > > mm APIs instead. We shouldn't build any new code only trying to service > > hugetlbfsv1 but nobody else. We shouldn't introduce new mm API only to > > service hugetlbfs. > > Ignoring hugetlb exists is a problem. I have removed the hugetlb from > being included in uffd while you have left it in its own loop. This > doesn't build new things on hugetlb, it moves the code for hugetlb out > of mm/userfaultfd.c - how is it building on hugetlb? The APIs you introduced is building for hugetlb. If without hugetlb, more than half of the API is not needed. > > Believe it or not, hugetlb is a memory type. > > Certainly smaller changes are inherently less bug prone. I honestly > think all of what I have here needs to be done, regardless of memfd > support. I cannot believe that things were allowed to be pushed this > far. I do not think they should be allowed to go further. > > > > > - Much less risk of breaking things > > > > I'm pretty sure my code introduce zero or very little bug, if there's > > one, I'll fix it, but really, likely not, because the changes are > > straightforward. > > > > Your changes are huge. I would not be surprised you break things here > > and there. I hope at least you will be around fixing them when it > > happens, even if we're not sure the benefits of most of the changes. > > I have always been prompt in fixing my issues and have taken > responsibility for anything I've written. I maintain the maple tree and > other areas of mm. I have no plans of leaving Linux and I hope not to > die. > > I can maintain mm/userfaultfd.c, if that helps. I didn't feel like I > knew the area enough before, but I'm learning a lot doing this. I'm not maintainer of userfaultfd, I'm a reviewer. I was almost just trying to help, in reality that is also true that obviously I don't make decisions. I definitely think you can at least propose add yourself as a reviewer if you like to start looking after userfaultfds, or even M if Andrew likes it. You're already listed as core mm R so I don't see much difference if it's only a R addition. -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 19:07 ` Peter Xu 2025-10-30 19:55 ` Peter Xu @ 2025-10-30 20:24 ` Liam R. Howlett 2025-10-30 21:26 ` Peter Xu 1 sibling, 1 reply; 37+ messages in thread From: Liam R. Howlett @ 2025-10-30 20:24 UTC (permalink / raw) To: Peter Xu Cc: David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * Peter Xu <peterx@redhat.com> [251030 15:07]: > On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote: > > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > > > ... > > > > > Can you send some patches and show us the code, help everyone to support > > > guest-memfd minor fault, please? > > > > Patches are here: > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > Great! Finally we have something solid to discuss on top. > > Yes, I'm extremely happy to see whatever code there is, I'm happy to review > it. I'm happy to see it rolling. If it is better, we can adopt it. > > > > > This is actually modularized memory types. That means there is no > > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > > > uffd_flag_t has been removed. This was turning into a middleware and > > it is not necessary. Neither is supported_ioctls. > > > > hugetlb now uses the same functions as every other memory type, > > including anon memory. > > > > Any memory type can change functionality without adding instructions or > > flags or anything to some other code. > > > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > > tests). > > > > guest-memfd can implement whatever it needs to (or use others > > implementations), like shmem_uffd_ops here: > > I didn't look at the patches in details, however I already commented > previously on a similar comment you left. Since you have solid code this > time, let me ask one by one clearly this time inline: > > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > .copy = shmem_mfill_atomic_pte_copy, > > This is optional, if you did the convertion it's perfect (though it's buggy > right now, more below). Yes, UFFDIO_COPY might be a good fit for a global > API like this, however that's the least useful, as I mentioned, I do not > expect a new user.. > > It means, what you did on this may not grow any user. The whole change may > not be useful to anyone.. We currently have two users, that's why it's here. It is useful because we don't have hugetlb in the code anymore. We don't even have shmem code in the userfaultfd code. > > Then I see what you introduced as the API: > > struct vm_uffd_ops { > int (*copy)(struct uffd_info *info); > int (*zeropage)(struct uffd_info *info); > int (*cont)(struct uffd_info *info); > int (*poison)(struct uffd_info *info); > int (*writeprotect)(struct uffd_info *info); > /* Required features below */ > ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma, > unsigned long dst_start, unsigned long len); > unsigned long (*increment)(struct vm_area_struct *vma); > ssize_t (*failed_do_unlock)(struct uffd_info *info); > unsigned int (*page_shift)(struct vm_area_struct *src_vma); > void (*complete_register)(struct vm_area_struct *vma); > }; > > The copy() interface (and most of the rest) takes something called > uffd_info*. Then it looks like this: > > struct uffd_info { > unsigned long dst_addr; /* Target address */ > unsigned long src_addr; /* Source address */ > unsigned long len; /* Total length to copy */ > unsigned long src_last; /* starting src_addr + len */ > unsigned long dst_last; /* starting dst_addr + len */ > struct folio *foliop; /* folio pointer for retry */ > struct userfaultfd_ctx *ctx; /* The userfaultfd context */ > > struct vm_area_struct *dst_vma; /* Target vma */ > unsigned long increment; /* Size of each operation */ > bool wp; /* Operation is requesting write protection */ > const struct vm_uffd_ops *uffd_ops; /* The vma uffd_ops pointer */ > int (*op)(struct uffd_info *info); /* The operation to perform on the dst */ > long copied; > }; > > You went almost mad when I introduced uffd_copy() in v1. It might be > because there used to have pmd* and something around pgtables. However > I'll still need to question whether this is a better and easier to adopt > interface if a mem type wants to opt-in any uffd features. > > So are you happy yourself now with above complicated struct that memtype > needs to implement and support? The entire thing is crazy, really. I don't think modules should be doing any of this. It's nuts. How can you register for a memory type and not know how to deal with the memory type? It makes no sense in the first place. The only reason that this is approaching acceptable is that memfd already does the things it needs to do with it. So we're trying to hand something back to someone who already made the damn thing, but has no way of finding it themselves. Otherwise, this is stupid. > > > .zeropage = shmem_mfill_atomic_pte_zeropage, > > Why a memory type needs to provide a separate hook to inject a zeropage? > It should almost always be the same of UFFDIO_COPY except copying. This also has two users. > > > .cont = shmem_mfill_atomic_pte_continue, > > It's OK to have it. However said that, what we really need is "fetching a > cache folio". I'll need to check how you exposed the userfaultfd helpers > so that it will support mem types to opt-in this. To me, this is really an > overkill. This is literally doing what your subject said: modularize memory types. > > Shmem impl: > > static int shmem_mfill_atomic_pte_continue(struct uffd_info *info) > { > struct vm_area_struct *dst_vma = info->dst_vma; > struct inode *inode = file_inode(dst_vma->vm_file); > pgoff_t pgoff = linear_page_index(dst_vma, info->dst_addr); > pmd_t *dst_pmd; > struct folio *folio; > struct page *page; > int ret; > > ret = uffd_get_dst_pmd(dst_vma, info->dst_addr, &dst_pmd); > if (ret) > return ret; > > ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); > /* Our caller expects us to return -EFAULT if we failed to find folio */ > if (ret == -ENOENT) > ret = -EFAULT; > if (ret) > goto out; > if (!folio) { > ret = -EFAULT; > goto out; > } > > page = folio_file_page(folio, pgoff); > if (PageHWPoison(page)) { > ret = -EIO; > goto out_release; > } > > ret = mfill_atomic_install_pte(dst_pmd, dst_vma, info->dst_addr, > page, false, info->wp); > if (ret) > goto out_release; > > folio_unlock(folio); > ret = 0; > out: > return ret; > out_release: > folio_unlock(folio); > folio_put(folio); > goto out; > } > > So are you sure this is better than the oneliner? > > In your new API, the driver also needs to operate on pmd*. Is it a concern > to you? Maybe you don't think it's a concern now, even if you used to > think uffd_copy() has concerns exposing pmd* pointers? > > The current series I proposed is not only simpler, but only expose folio*. > That's at least safer at least from your theory, is that right? I disagree, it's adding yet-another-mode to the code instead of function pointers. When do we stop bolting on and start fixing existing issues? This change stands on its own without guest_memfd support. Yours is an anti-pattern. > > > .poison = mfill_atomic_pte_poison, > > Why this is needed if UFFDIO_POISON should exactly do the same thing for > each memory type? Yet, it doesn't so there's a function pointer for it. > > > .writeprotect = uffd_writeprotect, > > Same question. Wr-protect over a pagecache folio should really behave the > same. Why do you need to introduce it? hugetlb is different. > > After all, even in your branch you reused change_protection(), where the > hugetlb special operations reside. I don't see much point on why it's > needed. hugetlb uses hugetlb_change_protection(). > > > .is_dst_valid = shmem_is_dst_valid, > > It's definitely not obvious what this is for. is_dst_valid checks to see if the destination is valid. I can add a comment if you want. > > Looks like it's trying to verify vma validity. However then I see shmem > impl has this: > > static ssize_t shmem_is_dst_valid(struct vm_area_struct *dst_vma, > unsigned long dst_start, unsigned long len) > { > if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) > return -EINVAL; > > return 0; > } > > Why shmem allows anon vma? We can drop the !vma_is_shmem() check from here, but I just moved them. This is an existing check today. > > > .increment = mfill_size, > > This is almost only useful for hugetlbfs, same to page_shift and > complete_register. Correct. > > It's ok if you want to clean hugetlbfs. But if we want to fetch the > granule of the folios inside a vma, IMHO we should make it a vma attribute, > not something special to userfaultfd. If there is a user outside of uffd, then we can move it. Until then, let's not pollute outside of mm/userfaultfd.c please. > > > .failed_do_unlock = uffd_failed_do_unlock, > > You'd better at least change the name of it. It so far unlocks some > misterious locks then try to copy the folio without mmap/vma lock. > > inline ssize_t > uffd_failed_do_unlock(struct uffd_info *info) > { > ssize_t err; > void *kaddr; > > up_read(&info->ctx->map_changing_lock); > uffd_mfill_unlock(info->dst_vma); > VM_WARN_ON_ONCE(!info->foliop); > > kaddr = kmap_local_folio(info->foliop, 0); > err = copy_from_user(kaddr, (const void __user *) info->src_addr, > PAGE_SIZE); > kunmap_local(kaddr); > if (unlikely(err)) > return -EFAULT; > > flush_dcache_folio(info->foliop); > return 0; > } > > How do the mem type know what locks to unlock if they do not lock it first > themselves? We can change this if you want to move the lock to the internal code, but the failed exit due to the folio mess that copy makes required this block of code before. hugetlb has a different block of code. > > > .page_shift = uffd_page_shift, > > .complete_register = uffd_complete_register, > > Hugetlbfs specific hooks. It's OK if you prefer rewritting code for > hugetlbfs. I don't have objections. Yes, hugetlb is a memory type and so it is also modularized. > > > }; > > > > Where guest-memfd needs to write the one function: > > guest_memfd_pte_continue(), from what I understand. > > You did mention what is required in your new API: > > /* Required features below */ > ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma, > unsigned long dst_start, unsigned long len); > unsigned long (*increment)(struct vm_area_struct *vma); > ssize_t (*failed_do_unlock)(struct uffd_info *info); > unsigned int (*page_shift)(struct vm_area_struct *src_vma); > void (*complete_register)(struct vm_area_struct *vma); > > So guest-memfd needs to implement these 6 APIs to support minor fault. Or, you know, call the existing functions. > > > > > Obviously some of the shmem_ functions would need to be added to a > > header, or such. > > > > And most of that can come from shmem_mfill_atomic_pte_continue(), from > > what I understand. This is about 40 lines of code, but may require > > exposing some shmem functions to keep the code that compact. > > > > So we don't need to expose getting a folio to a module, or decode any > > special flags or whatever. We just call the function that needs to be > > I didn't reply before, but I don't think supported_ioctls is special flags > or magic values. They're simply flags showing what the mem type supports > on the ioctls. One can set it wrong, but I don't think what you proposed > with above 6 APIs would be easier to get right. Any module can also make > things wrong in any of above. > > UFFDIO_CONTINUE itself is definitely getting much more complicated, it used > to only set a flag in supported_ioctls, provide a oneliner to fetch a > cache. Now it needs all above 6 APIs, one of them taking uffd_info* which > further contains 10+ fields to process. > > > called on the vma that is found. > > > > If anyone has tests I can use for guest-memfd and instructions on > > guest-memfd setup, I'll just write it instead of expanding the > > userfaultfd middleware. > > Personally, this whole thing is an overkill to me: Absolutely. And it needed to be done. You are welcome. > > $ git diff --stat 63f84ba525ea04ef376eac851efce2f82dd05f21..HEAD > fs/userfaultfd.c | 14 +-- > include/linux/hugetlb.h | 21 ---- > include/linux/mm.h | 11 ++ > include/linux/shmem_fs.h | 14 --- > include/linux/userfaultfd_k.h | 108 ++++++++++------ > mm/hugetlb.c | 359 +++++++++++++++++++++++++++++++++++++++++++++-------- > mm/shmem.c | 245 ++++++++++++++++++++++++------------ > mm/userfaultfd.c | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------------------------------------------- > 8 files changed, 962 insertions(+), 679 deletions(-) Right, so the existing code a huge mess today and you won't fix anything, ever. We currently have a uffd_flags_t that sets wp or not then passes it through to set another part of the flag to pass it through again. This is stupid. And then it gets more stupid. We then check that flag exactly once for the second argument in a global function and NEVER LOOK AT THAT PART AGAIN. So we now have a type to contain a boolean, but we keep passing through the uffd_flags_t, so we can see if wp is set or not. And you're fine with it. In fact, let's add another one. I mean, we tried for two but people didn't like two so lets whine and whine and whine until people get frustrated and let you push a second one of those gems into the code you REFUSED TO MAINTAIN. This is what owning a problem looks like: I removed the uufd_flags_t, because it's stupid. I removed all the hugetlb checks because I modularized the memory types. I'm happy to address changes, but I'm not happy to accept more middleware and "it's not part of the patch set" to address any problem as you push more trash into an already horrible code base. We need to fix things too. So I'm fixing it. Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 20:24 ` Liam R. Howlett @ 2025-10-30 21:26 ` Peter Xu 0 siblings, 0 replies; 37+ messages in thread From: Peter Xu @ 2025-10-30 21:26 UTC (permalink / raw) To: Liam R. Howlett, David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Let me reply in short. On Thu, Oct 30, 2025 at 04:24:46PM -0400, Liam R. Howlett wrote: > Right, so the existing code a huge mess today and you won't fix > anything, ever. IMHO fix is the wrong word. Cleanup it is. I agree the current userfaultfd code isn't extremely easy to follow. > > We currently have a uffd_flags_t that sets wp or not then passes it > through to set another part of the flag to pass it through again. This > is stupid. I feel like you're applying very strongly personal preference to this piece of code. I won't say that is stupid. Axel introduced it, and I think it's still an improvement comparing to before. > > And then it gets more stupid. We then check that flag exactly once for > the second argument in a global function and NEVER LOOK AT THAT PART > AGAIN. So we now have a type to contain a boolean, but we keep passing > through the uffd_flags_t, so we can see if wp is set or not. > > And you're fine with it. In fact, let's add another one. I mean, we > tried for two but people didn't like two so lets whine and whine and > whine until people get frustrated and let you push a second one of those > gems into the code you REFUSED TO MAINTAIN. What I refuse to maintain? I don't really follow. > > This is what owning a problem looks like: I removed the uufd_flags_t, > because it's stupid. I removed all the hugetlb checks because I > modularized the memory types. > > I'm happy to address changes, but I'm not happy to accept more > middleware and "it's not part of the patch set" to address any problem > as you push more trash into an already horrible code base. > > We need to fix things too. > > So I'm fixing it. Let's wait for a 2nd opinion on the approaches. As I said, I'm OK if everyone likes your solution and if I'm the only one NACKing it. If we can support guest-memfd finally whoever adds that, it's not so bad. -- Peter Xu ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 17:13 ` Liam R. Howlett 2025-10-30 18:00 ` Nikita Kalyazin 2025-10-30 19:07 ` Peter Xu @ 2025-11-03 16:11 ` Mike Rapoport 2025-11-03 18:43 ` Liam R. Howlett 2025-11-05 21:23 ` David Hildenbrand 3 siblings, 1 reply; 37+ messages in thread From: Mike Rapoport @ 2025-11-03 16:11 UTC (permalink / raw) To: Liam R. Howlett, Peter Xu, David Hildenbrand, linux-kernel, linux-mm, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli Hi Liam, On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > ... > > > Can you send some patches and show us the code, help everyone to support > > guest-memfd minor fault, please? > > Patches are here: > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem It's really cool you picked up the gauntlet and invested this effort into refactoring of uffd! I agree that userfault code begs for cleanups after the sh^W stuff has been piling over and over, but ... > This is actually modularized memory types. That means there is no > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > uffd_flag_t has been removed. This was turning into a middleware and > it is not necessary. Neither is supported_ioctls. > > hugetlb now uses the same functions as every other memory type, > including anon memory. > > Any memory type can change functionality without adding instructions or > flags or anything to some other code. > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > tests). > > guest-memfd can implement whatever it needs to (or use others > implementations), like shmem_uffd_ops here: > > static const struct vm_uffd_ops shmem_uffd_ops = { > .copy = shmem_mfill_atomic_pte_copy, > .zeropage = shmem_mfill_atomic_pte_zeropage, > .cont = shmem_mfill_atomic_pte_continue, > .poison = mfill_atomic_pte_poison, > .writeprotect = uffd_writeprotect, > .is_dst_valid = shmem_is_dst_valid, > .increment = mfill_size, > .failed_do_unlock = uffd_failed_do_unlock, > .page_shift = uffd_page_shift, > .complete_register = uffd_complete_register, > }; ... I don't think it's the right level of abstraction to add as uffd_ops to vmap_ops. As I see it, we have two levels where things are different: hugetlb vs others at the very core of mfill_atomic() and then how different pte-based types implement a single page operations, i.e copy/zeropage/continue. So to separate hugetlb code from userfault we need something like ->get_parent_pagetable() ->pagesize() ->mfill_atomic_page() and apparently something like your complete_register() and maybe is_dst_valid(). But to provide hooks for shmem, anon and potentially guest_memfd() we should be looking at callbacks to get a folio to populate a PTE, either for copy or continue, and Peter's ->minor_get_folio() seems to me the right level of abstraction to expose at vm_uffd_ops. I believe we can extract ->get_folio() and ->put_folio() from shmem_mfill_atomic_pte() and call them from mfill_atomic_pte_copy(). > Where guest-memfd needs to write the one function: > guest_memfd_pte_continue(), from what I understand. > > Obviously some of the shmem_ functions would need to be added to a > header, or such. > > And most of that can come from shmem_mfill_atomic_pte_continue(), from > what I understand. This is about 40 lines of code, but may require > exposing some shmem functions to keep the code that compact. This seems to me an overkill to implement MFILL_ATOMIC_CONTINUE for guest_memfd(). I think it should be able to register a callback to provide a singe folio at a given file offset if that folio is in the guest_memfd's page cache. No reason for guest_memfd to start re-implementing locking, acquiring of PMD and updating the page table, even if it only needs to call functions from userfaultfd > So we don't need to expose getting a folio to a module, or decode any > special flags or whatever. We just call the function that needs to be > called on the vma that is found. Agree about exposing flags to a module and about limiting API to functions only. > Thanks, > Liam > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-03 16:11 ` Mike Rapoport @ 2025-11-03 18:43 ` Liam R. Howlett 0 siblings, 0 replies; 37+ messages in thread From: Liam R. Howlett @ 2025-11-03 18:43 UTC (permalink / raw) To: Mike Rapoport Cc: Peter Xu, David Hildenbrand, linux-kernel, linux-mm, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * Mike Rapoport <rppt@kernel.org> [251103 11:11]: > Hi Liam, > > On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote: > > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > > > ... > > > > > Can you send some patches and show us the code, help everyone to support > > > guest-memfd minor fault, please? > > > > Patches are here: > > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > It's really cool you picked up the gauntlet and invested this effort into > refactoring of uffd! Thanks. I have found that a few places really needed attention to avoid expanding this code into more middleware. The existing code has bundled all operations through one loop and now ever operation needs to check for special cases even when they cannot be triggered. And that loop is duplicated for hugetlb, so if there is an issue then we already have to fix the code in 2 places, and we also have to check unnecessary error conditions for 3 out of 4 operations, in both loops. This seems like the worst situation of both ideas - multiple call paths and unnecessary checks in loops. > > I agree that userfault code begs for cleanups after the sh^W stuff has been > piling over and over, but ... > > > This is actually modularized memory types. That means there is no > > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > > > uffd_flag_t has been removed. This was turning into a middleware and > > it is not necessary. Neither is supported_ioctls. > > > > hugetlb now uses the same functions as every other memory type, > > including anon memory. > > > > Any memory type can change functionality without adding instructions or > > flags or anything to some other code. > > > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > > tests). > > > > guest-memfd can implement whatever it needs to (or use others > > implementations), like shmem_uffd_ops here: > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > .copy = shmem_mfill_atomic_pte_copy, > > .zeropage = shmem_mfill_atomic_pte_zeropage, > > .cont = shmem_mfill_atomic_pte_continue, > > .poison = mfill_atomic_pte_poison, > > .writeprotect = uffd_writeprotect, > > .is_dst_valid = shmem_is_dst_valid, > > .increment = mfill_size, > > .failed_do_unlock = uffd_failed_do_unlock, > > .page_shift = uffd_page_shift, > > .complete_register = uffd_complete_register, > > }; > > ... I don't think it's the right level of abstraction to add as uffd_ops to > vmap_ops. > > As I see it, we have two levels where things are different: hugetlb vs > others at the very core of mfill_atomic() and then how different pte-based > types implement a single page operations, i.e copy/zeropage/continue. > > So to separate hugetlb code from userfault we need something like > > ->get_parent_pagetable() > ->pagesize() > ->mfill_atomic_page() > > and apparently something like your complete_register() and maybe > is_dst_valid(). > > But to provide hooks for shmem, anon and potentially guest_memfd() we > should be looking at callbacks to get a folio to populate a PTE, either for > copy or continue, and Peter's ->minor_get_folio() seems to me the right > level of abstraction to expose at vm_uffd_ops. This was entirely a clean up, nothing has been exported. When adding the guest_memfd type, I want to export the smallest set of these that we need. I've come up with a few ways of doing that so far, and that's why I was asking how to test the guest_memefd use case. Both ideas involve splitting these ops into two: external and internal. We can either tier the ops in each other or add both directly to the vma struct. I'm leaning towards embedding a second op into uffd_ops. We can default to the common cases and allow the hugetlb code to set its own (it could never be a module anyways..). As in the example code I've sent out, you can see we can fall back to defaults if !vma->vm_ops || !vma->vm_ops->userfaultfd for anon vmas. We can use this same method. The other thing about going this far was that, by the time I reached the point of having the targed mfill_atomic operations modularized, it was very close to being able to drop shmem.h and hugetlb.h from the code entirely, so a few of these exist to completely decouple the code. I'm not sure that it is necessary, but I was trying to show how to completely modularize the memory types, specifically since I was asked how. > > I believe we can extract ->get_folio() and ->put_folio() from > shmem_mfill_atomic_pte() and call them from mfill_atomic_pte_copy(). > > > Where guest-memfd needs to write the one function: > > guest_memfd_pte_continue(), from what I understand. > > > > Obviously some of the shmem_ functions would need to be added to a > > header, or such. > > > > And most of that can come from shmem_mfill_atomic_pte_continue(), from > > what I understand. This is about 40 lines of code, but may require > > exposing some shmem functions to keep the code that compact. > > This seems to me an overkill to implement MFILL_ATOMIC_CONTINUE for > guest_memfd(). > I think it should be able to register a callback to provide a singe folio > at a given file offset if that folio is in the guest_memfd's page cache. > No reason for guest_memfd to start re-implementing locking, acquiring of > PMD and updating the page table, even if it only needs to call functions > from userfaultfd There are certainly ways around this as well; a guest_memfd and shmem function that passes the pointer to the core function comes to mind. That would avoid the safety concerns about handing out the folio. We could also have a single function pointer that is exported and only used in shmem functions when !null or fall back to the default. It's not unsolvable and it could built on the fact that this is less middleware and more of an interface. > > > So we don't need to expose getting a folio to a module, or decode any > > special flags or whatever. We just call the function that needs to be > > called on the vma that is found. > > Agree about exposing flags to a module and about limiting API to functions > only. Unfortunately, the middleware and flags will continue to grow. Given the tone of the response I received on my clean up work after being asked, I will not be continuing to work on this. Regards, Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-10-30 17:13 ` Liam R. Howlett ` (2 preceding siblings ...) 2025-11-03 16:11 ` Mike Rapoport @ 2025-11-05 21:23 ` David Hildenbrand 2025-11-06 16:16 ` Liam R. Howlett 3 siblings, 1 reply; 37+ messages in thread From: David Hildenbrand @ 2025-11-05 21:23 UTC (permalink / raw) To: Liam R. Howlett, Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli On 30.10.25 18:13, Liam R. Howlett wrote: > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > ... > >> Can you send some patches and show us the code, help everyone to support >> guest-memfd minor fault, please? > > Patches are here: > Hi Liam, thanks for showing us what userfaultfd could look like when refactored according to your idea. I think most of the userfaultfd core code is easier to get in your tree. > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > This is actually modularized memory types. That means there is no > hugetlb.h or shmem.h included in mm/userfaultfd.c code. Yeah, I think there is this confusion of "modulize memory types" and "support minor fault in guest_memfd". So I see what you did here as trying to see how far we could go to remove any traces of shmem/hugetlb from userfaultd core. So I'll comment based on that and rather see it as a bigger, more extreme rework. > > uffd_flag_t has been removed. This was turning into a middleware and > it is not necessary. Neither is supported_ioctls. I assume you mean the entries that were proposed in Peters series, not something that is upstream. > > hugetlb now uses the same functions as every other memory type, > including anon memory. > > Any memory type can change functionality without adding instructions or > flags or anything to some other code. > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > tests). > > guest-memfd can implement whatever it needs to (or use others > implementations), like shmem_uffd_ops here: There is obviously some downside to be had with this approach (some of which Mike raised), regarding the interface to "memory types" implementing this, but I'll discuss that later. > > static const struct vm_uffd_ops shmem_uffd_ops = { > .copy = shmem_mfill_atomic_pte_copy, > .zeropage = shmem_mfill_atomic_pte_zeropage, > .cont = shmem_mfill_atomic_pte_continue, > .poison = mfill_atomic_pte_poison, > .writeprotect = uffd_writeprotect, > .is_dst_valid = shmem_is_dst_valid, > .increment = mfill_size, See below, I wonder if that could be performed by the callbacks invoked as part of the prior calls to mfill_loop() etc. > .failed_do_unlock = uffd_failed_do_unlock, That one is a bit unfortunate (read: ugly :) ). failed_do_unlock() is only called from mfill_copy_loop(). Where we perform a prior info.uffd_ops->copy. After calling err = info->op(info); Couldn't that callback just deal with the -ENOENT case? So in case of increment/failed_do_unlock, maybe we could find a way to just let the ->copy etc communicate/perform that directly. > .page_shift = uffd_page_shift, Fortunately, this is not required. The only user in move_present_ptes() moves *real* PTEs, and nothing else (no hugetlb PTEs that are PMDs etc. in disguise). > .complete_register = uffd_complete_register, > }; > So, the design is to callback into the memory-type handler, which will then use exported uffd functionality to get the job done. This nicely abstracts hugetlb handling, but could mean that any code implementing this interface has to built up on exported uffd functionality (not judging, just saying). As we're using the callbacks as an indication whether features are supported, we cannot easily leave them unset to fallback to the default handling. Of course, we could use some placeholder, magic UFFD_DEFAULT_HANDLER keyword to just use the uffd_* stuff without exporting them. So NULL would mean "not supported" and "UFFD_DEFAULT_HANDLER" would mean "no special handling needed". Not sure how often that would be the case, though. For shmem it would probably only be the poison callback, for others, I am not sure. > Where guest-memfd needs to write the one function: > guest_memfd_pte_continue(), from what I understand. It would be interesting to see how that one would look like. I'd assume fairly similar to shmem_mfill_atomic_pte_continue()? Interesting question would be, how to avoid the code duplication there. (as a side note, I wonder if we would want to call most of these uffd helper uffd_*) I'll have to think about some of this some more. In particular, alternatives to at least get all the shmem logic cleanly out of there and maybe only have a handful callback into hugetlb. IOW, not completely make everything fit the "odd case" and rather focus on the "normal cases" when designing this vm_ops interface here. Not sure if that makes sense, just wanted to raise it. -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-05 21:23 ` David Hildenbrand @ 2025-11-06 16:16 ` Liam R. Howlett 2025-11-07 10:16 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 37+ messages in thread From: Liam R. Howlett @ 2025-11-06 16:16 UTC (permalink / raw) To: David Hildenbrand Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli * David Hildenbrand <dhildenb@redhat.com> [251105 16:23]: > On 30.10.25 18:13, Liam R. Howlett wrote: > > * Peter Xu <peterx@redhat.com> [251021 12:28]: > > > > ... > > > > > Can you send some patches and show us the code, help everyone to support > > > guest-memfd minor fault, please? > > > > Patches are here: > > > > Hi Liam, > > thanks for showing us what userfaultfd could look like when refactored > according to your idea. I think most of the userfaultfd core code is easier > to get in your tree. > > > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem > > > > This is actually modularized memory types. That means there is no > > hugetlb.h or shmem.h included in mm/userfaultfd.c code. > > Yeah, I think there is this confusion of "modulize memory types" and > "support minor fault in guest_memfd". > > So I see what you did here as trying to see how far we could go to remove > any traces of shmem/hugetlb from userfaultd core. Right, yes. This is an example of what we can do to completely modularize the memory types. If this were the reality, it might be possible to create a guest_memfd type that has the necessary changes and only export what is required for the module to execute - that is, avoid exporting memory functions and also avoid having the bulk of guest_memfd remain where it is today. > > So I'll comment based on that and rather see it as a bigger, more extreme > rework. Sounds good, I was hoping the patches could be a starting point for a conversation. > > > > > uffd_flag_t has been removed. This was turning into a middleware and > > it is not necessary. Neither is supported_ioctls. > > I assume you mean the entries that were proposed in Peters series, not > something that is upstream. No. This is upstream today. uffd_flags_t is used for two purposes today: 1. deciding which operation to call and 2. pass through if the request includes wp. In fact, some of the callers just create and send the flag only within the mm/userfaultfd.c code because wp doesn't make sense. Once dispatched to the operation code, the flag is only ever used to check for wp. If the code was structured to use the call path to know what underlying operation, then the flag can be reduced to a boolean - which is what I ended up doing. > > > > > hugetlb now uses the same functions as every other memory type, > > including anon memory. > > > > Any memory type can change functionality without adding instructions or > > flags or anything to some other code. > > > > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap > > tests). > > > > guest-memfd can implement whatever it needs to (or use others > > implementations), like shmem_uffd_ops here: > > There is obviously some downside to be had with this approach (some of which > Mike raised), regarding the interface to "memory types" implementing this, > but I'll discuss that later. > > > > > static const struct vm_uffd_ops shmem_uffd_ops = { > > .copy = shmem_mfill_atomic_pte_copy, > > .zeropage = shmem_mfill_atomic_pte_zeropage, > > .cont = shmem_mfill_atomic_pte_continue, > > .poison = mfill_atomic_pte_poison, > > .writeprotect = uffd_writeprotect, > > .is_dst_valid = shmem_is_dst_valid, > > .increment = mfill_size, > > See below, I wonder if that could be performed by the callbacks invoked as > part of the prior calls to mfill_loop() etc. > > > .failed_do_unlock = uffd_failed_do_unlock, > > That one is a bit unfortunate (read: ugly :) ). Agreed. > > failed_do_unlock() is only called from mfill_copy_loop(). Where we perform a > prior info.uffd_ops->copy. > > > After calling err = info->op(info); > > Couldn't that callback just deal with the -ENOENT case? > > So in case of increment/failed_do_unlock, maybe we could find a way to just > let the ->copy etc communicate/perform that directly. The failure case is only detected after getting a folio, but will need to 'retry' (copy is the only one that does a retry). Retry gets the destination vma, where the vm_ops comes from. This is why you need to return to the loop. So it's not that simple to moving it into the function. In upstream today, the return -ENOENT can only happen for copy (in fact others mask it out..), but every operation checks for -ENOENT since they are all using the same code block. All of this code is more complicated because we have to find the vma before we know what variation of the operation we need. Annoyingly, this is decided per-mfill_size even though the vma doesn't change, currently in mfill_atomic_pte(). I think we could find a way to do what you are suggesting, and I think it's easier and less risky if the logical operations are not being dispatched based on uffd_flag_t. > > > .page_shift = uffd_page_shift, > > Fortunately, this is not required. The only user in move_present_ptes() > moves *real* PTEs, and nothing else (no hugetlb PTEs that are PMDs etc. in > disguise). The hugetlb code had a different value, so I did extract it when I Iunited mfill_atomic() and mfill_atomic_hugetlb(). I am sure there are other changes that could be removed as well, but to logically follow the changes through each step it seemed easier to extract everything that was different into its own function pointer. > > > .complete_register = uffd_complete_register, > > }; > > > > So, the design is to callback into the memory-type handler, which will then > use exported uffd functionality to get the job done. > > This nicely abstracts hugetlb handling, but could mean that any code > implementing this interface has to built up on exported uffd functionality > (not judging, just saying). > > As we're using the callbacks as an indication whether features are > supported, we cannot easily leave them unset to fallback to the default > handling. > > Of course, we could use some placeholder, magic UFFD_DEFAULT_HANDLER keyword > to just use the uffd_* stuff without exporting them. > > So NULL would mean "not supported" and "UFFD_DEFAULT_HANDLER" would mean "no > special handling needed". > > Not sure how often that would be the case, though. For shmem it would > probably only be the poison callback, for others, I am not sure. There are certainly a lot of this we would not want to export. My initial thought was to create two function pointers: one for operations that can be replaced, and one for basic functions that always have a default. We could do this with two function pointers, either tiered or at the same level. Most of this is to do with hugetlb having its own code branch into its own loop. We could even create an op that is returned that only lives in mm/userfaultfd.c and has two variants: hugetlb and not_hugetlb. This would indeed need the hugetlb.h again, but I'm pretty sure that removing the header is 'too big of a change' anyways. > > > Where guest-memfd needs to write the one function: > > guest_memfd_pte_continue(), from what I understand. > > It would be interesting to see how that one would look like. > > I'd assume fairly similar to shmem_mfill_atomic_pte_continue()? > > Interesting question would be, how to avoid the code duplication there. Yes, this is where I was going here. I was going to try and finish this off by creating that one function. That, and reducing the vm_ops to a more sensible size (as mentioned above). shmem_mfill_atomic_pte_continue() could be cut up into function segments to avoid the duplication. Or we could make a wrapper that accepts a function pointer.. there are certainly ways we can mitigate duplication. Really, what is happening here is that the code was written to try and use common code. Then hugetlb came in and duplicated everything anyways, so we lost what we were gaining by creating a dispatcher/middleware in the end. Then handling errors complicated it further. The code has also bee __alway_inline'ed, so the assembly duplicates the code anyways. It's really an attempt to avoid updating multiple functions when issues arise. But now we have error checks that don't need to happen and they are running in a loop... also hugetlb has its own loop. And returning -ENOENT has a special meaning so we have to be careful of that too. I don't think the compliers are smart enough to know that the retry loop can be removed for 3/4 cases, so the assembly is probably sub-optimal. > > (as a side note, I wonder if we would want to call most of these uffd helper > uffd_*) Yeah, sure. Does it matter for static inlines? Naming is hard and I think shmem_mfill_atomic_pte_continue() is a dumb name as well.. it's too short really :) > > > I'll have to think about some of this some more. In particular, alternatives > to at least get all the shmem logic cleanly out of there and maybe only have > a handful callback into hugetlb. > > IOW, not completely make everything fit the "odd case" and rather focus on > the "normal cases" when designing this vm_ops interface here. > > Not sure if that makes sense, just wanted to raise it. Thanks for looking. I think there is some use to having this example and that's why I was asking what it might look like. I certainly went overboard in the last few commits to remove hugetlb all together, but at that point it was so close.. I think there might be value uniting both hugetlb and the normal code path, even if the operations call signatures are aligned and we just use a pointer to a struct within the "while (src_addr < src_start + len)" loop that exists today. The removal of the uffd_flags_t is also something that might be worth exploring. Cheers, Liam ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types 2025-11-06 16:16 ` Liam R. Howlett @ 2025-11-07 10:16 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 37+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-07 10:16 UTC (permalink / raw) To: Liam R. Howlett, Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Muchun Song, Nikita Kalyazin, Vlastimil Babka, Axel Rasmussen, Andrew Morton, James Houghton, Lorenzo Stoakes, Hugh Dickins, Michal Hocko, Ujwal Kundur, Oscar Salvador, Suren Baghdasaryan, Andrea Arcangeli [wondering how my mail client decides to use random mail aliases at this point. The kernel.org change seems to confuse something :) ] >> >>> >>> uffd_flag_t has been removed. This was turning into a middleware and >>> it is not necessary. Neither is supported_ioctls. >> >> I assume you mean the entries that were proposed in Peters series, not >> something that is upstream. > > No. This is upstream today. Ah, you mean *uffd_flags_t*. I was confused there for a second when grepping the codebase. Yeah, not sad to see that go ;) > uffd_flags_t is used for two purposes > today: 1. deciding which operation to call and 2. pass through if the > request includes wp. In fact, some of the callers just create and send > the flag only within the mm/userfaultfd.c code because wp doesn't make > sense. Once dispatched to the operation code, the flag is only ever > used to check for wp. > > If the code was structured to use the call path to know what underlying > operation, then the flag can be reduced to a boolean - which is what I > ended up doing. > [...] >> >> After calling err = info->op(info); >> >> Couldn't that callback just deal with the -ENOENT case? >> >> So in case of increment/failed_do_unlock, maybe we could find a way to just >> let the ->copy etc communicate/perform that directly. > > The failure case is only detected after getting a folio, but will need > to 'retry' (copy is the only one that does a retry). Retry gets the > destination vma, where the vm_ops comes from. This is why you need to > return to the loop. So it's not that simple to moving it into the > function. In mfill_copy_loop() we have err = info->op(info); cond_resched(); if (unlikely(err == -ENOENT)) { err = info->uffd_ops->failed_do_unlock(info); if (unlikely(err)) return err; /* Unlocked already */ return -ENOENT; } else { VM_WARN_ON_ONCE(info->foliop); } if (!err) { uffd_info_inc(info); if (fatal_signal_pending(current)) err = -EINTR; } Just to be clear, I was thinking about moving the failed_do_unlock() handling on -ENOENT into the info->op(). And the inc as well. (different) Return values could indicate what we have or don't have to do. > > In upstream today, the return -ENOENT can only happen for copy (in fact > others mask it out..), but every operation checks for -ENOENT since they > are all using the same code block. > > All of this code is more complicated because we have to find the vma > before we know what variation of the operation we need. Annoyingly, > this is decided per-mfill_size even though the vma doesn't change, > currently in mfill_atomic_pte(). > > I think we could find a way to do what you are suggesting, and I think > it's easier and less risky if the logical operations are not being > dispatched based on uffd_flag_t. Yeah :) > >> >>> .page_shift = uffd_page_shift, >> >> Fortunately, this is not required. The only user in move_present_ptes() >> moves *real* PTEs, and nothing else (no hugetlb PTEs that are PMDs etc. in >> disguise). > > The hugetlb code had a different value, so I did extract it when I > Iunited mfill_atomic() and mfill_atomic_hugetlb(). I am sure there are > other changes that could be removed as well, but to logically follow the > changes through each step it seemed easier to extract everything that > was different into its own function pointer. Let me elaborate to see if I am missing something. page_shift() is only invoked from move_present_ptes(). move_present_ptes() works on individual PAGE_SIZE PTEs. hugetlb does not support UFFDIO_MOVE, see how validate_move_areas() rejects VM_HUGETLB. Also, move_present_ptes() wouldn't ever do anything on large folios, see move_present_ptes() where we have a if (folio_test_large(src_folio) || ... err = -EBUSY; goto out; } So I think the page_shift() callback can simply be dropped? > >> >>> .complete_register = uffd_complete_register, >>> }; >>> >> >> So, the design is to callback into the memory-type handler, which will then >> use exported uffd functionality to get the job done. >> >> This nicely abstracts hugetlb handling, but could mean that any code >> implementing this interface has to built up on exported uffd functionality >> (not judging, just saying). >> >> As we're using the callbacks as an indication whether features are >> supported, we cannot easily leave them unset to fallback to the default >> handling. >> >> Of course, we could use some placeholder, magic UFFD_DEFAULT_HANDLER keyword >> to just use the uffd_* stuff without exporting them. >> >> So NULL would mean "not supported" and "UFFD_DEFAULT_HANDLER" would mean "no >> special handling needed". >> >> Not sure how often that would be the case, though. For shmem it would >> probably only be the poison callback, for others, I am not sure. > > There are certainly a lot of this we would not want to export. My > initial thought was to create two function pointers: one for operations > that can be replaced, and one for basic functions that always have a > default. We could do this with two function pointers, either tiered or > at the same level. > > Most of this is to do with hugetlb having its own code branch into its > own loop. We could even create an op that is returned that only lives > in mm/userfaultfd.c and has two variants: hugetlb and not_hugetlb. This > would indeed need the hugetlb.h again, but I'm pretty sure that removing > the header is 'too big of a change' anyways. Yes, I think leaving hugetlb be the only special thing around would be a sensible thing to do. But I would expect shmem+anon etc. to be completely modularizable (is that a word?). Having a high-level API draft of that could be very valuable. > > >> >>> Where guest-memfd needs to write the one function: >>> guest_memfd_pte_continue(), from what I understand. >> >> It would be interesting to see how that one would look like. >> >> I'd assume fairly similar to shmem_mfill_atomic_pte_continue()? >> >> Interesting question would be, how to avoid the code duplication there. > > Yes, this is where I was going here. I was going to try and finish this > off by creating that one function. That, and reducing the vm_ops to a > more sensible size (as mentioned above). > > shmem_mfill_atomic_pte_continue() could be cut up into function segments > to avoid the duplication. Or we could make a wrapper that accepts a > function pointer.. there are certainly ways we can mitigate duplication. > Seeing a prototype of that would be nice. > Really, what is happening here is that the code was written to try and > use common code. Then hugetlb came in and duplicated everything > anyways, so we lost what we were gaining by creating a > dispatcher/middleware in the end. Then handling errors complicated it > further. > > The code has also bee __alway_inline'ed, so the assembly duplicates the > code anyways. It's really an attempt to avoid updating multiple > functions when issues arise. But now we have error checks that don't > need to happen and they are running in a loop... also hugetlb has its > own loop. And returning -ENOENT has a special meaning so we have to be > careful of that too. > > I don't think the compliers are smart enough to know that the retry > loop can be removed for 3/4 cases, so the assembly is probably > sub-optimal. > >> >> (as a side note, I wonder if we would want to call most of these uffd helper >> uffd_*) > > Yeah, sure. Does it matter for static inlines? Naming is hard and I > think shmem_mfill_atomic_pte_continue() is a dumb name as well.. it's > too short really :) I think it just makes it clearer that this is some common uffd functionality we are using. Tells you immediately when reading the code what's common and what's hand-crafted. Agreed that the names are ... suboptimal. > >> >> >> I'll have to think about some of this some more. In particular, alternatives >> to at least get all the shmem logic cleanly out of there and maybe only have >> a handful callback into hugetlb. >> >> IOW, not completely make everything fit the "odd case" and rather focus on >> the "normal cases" when designing this vm_ops interface here. >> >> Not sure if that makes sense, just wanted to raise it. > > Thanks for looking. I think there is some use to having this example > and that's why I was asking what it might look like. I certainly went > overboard in the last few commits to remove hugetlb all together, but at > that point it was so close.. Right. > > I think there might be value uniting both hugetlb and the normal code > path, even if the operations call signatures are aligned and we just use > a pointer to a struct within the "while (src_addr < src_start + len)" > loop that exists today. > Right, what would be valuable is still leaving hugetlb be special, but minimizing the degree to which mm/userfaultfd.c would have to care / treat it specially. > The removal of the uffd_flags_t is also something that might be worth > exploring. Agreed. -- Cheers David ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2025-11-07 10:16 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-14 23:14 [PATCH v4 0/4] mm/userfaultfd: modulize memory types Peter Xu 2025-10-14 23:14 ` [PATCH v4 1/4] mm: Introduce vm_uffd_ops API Peter Xu 2025-10-20 14:18 ` David Hildenbrand 2025-10-14 23:14 ` [PATCH v4 2/4] mm/shmem: Support " Peter Xu 2025-10-20 14:18 ` David Hildenbrand 2025-10-14 23:15 ` [PATCH v4 3/4] mm/hugetlb: " Peter Xu 2025-10-20 14:19 ` David Hildenbrand 2025-10-14 23:15 ` [PATCH v4 4/4] mm: Apply vm_uffd_ops API to core mm Peter Xu 2025-10-20 13:34 ` [PATCH v4 0/4] mm/userfaultfd: modulize memory types David Hildenbrand 2025-10-20 14:12 ` Peter Xu 2025-10-21 15:51 ` Liam R. Howlett 2025-10-21 16:28 ` Peter Xu 2025-10-30 17:13 ` Liam R. Howlett 2025-10-30 18:00 ` Nikita Kalyazin 2025-10-30 19:07 ` Peter Xu 2025-10-30 19:55 ` Peter Xu 2025-10-30 20:23 ` Lorenzo Stoakes 2025-10-30 21:13 ` Peter Xu 2025-10-30 21:27 ` Peter 2025-11-03 20:01 ` David Hildenbrand (Red Hat) 2025-11-03 20:46 ` Peter Xu 2025-11-03 21:27 ` David Hildenbrand (Red Hat) 2025-11-03 22:49 ` Peter Xu 2025-11-04 7:10 ` Lorenzo Stoakes 2025-11-04 14:18 ` David Hildenbrand (Red Hat) 2025-11-04 7:21 ` Mike Rapoport 2025-11-04 12:23 ` David Hildenbrand (Red Hat) 2025-11-06 16:32 ` Liam R. Howlett 2025-10-30 20:52 ` Liam R. Howlett 2025-10-30 21:33 ` Peter Xu 2025-10-30 20:24 ` Liam R. Howlett 2025-10-30 21:26 ` Peter Xu 2025-11-03 16:11 ` Mike Rapoport 2025-11-03 18:43 ` Liam R. Howlett 2025-11-05 21:23 ` David Hildenbrand 2025-11-06 16:16 ` Liam R. Howlett 2025-11-07 10:16 ` David Hildenbrand (Red Hat)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).