* [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook @ 2025-05-01 17:25 Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes ` (3 more replies) 0 siblings, 4 replies; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-01 17:25 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox During the mmap() of a file-backed mapping, we invoke the underlying driver file's mmap() callback in order to perform driver/file system initialisation of the underlying VMA. This has been a source of issues in the past, including a significant security concern relating to unwinding of error state discovered by Jann Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") which performed the recent, significant, rework of mmap() as a whole. However, we have had a fly in the ointment remain - drivers have a great deal of freedom in the .mmap() hook to manipulate VMA state (as well as page table state). This can be problematic, as we can no longer reason sensibly about VMA state once the call is complete (the ability to do - anything - here does rather interfere with that). In addition, callers may choose to do odd or unusual things which might interfere with subsequent steps in the mmap() process, and it may do so and then raise an error, requiring very careful unwinding of state about which we can make no assumptions. Rather than providing such an open-ended interface, this series provides an alternative, far more restrictive one - we expose a whitelist of fields which can be adjusted by the driver, along with immutable state upon which the driver can make such decisions: struct vm_area_desc { /* Immutable state. */ struct mm_struct *mm; unsigned long start; unsigned long end; /* Mutable fields. Populated with initial state. */ pgoff_t pgoff; struct file *file; vm_flags_t vm_flags; pgprot_t page_prot; /* Write-only fields. */ const struct vm_operations_struct *vm_ops; void *private_data; }; The mmap logic then updates the state used to either merge with a VMA or establish a new VMA based upon this logic. This is achieved via new file hook .mmap_prepare(), which is, importantly, invoked very early on in the mmap() process. If an error arises, we can very simply abort the operation with very little unwinding of state required. The existing logic contains another, related, peccadillo - since the .mmap() callback might do anything, it may also cause a previously unmergeable VMA to become mergeable with adjacent VMAs. Right now the logic will retry a merge like this only if the driver changes VMA flags, and changes them in such a way that a merge might succeed (that is, the flags are not 'special', that is do not contain any of the flags specified in VM_SPECIAL). This has also been the source of a great deal of pain - it's hard to reason about an .mmap() callback that might do - anything - but it's also hard to reason about setting up a VMA and writing to the maple tree, only to do it again utilising a great deal of shared state. Since .mmap_prepare() sets fields before the first merge is even attempted, the use of this callback obviates the need for this retry merge logic. A driver may only specify .mmap_prepare() or the deprecated .mmap() callback. In future we may add futher callbacks beyond .mmap_prepare() to faciliate all use cass as we convert drivers. In researching this change, I examined every .mmap() callback, and discovered only a very few that set VMA state in such a way that a. the VMA flags changed and b. this would be mergeable. In the majority of cases, it turns out that drivers are mapping kernel memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable VM_SPECIAL flags. Of those that remain I identified a number of cases which are only applicable in DAX, setting the VM_HUGEPAGE flag: * dax_mmap() * erofs_file_mmap() * ext4_file_mmap() * xfs_file_mmap() For this remerge to not occur and to impact users, each of these cases would require a user to mmap() files using DAX, in parts, immediately adjacent to one another. This is a very unlikely usecase and so it does not appear to be worthwhile to adjust this functionality accordingly. We can, however, very quickly do so if needed by simply adding an .mmap_prepare() callback to these as required. There are two further non-DAX cases I idenitfied: * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with VM_SEQ_READ. * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. Both of these cases again seem very unlikely to be mmap()'d immediately adjacent to one another in a fashion that would result in a merge. Finally, we are left with a viable case: * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. This is viable enough that the mm selftests trigger the logic as a matter of course. Therefore, this series replace the .secretmem_mmap() hook with .secret_mmap_prepare(). RFC v2: * Renamed .mmap_proto() to .mmap_prepare() as per David. * Made .mmap_prepare(), .mmap() mutually exclusive. * Updated call_mmap() to bail out if .mmap_prepare() callback present as per Jann. * Renamed vma_proto to vm_area_desc as per Mike. * Added accidentally missing page_prot assignment/read from vm_area_desc. * Renamed vm_area_desc->flags to vm_flags as per Liam. * Added [__]call_mmap_prepare() for consistency with call_mmap(). * Added file_has_xxx_hook() helpers. * Renamed file_has_valid_mmap() to file_has_valid_mmap_hooks() and check that the hook is mutually exclusive. RFC v1: https://lore.kernel.org/all/cover.1746040540.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (3): mm: introduce new .mmap_prepare() file callback mm: secretmem: convert to .mmap_prepare() hook mm/vma: remove mmap() retry merge include/linux/fs.h | 38 +++++++++++++++ include/linux/mm_types.h | 24 ++++++++++ mm/memory.c | 3 +- mm/mmap.c | 2 +- mm/secretmem.c | 14 +++--- mm/vma.c | 82 ++++++++++++++++++++++++++------ tools/testing/vma/vma_internal.h | 79 ++++++++++++++++++++++++++++-- 7 files changed, 214 insertions(+), 28 deletions(-) -- 2.49.0 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback 2025-05-01 17:25 [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes @ 2025-05-01 17:25 ` Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes ` (2 subsequent siblings) 3 siblings, 0 replies; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-01 17:25 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox Provide a means by which drivers can specify which fields of those permitted to be changed should be altered to prior to mmap()'ing a range (which may either result from a merge or from mapping an entirely new VMA). Doing so is substantially safer than the existing .mmap() calback which provides unrestricted access to the part-constructed VMA and permits drivers and file systems to do 'creative' things which makes it hard to reason about the state of the VMA after the function returns. The existing .mmap() callback's freedom has caused a great deal of issues, especially in error handling, as unwinding the mmap() state has proven to be non-trivial and caused significant issues in the past, for instance those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). It also necessitates a second attempt at merge once the .mmap() callback has completed, which has caused issues in the past, is awkward, adds overhead and is difficult to reason about. The .mmap_prepare() callback eliminates this requirement, as we can update fields prior to even attempting the first merge. It is safer, as we heavily restrict what can actually be modified, and being invoked very early in the mmap() process, error handling can be performed safely with very little unwinding of state required. The .mmap_prepare() and deprecated .mmap() callbacks are mutually exclusive, so we permit only one to be invoked at a time. Update vma userland test stubs to account for changes. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- include/linux/fs.h | 38 +++++++++++++++ include/linux/mm_types.h | 24 ++++++++++ mm/memory.c | 3 +- mm/mmap.c | 2 +- mm/vma.c | 70 +++++++++++++++++++++++++++- tools/testing/vma/vma_internal.h | 79 ++++++++++++++++++++++++++++++-- 6 files changed, 208 insertions(+), 8 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 016b0fe1536e..d6c5a703a215 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2169,6 +2169,7 @@ struct file_operations { int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, unsigned int poll_flags); + int (*mmap_prepare)(struct vm_area_desc *); } __randomize_layout; /* Supports async buffered reads */ @@ -2238,11 +2239,48 @@ struct inode_operations { struct offset_ctx *(*get_offset_ctx)(struct inode *inode); } ____cacheline_aligned; +static inline bool file_has_deprecated_mmap_hook(struct file *file) +{ + return file->f_op->mmap; +} + +static inline bool file_has_mmap_prepare_hook(struct file *file) +{ + return file->f_op->mmap_prepare; +} + +/* Did the driver provide valid mmap hook configuration? */ +static inline bool file_has_valid_mmap_hooks(struct file *file) +{ + bool has_mmap = file_has_deprecated_mmap_hook(file); + bool has_mmap_prepare = file_has_mmap_prepare_hook(file); + + /* Hooks are mutually exclusive. */ + if (has_mmap && has_mmap_prepare) + return false; + + /* But at least one must be specified. */ + if (!has_mmap && !has_mmap_prepare) + return false; + + return true; +} + static inline int call_mmap(struct file *file, struct vm_area_struct *vma) { + /* If the driver specifies .mmap_prepare() this call is invalid. */ + if (file_has_mmap_prepare_hook(file)) + return -EINVAL; + return file->f_op->mmap(file, vma); } +static inline int __call_mmap_prepare(struct file *file, + struct vm_area_desc *desc) +{ + return file->f_op->mmap_prepare(desc); +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e76bade9ebb1..15808cad2bc1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -763,6 +763,30 @@ struct vma_numab_state { int prev_scan_seq; }; +/* + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to + * manipulate mutable fields which will cause those fields to be updated in the + * resultant VMA. + * + * Helper functions are not required for manipulating any field. + */ +struct vm_area_desc { + /* Immutable state. */ + struct mm_struct *mm; + unsigned long start; + unsigned long end; + + /* Mutable fields. Populated with initial state. */ + pgoff_t pgoff; + struct file *file; + vm_flags_t vm_flags; + pgprot_t page_prot; + + /* Write-only fields. */ + const struct vm_operations_struct *vm_ops; + void *private_data; +}; + /* * This struct describes a virtual memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory diff --git a/mm/memory.c b/mm/memory.c index 68c1d962d0ad..99af83434e7c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, dump_page(page, "bad pte"); pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n", (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n", + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n", vma->vm_file, vma->vm_ops ? vma->vm_ops->fault : NULL, vma->vm_file ? vma->vm_file->f_op->mmap : NULL, + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL, mapping ? mapping->a_ops->read_folio : NULL); dump_stack(); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); diff --git a/mm/mmap.c b/mm/mmap.c index 81dd962a1cfc..50f902c08341 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags &= ~VM_MAYEXEC; } - if (!file->f_op->mmap) + if (!file_has_valid_mmap_hooks(file)) return -ENODEV; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; diff --git a/mm/vma.c b/mm/vma.c index 1f2634b29568..acd5b98fe087 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -17,6 +17,11 @@ struct mmap_state { unsigned long pglen; unsigned long flags; struct file *file; + pgprot_t page_prot; + + /* User-defined fields, perhaps updated by .mmap_prepare(). */ + const struct vm_operations_struct *vm_ops; + void *vm_private_data; unsigned long charged; bool retry_merge; @@ -40,6 +45,7 @@ struct mmap_state { .pglen = PHYS_PFN(len_), \ .flags = flags_, \ .file = file_, \ + .page_prot = vm_get_page_prot(flags_), \ } #define VMG_MMAP_STATE(name, map_, vma_) \ @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mmap_state *map, int error; vma->vm_file = get_file(map->file); + + if (!file_has_deprecated_mmap_hook(map->file)) + return 0; + error = mmap_file(vma->vm_file, vma); if (error) { fput(vma->vm_file); @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) vma_iter_config(vmi, map->addr, map->end); vma_set_range(vma, map->addr, map->end, map->pgoff); vm_flags_init(vma, map->flags); - vma->vm_page_prot = vm_get_page_prot(map->flags); + vma->vm_page_prot = map->page_prot; if (vma_iter_prealloc(vmi, vma)) { error = -ENOMEM; @@ -2528,6 +2538,58 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma) vma_set_page_prot(vma); } +/* + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that + * specifies it. + * + * This is called prior to any merge attempt, and updates whitelisted fields + * that are permitted to be updated by the caller. + * + * All but user-defined fields will be pre-populated with original values. + * + * Returns 0 on success, or an error code otherwise. + */ +static int call_mmap_prepare(struct mmap_state *map) +{ + int err; + struct vm_area_desc desc = { + .mm = map->mm, + .start = map->addr, + .end = map->end, + + .pgoff = map->pgoff, + .file = map->file, + .vm_flags = map->flags, + .page_prot = map->page_prot, + }; + + VM_WARN_ON(!file_has_valid_mmap_hooks(map->file)); + + /* Invoke the hook. */ + err = __call_mmap_prepare(map->file, &desc); + if (err) + return err; + + /* Update fields permitted to be changed. */ + map->pgoff = desc.pgoff; + map->file = desc.file; + map->flags = desc.vm_flags; + map->page_prot = desc.page_prot; + /* User-defined fields. */ + map->vm_ops = desc.vm_ops; + map->vm_private_data = desc.private_data; + + return 0; +} + +static void set_vma_user_defined_fields(struct vm_area_struct *vma, + struct mmap_state *map) +{ + if (map->vm_ops) + vma->vm_ops = map->vm_ops; + vma->vm_private_data = map->vm_private_data; +} + static unsigned long __mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf) @@ -2535,10 +2597,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; struct vm_area_struct *vma = NULL; int error; + bool have_mmap_prepare = file && file_has_mmap_prepare_hook(file); VMA_ITERATOR(vmi, mm, addr); MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); error = __mmap_prepare(&map, uf); + if (!error && have_mmap_prepare) + error = call_mmap_prepare(&map); if (error) goto abort_munmap; @@ -2556,6 +2621,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, goto unacct_error; } + if (have_mmap_prepare) + set_vma_user_defined_fields(vma, &map); + /* If flags changed, we might be able to merge, so try again. */ if (map.retry_merge) { struct vm_area_struct *merged; diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h index 198abe66de5a..a2cc54e9ed36 100644 --- a/tools/testing/vma/vma_internal.h +++ b/tools/testing/vma/vma_internal.h @@ -253,8 +253,40 @@ struct mm_struct { unsigned long flags; /* Must use atomic bitops to access */ }; +struct vm_area_struct; + +/* + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to + * manipulate mutable fields which will cause those fields to be updated in the + * resultant VMA. + * + * Helper functions are not required for manipulating any field. + */ +struct vm_area_desc { + /* Immutable state. */ + struct mm_struct *mm; + unsigned long start; + unsigned long end; + + /* Mutable fields. Populated with initial state. */ + pgoff_t pgoff; + struct file *file; + vm_flags_t vm_flags; + pgprot_t page_prot; + + /* Write-only fields. */ + const struct vm_operations_struct *vm_ops; + void *private_data; +}; + +struct file_operations { + int (*mmap)(struct file *, struct vm_area_struct *); + int (*mmap_prepare)(struct vm_area_desc *); +}; + struct file { struct address_space *f_mapping; + const struct file_operations *f_op; }; #define VMA_LOCK_OFFSET 0x40000000 @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct vm_area_struct *vma, vma->__vm_flags &= ~flags; } -static inline int call_mmap(struct file *, struct vm_area_struct *) -{ - return 0; -} - static inline int shmem_zero_setup(struct vm_area_struct *) { return 0; @@ -1405,4 +1432,46 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma) (void)vma; } +static inline bool file_has_deprecated_mmap_hook(struct file *file) +{ + return file->f_op->mmap; +} + +static inline bool file_has_mmap_prepare_hook(struct file *file) +{ + return file->f_op->mmap_prepare; +} + +/* Did the driver provide valid mmap hook configuration? */ +static inline bool file_has_valid_mmap_hooks(struct file *file) +{ + bool has_mmap = file_has_deprecated_mmap_hook(file); + bool has_mmap_prepare = file_has_mmap_prepare_hook(file); + + /* Hooks are mutually exclusive. */ + if (has_mmap && has_mmap_prepare) + return false; + + /* But at least one must be specified. */ + if (!has_mmap && !has_mmap_prepare) + return false; + + return true; +} + +static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +{ + /* If the driver specifies .mmap_prepare() this call is invalid. */ + if (file_has_mmap_prepare_hook(file)) + return -EINVAL; + + return file->f_op->mmap(file, vma); +} + +static inline int __call_mmap_prepare(struct file *file, + struct vm_area_desc *desc) +{ + return file->f_op->mmap_prepare(desc); +} + #endif /* __MM_VMA_INTERNAL_H */ -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook 2025-05-01 17:25 [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes @ 2025-05-01 17:25 ` Lorenzo Stoakes 2025-05-07 6:09 ` Mike Rapoport 2025-05-01 17:25 ` [RFC PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes 2025-05-02 12:20 ` [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Jan Kara 3 siblings, 1 reply; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-01 17:25 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox Secretmem has a simple .mmap() hook which is easily converted to the new .mmap_prepare() callback. Importantly, it's a rare instance of an driver that manipulates a VMA which is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA flags which may adjust mergeability, meaning the retry merge logic might impact whether or not the VMA is merged. By using .mmap_prepare() there's no longer any need to retry the merge later as we can simply set the correct flags from the start. This change therefore allows us to remove the retry merge logic in a subsequent commit. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- mm/secretmem.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/secretmem.c b/mm/secretmem.c index 1b0a214ee558..f98cf3654974 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file) return 0; } -static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +static int secretmem_mmap_prepare(struct vm_area_desc *desc) { - unsigned long len = vma->vm_end - vma->vm_start; + unsigned long len = desc->end - desc->start; - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) + if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) return -EINVAL; - if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len)) return -EAGAIN; - vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP); - vma->vm_ops = &secretmem_vm_ops; + desc->vm_flags |= VM_LOCKED | VM_DONTDUMP; + desc->vm_ops = &secretmem_vm_ops; return 0; } @@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma) static const struct file_operations secretmem_fops = { .release = secretmem_release, - .mmap = secretmem_mmap, + .mmap_prepare = secretmem_mmap_prepare, }; static int secretmem_migrate_folio(struct address_space *mapping, -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook 2025-05-01 17:25 ` [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes @ 2025-05-07 6:09 ` Mike Rapoport 0 siblings, 0 replies; 9+ messages in thread From: Mike Rapoport @ 2025-05-07 6:09 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox On Thu, May 01, 2025 at 06:25:28PM +0100, Lorenzo Stoakes wrote: > Secretmem has a simple .mmap() hook which is easily converted to the new > .mmap_prepare() callback. > > Importantly, it's a rare instance of an driver that manipulates a VMA which > is mergeable (that is, not a VM_SPECIAL mapping) while also adjusting VMA > flags which may adjust mergeability, meaning the retry merge logic might > impact whether or not the VMA is merged. > > By using .mmap_prepare() there's no longer any need to retry the merge > later as we can simply set the correct flags from the start. > > This change therefore allows us to remove the retry merge logic in a > subsequent commit. > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > --- > mm/secretmem.c | 14 +++++++------- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/mm/secretmem.c b/mm/secretmem.c > index 1b0a214ee558..f98cf3654974 100644 > --- a/mm/secretmem.c > +++ b/mm/secretmem.c > @@ -120,18 +120,18 @@ static int secretmem_release(struct inode *inode, struct file *file) > return 0; > } > > -static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) > +static int secretmem_mmap_prepare(struct vm_area_desc *desc) > { > - unsigned long len = vma->vm_end - vma->vm_start; > + unsigned long len = desc->end - desc->start; > > - if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) > + if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) > return -EINVAL; > > - if (!mlock_future_ok(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) > + if (!mlock_future_ok(desc->mm, desc->vm_flags | VM_LOCKED, len)) > return -EAGAIN; > > - vm_flags_set(vma, VM_LOCKED | VM_DONTDUMP); > - vma->vm_ops = &secretmem_vm_ops; > + desc->vm_flags |= VM_LOCKED | VM_DONTDUMP; > + desc->vm_ops = &secretmem_vm_ops; > > return 0; > } > @@ -143,7 +143,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma) > > static const struct file_operations secretmem_fops = { > .release = secretmem_release, > - .mmap = secretmem_mmap, > + .mmap_prepare = secretmem_mmap_prepare, > }; > > static int secretmem_migrate_folio(struct address_space *mapping, > -- > 2.49.0 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v2 3/3] mm/vma: remove mmap() retry merge 2025-05-01 17:25 [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes @ 2025-05-01 17:25 ` Lorenzo Stoakes 2025-05-02 12:20 ` [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Jan Kara 3 siblings, 0 replies; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-01 17:25 UTC (permalink / raw) To: Andrew Morton Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox We have now introduced a mechanism that obviates the need for a reattempted merge via the mmap_prepare() file hook, so eliminate this functionality altogether. The retry merge logic has been the cause of a great deal of complexity in the past and required a great deal of careful manoeuvring of code to ensure its continued and correct functionality. It has also recently been involved in an issue surrounding maple tree state, which again points to its problematic nature. We make it much easier to reason about mmap() logic by eliminating this and simply writing a VMA once. This also opens the doors to future optimisation and improvement in the mmap() logic. For any device or file system which encounters unwanted VMA fragmentation as a result of this change (that is, having not implemented .mmap_prepare hooks), the issue is easily resolvable by doing so. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> --- mm/vma.c | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/mm/vma.c b/mm/vma.c index acd5b98fe087..95696eb44365 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -24,7 +24,6 @@ struct mmap_state { void *vm_private_data; unsigned long charged; - bool retry_merge; struct vm_area_struct *prev; struct vm_area_struct *next; @@ -2417,8 +2416,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, !(map->flags & VM_MAYWRITE) && (vma->vm_flags & VM_MAYWRITE)); - /* If the flags change (and are mergeable), let's retry later. */ - map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL); map->flags = vma->vm_flags; return 0; @@ -2624,17 +2621,6 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, if (have_mmap_prepare) set_vma_user_defined_fields(vma, &map); - /* If flags changed, we might be able to merge, so try again. */ - if (map.retry_merge) { - struct vm_area_struct *merged; - VMG_MMAP_STATE(vmg, &map, vma); - - vma_iter_config(map.vmi, map.addr, map.end); - merged = vma_merge_existing_range(&vmg); - if (merged) - vma = merged; - } - __mmap_complete(&map, vma); return addr; -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook 2025-05-01 17:25 [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes ` (2 preceding siblings ...) 2025-05-01 17:25 ` [RFC PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes @ 2025-05-02 12:20 ` Jan Kara 2025-05-02 12:59 ` Lorenzo Stoakes 3 siblings, 1 reply; 9+ messages in thread From: Jan Kara @ 2025-05-02 12:20 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Jan Kara, Matthew Wilcox On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote: > During the mmap() of a file-backed mapping, we invoke the underlying driver > file's mmap() callback in order to perform driver/file system > initialisation of the underlying VMA. > > This has been a source of issues in the past, including a significant > security concern relating to unwinding of error state discovered by Jann > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > error path behaviour") which performed the recent, significant, rework of > mmap() as a whole. > > However, we have had a fly in the ointment remain - drivers have a great > deal of freedom in the .mmap() hook to manipulate VMA state (as well as > page table state). > > This can be problematic, as we can no longer reason sensibly about VMA > state once the call is complete (the ability to do - anything - here does > rather interfere with that). > > In addition, callers may choose to do odd or unusual things which might > interfere with subsequent steps in the mmap() process, and it may do so and > then raise an error, requiring very careful unwinding of state about which > we can make no assumptions. > > Rather than providing such an open-ended interface, this series provides an > alternative, far more restrictive one - we expose a whitelist of fields > which can be adjusted by the driver, along with immutable state upon which > the driver can make such decisions: > > struct vm_area_desc { > /* Immutable state. */ > struct mm_struct *mm; > unsigned long start; > unsigned long end; > > /* Mutable fields. Populated with initial state. */ > pgoff_t pgoff; > struct file *file; > vm_flags_t vm_flags; > pgprot_t page_prot; > > /* Write-only fields. */ > const struct vm_operations_struct *vm_ops; > void *private_data; > }; > > The mmap logic then updates the state used to either merge with a VMA or > establish a new VMA based upon this logic. > > This is achieved via new file hook .mmap_prepare(), which is, importantly, > invoked very early on in the mmap() process. > > If an error arises, we can very simply abort the operation with very little > unwinding of state required. Looks sensible. So is there a plan to transform existing .mmap hooks to .mmap_prepare hooks? I agree that for most filesystems this should be just easy 1:1 replacement and AFAIU this would be prefered? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook 2025-05-02 12:20 ` [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Jan Kara @ 2025-05-02 12:59 ` Lorenzo Stoakes 2025-05-05 13:37 ` Christian Brauner 0 siblings, 1 reply; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-02 12:59 UTC (permalink / raw) To: Jan Kara Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Christian Brauner, Matthew Wilcox On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote: > On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote: > > During the mmap() of a file-backed mapping, we invoke the underlying driver > > file's mmap() callback in order to perform driver/file system > > initialisation of the underlying VMA. > > > > This has been a source of issues in the past, including a significant > > security concern relating to unwinding of error state discovered by Jann > > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > > error path behaviour") which performed the recent, significant, rework of > > mmap() as a whole. > > > > However, we have had a fly in the ointment remain - drivers have a great > > deal of freedom in the .mmap() hook to manipulate VMA state (as well as > > page table state). > > > > This can be problematic, as we can no longer reason sensibly about VMA > > state once the call is complete (the ability to do - anything - here does > > rather interfere with that). > > > > In addition, callers may choose to do odd or unusual things which might > > interfere with subsequent steps in the mmap() process, and it may do so and > > then raise an error, requiring very careful unwinding of state about which > > we can make no assumptions. > > > > Rather than providing such an open-ended interface, this series provides an > > alternative, far more restrictive one - we expose a whitelist of fields > > which can be adjusted by the driver, along with immutable state upon which > > the driver can make such decisions: > > > > struct vm_area_desc { > > /* Immutable state. */ > > struct mm_struct *mm; > > unsigned long start; > > unsigned long end; > > > > /* Mutable fields. Populated with initial state. */ > > pgoff_t pgoff; > > struct file *file; > > vm_flags_t vm_flags; > > pgprot_t page_prot; > > > > /* Write-only fields. */ > > const struct vm_operations_struct *vm_ops; > > void *private_data; > > }; > > > > The mmap logic then updates the state used to either merge with a VMA or > > establish a new VMA based upon this logic. > > > > This is achieved via new file hook .mmap_prepare(), which is, importantly, > > invoked very early on in the mmap() process. > > > > If an error arises, we can very simply abort the operation with very little > > unwinding of state required. > > Looks sensible. So is there a plan to transform existing .mmap hooks to > .mmap_prepare hooks? I agree that for most filesystems this should be just > easy 1:1 replacement and AFAIU this would be prefered? Thanks! Yeah the intent is to convert _all_ callers away from .mmap() so we can lock down what drivers are doing and be able to (relatively) safely make assumptions about what's going on in mmap logic. As David points out, we may need to add new callbacks to account for other requirements by drivers but .mmap_prepare() forms the foundation. Great to hear about filesystems, having done a big scan through .mmap() hooks I am pretty convinced _most_ can be pretty easily replaced, there are a few tricky things though, most notably things that 'pass through' .mmap(), as raised by (other) Jan[n]. For now we fudge that case by just disallowing any driver that doesn't implement .mmap(), though I think in future we can temporarily work around this by having a wrapper function that takes a driver with an .mmap_prepare() callback, populate a vm_area_desc object from the vma, then 'do the right thing' in setting vma fields once complete. Once the conversion is complete we can drop this. So overall I think this is very workable, and it's very helpful that we can do this incrementally over a period time, given the number of .mmap() callbacks that need changing :) I am happy to dedicate my time to doing this (as this is near and dear to my heart) though will of course encourage anybody else willing to help with the effort :) I think it'll lead to a more stable kernel overall. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR Cheers, Lorenzo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook 2025-05-02 12:59 ` Lorenzo Stoakes @ 2025-05-05 13:37 ` Christian Brauner 2025-05-06 9:25 ` Lorenzo Stoakes 0 siblings, 1 reply; 9+ messages in thread From: Christian Brauner @ 2025-05-05 13:37 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Jan Kara, Andrew Morton, David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Matthew Wilcox On Fri, May 02, 2025 at 01:59:49PM +0100, Lorenzo Stoakes wrote: > On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote: > > On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote: > > > During the mmap() of a file-backed mapping, we invoke the underlying driver > > > file's mmap() callback in order to perform driver/file system > > > initialisation of the underlying VMA. > > > > > > This has been a source of issues in the past, including a significant > > > security concern relating to unwinding of error state discovered by Jann > > > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > > > error path behaviour") which performed the recent, significant, rework of > > > mmap() as a whole. > > > > > > However, we have had a fly in the ointment remain - drivers have a great > > > deal of freedom in the .mmap() hook to manipulate VMA state (as well as > > > page table state). > > > > > > This can be problematic, as we can no longer reason sensibly about VMA > > > state once the call is complete (the ability to do - anything - here does > > > rather interfere with that). > > > > > > In addition, callers may choose to do odd or unusual things which might > > > interfere with subsequent steps in the mmap() process, and it may do so and > > > then raise an error, requiring very careful unwinding of state about which > > > we can make no assumptions. > > > > > > Rather than providing such an open-ended interface, this series provides an > > > alternative, far more restrictive one - we expose a whitelist of fields > > > which can be adjusted by the driver, along with immutable state upon which > > > the driver can make such decisions: > > > > > > struct vm_area_desc { > > > /* Immutable state. */ > > > struct mm_struct *mm; > > > unsigned long start; > > > unsigned long end; > > > > > > /* Mutable fields. Populated with initial state. */ > > > pgoff_t pgoff; > > > struct file *file; > > > vm_flags_t vm_flags; > > > pgprot_t page_prot; > > > > > > /* Write-only fields. */ > > > const struct vm_operations_struct *vm_ops; > > > void *private_data; > > > }; > > > > > > The mmap logic then updates the state used to either merge with a VMA or > > > establish a new VMA based upon this logic. > > > > > > This is achieved via new file hook .mmap_prepare(), which is, importantly, > > > invoked very early on in the mmap() process. > > > > > > If an error arises, we can very simply abort the operation with very little > > > unwinding of state required. > > > > Looks sensible. So is there a plan to transform existing .mmap hooks to > > .mmap_prepare hooks? I agree that for most filesystems this should be just > > easy 1:1 replacement and AFAIU this would be prefered? > > Thanks! > > Yeah the intent is to convert _all_ callers away from .mmap() so we can > lock down what drivers are doing and be able to (relatively) safely make > assumptions about what's going on in mmap logic. > > As David points out, we may need to add new callbacks to account for other The plural is a little worrying, let's please aim minimize the number of new methods we need for this. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook 2025-05-05 13:37 ` Christian Brauner @ 2025-05-06 9:25 ` Lorenzo Stoakes 0 siblings, 0 replies; 9+ messages in thread From: Lorenzo Stoakes @ 2025-05-06 9:25 UTC (permalink / raw) To: Christian Brauner Cc: Jan Kara, Andrew Morton, David Hildenbrand, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato, linux-fsdevel, linux-kernel, linux-mm, Alexander Viro, Matthew Wilcox On Mon, May 05, 2025 at 03:37:39PM +0200, Christian Brauner wrote: > On Fri, May 02, 2025 at 01:59:49PM +0100, Lorenzo Stoakes wrote: > > On Fri, May 02, 2025 at 02:20:38PM +0200, Jan Kara wrote: > > > On Thu 01-05-25 18:25:26, Lorenzo Stoakes wrote: > > > > During the mmap() of a file-backed mapping, we invoke the underlying driver > > > > file's mmap() callback in order to perform driver/file system > > > > initialisation of the underlying VMA. > > > > > > > > This has been a source of issues in the past, including a significant > > > > security concern relating to unwinding of error state discovered by Jann > > > > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > > > > error path behaviour") which performed the recent, significant, rework of > > > > mmap() as a whole. > > > > > > > > However, we have had a fly in the ointment remain - drivers have a great > > > > deal of freedom in the .mmap() hook to manipulate VMA state (as well as > > > > page table state). > > > > > > > > This can be problematic, as we can no longer reason sensibly about VMA > > > > state once the call is complete (the ability to do - anything - here does > > > > rather interfere with that). > > > > > > > > In addition, callers may choose to do odd or unusual things which might > > > > interfere with subsequent steps in the mmap() process, and it may do so and > > > > then raise an error, requiring very careful unwinding of state about which > > > > we can make no assumptions. > > > > > > > > Rather than providing such an open-ended interface, this series provides an > > > > alternative, far more restrictive one - we expose a whitelist of fields > > > > which can be adjusted by the driver, along with immutable state upon which > > > > the driver can make such decisions: > > > > > > > > struct vm_area_desc { > > > > /* Immutable state. */ > > > > struct mm_struct *mm; > > > > unsigned long start; > > > > unsigned long end; > > > > > > > > /* Mutable fields. Populated with initial state. */ > > > > pgoff_t pgoff; > > > > struct file *file; > > > > vm_flags_t vm_flags; > > > > pgprot_t page_prot; > > > > > > > > /* Write-only fields. */ > > > > const struct vm_operations_struct *vm_ops; > > > > void *private_data; > > > > }; > > > > > > > > The mmap logic then updates the state used to either merge with a VMA or > > > > establish a new VMA based upon this logic. > > > > > > > > This is achieved via new file hook .mmap_prepare(), which is, importantly, > > > > invoked very early on in the mmap() process. > > > > > > > > If an error arises, we can very simply abort the operation with very little > > > > unwinding of state required. > > > > > > Looks sensible. So is there a plan to transform existing .mmap hooks to > > > .mmap_prepare hooks? I agree that for most filesystems this should be just > > > easy 1:1 replacement and AFAIU this would be prefered? > > > > Thanks! > > > > Yeah the intent is to convert _all_ callers away from .mmap() so we can > > lock down what drivers are doing and be able to (relatively) safely make > > assumptions about what's going on in mmap logic. > > > > As David points out, we may need to add new callbacks to account for other > > The plural is a little worrying, let's please aim minimize the number of > new methods we need for this. Ack. My intent is maximum one more, but to try to avoid it at all costs. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-05-07 6:09 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-01 17:25 [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 1/3] mm: introduce new .mmap_prepare() file callback Lorenzo Stoakes 2025-05-01 17:25 ` [RFC PATCH v2 2/3] mm: secretmem: convert to .mmap_prepare() hook Lorenzo Stoakes 2025-05-07 6:09 ` Mike Rapoport 2025-05-01 17:25 ` [RFC PATCH v2 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes 2025-05-02 12:20 ` [RFC PATCH v2 0/3] eliminate mmap() retry merge, add .mmap_prepare hook Jan Kara 2025-05-02 12:59 ` Lorenzo Stoakes 2025-05-05 13:37 ` Christian Brauner 2025-05-06 9:25 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).