All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] eliminate mmap() retry merge, add .mmap_proto hook
@ 2025-04-30 19:59 ` Lorenzo Stoakes
  0 siblings, 0 replies; 23+ messages in thread
From: Lorenzo Stoakes @ 2025-04-30 19:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan --cc=Michal Hocko

During the mmap() of a file-backed mapping, we invoke the underlying
driver's f_op->mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.

This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.

However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).

This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).

In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so and
then raise an error, requiring very careful unwinding of state about which
we can make no assumptions.

Rather than providing such an open-ended interface, this series provides an
alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:

struct vma_proto {
	/* Immutable state. */
	struct mm_struct *mm;
	unsigned long start;
	unsigned long end;

	/* Mutable fields. Populated with initial state. */
	pgoff_t pgoff;
	struct file *file;
	vm_flags_t flags;
	pgprot_t page_prot;

	/* Write-only fields. */
	const struct vm_operations_struct *vm_ops;
	void *private_data;
};

The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.

This is achieved via new f_op hook .vma_proto(), which is, importantly,
invoked very early on in the mmap() process.

If an error arises, we can very simply abort the operation with very little
unwinding of state required.

The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.

Right now the logic will retry a merge like this only if the driver changes
VMA flags, and changes them in such a way that a merge might succeed (that
is, the flags are not 'special', that is do not contain any of the flags
specified in VM_SPECIAL).

This has been the source of a great deal of pain for a while - it is hard
to reason about an .mmap() callback that might do - anything - but it's
also hard to reason about setting up a VMA and writing to the maple tree,
only to do it again utilising a great deal of shared state.

Since .mmap_proto() sets fields before the first merge is even attempted,
the use of this callback obviates the need for this retry merge logic.

A driver can specify either .mmap_proto(), .mmap() or both. This provides
maximum flexibility.

In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a. the VMA
flags changed and b. this would be mergeable.

In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
VM_SPECIAL flags.

Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:

* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()

For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.

This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.

We can, however, very quickly do so if needed by simply adding an
.mmap_proto() hook to these as required.

There are two further non-DAX cases I idenitfied:

* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
  VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.

Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.

Finally, we are left with a viable case:

* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.

This is viable enough that the mm selftests trigger the logic as a matter
of course. Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_proto().

Lorenzo Stoakes (3):
  mm: introduce new .mmap_proto() f_op callback
  mm: secretmem: convert to .mmap_proto() hook
  mm/vma: remove mmap() retry merge

 include/linux/fs.h               |  7 +++
 include/linux/mm_types.h         | 24 ++++++++
 mm/memory.c                      |  3 +-
 mm/mmap.c                        |  2 +-
 mm/secretmem.c                   | 14 ++---
 mm/vma.c                         | 99 +++++++++++++++++++++++++++-----
 tools/testing/vma/vma_internal.h | 38 ++++++++++++
 7 files changed, 164 insertions(+), 23 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [RFC PATCH 1/3] mm: introduce new .mmap_proto() f_op callback
@ 2025-05-05 15:59 kernel test robot
  0 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-05-05 15:59 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp, Dan Carpenter

BCC: lkp@intel.com
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <f1bf4b452cc10281ef831c5e38ce16f09923f8c5.1746040540.git.lorenzo.stoakes@oracle.com>
References: <f1bf4b452cc10281ef831c5e38ce16f09923f8c5.1746040540.git.lorenzo.stoakes@oracle.com>
TO: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Hi Lorenzo,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250505]
[cannot apply to brauner-vfs/vfs.all linus/master v6.15-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-introduce-new-mmap_proto-f_op-callback/20250501-035712
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/f1bf4b452cc10281ef831c5e38ce16f09923f8c5.1746040540.git.lorenzo.stoakes%40oracle.com
patch subject: [RFC PATCH 1/3] mm: introduce new .mmap_proto() f_op callback
:::::: branch date: 5 days ago
:::::: commit date: 5 days ago
config: x86_64-randconfig-161-20250505 (https://download.01.org/0day-ci/archive/20250505/202505052353.V1RLZ4v9-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <error27@gmail.com>
| Closes: https://lore.kernel.org/r/202505052353.V1RLZ4v9-lkp@intel.com/

smatch warnings:
mm/vma.c:2642 __mmap_region() error: we previously assumed 'vma' could be null (see line 2635)

vim +/vma +2642 mm/vma.c

d5b6d999086329 Lorenzo Stoakes 2025-04-30  2609  
f8d4a6cabb74f8 Lorenzo Stoakes 2025-01-02  2610  static unsigned long __mmap_region(struct file *file, unsigned long addr,
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2611  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2612  		struct list_head *uf)
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2613  {
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2614  	struct mm_struct *mm = current->mm;
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2615  	struct vm_area_struct *vma = NULL;
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2616  	int error;
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2617  	VMA_ITERATOR(vmi, mm, addr);
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2618  	MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
d5b6d999086329 Lorenzo Stoakes 2025-04-30  2619  	bool have_proto = have_mmap_proto_hook(&map);
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2620  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2621  	error = __mmap_prepare(&map, uf);
d5b6d999086329 Lorenzo Stoakes 2025-04-30  2622  	if (!error && have_proto)
d5b6d999086329 Lorenzo Stoakes 2025-04-30  2623  		error = call_proto(&map);
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2624  	if (error)
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2625  		goto abort_munmap;
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2626  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2627  	/* Attempt to merge with adjacent VMAs... */
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2628  	if (map.prev || map.next) {
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2629  		VMG_MMAP_STATE(vmg, &map, /* vma = */ NULL);
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2630  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2631  		vma = vma_merge_new_range(&vmg);
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2632  	}
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2633  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2634  	/* ...but if we can't, allocate a new VMA. */
0d11630cc50a62 Lorenzo Stoakes 2024-10-25 @2635  	if (!vma) {
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2636  		error = __mmap_new_vma(&map, &vma);
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2637  		if (error)
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2638  			goto unacct_error;
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2639  	}
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2640  
d5b6d999086329 Lorenzo Stoakes 2025-04-30  2641  	if (have_proto)
d5b6d999086329 Lorenzo Stoakes 2025-04-30 @2642  		set_vma_user_defined_fields(vma, &map);
d5b6d999086329 Lorenzo Stoakes 2025-04-30  2643  
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2644  	/* If flags changed, we might be able to merge, so try again. */
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2645  	if (map.retry_merge) {
42c4e4b20d9c46 Lorenzo Stoakes 2024-12-06  2646  		struct vm_area_struct *merged;
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2647  		VMG_MMAP_STATE(vmg, &map, vma);
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2648  
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2649  		vma_iter_config(map.vmi, map.addr, map.end);
42c4e4b20d9c46 Lorenzo Stoakes 2024-12-06  2650  		merged = vma_merge_existing_range(&vmg);
42c4e4b20d9c46 Lorenzo Stoakes 2024-12-06  2651  		if (merged)
42c4e4b20d9c46 Lorenzo Stoakes 2024-12-06  2652  			vma = merged;
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2653  	}
5ac87a885aecb3 Lorenzo Stoakes 2024-10-25  2654  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2655  	__mmap_complete(&map, vma);
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2656  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2657  	return addr;
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2658  
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2659  	/* Accounting was done by __mmap_prepare(). */
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2660  unacct_error:
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2661  	if (map.charged)
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2662  		vm_unacct_memory(map.charged);
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2663  abort_munmap:
0d11630cc50a62 Lorenzo Stoakes 2024-10-25  2664  	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2665  	return error;
52956b0d7fb92e Lorenzo Stoakes 2024-10-25  2666  }
7d344babac9984 Lorenzo Stoakes 2024-12-03  2667  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-05-06 10:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-30 19:54 [RFC PATCH 0/3] eliminate mmap() retry merge, add .mmap_proto hook Lorenzo Stoakes
2025-04-30 19:59 ` Lorenzo Stoakes
2025-04-30 19:54 ` [RFC PATCH 1/3] mm: introduce new .mmap_proto() f_op callback Lorenzo Stoakes
2025-04-30 19:59   ` Lorenzo Stoakes
2025-04-30 21:44   ` Jann Horn
2025-05-01 10:37     ` Lorenzo Stoakes
2025-04-30 21:58   ` David Hildenbrand
2025-05-01 10:23     ` Lorenzo Stoakes
2025-05-01 12:17       ` Mike Rapoport
2025-05-01 13:00         ` Lorenzo Stoakes
2025-05-01 13:51     ` Liam R. Howlett
2025-05-01 13:57       ` Lorenzo Stoakes
2025-05-05 13:29     ` Christian Brauner
2025-05-06 10:01       ` Lorenzo Stoakes
2025-04-30 19:54 ` [RFC PATCH 2/3] mm: secretmem: convert to .mmap_proto() hook Lorenzo Stoakes
2025-04-30 19:59   ` Lorenzo Stoakes
2025-04-30 19:54 ` [RFC PATCH 3/3] mm/vma: remove mmap() retry merge Lorenzo Stoakes
2025-04-30 19:59   ` Lorenzo Stoakes
2025-04-30 19:58 ` [RFC PATCH 0/3] eliminate mmap() retry merge, add .mmap_proto hook Lorenzo Stoakes
2025-04-30 21:29 ` Jann Horn
2025-05-01 10:27   ` Lorenzo Stoakes
2025-05-01 14:33     ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2025-05-05 15:59 [RFC PATCH 1/3] mm: introduce new .mmap_proto() f_op callback kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.