linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/20] Add support for shared PTEs across processes
@ 2025-04-04  2:18 Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 01/20] mm: Add msharefs filesystem Anthony Yznaga
                   ` (19 more replies)
  0 siblings, 20 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, the a substantial amount of memory saved.

This patch series implements a mechanism that allows userspace
processes to opt into sharing PTEs. It adds a new in-memory
filesystem - msharefs. A file created on msharefs represents a
shared region where all processes mapping that region will map
objects within it with shared PTEs. When the file is created,
a new host mm struct is created to hold the shared page tables
and vmas for objects later mapped into the shared region. This
host mm struct is associated with the file and not with a task.
When a process mmap's the shared region, a vm flag VM_MSHARE
is added to the vma. On page fault the vma is checked for the
presence of the VM_MSHARE flag. If found, the host mm is
searched for a vma that covers the fault address. Fault handling
then continues using that host vma which establishes PTEs in the
host mm. Fault handling in a shared region also links the shared
page table to the process page table if the shared page table
already exists.

Ioctls are used to map and unmap objects in the shared region and
to (eventually) perform other operations on the shared objects such
as changing protections.

API
===

The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
        mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. The start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating an mshare region:
        a. Create a file on /sys/fs/mshare, for example -
                fd = open("/sys/fs/mshare/shareme",
                                O_RDWR|O_CREAT|O_EXCL, 0600);

        b. Establish the size of the region
                ftruncate(fd, BUFFER_SIZE);

        c. Map some memory in the region
                struct mshare_create mcreate;

                mcreate.region_offset = 0;
                mcreate.size = BUFFER_SIZE;
                mcreate.offset = 0;
                mcreate.prot = PROT_READ | PROT_WRITE;
                mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
                mcreate.fd = -1;

                ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

        d. Map the mshare region into the process
                mmap((void *)TB(2), BUFFER_SIZE, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_SHARED, fd, 0);

        e. Write and read to mshared region normally.

4. For processes attaching an mshare region:
        a. Open the file on msharefs, for example -
                fd = open("/sys/fs/mshare/shareme", O_RDWR);

        b. Get information about mshare'd region from the file:
                struct stat sb;

                fstat(fd, &sb);
                mshare_size = sb.st_size;

        c. Map the mshare'd region into the process
                mmap((void *)TB(2), mshare_size, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_SHARED, fd, 0);

5. To delete the mshare region -
                unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        struct mshare_create mcreate;

        fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
        read(fd, req, 128);
        alignsize = atoi(req);
        close(fd);
        fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
        start = alignsize * 4;
        size = alignsize * 2;

        ftruncate(fd, size);

        mcreate.region_offset = 0;
        mcreate.size = size;
        mcreate.offset = 0;
        mcreate.prot = PROT_READ | PROT_WRITE;
        mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
        mcreate.fd = -1;
        ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
        if (ret < 0)
                perror("ERROR: MSHAREFS_CREATE_MAPPING");

        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_FIXED | MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
                        sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        fstat(fd, &sb);
        size = sb.st_size;

        if (!size)
                perror("ERROR: mshare region not init'd");

        addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
        printf("\nDone\n");

-----------------

v2:
  - Based on mm-unstable as of 2025-04-03 (8ff02705ba8f)
  - Set mshare size via fallocate or ftruncate instead of MSHAREFS_SET_SIZE.
    Removed MSHAREFS_SET_SIZE/MSHAREFS_GET_SIZE ioctls. Use stat to get size.
    (David H)
  - Remove spinlock from mshare_data. Initializing the size is protected by
    the inode lock.
  - Support mapping a single mshare region at different virtual addresses.
  - Support system selection of the start address when mmap'ing an mshare
    region.
  - Changed MSHAREFS_CREATE_MAPPING and MSHAREFS_UNMAP to use a byte offset
    to specify the start of a mapping.
  - Updated documentation.

v1:
  (https://lore.kernel.org/linux-mm/20250124235454.84587-1-anthony.yznaga@oracle.com/)
  - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
  - Use mshare size instead of start address to check if mshare region
    has been initialized.
  - Share page tables at PUD level instead of PGD.
  - Rename vma_is_shared() to vma_is_mshare() (James H / David H)
  - Introduce and use mmap_read_lock_nested() (Kirill)
  - Use an mmu notifier to flush all TLBs when updating shared pagetable
    mappings. (Dave Hansen)
  - Move logic for finding the shared vma to use to handle a fault from
    handle_mm_fault() to do_user_addr_fault() because the arch-specific
    fault handling checks vma flags for access permissions.
  - Add CONFIG_MSHARE / ARCH_SUPPORTS_MSHARE
  - Add msharefs_get_unmapped_area()
  - Implemented vm_ops->unmap_page_range (Kirill)
  - Update free_pgtables/free_pgd_range to free process pagetable levels
    but not shared pagetable levels.
  - A first take at cgroup support

RFC v2 -> v3:
  - Now based on 6.11-rc5
  - Addressed many comments from v2.
  - Simplified filesystem code. Removed refcounting of the
    shared mm_struct allocated for an mshare file. The mm_struct
    and the pagetables and mappings it contains are freed when
    the inode is evicted.
  - Switched to an ioctl-based interface. Ioctls implemented
    are used to set and get the start address and size of an
    mshare region and to map objects into an mshare region
    (only anon shared memory is supported in this series).
  - Updated example code

[1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/

RFC v1 -> v2:
  - Eliminated mshare and mshare_unlink system calls and
    replaced API with standard mmap and unlink (Based upon
    v1 patch discussions and LSF/MM discussions)
  - All fd based API (based upon feedback and suggestions from
    Andy Lutomirski, Eric Biederman, Kirill and others)
  - Added a file /sys/fs/mshare/mshare_info to provide
    alignment and size requirement info (based upon feedback
    from Dave Hansen, Mark Hemment and discussions at LSF/MM)
  - Addressed TODOs in v1
  - Added support for directories in msharefs
  - Added locks around any time vma is touched (Dave Hansen)
  - Eliminated the need to point vm_mm in original vmas to the
    newly synthesized mshare mm
  - Ensured mmap_read_unlock is called for correct mm in
    handle_mm_fault (Dave Hansen)

Anthony Yznaga (13):
  mm/mshare: allocate an mm_struct for msharefs files
  mm/mshare: add ways to set the size of an mshare region
  mm/mshare: flush all TLBs when updating PTEs in an mshare range
  sched/numa: do not scan msharefs vmas
  mm: add mmap_read_lock_killable_nested()
  mm: add and use unmap_page_range vm_ops hook
  x86/mm: enable page table sharing
  mm: create __do_mmap() to take an mm_struct * arg
  mm: pass the mm in vma_munmap_struct
  mm/mshare: Add an ioctl for unmapping objects in an mshare region
  mm/mshare: provide a way to identify an mm as an mshare host mm
  mm/mshare: get memcg from current->mm instead of mshare mm
  mm/mshare: associate a mem cgroup with an mshare file

Khalid Aziz (7):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add a vma flag to indicate an mshare region
  mm/mshare: Add mmap support
  mm/mshare: prepare for page table sharing support
  mm/mshare: Add an ioctl for mapping objects in an mshare region

 Documentation/filesystems/index.rst           |   1 +
 Documentation/filesystems/msharefs.rst        |  96 +++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 arch/Kconfig                                  |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/mm/fault.c                           |  48 +-
 include/linux/memcontrol.h                    |   3 +
 include/linux/mm.h                            |  56 ++
 include/linux/mm_types.h                      |   2 +
 include/linux/mmap_lock.h                     |   7 +
 include/trace/events/mmflags.h                |   7 +
 include/uapi/linux/magic.h                    |   1 +
 include/uapi/linux/msharefs.h                 |  38 +
 ipc/shm.c                                     |  17 +
 kernel/sched/fair.c                           |   3 +-
 mm/Kconfig                                    |   9 +
 mm/Makefile                                   |   4 +
 mm/hugetlb.c                                  |  25 +
 mm/memcontrol.c                               |   3 +-
 mm/memory.c                                   |  81 +-
 mm/mmap.c                                     |  10 +-
 mm/mshare.c                                   | 785 ++++++++++++++++++
 mm/vma.c                                      |  22 +-
 mm/vma.h                                      |   3 +-
 24 files changed, 1176 insertions(+), 50 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 include/uapi/linux/msharefs.h
 create mode 100644 mm/mshare.c

-- 
2.43.5



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 01/20] mm: Add msharefs filesystem
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 02/20] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add a pseudo filesystem that contains files and page table sharing
information that enables processes to share page table entries.
This patch adds the basic filesystem that can be mounted, a
CONFIG_MSHARE option to enable the feature, and documentation.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 Documentation/filesystems/index.rst    |  1 +
 Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
 include/uapi/linux/magic.h             |  1 +
 mm/Kconfig                             |  9 +++
 mm/Makefile                            |  4 ++
 mm/mshare.c                            | 97 ++++++++++++++++++++++++++
 6 files changed, 208 insertions(+)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index a9cf8e950b15..573d7a05dbca 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -101,6 +101,7 @@ Documentation for filesystem implementations.
    fuse-io-uring
    inotify
    isofs
+   msharefs
    nilfs2
    nfs/index
    ntfs3
diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..3e5b7d531821
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Msharefs - A filesystem to support shared page tables
+=====================================================
+
+What is msharefs?
+-----------------
+
+msharefs is a pseudo filesystem that allows multiple processes to
+share page table entries for shared pages. To enable support for
+msharefs the kernel must be compiled with CONFIG_MSHARE set.
+
+msharefs is typically mounted like this::
+
+	mount -t msharefs none /sys/fs/mshare
+
+A file created on msharefs creates a new shared region where all
+processes mapping that region will map it using shared page table
+entries. Once the size of the region has been established via
+ftruncate() or fallocate(), the region can be mapped into processes
+and ioctls used to map and unmap objects within it. Note that an
+msharefs file is a control file and accessing mapped objects within
+a shared region through read or write of the file is not permitted.
+
+How to use mshare
+-----------------
+
+Here are the basic steps for using mshare:
+
+  1. Mount msharefs on /sys/fs/mshare::
+
+	mount -t msharefs msharefs /sys/fs/mshare
+
+  2. mshare regions have alignment and size requirements. Start
+     address for the region must be aligned to an address boundary and
+     be a multiple of fixed size. This alignment and size requirement
+     can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
+     which returns a number in text format. mshare regions must be
+     aligned to this boundary and be a multiple of this size.
+
+  3. For the process creating an mshare region:
+
+    a. Create a file on /sys/fs/mshare, for example::
+
+        fd = open("/sys/fs/mshare/shareme",
+                        O_RDWR|O_CREAT|O_EXCL, 0600);
+
+    b. Establish the size of the region::
+
+        fallocate(fd, 0, 0, BUF_SIZE);
+
+      or::
+
+        ftruncate(fd, BUF_SIZE);
+
+    c. Map some memory in the region::
+
+	struct mshare_create mcreate;
+
+	mcreate.region_offset = 0;
+	mcreate.size = BUF_SIZE;
+	mcreate.offset = 0;
+	mcreate.prot = PROT_READ | PROT_WRITE;
+	mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
+	mcreate.fd = -1;
+
+	ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
+
+    d. Map the mshare region into the process::
+
+	mmap(NULL, BUF_SIZE,
+		PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+    e. Write and read to mshared region normally.
+
+
+  4. For processes attaching an mshare region:
+
+    a. Open the msharefs file, for example::
+
+	fd = open("/sys/fs/mshare/shareme", O_RDWR);
+
+    b. Get the size of the mshare region from the file::
+
+        fstat(fd, &sb);
+        mshare_size = sb.st_size;
+
+    c. Map the mshare region into the process::
+
+	mmap(NULL, mshare_size,
+		PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+  5. To delete the mshare region::
+
+		unlink("/sys/fs/mshare/shareme");
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..e53dd6063cba 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index d3fb3762887b..e6c90db83d01 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1342,6 +1342,15 @@ config PT_RECLAIM
 
 	  Note: now only empty user PTE page table pages will be reclaimed.
 
+config MSHARE
+	bool "Mshare"
+	depends on MMU
+	help
+	  Enable msharefs: A ram-based filesystem that allows multiple
+	  processes to share page table entries for shared pages. A file
+	  created on msharefs represents a shared region where all processes
+	  mapping that region will map objects within it with shared PTEs.
+	  Ioctls are used to configure and map objects into the shared region.
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/Makefile b/mm/Makefile
index e7f6bbf8ae5f..b32aad62589b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,6 +48,10 @@ ifdef CONFIG_64BIT
 mmu-$(CONFIG_MMU)	+= mseal.o
 endif
 
+ifdef CONFIG_MSHARE
+mmu-$(CONFIG_MMU)	+= mshare.o
+endif
+
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page-writeback.o folio-compat.o \
 			   readahead.o swap.o truncate.o vmscan.o shrinker.o \
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..f703af49ec81
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Enable cooperating processes to share page table between
+ * them to reduce the extra memory consumed by multiple copies
+ * of page tables.
+ *
+ * This code adds an in-memory filesystem - msharefs.
+ * msharefs is used to manage page table sharing
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author:	Khalid Aziz <khalid@kernel.org>
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/fs_context.h>
+#include <uapi/linux/magic.h>
+
+static const struct file_operations msharefs_file_operations = {
+	.open			= simple_open,
+};
+
+static const struct super_operations mshare_s_ops = {
+	.statfs		= simple_statfs,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *inode;
+
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_magic		= MSHARE_MAGIC;
+	sb->s_op		= &mshare_s_ops;
+	sb->s_time_gran		= 1;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = 1;
+	inode->i_mode = S_IFDIR | 0777;
+	simple_inode_init_ts(inode);
+	inode->i_op = &simple_dir_inode_operations;
+	inode->i_fop = &simple_dir_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+	.get_tree	= msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &msharefs_context_ops;
+	return 0;
+}
+
+static struct file_system_type mshare_fs = {
+	.name			= "msharefs",
+	.init_fs_context	= mshare_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int __init
+mshare_init(void)
+{
+	int ret;
+
+	ret = sysfs_create_mount_point(fs_kobj, "mshare");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&mshare_fs);
+	if (ret)
+		sysfs_remove_mount_point(fs_kobj, "mshare");
+
+	return ret;
+}
+
+core_initcall(mshare_init);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 02/20] mm/mshare: pre-populate msharefs with information file
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 01/20] mm: Add msharefs filesystem Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 03/20] mm/mshare: make msharefs writable and support directories Anthony Yznaga
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Users of mshare need to know the size and alignment requirement
for shared regions. Pre-populate msharefs with a file, mshare_info,
that provides this information. For now, pagetable sharing is
hardcoded to be at the PUD level.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 75 insertions(+), 2 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index f703af49ec81..d666471bc94b 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,18 +17,74 @@
 #include <linux/fs_context.h>
 #include <uapi/linux/magic.h>
 
+const unsigned long mshare_align = P4D_SIZE;
+
 static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
 };
 
+struct msharefs_info {
+	struct dentry *info_dentry;
+};
+
+static ssize_t
+mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
+		loff_t *ppos)
+{
+	char s[80];
+
+	sprintf(s, "%ld\n", mshare_align);
+	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
+}
+
+static const struct file_operations mshare_info_ops = {
+	.read	= mshare_info_read,
+	.llseek	= noop_llseek,
+};
+
 static const struct super_operations mshare_s_ops = {
 	.statfs		= simple_statfs,
 };
 
+static int
+msharefs_create_mshare_info(struct super_block *sb)
+{
+	struct msharefs_info *info = sb->s_fs_info;
+	struct dentry *root = sb->s_root;
+	struct dentry *dentry;
+	struct inode *inode;
+	int ret;
+
+	ret = -ENOMEM;
+	inode = new_inode(sb);
+	if (!inode)
+		goto out;
+
+	inode->i_ino = 2;
+	simple_inode_init_ts(inode);
+	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG | 0444);
+	inode->i_fop = &mshare_info_ops;
+
+	dentry = d_alloc_name(root, "mshare_info");
+	if (!dentry)
+		goto out;
+
+	info->info_dentry = dentry;
+	d_add(dentry, inode);
+
+	return 0;
+out:
+	iput(inode);
+
+	return ret;
+}
+
 static int
 msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
+	struct msharefs_info *info;
 	struct inode *inode;
+	int ret;
 
 	sb->s_blocksize		= PAGE_SIZE;
 	sb->s_blocksize_bits	= PAGE_SHIFT;
@@ -37,6 +93,12 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_op		= &mshare_s_ops;
 	sb->s_time_gran		= 1;
 
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	sb->s_fs_info = info;
+
 	inode = new_inode(sb);
 	if (!inode)
 		return -ENOMEM;
@@ -52,7 +114,9 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (!sb->s_root)
 		return -ENOMEM;
 
-	return 0;
+	ret = msharefs_create_mshare_info(sb);
+
+	return ret;
 }
 
 static int
@@ -72,10 +136,19 @@ mshare_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
+static void
+msharefs_kill_super(struct super_block *sb)
+{
+	struct msharefs_info *info = sb->s_fs_info;
+
+	kfree(info);
+	kill_litter_super(sb);
+}
+
 static struct file_system_type mshare_fs = {
 	.name			= "msharefs",
 	.init_fs_context	= mshare_init_fs_context,
-	.kill_sb		= kill_litter_super,
+	.kill_sb		= msharefs_kill_super,
 };
 
 static int __init
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 03/20] mm/mshare: make msharefs writable and support directories
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 01/20] mm: Add msharefs filesystem Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 02/20] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 04/20] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Make msharefs filesystem writable and allow creating directories
to support better access control to mshare'd regions defined in
msharefs.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 115 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index d666471bc94b..5d9e25da0244 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -19,14 +19,128 @@
 
 const unsigned long mshare_align = P4D_SIZE;
 
+static const struct inode_operations msharefs_dir_inode_ops;
+static const struct inode_operations msharefs_file_inode_ops;
+
 static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
 };
 
+static struct inode
+*msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
+			const struct inode *dir, umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_ino = get_next_ino();
+	inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+	simple_inode_init_ts(inode);
+
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &msharefs_file_inode_ops;
+		inode->i_fop = &msharefs_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &msharefs_dir_inode_ops;
+		inode->i_fop = &simple_dir_operations;
+		inc_nlink(inode);
+		break;
+	default:
+		discard_new_inode(inode);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return inode;
+}
+
+static int
+msharefs_mknod(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = msharefs_get_inode(idmap, dir->i_sb, dir, mode);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
+
+	return 0;
+}
+
+static int
+msharefs_create(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode, bool excl)
+{
+	return msharefs_mknod(idmap, dir, dentry, mode | S_IFREG);
+}
+
+static struct dentry *
+msharefs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode)
+{
+	int ret = msharefs_mknod(idmap, dir, dentry, mode | S_IFDIR);
+
+	if (!ret)
+		inc_nlink(dir);
+	return ERR_PTR(ret);
+}
+
 struct msharefs_info {
 	struct dentry *info_dentry;
 };
 
+static inline bool
+is_msharefs_info_file(const struct dentry *dentry)
+{
+	struct msharefs_info *info = dentry->d_sb->s_fs_info;
+
+	return info->info_dentry == dentry;
+}
+
+static int
+msharefs_rename(struct mnt_idmap *idmap,
+		struct inode *old_dir, struct dentry *old_dentry,
+		struct inode *new_dir, struct dentry *new_dentry,
+		unsigned int flags)
+{
+	if (is_msharefs_info_file(old_dentry) ||
+	    is_msharefs_info_file(new_dentry))
+		return -EPERM;
+
+	return simple_rename(idmap, old_dir, old_dentry, new_dir,
+			     new_dentry, flags);
+}
+
+static int
+msharefs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	if (is_msharefs_info_file(dentry))
+		return -EPERM;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations msharefs_file_inode_ops = {
+	.setattr	= simple_setattr,
+};
+
+static const struct inode_operations msharefs_dir_inode_ops = {
+	.create		= msharefs_create,
+	.lookup		= simple_lookup,
+	.link		= simple_link,
+	.unlink		= msharefs_unlink,
+	.mkdir		= msharefs_mkdir,
+	.rmdir		= simple_rmdir,
+	.rename		= msharefs_rename,
+};
+
 static ssize_t
 mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
 		loff_t *ppos)
@@ -106,7 +220,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	inode->i_ino = 1;
 	inode->i_mode = S_IFDIR | 0777;
 	simple_inode_init_ts(inode);
-	inode->i_op = &simple_dir_inode_operations;
+	inode->i_op = &msharefs_dir_inode_ops;
 	inode->i_fop = &simple_dir_operations;
 	set_nlink(inode, 2);
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 04/20] mm/mshare: allocate an mm_struct for msharefs files
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (2 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 03/20] mm/mshare: make msharefs writable and support directories Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 05/20] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

When a new file is created under msharefs, allocate a new mm_struct
to be associated with it for the lifetime of the file.
The mm_struct will hold the VMAs and pagetables for the mshare region
the file represents.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 5d9e25da0244..551d12cb9fdb 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -19,6 +19,10 @@
 
 const unsigned long mshare_align = P4D_SIZE;
 
+struct mshare_data {
+	struct mm_struct *mm;
+};
+
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
@@ -26,11 +30,51 @@ static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
 };
 
+static int
+msharefs_fill_mm(struct inode *inode)
+{
+	struct mm_struct *mm;
+	struct mshare_data *m_data = NULL;
+	int ret = 0;
+
+	mm = mm_alloc();
+	if (!mm) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	mm->mmap_base = mm->task_size = 0;
+
+	m_data = kzalloc(sizeof(*m_data), GFP_KERNEL);
+	if (!m_data) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+	m_data->mm = mm;
+	inode->i_private = m_data;
+
+	return 0;
+
+err_free:
+	if (mm)
+		mmput(mm);
+	kfree(m_data);
+	return ret;
+}
+
+static void
+msharefs_delmm(struct mshare_data *m_data)
+{
+	mmput(m_data->mm);
+	kfree(m_data);
+}
+
 static struct inode
 *msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
 			const struct inode *dir, umode_t mode)
 {
 	struct inode *inode = new_inode(sb);
+	int ret;
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
@@ -43,6 +87,11 @@ static struct inode
 	case S_IFREG:
 		inode->i_op = &msharefs_file_inode_ops;
 		inode->i_fop = &msharefs_file_operations;
+		ret = msharefs_fill_mm(inode);
+		if (ret) {
+			discard_new_inode(inode);
+			inode = ERR_PTR(ret);
+		}
 		break;
 	case S_IFDIR:
 		inode->i_op = &msharefs_dir_inode_ops;
@@ -141,6 +190,16 @@ static const struct inode_operations msharefs_dir_inode_ops = {
 	.rename		= msharefs_rename,
 };
 
+static void
+mshare_evict_inode(struct inode *inode)
+{
+	struct mshare_data *m_data = inode->i_private;
+
+	if (m_data)
+		msharefs_delmm(m_data);
+	clear_inode(inode);
+}
+
 static ssize_t
 mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
 		loff_t *ppos)
@@ -158,6 +217,7 @@ static const struct file_operations mshare_info_ops = {
 
 static const struct super_operations mshare_s_ops = {
 	.statfs		= simple_statfs,
+	.evict_inode	= mshare_evict_inode,
 };
 
 static int
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 05/20] mm/mshare: add ways to set the size of an mshare region
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (3 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 04/20] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 06/20] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Add file and inode operations to allow the size of an mshare region
to be set fallocate() or ftruncate().

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 81 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 551d12cb9fdb..5eed18bc3b51 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,18 +16,72 @@
 #include <linux/fs.h>
 #include <linux/fs_context.h>
 #include <uapi/linux/magic.h>
+#include <linux/falloc.h>
 
 const unsigned long mshare_align = P4D_SIZE;
 
+#define MSHARE_INITIALIZED	0x1
+
 struct mshare_data {
 	struct mm_struct *mm;
+	unsigned long size;
+	unsigned long flags;
 };
 
+static int msharefs_set_size(struct mshare_data *m_data, unsigned long size)
+{
+	int error = -EINVAL;
+
+	if (test_bit(MSHARE_INITIALIZED, &m_data->flags))
+		goto out;
+
+	if (m_data->size || (size & (mshare_align - 1)))
+		goto out;
+
+	m_data->mm->task_size = m_data->size = size;
+
+	set_bit(MSHARE_INITIALIZED, &m_data->flags);
+	error = 0;
+out:
+	return error;
+}
+
+static long msharefs_fallocate(struct file *file, int mode, loff_t offset,
+				loff_t len)
+{
+	struct inode *inode = file_inode(file);
+	struct mshare_data *m_data = inode->i_private;
+	int error;
+
+	if (mode != FALLOC_FL_ALLOCATE_RANGE)
+		return -EOPNOTSUPP;
+
+	if (offset)
+		return -EINVAL;
+
+	inode_lock(inode);
+
+	error = inode_newsize_ok(inode, len);
+	if (error)
+		goto out;
+
+	error = msharefs_set_size(m_data, len);
+	if (error)
+		goto out;
+
+	i_size_write(inode, len);
+out:
+	inode_unlock(inode);
+
+	return error;
+}
+
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
 static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
+	.fallocate		= msharefs_fallocate,
 };
 
 static int
@@ -123,6 +177,32 @@ msharefs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	return 0;
 }
 
+static int msharefs_setattr(struct mnt_idmap *idmap,
+			    struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = d_inode(dentry);
+	struct mshare_data *m_data = inode->i_private;
+	unsigned int ia_valid = attr->ia_valid;
+	int error;
+
+	error = setattr_prepare(idmap, dentry, attr);
+	if (error)
+		return error;
+
+	if (ia_valid & ATTR_SIZE) {
+		loff_t newsize = attr->ia_size;
+
+		error = msharefs_set_size(m_data, newsize);
+		if (error)
+			return error;
+
+		i_size_write(inode, newsize);
+	}
+
+	setattr_copy(idmap, inode, attr);
+	return 0;
+}
+
 static int
 msharefs_create(struct mnt_idmap *idmap, struct inode *dir,
 		struct dentry *dentry, umode_t mode, bool excl)
@@ -177,7 +257,7 @@ msharefs_unlink(struct inode *dir, struct dentry *dentry)
 }
 
 static const struct inode_operations msharefs_file_inode_ops = {
-	.setattr	= simple_setattr,
+	.setattr	= msharefs_setattr,
 };
 
 static const struct inode_operations msharefs_dir_inode_ops = {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 06/20] mm/mshare: Add a vma flag to indicate an mshare region
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (4 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 05/20] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 07/20] mm/mshare: Add mmap support Anthony Yznaga
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

An mshare region contains zero or more actual vmas that map objects
in the mshare range with shared page tables.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h             | 19 +++++++++++++++++++
 include/trace/events/mmflags.h |  7 +++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 778f5de6a12e..f2f9d15213ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -414,6 +414,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_DROPPABLE		VM_NONE
 #endif
 
+#ifdef CONFIG_MSHARE
+#define VM_MSHARE_BIT		41
+#define VM_MSHARE		BIT(VM_MSHARE_BIT)
+#else
+#define VM_MSHARE		VM_NONE
+#endif
+
 #ifdef CONFIG_64BIT
 /* VM is sealed, in vm_flags */
 #define VM_SEALED	_BITUL(63)
@@ -1161,6 +1168,18 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
 
 int vma_is_stack_for_current(struct vm_area_struct *vma);
 
+#ifdef CONFIG_MSHARE
+static inline bool vma_is_mshare(const struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_MSHARE;
+}
+#else
+static inline bool vma_is_mshare(const struct vm_area_struct *vma)
+{
+	return false;
+}
+#endif
+
 /* flush_tlb_range() takes a vma, not a mm, and can care about flags */
 #define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 15aae955a10b..02ebd354ed55 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -202,6 +202,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
 # define IF_HAVE_VM_DROPPABLE(flag, name)
 #endif
 
+#ifdef CONFIG_MSHARE
+# define IF_HAVE_VM_MSHARE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_MSHARE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -235,6 +241,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
 IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
+IF_HAVE_VM_MSHARE(VM_MSHARE,		"mshare"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 07/20] mm/mshare: Add mmap support
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (5 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 06/20] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add support for mapping an mshare region into a process after the
region has been established in msharefs. Disallow operations that
could split the resulting msharefs vma such as partial unmaps and
protection changes. Fault handling, mapping, unmapping, and
protection changes for objects mapped into an mshare region will
be done using the shared vmas created for them in the host mm. This
functionality will be added in later patches.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 132 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 5eed18bc3b51..6bdbcfa8deea 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -15,19 +15,146 @@
 
 #include <linux/fs.h>
 #include <linux/fs_context.h>
+#include <linux/mman.h>
 #include <uapi/linux/magic.h>
 #include <linux/falloc.h>
 
 const unsigned long mshare_align = P4D_SIZE;
+const unsigned long mshare_base = mshare_align;
 
 #define MSHARE_INITIALIZED	0x1
 
 struct mshare_data {
 	struct mm_struct *mm;
+	unsigned long start;
 	unsigned long size;
 	unsigned long flags;
 };
 
+static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	return -EINVAL;
+}
+
+static int mshare_vm_op_mprotect(struct vm_area_struct *vma, unsigned long start,
+				 unsigned long end, unsigned long newflags)
+{
+	return -EINVAL;
+}
+
+static const struct vm_operations_struct msharefs_vm_ops = {
+	.may_split = mshare_vm_op_split,
+	.mprotect = mshare_vm_op_mprotect,
+};
+
+/*
+ * msharefs_mmap() - mmap an mshare region
+ */
+static int
+msharefs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct mshare_data *m_data = file->private_data;
+
+	vma->vm_private_data = m_data;
+	vm_flags_set(vma, VM_MSHARE | VM_DONTEXPAND);
+	vma->vm_ops = &msharefs_vm_ops;
+
+	return 0;
+}
+
+static unsigned long
+msharefs_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct vm_unmapped_area_info info = {};
+
+	info.length = len;
+	info.low_limit = current->mm->mmap_base;
+	info.high_limit = arch_get_mmap_end(addr, len, flags);
+	info.align_mask = PAGE_MASK & (mshare_align - 1);
+	return vm_unmapped_area(&info);
+}
+
+static unsigned long
+msharefs_get_unmapped_area_topdown(struct file *file, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct vm_unmapped_area_info info = {};
+
+	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+	info.length = len;
+	info.low_limit = PAGE_SIZE;
+	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
+	info.align_mask = PAGE_MASK & (mshare_align - 1);
+	addr = vm_unmapped_area(&info);
+
+	/*
+	 * A failed mmap() very likely causes application failure,
+	 * so fall back to the bottom-up function here. This scenario
+	 * can happen with large stack limits and large mmap()
+	 * allocations.
+	 */
+	if (unlikely(offset_in_page(addr))) {
+		VM_BUG_ON(addr != -ENOMEM);
+		info.flags = 0;
+		info.low_limit = current->mm->mmap_base;
+		info.high_limit = arch_get_mmap_end(addr, len, flags);
+		addr = vm_unmapped_area(&info);
+	}
+
+	return addr;
+}
+
+static unsigned long
+msharefs_get_unmapped_area(struct file *file, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	struct mshare_data *m_data = file->private_data;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	unsigned long mshare_start, mshare_size;
+	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
+
+	mmap_assert_write_locked(mm);
+
+	if ((flags & MAP_TYPE) == MAP_PRIVATE)
+		return -EINVAL;
+
+	if (!test_bit(MSHARE_INITIALIZED, &m_data->flags))
+		return -EINVAL;
+
+	mshare_start = m_data->start;
+	mshare_size = m_data->size;
+
+	if (len != mshare_size)
+		return -EINVAL;
+
+	if (len > mmap_end - mmap_min_addr)
+		return -ENOMEM;
+
+	if (flags & MAP_FIXED) {
+		if (!IS_ALIGNED(addr, mshare_align))
+			return -EINVAL;
+		return addr;
+	}
+
+	if (addr) {
+		addr = ALIGN(addr, mshare_align);
+		vma = find_vma_prev(mm, addr, &prev);
+		if (mmap_end - len >= addr && addr >= mmap_min_addr &&
+		    (!vma || addr + len <= vm_start_gap(vma)) &&
+		    (!prev || addr >= vm_end_gap(prev)))
+			return addr;
+	}
+
+	if (!test_bit(MMF_TOPDOWN, &mm->flags))
+		return msharefs_get_unmapped_area_bottomup(file, addr, len,
+				pgoff, flags);
+	else
+		return msharefs_get_unmapped_area_topdown(file, addr, len,
+				pgoff, flags);
+}
+
 static int msharefs_set_size(struct mshare_data *m_data, unsigned long size)
 {
 	int error = -EINVAL;
@@ -81,6 +208,8 @@ static const struct inode_operations msharefs_file_inode_ops;
 
 static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
+	.mmap			= msharefs_mmap,
+	.get_unmapped_area	= msharefs_get_unmapped_area,
 	.fallocate		= msharefs_fallocate,
 };
 
@@ -97,7 +226,8 @@ msharefs_fill_mm(struct inode *inode)
 		goto err_free;
 	}
 
-	mm->mmap_base = mm->task_size = 0;
+	mm->mmap_base = mshare_base;
+	mm->task_size = 0;
 
 	m_data = kzalloc(sizeof(*m_data), GFP_KERNEL);
 	if (!m_data) {
@@ -105,6 +235,7 @@ msharefs_fill_mm(struct inode *inode)
 		goto err_free;
 	}
 	m_data->mm = mm;
+	m_data->start = mshare_base;
 	inode->i_private = m_data;
 
 	return 0;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (6 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 07/20] mm/mshare: Add mmap support Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-05-30 14:41   ` Jann Horn
  2025-04-04  2:18 ` [PATCH v2 09/20] sched/numa: do not scan msharefs vmas Anthony Yznaga
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Unlike the mm of a task, an mshare host mm is not updated on context
switch. In particular this means that mm_cpumask is never updated
which results in TLB flushes for updates to mshare PTEs only being
done on the local CPU. To ensure entries are flushed for non-local
TLBs, set up an mmu notifier on the mshare mm and use the
.arch_invalidate_secondary_tlbs callback to flush all TLBs.
arch_invalidate_secondary_tlbs guarantees that TLB entries will be
flushed before pages are freed when unmapping pages in an mshare region.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 6bdbcfa8deea..792d86c61042 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,8 +16,10 @@
 #include <linux/fs.h>
 #include <linux/fs_context.h>
 #include <linux/mman.h>
+#include <linux/mmu_notifier.h>
 #include <uapi/linux/magic.h>
 #include <linux/falloc.h>
+#include <asm/tlbflush.h>
 
 const unsigned long mshare_align = P4D_SIZE;
 const unsigned long mshare_base = mshare_align;
@@ -29,6 +31,17 @@ struct mshare_data {
 	unsigned long start;
 	unsigned long size;
 	unsigned long flags;
+	struct mmu_notifier mn;
+};
+
+static void mshare_invalidate_tlbs(struct mmu_notifier *mn, struct mm_struct *mm,
+				   unsigned long start, unsigned long end)
+{
+	flush_tlb_all();
+}
+
+static const struct mmu_notifier_ops mshare_mmu_ops = {
+	.arch_invalidate_secondary_tlbs = mshare_invalidate_tlbs,
 };
 
 static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -237,6 +250,10 @@ msharefs_fill_mm(struct inode *inode)
 	m_data->mm = mm;
 	m_data->start = mshare_base;
 	inode->i_private = m_data;
+	m_data->mn.ops = &mshare_mmu_ops;
+	ret = mmu_notifier_register(&m_data->mn, mm);
+	if (ret)
+		goto err_free;
 
 	return 0;
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 09/20] sched/numa: do not scan msharefs vmas
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (7 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 10/20] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Scanning an msharefs vma results in changes to the shared page
table but with TLB flushes incorrectly only going to the process
with the vma.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..6e1649a28551 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3386,7 +3386,8 @@ static void task_numa_work(struct callback_head *work)
 
 	for (; vma; vma = vma_next(&vmi)) {
 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
-			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP) ||
+			vma_is_mshare(vma)) {
 			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
 			continue;
 		}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 10/20] mm: add mmap_read_lock_killable_nested()
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (8 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 09/20] sched/numa: do not scan msharefs vmas Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 11/20] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

This will be used to support mshare functionality where the read
lock on an mshare host mm is taken while holding the lock on a
process mm.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mmap_lock.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4706c6769902..a14d74822846 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -185,6 +185,13 @@ static inline void mmap_read_lock(struct mm_struct *mm)
 	__mmap_lock_trace_acquire_returned(mm, false, true);
 }
 
+static inline void mmap_read_lock_nested(struct mm_struct *mm, int subclass)
+{
+	__mmap_lock_trace_start_locking(mm, false);
+	down_read_nested(&mm->mmap_lock, subclass);
+	__mmap_lock_trace_acquire_returned(mm, false, true);
+}
+
 static inline int mmap_read_lock_killable(struct mm_struct *mm)
 {
 	int ret;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 11/20] mm: add and use unmap_page_range vm_ops hook
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (9 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 10/20] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Special handling is needed when unmapping a hugetlb vma and will
be needed when unmapping an msharefs vma once support is added for
handling faults in an mshare region.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h | 10 ++++++++++
 ipc/shm.c          | 17 +++++++++++++++++
 mm/hugetlb.c       | 25 +++++++++++++++++++++++++
 mm/memory.c        | 36 +++++++++++++-----------------------
 4 files changed, 65 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f2f9d15213ab..d6cac2cca4da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -40,6 +40,7 @@ struct anon_vma_chain;
 struct user_struct;
 struct pt_regs;
 struct folio_batch;
+struct zap_details;
 
 void arch_mm_preinit(void);
 void mm_core_init(void);
@@ -661,8 +662,17 @@ struct vm_operations_struct {
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
 					  unsigned long addr);
+	void (*unmap_page_range)(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end,
+				 struct zap_details *details);
 };
 
+void __unmap_page_range(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end,
+			struct zap_details *details);
+
 #ifdef CONFIG_NUMA_BALANCING
 static inline void vma_numab_state_init(struct vm_area_struct *vma)
 {
diff --git a/ipc/shm.c b/ipc/shm.c
index 99564c870084..cadd551e60b9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -585,6 +585,22 @@ static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,
 }
 #endif
 
+static void shm_unmap_page_range(struct mmu_gather *tlb,
+				 struct vm_area_struct *vma,
+				 unsigned long addr, unsigned long end,
+				 struct zap_details *details)
+{
+	struct file *file = vma->vm_file;
+	struct shm_file_data *sfd = shm_file_data(file);
+
+	if (sfd->vm_ops->unmap_page_range) {
+		sfd->vm_ops->unmap_page_range(tlb, vma, addr, end, details);
+		return;
+	}
+
+	__unmap_page_range(tlb, vma, addr, end, details);
+}
+
 static int shm_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct shm_file_data *sfd = shm_file_data(file);
@@ -685,6 +701,7 @@ static const struct vm_operations_struct shm_vm_ops = {
 	.set_policy = shm_set_policy,
 	.get_policy = shm_get_policy,
 #endif
+	.unmap_page_range = shm_unmap_page_range,
 };
 
 /**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 39f92aad7bd1..e4a516f458a0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5426,6 +5426,30 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
 	return 0;
 }
 
+static void hugetlb_vm_op_unmap_page_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma,
+				unsigned long addr, unsigned long end,
+				struct zap_details *details)
+{
+	zap_flags_t zap_flags = details ?  details->zap_flags : 0;
+
+	/*
+	 * It is undesirable to test vma->vm_file as it
+	 * should be non-null for valid hugetlb area.
+	 * However, vm_file will be NULL in the error
+	 * cleanup path of mmap_region. When
+	 * hugetlbfs ->mmap method fails,
+	 * mmap_region() nullifies vma->vm_file
+	 * before calling this function to clean up.
+	 * Since no pte has actually been setup, it is
+	 * safe to do nothing in this case.
+	 */
+	if (!vma->vm_file)
+		return;
+
+	__unmap_hugepage_range(tlb, vma, addr, end, NULL, zap_flags);
+}
+
 /*
  * When a new function is introduced to vm_operations_struct and added
  * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5439,6 +5463,7 @@ const struct vm_operations_struct hugetlb_vm_ops = {
 	.close = hugetlb_vm_op_close,
 	.may_split = hugetlb_vm_op_split,
 	.pagesize = hugetlb_vm_op_pagesize,
+	.unmap_page_range = hugetlb_vm_op_unmap_page_range,
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
diff --git a/mm/memory.c b/mm/memory.c
index 6ea3551eb2df..db558fe43088 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1876,7 +1876,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-void unmap_page_range(struct mmu_gather *tlb,
+void __unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1896,6 +1896,16 @@ void unmap_page_range(struct mmu_gather *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details)
+{
+	if (vma->vm_ops && vma->vm_ops->unmap_page_range)
+		vma->vm_ops->unmap_page_range(tlb, vma, addr, end, details);
+	else
+		__unmap_page_range(tlb, vma, addr, end, details);
+}
 
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
@@ -1917,28 +1927,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (unlikely(vma->vm_flags & VM_PFNMAP))
 		untrack_pfn(vma, 0, 0, mm_wr_locked);
 
-	if (start != end) {
-		if (unlikely(is_vm_hugetlb_page(vma))) {
-			/*
-			 * It is undesirable to test vma->vm_file as it
-			 * should be non-null for valid hugetlb area.
-			 * However, vm_file will be NULL in the error
-			 * cleanup path of mmap_region. When
-			 * hugetlbfs ->mmap method fails,
-			 * mmap_region() nullifies vma->vm_file
-			 * before calling this function to clean up.
-			 * Since no pte has actually been setup, it is
-			 * safe to do nothing in this case.
-			 */
-			if (vma->vm_file) {
-				zap_flags_t zap_flags = details ?
-				    details->zap_flags : 0;
-				__unmap_hugepage_range(tlb, vma, start, end,
-							     NULL, zap_flags);
-			}
-		} else
-			unmap_page_range(tlb, vma, start, end, details);
-	}
+	if (start != end)
+		unmap_page_range(tlb, vma, start, end, details);
 }
 
 /**
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (10 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 11/20] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-05-30 14:56   ` Jann Horn
  2025-04-04  2:18 ` [PATCH v2 13/20] x86/mm: enable page table sharing Anthony Yznaga
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

In preparation for enabling the handling of page faults in an mshare
region provide a way to link an mshare shared page table to a process
page table and otherwise find the actual vma in order to handle a page
fault. Modify the unmap path to ensure that page tables in mshare regions
are unlinked and kept intact when a process exits or an mshare region
is explicitly unmapped.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h |  6 +++++
 mm/memory.c        | 45 +++++++++++++++++++++++++++------
 mm/mshare.c        | 62 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6cac2cca4da..f06be2f20c20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,11 +1179,17 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
 int vma_is_stack_for_current(struct vm_area_struct *vma);
 
 #ifdef CONFIG_MSHARE
+vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
 static inline bool vma_is_mshare(const struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_MSHARE;
 }
 #else
+static inline vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp)
+{
+	WARN_ON_ONCE(1);
+	return VM_FAULT_SIGBUS;
+}
 static inline bool vma_is_mshare(const struct vm_area_struct *vma)
 {
 	return false;
diff --git a/mm/memory.c b/mm/memory.c
index db558fe43088..68422b606819 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -247,7 +247,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 
 static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				unsigned long floor, unsigned long ceiling)
+				unsigned long floor, unsigned long ceiling,
+				bool shared_pud)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+		if (unlikely(shared_pud))
+			p4d_clear(p4d);
+		else
+			free_pud_range(tlb, p4d, addr, next, floor, ceiling);
 	} while (p4d++, addr = next, addr != end);
 
 	start &= PGDIR_MASK;
@@ -281,9 +285,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 /*
  * This function frees user-level page tables of a process.
  */
-void free_pgd_range(struct mmu_gather *tlb,
+static void __free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
-			unsigned long floor, unsigned long ceiling)
+			unsigned long floor, unsigned long ceiling,
+			bool shared_pud)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -339,10 +344,17 @@ void free_pgd_range(struct mmu_gather *tlb,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
+		free_p4d_range(tlb, pgd, addr, next, floor, ceiling, shared_pud);
 	} while (pgd++, addr = next, addr != end);
 }
 
+void free_pgd_range(struct mmu_gather *tlb,
+			unsigned long addr, unsigned long end,
+			unsigned long floor, unsigned long ceiling)
+{
+	__free_pgd_range(tlb, addr, end, floor, ceiling, false);
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked)
@@ -379,9 +391,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 
 			/*
 			 * Optimization: gather nearby vmas into one call down
+			 *
+			 * Do not free the shared page tables of an mshare region.
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && !vma_is_mshare(next)) {
 				vma = next;
 				next = mas_find(mas, ceiling - 1);
 				if (unlikely(xa_is_zero(next)))
@@ -392,9 +407,11 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 				unlink_file_vma_batch_add(&vb, vma);
 			}
 			unlink_file_vma_batch_final(&vb);
-			free_pgd_range(tlb, addr, vma->vm_end,
-				floor, next ? next->vm_start : ceiling);
+			__free_pgd_range(tlb, addr, vma->vm_end,
+				floor, next ? next->vm_start : ceiling,
+				vma_is_mshare(vma));
 		}
+
 		vma = next;
 	} while (vma);
 }
@@ -1884,6 +1901,13 @@ void __unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
+	/*
+	 * Do not unmap vmas that share page tables through an
+	 * mshare region.
+	 */
+	if (vma_is_mshare(vma))
+		return;
+
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -6275,6 +6299,11 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (ret)
 		goto out;
 
+	if (unlikely(vma_is_mshare(vma))) {
+		WARN_ON_ONCE(1);
+		return VM_FAULT_SIGBUS;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE)) {
diff --git a/mm/mshare.c b/mm/mshare.c
index 792d86c61042..4ddaa0d41070 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -44,6 +44,56 @@ static const struct mmu_notifier_ops mshare_mmu_ops = {
 	.arch_invalidate_secondary_tlbs = mshare_invalidate_tlbs,
 };
 
+static p4d_t *walk_to_p4d(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return NULL;
+
+	return p4d;
+}
+
+/* Returns holding the host mm's lock for read.  Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+	struct vm_area_struct *vma, *guest = *vmap;
+	struct mshare_data *m_data = guest->vm_private_data;
+	struct mm_struct *host_mm = m_data->mm;
+	unsigned long host_addr;
+	p4d_t *p4d, *guest_p4d;
+
+	mmap_read_lock_nested(host_mm, SINGLE_DEPTH_NESTING);
+	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+	p4d = walk_to_p4d(host_mm, host_addr);
+	guest_p4d = walk_to_p4d(guest->vm_mm, *addrp);
+	if (!p4d_same(*guest_p4d, *p4d)) {
+		set_p4d(guest_p4d, *p4d);
+		mmap_read_unlock(host_mm);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	vma = find_vma(host_mm, host_addr);
+
+	/* XXX: expand stack? */
+	if (vma && vma->vm_start > host_addr)
+		vma = NULL;
+
+	*vmap = vma;
+
+	/*
+	 * release host mm lock unless a matching vma is found
+	 */
+	if (!vma)
+		mmap_read_unlock(host_mm);
+	return 0;
+}
+
 static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
 {
 	return -EINVAL;
@@ -55,9 +105,21 @@ static int mshare_vm_op_mprotect(struct vm_area_struct *vma, unsigned long start
 	return -EINVAL;
 }
 
+static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
+				struct vm_area_struct *vma,
+				unsigned long addr, unsigned long end,
+				struct zap_details *details)
+{
+	/*
+	 * The msharefs vma is being unmapped. Do not unmap pages in the
+	 * mshare region itself.
+	 */
+}
+
 static const struct vm_operations_struct msharefs_vm_ops = {
 	.may_split = mshare_vm_op_split,
 	.mprotect = mshare_vm_op_mprotect,
+	.unmap_page_range = mshare_vm_op_unmap_page_range,
 };
 
 /*
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 13/20] x86/mm: enable page table sharing
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (11 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-08-12 13:46   ` Yongting Lin
  2025-04-04  2:18 ` [PATCH v2 14/20] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Enable x86 support for handling page faults in an mshare region by
redirecting page faults to operate on the mshare mm_struct and vmas
contained in it.
Some permissions checks are done using vma flags in architecture-specfic
fault handling code so the actual vma needed to complete the handling
is acquired before calling handle_mm_fault(). Because of this an
ARCH_SUPPORTS_MSHARE config option is added.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/Kconfig        |  3 +++
 arch/x86/Kconfig    |  1 +
 arch/x86/mm/fault.c | 37 ++++++++++++++++++++++++++++++++++++-
 mm/Kconfig          |  2 +-
 4 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9f6eb09ef12d..2e000fefe9b3 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1652,6 +1652,9 @@ config HAVE_ARCH_PFN_VALID
 config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 	bool
 
+config ARCH_SUPPORTS_MSHARE
+	bool
+
 config ARCH_SUPPORTS_PAGE_TABLE_CHECK
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1502fd0c3c06..1f1779decb44 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
 	select ARCH_SUPPORTS_ACPI
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
+	select ARCH_SUPPORTS_MSHARE		if X86_64
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK	if X86_64
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 296d294142c8..49659d2f9316 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1216,6 +1216,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	vm_fault_t fault;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
+	bool is_shared_vma;
+	unsigned long addr;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1329,6 +1331,12 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!vma)
 		goto lock_mmap;
 
+	/* mshare does not support per-VMA locks yet */
+	if (vma_is_mshare(vma)) {
+		vma_end_read(vma);
+		goto lock_mmap;
+	}
+
 	if (unlikely(access_error(error_code, vma))) {
 		bad_area_access_error(regs, error_code, address, NULL, vma);
 		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
@@ -1357,17 +1365,38 @@ void do_user_addr_fault(struct pt_regs *regs,
 lock_mmap:
 
 retry:
+	addr = address;
+	is_shared_vma = false;
 	vma = lock_mm_and_find_vma(mm, address, regs);
 	if (unlikely(!vma)) {
 		bad_area_nosemaphore(regs, error_code, address);
 		return;
 	}
 
+	if (unlikely(vma_is_mshare(vma))) {
+		fault = find_shared_vma(&vma, &addr);
+
+		if (fault) {
+			mmap_read_unlock(mm);
+			goto done;
+		}
+
+		if (!vma) {
+			mmap_read_unlock(mm);
+			bad_area_nosemaphore(regs, error_code, address);
+			return;
+		}
+
+		is_shared_vma = true;
+	}
+
 	/*
 	 * Ok, we have a good vm_area for this memory access, so
 	 * we can handle it..
 	 */
 	if (unlikely(access_error(error_code, vma))) {
+		if (unlikely(is_shared_vma))
+			mmap_read_unlock(vma->vm_mm);
 		bad_area_access_error(regs, error_code, address, mm, vma);
 		return;
 	}
@@ -1385,7 +1414,11 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * userland). The return to userland is identified whenever
 	 * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
 	 */
-	fault = handle_mm_fault(vma, address, flags, regs);
+	fault = handle_mm_fault(vma, addr, flags, regs);
+
+	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
+	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
+		mmap_read_unlock(mm);
 
 	if (fault_signal_pending(fault, regs)) {
 		/*
@@ -1413,6 +1446,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 		goto retry;
 	}
 
+	if (unlikely(is_shared_vma))
+		mmap_read_unlock(vma->vm_mm);
 	mmap_read_unlock(mm);
 done:
 	if (likely(!(fault & VM_FAULT_ERROR)))
diff --git a/mm/Kconfig b/mm/Kconfig
index e6c90db83d01..8a5a159457f2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1344,7 +1344,7 @@ config PT_RECLAIM
 
 config MSHARE
 	bool "Mshare"
-	depends on MMU
+	depends on MMU && ARCH_SUPPORTS_MSHARE
 	help
 	  Enable msharefs: A ram-based filesystem that allows multiple
 	  processes to share page table entries for shared pages. A file
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 14/20] mm: create __do_mmap() to take an mm_struct * arg
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (12 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 13/20] x86/mm: enable page table sharing Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 15/20] mm: pass the mm in vma_munmap_struct Anthony Yznaga
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

In preparation for mapping objects into an mshare region, create
__do_mmap() to allow mapping into a specified mm. There are no
functional changes otherwise.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h | 16 ++++++++++++++++
 mm/mmap.c          | 10 +++++-----
 mm/vma.c           | 12 ++++++------
 mm/vma.h           |  2 +-
 4 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f06be2f20c20..9e64deae3d64 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3480,10 +3480,26 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 	return __get_unmapped_area(file, addr, len, pgoff, flags, 0);
 }
 
+#ifdef CONFIG_MMU
+unsigned long __do_mmap(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long prot, unsigned long flags,
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf, struct mm_struct *mm);
+static inline unsigned long do_mmap(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long prot, unsigned long flags,
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf)
+{
+	return __do_mmap(file, addr, len, prot, flags, vm_flags, pgoff,
+			 populate, uf, current->mm);
+}
+#else
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
 	struct list_head *uf);
+#endif
+
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock);
diff --git a/mm/mmap.c b/mm/mmap.c
index bd210aaf7ebd..0cc187a60a0f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -278,7 +278,7 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
 }
 
 /**
- * do_mmap() - Perform a userland memory mapping into the current process
+ * __do_mmap() - Perform a userland memory mapping into the current process
  * address space of length @len with protection bits @prot, mmap flags @flags
  * (from which VMA flags will be inferred), and any additional VMA flags to
  * apply @vm_flags. If this is a file-backed mapping then the file is specified
@@ -330,17 +330,17 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
  * @uf: An optional pointer to a list head to track userfaultfd unmap events
  * should unmapping events arise. If provided, it is up to the caller to manage
  * this.
+ * @mm: The mm_struct
  *
  * Returns: Either an error, or the address at which the requested mapping has
  * been performed.
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
+unsigned long __do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, struct mm_struct *mm)
 {
-	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
 	*populate = 0;
@@ -558,7 +558,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
diff --git a/mm/vma.c b/mm/vma.c
index 5cdc5612bfc1..9069b42edab6 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2452,9 +2452,8 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
 
 static unsigned long __mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, struct mm_struct *mm)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	int error;
 	VMA_ITERATOR(vmi, mm, addr);
@@ -2521,18 +2520,19 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
  * the virtual page offset in memory of the anonymous mapping.
  * @uf: Optionally, a pointer to a list head used for tracking userfaultfd unmap
  * events.
+ * @mm: The mm struct
  *
  * Returns: Either an error, or the address at which the requested mapping has
  * been performed.
  */
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-			  struct list_head *uf)
+			  struct list_head *uf, struct mm_struct *mm)
 {
 	unsigned long ret;
 	bool writable_file_mapping = false;
 
-	mmap_assert_write_locked(current->mm);
+	mmap_assert_write_locked(mm);
 
 	/* Check to see if MDWE is applicable. */
 	if (map_deny_write_exec(vm_flags, vm_flags))
@@ -2551,13 +2551,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		writable_file_mapping = true;
 	}
 
-	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
 
 	/* Clear our write mapping regardless of error. */
 	if (writable_file_mapping)
 		mapping_unmap_writable(file->f_mapping);
 
-	validate_mm(current->mm);
+	validate_mm(mm);
 	return ret;
 }
 
diff --git a/mm/vma.h b/mm/vma.h
index 7356ca5a22d3..a6db191e65cf 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -292,7 +292,7 @@ void mm_drop_all_locks(struct mm_struct *mm);
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf);
+		struct list_head *uf, struct mm_struct *mm);
 
 int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
 		 unsigned long addr, unsigned long request, unsigned long flags);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 15/20] mm: pass the mm in vma_munmap_struct
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (13 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 14/20] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 16/20] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Allow unmap to work with an mshare host mm.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/vma.c | 10 ++++++----
 mm/vma.h |  1 +
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 9069b42edab6..c56f773c06c0 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1188,7 +1188,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
-	mm = current->mm;
+	mm = vms->mm;
 	mm->map_count -= vms->vma_count;
 	mm->locked_vm -= vms->locked_vm;
 	if (vms->unlock)
@@ -1396,13 +1396,15 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
  * @start: The aligned start address to munmap
  * @end: The aligned end address to munmap
  * @uf: The userfaultfd list_head
+ * @mm: The mm struct
  * @unlock: Unlock after the operation.  Only unlocked on success
  */
 static void init_vma_munmap(struct vma_munmap_struct *vms,
 		struct vma_iterator *vmi, struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
+		struct mm_struct *mm, bool unlock)
 {
+	vms->mm = mm;
 	vms->vmi = vmi;
 	vms->vma = vma;
 	if (vma) {
@@ -1446,7 +1448,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	struct vma_munmap_struct vms;
 	int error;
 
-	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+	init_vma_munmap(&vms, vmi, vma, start, end, uf, mm, unlock);
 	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
@@ -2247,7 +2249,7 @@ static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
 
 	/* Find the first overlapping VMA and initialise unmap state. */
 	vms->vma = vma_find(vmi, map->end);
-	init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf,
+	init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf, map->mm,
 			/* unlock = */ false);
 
 	/* OK, we have overlapping VMAs - prepare to unmap them. */
diff --git a/mm/vma.h b/mm/vma.h
index a6db191e65cf..572a11274114 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -49,6 +49,7 @@ struct vma_munmap_struct {
 	unsigned long exec_vm;
 	unsigned long stack_vm;
 	unsigned long data_vm;
+	struct mm_struct *mm;
 };
 
 enum vma_merge_state {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 16/20] mm/mshare: Add an ioctl for mapping objects in an mshare region
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (14 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 15/20] mm: pass the mm in vma_munmap_struct Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:18 ` [PATCH v2 17/20] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Reserve a range of ioctls for msharefs and add an ioctl for mapping
objects within an mshare region. The arguments are the same as mmap()
except that the start of the mapping is specified as an offset into
the mshare region instead of as an address. For now system-selected
addresses are disallowed so MAP_FIXED must be specified. Only shared
anonymous memory is supported initially.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |  1 +
 include/uapi/linux/msharefs.h                 | 31 ++++++++
 mm/mshare.c                                   | 74 +++++++++++++++++++
 3 files changed, 106 insertions(+)
 create mode 100644 include/uapi/linux/msharefs.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 3d1cd7ad9d67..250dd58ebdba 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -306,6 +306,7 @@ Code  Seq#    Include File                                           Comments
 'v'   20-27  arch/powerpc/include/uapi/asm/vas-api.h		     VAS API
 'v'   C0-FF  linux/meye.h                                            conflict!
 'w'   all                                                            CERN SCI driver
+'x'   00-1F  linux/msharefs.h                                        msharefs filesystem
 'y'   00-1F                                                          packet based user level communications
                                                                      <mailto:zapman@interlan.net>
 'z'   00-3F                                                          CAN bus card conflict!
diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
new file mode 100644
index 000000000000..ad129beeef62
--- /dev/null
+++ b/include/uapi/linux/msharefs.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * msharefs defines a memory region that is shared across processes.
+ * ioctl is used on files created under msharefs to set various
+ * attributes on these shared memory regions
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author:	Khalid Aziz <khalid@kernel.org>
+ */
+
+#ifndef _UAPI_LINUX_MSHAREFS_H
+#define _UAPI_LINUX_MSHAREFS_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/*
+ * msharefs specific ioctl commands
+ */
+#define MSHAREFS_CREATE_MAPPING	_IOW('x', 0,  struct mshare_create)
+
+struct mshare_create {
+	__u64 region_offset;
+	__u64 size;
+	__u64 offset;
+	__u32 prot;
+	__u32 flags;
+	__u32 fd;
+};
+#endif
diff --git a/mm/mshare.c b/mm/mshare.c
index 4ddaa0d41070..be0aaa894963 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,6 +10,7 @@
  *
  * Copyright (C) 2024 Oracle Corp. All rights reserved.
  * Author:	Khalid Aziz <khalid@kernel.org>
+ * Author:	Matthew Wilcox <willy@infradead.org>
  *
  */
 
@@ -18,6 +19,7 @@
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
 #include <uapi/linux/magic.h>
+#include <uapi/linux/msharefs.h>
 #include <linux/falloc.h>
 #include <asm/tlbflush.h>
 
@@ -230,6 +232,77 @@ msharefs_get_unmapped_area(struct file *file, unsigned long addr,
 				pgoff, flags);
 }
 
+static long
+msharefs_create_mapping(struct mshare_data *m_data, struct mshare_create *mcreate)
+{
+	struct mm_struct *host_mm = m_data->mm;
+	unsigned long mshare_start, mshare_end;
+	unsigned long region_offset = mcreate->region_offset;
+	unsigned long size = mcreate->size;
+	unsigned int fd = mcreate->fd;
+	int flags = mcreate->flags;
+	int prot = mcreate->prot;
+	unsigned long populate = 0;
+	unsigned long mapped_addr;
+	unsigned long addr;
+	vm_flags_t vm_flags;
+	int error = -EINVAL;
+
+	mshare_start = m_data->start;
+	mshare_end = mshare_start + m_data->size;
+	addr = mshare_start + region_offset;
+
+	if ((addr < mshare_start) || (addr >= mshare_end) ||
+	    (addr + size > mshare_end))
+		goto out;
+
+	/*
+	 * Only anonymous shared memory at fixed addresses is allowed for now.
+	 */
+	if ((flags & (MAP_SHARED | MAP_FIXED)) != (MAP_SHARED | MAP_FIXED))
+		goto out;
+	if (fd != -1)
+		goto out;
+
+	if (mmap_write_lock_killable(host_mm)) {
+		error = -EINTR;
+		goto out;
+	}
+
+	error = 0;
+	mapped_addr = __do_mmap(NULL, addr, size, prot, flags, vm_flags,
+				0, &populate, NULL, host_mm);
+
+	if (IS_ERR_VALUE(mapped_addr))
+		error = (long)mapped_addr;
+
+	mmap_write_unlock(host_mm);
+out:
+	return error;
+}
+
+static long
+msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct mshare_data *m_data = filp->private_data;
+	struct mshare_create mcreate;
+
+	switch (cmd) {
+	case MSHAREFS_CREATE_MAPPING:
+		if (copy_from_user(&mcreate, (struct mshare_create __user *)arg,
+			sizeof(mcreate)))
+			return -EFAULT;
+
+		if (!test_bit(MSHARE_INITIALIZED, &m_data->flags))
+			return -EINVAL;
+
+		return msharefs_create_mapping(m_data, &mcreate);
+
+	default:
+		return -ENOTTY;
+	}
+}
+
 static int msharefs_set_size(struct mshare_data *m_data, unsigned long size)
 {
 	int error = -EINVAL;
@@ -285,6 +358,7 @@ static const struct file_operations msharefs_file_operations = {
 	.open			= simple_open,
 	.mmap			= msharefs_mmap,
 	.get_unmapped_area	= msharefs_get_unmapped_area,
+	.unlocked_ioctl		= msharefs_ioctl,
 	.fallocate		= msharefs_fallocate,
 };
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 17/20] mm/mshare: Add an ioctl for unmapping objects in an mshare region
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (15 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 16/20] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
@ 2025-04-04  2:18 ` Anthony Yznaga
  2025-04-04  2:19 ` [PATCH v2 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:18 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

The arguments are the same as munmap() except that the start of the
mapping is specified as an offset into the mshare region instead of
as an address.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/uapi/linux/msharefs.h |  7 ++++++
 mm/mshare.c                   | 40 +++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
index ad129beeef62..fb0235d1e384 100644
--- a/include/uapi/linux/msharefs.h
+++ b/include/uapi/linux/msharefs.h
@@ -19,6 +19,7 @@
  * msharefs specific ioctl commands
  */
 #define MSHAREFS_CREATE_MAPPING	_IOW('x', 0,  struct mshare_create)
+#define MSHAREFS_UNMAP		_IOW('x', 1,  struct mshare_unmap)
 
 struct mshare_create {
 	__u64 region_offset;
@@ -28,4 +29,10 @@ struct mshare_create {
 	__u32 flags;
 	__u32 fd;
 };
+
+struct mshare_unmap {
+	__u64 region_offset;
+	__u64 size;
+};
+
 #endif
diff --git a/mm/mshare.c b/mm/mshare.c
index be0aaa894963..a6106f6264cb 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -281,11 +281,41 @@ msharefs_create_mapping(struct mshare_data *m_data, struct mshare_create *mcreat
 	return error;
 }
 
+static long
+msharefs_unmap(struct mshare_data *m_data, struct mshare_unmap *munmap)
+{
+	struct mm_struct *host_mm = m_data->mm;
+	unsigned long mshare_start, mshare_end, mshare_size;
+	unsigned long region_offset = munmap->region_offset;
+	unsigned long size = munmap->size;
+	unsigned long addr;
+	int error;
+
+	mshare_start = m_data->start;
+	mshare_size = m_data->size;
+	mshare_end = mshare_start + mshare_size;
+	addr = mshare_start + region_offset;
+
+	if ((size > mshare_size) || (region_offset >= mshare_size) ||
+	    (addr + size > mshare_end))
+		return -EINVAL;
+
+	if (mmap_write_lock_killable(host_mm))
+		return -EINTR;
+
+	error = do_munmap(host_mm, addr, size, NULL);
+
+	mmap_write_unlock(host_mm);
+
+	return error;
+}
+
 static long
 msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct mshare_data *m_data = filp->private_data;
 	struct mshare_create mcreate;
+	struct mshare_unmap munmap;
 
 	switch (cmd) {
 	case MSHAREFS_CREATE_MAPPING:
@@ -298,6 +328,16 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 		return msharefs_create_mapping(m_data, &mcreate);
 
+	case MSHAREFS_UNMAP:
+		if (copy_from_user(&munmap, (struct mshare_unmap __user *)arg,
+			sizeof(munmap)))
+			return -EFAULT;
+
+		if (!test_bit(MSHARE_INITIALIZED, &m_data->flags))
+			return -EINVAL;
+
+		return msharefs_unmap(m_data, &munmap);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (16 preceding siblings ...)
  2025-04-04  2:18 ` [PATCH v2 17/20] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
@ 2025-04-04  2:19 ` Anthony Yznaga
  2025-04-04  2:19 ` [PATCH v2 19/20] mm/mshare: get memcg from current->mm instead of mshare mm Anthony Yznaga
  2025-04-04  2:19 ` [PATCH v2 20/20] mm/mshare: associate a mem cgroup with an mshare file Anthony Yznaga
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:19 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Add new mm flag, MMF_MSHARE.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm_types.h | 2 ++
 mm/mshare.c              | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..392605b23c62 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1741,6 +1741,8 @@ enum {
 #define MMF_TOPDOWN		31	/* mm searches top down by default */
 #define MMF_TOPDOWN_MASK	(1 << MMF_TOPDOWN)
 
+#define MMF_MSHARE		32	/* mm is an mshare host mm */
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
 				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
diff --git a/mm/mshare.c b/mm/mshare.c
index a6106f6264cb..0a75bd3928fc 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -415,6 +415,7 @@ msharefs_fill_mm(struct inode *inode)
 		goto err_free;
 	}
 
+	set_bit(MMF_MSHARE, &mm->flags);
 	mm->mmap_base = mshare_base;
 	mm->task_size = 0;
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 19/20] mm/mshare: get memcg from current->mm instead of mshare mm
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (17 preceding siblings ...)
  2025-04-04  2:19 ` [PATCH v2 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
@ 2025-04-04  2:19 ` Anthony Yznaga
  2025-04-04  2:19 ` [PATCH v2 20/20] mm/mshare: associate a mem cgroup with an mshare file Anthony Yznaga
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:19 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Because handle_mm_fault() may operate on a vma from an mshare host mm,
the mm passed to cgroup functions count_memcg_event_mm() and
get_mem_cgroup_from_mm() may be an mshare host mm. These functions find
a memcg by dereferencing mm->owner which is set when an mm is allocated.
Since the task that created an mshare file may exit before the file is
deleted, use current->mm instead to find the memcg to update or charge
to.
This may not be the right solution but is hopefully a good starting
point. If charging should always go to a single memcg associated with
the mshare file, perhaps active_memcg could be used.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/memcontrol.h | 3 +++
 mm/memcontrol.c            | 3 ++-
 mm/mshare.c                | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..0d7a8787c876 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -983,6 +983,9 @@ static inline void count_memcg_events_mm(struct mm_struct *mm,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (test_bit(MMF_MSHARE, &mm->flags))
+		mm = current->mm;
+
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
 	if (likely(memcg))
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..42465e523caa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -945,7 +945,8 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 		mm = current->mm;
 		if (unlikely(!mm))
 			return root_mem_cgroup;
-	}
+	} else if (test_bit(MMF_MSHARE, &mm->flags))
+		mm = current->mm;
 
 	rcu_read_lock();
 	do {
diff --git a/mm/mshare.c b/mm/mshare.c
index 0a75bd3928fc..276fb825cc9a 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -432,6 +432,9 @@ msharefs_fill_mm(struct inode *inode)
 	if (ret)
 		goto err_free;
 
+#ifdef CONFIG_MEMCG
+	mm->owner = NULL;
+#endif
 	return 0;
 
 err_free:
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 20/20] mm/mshare: associate a mem cgroup with an mshare file
  2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
                   ` (18 preceding siblings ...)
  2025-04-04  2:19 ` [PATCH v2 19/20] mm/mshare: get memcg from current->mm instead of mshare mm Anthony Yznaga
@ 2025-04-04  2:19 ` Anthony Yznaga
  19 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-04-04  2:19 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

This patch shows one approach to associating a specific mem cgroup to
an mshare file and was inspired by code in mem_cgroup_sk_alloc().
Essentially when a process creates an mshare region, a reference is
taken on the mem cgroup that the process belongs to and a pointer to
the memcg is saved. At fault time set_active_memcg() is used to
temporarily enable charging of __GFP_ACCOUNT allocations to the saved
memcg. This does consolidate pagetable charges to a single memcg, but
there are issues to address such as how to handle the case where the
memcg is deleted but becomes a hidden, zombie memcg because the
mshare file has a reference to it.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 arch/x86/mm/fault.c | 11 +++++++++++
 include/linux/mm.h  |  5 +++++
 mm/mshare.c         | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 49659d2f9316..f79186b76ffe 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -20,6 +20,7 @@
 #include <linux/mm_types.h>
 #include <linux/mm.h>			/* find_and_lock_vma() */
 #include <linux/vmalloc.h>
+#include <linux/memcontrol.h>
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1218,6 +1219,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 	bool is_shared_vma;
 	unsigned long addr;
+	struct mem_cgroup *mshare_memcg;
+	struct mem_cgroup *memcg;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1374,6 +1377,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	}
 
 	if (unlikely(vma_is_mshare(vma))) {
+		mshare_memcg = get_mshare_memcg(vma);
+
 		fault = find_shared_vma(&vma, &addr);
 
 		if (fault) {
@@ -1401,6 +1406,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 		return;
 	}
 
+	if (is_shared_vma && mshare_memcg)
+		memcg = set_active_memcg(mshare_memcg);
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
@@ -1416,6 +1424,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	fault = handle_mm_fault(vma, addr, flags, regs);
 
+	if (is_shared_vma && mshare_memcg)
+		set_active_memcg(memcg);
+
 	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
 	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
 		mmap_read_unlock(mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e64deae3d64..e848c29eafe4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1179,12 +1179,17 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
 int vma_is_stack_for_current(struct vm_area_struct *vma);
 
 #ifdef CONFIG_MSHARE
+struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma);
 vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
 static inline bool vma_is_mshare(const struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_MSHARE;
 }
 #else
+static inline struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma)
+{
+	return NULL;
+}
 static inline vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp)
 {
 	WARN_ON_ONCE(1);
diff --git a/mm/mshare.c b/mm/mshare.c
index 276fb825cc9a..509b1ae8ce72 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,6 +16,7 @@
 
 #include <linux/fs.h>
 #include <linux/fs_context.h>
+#include <linux/memcontrol.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
 #include <uapi/linux/magic.h>
@@ -34,8 +35,22 @@ struct mshare_data {
 	unsigned long size;
 	unsigned long flags;
 	struct mmu_notifier mn;
+#ifdef CONFIG_MEMCG
+	struct mem_cgroup *memcg;
+#endif
 };
 
+struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma)
+{
+	struct mshare_data *m_data = vma->vm_private_data;
+
+#ifdef CONFIG_MEMCG
+	return m_data->memcg;
+#else
+	return NULL;
+#endif
+}
+
 static void mshare_invalidate_tlbs(struct mmu_notifier *mn, struct mm_struct *mm,
 				   unsigned long start, unsigned long end)
 {
@@ -408,6 +423,9 @@ msharefs_fill_mm(struct inode *inode)
 	struct mm_struct *mm;
 	struct mshare_data *m_data = NULL;
 	int ret = 0;
+#ifdef CONFIG_MEMCG
+	struct mem_cgroup *memcg;
+#endif
 
 	mm = mm_alloc();
 	if (!mm) {
@@ -434,6 +452,17 @@ msharefs_fill_mm(struct inode *inode)
 
 #ifdef CONFIG_MEMCG
 	mm->owner = NULL;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (mem_cgroup_is_root(memcg))
+		goto out;
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		goto out;
+	if (css_tryget(&memcg->css))
+		m_data->memcg = memcg;
+out:
+	rcu_read_unlock();
 #endif
 	return 0;
 
@@ -447,6 +476,10 @@ msharefs_fill_mm(struct inode *inode)
 static void
 msharefs_delmm(struct mshare_data *m_data)
 {
+#ifdef CONFIG_MEMCG
+	if (m_data->memcg)
+		css_put(&m_data->memcg->css);
+#endif
 	mmput(m_data->mm);
 	kfree(m_data);
 }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
  2025-04-04  2:18 ` [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
@ 2025-05-30 14:41   ` Jann Horn
  2025-05-30 16:29     ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Jann Horn @ 2025-05-30 14:41 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> Unlike the mm of a task, an mshare host mm is not updated on context
> switch. In particular this means that mm_cpumask is never updated
> which results in TLB flushes for updates to mshare PTEs only being
> done on the local CPU. To ensure entries are flushed for non-local
> TLBs, set up an mmu notifier on the mshare mm and use the
> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
> flushed before pages are freed when unmapping pages in an mshare region.

Thanks for working on this, I think this is a really nice feature.

An issue that I think this series doesn't address is:
There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
want to be notified on changes to an mshare VMA; if those are not
invoked, we could get UAF of page contents. So either we propagate MMU
notifier invocations in the host mm into the mshare regions that use
it, or we'd have to somehow prevent a process from using MMU notifiers
and mshare at the same time.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
  2025-04-04  2:18 ` [PATCH v2 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
@ 2025-05-30 14:56   ` Jann Horn
  2025-05-30 16:41     ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Jann Horn @ 2025-05-30 14:56 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, Lorenzo Stoakes,
	Liam Howlett

On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> In preparation for enabling the handling of page faults in an mshare
> region provide a way to link an mshare shared page table to a process
> page table and otherwise find the actual vma in order to handle a page
> fault. Modify the unmap path to ensure that page tables in mshare regions
> are unlinked and kept intact when a process exits or an mshare region
> is explicitly unmapped.
>
> Signed-off-by: Khalid Aziz <khalid@kernel.org>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index db558fe43088..68422b606819 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
[...]
> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>                 next = p4d_addr_end(addr, end);
>                 if (p4d_none_or_clear_bad(p4d))
>                         continue;
> -               free_pud_range(tlb, p4d, addr, next, floor, ceiling);
> +               if (unlikely(shared_pud))
> +                       p4d_clear(p4d);
> +               else
> +                       free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>         } while (p4d++, addr = next, addr != end);
>
>         start &= PGDIR_MASK;
[...]
> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
> +                               struct vm_area_struct *vma,
> +                               unsigned long addr, unsigned long end,
> +                               struct zap_details *details)
> +{
> +       /*
> +        * The msharefs vma is being unmapped. Do not unmap pages in the
> +        * mshare region itself.
> +        */
> +}

Unmapping a VMA has three major phases:

1. unlinking the VMA from the VMA tree
2. removing the VMA contents
3. removing unneeded page tables

The MM subsystem broadly assumes that after phase 2, no stuff is
mapped in the region anymore and therefore changes to the backing file
don't need to TLB-flush this VMA anymore, and unlinks the mapping from
rmaps and such. If munmap() of an mshare region only removes the
mapping of shared page tables in step 3, as implemented here, that
means things like TLB flushes won't be able to discover all
currently-existing mshare mappings of a host MM through rmap walks.

I think it would make more sense to remove the links to shared page
tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
hugetlb does, and not modify free_pgtables().

>  static const struct vm_operations_struct msharefs_vm_ops = {
>         .may_split = mshare_vm_op_split,
>         .mprotect = mshare_vm_op_mprotect,
> +       .unmap_page_range = mshare_vm_op_unmap_page_range,
>  };
>
>  /*
> --
> 2.43.5
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
  2025-05-30 14:41   ` Jann Horn
@ 2025-05-30 16:29     ` Anthony Yznaga
  2025-05-30 17:46       ` Jann Horn
  0 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-05-30 16:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz



On 5/30/25 7:41 AM, Jann Horn wrote:
> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> Unlike the mm of a task, an mshare host mm is not updated on context
>> switch. In particular this means that mm_cpumask is never updated
>> which results in TLB flushes for updates to mshare PTEs only being
>> done on the local CPU. To ensure entries are flushed for non-local
>> TLBs, set up an mmu notifier on the mshare mm and use the
>> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
>> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
>> flushed before pages are freed when unmapping pages in an mshare region.
> 
> Thanks for working on this, I think this is a really nice feature.
> 
> An issue that I think this series doesn't address is:
> There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
> want to be notified on changes to an mshare VMA; if those are not
> invoked, we could get UAF of page contents. So either we propagate MMU
> notifier invocations in the host mm into the mshare regions that use
> it, or we'd have to somehow prevent a process from using MMU notifiers
> and mshare at the same time.

Thanks, Jann. I've noted this as an issue. Ultimately I think the 
notifiers calls will need to be propagated. It's going to be tricky, but 
I have some ideas.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
  2025-05-30 14:56   ` Jann Horn
@ 2025-05-30 16:41     ` Anthony Yznaga
  2025-06-02 15:26       ` Jann Horn
  0 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-05-30 16:41 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, Lorenzo Stoakes,
	Liam Howlett



On 5/30/25 7:56 AM, Jann Horn wrote:
> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> In preparation for enabling the handling of page faults in an mshare
>> region provide a way to link an mshare shared page table to a process
>> page table and otherwise find the actual vma in order to handle a page
>> fault. Modify the unmap path to ensure that page tables in mshare regions
>> are unlinked and kept intact when a process exits or an mshare region
>> is explicitly unmapped.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> [...]
>> diff --git a/mm/memory.c b/mm/memory.c
>> index db558fe43088..68422b606819 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
> [...]
>> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>>                  next = p4d_addr_end(addr, end);
>>                  if (p4d_none_or_clear_bad(p4d))
>>                          continue;
>> -               free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>> +               if (unlikely(shared_pud))
>> +                       p4d_clear(p4d);
>> +               else
>> +                       free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>>          } while (p4d++, addr = next, addr != end);
>>
>>          start &= PGDIR_MASK;
> [...]
>> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
>> +                               struct vm_area_struct *vma,
>> +                               unsigned long addr, unsigned long end,
>> +                               struct zap_details *details)
>> +{
>> +       /*
>> +        * The msharefs vma is being unmapped. Do not unmap pages in the
>> +        * mshare region itself.
>> +        */
>> +}
> 
> Unmapping a VMA has three major phases:
> 
> 1. unlinking the VMA from the VMA tree
> 2. removing the VMA contents
> 3. removing unneeded page tables
> 
> The MM subsystem broadly assumes that after phase 2, no stuff is
> mapped in the region anymore and therefore changes to the backing file
> don't need to TLB-flush this VMA anymore, and unlinks the mapping from
> rmaps and such. If munmap() of an mshare region only removes the
> mapping of shared page tables in step 3, as implemented here, that
> means things like TLB flushes won't be able to discover all
> currently-existing mshare mappings of a host MM through rmap walks.
> 
> I think it would make more sense to remove the links to shared page
> tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
> hugetlb does, and not modify free_pgtables().

That makes sense. I'll make this change.

Thanks!

> 
>>   static const struct vm_operations_struct msharefs_vm_ops = {
>>          .may_split = mshare_vm_op_split,
>>          .mprotect = mshare_vm_op_mprotect,
>> +       .unmap_page_range = mshare_vm_op_unmap_page_range,
>>   };
>>
>>   /*
>> --
>> 2.43.5
>>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
  2025-05-30 16:29     ` Anthony Yznaga
@ 2025-05-30 17:46       ` Jann Horn
  2025-05-30 22:47         ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Jann Horn @ 2025-05-30 17:46 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Fri, May 30, 2025 at 6:30 PM Anthony Yznaga
<anthony.yznaga@oracle.com> wrote:
> On 5/30/25 7:41 AM, Jann Horn wrote:
> > On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> >> Unlike the mm of a task, an mshare host mm is not updated on context
> >> switch. In particular this means that mm_cpumask is never updated
> >> which results in TLB flushes for updates to mshare PTEs only being
> >> done on the local CPU. To ensure entries are flushed for non-local
> >> TLBs, set up an mmu notifier on the mshare mm and use the
> >> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
> >> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
> >> flushed before pages are freed when unmapping pages in an mshare region.
> >
> > Thanks for working on this, I think this is a really nice feature.
> >
> > An issue that I think this series doesn't address is:
> > There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
> > want to be notified on changes to an mshare VMA; if those are not
> > invoked, we could get UAF of page contents. So either we propagate MMU
> > notifier invocations in the host mm into the mshare regions that use
> > it, or we'd have to somehow prevent a process from using MMU notifiers
> > and mshare at the same time.
>
> Thanks, Jann. I've noted this as an issue. Ultimately I think the
> notifiers calls will need to be propagated. It's going to be tricky, but
> I have some ideas.

Very naively I think you could basically register your own notifier on
the host mm that has notifier callbacks vaguely like this that walk
the rmap of the mshare file and invoke nested mmu notifiers on each
VMA that maps the file, basically like unmap_mapping_pages() except
that you replace unmap_mapping_range_vma() with a notifier invocation?

static int mshare_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
    const struct mmu_notifier_range *range)
{
  struct vm_area_struct *vma;
  pgoff_t first_index, last_index;

  if (range->end < host_mm->mmap_base)
    return 0;
  first_index = (max(range->start, host_mm->mmap_base) -
host_mm->mmap_base) / PAGE_SIZE;
  last_index = (range->end - host_mm->mmap_base) / PAGE_SIZE;
  i_mmap_lock_read(mapping);
  vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, last_index) {
    struct mmu_notifier_range nested_range;

    [... same math as in unmap_mapping_range_tree ...]
    mmu_notifier_range_init(&nested_range, range->event, vma->vm_mm,
nested_start, nested_end);
    mmu_notifier_invalidate_range_start(&nested_range);
  }
  i_mmap_unlock_read(mapping);
}

And ensure that when mm_take_all_locks() encounters an mshare VMA, it
basically recursively does mm_take_all_locks() on the mshare host mm?

I think that might be enough to make it work, and the rest beyond that
would be optimizations?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
  2025-05-30 17:46       ` Jann Horn
@ 2025-05-30 22:47         ` Anthony Yznaga
  0 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-05-30 22:47 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz



On 5/30/25 10:46 AM, Jann Horn wrote:
> On Fri, May 30, 2025 at 6:30 PM Anthony Yznaga
> <anthony.yznaga@oracle.com> wrote:
>> On 5/30/25 7:41 AM, Jann Horn wrote:
>>> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>>> Unlike the mm of a task, an mshare host mm is not updated on context
>>>> switch. In particular this means that mm_cpumask is never updated
>>>> which results in TLB flushes for updates to mshare PTEs only being
>>>> done on the local CPU. To ensure entries are flushed for non-local
>>>> TLBs, set up an mmu notifier on the mshare mm and use the
>>>> .arch_invalidate_secondary_tlbs callback to flush all TLBs.
>>>> arch_invalidate_secondary_tlbs guarantees that TLB entries will be
>>>> flushed before pages are freed when unmapping pages in an mshare region.
>>>
>>> Thanks for working on this, I think this is a really nice feature.
>>>
>>> An issue that I think this series doesn't address is:
>>> There could be mmu_notifiers (for things like KVM or SVA IOMMU) that
>>> want to be notified on changes to an mshare VMA; if those are not
>>> invoked, we could get UAF of page contents. So either we propagate MMU
>>> notifier invocations in the host mm into the mshare regions that use
>>> it, or we'd have to somehow prevent a process from using MMU notifiers
>>> and mshare at the same time.
>>
>> Thanks, Jann. I've noted this as an issue. Ultimately I think the
>> notifiers calls will need to be propagated. It's going to be tricky, but
>> I have some ideas.
> 
> Very naively I think you could basically register your own notifier on
> the host mm that has notifier callbacks vaguely like this that walk
> the rmap of the mshare file and invoke nested mmu notifiers on each
> VMA that maps the file, basically like unmap_mapping_pages() except
> that you replace unmap_mapping_range_vma() with a notifier invocation?
> 
> static int mshare_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>      const struct mmu_notifier_range *range)
> {
>    struct vm_area_struct *vma;
>    pgoff_t first_index, last_index;
> 
>    if (range->end < host_mm->mmap_base)
>      return 0;
>    first_index = (max(range->start, host_mm->mmap_base) -
> host_mm->mmap_base) / PAGE_SIZE;
>    last_index = (range->end - host_mm->mmap_base) / PAGE_SIZE;
>    i_mmap_lock_read(mapping);
>    vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, last_index) {
>      struct mmu_notifier_range nested_range;
> 
>      [... same math as in unmap_mapping_range_tree ...]
>      mmu_notifier_range_init(&nested_range, range->event, vma->vm_mm,
> nested_start, nested_end);
>      mmu_notifier_invalidate_range_start(&nested_range);
>    }
>    i_mmap_unlock_read(mapping);
> }
> 
> And ensure that when mm_take_all_locks() encounters an mshare VMA, it
> basically recursively does mm_take_all_locks() on the mshare host mm?
> 
> I think that might be enough to make it work, and the rest beyond that
> would be optimizations?

I figured the vma interval tree would need to be walked. I hadn't 
considered mm_take_all_locks(), though. This is definitely a good 
starting point. Thanks for this!



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
  2025-05-30 16:41     ` Anthony Yznaga
@ 2025-06-02 15:26       ` Jann Horn
  2025-06-02 22:02         ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Jann Horn @ 2025-06-02 15:26 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, Lorenzo Stoakes,
	Liam Howlett

On Fri, May 30, 2025 at 6:42 PM Anthony Yznaga
<anthony.yznaga@oracle.com> wrote:
> On 5/30/25 7:56 AM, Jann Horn wrote:
> > On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> >> In preparation for enabling the handling of page faults in an mshare
> >> region provide a way to link an mshare shared page table to a process
> >> page table and otherwise find the actual vma in order to handle a page
> >> fault. Modify the unmap path to ensure that page tables in mshare regions
> >> are unlinked and kept intact when a process exits or an mshare region
> >> is explicitly unmapped.
> >>
> >> Signed-off-by: Khalid Aziz <khalid@kernel.org>
> >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> >> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> > [...]
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index db558fe43088..68422b606819 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> > [...]
> >> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
> >>                  next = p4d_addr_end(addr, end);
> >>                  if (p4d_none_or_clear_bad(p4d))
> >>                          continue;
> >> -               free_pud_range(tlb, p4d, addr, next, floor, ceiling);
> >> +               if (unlikely(shared_pud))
> >> +                       p4d_clear(p4d);
> >> +               else
> >> +                       free_pud_range(tlb, p4d, addr, next, floor, ceiling);
> >>          } while (p4d++, addr = next, addr != end);
> >>
> >>          start &= PGDIR_MASK;
> > [...]
> >> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
> >> +                               struct vm_area_struct *vma,
> >> +                               unsigned long addr, unsigned long end,
> >> +                               struct zap_details *details)
> >> +{
> >> +       /*
> >> +        * The msharefs vma is being unmapped. Do not unmap pages in the
> >> +        * mshare region itself.
> >> +        */
> >> +}
> >
> > Unmapping a VMA has three major phases:
> >
> > 1. unlinking the VMA from the VMA tree
> > 2. removing the VMA contents
> > 3. removing unneeded page tables
> >
> > The MM subsystem broadly assumes that after phase 2, no stuff is
> > mapped in the region anymore and therefore changes to the backing file
> > don't need to TLB-flush this VMA anymore, and unlinks the mapping from
> > rmaps and such. If munmap() of an mshare region only removes the
> > mapping of shared page tables in step 3, as implemented here, that
> > means things like TLB flushes won't be able to discover all
> > currently-existing mshare mappings of a host MM through rmap walks.
> >
> > I think it would make more sense to remove the links to shared page
> > tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
> > hugetlb does, and not modify free_pgtables().
>
> That makes sense. I'll make this change.

Related: I think there needs to be a strategy for preventing walking
of mshare host page tables through an mshare VMA by codepaths relying
on MM/VMA locks, because those locks won't have an effect on the
underlying host MM. For example, I think the only reason fork() is
safe with your proposal is that copy_page_range() skips shared VMAs,
and I think non-fast get_user_pages() could maybe hit use-after-free
of page tables or such?

I guess the only clean strategy for that is to ensure that all
locking-based page table walking code does a check for "is this an
mshare VMA?" and, if yes, either bails immediately or takes extra
locks on the host MM (which could get messy).


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 12/20] mm/mshare: prepare for page table sharing support
  2025-06-02 15:26       ` Jann Horn
@ 2025-06-02 22:02         ` Anthony Yznaga
  0 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-06-02 22:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, Lorenzo Stoakes,
	Liam Howlett



On 6/2/25 8:26 AM, Jann Horn wrote:
> On Fri, May 30, 2025 at 6:42 PM Anthony Yznaga
> <anthony.yznaga@oracle.com> wrote:
>> On 5/30/25 7:56 AM, Jann Horn wrote:
>>> On Fri, Apr 4, 2025 at 4:18 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>>> In preparation for enabling the handling of page faults in an mshare
>>>> region provide a way to link an mshare shared page table to a process
>>>> page table and otherwise find the actual vma in order to handle a page
>>>> fault. Modify the unmap path to ensure that page tables in mshare regions
>>>> are unlinked and kept intact when a process exits or an mshare region
>>>> is explicitly unmapped.
>>>>
>>>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>>>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>> [...]
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index db558fe43088..68422b606819 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>> [...]
>>>> @@ -259,7 +260,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
>>>>                   next = p4d_addr_end(addr, end);
>>>>                   if (p4d_none_or_clear_bad(p4d))
>>>>                           continue;
>>>> -               free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>>>> +               if (unlikely(shared_pud))
>>>> +                       p4d_clear(p4d);
>>>> +               else
>>>> +                       free_pud_range(tlb, p4d, addr, next, floor, ceiling);
>>>>           } while (p4d++, addr = next, addr != end);
>>>>
>>>>           start &= PGDIR_MASK;
>>> [...]
>>>> +static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
>>>> +                               struct vm_area_struct *vma,
>>>> +                               unsigned long addr, unsigned long end,
>>>> +                               struct zap_details *details)
>>>> +{
>>>> +       /*
>>>> +        * The msharefs vma is being unmapped. Do not unmap pages in the
>>>> +        * mshare region itself.
>>>> +        */
>>>> +}
>>>
>>> Unmapping a VMA has three major phases:
>>>
>>> 1. unlinking the VMA from the VMA tree
>>> 2. removing the VMA contents
>>> 3. removing unneeded page tables
>>>
>>> The MM subsystem broadly assumes that after phase 2, no stuff is
>>> mapped in the region anymore and therefore changes to the backing file
>>> don't need to TLB-flush this VMA anymore, and unlinks the mapping from
>>> rmaps and such. If munmap() of an mshare region only removes the
>>> mapping of shared page tables in step 3, as implemented here, that
>>> means things like TLB flushes won't be able to discover all
>>> currently-existing mshare mappings of a host MM through rmap walks.
>>>
>>> I think it would make more sense to remove the links to shared page
>>> tables in step 2 (meaning in mshare_vm_op_unmap_page_range), just like
>>> hugetlb does, and not modify free_pgtables().
>>
>> That makes sense. I'll make this change.
> 
> Related: I think there needs to be a strategy for preventing walking
> of mshare host page tables through an mshare VMA by codepaths relying
> on MM/VMA locks, because those locks won't have an effect on the
> underlying host MM. For example, I think the only reason fork() is
> safe with your proposal is that copy_page_range() skips shared VMAs,
> and I think non-fast get_user_pages() could maybe hit use-after-free
> of page tables or such?
> 
> I guess the only clean strategy for that is to ensure that all
> locking-based page table walking code does a check for "is this an
> mshare VMA?" and, if yes, either bails immediately or takes extra
> locks on the host MM (which could get messy).

Thanks. Yes, I need to audit all VMA / page table scanning. This series 
already has a patch to avoid scanning mshare VMAs for numa migration, 
but more issues are lurking.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/20] x86/mm: enable page table sharing
  2025-04-04  2:18 ` [PATCH v2 13/20] x86/mm: enable page table sharing Anthony Yznaga
@ 2025-08-12 13:46   ` Yongting Lin
  2025-08-12 17:12     ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Yongting Lin @ 2025-08-12 13:46 UTC (permalink / raw)
  To: anthony.yznaga
  Cc: akpm, andreyknvl, arnd, brauner, catalin.marinas, dave.hansen,
	david, ebiederm, khalid, linux-arch, linux-kernel, linux-mm, luto,
	markhemm, maz, mhiramat, neilb, pcc, rostedt, vasily.averin, viro,
	willy, xhao

Hi,

On 4/4/25 10:18 AM, Anthony Yznaga wrote:
> Enable x86 support for handling page faults in an mshare region by
> redirecting page faults to operate on the mshare mm_struct and vmas
> contained in it.
> Some permissions checks are done using vma flags in architecture-specfic
> fault handling code so the actual vma needed to complete the handling
> is acquired before calling handle_mm_fault(). Because of this an
> ARCH_SUPPORTS_MSHARE config option is added.
>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  arch/Kconfig        |  3 +++
>  arch/x86/Kconfig    |  1 +
>  arch/x86/mm/fault.c | 37 ++++++++++++++++++++++++++++++++++++-
>  mm/Kconfig          |  2 +-
>  4 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 9f6eb09ef12d..2e000fefe9b3 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1652,6 +1652,9 @@ config HAVE_ARCH_PFN_VALID
>  config ARCH_SUPPORTS_DEBUG_PAGEALLOC
>  	bool
>  
> +config ARCH_SUPPORTS_MSHARE
> +	bool
> +
>  config ARCH_SUPPORTS_PAGE_TABLE_CHECK
>  	bool
>  
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1502fd0c3c06..1f1779decb44 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -125,6 +125,7 @@ config X86
>  	select ARCH_SUPPORTS_ACPI
>  	select ARCH_SUPPORTS_ATOMIC_RMW
>  	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
> +	select ARCH_SUPPORTS_MSHARE		if X86_64
>  	select ARCH_SUPPORTS_PAGE_TABLE_CHECK	if X86_64
>  	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
>  	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 296d294142c8..49659d2f9316 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1216,6 +1216,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	struct mm_struct *mm;
>  	vm_fault_t fault;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
> +	bool is_shared_vma;
> +	unsigned long addr;
>  
>  	tsk = current;
>  	mm = tsk->mm;
> @@ -1329,6 +1331,12 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	if (!vma)
>  		goto lock_mmap;
>  
> +	/* mshare does not support per-VMA locks yet */
> +	if (vma_is_mshare(vma)) {
> +		vma_end_read(vma);
> +		goto lock_mmap;
> +	}
> +
>  	if (unlikely(access_error(error_code, vma))) {
>  		bad_area_access_error(regs, error_code, address, NULL, vma);
>  		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> @@ -1357,17 +1365,38 @@ void do_user_addr_fault(struct pt_regs *regs,
>  lock_mmap:
>  
>  retry:
> +	addr = address;
> +	is_shared_vma = false;
>  	vma = lock_mm_and_find_vma(mm, address, regs);
>  	if (unlikely(!vma)) {
>  		bad_area_nosemaphore(regs, error_code, address);
>  		return;
>  	}
>  
> +	if (unlikely(vma_is_mshare(vma))) {
> +		fault = find_shared_vma(&vma, &addr);
> +
> +		if (fault) {
> +			mmap_read_unlock(mm);
> +			goto done;
> +		}
> +
> +		if (!vma) {
> +			mmap_read_unlock(mm);
> +			bad_area_nosemaphore(regs, error_code, address);
> +			return;
> +		}
> +
> +		is_shared_vma = true;
> +	}
> +
>  	/*
>  	 * Ok, we have a good vm_area for this memory access, so
>  	 * we can handle it..
>  	 */
>  	if (unlikely(access_error(error_code, vma))) {
> +		if (unlikely(is_shared_vma))
> +			mmap_read_unlock(vma->vm_mm);
>  		bad_area_access_error(regs, error_code, address, mm, vma);
>  		return;
>  	}
> @@ -1385,7 +1414,11 @@ void do_user_addr_fault(struct pt_regs *regs,
>  	 * userland). The return to userland is identified whenever
>  	 * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
>  	 */
> -	fault = handle_mm_fault(vma, address, flags, regs);
> +	fault = handle_mm_fault(vma, addr, flags, regs);
> +
> +	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
> +	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
> +		mmap_read_unlock(mm);

I was backporting these patches of mshare to 5.15 kernel and trying to do some
basic tests. Then found a potential issue.

Reaching here means find_shared_vma function has been executed successfully 
and host_mm->mmap_lock has got locked.

When returned fault variable has VM_FAULT_COMPLETED or VM_FAULT_RETRY flags,
or fault_signal_pending(fault, regs) takes true, there is not chance to release
locks of both mm and host_mm(i.e. vma->vm_mm) in the following Snippet of Code.

As a result, needs to release vma->vm_mm.mmap_lock as well.

So it is supposed to be like below:

-	fault = handle_mm_fault(vma, address, flags, regs);
+	fault = handle_mm_fault(vma, addr, flags, regs);
+
+	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
+	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs))) {
+		mmap_read_unlock(vma->vm_mm);
+		mmap_read_unlock(mm);
+	}

>  
>  	if (fault_signal_pending(fault, regs)) {
>  		/*
> @@ -1413,6 +1446,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>  		goto retry;
>  	}
>  
> +	if (unlikely(is_shared_vma))
> +		mmap_read_unlock(vma->vm_mm);
>  	mmap_read_unlock(mm);
>  done:
>  	if (likely(!(fault & VM_FAULT_ERROR)))
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e6c90db83d01..8a5a159457f2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1344,7 +1344,7 @@ config PT_RECLAIM
>  
>  config MSHARE
>  	bool "Mshare"
> -	depends on MMU
> +	depends on MMU && ARCH_SUPPORTS_MSHARE
>  	help
>  	  Enable msharefs: A ram-based filesystem that allows multiple
>  	  processes to share page table entries for shared pages. A file

Yongting Lin.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/20] x86/mm: enable page table sharing
  2025-08-12 13:46   ` Yongting Lin
@ 2025-08-12 17:12     ` Anthony Yznaga
  2025-08-18  9:44       ` Yongting Lin
  0 siblings, 1 reply; 33+ messages in thread
From: Anthony Yznaga @ 2025-08-12 17:12 UTC (permalink / raw)
  To: Yongting Lin
  Cc: akpm, andreyknvl, arnd, brauner, catalin.marinas, dave.hansen,
	david, ebiederm, khalid, linux-arch, linux-kernel, linux-mm, luto,
	markhemm, maz, mhiramat, neilb, pcc, rostedt, vasily.averin, viro,
	willy, xhao



On 8/12/25 6:46 AM, Yongting Lin wrote:
> Hi,
> 
> On 4/4/25 10:18 AM, Anthony Yznaga wrote:
>> Enable x86 support for handling page faults in an mshare region by
>> redirecting page faults to operate on the mshare mm_struct and vmas
>> contained in it.
>> Some permissions checks are done using vma flags in architecture-specfic
>> fault handling code so the actual vma needed to complete the handling
>> is acquired before calling handle_mm_fault(). Because of this an
>> ARCH_SUPPORTS_MSHARE config option is added.
>>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   arch/Kconfig        |  3 +++
>>   arch/x86/Kconfig    |  1 +
>>   arch/x86/mm/fault.c | 37 ++++++++++++++++++++++++++++++++++++-
>>   mm/Kconfig          |  2 +-
>>   4 files changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 9f6eb09ef12d..2e000fefe9b3 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -1652,6 +1652,9 @@ config HAVE_ARCH_PFN_VALID
>>   config ARCH_SUPPORTS_DEBUG_PAGEALLOC
>>   	bool
>>   
>> +config ARCH_SUPPORTS_MSHARE
>> +	bool
>> +
>>   config ARCH_SUPPORTS_PAGE_TABLE_CHECK
>>   	bool
>>   
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 1502fd0c3c06..1f1779decb44 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -125,6 +125,7 @@ config X86
>>   	select ARCH_SUPPORTS_ACPI
>>   	select ARCH_SUPPORTS_ATOMIC_RMW
>>   	select ARCH_SUPPORTS_DEBUG_PAGEALLOC
>> +	select ARCH_SUPPORTS_MSHARE		if X86_64
>>   	select ARCH_SUPPORTS_PAGE_TABLE_CHECK	if X86_64
>>   	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
>>   	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 296d294142c8..49659d2f9316 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1216,6 +1216,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   	struct mm_struct *mm;
>>   	vm_fault_t fault;
>>   	unsigned int flags = FAULT_FLAG_DEFAULT;
>> +	bool is_shared_vma;
>> +	unsigned long addr;
>>   
>>   	tsk = current;
>>   	mm = tsk->mm;
>> @@ -1329,6 +1331,12 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   	if (!vma)
>>   		goto lock_mmap;
>>   
>> +	/* mshare does not support per-VMA locks yet */
>> +	if (vma_is_mshare(vma)) {
>> +		vma_end_read(vma);
>> +		goto lock_mmap;
>> +	}
>> +
>>   	if (unlikely(access_error(error_code, vma))) {
>>   		bad_area_access_error(regs, error_code, address, NULL, vma);
>>   		count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
>> @@ -1357,17 +1365,38 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   lock_mmap:
>>   
>>   retry:
>> +	addr = address;
>> +	is_shared_vma = false;
>>   	vma = lock_mm_and_find_vma(mm, address, regs);
>>   	if (unlikely(!vma)) {
>>   		bad_area_nosemaphore(regs, error_code, address);
>>   		return;
>>   	}
>>   
>> +	if (unlikely(vma_is_mshare(vma))) {
>> +		fault = find_shared_vma(&vma, &addr);
>> +
>> +		if (fault) {
>> +			mmap_read_unlock(mm);
>> +			goto done;
>> +		}
>> +
>> +		if (!vma) {
>> +			mmap_read_unlock(mm);
>> +			bad_area_nosemaphore(regs, error_code, address);
>> +			return;
>> +		}
>> +
>> +		is_shared_vma = true;
>> +	}
>> +
>>   	/*
>>   	 * Ok, we have a good vm_area for this memory access, so
>>   	 * we can handle it..
>>   	 */
>>   	if (unlikely(access_error(error_code, vma))) {
>> +		if (unlikely(is_shared_vma))
>> +			mmap_read_unlock(vma->vm_mm);
>>   		bad_area_access_error(regs, error_code, address, mm, vma);
>>   		return;
>>   	}
>> @@ -1385,7 +1414,11 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   	 * userland). The return to userland is identified whenever
>>   	 * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
>>   	 */
>> -	fault = handle_mm_fault(vma, address, flags, regs);
>> +	fault = handle_mm_fault(vma, addr, flags, regs);
>> +
>> +	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
>> +	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
>> +		mmap_read_unlock(mm);
> 
> I was backporting these patches of mshare to 5.15 kernel and trying to do some
> basic tests. Then found a potential issue.
> 
> Reaching here means find_shared_vma function has been executed successfully
> and host_mm->mmap_lock has got locked.
> 
> When returned fault variable has VM_FAULT_COMPLETED or VM_FAULT_RETRY flags,
> or fault_signal_pending(fault, regs) takes true, there is not chance to release
> locks of both mm and host_mm(i.e. vma->vm_mm) in the following Snippet of Code.

If VM_FAULT_COMPLETED or VM_FAULT_RETRY are returned then the 
host_mm->mmap_lock will already have been released. See 
fault_dirty_shared_page(), filemap_fault(), and other callers of 
maybe_unlock_mmap_for_io(). I will add a comment to help make this 
clearer. I also realized that fault_signal_pending() does not need to be 
called here because it can only return true if VM_FAULT_RETRY is set so 
I'll change that, too.

The checks in a 5.15 kernel will be different. You probably want 
something like:

         if (unlikely(is_shared_vma) && ((fault & VM_FAULT_RETRY) &&
             (flags & FAULT_FLAG_ALLOW_RETRY)) || 
fault_signal_pending(fault, regs))
                 mmap_read_unlock(mm);

Anthony

> 
> As a result, needs to release vma->vm_mm.mmap_lock as well.
> 
> So it is supposed to be like below:
> 
> -	fault = handle_mm_fault(vma, address, flags, regs);
> +	fault = handle_mm_fault(vma, addr, flags, regs);
> +
> +	if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
> +	    (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs))) {
> +		mmap_read_unlock(vma->vm_mm);
> +		mmap_read_unlock(mm);
> +	}
> 
>>   
>>   	if (fault_signal_pending(fault, regs)) {
>>   		/*
>> @@ -1413,6 +1446,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>>   		goto retry;
>>   	}
>>   
>> +	if (unlikely(is_shared_vma))
>> +		mmap_read_unlock(vma->vm_mm);
>>   	mmap_read_unlock(mm);
>>   done:
>>   	if (likely(!(fault & VM_FAULT_ERROR)))
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e6c90db83d01..8a5a159457f2 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1344,7 +1344,7 @@ config PT_RECLAIM
>>   
>>   config MSHARE
>>   	bool "Mshare"
>> -	depends on MMU
>> +	depends on MMU && ARCH_SUPPORTS_MSHARE
>>   	help
>>   	  Enable msharefs: A ram-based filesystem that allows multiple
>>   	  processes to share page table entries for shared pages. A file
> 
> Yongting Lin.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/20] x86/mm: enable page table sharing
  2025-08-12 17:12     ` Anthony Yznaga
@ 2025-08-18  9:44       ` Yongting Lin
  2025-08-20  1:32         ` Anthony Yznaga
  0 siblings, 1 reply; 33+ messages in thread
From: Yongting Lin @ 2025-08-18  9:44 UTC (permalink / raw)
  To: anthony.yznaga, shuah, skhan
  Cc: akpm, andreyknvl, arnd, brauner, catalin.marinas, dave.hansen,
	david, ebiederm, khalid, linux-arch, linux-kernel, linux-mm,
	linyongting, luto, markhemm, maz, mhiramat, neilb, pcc, rostedt,
	vasily.averin, viro, willy, xhao, linux-kselftest, libo.gcs85,
	yuanzhu

Thank you! Anthony.

Yep, I checked the comments in arch/mm/x86/fault.c file which says as your
advices in previous email.


I changed my code in kernel 5.5 as below:

       if (unlikely(is_shared_vma) && ((fault & VM_FAULT_RETRY) &&
           (flags & FAULT_FLAG_ALLOW_RETRY) || fault_signal_pending(fault, regs)))
               mmap_read_unlock(mm);

BTW: I wrote some selftests in my github repostory, which perform
the basic function of mshare, and I will write some complicated cases
to support the new functions or defect found in mshare. For example,
once you support mshare as a VMA in KVM (just as the defeat viewed by 
Jann Horn), I will add extra test cases to verify its correctiness for 
this scenario.

From Jann Horn's review:
https://lore.kernel.org/all/CAG48ez3cUZf+xOtP6UkkS2-CmOeo+3K5pvny0AFL_XBkHh5q_g@mail.gmail.com/

Currently, I put my selftest in my github repostory, and you could retrieve it
as below:

    git remote add yongting-mshare-selftests https://github.com/ivanalgo/linux-kernel-develop/
    git fetch yongting-mshare-selftests dev-mshare-v2-selftest-v1
    git cherry-pick a64f2ff6497d13c09badc0fc68c44d9995bc2fef

At this stage, I am not sure what is the best way to proceed:
- Should I send them as part of your next version (v3)?
- Or should I post them separately as [RFC PATCH] for early review?

Please let me know your preference and any sugestion is welcome.
I am happy to rebase and resend in the format that works best for
the community.

Thanks
Yongting

> Anthony
>
>>
>> As a result, needs to release vma->vm_mm.mmap_lock as well.
>>
>> So it is supposed to be like below:
>>
>> -    fault = handle_mm_fault(vma, address, flags, regs);
>> +    fault = handle_mm_fault(vma, addr, flags, regs);
>> +
>> +    if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
>> +        (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs))) {
>> +        mmap_read_unlock(vma->vm_mm);
>> +        mmap_read_unlock(mm);
>> +    }
>>
>>>         if (fault_signal_pending(fault, regs)) {
>>>           /*
>>> @@ -1413,6 +1446,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>>>           goto retry;
>>>       }
>>>   +    if (unlikely(is_shared_vma))
>>> +        mmap_read_unlock(vma->vm_mm);
>>>       mmap_read_unlock(mm);
>>>   done:
>>>       if (likely(!(fault & VM_FAULT_ERROR)))
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index e6c90db83d01..8a5a159457f2 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1344,7 +1344,7 @@ config PT_RECLAIM
>>>     config MSHARE
>>>       bool "Mshare"
>>> -    depends on MMU
>>> +    depends on MMU && ARCH_SUPPORTS_MSHARE
>>>       help
>>>         Enable msharefs: A ram-based filesystem that allows multiple
>>>         processes to share page table entries for shared pages. A file 
>>
>> Yongting Lin. 
>
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 13/20] x86/mm: enable page table sharing
  2025-08-18  9:44       ` Yongting Lin
@ 2025-08-20  1:32         ` Anthony Yznaga
  0 siblings, 0 replies; 33+ messages in thread
From: Anthony Yznaga @ 2025-08-20  1:32 UTC (permalink / raw)
  To: Yongting Lin, shuah, skhan
  Cc: akpm, andreyknvl, arnd, brauner, catalin.marinas, dave.hansen,
	david, ebiederm, khalid, linux-arch, linux-kernel, linux-mm, luto,
	markhemm, maz, mhiramat, neilb, pcc, rostedt, vasily.averin, viro,
	willy, xhao, linux-kselftest, libo.gcs85, yuanzhu

Hi Yongting,

On 8/18/25 2:44 AM, Yongting Lin wrote:
> Thank you! Anthony.
> 
> Yep, I checked the comments in arch/mm/x86/fault.c file which says as your
> advices in previous email.
> 
> 
> I changed my code in kernel 5.5 as below:
> 
>         if (unlikely(is_shared_vma) && ((fault & VM_FAULT_RETRY) &&
>             (flags & FAULT_FLAG_ALLOW_RETRY) || fault_signal_pending(fault, regs)))
>                 mmap_read_unlock(mm);
> 
> BTW: I wrote some selftests in my github repostory, which perform
> the basic function of mshare, and I will write some complicated cases
> to support the new functions or defect found in mshare. For example,
> once you support mshare as a VMA in KVM (just as the defeat viewed by
> Jann Horn), I will add extra test cases to verify its correctiness for
> this scenario.

This is great! I'll take a look at them in more detail. I just sent out 
an updated series.

> 
>  From Jann Horn's review:
> https://lore.kernel.org/all/CAG48ez3cUZf+xOtP6UkkS2-CmOeo+3K5pvny0AFL_XBkHh5q_g@mail.gmail.com/

My new series does not yet have support for mmu notifiers. It's 
something I'm working on, but there are key issues to overcome. One is 
that I need to update the implementation of mm_take_all_locks() to also 
carefully take all locks in any mapped mshare regions. The other is that 
passing through mmu notifier calls for arch_invalidate_secondary_tlbs 
callbacks is especially tricky because the callback is not allowed to 
sleep due to holding a ptl spin lock.

> 
> Currently, I put my selftest in my github repostory, and you could retrieve it
> as below:
> 
>      git remote add yongting-mshare-selftests https://github.com/ivanalgo/linux-kernel-develop/
>      git fetch yongting-mshare-selftests dev-mshare-v2-selftest-v1
>      git cherry-pick a64f2ff6497d13c09badc0fc68c44d9995bc2fef
> 
> At this stage, I am not sure what is the best way to proceed:
> - Should I send them as part of your next version (v3)?
> - Or should I post them separately as [RFC PATCH] for early review?
> 
> Please let me know your preference and any sugestion is welcome.
> I am happy to rebase and resend in the format that works best for
> the community.

I may have more feedback once I take look. I suggest starting by 
updating them to work with the series I just sent out.

Thanks,
Anthony
  >
> Thanks
> Yongting
> 
>> Anthony
>>
>>>
>>> As a result, needs to release vma->vm_mm.mmap_lock as well.
>>>
>>> So it is supposed to be like below:
>>>
>>> -    fault = handle_mm_fault(vma, address, flags, regs);
>>> +    fault = handle_mm_fault(vma, addr, flags, regs);
>>> +
>>> +    if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
>>> +        (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs))) {
>>> +        mmap_read_unlock(vma->vm_mm);
>>> +        mmap_read_unlock(mm);
>>> +    }
>>>
>>>>          if (fault_signal_pending(fault, regs)) {
>>>>            /*
>>>> @@ -1413,6 +1446,8 @@ void do_user_addr_fault(struct pt_regs *regs,
>>>>            goto retry;
>>>>        }
>>>>    +    if (unlikely(is_shared_vma))
>>>> +        mmap_read_unlock(vma->vm_mm);
>>>>        mmap_read_unlock(mm);
>>>>    done:
>>>>        if (likely(!(fault & VM_FAULT_ERROR)))
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index e6c90db83d01..8a5a159457f2 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1344,7 +1344,7 @@ config PT_RECLAIM
>>>>      config MSHARE
>>>>        bool "Mshare"
>>>> -    depends on MMU
>>>> +    depends on MMU && ARCH_SUPPORTS_MSHARE
>>>>        help
>>>>          Enable msharefs: A ram-based filesystem that allows multiple
>>>>          processes to share page table entries for shared pages. A file
>>>
>>> Yongting Lin.
>>
>>



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-08-20  1:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04  2:18 [PATCH v2 00/20] Add support for shared PTEs across processes Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 01/20] mm: Add msharefs filesystem Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 02/20] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 03/20] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 04/20] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 05/20] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 06/20] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 07/20] mm/mshare: Add mmap support Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
2025-05-30 14:41   ` Jann Horn
2025-05-30 16:29     ` Anthony Yznaga
2025-05-30 17:46       ` Jann Horn
2025-05-30 22:47         ` Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 09/20] sched/numa: do not scan msharefs vmas Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 10/20] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 11/20] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
2025-05-30 14:56   ` Jann Horn
2025-05-30 16:41     ` Anthony Yznaga
2025-06-02 15:26       ` Jann Horn
2025-06-02 22:02         ` Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 13/20] x86/mm: enable page table sharing Anthony Yznaga
2025-08-12 13:46   ` Yongting Lin
2025-08-12 17:12     ` Anthony Yznaga
2025-08-18  9:44       ` Yongting Lin
2025-08-20  1:32         ` Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 14/20] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 15/20] mm: pass the mm in vma_munmap_struct Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 16/20] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
2025-04-04  2:18 ` [PATCH v2 17/20] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
2025-04-04  2:19 ` [PATCH v2 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
2025-04-04  2:19 ` [PATCH v2 19/20] mm/mshare: get memcg from current->mm instead of mshare mm Anthony Yznaga
2025-04-04  2:19 ` [PATCH v2 20/20] mm/mshare: associate a mem cgroup with an mshare file Anthony Yznaga

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).