linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/10] Add support for shared PTEs across processes
@ 2024-09-03 23:22 Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 01/10] mm: Add msharefs filesystem Anthony Yznaga
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

[Note: I have taken on the mshare work from Khalid. This series
is a continuation of the series last sent out in 2022[1] and not
of the later ptshare series.]

Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, the a substantial amount of memory saved.

This patch series implements a mechanism that allows userspace
processes to opt into sharing PTEs. It adds a new in-memory
filesystem - msharefs. A file created on msharefs represents a
shared region where all processes mapping that region will map
objects within it with shared PTEs. When the file is created,
a new host mm struct is created to hold the shared page tables
and vmas for objects later mapped into the shared region. This
host mm struct is associated with the file and not with a task.
When a process mmap's the shared region, a vm flag VM_SHARED_PT
is added to the vma. On page fault the vma is checked for the
presence of the VM_SHARED_PT flag. If found, the host mm is
searched for a vma that covers the fault address. Fault handling
then continues using that host vma which establishes PTEs in the
host mm. Fault handling in a shared region also links the shared
page table to the process page table if the shared page table
already exists.

Ioctls are used to set/get the start address and size of the host
mm, to map objects into the shared region, and to (eventually)
perform other operations on the shared objects such as changing
protections.

One major issue to address for this series to function correctly
is how to ensure proper TLB flushing when a page in a shared
region is unmapped. For example, since the rmaps for pages in a
shared region map back to host vmas which point to a host mm, TLB
flushes won't be directed to the CPUs the sharing processes have
run on. I am by no means an expert in this area. One idea is to
install a mmu_notifier on the host mm that can gather the necessary
data and do flushes similar to the batch flushing.


API
===

mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
        mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. Start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating an mshare region:
        a. Create a file on /sys/fs/mshare, for example -
                fd = open("/sys/fs/mshare/shareme",
                                O_RDWR|O_CREAT|O_EXCL, 0600);

        b. Establish the starting address and size of the region
                struct mshare_info minfo;

                minfo.start = TB(2);
                minfo.size = BUFFER_SIZE;
                ioctl(fd, MSHAREFS_SET_SIZE, &minfo)

        c. Map some memory in the region
                struct mshare_create mcreate;

                mcreate.addr = TB(2);
                mcreate.size = BUFFER_SIZE;
                mcreate.offset = 0;
                mcreate.prot = PROT_READ | PROT_WRITE;
                mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
                mcreate.fd = -1;

                ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

        d. Map the mshare region into the process
                mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        e. Write and read to mshared region normally.

4. For processes attaching an mshare region:
        a. Open the file on msharefs, for example -
                fd = open("/sys/fs/mshare/shareme", O_RDWR);

        b. Get information about mshare'd region from the file:
                struct mshare_info minfo;

                ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
        
        c. Map the mshare'd region into the process
                mmap(minfo.start, minfo.size,
                        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

5. To delete the mshare region -
                unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        struct mshare_info minfo;
        struct mshare_create mcreate;

        fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
        read(fd, req, 128);
        alignsize = atoi(req);
        close(fd);
        fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
        start = alignsize * 4;
        size = alignsize * 2;

        minfo.start = start;
        minfo.size = size;
        ret = ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_SET_SIZE");

        mcreate.addr = start;
        mcreate.size = size;
        mcreate.offset = 0;
        mcreate.prot = PROT_READ | PROT_WRITE;
        mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
        mcreate.fd = -1;
        ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
        if (ret < 0)
                perror("ERROR: MSHAREFS_CREATE_MAPPING");

        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
                        sizeof("Some random shared text"));
-----------------


Snippet of code that a consumer process would execute looks like:

-----------------
        struct mshare_info minfo;

        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        ret = ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_GET_SIZE");
        if (!minfo.start)
                perror("ERROR: mshare region not init'd");

        addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
        printf("\nDone\n");

-----------------


v3 -> v2:
        - Now based on 6.11-rc5
        - Addressed many comments from v2.
        - Simplified filesystem code. Removed refcounting of the
          shared mm_struct allocated for an mshare file. The mm_struct
          and the pagetables and mappings it contains are freed when
          the inode is evicted.
        - Switched to an ioctl-based interface. Ioctls implemented
          are used to set and get the start address and size of an
          mshare region and to map objects into an mshare region
          (only anon shared memory is supported in this series).
        - Updated example code

[1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/


v1 -> v2:
        - Eliminated mshare and mshare_unlink system calls and
          replaced API with standard mmap and unlink (Based upon
          v1 patch discussions and LSF/MM discussions)
        - All fd based API (based upon feedback and suggestions from
          Andy Lutomirski, Eric Biederman, Kirill and others)
        - Added a file /sys/fs/mshare/mshare_info to provide
          alignment and size requirement info (based upon feedback
          from Dave Hansen, Mark Hemment and discussions at LSF/MM)
        - Addressed TODOs in v1
        - Added support for directories in msharefs
        - Added locks around any time vma is touched (Dave Hansen)
        - Eliminated the need to point vm_mm in original vmas to the
          newly synthesized mshare mm
        - Ensured mmap_read_unlock is called for correct mm in
          handle_mm_fault (Dave Hansen)


Anthony Yznaga (3):
  mm/mshare: allocate an mm_struct for msharefs files
  mm: create __do_mmap() to take an mm_struct * arg
  mshare: add MSHAREFS_CREATE_MAPPING

Khalid Aziz (7):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add ioctl support
  mm/mshare: Add vm flag for shared PTEs
  mm/mshare: Add mmap support
  mm/mshare: Add basic page table sharing support

 Documentation/filesystems/msharefs.rst        | 107 ++++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 include/linux/mm.h                            |  25 +-
 include/trace/events/mmflags.h                |   3 +
 include/uapi/linux/magic.h                    |   1 +
 include/uapi/linux/msharefs.h                 |  38 ++
 mm/Makefile                                   |   2 +-
 mm/internal.h                                 |   6 +
 mm/memory.c                                   |  62 +-
 mm/mmap.c                                     |  12 +-
 mm/mshare.c                                   | 582 ++++++++++++++++++
 11 files changed, 827 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 include/uapi/linux/msharefs.h
 create mode 100644 mm/mshare.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 01/10] mm: Add msharefs filesystem
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 02/10] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add a ram-based filesystem that contains page table sharing
information and files that enables processes to share page tables.
This patch adds the basic filesystem that can be mounted.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 Documentation/filesystems/msharefs.rst | 107 +++++++++++++++++++++++++
 include/uapi/linux/magic.h             |   1 +
 mm/Makefile                            |   2 +-
 mm/mshare.c                            |  97 ++++++++++++++++++++++
 4 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..727cda663b0b
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages.
+
+msharefs is typically mounted like this::
+
+	mount -t msharefs none /sys/fs/mshare
+
+A file created on msharefs creates a new shared region where all
+processes mapping that region will map it using shared page table
+entries. ioctls are used to initialize or retrieve the start address
+and size of a shared region and to map objects in the shared
+region. It is important to note that an msharefs file is a control
+file for the shared region and does not contain the contents
+of the region itself.
+
+Here are the basic steps for using mshare::
+
+1. Mount msharefs on /sys/fs/mshare::
+
+	mount -t msharefs msharefs /sys/fs/mshare
+
+2. mshare regions have alignment and size requirements. Start
+   address for the region must be aligned to an address boundary and
+   be a multiple of fixed size. This alignment and size requirement
+   can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
+   which returns a number in text format. mshare regions must be
+   aligned to this boundary and be a multiple of this size.
+
+3. For the process creating an mshare region::
+
+a. Create a file on /sys/fs/mshare, for example:
+
+.. code-block:: c
+
+	fd = open("/sys/fs/mshare/shareme",
+			O_RDWR|O_CREAT|O_EXCL, 0600);
+
+b. Establish the starting address and size of the region:
+
+.. code-block:: c
+
+	struct mshare_info minfo;
+
+	minfo.start = TB(2);
+	minfo.size = BUFFER_SIZE;
+	ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
+
+c. Map some memory in the region:
+
+.. code-block:: c
+
+	struct mshare_create mcreate;
+
+	mcreate.addr = TB(2);
+	mcreate.size = BUFFER_SIZE;
+	mcreate.offset = 0;
+	mcreate.prot = PROT_READ | PROT_WRITE;
+	mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
+	mcreate.fd = -1;
+
+	ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
+
+d. Map the mshare region into the process:
+
+.. code-block:: c
+
+	mmap((void *)TB(2), BUF_SIZE,
+		PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+e. Write and read to mshared region normally.
+
+
+4. For processes attaching an mshare region::
+
+a. Open the file on msharefs, for example:
+
+.. code-block:: c
+
+	fd = open("/sys/fs/mshare/shareme", O_RDWR);
+
+b. Get information about mshare'd region from the file:
+
+.. code-block:: c
+
+	struct mshare_info minfo;
+
+	ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
+
+c. Map the mshare'd region into the process:
+
+.. code-block:: c
+
+	mmap(minfo.start, minfo.size,
+		PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+5. To delete the mshare region:
+
+.. code-block:: c
+
+		unlink("/sys/fs/mshare/shareme");
+
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..e53dd6063cba 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define MSHARE_MAGIC		0x4d534852	/* "MSHR" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..956137de9e1f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -45,7 +45,7 @@ mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
 
 ifdef CONFIG_64BIT
-mmu-$(CONFIG_MMU)	+= mseal.o
+mmu-$(CONFIG_MMU)	+= mseal.o mshare.o
 endif
 
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..d5d8e2530e5a
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Enable cooperating processes to share page table between
+ * them to reduce the extra memory consumed by multiple copies
+ * of page tables.
+ *
+ * This code adds an in-memory filesystem - msharefs.
+ * msharefs is used to manage page table sharing
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author:	Khalid Aziz <khalid@kernel.org>
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/fs_context.h>
+#include <uapi/linux/magic.h>
+
+static const struct file_operations msharefs_file_operations = {
+	.open		= simple_open,
+	.llseek		= no_llseek,
+};
+
+static const struct super_operations mshare_s_ops = {
+	.statfs		= simple_statfs,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *inode;
+
+	sb->s_blocksize		= PAGE_SIZE;
+	sb->s_blocksize_bits	= PAGE_SHIFT;
+	sb->s_magic		= MSHARE_MAGIC;
+	sb->s_op		= &mshare_s_ops;
+	sb->s_time_gran		= 1;
+
+	inode = new_inode(sb);
+	if (!inode)
+		return -ENOMEM;
+
+	inode->i_ino = 1;
+	inode->i_mode = S_IFDIR | 0777;
+	simple_inode_init_ts(inode);
+	inode->i_op = &simple_dir_inode_operations;
+	inode->i_fop = &simple_dir_operations;
+	set_nlink(inode, 2);
+
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+	.get_tree	= msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+	fc->ops = &msharefs_context_ops;
+	return 0;
+}
+
+static struct file_system_type mshare_fs = {
+	.name			= "msharefs",
+	.init_fs_context	= mshare_init_fs_context,
+	.kill_sb		= kill_litter_super,
+};
+
+static int __init
+mshare_init(void)
+{
+	int ret;
+
+	ret = sysfs_create_mount_point(fs_kobj, "mshare");
+	if (ret)
+		return ret;
+
+	ret = register_filesystem(&mshare_fs);
+	if (ret)
+		sysfs_remove_mount_point(fs_kobj, "mshare");
+
+	return ret;
+}
+
+core_initcall(mshare_init);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 02/10] mm/mshare: pre-populate msharefs with information file
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 01/10] mm: Add msharefs filesystem Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 03/10] mm/mshare: make msharefs writable and support directories Anthony Yznaga
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Users of mshare feature to share page tables need to know the size
and alignment requirement for shared regions. Pre-populate msharefs
with a file, mshare_info, that provides this information.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 73 insertions(+), 2 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index d5d8e2530e5a..6562000545e1 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,19 +17,73 @@
 #include <linux/fs_context.h>
 #include <uapi/linux/magic.h>
 
+struct msharefs_info {
+	struct dentry *info_dentry;
+};
+
 static const struct file_operations msharefs_file_operations = {
 	.open		= simple_open,
 	.llseek		= no_llseek,
 };
 
+static ssize_t
+mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
+		loff_t *ppos)
+{
+	char s[80];
+
+	sprintf(s, "%ld\n", PGDIR_SIZE);
+	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
+}
+
+static const struct file_operations mshare_info_ops = {
+	.read	= mshare_info_read,
+	.llseek	= noop_llseek,
+};
+
 static const struct super_operations mshare_s_ops = {
 	.statfs		= simple_statfs,
 };
 
+static int
+msharefs_create_mshare_info(struct super_block *sb)
+{
+	struct msharefs_info *info = sb->s_fs_info;
+	struct dentry *root = sb->s_root;
+	struct dentry *dentry;
+	struct inode *inode;
+	int ret;
+
+	ret = -ENOMEM;
+	inode = new_inode(sb);
+	if (!inode)
+		goto out;
+
+	inode->i_ino = 2;
+	simple_inode_init_ts(inode);
+	inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG | 0444);
+	inode->i_fop = &mshare_info_ops;
+
+	dentry = d_alloc_name(root, "mshare_info");
+	if (!dentry)
+		goto out;
+
+	info->info_dentry = dentry;
+	d_add(dentry, inode);
+
+	return 0;
+out:
+	iput(inode);
+
+	return ret;
+}
+
 static int
 msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
+	struct msharefs_info *info;
 	struct inode *inode;
+	int ret;
 
 	sb->s_blocksize		= PAGE_SIZE;
 	sb->s_blocksize_bits	= PAGE_SHIFT;
@@ -37,6 +91,12 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_op		= &mshare_s_ops;
 	sb->s_time_gran		= 1;
 
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	sb->s_fs_info = info;
+
 	inode = new_inode(sb);
 	if (!inode)
 		return -ENOMEM;
@@ -52,7 +112,9 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (!sb->s_root)
 		return -ENOMEM;
 
-	return 0;
+	ret = msharefs_create_mshare_info(sb);
+
+	return ret;
 }
 
 static int
@@ -72,10 +134,19 @@ mshare_init_fs_context(struct fs_context *fc)
 	return 0;
 }
 
+static void
+msharefs_kill_super(struct super_block *sb)
+{
+	struct msharefs_info *info = sb->s_fs_info;
+
+	kfree(info);
+	kill_litter_super(sb);
+}
+
 static struct file_system_type mshare_fs = {
 	.name			= "msharefs",
 	.init_fs_context	= mshare_init_fs_context,
-	.kill_sb		= kill_litter_super,
+	.kill_sb		= msharefs_kill_super,
 };
 
 static int __init
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 03/10] mm/mshare: make msharefs writable and support directories
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 01/10] mm: Add msharefs filesystem Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 02/10] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 04/10] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Make msharefs filesystem writable and allow creating directories
to support better access control to mshare'd regions defined in
msharefs.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 1 deletion(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index 6562000545e1..f718b8934cdf 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -21,11 +21,126 @@ struct msharefs_info {
 	struct dentry *info_dentry;
 };
 
+static const struct inode_operations msharefs_dir_inode_ops;
+static const struct inode_operations msharefs_file_inode_ops;
+
 static const struct file_operations msharefs_file_operations = {
 	.open		= simple_open,
 	.llseek		= no_llseek,
 };
 
+static struct inode
+*msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
+			const struct inode *dir, umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_ino = get_next_ino();
+	inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+	simple_inode_init_ts(inode);
+
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &msharefs_file_inode_ops;
+		inode->i_fop = &msharefs_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &msharefs_dir_inode_ops;
+		inode->i_fop = &simple_dir_operations;
+		inc_nlink(inode);
+		break;
+	default:
+		discard_new_inode(inode);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return inode;
+}
+
+static int
+msharefs_mknod(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = msharefs_get_inode(idmap, dir->i_sb, dir, mode);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	d_instantiate(dentry, inode);
+	dget(dentry);
+	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
+
+	return 0;
+}
+
+static int
+msharefs_create(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode, bool excl)
+{
+	return msharefs_mknod(idmap, dir, dentry, mode | S_IFREG);
+}
+
+static int
+msharefs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
+		struct dentry *dentry, umode_t mode)
+{
+	int ret = msharefs_mknod(idmap, dir, dentry, mode | S_IFDIR);
+
+	if (!ret)
+		inc_nlink(dir);
+	return ret;
+}
+
+static inline bool
+is_msharefs_info_file(const struct dentry *dentry)
+{
+	struct msharefs_info *info = dentry->d_sb->s_fs_info;
+
+	return info->info_dentry == dentry;
+}
+
+static int
+msharefs_rename(struct mnt_idmap *idmap,
+		struct inode *old_dir, struct dentry *old_dentry,
+		struct inode *new_dir, struct dentry *new_dentry,
+		unsigned int flags)
+{
+	if (is_msharefs_info_file(old_dentry) ||
+	    is_msharefs_info_file(new_dentry))
+		return -EPERM;
+
+	return simple_rename(idmap, old_dir, old_dentry, new_dir,
+			     new_dentry, flags);
+}
+
+static int
+msharefs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	if (is_msharefs_info_file(dentry))
+		return -EPERM;
+
+	return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations msharefs_file_inode_ops = {
+	.setattr	= simple_setattr,
+	.getattr	= simple_getattr,
+};
+
+static const struct inode_operations msharefs_dir_inode_ops = {
+	.create		= msharefs_create,
+	.lookup		= simple_lookup,
+	.link		= simple_link,
+	.unlink		= msharefs_unlink,
+	.mkdir		= msharefs_mkdir,
+	.rmdir		= simple_rmdir,
+	.rename		= msharefs_rename,
+};
+
 static ssize_t
 mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
 		loff_t *ppos)
@@ -104,7 +219,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
 	inode->i_ino = 1;
 	inode->i_mode = S_IFDIR | 0777;
 	simple_inode_init_ts(inode);
-	inode->i_op = &simple_dir_inode_operations;
+	inode->i_op = &msharefs_dir_inode_ops;
 	inode->i_fop = &simple_dir_operations;
 	set_nlink(inode, 2);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 04/10] mm/mshare: allocate an mm_struct for msharefs files
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (2 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 03/10] mm/mshare: make msharefs writable and support directories Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 05/10] mm/mshare: Add ioctl support Anthony Yznaga
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

When a new file is created under msharefs, allocate a new mm_struct
to be associated with it for the lifetime of the file.
The mm_struct will hold the VMAs and pagetables for the mshare region
the file represents.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index f718b8934cdf..a37849e724e1 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,6 +17,10 @@
 #include <linux/fs_context.h>
 #include <uapi/linux/magic.h>
 
+struct mshare_data {
+	struct mm_struct *mm;
+};
+
 struct msharefs_info {
 	struct dentry *info_dentry;
 };
@@ -29,11 +33,51 @@ static const struct file_operations msharefs_file_operations = {
 	.llseek		= no_llseek,
 };
 
+static int
+msharefs_fill_mm(struct inode *inode)
+{
+	struct mm_struct *mm;
+	struct mshare_data *m_data = NULL;
+	int ret = 0;
+
+	mm = mm_alloc();
+	if (!mm) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	mm->mmap_base = mm->task_size = 0;
+
+	m_data = kzalloc(sizeof(*m_data), GFP_KERNEL);
+	if (!m_data) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+	m_data->mm = mm;
+	inode->i_private = m_data;
+
+	return 0;
+
+err_free:
+	if (mm)
+		mmput(mm);
+	kfree(m_data);
+	return ret;
+}
+
+static void
+msharefs_delmm(struct mshare_data *m_data)
+{
+	mmput(m_data->mm);
+	kfree(m_data);
+}
+
 static struct inode
 *msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
 			const struct inode *dir, umode_t mode)
 {
 	struct inode *inode = new_inode(sb);
+	int ret;
 
 	if (!inode)
 		return ERR_PTR(-ENOMEM);
@@ -46,6 +90,11 @@ static struct inode
 	case S_IFREG:
 		inode->i_op = &msharefs_file_inode_ops;
 		inode->i_fop = &msharefs_file_operations;
+		ret = msharefs_fill_mm(inode);
+		if (ret) {
+			discard_new_inode(inode);
+			inode = ERR_PTR(ret);
+		}
 		break;
 	case S_IFDIR:
 		inode->i_op = &msharefs_dir_inode_ops;
@@ -141,6 +190,16 @@ static const struct inode_operations msharefs_dir_inode_ops = {
 	.rename		= msharefs_rename,
 };
 
+static void
+mshare_evict_inode(struct inode *inode)
+{
+	struct mshare_data *m_data = inode->i_private;
+
+	if (m_data)
+		msharefs_delmm(m_data);
+	clear_inode(inode);
+}
+
 static ssize_t
 mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
 		loff_t *ppos)
@@ -158,6 +217,7 @@ static const struct file_operations mshare_info_ops = {
 
 static const struct super_operations mshare_s_ops = {
 	.statfs		= simple_statfs,
+	.evict_inode	= mshare_evict_inode,
 };
 
 static int
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 05/10] mm/mshare: Add ioctl support
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (3 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 04/10] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-10-14 20:08   ` Jann Horn
  2024-09-03 23:22 ` [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs Anthony Yznaga
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Reserve a range of ioctls for msharefs and add the first two ioctls
to get and set the start address and size of an mshare region.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |  1 +
 include/uapi/linux/msharefs.h                 | 29 ++++++++
 mm/mshare.c                                   | 72 +++++++++++++++++++
 3 files changed, 102 insertions(+)
 create mode 100644 include/uapi/linux/msharefs.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index e91c0376ee59..d2fa6b117f66 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -301,6 +301,7 @@ Code  Seq#    Include File                                           Comments
 'v'   20-27  arch/powerpc/include/uapi/asm/vas-api.h		     VAS API
 'v'   C0-FF  linux/meye.h                                            conflict!
 'w'   all                                                            CERN SCI driver
+'x'   00-1F  linux/msharefs.h                                        msharefs filesystem
 'y'   00-1F                                                          packet based user level communications
                                                                      <mailto:zapman@interlan.net>
 'z'   00-3F                                                          CAN bus card conflict!
diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
new file mode 100644
index 000000000000..c7b509c7e093
--- /dev/null
+++ b/include/uapi/linux/msharefs.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * msharefs defines a memory region that is shared across processes.
+ * ioctl is used on files created under msharefs to set various
+ * attributes on these shared memory regions
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author:	Khalid Aziz <khalid@kernel.org>
+ */
+
+#ifndef _UAPI_LINUX_MSHAREFS_H
+#define _UAPI_LINUX_MSHAREFS_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/*
+ * msharefs specific ioctl commands
+ */
+#define MSHAREFS_GET_SIZE	_IOR('x', 0,  struct mshare_info)
+#define MSHAREFS_SET_SIZE	_IOW('x', 1,  struct mshare_info)
+
+struct mshare_info {
+	__u64 start;
+	__u64 size;
+};
+
+#endif
diff --git a/mm/mshare.c b/mm/mshare.c
index a37849e724e1..af46eb76d2bc 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,15 +10,20 @@
  *
  * Copyright (C) 2024 Oracle Corp. All rights reserved.
  * Author:	Khalid Aziz <khalid@kernel.org>
+ * Author:	Matthew Wilcox <willy@infradead.org>
  *
  */
 
 #include <linux/fs.h>
 #include <linux/fs_context.h>
+#include <linux/spinlock_types.h>
 #include <uapi/linux/magic.h>
+#include <uapi/linux/msharefs.h>
 
 struct mshare_data {
 	struct mm_struct *mm;
+	spinlock_t m_lock;
+	struct mshare_info minfo;
 };
 
 struct msharefs_info {
@@ -28,8 +33,74 @@ struct msharefs_info {
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
+static long
+msharefs_set_size(struct mm_struct *mm, struct mshare_data *m_data,
+			struct mshare_info *minfo)
+{
+	unsigned long end = minfo->start + minfo->size;
+
+	/*
+	 * Validate alignment for start address, and size
+	 */
+	if ((minfo->start | end) & (PGDIR_SIZE - 1)) {
+		spin_unlock(&m_data->m_lock);
+		return -EINVAL;
+	}
+
+	mm->mmap_base = minfo->start;
+	mm->task_size = minfo->size;
+	if (!mm->task_size)
+		mm->task_size--;
+
+	m_data->minfo.start = mm->mmap_base;
+	m_data->minfo.size = mm->task_size;
+	spin_unlock(&m_data->m_lock);
+
+	return 0;
+}
+
+static long
+msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	struct mshare_data *m_data = filp->private_data;
+	struct mm_struct *mm = m_data->mm;
+	struct mshare_info minfo;
+
+	switch (cmd) {
+	case MSHAREFS_GET_SIZE:
+		spin_lock(&m_data->m_lock);
+		minfo = m_data->minfo;
+		spin_unlock(&m_data->m_lock);
+
+		if (copy_to_user((void __user *)arg, &minfo, sizeof(minfo)))
+			return -EFAULT;
+
+		return 0;
+
+	case MSHAREFS_SET_SIZE:
+		if (copy_from_user(&minfo, (struct mshare_info __user *)arg,
+			sizeof(minfo)))
+			return -EFAULT;
+
+		/*
+		 * If this mshare region has been set up once already, bail out
+		 */
+		spin_lock(&m_data->m_lock);
+		if (m_data->minfo.start != 0) {
+			spin_unlock(&m_data->m_lock);
+			return -EINVAL;
+		}
+
+		return msharefs_set_size(mm, m_data, &minfo);
+
+	default:
+		return -ENOTTY;
+	}
+}
+
 static const struct file_operations msharefs_file_operations = {
 	.open		= simple_open,
+	.unlocked_ioctl	= msharefs_ioctl,
 	.llseek		= no_llseek,
 };
 
@@ -54,6 +125,7 @@ msharefs_fill_mm(struct inode *inode)
 		goto err_free;
 	}
 	m_data->mm = mm;
+	spin_lock_init(&m_data->m_lock);
 	inode->i_private = m_data;
 
 	return 0;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (4 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 05/10] mm/mshare: Add ioctl support Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:40   ` James Houghton
  2024-09-03 23:22 ` [RFC PATCH v3 07/10] mm/mshare: Add mmap support Anthony Yznaga
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTEs by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTEs.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h             | 7 +++++++
 include/trace/events/mmflags.h | 3 +++
 mm/internal.h                  | 5 +++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6549d0979b28..3aa0b3322284 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -413,6 +413,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_DROPPABLE		VM_NONE
 #endif
 
+#ifdef CONFIG_64BIT
+#define VM_SHARED_PT_BIT	41
+#define VM_SHARED_PT		BIT(VM_SHARED_PT_BIT)
+#else
+#define VM_SHARED_PT		VM_NONE
+#endif
+
 #ifdef CONFIG_64BIT
 /* VM is sealed, in vm_flags */
 #define VM_SEALED	_BITUL(63)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index b63d211bd141..e1ae1e60d086 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -167,8 +167,10 @@ IF_HAVE_PG_ARCH_X(arch_3)
 
 #ifdef CONFIG_64BIT
 # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+# define IF_HAVE_VM_SHARED_PT(flag, name) {flag, name},
 #else
 # define IF_HAVE_VM_DROPPABLE(flag, name)
+# define IF_HAVE_VM_SHARED_PT(flag, name)
 #endif
 
 #define __def_vmaflag_names						\
@@ -204,6 +206,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
 IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
+IF_HAVE_VM_SHARED_PT(VM_SHARED_PT,	"sharedpt"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..8005d5956b6e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1578,4 +1578,9 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
 void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
 void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
 
+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+	return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
+}
+
 #endif	/* __MM_INTERNAL_H */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 07/10] mm/mshare: Add mmap support
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (5 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-09-03 23:22 ` [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support Anthony Yznaga
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add support for mapping an mshare region into a process after the
region has been established in msharefs. For now, disallow partial
unmaps of the region by disallowing splitting of an mshare VMA.
The functionality for mapping an object into an mshare region
will added in later patches.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/mshare.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index af46eb76d2bc..f3f6ed9c3761 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -33,6 +33,63 @@ struct msharefs_info {
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
+/*
+ * Disallow partial unmaps of an mshare region for now. Unmapping at
+ * boundaries aligned to the level page tables are shared at could
+ * be allowed in the future.
+ */ 
+static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	return -EINVAL;
+}
+
+static const struct vm_operations_struct msharefs_vm_ops = {
+	.may_split = mshare_vm_op_split,
+};
+
+/*
+ * msharefs_mmap() - mmap an mshare region
+ */
+static int
+msharefs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct mshare_data *m_data = file->private_data;
+	unsigned long mshare_start, mshare_end;
+	int err = -EINVAL;
+
+	spin_lock(&m_data->m_lock);
+	mshare_start = m_data->minfo.start;
+	mshare_end = mshare_start + m_data->minfo.size;
+	spin_unlock(&m_data->m_lock);
+
+	/*
+	 * Make sure start and end of this mshare region has
+	 * been established already
+	 */
+	if (mshare_start == 0)
+		goto err_out;
+
+	/*
+	 * Verify alignment and size multiple
+	 */
+	if ((vma->vm_start | vma->vm_end) & (PGDIR_SIZE - 1))
+		goto err_out;
+
+	/*
+	 * Verify this mapping does not extend outside of mshare region
+	 */
+	if (vma->vm_start < mshare_start || vma->vm_end > mshare_end)
+		goto err_out;
+
+	err = 0;
+	vma->vm_private_data = m_data;
+	vm_flags_set(vma, VM_SHARED_PT);
+	vma->vm_ops = &msharefs_vm_ops;
+
+err_out:
+	return err;
+}
+
 static long
 msharefs_set_size(struct mm_struct *mm, struct mshare_data *m_data,
 			struct mshare_info *minfo)
@@ -100,6 +157,7 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 static const struct file_operations msharefs_file_operations = {
 	.open		= simple_open,
+	.mmap		= msharefs_mmap,
 	.unlocked_ioctl	= msharefs_ioctl,
 	.llseek		= no_llseek,
 };
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (6 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 07/10] mm/mshare: Add mmap support Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-10-07  8:41   ` Kirill A. Shutemov
  2024-09-03 23:22 ` [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

From: Khalid Aziz <khalid@kernel.org>

Add support for handling page faults in an mshare region by
redirecting the faults to operate on the mshare mm_struct and
vmas contained in it and to link the page tables of the faulting
process with the shared page tables in the mshare mm.
Modify the unmap path to ensure that page tables in mshare regions
are kept intact when a process exits. Note that they are also
kept intact and not unlinked from a process when an mshare region
is explicitly unmapped which is bug to be addressed.

Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 mm/internal.h |  1 +
 mm/memory.c   | 62 ++++++++++++++++++++++++++++++++++++++++++++++++---
 mm/mshare.c   | 38 +++++++++++++++++++++++++++++++
 3 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8005d5956b6e..8ac224d96806 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1578,6 +1578,7 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
 void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
 void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
 
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
 static inline bool vma_is_shared(const struct vm_area_struct *vma)
 {
 	return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
diff --git a/mm/memory.c b/mm/memory.c
index 3c01d68065be..f526aef71a61 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -387,11 +387,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 			vma_start_write(vma);
 		unlink_anon_vmas(vma);
 
+		/*
+		 * There is no page table to be freed for vmas that
+		 * are mapped in mshare regions
+		 */
 		if (is_vm_hugetlb_page(vma)) {
 			unlink_file_vma(vma);
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
-		} else {
+		} else if (!vma_is_shared(vma)) {
 			unlink_file_vma_batch_init(&vb);
 			unlink_file_vma_batch_add(&vb, vma);
 
@@ -399,7 +403,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 			 * Optimization: gather nearby vmas into one call down
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && !vma_is_shared(next)) {
 				vma = next;
 				next = mas_find(mas, ceiling - 1);
 				if (unlikely(xa_is_zero(next)))
@@ -412,7 +417,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 			unlink_file_vma_batch_final(&vb);
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
-		}
+		} else
+			unlink_file_vma(vma);
+
 		vma = next;
 	} while (vma);
 }
@@ -1797,6 +1804,13 @@ void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
+	/*
+	 * No need to unmap vmas that share page table through
+	 * mshare region
+	 */
+	 if (vma_is_shared(vma))
+		return;
+
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -5801,6 +5815,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	struct mm_struct *mm = vma->vm_mm;
 	vm_fault_t ret;
 	bool is_droppable;
+	bool shared = false;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -5808,6 +5823,21 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (ret)
 		goto out;
 
+	if (unlikely(vma_is_shared(vma))) {
+		/* XXX mshare does not support per-VMA locks yet so fallback to mm lock */
+		if (flags & FAULT_FLAG_VMA_LOCK) {
+			vma_end_read(vma);
+			return VM_FAULT_RETRY;
+		}
+
+		ret = find_shared_vma(&vma, &address);
+		if (ret)
+			return ret;
+		if (!vma)
+			return VM_FAULT_SIGSEGV;
+		shared = true;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE)) {
@@ -5843,6 +5873,32 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (is_droppable)
 		ret &= ~VM_FAULT_OOM;
 
+	/*
+	 * Release the read lock on the shared mm of a shared VMA unless
+	 * unless the lock has already been released.
+	 * The mmap lock will already have been released if __handle_mm_fault
+	 * returns VM_FAULT_COMPLETED or if it returns VM_FAULT_RETRY and
+	 * the flags FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT are
+	 * _not_ both set.
+	 * If the lock was released earlier, release the lock on the task's
+	 * mm now to keep lock state consistent.
+	 */
+	if (shared) {
+		int release_mmlock = 1;
+
+		if ((ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) == 0) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+				(flags & FAULT_FLAG_RETRY_NOWAIT)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		}
+
+		if (release_mmlock)
+			mmap_read_unlock(mm);
+	}
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mshare.c b/mm/mshare.c
index f3f6ed9c3761..8f47c8d6e6a4 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -19,6 +19,7 @@
 #include <linux/spinlock_types.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/msharefs.h>
+#include "internal.h"
 
 struct mshare_data {
 	struct mm_struct *mm;
@@ -33,6 +34,43 @@ struct msharefs_info {
 static const struct inode_operations msharefs_dir_inode_ops;
 static const struct inode_operations msharefs_file_inode_ops;
 
+/* Returns holding the host mm's lock for read.  Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+	struct vm_area_struct *vma, *guest = *vmap;
+	struct mshare_data *m_data = guest->vm_private_data;
+	struct mm_struct *host_mm = m_data->mm;
+	unsigned long host_addr;
+	pgd_t *pgd, *guest_pgd;
+
+	mmap_read_lock(host_mm);
+	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+	pgd = pgd_offset(host_mm, host_addr);
+	guest_pgd = pgd_offset(guest->vm_mm, *addrp);
+	if (!pgd_same(*guest_pgd, *pgd)) {
+		set_pgd(guest_pgd, *pgd);
+		mmap_read_unlock(host_mm);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	vma = find_vma(host_mm, host_addr);
+
+	/* XXX: expand stack? */
+	if (vma && vma->vm_start > host_addr)
+		vma = NULL;
+
+	*vmap = vma;
+
+	/*
+	 * release host mm lock unless a matching vma is found
+	 */
+	if (!vma)
+		mmap_read_unlock(host_mm);
+	return 0;
+}
+
 /*
  * Disallow partial unmaps of an mshare region for now. Unmapping at
  * boundaries aligned to the level page tables are shared at could
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (7 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-10-07  8:44   ` Kirill A. Shutemov
  2024-09-03 23:22 ` [RFC PATCH v3 10/10] mshare: add MSHAREFS_CREATE_MAPPING Anthony Yznaga
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

In preparation for mapping objects into an mshare region, create
__do_mmap() to allow mapping into a specified mm. There are no
functional changes otherwise.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/linux/mm.h | 18 +++++++++++++++++-
 mm/mmap.c          | 12 +++++-------
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3aa0b3322284..a9afbda73cb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3409,11 +3409,27 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, struct mm_struct *mm);
+#ifdef CONFIG_MMU
+extern unsigned long __do_mmap(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long prot, unsigned long flags,
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf, struct mm_struct *mm);
+static inline unsigned long do_mmap(struct file *file, unsigned long addr,
+	unsigned long len, unsigned long prot, unsigned long flags,
+	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+	struct list_head *uf)
+{
+	return __do_mmap(file, addr, len, prot, flags, vm_flags, pgoff,
+			 populate, uf, current->mm);
+}
+#else
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
 	struct list_head *uf);
+#endif
+
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock);
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..4112f18e7302 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1250,15 +1250,14 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
 }
 
 /*
- * The caller must write-lock current->mm->mmap_lock.
+ * The caller must write-lock mm->mmap_lock.
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
+unsigned long __do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, vm_flags_t vm_flags,
 			unsigned long pgoff, unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf, struct mm_struct *mm)
 {
-	struct mm_struct *mm = current->mm;
 	int pkey = 0;
 
 	*populate = 0;
@@ -1465,7 +1464,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -2848,9 +2847,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, struct mm_struct *mm)
 {
-	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v3 10/10] mshare: add MSHAREFS_CREATE_MAPPING
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (8 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
@ 2024-09-03 23:22 ` Anthony Yznaga
  2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:22 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, khalid
  Cc: anthony.yznaga, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Add an ioctl for mapping objects within an mshare region. The
arguments are the same as mmap() although only shared anonymous
memory with some restrictions is supported initially.

Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
 include/uapi/linux/msharefs.h |  9 +++++
 mm/mshare.c                   | 71 +++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
index c7b509c7e093..fea0afdf000d 100644
--- a/include/uapi/linux/msharefs.h
+++ b/include/uapi/linux/msharefs.h
@@ -20,10 +20,19 @@
  */
 #define MSHAREFS_GET_SIZE	_IOR('x', 0,  struct mshare_info)
 #define MSHAREFS_SET_SIZE	_IOW('x', 1,  struct mshare_info)
+#define MSHAREFS_CREATE_MAPPING	_IOW('x', 2,  struct mshare_create)
 
 struct mshare_info {
 	__u64 start;
 	__u64 size;
 };
 
+struct mshare_create {
+	__u64 addr;
+	__u64 size;
+	__u64 offset;
+	__u32 prot;
+	__u32 flags;
+	__u32 fd;
+};
 #endif
diff --git a/mm/mshare.c b/mm/mshare.c
index 8f47c8d6e6a4..7b89bf7f5ffc 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,6 +16,7 @@
 
 #include <linux/fs.h>
 #include <linux/fs_context.h>
+#include <linux/mman.h>
 #include <linux/spinlock_types.h>
 #include <uapi/linux/magic.h>
 #include <uapi/linux/msharefs.h>
@@ -154,12 +155,65 @@ msharefs_set_size(struct mm_struct *mm, struct mshare_data *m_data,
 	return 0;
 }
 
+static long
+msharefs_create_mapping(struct mm_struct *mm, struct mshare_data *m_data,
+		        struct mshare_create *mcreate)
+{
+	unsigned long mshare_start, mshare_end;
+	unsigned long mapped_addr;
+	unsigned long populate = 0;
+	unsigned long addr = mcreate->addr;
+	unsigned long size = mcreate->size;
+	unsigned int fd = mcreate->fd;
+	int prot = mcreate->prot;
+	int flags = mcreate->flags;
+	vm_flags_t vm_flags;
+	int err = -EINVAL;
+
+	mshare_start = m_data->minfo.start;
+	mshare_end = mshare_start + m_data->minfo.size;
+
+	if ((addr < mshare_start) || (addr >= mshare_end) ||
+	    (addr + size > mshare_end))
+		goto out;
+
+	/*
+	 * XXX Keep things simple initially and only allow the mapping of
+	 * anonymous shared memory at fixed addresses without unmapping.
+	 */
+	if ((flags & (MAP_SHARED | MAP_FIXED)) != (MAP_SHARED | MAP_FIXED))
+		goto out;
+
+	if (fd != -1)
+		goto out;
+
+	flags |= MAP_FIXED_NOREPLACE;
+	vm_flags = VM_NOHUGEPAGE;
+
+	if (mmap_write_lock_killable(mm)) {
+		err = -EINTR;
+		goto out;
+	}
+
+	err = 0;
+	mapped_addr = __do_mmap(NULL, addr, size, prot, flags, vm_flags,
+				0, &populate, NULL, mm);
+
+	if (IS_ERR_VALUE(mapped_addr))
+		err = (long)mapped_addr;
+
+	mmap_write_unlock(mm);
+out:
+	return err;
+}
+
 static long
 msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct mshare_data *m_data = filp->private_data;
 	struct mm_struct *mm = m_data->mm;
 	struct mshare_info minfo;
+	struct mshare_create mcreate;
 
 	switch (cmd) {
 	case MSHAREFS_GET_SIZE:
@@ -188,6 +242,23 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 		return msharefs_set_size(mm, m_data, &minfo);
 
+	case MSHAREFS_CREATE_MAPPING:
+		if (copy_from_user(&mcreate, (struct mshare_create __user *)arg,
+			sizeof(mcreate)))
+			return -EFAULT;
+
+		/*
+		 * validate mshare region
+		 */
+		spin_lock(&m_data->m_lock);
+		if (m_data->minfo.start == 0) {
+			spin_unlock(&m_data->m_lock);
+			return -EINVAL;
+		}
+		spin_unlock(&m_data->m_lock);
+
+		return msharefs_create_mapping(mm, m_data, &mcreate);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs
  2024-09-03 23:22 ` [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs Anthony Yznaga
@ 2024-09-03 23:40   ` James Houghton
  2024-09-03 23:58     ` Anthony Yznaga
  2024-10-07 10:24     ` David Hildenbrand
  0 siblings, 2 replies; 38+ messages in thread
From: James Houghton @ 2024-09-03 23:40 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Tue, Sep 3, 2024 at 4:23 PM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
> From: Khalid Aziz <khalid@kernel.org>
>
> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
> a function to determine if a vma shares PTEs by checking this flag.
> This is to be used to find the shared page table entries on page fault
> for vmas sharing PTEs.
>
> Signed-off-by: Khalid Aziz <khalid@kernel.org>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/linux/mm.h             | 7 +++++++
>  include/trace/events/mmflags.h | 3 +++
>  mm/internal.h                  | 5 +++++
>  3 files changed, 15 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6549d0979b28..3aa0b3322284 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -413,6 +413,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_DROPPABLE           VM_NONE
>  #endif
>
> +#ifdef CONFIG_64BIT
> +#define VM_SHARED_PT_BIT       41
> +#define VM_SHARED_PT           BIT(VM_SHARED_PT_BIT)
> +#else
> +#define VM_SHARED_PT           VM_NONE
> +#endif
> +
>  #ifdef CONFIG_64BIT
>  /* VM is sealed, in vm_flags */
>  #define VM_SEALED      _BITUL(63)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index b63d211bd141..e1ae1e60d086 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -167,8 +167,10 @@ IF_HAVE_PG_ARCH_X(arch_3)
>
>  #ifdef CONFIG_64BIT
>  # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
> +# define IF_HAVE_VM_SHARED_PT(flag, name) {flag, name},
>  #else
>  # define IF_HAVE_VM_DROPPABLE(flag, name)
> +# define IF_HAVE_VM_SHARED_PT(flag, name)
>  #endif
>
>  #define __def_vmaflag_names                                            \
> @@ -204,6 +206,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>         {VM_HUGEPAGE,                   "hugepage"      },              \
>         {VM_NOHUGEPAGE,                 "nohugepage"    },              \
>  IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,     "droppable"     )               \
> +IF_HAVE_VM_SHARED_PT(VM_SHARED_PT,     "sharedpt"      )               \
>         {VM_MERGEABLE,                  "mergeable"     }               \
>
>  #define show_vma_flags(flags)                                          \
> diff --git a/mm/internal.h b/mm/internal.h
> index b4d86436565b..8005d5956b6e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1578,4 +1578,9 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
>  void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
>  void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>

Hi Anthony,

I'm really excited to see this series on the mailing list again! :) I
won't have time to review this series in too much detail, but I hope
something like it gets merged eventually.

> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
> +{
> +       return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
> +}

Tiny comment - I find vma_is_shared() to be a bit of a confusing name,
especially given how vma_is_shared_maywrite() is defined. (Sorry if
this has already been discussed before.)

How about vma_is_shared_pt()?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs
  2024-09-03 23:40   ` James Houghton
@ 2024-09-03 23:58     ` Anthony Yznaga
  2024-10-07 10:24     ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-09-03 23:58 UTC (permalink / raw)
  To: James Houghton
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 9/3/24 4:40 PM, James Houghton wrote:
> On Tue, Sep 3, 2024 at 4:23 PM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> From: Khalid Aziz <khalid@kernel.org>
>>
>> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
>> a function to determine if a vma shares PTEs by checking this flag.
>> This is to be used to find the shared page table entries on page fault
>> for vmas sharing PTEs.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   include/linux/mm.h             | 7 +++++++
>>   include/trace/events/mmflags.h | 3 +++
>>   mm/internal.h                  | 5 +++++
>>   3 files changed, 15 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 6549d0979b28..3aa0b3322284 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -413,6 +413,13 @@ extern unsigned int kobjsize(const void *objp);
>>   #define VM_DROPPABLE           VM_NONE
>>   #endif
>>
>> +#ifdef CONFIG_64BIT
>> +#define VM_SHARED_PT_BIT       41
>> +#define VM_SHARED_PT           BIT(VM_SHARED_PT_BIT)
>> +#else
>> +#define VM_SHARED_PT           VM_NONE
>> +#endif
>> +
>>   #ifdef CONFIG_64BIT
>>   /* VM is sealed, in vm_flags */
>>   #define VM_SEALED      _BITUL(63)
>> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>> index b63d211bd141..e1ae1e60d086 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -167,8 +167,10 @@ IF_HAVE_PG_ARCH_X(arch_3)
>>
>>   #ifdef CONFIG_64BIT
>>   # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
>> +# define IF_HAVE_VM_SHARED_PT(flag, name) {flag, name},
>>   #else
>>   # define IF_HAVE_VM_DROPPABLE(flag, name)
>> +# define IF_HAVE_VM_SHARED_PT(flag, name)
>>   #endif
>>
>>   #define __def_vmaflag_names                                            \
>> @@ -204,6 +206,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>>          {VM_HUGEPAGE,                   "hugepage"      },              \
>>          {VM_NOHUGEPAGE,                 "nohugepage"    },              \
>>   IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,     "droppable"     )               \
>> +IF_HAVE_VM_SHARED_PT(VM_SHARED_PT,     "sharedpt"      )               \
>>          {VM_MERGEABLE,                  "mergeable"     }               \
>>
>>   #define show_vma_flags(flags)                                          \
>> diff --git a/mm/internal.h b/mm/internal.h
>> index b4d86436565b..8005d5956b6e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1578,4 +1578,9 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
>>   void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
>>   void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>>
Hi James,
> Hi Anthony,
>
> I'm really excited to see this series on the mailing list again! :) I
> won't have time to review this series in too much detail, but I hope
> something like it gets merged eventually.
Me, too. :-)
>
>> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
>> +{
>> +       return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
>> +}
> Tiny comment - I find vma_is_shared() to be a bit of a confusing name,
> especially given how vma_is_shared_maywrite() is defined. (Sorry if
> this has already been discussed before.)
>
> How about vma_is_shared_pt()?

Good point. I'll make this change. Thanks for taking a look.


Anthony



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (9 preceding siblings ...)
  2024-09-03 23:22 ` [RFC PATCH v3 10/10] mshare: add MSHAREFS_CREATE_MAPPING Anthony Yznaga
@ 2024-10-02 17:35 ` Dave Hansen
  2024-10-02 19:30   ` Anthony Yznaga
                     ` (2 more replies)
  2024-10-07  9:01 ` Kirill A. Shutemov
  2024-10-14 20:07 ` Jann Horn
  12 siblings, 3 replies; 38+ messages in thread
From: Dave Hansen @ 2024-10-02 17:35 UTC (permalink / raw)
  To: Anthony Yznaga, akpm, willy, markhemm, viro, david, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

We were just chatting about this on David Rientjes's MM alignment call.
I thought I'd try to give a little brain

Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
tables get entries that effectively mirror the primary mm page tables
and constitute a secondary MMU.  If the primary page tables change,
mmu_notifiers ensure that the changes get reflected into the
virtualization tables and also that the virtualization paging structure
caches are flushed.

msharefs is doing something very similar.  But, in the msharefs case,
the secondary MMUs are actually normal CPU MMUs.  The page tables are
normal old page tables and the caches are the normal old TLB.  That's
what makes it so confusing: we have lots of infrastructure for dealing
with that "stuff" (CPU page tables and TLB), but msharefs has
short-circuited the infrastructure and it doesn't work any more.

Basically, I think it makes a lot of sense to check what KVM (or another
mmu_notifier user) is doing and make sure that msharefs is following its
lead.  For instance, KVM _should_ have the exact same "page free"
flushing issue where it gets the MMU notifier call but the page may
still be in the secondary MMU.  I _think_ KVM fixes it with an extra
page refcount that it takes when it first walks the primary page tables.

But the short of it is that the msharefs host mm represents a "secondary
MMU".  I don't think it is really that special of an MMU other than the
fact that it has an mm_struct.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
@ 2024-10-02 19:30   ` Anthony Yznaga
  2024-10-02 23:11     ` Dave Hansen
  2024-10-07  8:44   ` David Hildenbrand
  2024-10-07  8:48   ` David Hildenbrand
  2 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-02 19:30 UTC (permalink / raw)
  To: Dave Hansen, akpm, willy, markhemm, viro, david, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes


On 10/2/24 10:35 AM, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.
> I thought I'd try to give a little brain
>
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
>
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.
>
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.
>
> But the short of it is that the msharefs host mm represents a "secondary
> MMU".  I don't think it is really that special of an MMU other than the
> fact that it has an mm_struct.


Thanks, Dave. This is helpful. I'll look at what other mmu notifier 
users are doing. This does align with the comments in mmu_notifier.h 
regarding invalidate_range_start/end.


Anthony



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-02 19:30   ` Anthony Yznaga
@ 2024-10-02 23:11     ` Dave Hansen
  2024-10-03  0:24       ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2024-10-02 23:11 UTC (permalink / raw)
  To: Anthony Yznaga, akpm, willy, markhemm, viro, david, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

About TLB flushing...

The quick and dirty thing to do is just flush_tlb_all() after you remove
the PTE from the host mm.  That will surely work everywhere and it's as
dirt simple as you get.  Honestly, it might even be cheaper than the
alternative.

Also, I don't think PCIDs actually complicate the problem at all.  We
basically do remote mm TLB flushes using two mechanisms:

	1. If the mm is loaded, use INVLPG and friends to zap the TLB
	2. Bump mm->context.tlb_gen so that the next time it _gets_
	   loaded, the TLB is flushed.

flush_tlb_func() really only cares about #1 since if the mm isn't
loaded, it'll get flushed anyway at the next context switch.

The alternatives I can think of:

Make flush_tlb_mm_range(host_mm) work somehow.  You'd need to somehow
keep mm_cpumask(host_mm) up to date and also make do something to
flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it
should flush regardless.

The other way is to use the msharefs's inode ->i_mmap to find all the
VMAs mapping the file, and find all *their* mm's:

	for each vma in inode->i_mmap
		mm = vma->vm_mm
		flush_tlb_mm_range(<vma range here>)

But that might be even worse than flush_tlb_all() because it might end
up sending more than one IPI per CPU.

You can fix _that_ by keeping a single cpumask that you build up:

	mask = 0
	for each vma in inode->i_mmap
		mm = vma->vm_mm
		mask |= mm_cpumask(mm)

	flush_tlb_multi(mask, info);

Unfortunately, 'info->mm' needs to be more than one mm, so you probably
still need a new flush_tlb_func() flush type to tell it to ignore
'info->mm' and flush anyway.

After all that, I kinda like flush_tlb_all(). ;)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-02 23:11     ` Dave Hansen
@ 2024-10-03  0:24       ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-03  0:24 UTC (permalink / raw)
  To: Dave Hansen, akpm, willy, markhemm, viro, david, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes


On 10/2/24 4:11 PM, Dave Hansen wrote:
> About TLB flushing...
>
> The quick and dirty thing to do is just flush_tlb_all() after you remove
> the PTE from the host mm.  That will surely work everywhere and it's as
> dirt simple as you get.  Honestly, it might even be cheaper than the
> alternative.

I think this a good place to start from. My concern is that unnecessary 
flushes will potentially impact unrelated loads. Performance testing as 
things progress can help determine if a more involved approach is needed.

>
> Also, I don't think PCIDs actually complicate the problem at all.  We
> basically do remote mm TLB flushes using two mechanisms:
>
> 	1. If the mm is loaded, use INVLPG and friends to zap the TLB
> 	2. Bump mm->context.tlb_gen so that the next time it _gets_
> 	   loaded, the TLB is flushed.
>
> flush_tlb_func() really only cares about #1 since if the mm isn't
> loaded, it'll get flushed anyway at the next context switch.
>
> The alternatives I can think of:
>
> Make flush_tlb_mm_range(host_mm) work somehow.  You'd need to somehow
> keep mm_cpumask(host_mm) up to date and also make do something to
> flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it
> should flush regardless.
>
> The other way is to use the msharefs's inode ->i_mmap to find all the
> VMAs mapping the file, and find all *their* mm's:
>
> 	for each vma in inode->i_mmap
> 		mm = vma->vm_mm
> 		flush_tlb_mm_range(<vma range here>)
>
> But that might be even worse than flush_tlb_all() because it might end
> up sending more than one IPI per CPU.
>
> You can fix _that_ by keeping a single cpumask that you build up:
>
> 	mask = 0
> 	for each vma in inode->i_mmap
> 		mm = vma->vm_mm
> 		mask |= mm_cpumask(mm)
>
> 	flush_tlb_multi(mask, info);
>
> Unfortunately, 'info->mm' needs to be more than one mm, so you probably
> still need a new flush_tlb_func() flush type to tell it to ignore
> 'info->mm' and flush anyway.

What about something like arch_tlbbatch_flush() which sets info->mm to 
NULL and uses a batch cpumask? That seems perfect though there would be 
a bit more work needed to ensure things work on other architectures. 
flush_tlb_all() it is then, for now. :-)

>
> After all that, I kinda like flush_tlb_all(). ;)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support
  2024-09-03 23:22 ` [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support Anthony Yznaga
@ 2024-10-07  8:41   ` Kirill A. Shutemov
  2024-10-07 17:45     ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2024-10-07  8:41 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Tue, Sep 03, 2024 at 04:22:39PM -0700, Anthony Yznaga wrote:
> From: Khalid Aziz <khalid@kernel.org>
> 
> Add support for handling page faults in an mshare region by
> redirecting the faults to operate on the mshare mm_struct and
> vmas contained in it and to link the page tables of the faulting
> process with the shared page tables in the mshare mm.
> Modify the unmap path to ensure that page tables in mshare regions
> are kept intact when a process exits. Note that they are also
> kept intact and not unlinked from a process when an mshare region
> is explicitly unmapped which is bug to be addressed.
> 
> Signed-off-by: Khalid Aziz <khalid@kernel.org>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  mm/internal.h |  1 +
>  mm/memory.c   | 62 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  mm/mshare.c   | 38 +++++++++++++++++++++++++++++++
>  3 files changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 8005d5956b6e..8ac224d96806 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1578,6 +1578,7 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
>  void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
>  void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>  
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
>  static inline bool vma_is_shared(const struct vm_area_struct *vma)
>  {
>  	return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3c01d68065be..f526aef71a61 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -387,11 +387,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  			vma_start_write(vma);
>  		unlink_anon_vmas(vma);
>  
> +		/*
> +		 * There is no page table to be freed for vmas that
> +		 * are mapped in mshare regions
> +		 */
>  		if (is_vm_hugetlb_page(vma)) {
>  			unlink_file_vma(vma);
>  			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>  				floor, next ? next->vm_start : ceiling);
> -		} else {
> +		} else if (!vma_is_shared(vma)) {
>  			unlink_file_vma_batch_init(&vb);
>  			unlink_file_vma_batch_add(&vb, vma);
>  
> @@ -399,7 +403,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  			 * Optimization: gather nearby vmas into one call down
>  			 */
>  			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> -			       && !is_vm_hugetlb_page(next)) {
> +			       && !is_vm_hugetlb_page(next)
> +			       && !vma_is_shared(next)) {
>  				vma = next;
>  				next = mas_find(mas, ceiling - 1);
>  				if (unlikely(xa_is_zero(next)))
> @@ -412,7 +417,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  			unlink_file_vma_batch_final(&vb);
>  			free_pgd_range(tlb, addr, vma->vm_end,
>  				floor, next ? next->vm_start : ceiling);
> -		}
> +		} else
> +			unlink_file_vma(vma);
> +
>  		vma = next;

I would rather have vma->vm_ops->free_pgtables() hook that would be defined
to non-NULL for mshared and hugetlb VMAs

>  	} while (vma);
>  }
> @@ -1797,6 +1804,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>  	pgd_t *pgd;
>  	unsigned long next;
>  
> +	/*
> +	 * No need to unmap vmas that share page table through
> +	 * mshare region
> +	 */
> +	 if (vma_is_shared(vma))
> +		return;
> +

Ditto. vma->vm_ops->unmap_page_range().

>  	BUG_ON(addr >= end);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -5801,6 +5815,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	struct mm_struct *mm = vma->vm_mm;
>  	vm_fault_t ret;
>  	bool is_droppable;
> +	bool shared = false;
>  
>  	__set_current_state(TASK_RUNNING);
>  
> @@ -5808,6 +5823,21 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	if (ret)
>  		goto out;
>  
> +	if (unlikely(vma_is_shared(vma))) {
> +		/* XXX mshare does not support per-VMA locks yet so fallback to mm lock */
> +		if (flags & FAULT_FLAG_VMA_LOCK) {
> +			vma_end_read(vma);
> +			return VM_FAULT_RETRY;
> +		}
> +
> +		ret = find_shared_vma(&vma, &address);
> +		if (ret)
> +			return ret;
> +		if (!vma)
> +			return VM_FAULT_SIGSEGV;
> +		shared = true;

Do we need to update 'mm' variable here?

It is going to be used to account the fault below. Not sure which mm has
to account such faults.

> +	}
> +
>  	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>  					    flags & FAULT_FLAG_INSTRUCTION,
>  					    flags & FAULT_FLAG_REMOTE)) {
> @@ -5843,6 +5873,32 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  	if (is_droppable)
>  		ret &= ~VM_FAULT_OOM;
>  
> +	/*
> +	 * Release the read lock on the shared mm of a shared VMA unless
> +	 * unless the lock has already been released.
> +	 * The mmap lock will already have been released if __handle_mm_fault
> +	 * returns VM_FAULT_COMPLETED or if it returns VM_FAULT_RETRY and
> +	 * the flags FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT are
> +	 * _not_ both set.
> +	 * If the lock was released earlier, release the lock on the task's
> +	 * mm now to keep lock state consistent.
> +	 */
> +	if (shared) {
> +		int release_mmlock = 1;
> +
> +		if ((ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) == 0) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> +				(flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +			mmap_read_unlock(vma->vm_mm);
> +			release_mmlock = 0;
> +		}
> +
> +		if (release_mmlock)
> +			mmap_read_unlock(mm);
> +	}
> +
>  	if (flags & FAULT_FLAG_USER) {
>  		mem_cgroup_exit_user_fault();
>  		/*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index f3f6ed9c3761..8f47c8d6e6a4 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -19,6 +19,7 @@
>  #include <linux/spinlock_types.h>
>  #include <uapi/linux/magic.h>
>  #include <uapi/linux/msharefs.h>
> +#include "internal.h"
>  
>  struct mshare_data {
>  	struct mm_struct *mm;
> @@ -33,6 +34,43 @@ struct msharefs_info {
>  static const struct inode_operations msharefs_dir_inode_ops;
>  static const struct inode_operations msharefs_file_inode_ops;
>  
> +/* Returns holding the host mm's lock for read.  Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> +	struct vm_area_struct *vma, *guest = *vmap;
> +	struct mshare_data *m_data = guest->vm_private_data;
> +	struct mm_struct *host_mm = m_data->mm;
> +	unsigned long host_addr;
> +	pgd_t *pgd, *guest_pgd;
> +
> +	mmap_read_lock(host_mm);

Hm. So we have current->mm locked here, right? So this is nested mmap
lock. Have you tested it under lockdep? I expected it to complain.

> +	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> +	pgd = pgd_offset(host_mm, host_addr);
> +	guest_pgd = pgd_offset(guest->vm_mm, *addrp);
> +	if (!pgd_same(*guest_pgd, *pgd)) {
> +		set_pgd(guest_pgd, *pgd);
> +		mmap_read_unlock(host_mm);
> +		return VM_FAULT_NOPAGE;
> +	}
> +
> +	*addrp = host_addr;
> +	vma = find_vma(host_mm, host_addr);
> +
> +	/* XXX: expand stack? */
> +	if (vma && vma->vm_start > host_addr)
> +		vma = NULL;
> +
> +	*vmap = vma;
> +
> +	/*
> +	 * release host mm lock unless a matching vma is found
> +	 */
> +	if (!vma)
> +		mmap_read_unlock(host_mm);
> +	return 0;
> +}
> +
>  /*
>   * Disallow partial unmaps of an mshare region for now. Unmapping at
>   * boundaries aligned to the level page tables are shared at could
> -- 
> 2.43.5
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg
  2024-09-03 23:22 ` [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
@ 2024-10-07  8:44   ` Kirill A. Shutemov
  2024-10-07 17:46     ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2024-10-07  8:44 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Tue, Sep 03, 2024 at 04:22:40PM -0700, Anthony Yznaga wrote:
> In preparation for mapping objects into an mshare region, create
> __do_mmap() to allow mapping into a specified mm. There are no
> functional changes otherwise.
> 
> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
> ---
>  include/linux/mm.h | 18 +++++++++++++++++-
>  mm/mmap.c          | 12 +++++-------
>  2 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3aa0b3322284..a9afbda73cb0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3409,11 +3409,27 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>  
>  extern unsigned long mmap_region(struct file *file, unsigned long addr,
>  	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> -	struct list_head *uf);
> +	struct list_head *uf, struct mm_struct *mm);
> +#ifdef CONFIG_MMU
> +extern unsigned long __do_mmap(struct file *file, unsigned long addr,
> +	unsigned long len, unsigned long prot, unsigned long flags,
> +	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
> +	struct list_head *uf, struct mm_struct *mm);
> +static inline unsigned long do_mmap(struct file *file, unsigned long addr,
> +	unsigned long len, unsigned long prot, unsigned long flags,
> +	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
> +	struct list_head *uf)
> +{
> +	return __do_mmap(file, addr, len, prot, flags, vm_flags, pgoff,
> +			 populate, uf, current->mm);
> +}
> +#else
>  extern unsigned long do_mmap(struct file *file, unsigned long addr,
>  	unsigned long len, unsigned long prot, unsigned long flags,
>  	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
>  	struct list_head *uf);
> +#endif
> +
>  extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  			 unsigned long start, size_t len, struct list_head *uf,
>  			 bool unlock);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d0dfc85b209b..4112f18e7302 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1250,15 +1250,14 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
>  }
>  
>  /*
> - * The caller must write-lock current->mm->mmap_lock.
> + * The caller must write-lock mm->mmap_lock.
>   */
> -unsigned long do_mmap(struct file *file, unsigned long addr,
> +unsigned long __do_mmap(struct file *file, unsigned long addr,
>  			unsigned long len, unsigned long prot,
>  			unsigned long flags, vm_flags_t vm_flags,
>  			unsigned long pgoff, unsigned long *populate,
> -			struct list_head *uf)
> +			struct list_head *uf, struct mm_struct *mm)

Argument list getting too long. At some point we need to have a struct to
pass them around.

>  {
> -	struct mm_struct *mm = current->mm;
>  	int pkey = 0;
>  
>  	*populate = 0;
> @@ -1465,7 +1464,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  			vm_flags |= VM_NORESERVE;
>  	}
>  
> -	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
> +	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
>  	if (!IS_ERR_VALUE(addr) &&
>  	    ((vm_flags & VM_LOCKED) ||
>  	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
> @@ -2848,9 +2847,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  
>  unsigned long mmap_region(struct file *file, unsigned long addr,
>  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> -		struct list_head *uf)
> +		struct list_head *uf, struct mm_struct *mm)
>  {
> -	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	struct vm_area_struct *next, *prev, *merge;
>  	pgoff_t pglen = len >> PAGE_SHIFT;
> -- 
> 2.43.5
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
  2024-10-02 19:30   ` Anthony Yznaga
@ 2024-10-07  8:44   ` David Hildenbrand
  2024-10-07 15:58     ` Dave Hansen
  2024-10-07  8:48   ` David Hildenbrand
  2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2024-10-07  8:44 UTC (permalink / raw)
  To: Dave Hansen, Anthony Yznaga, akpm, willy, markhemm, viro, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

On 02.10.24 19:35, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.

Unfortunately I was not able to attend this time, my body decided it's a 
good idea to stay in bed for a couple of days.

> I thought I'd try to give a little brain
> 
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
> 
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.

It's quite different IMHO, to a degree that I believe they are different 
beasts:

Secondary MMUs:
* "Belongs" to same MM context and the primary MMU (process page tables)
* Maintains separate tables/PTEs, in completely separate page table
   hierarchy
* Notifiers make sure the secondary structure stays in sync (update
   PTEs, flush TLB)

mshare:
* Possibly mapped by many different MMs. Likely nothing stops us from
   having on MM map multiple different mshare fds/
* Updating the PTEs directly affects all other MM page table structures
   (and possibly any secondary MMUs! scary)


I better not think about the complexity of seconary MMUs + mshare (e.g., 
KVM with mshare in guest memory): MMU notifiers for all MMs must be 
called ...


> 
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.
> 
> But the short of it is that the msharefs host mm represents a "secondary
> MMU".  I don't think it is really that special of an MMU other than the
> fact that it has an mm_struct.

Not sure I agree ... IMHO these are two orthogonal things. Unless we 
want MMU notifiers to "update" MM primary MMUs (there is not really 
anything to update ...), but not sure if that is what we are looking for.

What you note about TLB flushing in the other mail makes sense, not sure 
how this interacts with any secondary MMUs ....

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
  2024-10-02 19:30   ` Anthony Yznaga
  2024-10-07  8:44   ` David Hildenbrand
@ 2024-10-07  8:48   ` David Hildenbrand
  2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2024-10-07  8:48 UTC (permalink / raw)
  To: Dave Hansen, Anthony Yznaga, akpm, willy, markhemm, viro, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

On 02.10.24 19:35, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.
> I thought I'd try to give a little brain
> 
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
> 
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.
> 
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.

Forgot to comment on this (brain still recovering ...).

KVM only grabs a temporary reference, and drops that reference once the 
secondary MMU PTE was updated (present PTE installed). The notifiers on 
primary MMU changes (e.g., unmap) take care of any TLB invalidation 
before the primary MMU let's go of the page and the refcount is dropped.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (10 preceding siblings ...)
  2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
@ 2024-10-07  9:01 ` Kirill A. Shutemov
  2024-10-07 19:23   ` Anthony Yznaga
  2024-10-14 20:07 ` Jann Horn
  12 siblings, 1 reply; 38+ messages in thread
From: Kirill A. Shutemov @ 2024-10-07  9:01 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
> This patch series implements a mechanism that allows userspace
> processes to opt into sharing PTEs. It adds a new in-memory
> filesystem - msharefs. A file created on msharefs represents a
> shared region where all processes mapping that region will map
> objects within it with shared PTEs. When the file is created,
> a new host mm struct is created to hold the shared page tables
> and vmas for objects later mapped into the shared region. This
> host mm struct is associated with the file and not with a task.

Taskless mm_struct can be problematic. Like, we don't have access to it's
counters because it is not represented in /proc. For instance, there's no
way to check its smaps.

Also, I *think* it is immune to oom-killer because oom-killer looks for a
victim task, not mm.
I hope it is not an intended feature :P

> When a process mmap's the shared region, a vm flag VM_SHARED_PT
> is added to the vma. On page fault the vma is checked for the
> presence of the VM_SHARED_PT flag.

I think it is wrong approach.

Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a
generic hooks that can be used by mshare and hugetlb. And remove
is_vm_hugetlb_page() check from core-mm along the way.

BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if
mshare has to do something differently there. I feel you miss a lot of
such cases.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs
  2024-09-03 23:40   ` James Houghton
  2024-09-03 23:58     ` Anthony Yznaga
@ 2024-10-07 10:24     ` David Hildenbrand
  2024-10-07 23:03       ` Anthony Yznaga
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2024-10-07 10:24 UTC (permalink / raw)
  To: James Houghton, Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, khalid, andreyknvl, dave.hansen,
	luto, brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz

On 04.09.24 01:40, James Houghton wrote:
> On Tue, Sep 3, 2024 at 4:23 PM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>
>> From: Khalid Aziz <khalid@kernel.org>
>>
>> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
>> a function to determine if a vma shares PTEs by checking this flag.
>> This is to be used to find the shared page table entries on page fault
>> for vmas sharing PTEs.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   include/linux/mm.h             | 7 +++++++
>>   include/trace/events/mmflags.h | 3 +++
>>   mm/internal.h                  | 5 +++++
>>   3 files changed, 15 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 6549d0979b28..3aa0b3322284 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -413,6 +413,13 @@ extern unsigned int kobjsize(const void *objp);
>>   #define VM_DROPPABLE           VM_NONE
>>   #endif
>>
>> +#ifdef CONFIG_64BIT
>> +#define VM_SHARED_PT_BIT       41
>> +#define VM_SHARED_PT           BIT(VM_SHARED_PT_BIT)
>> +#else
>> +#define VM_SHARED_PT           VM_NONE
>> +#endif
>> +
>>   #ifdef CONFIG_64BIT
>>   /* VM is sealed, in vm_flags */
>>   #define VM_SEALED      _BITUL(63)
>> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>> index b63d211bd141..e1ae1e60d086 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -167,8 +167,10 @@ IF_HAVE_PG_ARCH_X(arch_3)
>>
>>   #ifdef CONFIG_64BIT
>>   # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
>> +# define IF_HAVE_VM_SHARED_PT(flag, name) {flag, name},
>>   #else
>>   # define IF_HAVE_VM_DROPPABLE(flag, name)
>> +# define IF_HAVE_VM_SHARED_PT(flag, name)
>>   #endif
>>
>>   #define __def_vmaflag_names                                            \
>> @@ -204,6 +206,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty"     )               \
>>          {VM_HUGEPAGE,                   "hugepage"      },              \
>>          {VM_NOHUGEPAGE,                 "nohugepage"    },              \
>>   IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,     "droppable"     )               \
>> +IF_HAVE_VM_SHARED_PT(VM_SHARED_PT,     "sharedpt"      )               \
>>          {VM_MERGEABLE,                  "mergeable"     }               \
>>
>>   #define show_vma_flags(flags)                                          \
>> diff --git a/mm/internal.h b/mm/internal.h
>> index b4d86436565b..8005d5956b6e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1578,4 +1578,9 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
>>   void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
>>   void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>>
> 
> Hi Anthony,
> 
> I'm really excited to see this series on the mailing list again! :) I
> won't have time to review this series in too much detail, but I hope
> something like it gets merged eventually.
> 
>> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
>> +{
>> +       return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
>> +}
> 
> Tiny comment - I find vma_is_shared() to be a bit of a confusing name,
> especially given how vma_is_shared_maywrite() is defined. (Sorry if
> this has already been discussed before.)
> 
> How about vma_is_shared_pt()?

vma_is_mshare() ? ;)

The whole "shared PT / shared PTE" is a bit misleading IMHO and a bit 
too dominant in the series. Yes, we're sharing PTEs/page tables, but the 
main point is that a single mshare VMA might cover multiple different 
VMAs (in a different process).

I would describe mshare VMAs as being something that shares page tables 
with another MM, BUT, also that the VMA is a container and what exactly 
the *actual* VMAs in there are (including holes), only the owner knows.

E.g., is_vm_hugetlb_page() might be *false* for an mshare VMA, but there 
might be hugetlb folios mapped into the page tables, described by a 
is_vm_hugetlb_page() VMA in the owner MM.

So again, it's not just "sharing page tables".

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07  8:44   ` David Hildenbrand
@ 2024-10-07 15:58     ` Dave Hansen
  2024-10-07 16:27       ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2024-10-07 15:58 UTC (permalink / raw)
  To: David Hildenbrand, Anthony Yznaga, akpm, willy, markhemm, viro,
	khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

On 10/7/24 01:44, David Hildenbrand wrote:
> On 02.10.24 19:35, Dave Hansen wrote:
>> We were just chatting about this on David Rientjes's MM alignment call.
> 
> Unfortunately I was not able to attend this time, my body decided it's a
> good idea to stay in bed for a couple of days.
> 
>> I thought I'd try to give a little brain
>>
>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>> tables get entries that effectively mirror the primary mm page tables
>> and constitute a secondary MMU.  If the primary page tables change,
>> mmu_notifiers ensure that the changes get reflected into the
>> virtualization tables and also that the virtualization paging structure
>> caches are flushed.
>>
>> msharefs is doing something very similar.  But, in the msharefs case,
>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>> normal old page tables and the caches are the normal old TLB.  That's
>> what makes it so confusing: we have lots of infrastructure for dealing
>> with that "stuff" (CPU page tables and TLB), but msharefs has
>> short-circuited the infrastructure and it doesn't work any more.
> 
> It's quite different IMHO, to a degree that I believe they are different
> beasts:
> 
> Secondary MMUs:
> * "Belongs" to same MM context and the primary MMU (process page tables)

I think you're speaking to the ratio here.  For each secondary MMU, I
think you're saying that there's one and only one mm_struct.  Is that right?

> * Maintains separate tables/PTEs, in completely separate page table
>   hierarchy

This is the case for KVM and the VMX/SVM MMUs, but it's not generally
true about hardware.  IOMMUs can walk x86 page tables and populate the
IOTLB from the _same_ page table hierarchy as the CPU.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07 15:58     ` Dave Hansen
@ 2024-10-07 16:27       ` David Hildenbrand
  2024-10-07 16:45         ` Sean Christopherson
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2024-10-07 16:27 UTC (permalink / raw)
  To: Dave Hansen, Anthony Yznaga, akpm, willy, markhemm, viro, khalid
  Cc: andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

On 07.10.24 17:58, Dave Hansen wrote:
> On 10/7/24 01:44, David Hildenbrand wrote:
>> On 02.10.24 19:35, Dave Hansen wrote:
>>> We were just chatting about this on David Rientjes's MM alignment call.
>>
>> Unfortunately I was not able to attend this time, my body decided it's a
>> good idea to stay in bed for a couple of days.
>>
>>> I thought I'd try to give a little brain
>>>
>>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>>> tables get entries that effectively mirror the primary mm page tables
>>> and constitute a secondary MMU.  If the primary page tables change,
>>> mmu_notifiers ensure that the changes get reflected into the
>>> virtualization tables and also that the virtualization paging structure
>>> caches are flushed.
>>>
>>> msharefs is doing something very similar.  But, in the msharefs case,
>>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>>> normal old page tables and the caches are the normal old TLB.  That's
>>> what makes it so confusing: we have lots of infrastructure for dealing
>>> with that "stuff" (CPU page tables and TLB), but msharefs has
>>> short-circuited the infrastructure and it doesn't work any more.
>>
>> It's quite different IMHO, to a degree that I believe they are different
>> beasts:
>>
>> Secondary MMUs:
>> * "Belongs" to same MM context and the primary MMU (process page tables)
> 
> I think you're speaking to the ratio here.  For each secondary MMU, I
> think you're saying that there's one and only one mm_struct.  Is that right?

Yes, that is my understanding (at least with KVM). It's a secondary MMU 
derived from exactly one primary MMU (MM context -> page table hierarchy).

The sophisticated ( :) ) notifier mechanism when updating the primary 
MMU will result in keeping the secondary MMU in sync (of course, what to 
sync, and how to, depends in KVM on the memory slot that define how the 
guest physical memory layout is derived from the process virtual address 
space).

> 
>> * Maintains separate tables/PTEs, in completely separate page table
>>    hierarchy
> 
> This is the case for KVM and the VMX/SVM MMUs, but it's not generally
> true about hardware.  IOMMUs can walk x86 page tables and populate the
> IOTLB from the _same_ page table hierarchy as the CPU.

Yes, of course.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07 16:27       ` David Hildenbrand
@ 2024-10-07 16:45         ` Sean Christopherson
  2024-10-08  1:37           ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: Sean Christopherson @ 2024-10-07 16:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, Anthony Yznaga, akpm, willy, markhemm, viro, khalid,
	andreyknvl, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz, David Rientjes

On Mon, Oct 07, 2024, David Hildenbrand wrote:
> On 07.10.24 17:58, Dave Hansen wrote:
> > On 10/7/24 01:44, David Hildenbrand wrote:
> > > On 02.10.24 19:35, Dave Hansen wrote:
> > > > We were just chatting about this on David Rientjes's MM alignment call.
> > > 
> > > Unfortunately I was not able to attend this time, my body decided it's a
> > > good idea to stay in bed for a couple of days.
> > > 
> > > > I thought I'd try to give a little brain
> > > > 
> > > > Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> > > > mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> > > > tables get entries that effectively mirror the primary mm page tables
> > > > and constitute a secondary MMU.  If the primary page tables change,
> > > > mmu_notifiers ensure that the changes get reflected into the
> > > > virtualization tables and also that the virtualization paging structure
> > > > caches are flushed.
> > > > 
> > > > msharefs is doing something very similar.  But, in the msharefs case,
> > > > the secondary MMUs are actually normal CPU MMUs.  The page tables are
> > > > normal old page tables and the caches are the normal old TLB.  That's
> > > > what makes it so confusing: we have lots of infrastructure for dealing
> > > > with that "stuff" (CPU page tables and TLB), but msharefs has
> > > > short-circuited the infrastructure and it doesn't work any more.
> > > 
> > > It's quite different IMHO, to a degree that I believe they are different
> > > beasts:
> > > 
> > > Secondary MMUs:
> > > * "Belongs" to same MM context and the primary MMU (process page tables)
> > 
> > I think you're speaking to the ratio here.  For each secondary MMU, I
> > think you're saying that there's one and only one mm_struct.  Is that right?
> 
> Yes, that is my understanding (at least with KVM). It's a secondary MMU
> derived from exactly one primary MMU (MM context -> page table hierarchy).

I don't think the ratio is what's important.  I think the important takeaway is
that the secondary MMU is explicitly tied to the primary MMU that it is tracking.
This is enforced in code, as the list of mmu_notifiers is stored in mm_struct.

The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with
exactly one mm_struct.  But fundamentally, nothing would prevent a secondary MMU
that manages a so called software TLB from tracking multiple primary MMUs.

E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard),
because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary
MMU would be associated with at most one VMA and thus mm_struct.

Pulling Dave's earlier comment in:

 : But the short of it is that the msharefs host mm represents a "secondary
 : MMU".  I don't think it is really that special of an MMU other than the
 : fact that it has an mm_struct.

and David's (so. many. Davids):

 : I better not think about the complexity of seconary MMUs + mshare (e.g.,
 : KVM with mshare in guest memory): MMU notifiers for all MMs must be
 : called ...

mshare() is unique because it creates the possibly of chained "secondary" MMUs.
I.e. the fact that it has an mm_struct makes it *very* special, IMO.

> > > * Maintains separate tables/PTEs, in completely separate page table
> > >    hierarchy
> > 
> > This is the case for KVM and the VMX/SVM MMUs, but it's not generally
> > true about hardware.  IOMMUs can walk x86 page tables and populate the
> > IOTLB from the _same_ page table hierarchy as the CPU.
> 
> Yes, of course.

Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs()
sums things up nicely:

commit 1af5a8109904b7f00828e7f9f63f5695b42f8215
Author:     Alistair Popple <apopple@nvidia.com>
AuthorDate: Tue Jul 25 23:42:07 2023 +1000
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Fri Aug 18 10:12:41 2023 -0700

    mmu_notifiers: rename invalidate_range notifier
    
    There are two main use cases for mmu notifiers.  One is by KVM which uses
    mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
    
    The other is to manage hardware TLBs which need to use the
    invalidate_range() callback because HW can establish new TLB entries at
    any time.  Hence using start/end() can lead to memory corruption as these
    callbacks happen too soon/late during page unmap.
    
    mmu notifier users should therefore either use the start()/end() callbacks
    or the invalidate_range() callbacks.  To make this usage clearer rename
    the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
    update documention.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support
  2024-10-07  8:41   ` Kirill A. Shutemov
@ 2024-10-07 17:45     ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-07 17:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 10/7/24 1:41 AM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2024 at 04:22:39PM -0700, Anthony Yznaga wrote:
>> From: Khalid Aziz <khalid@kernel.org>
>>
>> Add support for handling page faults in an mshare region by
>> redirecting the faults to operate on the mshare mm_struct and
>> vmas contained in it and to link the page tables of the faulting
>> process with the shared page tables in the mshare mm.
>> Modify the unmap path to ensure that page tables in mshare regions
>> are kept intact when a process exits. Note that they are also
>> kept intact and not unlinked from a process when an mshare region
>> is explicitly unmapped which is bug to be addressed.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   mm/internal.h |  1 +
>>   mm/memory.c   | 62 ++++++++++++++++++++++++++++++++++++++++++++++++---
>>   mm/mshare.c   | 38 +++++++++++++++++++++++++++++++
>>   3 files changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 8005d5956b6e..8ac224d96806 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -1578,6 +1578,7 @@ void unlink_file_vma_batch_init(struct unlink_vma_file_batch *);
>>   void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, struct vm_area_struct *);
>>   void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>>   
>> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
>>   static inline bool vma_is_shared(const struct vm_area_struct *vma)
>>   {
>>   	return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3c01d68065be..f526aef71a61 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -387,11 +387,15 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>>   			vma_start_write(vma);
>>   		unlink_anon_vmas(vma);
>>   
>> +		/*
>> +		 * There is no page table to be freed for vmas that
>> +		 * are mapped in mshare regions
>> +		 */
>>   		if (is_vm_hugetlb_page(vma)) {
>>   			unlink_file_vma(vma);
>>   			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>>   				floor, next ? next->vm_start : ceiling);
>> -		} else {
>> +		} else if (!vma_is_shared(vma)) {
>>   			unlink_file_vma_batch_init(&vb);
>>   			unlink_file_vma_batch_add(&vb, vma);
>>   
>> @@ -399,7 +403,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>>   			 * Optimization: gather nearby vmas into one call down
>>   			 */
>>   			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>> -			       && !is_vm_hugetlb_page(next)) {
>> +			       && !is_vm_hugetlb_page(next)
>> +			       && !vma_is_shared(next)) {
>>   				vma = next;
>>   				next = mas_find(mas, ceiling - 1);
>>   				if (unlikely(xa_is_zero(next)))
>> @@ -412,7 +417,9 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>>   			unlink_file_vma_batch_final(&vb);
>>   			free_pgd_range(tlb, addr, vma->vm_end,
>>   				floor, next ? next->vm_start : ceiling);
>> -		}
>> +		} else
>> +			unlink_file_vma(vma);
>> +
>>   		vma = next;
> I would rather have vma->vm_ops->free_pgtables() hook that would be defined
> to non-NULL for mshared and hugetlb VMAs

That's a good idea. I'll do that.


>
>>   	} while (vma);
>>   }
>> @@ -1797,6 +1804,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>   	pgd_t *pgd;
>>   	unsigned long next;
>>   
>> +	/*
>> +	 * No need to unmap vmas that share page table through
>> +	 * mshare region
>> +	 */
>> +	 if (vma_is_shared(vma))
>> +		return;
>> +
> Ditto. vma->vm_ops->unmap_page_range().

Okay, I can do that here, too.


>
>>   	BUG_ON(addr >= end);
>>   	tlb_start_vma(tlb, vma);
>>   	pgd = pgd_offset(vma->vm_mm, addr);
>> @@ -5801,6 +5815,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	vm_fault_t ret;
>>   	bool is_droppable;
>> +	bool shared = false;
>>   
>>   	__set_current_state(TASK_RUNNING);
>>   
>> @@ -5808,6 +5823,21 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   	if (ret)
>>   		goto out;
>>   
>> +	if (unlikely(vma_is_shared(vma))) {
>> +		/* XXX mshare does not support per-VMA locks yet so fallback to mm lock */
>> +		if (flags & FAULT_FLAG_VMA_LOCK) {
>> +			vma_end_read(vma);
>> +			return VM_FAULT_RETRY;
>> +		}
>> +
>> +		ret = find_shared_vma(&vma, &address);
>> +		if (ret)
>> +			return ret;
>> +		if (!vma)
>> +			return VM_FAULT_SIGSEGV;
>> +		shared = true;
> Do we need to update 'mm' variable here?
>
> It is going to be used to account the fault below. Not sure which mm has
> to account such faults.

The accounting won't work right for memcg accounting, and there's a bug 
here. The mshare mm is allocated via mm_alloc() which will initialize 
mm->owner to current. As long as that task is around, 
count_memcg_event_mm() will go through it to get a memcg to account to. 
But if the task has exited, the mshare mm owner is no longer valid but 
will still be used. I will just clear the owner for now.


And a quick note on calling find_shared_vma() here. I found I needed to 
move the call earlier to do_user_addr_fault() because there are 
permission checks there that check vma flags, and they need to be done 
against the vmas in the mshare mm.


>
>> +	}
>> +
>>   	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>   					    flags & FAULT_FLAG_INSTRUCTION,
>>   					    flags & FAULT_FLAG_REMOTE)) {
>> @@ -5843,6 +5873,32 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>   	if (is_droppable)
>>   		ret &= ~VM_FAULT_OOM;
>>   
>> +	/*
>> +	 * Release the read lock on the shared mm of a shared VMA unless
>> +	 * unless the lock has already been released.
>> +	 * The mmap lock will already have been released if __handle_mm_fault
>> +	 * returns VM_FAULT_COMPLETED or if it returns VM_FAULT_RETRY and
>> +	 * the flags FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT are
>> +	 * _not_ both set.
>> +	 * If the lock was released earlier, release the lock on the task's
>> +	 * mm now to keep lock state consistent.
>> +	 */
>> +	if (shared) {
>> +		int release_mmlock = 1;
>> +
>> +		if ((ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) == 0) {
>> +			mmap_read_unlock(vma->vm_mm);
>> +			release_mmlock = 0;
>> +		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
>> +				(flags & FAULT_FLAG_RETRY_NOWAIT)) {
>> +			mmap_read_unlock(vma->vm_mm);
>> +			release_mmlock = 0;
>> +		}
>> +
>> +		if (release_mmlock)
>> +			mmap_read_unlock(mm);
>> +	}
>> +
>>   	if (flags & FAULT_FLAG_USER) {
>>   		mem_cgroup_exit_user_fault();
>>   		/*
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index f3f6ed9c3761..8f47c8d6e6a4 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/spinlock_types.h>
>>   #include <uapi/linux/magic.h>
>>   #include <uapi/linux/msharefs.h>
>> +#include "internal.h"
>>   
>>   struct mshare_data {
>>   	struct mm_struct *mm;
>> @@ -33,6 +34,43 @@ struct msharefs_info {
>>   static const struct inode_operations msharefs_dir_inode_ops;
>>   static const struct inode_operations msharefs_file_inode_ops;
>>   
>> +/* Returns holding the host mm's lock for read.  Caller must release. */
>> +vm_fault_t
>> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
>> +{
>> +	struct vm_area_struct *vma, *guest = *vmap;
>> +	struct mshare_data *m_data = guest->vm_private_data;
>> +	struct mm_struct *host_mm = m_data->mm;
>> +	unsigned long host_addr;
>> +	pgd_t *pgd, *guest_pgd;
>> +
>> +	mmap_read_lock(host_mm);
> Hm. So we have current->mm locked here, right? So this is nested mmap
> lock. Have you tested it under lockdep? I expected it to complain.

Yes, it complains. I have patches to introduce and use 
mmap_read_lock_nested().


Thanks you for the feedback.


Anthony


>
>> +	host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
>> +	pgd = pgd_offset(host_mm, host_addr);
>> +	guest_pgd = pgd_offset(guest->vm_mm, *addrp);
>> +	if (!pgd_same(*guest_pgd, *pgd)) {
>> +		set_pgd(guest_pgd, *pgd);
>> +		mmap_read_unlock(host_mm);
>> +		return VM_FAULT_NOPAGE;
>> +	}
>> +
>> +	*addrp = host_addr;
>> +	vma = find_vma(host_mm, host_addr);
>> +
>> +	/* XXX: expand stack? */
>> +	if (vma && vma->vm_start > host_addr)
>> +		vma = NULL;
>> +
>> +	*vmap = vma;
>> +
>> +	/*
>> +	 * release host mm lock unless a matching vma is found
>> +	 */
>> +	if (!vma)
>> +		mmap_read_unlock(host_mm);
>> +	return 0;
>> +}
>> +
>>   /*
>>    * Disallow partial unmaps of an mshare region for now. Unmapping at
>>    * boundaries aligned to the level page tables are shared at could
>> -- 
>> 2.43.5
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg
  2024-10-07  8:44   ` Kirill A. Shutemov
@ 2024-10-07 17:46     ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-07 17:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 10/7/24 1:44 AM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2024 at 04:22:40PM -0700, Anthony Yznaga wrote:
>> In preparation for mapping objects into an mshare region, create
>> __do_mmap() to allow mapping into a specified mm. There are no
>> functional changes otherwise.
>>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   include/linux/mm.h | 18 +++++++++++++++++-
>>   mm/mmap.c          | 12 +++++-------
>>   2 files changed, 22 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 3aa0b3322284..a9afbda73cb0 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3409,11 +3409,27 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>>   
>>   extern unsigned long mmap_region(struct file *file, unsigned long addr,
>>   	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>> -	struct list_head *uf);
>> +	struct list_head *uf, struct mm_struct *mm);
>> +#ifdef CONFIG_MMU
>> +extern unsigned long __do_mmap(struct file *file, unsigned long addr,
>> +	unsigned long len, unsigned long prot, unsigned long flags,
>> +	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
>> +	struct list_head *uf, struct mm_struct *mm);
>> +static inline unsigned long do_mmap(struct file *file, unsigned long addr,
>> +	unsigned long len, unsigned long prot, unsigned long flags,
>> +	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
>> +	struct list_head *uf)
>> +{
>> +	return __do_mmap(file, addr, len, prot, flags, vm_flags, pgoff,
>> +			 populate, uf, current->mm);
>> +}
>> +#else
>>   extern unsigned long do_mmap(struct file *file, unsigned long addr,
>>   	unsigned long len, unsigned long prot, unsigned long flags,
>>   	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
>>   	struct list_head *uf);
>> +#endif
>> +
>>   extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>>   			 unsigned long start, size_t len, struct list_head *uf,
>>   			 bool unlock);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index d0dfc85b209b..4112f18e7302 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -1250,15 +1250,14 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
>>   }
>>   
>>   /*
>> - * The caller must write-lock current->mm->mmap_lock.
>> + * The caller must write-lock mm->mmap_lock.
>>    */
>> -unsigned long do_mmap(struct file *file, unsigned long addr,
>> +unsigned long __do_mmap(struct file *file, unsigned long addr,
>>   			unsigned long len, unsigned long prot,
>>   			unsigned long flags, vm_flags_t vm_flags,
>>   			unsigned long pgoff, unsigned long *populate,
>> -			struct list_head *uf)
>> +			struct list_head *uf, struct mm_struct *mm)
> Argument list getting too long. At some point we need to have a struct to
> pass them around.

I'll look into doing that.


Anthony

>
>>   {
>> -	struct mm_struct *mm = current->mm;
>>   	int pkey = 0;
>>   
>>   	*populate = 0;
>> @@ -1465,7 +1464,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>>   			vm_flags |= VM_NORESERVE;
>>   	}
>>   
>> -	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
>> +	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
>>   	if (!IS_ERR_VALUE(addr) &&
>>   	    ((vm_flags & VM_LOCKED) ||
>>   	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
>> @@ -2848,9 +2847,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   
>>   unsigned long mmap_region(struct file *file, unsigned long addr,
>>   		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>> -		struct list_head *uf)
>> +		struct list_head *uf, struct mm_struct *mm)
>>   {
>> -	struct mm_struct *mm = current->mm;
>>   	struct vm_area_struct *vma = NULL;
>>   	struct vm_area_struct *next, *prev, *merge;
>>   	pgoff_t pglen = len >> PAGE_SHIFT;
>> -- 
>> 2.43.5
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07  9:01 ` Kirill A. Shutemov
@ 2024-10-07 19:23   ` Anthony Yznaga
  2024-10-07 19:41     ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-07 19:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>> This patch series implements a mechanism that allows userspace
>> processes to opt into sharing PTEs. It adds a new in-memory
>> filesystem - msharefs. A file created on msharefs represents a
>> shared region where all processes mapping that region will map
>> objects within it with shared PTEs. When the file is created,
>> a new host mm struct is created to hold the shared page tables
>> and vmas for objects later mapped into the shared region. This
>> host mm struct is associated with the file and not with a task.
> Taskless mm_struct can be problematic. Like, we don't have access to it's
> counters because it is not represented in /proc. For instance, there's no
> way to check its smaps.

Definitely needs exposure in /proc. One of the things I'm looking into 
is the feasibility of showing the mappings in maps/smaps/etc..


>
> Also, I *think* it is immune to oom-killer because oom-killer looks for a
> victim task, not mm.
> I hope it is not an intended feature :P

oom-killer would have to kill all sharers of an mshare region before the 
mshare region itself could be freed, but I'm not sure that oom-killer 
would be the one to free the region. An mshare region is essentially a 
shared memory object not unlike a tmpfs or hugetlb file. I think some 
higher level intelligence would have to be involved to release it if 
appropriate when under oom conditions.


>
>> When a process mmap's the shared region, a vm flag VM_SHARED_PT
>> is added to the vma. On page fault the vma is checked for the
>> presence of the VM_SHARED_PT flag.
> I think it is wrong approach.
>
> Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a
> generic hooks that can be used by mshare and hugetlb. And remove
> is_vm_hugetlb_page() check from core-mm along the way.
>
> BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if
> mshare has to do something differently there. I feel you miss a lot of
> such cases.

Good point about is_vm_hugetlb_page(). I'll review the callsites (there 
are only ~60 of them :-).


Thanks,

Anthony


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07 19:23   ` Anthony Yznaga
@ 2024-10-07 19:41     ` David Hildenbrand
  2024-10-07 19:46       ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2024-10-07 19:41 UTC (permalink / raw)
  To: Anthony Yznaga, Kirill A. Shutemov
  Cc: akpm, willy, markhemm, viro, khalid, andreyknvl, dave.hansen,
	luto, brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz

On 07.10.24 21:23, Anthony Yznaga wrote:
> 
> On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
>> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>>> This patch series implements a mechanism that allows userspace
>>> processes to opt into sharing PTEs. It adds a new in-memory
>>> filesystem - msharefs. A file created on msharefs represents a
>>> shared region where all processes mapping that region will map
>>> objects within it with shared PTEs. When the file is created,
>>> a new host mm struct is created to hold the shared page tables
>>> and vmas for objects later mapped into the shared region. This
>>> host mm struct is associated with the file and not with a task.
>> Taskless mm_struct can be problematic. Like, we don't have access to it's
>> counters because it is not represented in /proc. For instance, there's no
>> way to check its smaps.
> 
> Definitely needs exposure in /proc. One of the things I'm looking into
> is the feasibility of showing the mappings in maps/smaps/etc..
> 
> 
>>
>> Also, I *think* it is immune to oom-killer because oom-killer looks for a
>> victim task, not mm.
>> I hope it is not an intended feature :P
> 
> oom-killer would have to kill all sharers of an mshare region before the
> mshare region itself could be freed, but I'm not sure that oom-killer
> would be the one to free the region. An mshare region is essentially a
> shared memory object not unlike a tmpfs or hugetlb file. I think some
> higher level intelligence would have to be involved to release it if
> appropriate when under oom conditions.


I thought we discussed that at LSF/MM last year and the conclusion was 
that a process is needed (OOM kill, cgroup handling, ...), and it must 
remain running. Once it stops running, we can force-unmap all pages etc.

Did you look at the summary/(recording if available) of that, or am I 
remembering things wrongly or was there actually such a discussion?

I know, it's problematic that this feature switched owners, ...


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07 19:41     ` David Hildenbrand
@ 2024-10-07 19:46       ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-07 19:46 UTC (permalink / raw)
  To: David Hildenbrand, Kirill A. Shutemov
  Cc: akpm, willy, markhemm, viro, khalid, andreyknvl, dave.hansen,
	luto, brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz


On 10/7/24 12:41 PM, David Hildenbrand wrote:
> On 07.10.24 21:23, Anthony Yznaga wrote:
>>
>> On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
>>> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>>>> This patch series implements a mechanism that allows userspace
>>>> processes to opt into sharing PTEs. It adds a new in-memory
>>>> filesystem - msharefs. A file created on msharefs represents a
>>>> shared region where all processes mapping that region will map
>>>> objects within it with shared PTEs. When the file is created,
>>>> a new host mm struct is created to hold the shared page tables
>>>> and vmas for objects later mapped into the shared region. This
>>>> host mm struct is associated with the file and not with a task.
>>> Taskless mm_struct can be problematic. Like, we don't have access to 
>>> it's
>>> counters because it is not represented in /proc. For instance, 
>>> there's no
>>> way to check its smaps.
>>
>> Definitely needs exposure in /proc. One of the things I'm looking into
>> is the feasibility of showing the mappings in maps/smaps/etc..
>>
>>
>>>
>>> Also, I *think* it is immune to oom-killer because oom-killer looks 
>>> for a
>>> victim task, not mm.
>>> I hope it is not an intended feature :P
>>
>> oom-killer would have to kill all sharers of an mshare region before the
>> mshare region itself could be freed, but I'm not sure that oom-killer
>> would be the one to free the region. An mshare region is essentially a
>> shared memory object not unlike a tmpfs or hugetlb file. I think some
>> higher level intelligence would have to be involved to release it if
>> appropriate when under oom conditions.
>
>
> I thought we discussed that at LSF/MM last year and the conclusion was 
> that a process is needed (OOM kill, cgroup handling, ...), and it must 
> remain running. Once it stops running, we can force-unmap all pages etc.
>
> Did you look at the summary/(recording if available) of that, or am I 
> remembering things wrongly or was there actually such a discussion?

You're likely right. I'll review the discussion.


Anthony


>
> I know, it's problematic that this feature switched owners, ...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs
  2024-10-07 10:24     ` David Hildenbrand
@ 2024-10-07 23:03       ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-07 23:03 UTC (permalink / raw)
  To: David Hildenbrand, James Houghton
  Cc: akpm, willy, markhemm, viro, khalid, andreyknvl, dave.hansen,
	luto, brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz


On 10/7/24 3:24 AM, David Hildenbrand wrote:
> On 04.09.24 01:40, James Houghton wrote:
>> On Tue, Sep 3, 2024 at 4:23 PM Anthony Yznaga 
>> <anthony.yznaga@oracle.com> wrote:
>>>
>>> From: Khalid Aziz <khalid@kernel.org>
>>>
>>> Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
>>> a function to determine if a vma shares PTEs by checking this flag.
>>> This is to be used to find the shared page table entries on page fault
>>> for vmas sharing PTEs.
>>>
>>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>>> ---
>>>   include/linux/mm.h             | 7 +++++++
>>>   include/trace/events/mmflags.h | 3 +++
>>>   mm/internal.h                  | 5 +++++
>>>   3 files changed, 15 insertions(+)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 6549d0979b28..3aa0b3322284 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -413,6 +413,13 @@ extern unsigned int kobjsize(const void *objp);
>>>   #define VM_DROPPABLE           VM_NONE
>>>   #endif
>>>
>>> +#ifdef CONFIG_64BIT
>>> +#define VM_SHARED_PT_BIT       41
>>> +#define VM_SHARED_PT           BIT(VM_SHARED_PT_BIT)
>>> +#else
>>> +#define VM_SHARED_PT           VM_NONE
>>> +#endif
>>> +
>>>   #ifdef CONFIG_64BIT
>>>   /* VM is sealed, in vm_flags */
>>>   #define VM_SEALED      _BITUL(63)
>>> diff --git a/include/trace/events/mmflags.h 
>>> b/include/trace/events/mmflags.h
>>> index b63d211bd141..e1ae1e60d086 100644
>>> --- a/include/trace/events/mmflags.h
>>> +++ b/include/trace/events/mmflags.h
>>> @@ -167,8 +167,10 @@ IF_HAVE_PG_ARCH_X(arch_3)
>>>
>>>   #ifdef CONFIG_64BIT
>>>   # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
>>> +# define IF_HAVE_VM_SHARED_PT(flag, name) {flag, name},
>>>   #else
>>>   # define IF_HAVE_VM_DROPPABLE(flag, name)
>>> +# define IF_HAVE_VM_SHARED_PT(flag, name)
>>>   #endif
>>>
>>>   #define __def_vmaflag_names \
>>> @@ -204,6 +206,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, 
>>> "softdirty"     )               \
>>>          {VM_HUGEPAGE,                   "hugepage" },              \
>>>          {VM_NOHUGEPAGE,                 "nohugepage" },              \
>>>   IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,     "droppable" )               \
>>> +IF_HAVE_VM_SHARED_PT(VM_SHARED_PT,     "sharedpt" )               \
>>>          {VM_MERGEABLE,                  "mergeable" }               \
>>>
>>>   #define show_vma_flags(flags) \
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index b4d86436565b..8005d5956b6e 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -1578,4 +1578,9 @@ void unlink_file_vma_batch_init(struct 
>>> unlink_vma_file_batch *);
>>>   void unlink_file_vma_batch_add(struct unlink_vma_file_batch *, 
>>> struct vm_area_struct *);
>>>   void unlink_file_vma_batch_final(struct unlink_vma_file_batch *);
>>>
>>
>> Hi Anthony,
>>
>> I'm really excited to see this series on the mailing list again! :) I
>> won't have time to review this series in too much detail, but I hope
>> something like it gets merged eventually.
>>
>>> +static inline bool vma_is_shared(const struct vm_area_struct *vma)
>>> +{
>>> +       return VM_SHARED_PT && (vma->vm_flags & VM_SHARED_PT);
>>> +}
>>
>> Tiny comment - I find vma_is_shared() to be a bit of a confusing name,
>> especially given how vma_is_shared_maywrite() is defined. (Sorry if
>> this has already been discussed before.)
>>
>> How about vma_is_shared_pt()?
>
> vma_is_mshare() ? ;)

vma_is_vmas()? :-D


>
> The whole "shared PT / shared PTE" is a bit misleading IMHO and a bit 
> too dominant in the series. Yes, we're sharing PTEs/page tables, but 
> the main point is that a single mshare VMA might cover multiple 
> different VMAs (in a different process).
>
> I would describe mshare VMAs as being something that shares page 
> tables with another MM, BUT, also that the VMA is a container and what 
> exactly the *actual* VMAs in there are (including holes), only the 
> owner knows.
>
> E.g., is_vm_hugetlb_page() might be *false* for an mshare VMA, but 
> there might be hugetlb folios mapped into the page tables, described 
> by a is_vm_hugetlb_page() VMA in the owner MM.
>
> So again, it's not just "sharing page tables".

Understood. I'm okay with something like vma_is_mshare() or some other 
shorthand for a "container" VMA. And I recognize that I need to identify 
which code paths need to be enlightened to container VMAs and which 
should expect to be operating on a real VMA or don't care.


Anthony


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-07 16:45         ` Sean Christopherson
@ 2024-10-08  1:37           ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-08  1:37 UTC (permalink / raw)
  To: Sean Christopherson, David Hildenbrand
  Cc: Dave Hansen, akpm, willy, markhemm, viro, khalid, andreyknvl,
	luto, brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz, David Rientjes


On 10/7/24 9:45 AM, Sean Christopherson wrote:
> On Mon, Oct 07, 2024, David Hildenbrand wrote:
>> On 07.10.24 17:58, Dave Hansen wrote:
>>> On 10/7/24 01:44, David Hildenbrand wrote:
>>>> On 02.10.24 19:35, Dave Hansen wrote:
>>>>> We were just chatting about this on David Rientjes's MM alignment call.
>>>> Unfortunately I was not able to attend this time, my body decided it's a
>>>> good idea to stay in bed for a couple of days.
>>>>
>>>>> I thought I'd try to give a little brain
>>>>>
>>>>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>>>>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>>>>> tables get entries that effectively mirror the primary mm page tables
>>>>> and constitute a secondary MMU.  If the primary page tables change,
>>>>> mmu_notifiers ensure that the changes get reflected into the
>>>>> virtualization tables and also that the virtualization paging structure
>>>>> caches are flushed.
>>>>>
>>>>> msharefs is doing something very similar.  But, in the msharefs case,
>>>>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>>>>> normal old page tables and the caches are the normal old TLB.  That's
>>>>> what makes it so confusing: we have lots of infrastructure for dealing
>>>>> with that "stuff" (CPU page tables and TLB), but msharefs has
>>>>> short-circuited the infrastructure and it doesn't work any more.
>>>> It's quite different IMHO, to a degree that I believe they are different
>>>> beasts:
>>>>
>>>> Secondary MMUs:
>>>> * "Belongs" to same MM context and the primary MMU (process page tables)
>>> I think you're speaking to the ratio here.  For each secondary MMU, I
>>> think you're saying that there's one and only one mm_struct.  Is that right?
>> Yes, that is my understanding (at least with KVM). It's a secondary MMU
>> derived from exactly one primary MMU (MM context -> page table hierarchy).
> I don't think the ratio is what's important.  I think the important takeaway is
> that the secondary MMU is explicitly tied to the primary MMU that it is tracking.
> This is enforced in code, as the list of mmu_notifiers is stored in mm_struct.
>
> The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with
> exactly one mm_struct.  But fundamentally, nothing would prevent a secondary MMU
> that manages a so called software TLB from tracking multiple primary MMUs.
>
> E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard),
> because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary
> MMU would be associated with at most one VMA and thus mm_struct.
>
> Pulling Dave's earlier comment in:
>
>   : But the short of it is that the msharefs host mm represents a "secondary
>   : MMU".  I don't think it is really that special of an MMU other than the
>   : fact that it has an mm_struct.
>
> and David's (so. many. Davids):
>
>   : I better not think about the complexity of seconary MMUs + mshare (e.g.,
>   : KVM with mshare in guest memory): MMU notifiers for all MMs must be
>   : called ...
>
> mshare() is unique because it creates the possibly of chained "secondary" MMUs.
> I.e. the fact that it has an mm_struct makes it *very* special, IMO.

This is definitely a gap with the current mshare implementation. Mapping 
memory

that relies on an mmu notifier in an mshare region will result in the 
notifier callbacks

never being called. On the surface it seems like the mshare mm needs 
notifiers that

go through every mm that has mapped the mshare region and calls its 
notifiers.


>
>>>> * Maintains separate tables/PTEs, in completely separate page table
>>>>     hierarchy
>>> This is the case for KVM and the VMX/SVM MMUs, but it's not generally
>>> true about hardware.  IOMMUs can walk x86 page tables and populate the
>>> IOTLB from the _same_ page table hierarchy as the CPU.
>> Yes, of course.
> Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs()
> sums things up nicely:
>
> commit 1af5a8109904b7f00828e7f9f63f5695b42f8215
> Author:     Alistair Popple <apopple@nvidia.com>
> AuthorDate: Tue Jul 25 23:42:07 2023 +1000
> Commit:     Andrew Morton <akpm@linux-foundation.org>
> CommitDate: Fri Aug 18 10:12:41 2023 -0700
>
>      mmu_notifiers: rename invalidate_range notifier
>      
>      There are two main use cases for mmu notifiers.  One is by KVM which uses
>      mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
>      
>      The other is to manage hardware TLBs which need to use the
>      invalidate_range() callback because HW can establish new TLB entries at
>      any time.  Hence using start/end() can lead to memory corruption as these
>      callbacks happen too soon/late during page unmap.
>      
>      mmu notifier users should therefore either use the start()/end() callbacks
>      or the invalidate_range() callbacks.  To make this usage clearer rename
>      the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
>      update documention.

I believe if I implemented an arch_invalidate_secondary_tlbs notifier 
that flushed all

TLBs, that would solve the problem of ensuring TLBs are flushed before 
pages being

unmapped from an mshare region are freed. However, it seems like there 
would be

potentially be a lot more collateral damage from flushing everything 
since the flushes

would happen more often.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
                   ` (11 preceding siblings ...)
  2024-10-07  9:01 ` Kirill A. Shutemov
@ 2024-10-14 20:07 ` Jann Horn
  2024-10-16  0:59   ` Anthony Yznaga
  12 siblings, 1 reply; 38+ messages in thread
From: Jann Horn @ 2024-10-14 20:07 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> One major issue to address for this series to function correctly
> is how to ensure proper TLB flushing when a page in a shared
> region is unmapped. For example, since the rmaps for pages in a
> shared region map back to host vmas which point to a host mm, TLB
> flushes won't be directed to the CPUs the sharing processes have
> run on. I am by no means an expert in this area. One idea is to
> install a mmu_notifier on the host mm that can gather the necessary
> data and do flushes similar to the batch flushing.

The mmu_notifier API has two ways you can use it:

First, there is the classic mode, where before you start modifying
PTEs in some range, you remove mirrored PTEs from some other context,
and until you're done with your PTE modification, you don't allow
creation of new mirrored PTEs. This is intended for cases where
individual PTE entries are copied over to some other context (such as
EPT tables for virtualization). When I last looked at that code, it
looked fine, and this is what KVM uses. But it probably doesn't match
your usecase, since you wouldn't want removal of a single page to
cause the entire page table containing it to be temporarily unmapped
from the processes that use it?

Second, there is a newer mode for IOMMUv2 stuff (using the
mmu_notifier_ops::invalidate_range callback), where the idea is that
you have secondary MMUs that share the normal page tables, and so you
basically send them invalidations at the same time you invalidate the
primary MMU for the process. I think that's the right fit for this
usecase; however, last I looked, this code was extremely broken (see
https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
for context). Unless that's changed in the meantime, I think someone
would have to fix that code before it can be relied on for new
usecases.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 05/10] mm/mshare: Add ioctl support
  2024-09-03 23:22 ` [RFC PATCH v3 05/10] mm/mshare: Add ioctl support Anthony Yznaga
@ 2024-10-14 20:08   ` Jann Horn
  2024-10-16  0:49     ` Anthony Yznaga
  0 siblings, 1 reply; 38+ messages in thread
From: Jann Horn @ 2024-10-14 20:08 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> Reserve a range of ioctls for msharefs and add the first two ioctls
> to get and set the start address and size of an mshare region.
[...]
> +static long
> +msharefs_set_size(struct mm_struct *mm, struct mshare_data *m_data,
> +                       struct mshare_info *minfo)
> +{
> +       unsigned long end = minfo->start + minfo->size;
> +
> +       /*
> +        * Validate alignment for start address, and size
> +        */
> +       if ((minfo->start | end) & (PGDIR_SIZE - 1)) {
> +               spin_unlock(&m_data->m_lock);
> +               return -EINVAL;
> +       }
> +
> +       mm->mmap_base = minfo->start;
> +       mm->task_size = minfo->size;
> +       if (!mm->task_size)
> +               mm->task_size--;
> +
> +       m_data->minfo.start = mm->mmap_base;
> +       m_data->minfo.size = mm->task_size;
> +       spin_unlock(&m_data->m_lock);
> +
> +       return 0;
> +}
> +
> +static long
> +msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> +{
> +       struct mshare_data *m_data = filp->private_data;
> +       struct mm_struct *mm = m_data->mm;
> +       struct mshare_info minfo;
> +
> +       switch (cmd) {
> +       case MSHAREFS_GET_SIZE:
> +               spin_lock(&m_data->m_lock);
> +               minfo = m_data->minfo;
> +               spin_unlock(&m_data->m_lock);
> +
> +               if (copy_to_user((void __user *)arg, &minfo, sizeof(minfo)))
> +                       return -EFAULT;
> +
> +               return 0;
> +
> +       case MSHAREFS_SET_SIZE:
> +               if (copy_from_user(&minfo, (struct mshare_info __user *)arg,
> +                       sizeof(minfo)))
> +                       return -EFAULT;
> +
> +               /*
> +                * If this mshare region has been set up once already, bail out
> +                */
> +               spin_lock(&m_data->m_lock);
> +               if (m_data->minfo.start != 0) {

Is there actually anything that prevents msharefs_set_size() from
setting up m_data with ->minfo.start==0, so that a second
MSHAREFS_SET_SIZE invocation will succeed? It would probably be more
reliable to have a separate flag for "has this thing been set up yet".


> +                       spin_unlock(&m_data->m_lock);
> +                       return -EINVAL;
> +               }
> +
> +               return msharefs_set_size(mm, m_data, &minfo);
> +
> +       default:
> +               return -ENOTTY;
> +       }
> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 05/10] mm/mshare: Add ioctl support
  2024-10-14 20:08   ` Jann Horn
@ 2024-10-16  0:49     ` Anthony Yznaga
  0 siblings, 0 replies; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-16  0:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 10/14/24 1:08 PM, Jann Horn wrote:
> On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> Reserve a range of ioctls for msharefs and add the first two ioctls
>> to get and set the start address and size of an mshare region.
> [...]
>> +static long
>> +msharefs_set_size(struct mm_struct *mm, struct mshare_data *m_data,
>> +                       struct mshare_info *minfo)
>> +{
>> +       unsigned long end = minfo->start + minfo->size;
>> +
>> +       /*
>> +        * Validate alignment for start address, and size
>> +        */
>> +       if ((minfo->start | end) & (PGDIR_SIZE - 1)) {
>> +               spin_unlock(&m_data->m_lock);
>> +               return -EINVAL;
>> +       }
>> +
>> +       mm->mmap_base = minfo->start;
>> +       mm->task_size = minfo->size;
>> +       if (!mm->task_size)
>> +               mm->task_size--;
>> +
>> +       m_data->minfo.start = mm->mmap_base;
>> +       m_data->minfo.size = mm->task_size;
>> +       spin_unlock(&m_data->m_lock);
>> +
>> +       return 0;
>> +}
>> +
>> +static long
>> +msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>> +{
>> +       struct mshare_data *m_data = filp->private_data;
>> +       struct mm_struct *mm = m_data->mm;
>> +       struct mshare_info minfo;
>> +
>> +       switch (cmd) {
>> +       case MSHAREFS_GET_SIZE:
>> +               spin_lock(&m_data->m_lock);
>> +               minfo = m_data->minfo;
>> +               spin_unlock(&m_data->m_lock);
>> +
>> +               if (copy_to_user((void __user *)arg, &minfo, sizeof(minfo)))
>> +                       return -EFAULT;
>> +
>> +               return 0;
>> +
>> +       case MSHAREFS_SET_SIZE:
>> +               if (copy_from_user(&minfo, (struct mshare_info __user *)arg,
>> +                       sizeof(minfo)))
>> +                       return -EFAULT;
>> +
>> +               /*
>> +                * If this mshare region has been set up once already, bail out
>> +                */
>> +               spin_lock(&m_data->m_lock);
>> +               if (m_data->minfo.start != 0) {
> Is there actually anything that prevents msharefs_set_size() from
> setting up m_data with ->minfo.start==0, so that a second
> MSHAREFS_SET_SIZE invocation will succeed? It would probably be more
> reliable to have a separate flag for "has this thing been set up yet".

Thanks for pointing this out. Yes, this is problematic. A start address 
of 0 generally won't work because mmap() will fail unless there are 
sufficient privileges (cap_map_addr will return -EPERM). I already have 
changes to use the size to indicate initialization, but it may make 
sense to have flags.


Anthony

>
>
>> +                       spin_unlock(&m_data->m_lock);
>> +                       return -EINVAL;
>> +               }
>> +
>> +               return msharefs_set_size(mm, m_data, &minfo);
>> +
>> +       default:
>> +               return -ENOTTY;
>> +       }
>> +}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-14 20:07 ` Jann Horn
@ 2024-10-16  0:59   ` Anthony Yznaga
  2024-10-16 13:25     ` Jann Horn
  0 siblings, 1 reply; 38+ messages in thread
From: Anthony Yznaga @ 2024-10-16  0:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz


On 10/14/24 1:07 PM, Jann Horn wrote:
> On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> One major issue to address for this series to function correctly
>> is how to ensure proper TLB flushing when a page in a shared
>> region is unmapped. For example, since the rmaps for pages in a
>> shared region map back to host vmas which point to a host mm, TLB
>> flushes won't be directed to the CPUs the sharing processes have
>> run on. I am by no means an expert in this area. One idea is to
>> install a mmu_notifier on the host mm that can gather the necessary
>> data and do flushes similar to the batch flushing.
> The mmu_notifier API has two ways you can use it:
>
> First, there is the classic mode, where before you start modifying
> PTEs in some range, you remove mirrored PTEs from some other context,
> and until you're done with your PTE modification, you don't allow
> creation of new mirrored PTEs. This is intended for cases where
> individual PTE entries are copied over to some other context (such as
> EPT tables for virtualization). When I last looked at that code, it
> looked fine, and this is what KVM uses. But it probably doesn't match
> your usecase, since you wouldn't want removal of a single page to
> cause the entire page table containing it to be temporarily unmapped
> from the processes that use it?

No, definitely don't want to do that. :-)


>
> Second, there is a newer mode for IOMMUv2 stuff (using the
> mmu_notifier_ops::invalidate_range callback), where the idea is that
> you have secondary MMUs that share the normal page tables, and so you
> basically send them invalidations at the same time you invalidate the
> primary MMU for the process. I think that's the right fit for this
> usecase; however, last I looked, this code was extremely broken (see
> https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
> for context). Unless that's changed in the meantime, I think someone
> would have to fix that code before it can be relied on for new
> usecases.

Thank you for this background! Looks like there have since been some 
changes to the mmu notifiers, and the invalidate_range callback became 
arch_invalidate_secondary_tlbs. I'm currently looking into using it to 
flush all TLBs.


Anthony


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v3 00/10] Add support for shared PTEs across processes
  2024-10-16  0:59   ` Anthony Yznaga
@ 2024-10-16 13:25     ` Jann Horn
  0 siblings, 0 replies; 38+ messages in thread
From: Jann Horn @ 2024-10-16 13:25 UTC (permalink / raw)
  To: Anthony Yznaga
  Cc: akpm, willy, markhemm, viro, david, khalid, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Wed, Oct 16, 2024 at 2:59 AM Anthony Yznaga
<anthony.yznaga@oracle.com> wrote:
> On 10/14/24 1:07 PM, Jann Horn wrote:
> > Second, there is a newer mode for IOMMUv2 stuff (using the
> > mmu_notifier_ops::invalidate_range callback), where the idea is that
> > you have secondary MMUs that share the normal page tables, and so you
> > basically send them invalidations at the same time you invalidate the
> > primary MMU for the process. I think that's the right fit for this
> > usecase; however, last I looked, this code was extremely broken (see
> > https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
> > for context). Unless that's changed in the meantime, I think someone
> > would have to fix that code before it can be relied on for new
> > usecases.
>
> Thank you for this background! Looks like there have since been some
> changes to the mmu notifiers, and the invalidate_range callback became
> arch_invalidate_secondary_tlbs. I'm currently looking into using it to
> flush all TLBs.

Ah, nice, that looks much better now. With the caveat that, from what
I can tell, the notifiers only work on x86/arm64/powerpc - I guess
maybe that infrastructure should be gated on a HAVE_... arch config
flag...

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-10-16 13:26 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 23:22 [RFC PATCH v3 00/10] Add support for shared PTEs across processes Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 01/10] mm: Add msharefs filesystem Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 02/10] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 03/10] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 04/10] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 05/10] mm/mshare: Add ioctl support Anthony Yznaga
2024-10-14 20:08   ` Jann Horn
2024-10-16  0:49     ` Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 06/10] mm/mshare: Add vm flag for shared PTEs Anthony Yznaga
2024-09-03 23:40   ` James Houghton
2024-09-03 23:58     ` Anthony Yznaga
2024-10-07 10:24     ` David Hildenbrand
2024-10-07 23:03       ` Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 07/10] mm/mshare: Add mmap support Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 08/10] mm/mshare: Add basic page table sharing support Anthony Yznaga
2024-10-07  8:41   ` Kirill A. Shutemov
2024-10-07 17:45     ` Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 09/10] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2024-10-07  8:44   ` Kirill A. Shutemov
2024-10-07 17:46     ` Anthony Yznaga
2024-09-03 23:22 ` [RFC PATCH v3 10/10] mshare: add MSHAREFS_CREATE_MAPPING Anthony Yznaga
2024-10-02 17:35 ` [RFC PATCH v3 00/10] Add support for shared PTEs across processes Dave Hansen
2024-10-02 19:30   ` Anthony Yznaga
2024-10-02 23:11     ` Dave Hansen
2024-10-03  0:24       ` Anthony Yznaga
2024-10-07  8:44   ` David Hildenbrand
2024-10-07 15:58     ` Dave Hansen
2024-10-07 16:27       ` David Hildenbrand
2024-10-07 16:45         ` Sean Christopherson
2024-10-08  1:37           ` Anthony Yznaga
2024-10-07  8:48   ` David Hildenbrand
2024-10-07  9:01 ` Kirill A. Shutemov
2024-10-07 19:23   ` Anthony Yznaga
2024-10-07 19:41     ` David Hildenbrand
2024-10-07 19:46       ` Anthony Yznaga
2024-10-14 20:07 ` Jann Horn
2024-10-16  0:59   ` Anthony Yznaga
2024-10-16 13:25     ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).