* [PATCH 01/20] mm: Add msharefs filesystem
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-25 3:13 ` Randy Dunlap
2025-02-04 1:52 ` Bagas Sanjaya
2025-01-24 23:54 ` [PATCH 02/20] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
` (22 subsequent siblings)
23 siblings, 2 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
Add a ram-based filesystem that contains page table sharing
information and files that enables processes to share page tables.
This patch adds the basic filesystem that can be mounted and
a CONFIG_MSHARE option for compiling support in a kernel.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
Documentation/filesystems/msharefs.rst | 107 +++++++++++++++++++++++++
include/uapi/linux/magic.h | 1 +
mm/Kconfig | 9 +++
mm/Makefile | 4 +
mm/mshare.c | 96 ++++++++++++++++++++++
5 files changed, 217 insertions(+)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c
diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..c3c7168aa18f
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages. To enable support for
+msharefs the kernel must be compiled with CONFIG_MSHARE set.
+
+msharefs is typically mounted like this::
+
+ mount -t msharefs none /sys/fs/mshare
+
+A file created on msharefs creates a new shared region where all
+processes mapping that region will map it using shared page table
+entries. ioctls are used to initialize or retrieve the start address
+and size of a shared region and to map objects in the shared
+region. It is important to note that an msharefs file is a control
+file for the shared region and does not contain the contents
+of the region itself.
+
+Here are the basic steps for using mshare::
+
+1. Mount msharefs on /sys/fs/mshare::
+
+ mount -t msharefs msharefs /sys/fs/mshare
+
+2. mshare regions have alignment and size requirements. Start
+ address for the region must be aligned to an address boundary and
+ be a multiple of fixed size. This alignment and size requirement
+ can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
+ which returns a number in text format. mshare regions must be
+ aligned to this boundary and be a multiple of this size.
+
+3. For the process creating an mshare region::
+
+a. Create a file on /sys/fs/mshare, for example:
+
+.. code-block:: c
+
+ fd = open("/sys/fs/mshare/shareme",
+ O_RDWR|O_CREAT|O_EXCL, 0600);
+
+b. Establish the starting address and size of the region:
+
+.. code-block:: c
+
+ struct mshare_info minfo;
+
+ minfo.start = TB(2);
+ minfo.size = BUFFER_SIZE;
+ ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
+
+c. Map some memory in the region:
+
+.. code-block:: c
+
+ struct mshare_create mcreate;
+
+ mcreate.addr = TB(2);
+ mcreate.size = BUFFER_SIZE;
+ mcreate.offset = 0;
+ mcreate.prot = PROT_READ | PROT_WRITE;
+ mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
+ mcreate.fd = -1;
+
+ ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
+
+d. Map the mshare region into the process:
+
+.. code-block:: c
+
+ mmap((void *)TB(2), BUF_SIZE,
+ PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+e. Write and read to mshared region normally.
+
+
+4. For processes attaching an mshare region::
+
+a. Open the file on msharefs, for example:
+
+.. code-block:: c
+
+ fd = open("/sys/fs/mshare/shareme", O_RDWR);
+
+b. Get information about mshare'd region from the file:
+
+.. code-block:: c
+
+ struct mshare_info minfo;
+
+ ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
+
+c. Map the mshare'd region into the process:
+
+.. code-block:: c
+
+ mmap(minfo.start, minfo.size,
+ PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+5. To delete the mshare region:
+
+.. code-block:: c
+
+ unlink("/sys/fs/mshare/shareme");
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..e53dd6063cba 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
#define PID_FS_MAGIC 0x50494446 /* "PIDF" */
+#define MSHARE_MAGIC 0x4d534852 /* "MSHR" */
#endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..ba3dbe31f86a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1358,6 +1358,15 @@ config PT_RECLAIM
Note: now only empty user PTE page table pages will be reclaimed.
+config MSHARE
+ bool "Mshare"
+ depends on MMU
+ help
+ Enable msharefs: A ram-based filesystem that allows multiple
+ processes to share page table entries for shared pages. A file
+ created on msharefs represents a shared region where all processes
+ mapping that region will map objects within it with shared PTEs.
+ Ioctls are used to configure and map objects into the shared region
source "mm/damon/Kconfig"
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..68bc967863f9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -48,6 +48,10 @@ ifdef CONFIG_64BIT
mmu-$(CONFIG_MMU) += mseal.o
endif
+ifdef CONFIG_MSHARE
+mmu-$(CONFIG_MMU) += mshare.o
+endif
+
obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
maccess.o page-writeback.o folio-compat.o \
readahead.o swap.o truncate.o vmscan.o shrinker.o \
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..49d32e0c20d2
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Enable cooperating processes to share page table between
+ * them to reduce the extra memory consumed by multiple copies
+ * of page tables.
+ *
+ * This code adds an in-memory filesystem - msharefs.
+ * msharefs is used to manage page table sharing
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author: Khalid Aziz <khalid@kernel.org>
+ *
+ */
+
+#include <linux/fs.h>
+#include <linux/fs_context.h>
+#include <uapi/linux/magic.h>
+
+static const struct file_operations msharefs_file_operations = {
+ .open = simple_open,
+};
+
+static const struct super_operations mshare_s_ops = {
+ .statfs = simple_statfs,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *inode;
+
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+ sb->s_magic = MSHARE_MAGIC;
+ sb->s_op = &mshare_s_ops;
+ sb->s_time_gran = 1;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return -ENOMEM;
+
+ inode->i_ino = 1;
+ inode->i_mode = S_IFDIR | 0777;
+ simple_inode_init_ts(inode);
+ inode->i_op = &simple_dir_inode_operations;
+ inode->i_fop = &simple_dir_operations;
+ set_nlink(inode, 2);
+
+ sb->s_root = d_make_root(inode);
+ if (!sb->s_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+ return get_tree_nodev(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+ .get_tree = msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &msharefs_context_ops;
+ return 0;
+}
+
+static struct file_system_type mshare_fs = {
+ .name = "msharefs",
+ .init_fs_context = mshare_init_fs_context,
+ .kill_sb = kill_litter_super,
+};
+
+static int __init
+mshare_init(void)
+{
+ int ret;
+
+ ret = sysfs_create_mount_point(fs_kobj, "mshare");
+ if (ret)
+ return ret;
+
+ ret = register_filesystem(&mshare_fs);
+ if (ret)
+ sysfs_remove_mount_point(fs_kobj, "mshare");
+
+ return ret;
+}
+
+core_initcall(mshare_init);
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-01-24 23:54 ` [PATCH 01/20] mm: Add msharefs filesystem Anthony Yznaga
@ 2025-01-25 3:13 ` Randy Dunlap
2025-01-25 20:05 ` Anthony Yznaga
2025-02-04 1:52 ` Bagas Sanjaya
1 sibling, 1 reply; 37+ messages in thread
From: Randy Dunlap @ 2025-01-25 3:13 UTC (permalink / raw)
To: Anthony Yznaga, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Just nits:
On 1/24/25 3:54 PM, Anthony Yznaga wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 1b501db06417..ba3dbe31f86a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1358,6 +1358,15 @@ config PT_RECLAIM
>
> Note: now only empty user PTE page table pages will be reclaimed.
>
> +config MSHARE
> + bool "Mshare"
> + depends on MMU
> + help
> + Enable msharefs: A ram-based filesystem that allows multiple
RAM-based
> + processes to share page table entries for shared pages. A file
> + created on msharefs represents a shared region where all processes
> + mapping that region will map objects within it with shared PTEs.
> + Ioctls are used to configure and map objects into the shared region
End the sentence above with a period.
>
> source "mm/damon/Kconfig"
--
~Randy
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-01-25 3:13 ` Randy Dunlap
@ 2025-01-25 20:05 ` Anthony Yznaga
2025-01-25 21:10 ` Matthew Wilcox
0 siblings, 1 reply; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-25 20:05 UTC (permalink / raw)
To: Randy Dunlap, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
On 1/24/25 7:13 PM, Randy Dunlap wrote:
> Just nits:
>
>
> On 1/24/25 3:54 PM, Anthony Yznaga wrote:
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 1b501db06417..ba3dbe31f86a 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1358,6 +1358,15 @@ config PT_RECLAIM
>>
>> Note: now only empty user PTE page table pages will be reclaimed.
>>
>> +config MSHARE
>> + bool "Mshare"
>> + depends on MMU
>> + help
>> + Enable msharefs: A ram-based filesystem that allows multiple
> RAM-based
>
>> + processes to share page table entries for shared pages. A file
>> + created on msharefs represents a shared region where all processes
>> + mapping that region will map objects within it with shared PTEs.
>> + Ioctls are used to configure and map objects into the shared region
> End the sentence above with a period.
Thanks, Randy. Appreciate the comments.
Anthony
>
>>
>> source "mm/damon/Kconfig"
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-01-25 20:05 ` Anthony Yznaga
@ 2025-01-25 21:10 ` Matthew Wilcox
2025-01-27 17:01 ` Anthony Yznaga
0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2025-01-25 21:10 UTC (permalink / raw)
To: Anthony Yznaga
Cc: Randy Dunlap, akpm, markhemm, viro, david, khalid, jthoughton,
corbet, dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On Sat, Jan 25, 2025 at 12:05:47PM -0800, Anthony Yznaga wrote:
>
> On 1/24/25 7:13 PM, Randy Dunlap wrote:
> > Just nits:
> >
> >
> > On 1/24/25 3:54 PM, Anthony Yznaga wrote:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 1b501db06417..ba3dbe31f86a 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -1358,6 +1358,15 @@ config PT_RECLAIM
> > > Note: now only empty user PTE page table pages will be reclaimed.
> > > +config MSHARE
> > > + bool "Mshare"
> > > + depends on MMU
> > > + help
> > > + Enable msharefs: A ram-based filesystem that allows multiple
> > RAM-based
But it's not a ram-based filesystem. It's a pseudo-filesystem like
procfs. It doesn't have any memory of its own.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-01-25 21:10 ` Matthew Wilcox
@ 2025-01-27 17:01 ` Anthony Yznaga
0 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-27 17:01 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Randy Dunlap, akpm, markhemm, viro, david, khalid, jthoughton,
corbet, dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On 1/25/25 1:10 PM, Matthew Wilcox wrote:
> On Sat, Jan 25, 2025 at 12:05:47PM -0800, Anthony Yznaga wrote:
>> On 1/24/25 7:13 PM, Randy Dunlap wrote:
>>> Just nits:
>>>
>>>
>>> On 1/24/25 3:54 PM, Anthony Yznaga wrote:
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index 1b501db06417..ba3dbe31f86a 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1358,6 +1358,15 @@ config PT_RECLAIM
>>>> Note: now only empty user PTE page table pages will be reclaimed.
>>>> +config MSHARE
>>>> + bool "Mshare"
>>>> + depends on MMU
>>>> + help
>>>> + Enable msharefs: A ram-based filesystem that allows multiple
>>> RAM-based
> But it's not a ram-based filesystem. It's a pseudo-filesystem like
> procfs. It doesn't have any memory of its own.
Right. I'll clear that up.
Anthony
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-01-24 23:54 ` [PATCH 01/20] mm: Add msharefs filesystem Anthony Yznaga
2025-01-25 3:13 ` Randy Dunlap
@ 2025-02-04 1:52 ` Bagas Sanjaya
2025-02-04 16:41 ` Anthony Yznaga
1 sibling, 1 reply; 37+ messages in thread
From: Bagas Sanjaya @ 2025-02-04 1:52 UTC (permalink / raw)
To: Anthony Yznaga, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
[-- Attachment #1: Type: text/plain, Size: 4051 bytes --]
On Fri, Jan 24, 2025 at 03:54:35PM -0800, Anthony Yznaga wrote:
> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
> new file mode 100644
> index 000000000000..c3c7168aa18f
> --- /dev/null
> +++ b/Documentation/filesystems/msharefs.rst
> @@ -0,0 +1,107 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================================
> +msharefs - a filesystem to support shared page tables
> +=====================================================
> +
> +msharefs is a ram-based filesystem that allows multiple processes to
> +share page table entries for shared pages. To enable support for
> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
> +
> +msharefs is typically mounted like this::
> +
> + mount -t msharefs none /sys/fs/mshare
> +
> +A file created on msharefs creates a new shared region where all
> +processes mapping that region will map it using shared page table
> +entries. ioctls are used to initialize or retrieve the start address
> +and size of a shared region and to map objects in the shared
> +region. It is important to note that an msharefs file is a control
> +file for the shared region and does not contain the contents
> +of the region itself.
> +
> +Here are the basic steps for using mshare::
> +
> +1. Mount msharefs on /sys/fs/mshare::
> +
> + mount -t msharefs msharefs /sys/fs/mshare
> +
> +2. mshare regions have alignment and size requirements. Start
> + address for the region must be aligned to an address boundary and
> + be a multiple of fixed size. This alignment and size requirement
> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
> + which returns a number in text format. mshare regions must be
> + aligned to this boundary and be a multiple of this size.
> +
> +3. For the process creating an mshare region::
> +
> +a. Create a file on /sys/fs/mshare, for example:
Should the creating mshare region sublist be nested list?
> +
> +.. code-block:: c
> +
> + fd = open("/sys/fs/mshare/shareme",
> + O_RDWR|O_CREAT|O_EXCL, 0600);
> +
> +b. Establish the starting address and size of the region:
> +
> +.. code-block:: c
> +
> + struct mshare_info minfo;
> +
> + minfo.start = TB(2);
> + minfo.size = BUFFER_SIZE;
> + ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
> +
> +c. Map some memory in the region:
> +
> +.. code-block:: c
> +
> + struct mshare_create mcreate;
> +
> + mcreate.addr = TB(2);
> + mcreate.size = BUFFER_SIZE;
> + mcreate.offset = 0;
> + mcreate.prot = PROT_READ | PROT_WRITE;
> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> + mcreate.fd = -1;
> +
> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
> +
> +d. Map the mshare region into the process:
> +
> +.. code-block:: c
> +
> + mmap((void *)TB(2), BUF_SIZE,
> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +e. Write and read to mshared region normally.
> +
> +
> +4. For processes attaching an mshare region::
> +
> +a. Open the file on msharefs, for example:
> +
> +.. code-block:: c
> +
> + fd = open("/sys/fs/mshare/shareme", O_RDWR);
> +
> +b. Get information about mshare'd region from the file:
> +
> +.. code-block:: c
> +
> + struct mshare_info minfo;
> +
> + ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
> +
> +c. Map the mshare'd region into the process:
> +
> +.. code-block:: c
> +
> + mmap(minfo.start, minfo.size,
> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +5. To delete the mshare region:
> +
> +.. code-block:: c
> +
> + unlink("/sys/fs/mshare/shareme");
Sphinx reports htmldocs warnings:
Documentation/filesystems/msharefs.rst:25: WARNING: Literal block expected; none found. [docutils]
Documentation/filesystems/msharefs.rst:38: WARNING: Literal block expected; none found. [docutils]
Documentation/filesystems/msharefs.rst:82: WARNING: Literal block expected; none found. [docutils]
Thanks.
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 01/20] mm: Add msharefs filesystem
2025-02-04 1:52 ` Bagas Sanjaya
@ 2025-02-04 16:41 ` Anthony Yznaga
0 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-02-04 16:41 UTC (permalink / raw)
To: Bagas Sanjaya, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
On 2/3/25 5:52 PM, Bagas Sanjaya wrote:
> On Fri, Jan 24, 2025 at 03:54:35PM -0800, Anthony Yznaga wrote:
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..c3c7168aa18f
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,107 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +msharefs - a filesystem to support shared page tables
>> +=====================================================
>> +
>> +msharefs is a ram-based filesystem that allows multiple processes to
>> +share page table entries for shared pages. To enable support for
>> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
>> +
>> +msharefs is typically mounted like this::
>> +
>> + mount -t msharefs none /sys/fs/mshare
>> +
>> +A file created on msharefs creates a new shared region where all
>> +processes mapping that region will map it using shared page table
>> +entries. ioctls are used to initialize or retrieve the start address
>> +and size of a shared region and to map objects in the shared
>> +region. It is important to note that an msharefs file is a control
>> +file for the shared region and does not contain the contents
>> +of the region itself.
>> +
>> +Here are the basic steps for using mshare::
>> +
>> +1. Mount msharefs on /sys/fs/mshare::
>> +
>> + mount -t msharefs msharefs /sys/fs/mshare
>> +
>> +2. mshare regions have alignment and size requirements. Start
>> + address for the region must be aligned to an address boundary and
>> + be a multiple of fixed size. This alignment and size requirement
>> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
>> + which returns a number in text format. mshare regions must be
>> + aligned to this boundary and be a multiple of this size.
>> +
>> +3. For the process creating an mshare region::
>> +
>> +a. Create a file on /sys/fs/mshare, for example:
> Should the creating mshare region sublist be nested list?
Can you expand on that? Do you mean create an mshare region as a
directory and populate it with files representing the mappings that are
created in the region?
>
>> +
>> +.. code-block:: c
>> +
>> + fd = open("/sys/fs/mshare/shareme",
>> + O_RDWR|O_CREAT|O_EXCL, 0600);
>> +
>> +b. Establish the starting address and size of the region:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_info minfo;
>> +
>> + minfo.start = TB(2);
>> + minfo.size = BUFFER_SIZE;
>> + ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
>> +
>> +c. Map some memory in the region:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_create mcreate;
>> +
>> + mcreate.addr = TB(2);
>> + mcreate.size = BUFFER_SIZE;
>> + mcreate.offset = 0;
>> + mcreate.prot = PROT_READ | PROT_WRITE;
>> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> + mcreate.fd = -1;
>> +
>> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
>> +
>> +d. Map the mshare region into the process:
>> +
>> +.. code-block:: c
>> +
>> + mmap((void *)TB(2), BUF_SIZE,
>> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +e. Write and read to mshared region normally.
>> +
>> +
>> +4. For processes attaching an mshare region::
>> +
>> +a. Open the file on msharefs, for example:
>> +
>> +.. code-block:: c
>> +
>> + fd = open("/sys/fs/mshare/shareme", O_RDWR);
>> +
>> +b. Get information about mshare'd region from the file:
>> +
>> +.. code-block:: c
>> +
>> + struct mshare_info minfo;
>> +
>> + ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>> +
>> +c. Map the mshare'd region into the process:
>> +
>> +.. code-block:: c
>> +
>> + mmap(minfo.start, minfo.size,
>> + PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> +
>> +5. To delete the mshare region:
>> +
>> +.. code-block:: c
>> +
>> + unlink("/sys/fs/mshare/shareme");
> Sphinx reports htmldocs warnings:
>
> Documentation/filesystems/msharefs.rst:25: WARNING: Literal block expected; none found. [docutils]
> Documentation/filesystems/msharefs.rst:38: WARNING: Literal block expected; none found. [docutils]
> Documentation/filesystems/msharefs.rst:82: WARNING: Literal block expected; none found. [docutils]
Thanks. Will fix this.
Anthony
>
> Thanks.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH 02/20] mm/mshare: pre-populate msharefs with information file
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
2025-01-24 23:54 ` [PATCH 01/20] mm: Add msharefs filesystem Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 03/20] mm/mshare: make msharefs writable and support directories Anthony Yznaga
` (21 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
Users of mshare need to know the size and alignment requirement
for shared regions. Pre-populate msharefs with a file, mshare_info,
that provides this information. For now, pagetable sharing is
hardcoded to be at the PUD level.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/mshare.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 75 insertions(+), 2 deletions(-)
diff --git a/mm/mshare.c b/mm/mshare.c
index 49d32e0c20d2..6d3760d1af8e 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,18 +17,74 @@
#include <linux/fs_context.h>
#include <uapi/linux/magic.h>
+const unsigned long mshare_align = P4D_SIZE;
+
static const struct file_operations msharefs_file_operations = {
.open = simple_open,
};
+struct msharefs_info {
+ struct dentry *info_dentry;
+};
+
+static ssize_t
+mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
+ loff_t *ppos)
+{
+ char s[80];
+
+ sprintf(s, "%ld\n", mshare_align);
+ return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
+}
+
+static const struct file_operations mshare_info_ops = {
+ .read = mshare_info_read,
+ .llseek = noop_llseek,
+};
+
static const struct super_operations mshare_s_ops = {
.statfs = simple_statfs,
};
+static int
+msharefs_create_mshare_info(struct super_block *sb)
+{
+ struct msharefs_info *info = sb->s_fs_info;
+ struct dentry *root = sb->s_root;
+ struct dentry *dentry;
+ struct inode *inode;
+ int ret;
+
+ ret = -ENOMEM;
+ inode = new_inode(sb);
+ if (!inode)
+ goto out;
+
+ inode->i_ino = 2;
+ simple_inode_init_ts(inode);
+ inode_init_owner(&nop_mnt_idmap, inode, NULL, S_IFREG | 0444);
+ inode->i_fop = &mshare_info_ops;
+
+ dentry = d_alloc_name(root, "mshare_info");
+ if (!dentry)
+ goto out;
+
+ info->info_dentry = dentry;
+ d_add(dentry, inode);
+
+ return 0;
+out:
+ iput(inode);
+
+ return ret;
+}
+
static int
msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
{
+ struct msharefs_info *info;
struct inode *inode;
+ int ret;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
@@ -36,6 +92,12 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
sb->s_op = &mshare_s_ops;
sb->s_time_gran = 1;
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return -ENOMEM;
+
+ sb->s_fs_info = info;
+
inode = new_inode(sb);
if (!inode)
return -ENOMEM;
@@ -51,7 +113,9 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
if (!sb->s_root)
return -ENOMEM;
- return 0;
+ ret = msharefs_create_mshare_info(sb);
+
+ return ret;
}
static int
@@ -71,10 +135,19 @@ mshare_init_fs_context(struct fs_context *fc)
return 0;
}
+static void
+msharefs_kill_super(struct super_block *sb)
+{
+ struct msharefs_info *info = sb->s_fs_info;
+
+ kfree(info);
+ kill_litter_super(sb);
+}
+
static struct file_system_type mshare_fs = {
.name = "msharefs",
.init_fs_context = mshare_init_fs_context,
- .kill_sb = kill_litter_super,
+ .kill_sb = msharefs_kill_super,
};
static int __init
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 03/20] mm/mshare: make msharefs writable and support directories
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
2025-01-24 23:54 ` [PATCH 01/20] mm: Add msharefs filesystem Anthony Yznaga
2025-01-24 23:54 ` [PATCH 02/20] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 04/20] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
` (20 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
Make msharefs filesystem writable and allow creating directories
to support better access control to mshare'd regions defined in
msharefs.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/mshare.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 116 insertions(+), 1 deletion(-)
diff --git a/mm/mshare.c b/mm/mshare.c
index 6d3760d1af8e..b755346da827 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -19,14 +19,129 @@
const unsigned long mshare_align = P4D_SIZE;
+static const struct inode_operations msharefs_dir_inode_ops;
+static const struct inode_operations msharefs_file_inode_ops;
+
static const struct file_operations msharefs_file_operations = {
.open = simple_open,
};
+static struct inode
+*msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
+ const struct inode *dir, umode_t mode)
+{
+ struct inode *inode = new_inode(sb);
+
+ if (!inode)
+ return ERR_PTR(-ENOMEM);
+
+ inode->i_ino = get_next_ino();
+ inode_init_owner(&nop_mnt_idmap, inode, dir, mode);
+ simple_inode_init_ts(inode);
+
+ switch (mode & S_IFMT) {
+ case S_IFREG:
+ inode->i_op = &msharefs_file_inode_ops;
+ inode->i_fop = &msharefs_file_operations;
+ break;
+ case S_IFDIR:
+ inode->i_op = &msharefs_dir_inode_ops;
+ inode->i_fop = &simple_dir_operations;
+ inc_nlink(inode);
+ break;
+ default:
+ discard_new_inode(inode);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return inode;
+}
+
+static int
+msharefs_mknod(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode)
+{
+ struct inode *inode;
+
+ inode = msharefs_get_inode(idmap, dir->i_sb, dir, mode);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ d_instantiate(dentry, inode);
+ dget(dentry);
+ inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
+
+ return 0;
+}
+
+static int
+msharefs_create(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode, bool excl)
+{
+ return msharefs_mknod(idmap, dir, dentry, mode | S_IFREG);
+}
+
+static int
+msharefs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
+ struct dentry *dentry, umode_t mode)
+{
+ int ret = msharefs_mknod(idmap, dir, dentry, mode | S_IFDIR);
+
+ if (!ret)
+ inc_nlink(dir);
+ return ret;
+}
+
struct msharefs_info {
struct dentry *info_dentry;
};
+static inline bool
+is_msharefs_info_file(const struct dentry *dentry)
+{
+ struct msharefs_info *info = dentry->d_sb->s_fs_info;
+
+ return info->info_dentry == dentry;
+}
+
+static int
+msharefs_rename(struct mnt_idmap *idmap,
+ struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry,
+ unsigned int flags)
+{
+ if (is_msharefs_info_file(old_dentry) ||
+ is_msharefs_info_file(new_dentry))
+ return -EPERM;
+
+ return simple_rename(idmap, old_dir, old_dentry, new_dir,
+ new_dentry, flags);
+}
+
+static int
+msharefs_unlink(struct inode *dir, struct dentry *dentry)
+{
+ if (is_msharefs_info_file(dentry))
+ return -EPERM;
+
+ return simple_unlink(dir, dentry);
+}
+
+static const struct inode_operations msharefs_file_inode_ops = {
+ .setattr = simple_setattr,
+ .getattr = simple_getattr,
+};
+
+static const struct inode_operations msharefs_dir_inode_ops = {
+ .create = msharefs_create,
+ .lookup = simple_lookup,
+ .link = simple_link,
+ .unlink = msharefs_unlink,
+ .mkdir = msharefs_mkdir,
+ .rmdir = simple_rmdir,
+ .rename = msharefs_rename,
+};
+
static ssize_t
mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
loff_t *ppos)
@@ -105,7 +220,7 @@ msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
inode->i_ino = 1;
inode->i_mode = S_IFDIR | 0777;
simple_inode_init_ts(inode);
- inode->i_op = &simple_dir_inode_operations;
+ inode->i_op = &msharefs_dir_inode_ops;
inode->i_fop = &simple_dir_operations;
set_nlink(inode, 2);
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 04/20] mm/mshare: allocate an mm_struct for msharefs files
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (2 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 03/20] mm/mshare: make msharefs writable and support directories Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 05/20] mm/mshare: Add ioctl support Anthony Yznaga
` (19 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
When a new file is created under msharefs, allocate a new mm_struct
to be associated with it for the lifetime of the file.
The mm_struct will hold the VMAs and pagetables for the mshare region
the file represents.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/mshare.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/mm/mshare.c b/mm/mshare.c
index b755346da827..060292fb6a00 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -19,6 +19,10 @@
const unsigned long mshare_align = P4D_SIZE;
+struct mshare_data {
+ struct mm_struct *mm;
+};
+
static const struct inode_operations msharefs_dir_inode_ops;
static const struct inode_operations msharefs_file_inode_ops;
@@ -26,11 +30,51 @@ static const struct file_operations msharefs_file_operations = {
.open = simple_open,
};
+static int
+msharefs_fill_mm(struct inode *inode)
+{
+ struct mm_struct *mm;
+ struct mshare_data *m_data = NULL;
+ int ret = 0;
+
+ mm = mm_alloc();
+ if (!mm) {
+ ret = -ENOMEM;
+ goto err_free;
+ }
+
+ mm->mmap_base = mm->task_size = 0;
+
+ m_data = kzalloc(sizeof(*m_data), GFP_KERNEL);
+ if (!m_data) {
+ ret = -ENOMEM;
+ goto err_free;
+ }
+ m_data->mm = mm;
+ inode->i_private = m_data;
+
+ return 0;
+
+err_free:
+ if (mm)
+ mmput(mm);
+ kfree(m_data);
+ return ret;
+}
+
+static void
+msharefs_delmm(struct mshare_data *m_data)
+{
+ mmput(m_data->mm);
+ kfree(m_data);
+}
+
static struct inode
*msharefs_get_inode(struct mnt_idmap *idmap, struct super_block *sb,
const struct inode *dir, umode_t mode)
{
struct inode *inode = new_inode(sb);
+ int ret;
if (!inode)
return ERR_PTR(-ENOMEM);
@@ -43,6 +87,11 @@ static struct inode
case S_IFREG:
inode->i_op = &msharefs_file_inode_ops;
inode->i_fop = &msharefs_file_operations;
+ ret = msharefs_fill_mm(inode);
+ if (ret) {
+ discard_new_inode(inode);
+ inode = ERR_PTR(ret);
+ }
break;
case S_IFDIR:
inode->i_op = &msharefs_dir_inode_ops;
@@ -142,6 +191,16 @@ static const struct inode_operations msharefs_dir_inode_ops = {
.rename = msharefs_rename,
};
+static void
+mshare_evict_inode(struct inode *inode)
+{
+ struct mshare_data *m_data = inode->i_private;
+
+ if (m_data)
+ msharefs_delmm(m_data);
+ clear_inode(inode);
+}
+
static ssize_t
mshare_info_read(struct file *file, char __user *buf, size_t nbytes,
loff_t *ppos)
@@ -159,6 +218,7 @@ static const struct file_operations mshare_info_ops = {
static const struct super_operations mshare_s_ops = {
.statfs = simple_statfs,
+ .evict_inode = mshare_evict_inode,
};
static int
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 05/20] mm/mshare: Add ioctl support
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (3 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 04/20] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 06/20] mm/mshare: Add a vma flag to indicate an mshare region Anthony Yznaga
` (18 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
Reserve a range of ioctls for msharefs and add the first two ioctls
to get and set the start address and size of an mshare region.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
include/uapi/linux/msharefs.h | 29 ++++++++
mm/mshare.c | 68 +++++++++++++++++++
3 files changed, 98 insertions(+)
create mode 100644 include/uapi/linux/msharefs.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 243f1f1b554a..aa22b5412e4d 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -303,6 +303,7 @@ Code Seq# Include File Comments
'v' 20-27 arch/powerpc/include/uapi/asm/vas-api.h VAS API
'v' C0-FF linux/meye.h conflict!
'w' all CERN SCI driver
+'x' 00-1F linux/msharefs.h msharefs filesystem
'y' 00-1F packet based user level communications
<mailto:zapman@interlan.net>
'z' 00-3F CAN bus card conflict!
diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
new file mode 100644
index 000000000000..c7b509c7e093
--- /dev/null
+++ b/include/uapi/linux/msharefs.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * msharefs defines a memory region that is shared across processes.
+ * ioctl is used on files created under msharefs to set various
+ * attributes on these shared memory regions
+ *
+ *
+ * Copyright (C) 2024 Oracle Corp. All rights reserved.
+ * Author: Khalid Aziz <khalid@kernel.org>
+ */
+
+#ifndef _UAPI_LINUX_MSHAREFS_H
+#define _UAPI_LINUX_MSHAREFS_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/*
+ * msharefs specific ioctl commands
+ */
+#define MSHAREFS_GET_SIZE _IOR('x', 0, struct mshare_info)
+#define MSHAREFS_SET_SIZE _IOW('x', 1, struct mshare_info)
+
+struct mshare_info {
+ __u64 start;
+ __u64 size;
+};
+
+#endif
diff --git a/mm/mshare.c b/mm/mshare.c
index 060292fb6a00..056cb5a82547 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,24 +10,91 @@
*
* Copyright (C) 2024 Oracle Corp. All rights reserved.
* Author: Khalid Aziz <khalid@kernel.org>
+ * Author: Matthew Wilcox <willy@infradead.org>
*
*/
#include <linux/fs.h>
#include <linux/fs_context.h>
+#include <linux/spinlock_types.h>
#include <uapi/linux/magic.h>
+#include <uapi/linux/msharefs.h>
const unsigned long mshare_align = P4D_SIZE;
struct mshare_data {
struct mm_struct *mm;
+ spinlock_t m_lock;
+ struct mshare_info minfo;
};
+static long
+msharefs_set_size(struct mm_struct *host_mm, struct mshare_data *m_data,
+ struct mshare_info *minfo)
+{
+ /*
+ * Validate alignment for start address and size
+ */
+ if (!minfo->size || ((minfo->start | minfo->size) & (mshare_align - 1))) {
+ spin_unlock(&m_data->m_lock);
+ return -EINVAL;
+ }
+
+ host_mm->mmap_base = minfo->start;
+ host_mm->task_size = minfo->size;
+
+ m_data->minfo.start = host_mm->mmap_base;
+ m_data->minfo.size = host_mm->task_size;
+ spin_unlock(&m_data->m_lock);
+
+ return 0;
+}
+
+static long
+msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+ struct mshare_data *m_data = filp->private_data;
+ struct mm_struct *host_mm = m_data->mm;
+ struct mshare_info minfo;
+
+ switch (cmd) {
+ case MSHAREFS_GET_SIZE:
+ spin_lock(&m_data->m_lock);
+ minfo = m_data->minfo;
+ spin_unlock(&m_data->m_lock);
+
+ if (copy_to_user((void __user *)arg, &minfo, sizeof(minfo)))
+ return -EFAULT;
+
+ return 0;
+
+ case MSHAREFS_SET_SIZE:
+ if (copy_from_user(&minfo, (struct mshare_info __user *)arg,
+ sizeof(minfo)))
+ return -EFAULT;
+
+ /*
+ * If this mshare region has been set up once already, bail out
+ */
+ spin_lock(&m_data->m_lock);
+ if (m_data->minfo.size != 0) {
+ spin_unlock(&m_data->m_lock);
+ return -EINVAL;
+ }
+
+ return msharefs_set_size(host_mm, m_data, &minfo);
+
+ default:
+ return -ENOTTY;
+ }
+}
+
static const struct inode_operations msharefs_dir_inode_ops;
static const struct inode_operations msharefs_file_inode_ops;
static const struct file_operations msharefs_file_operations = {
.open = simple_open,
+ .unlocked_ioctl = msharefs_ioctl,
};
static int
@@ -51,6 +118,7 @@ msharefs_fill_mm(struct inode *inode)
goto err_free;
}
m_data->mm = mm;
+ spin_lock_init(&m_data->m_lock);
inode->i_private = m_data;
return 0;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 06/20] mm/mshare: Add a vma flag to indicate an mshare region
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (4 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 05/20] mm/mshare: Add ioctl support Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 07/20] mm/mshare: Add mmap support Anthony Yznaga
` (17 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
An mshare region contains zero or more actual vmas that map objects
in the mshare range with shared page tables.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mm.h | 19 +++++++++++++++++++
include/trace/events/mmflags.h | 7 +++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..bca7aee40f4d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -440,6 +440,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_DROPPABLE VM_NONE
#endif
+#ifdef CONFIG_MSHARE
+#define VM_MSHARE_BIT 41
+#define VM_MSHARE BIT(VM_MSHARE_BIT)
+#else
+#define VM_MSHARE VM_NONE
+#endif
+
#ifdef CONFIG_64BIT
/* VM is sealed, in vm_flags */
#define VM_SEALED _BITUL(63)
@@ -1092,6 +1099,18 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
int vma_is_stack_for_current(struct vm_area_struct *vma);
+#ifdef CONFIG_MSHARE
+static inline bool vma_is_mshare(const struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_MSHARE;
+}
+#else
+static inline bool vma_is_mshare(const struct vm_area_struct *vma)
+{
+ return false;
+}
+#endif
+
/* flush_tlb_range() takes a vma, not a mm, and can care about flags */
#define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 3bc8656c8359..0c7d50ab56cd 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -160,6 +160,12 @@ IF_HAVE_PG_ARCH_3(arch_3)
# define IF_HAVE_VM_DROPPABLE(flag, name)
#endif
+#ifdef CONFIG_MSHARE
+# define IF_HAVE_VM_MSHARE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_MSHARE(flag, name)
+#endif
+
#define __def_vmaflag_names \
{VM_READ, "read" }, \
{VM_WRITE, "write" }, \
@@ -193,6 +199,7 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
{VM_HUGEPAGE, "hugepage" }, \
{VM_NOHUGEPAGE, "nohugepage" }, \
IF_HAVE_VM_DROPPABLE(VM_DROPPABLE, "droppable" ) \
+IF_HAVE_VM_MSHARE(VM_MSHARE, "mshare" ) \
{VM_MERGEABLE, "mergeable" } \
#define show_vma_flags(flags) \
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 07/20] mm/mshare: Add mmap support
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (5 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 06/20] mm/mshare: Add a vma flag to indicate an mshare region Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
` (16 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
Add support for mapping an mshare region into a process after the
region has been established in msharefs. Disallow operations that
could split the resulting msharefs vma such as partial unmaps and
protection changes. Fault handling, mapping, unmapping, and
protection changes for objects mapped into an mshare region will
be done using the shared vmas created for them in the host mm. This
functionality will be added in later patches.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/mshare.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/mm/mshare.c b/mm/mshare.c
index 056cb5a82547..529a90fe1602 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/fs_context.h>
+#include <linux/mman.h>
#include <linux/spinlock_types.h>
#include <uapi/linux/magic.h>
#include <uapi/linux/msharefs.h>
@@ -28,6 +29,74 @@ struct mshare_data {
struct mshare_info minfo;
};
+static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ return -EINVAL;
+}
+
+static int mshare_vm_op_mprotect(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, unsigned long newflags)
+{
+ return -EINVAL;
+}
+
+static const struct vm_operations_struct msharefs_vm_ops = {
+ .may_split = mshare_vm_op_split,
+ .mprotect = mshare_vm_op_mprotect,
+};
+
+/*
+ * msharefs_mmap() - mmap an mshare region
+ */
+static int
+msharefs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct mshare_data *m_data = file->private_data;
+
+ vma->vm_private_data = m_data;
+ vm_flags_set(vma, VM_MSHARE | VM_DONTEXPAND);
+ vma->vm_ops = &msharefs_vm_ops;
+
+ return 0;
+}
+
+static unsigned long
+msharefs_get_unmapped_area(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+ struct mshare_data *m_data = file->private_data;
+ struct mm_struct *mm = current->mm;
+ unsigned long mshare_start, mshare_size;
+ const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
+
+ mmap_assert_write_locked(mm);
+
+ if ((flags & MAP_TYPE) == MAP_PRIVATE)
+ return -EINVAL;
+
+ spin_lock(&m_data->m_lock);
+ mshare_start = m_data->minfo.start;
+ mshare_size = m_data->minfo.size;
+ spin_unlock(&m_data->m_lock);
+
+ if ((mshare_size == 0) || (len != mshare_size))
+ return -EINVAL;
+
+ if (len > mmap_end - mmap_min_addr)
+ return -ENOMEM;
+
+ if (addr && (addr != mshare_start))
+ return -EINVAL;
+
+ if (flags & MAP_FIXED)
+ return addr;
+
+ if (find_vma_intersection(mm, mshare_start, mshare_start + mshare_size))
+ return -EEXIST;
+
+ return mshare_start;
+}
+
static long
msharefs_set_size(struct mm_struct *host_mm, struct mshare_data *m_data,
struct mshare_info *minfo)
@@ -94,6 +163,8 @@ static const struct inode_operations msharefs_file_inode_ops;
static const struct file_operations msharefs_file_operations = {
.open = simple_open,
+ .mmap = msharefs_mmap,
+ .get_unmapped_area = msharefs_get_unmapped_area,
.unlocked_ioctl = msharefs_ioctl,
};
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (6 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 07/20] mm/mshare: Add mmap support Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 09/20] sched/numa: do not scan msharefs vmas Anthony Yznaga
` (15 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Unlike the mm of a task, an mshare host mm is not updated on context
switch. In particular this means that mm_cpumask is never updated
which results in TLB flushes for updates to mshare PTEs only being
done on the local CPU. To ensure entries are flushed for non-local
TLBs, set up an mmu notifier on the mshare mm and use the
.arch_invalidate_secondary_tlbs callback to flush all TLBs.
arch_invalidate_secondary_tlbs guarantees that TLB entries will be
flushed before pages are freed when unmapping pages in an mshare region.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/mshare.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/mm/mshare.c b/mm/mshare.c
index 529a90fe1602..8dca4199dd01 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,9 +17,11 @@
#include <linux/fs.h>
#include <linux/fs_context.h>
#include <linux/mman.h>
+#include <linux/mmu_notifier.h>
#include <linux/spinlock_types.h>
#include <uapi/linux/magic.h>
#include <uapi/linux/msharefs.h>
+#include <asm/tlbflush.h>
const unsigned long mshare_align = P4D_SIZE;
@@ -27,6 +29,17 @@ struct mshare_data {
struct mm_struct *mm;
spinlock_t m_lock;
struct mshare_info minfo;
+ struct mmu_notifier mn;
+};
+
+static void mshare_invalidate_tlbs(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ flush_tlb_all();
+}
+
+static const struct mmu_notifier_ops mshare_mmu_ops = {
+ .arch_invalidate_secondary_tlbs = mshare_invalidate_tlbs,
};
static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -191,6 +204,10 @@ msharefs_fill_mm(struct inode *inode)
m_data->mm = mm;
spin_lock_init(&m_data->m_lock);
inode->i_private = m_data;
+ m_data->mn.ops = &mshare_mmu_ops;
+ ret = mmu_notifier_register(&m_data->mn, mm);
+ if (ret)
+ goto err_free;
return 0;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 09/20] sched/numa: do not scan msharefs vmas
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (7 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 08/20] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 10/20] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
` (14 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Scanning an msharefs vma results changes to the shared page table
but with TLB flushes only going to the process with the vma.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..e9aa1e35f40e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3374,7 +3374,8 @@ static void task_numa_work(struct callback_head *work)
for (; vma; vma = vma_next(&vmi)) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
- is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+ is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP) ||
+ vma_is_mshare(vma)) {
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UNSUITABLE);
continue;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 10/20] mm: add mmap_read_lock_killable_nested()
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (8 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 09/20] sched/numa: do not scan msharefs vmas Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 11/20] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
` (13 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
This will be used to support mshare functionality where the read
lock on an mshare host mm is taken while holding the lock on a
process mm.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mmap_lock.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 45a21faa3ff6..4671b4435d2a 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -191,6 +191,13 @@ static inline void mmap_read_lock(struct mm_struct *mm)
__mmap_lock_trace_acquire_returned(mm, false, true);
}
+static inline void mmap_read_lock_nested(struct mm_struct *mm, int subclass)
+{
+ __mmap_lock_trace_start_locking(mm, false);
+ down_read_nested(&mm->mmap_lock, subclass);
+ __mmap_lock_trace_acquire_returned(mm, false, true);
+}
+
static inline int mmap_read_lock_killable(struct mm_struct *mm)
{
int ret;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 11/20] mm: add and use unmap_page_range vm_ops hook
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (9 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 10/20] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
` (12 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Special handling is needed when unmapping a hugetlb vma and will
be needed when unmapping an msharefs vma once support is added for
handling faults in an mshare region.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mm.h | 10 ++++++++++
ipc/shm.c | 17 +++++++++++++++++
mm/hugetlb.c | 25 +++++++++++++++++++++++++
mm/memory.c | 36 +++++++++++++-----------------------
4 files changed, 65 insertions(+), 23 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bca7aee40f4d..1314af11596d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -39,6 +39,7 @@ struct anon_vma_chain;
struct user_struct;
struct pt_regs;
struct folio_batch;
+struct zap_details;
extern int sysctl_page_lock_unfairness;
@@ -687,8 +688,17 @@ struct vm_operations_struct {
*/
struct page *(*find_special_page)(struct vm_area_struct *vma,
unsigned long addr);
+ void (*unmap_page_range)(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details);
};
+void __unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details);
+
#ifdef CONFIG_NUMA_BALANCING
static inline void vma_numab_state_init(struct vm_area_struct *vma)
{
diff --git a/ipc/shm.c b/ipc/shm.c
index 99564c870084..cadd551e60b9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -585,6 +585,22 @@ static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,
}
#endif
+static void shm_unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ struct file *file = vma->vm_file;
+ struct shm_file_data *sfd = shm_file_data(file);
+
+ if (sfd->vm_ops->unmap_page_range) {
+ sfd->vm_ops->unmap_page_range(tlb, vma, addr, end, details);
+ return;
+ }
+
+ __unmap_page_range(tlb, vma, addr, end, details);
+}
+
static int shm_mmap(struct file *file, struct vm_area_struct *vma)
{
struct shm_file_data *sfd = shm_file_data(file);
@@ -685,6 +701,7 @@ static const struct vm_operations_struct shm_vm_ops = {
.set_policy = shm_set_policy,
.get_policy = shm_get_policy,
#endif
+ .unmap_page_range = shm_unmap_page_range,
};
/**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87761b042ed0..ac3ef62a3dc4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5147,6 +5147,30 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
return 0;
}
+static void hugetlb_vm_op_unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ zap_flags_t zap_flags = details ? details->zap_flags : 0;
+
+ /*
+ * It is undesirable to test vma->vm_file as it
+ * should be non-null for valid hugetlb area.
+ * However, vm_file will be NULL in the error
+ * cleanup path of mmap_region. When
+ * hugetlbfs ->mmap method fails,
+ * mmap_region() nullifies vma->vm_file
+ * before calling this function to clean up.
+ * Since no pte has actually been setup, it is
+ * safe to do nothing in this case.
+ */
+ if (!vma->vm_file)
+ return;
+
+ __unmap_hugepage_range(tlb, vma, addr, end, NULL, zap_flags);
+}
+
/*
* When a new function is introduced to vm_operations_struct and added
* to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5160,6 +5184,7 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.close = hugetlb_vm_op_close,
.may_split = hugetlb_vm_op_split,
.pagesize = hugetlb_vm_op_pagesize,
+ .unmap_page_range = hugetlb_vm_op_unmap_page_range,
};
static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
diff --git a/mm/memory.c b/mm/memory.c
index 2a20e3810534..20bafbb10ea7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1875,7 +1875,7 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
return addr;
}
-void unmap_page_range(struct mmu_gather *tlb,
+void __unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1895,6 +1895,16 @@ void unmap_page_range(struct mmu_gather *tlb,
tlb_end_vma(tlb, vma);
}
+void unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ if (vma->vm_ops && vma->vm_ops->unmap_page_range)
+ vma->vm_ops->unmap_page_range(tlb, vma, addr, end, details);
+ else
+ __unmap_page_range(tlb, vma, addr, end, details);
+}
static void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
@@ -1916,28 +1926,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
if (unlikely(vma->vm_flags & VM_PFNMAP))
untrack_pfn(vma, 0, 0, mm_wr_locked);
- if (start != end) {
- if (unlikely(is_vm_hugetlb_page(vma))) {
- /*
- * It is undesirable to test vma->vm_file as it
- * should be non-null for valid hugetlb area.
- * However, vm_file will be NULL in the error
- * cleanup path of mmap_region. When
- * hugetlbfs ->mmap method fails,
- * mmap_region() nullifies vma->vm_file
- * before calling this function to clean up.
- * Since no pte has actually been setup, it is
- * safe to do nothing in this case.
- */
- if (vma->vm_file) {
- zap_flags_t zap_flags = details ?
- details->zap_flags : 0;
- __unmap_hugepage_range(tlb, vma, start, end,
- NULL, zap_flags);
- }
- } else
- unmap_page_range(tlb, vma, start, end, details);
- }
+ if (start != end)
+ unmap_page_range(tlb, vma, start, end, details);
}
/**
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 12/20] mm/mshare: prepare for page table sharing support
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (10 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 11/20] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 13/20] x86/mm: enable page table sharing Anthony Yznaga
` (11 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
From: Khalid Aziz <khalid@kernel.org>
In preparation for enabling the handling of page faults in an mshare
region provide a way to link an mshare shared page table to a process
page table and otherwise find the actual vma in order to handle a page
fault. Modify the unmap path to ensure that page tables in mshare regions
are unlinked and kept intact when a process exits or an mshare region
is explicitly unmapped.
Signed-off-by: Khalid Aziz <khalid@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mm.h | 6 +++++
mm/memory.c | 38 ++++++++++++++++++++++------
mm/mshare.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1314af11596d..9889c4757f45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1110,11 +1110,17 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
int vma_is_stack_for_current(struct vm_area_struct *vma);
#ifdef CONFIG_MSHARE
+vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
static inline bool vma_is_mshare(const struct vm_area_struct *vma)
{
return vma->vm_flags & VM_MSHARE;
}
#else
+static inline vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp)
+{
+ WARN_ON_ONCE(1);
+ return VM_FAULT_SIGBUS;
+}
static inline bool vma_is_mshare(const struct vm_area_struct *vma)
{
return false;
diff --git a/mm/memory.c b/mm/memory.c
index 20bafbb10ea7..9374bb184a5f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -263,7 +263,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pud)
{
p4d_t *p4d;
unsigned long next;
@@ -275,7 +276,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+ if (unlikely(shared_pud))
+ p4d_clear(p4d);
+ else
+ free_pud_range(tlb, p4d, addr, next, floor, ceiling);
} while (p4d++, addr = next, addr != end);
start &= PGDIR_MASK;
@@ -297,9 +301,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
/*
* This function frees user-level page tables of a process.
*/
-void free_pgd_range(struct mmu_gather *tlb,
+static void __free_pgd_range(struct mmu_gather *tlb,
unsigned long addr, unsigned long end,
- unsigned long floor, unsigned long ceiling)
+ unsigned long floor, unsigned long ceiling,
+ bool shared_pud)
{
pgd_t *pgd;
unsigned long next;
@@ -355,10 +360,17 @@ void free_pgd_range(struct mmu_gather *tlb,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
+ free_p4d_range(tlb, pgd, addr, next, floor, ceiling, shared_pud);
} while (pgd++, addr = next, addr != end);
}
+void free_pgd_range(struct mmu_gather *tlb,
+ unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ __free_pgd_range(tlb, addr, end, floor, ceiling, false);
+}
+
void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *vma, unsigned long floor,
unsigned long ceiling, bool mm_wr_locked)
@@ -395,9 +407,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
/*
* Optimization: gather nearby vmas into one call down
+ *
+ * Do not free the shared page tables of an mshare region.
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
- && !is_vm_hugetlb_page(next)) {
+ && !is_vm_hugetlb_page(next)
+ && !vma_is_mshare(next)) {
vma = next;
next = mas_find(mas, ceiling - 1);
if (unlikely(xa_is_zero(next)))
@@ -408,9 +423,11 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
unlink_file_vma_batch_add(&vb, vma);
}
unlink_file_vma_batch_final(&vb);
- free_pgd_range(tlb, addr, vma->vm_end,
- floor, next ? next->vm_start : ceiling);
+ __free_pgd_range(tlb, addr, vma->vm_end,
+ floor, next ? next->vm_start : ceiling,
+ vma_is_mshare(vma));
}
+
vma = next;
} while (vma);
}
@@ -6148,6 +6165,11 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
if (ret)
goto out;
+ if (unlikely(vma_is_mshare(vma))) {
+ WARN_ON_ONCE(1);
+ return VM_FAULT_SIGBUS;
+ }
+
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE)) {
diff --git a/mm/mshare.c b/mm/mshare.c
index 8dca4199dd01..9ada1544aeb1 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -42,6 +42,56 @@ static const struct mmu_notifier_ops mshare_mmu_ops = {
.arch_invalidate_secondary_tlbs = mshare_invalidate_tlbs,
};
+static p4d_t *walk_to_p4d(struct mm_struct *mm, unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
+
+ return p4d;
+}
+
+/* Returns holding the host mm's lock for read. Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+ struct vm_area_struct *vma, *guest = *vmap;
+ struct mshare_data *m_data = guest->vm_private_data;
+ struct mm_struct *host_mm = m_data->mm;
+ unsigned long host_addr;
+ p4d_t *p4d, *guest_p4d;
+
+ mmap_read_lock_nested(host_mm, SINGLE_DEPTH_NESTING);
+ host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+ p4d = walk_to_p4d(host_mm, host_addr);
+ guest_p4d = walk_to_p4d(guest->vm_mm, *addrp);
+ if (!p4d_same(*guest_p4d, *p4d)) {
+ set_p4d(guest_p4d, *p4d);
+ mmap_read_unlock(host_mm);
+ return VM_FAULT_NOPAGE;
+ }
+
+ *addrp = host_addr;
+ vma = find_vma(host_mm, host_addr);
+
+ /* XXX: expand stack? */
+ if (vma && vma->vm_start > host_addr)
+ vma = NULL;
+
+ *vmap = vma;
+
+ /*
+ * release host mm lock unless a matching vma is found
+ */
+ if (!vma)
+ mmap_read_unlock(host_mm);
+ return 0;
+}
+
static int mshare_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
{
return -EINVAL;
@@ -53,9 +103,21 @@ static int mshare_vm_op_mprotect(struct vm_area_struct *vma, unsigned long start
return -EINVAL;
}
+static void mshare_vm_op_unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details)
+{
+ /*
+ * The msharefs vma is being unmapped. Do not unmap pages in the
+ * mshare region itself.
+ */
+}
+
static const struct vm_operations_struct msharefs_vm_ops = {
.may_split = mshare_vm_op_split,
.mprotect = mshare_vm_op_mprotect,
+ .unmap_page_range = mshare_vm_op_unmap_page_range,
};
/*
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 13/20] x86/mm: enable page table sharing
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (11 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 12/20] mm/mshare: prepare for page table sharing support Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 14/20] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
` (10 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Enable x86 support for handling page faults in an mshare region by
redirecting page faults to operate on the mshare mm_struct and vmas
contained in it.
Some permissions checks are done using vma flags in architecture-specfic
fault handling code so the actual vma needed to complete the handling
is acquired before calling handle_mm_fault(). Because of this an
ARCH_SUPPORTS_MSHARE config option is added.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
arch/Kconfig | 3 +++
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 37 ++++++++++++++++++++++++++++++++++++-
mm/Kconfig | 2 +-
4 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 6682b2a53e34..32474cdcb882 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1640,6 +1640,9 @@ config HAVE_ARCH_PFN_VALID
config ARCH_SUPPORTS_DEBUG_PAGEALLOC
bool
+config ARCH_SUPPORTS_MSHARE
+ bool
+
config ARCH_SUPPORTS_PAGE_TABLE_CHECK
bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2e1a3e4386de..453a39098dfa 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -120,6 +120,7 @@ config X86
select ARCH_SUPPORTS_ACPI
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
+ select ARCH_SUPPORTS_MSHARE if X86_64
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6c469b323cc..4b55ade61a01 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1217,6 +1217,8 @@ void do_user_addr_fault(struct pt_regs *regs,
struct mm_struct *mm;
vm_fault_t fault;
unsigned int flags = FAULT_FLAG_DEFAULT;
+ bool is_shared_vma;
+ unsigned long addr;
tsk = current;
mm = tsk->mm;
@@ -1330,6 +1332,12 @@ void do_user_addr_fault(struct pt_regs *regs,
if (!vma)
goto lock_mmap;
+ /* mshare does not support per-VMA locks yet */
+ if (vma_is_mshare(vma)) {
+ vma_end_read(vma);
+ goto lock_mmap;
+ }
+
if (unlikely(access_error(error_code, vma))) {
bad_area_access_error(regs, error_code, address, NULL, vma);
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
@@ -1358,17 +1366,38 @@ void do_user_addr_fault(struct pt_regs *regs,
lock_mmap:
retry:
+ addr = address;
+ is_shared_vma = false;
vma = lock_mm_and_find_vma(mm, address, regs);
if (unlikely(!vma)) {
bad_area_nosemaphore(regs, error_code, address);
return;
}
+ if (unlikely(vma_is_mshare(vma))) {
+ fault = find_shared_vma(&vma, &addr);
+
+ if (fault) {
+ mmap_read_unlock(mm);
+ goto done;
+ }
+
+ if (!vma) {
+ mmap_read_unlock(mm);
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }
+
+ is_shared_vma = true;
+ }
+
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it..
*/
if (unlikely(access_error(error_code, vma))) {
+ if (unlikely(is_shared_vma))
+ mmap_read_unlock(vma->vm_mm);
bad_area_access_error(regs, error_code, address, mm, vma);
return;
}
@@ -1386,7 +1415,11 @@ void do_user_addr_fault(struct pt_regs *regs,
* userland). The return to userland is identified whenever
* FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
*/
- fault = handle_mm_fault(vma, address, flags, regs);
+ fault = handle_mm_fault(vma, addr, flags, regs);
+
+ if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
+ (fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
+ mmap_read_unlock(mm);
if (fault_signal_pending(fault, regs)) {
/*
@@ -1414,6 +1447,8 @@ void do_user_addr_fault(struct pt_regs *regs,
goto retry;
}
+ if (unlikely(is_shared_vma))
+ mmap_read_unlock(vma->vm_mm);
mmap_read_unlock(mm);
done:
if (likely(!(fault & VM_FAULT_ERROR)))
diff --git a/mm/Kconfig b/mm/Kconfig
index ba3dbe31f86a..4fc056bb5643 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1360,7 +1360,7 @@ config PT_RECLAIM
config MSHARE
bool "Mshare"
- depends on MMU
+ depends on MMU && ARCH_SUPPORTS_MSHARE
help
Enable msharefs: A ram-based filesystem that allows multiple
processes to share page table entries for shared pages. A file
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 14/20] mm: create __do_mmap() to take an mm_struct * arg
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (12 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 13/20] x86/mm: enable page table sharing Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 15/20] mm: pass the mm in vma_munmap_struct Anthony Yznaga
` (9 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
In preparation for mapping objects into an mshare region, create
__do_mmap() to allow mapping into a specified mm. There are no
functional changes otherwise.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mm.h | 16 ++++++++++++++++
mm/mmap.c | 7 +++----
mm/vma.c | 15 +++++++--------
mm/vma.h | 2 +-
4 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9889c4757f45..80429d1a6ae4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3398,10 +3398,26 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
return __get_unmapped_area(file, addr, len, pgoff, flags, 0);
}
+#ifdef CONFIG_MMU
+unsigned long __do_mmap(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf, struct mm_struct *mm);
+static inline unsigned long do_mmap(struct file *file, unsigned long addr,
+ unsigned long len, unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
+ struct list_head *uf)
+{
+ return __do_mmap(file, addr, len, prot, flags, vm_flags, pgoff,
+ populate, uf, current->mm);
+}
+#else
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
struct list_head *uf);
+#endif
+
extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
diff --git a/mm/mmap.c b/mm/mmap.c
index cda01071c7b1..2d327b148bfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -334,13 +334,12 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
* Returns: Either an error, or the address at which the requested mapping has
* been performed.
*/
-unsigned long do_mmap(struct file *file, unsigned long addr,
+unsigned long __do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate,
- struct list_head *uf)
+ struct list_head *uf, struct mm_struct *mm)
{
- struct mm_struct *mm = current->mm;
int pkey = 0;
*populate = 0;
@@ -558,7 +557,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags |= VM_NORESERVE;
}
- addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+ addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
diff --git a/mm/vma.c b/mm/vma.c
index af1d549b179c..28942701e301 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2433,9 +2433,8 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
static unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf)
+ struct list_head *uf, struct mm_struct *mm)
{
- struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
int error;
VMA_ITERATOR(vmi, mm, addr);
@@ -2485,13 +2484,13 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
/**
* mmap_region() - Actually perform the userland mapping of a VMA into
- * current->mm with known, aligned and overflow-checked @addr and @len, and
+ * mm with known, aligned and overflow-checked @addr and @len, and
* correctly determined VMA flags @vm_flags and page offset @pgoff.
*
* This is an internal memory management function, and should not be used
* directly.
*
- * The caller must write-lock current->mm->mmap_lock.
+ * The caller must write-lock mm->mmap_lock.
*
* @file: If a file-backed mapping, a pointer to the struct file describing the
* file to be mapped, otherwise NULL.
@@ -2508,12 +2507,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
*/
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf)
+ struct list_head *uf, struct mm_struct *mm)
{
unsigned long ret;
bool writable_file_mapping = false;
- mmap_assert_write_locked(current->mm);
+ mmap_assert_write_locked(mm);
/* Check to see if MDWE is applicable. */
if (map_deny_write_exec(vm_flags, vm_flags))
@@ -2532,13 +2531,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
writable_file_mapping = true;
}
- ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+ ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf, mm);
/* Clear our write mapping regardless of error. */
if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
- validate_mm(current->mm);
+ validate_mm(mm);
return ret;
}
diff --git a/mm/vma.h b/mm/vma.h
index a2e8710b8c47..e704f56577f3 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -243,7 +243,7 @@ void mm_drop_all_locks(struct mm_struct *mm);
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf);
+ struct list_head *uf, struct mm_struct *mm);
int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
unsigned long addr, unsigned long request, unsigned long flags);
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 15/20] mm: pass the mm in vma_munmap_struct
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (13 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 14/20] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 16/20] mshare: add MSHAREFS_CREATE_MAPPING Anthony Yznaga
` (8 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Allow unmap to work with an mshare host mm.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
mm/vma.c | 10 ++++++----
mm/vma.h | 1 +
2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 28942701e301..60a37a9eb15e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1174,7 +1174,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct vm_area_struct *vma;
struct mm_struct *mm;
- mm = current->mm;
+ mm = vms->mm;
mm->map_count -= vms->vma_count;
mm->locked_vm -= vms->locked_vm;
if (vms->unlock)
@@ -1382,13 +1382,15 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
* @start: The aligned start address to munmap
* @end: The aligned end address to munmap
* @uf: The userfaultfd list_head
+ * @mm: The mm struct
* @unlock: Unlock after the operation. Only unlocked on success
*/
static void init_vma_munmap(struct vma_munmap_struct *vms,
struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, struct list_head *uf,
- bool unlock)
+ struct mm_struct *mm, bool unlock)
{
+ vms->mm = mm;
vms->vmi = vmi;
vms->vma = vma;
if (vma) {
@@ -1432,7 +1434,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vma_munmap_struct vms;
int error;
- init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+ init_vma_munmap(&vms, vmi, vma, start, end, uf, mm, unlock);
error = vms_gather_munmap_vmas(&vms, &mas_detach);
if (error)
goto gather_failed;
@@ -2229,7 +2231,7 @@ static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
/* Find the first overlapping VMA and initialise unmap state. */
vms->vma = vma_find(vmi, map->end);
- init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf,
+ init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf, map->mm,
/* unlock = */ false);
/* OK, we have overlapping VMAs - prepare to unmap them. */
diff --git a/mm/vma.h b/mm/vma.h
index e704f56577f3..03d69321312d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -49,6 +49,7 @@ struct vma_munmap_struct {
unsigned long exec_vm;
unsigned long stack_vm;
unsigned long data_vm;
+ struct mm_struct *mm;
};
enum vma_merge_state {
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 16/20] mshare: add MSHAREFS_CREATE_MAPPING
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (14 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 15/20] mm: pass the mm in vma_munmap_struct Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 17/20] mshare: add MSHAREFS_UNMAP Anthony Yznaga
` (7 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Add an ioctl for mapping objects within an mshare region. The
arguments are the same as mmap(). Only shared anonymous memory
mapped with MAP_FIXED is supported initially.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/uapi/linux/msharefs.h | 9 +++++
mm/mshare.c | 65 +++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+)
diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
index c7b509c7e093..fea0afdf000d 100644
--- a/include/uapi/linux/msharefs.h
+++ b/include/uapi/linux/msharefs.h
@@ -20,10 +20,19 @@
*/
#define MSHAREFS_GET_SIZE _IOR('x', 0, struct mshare_info)
#define MSHAREFS_SET_SIZE _IOW('x', 1, struct mshare_info)
+#define MSHAREFS_CREATE_MAPPING _IOW('x', 2, struct mshare_create)
struct mshare_info {
__u64 start;
__u64 size;
};
+struct mshare_create {
+ __u64 addr;
+ __u64 size;
+ __u64 offset;
+ __u32 prot;
+ __u32 flags;
+ __u32 fd;
+};
#endif
diff --git a/mm/mshare.c b/mm/mshare.c
index 9ada1544aeb1..d70f10210b46 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -194,12 +194,60 @@ msharefs_set_size(struct mm_struct *host_mm, struct mshare_data *m_data,
return 0;
}
+static long
+msharefs_create_mapping(struct mm_struct *host_mm, struct mshare_data *m_data,
+ struct mshare_create *mcreate)
+{
+ unsigned long mshare_start, mshare_end;
+ unsigned long mapped_addr;
+ unsigned long populate = 0;
+ unsigned long addr = mcreate->addr;
+ unsigned long size = mcreate->size;
+ unsigned int fd = mcreate->fd;
+ int prot = mcreate->prot;
+ int flags = mcreate->flags;
+ vm_flags_t vm_flags;
+ int err = -EINVAL;
+
+ mshare_start = m_data->minfo.start;
+ mshare_end = mshare_start + m_data->minfo.size;
+
+ if ((addr < mshare_start) || (addr >= mshare_end) ||
+ (addr + size > mshare_end))
+ goto out;
+
+ /*
+ * Only anonymous shared memory at fixed addresses is allowed for now.
+ */
+ if ((flags & (MAP_SHARED | MAP_FIXED)) != (MAP_SHARED | MAP_FIXED))
+ goto out;
+ if (fd != -1)
+ goto out;
+
+ if (mmap_write_lock_killable(host_mm)) {
+ err = -EINTR;
+ goto out;
+ }
+
+ err = 0;
+ mapped_addr = __do_mmap(NULL, addr, size, prot, flags, vm_flags,
+ 0, &populate, NULL, host_mm);
+
+ if (IS_ERR_VALUE(mapped_addr))
+ err = (long)mapped_addr;
+
+ mmap_write_unlock(host_mm);
+out:
+ return err;
+}
+
static long
msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct mshare_data *m_data = filp->private_data;
struct mm_struct *host_mm = m_data->mm;
struct mshare_info minfo;
+ struct mshare_create mcreate;
switch (cmd) {
case MSHAREFS_GET_SIZE:
@@ -228,6 +276,23 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
return msharefs_set_size(host_mm, m_data, &minfo);
+ case MSHAREFS_CREATE_MAPPING:
+ if (copy_from_user(&mcreate, (struct mshare_create __user *)arg,
+ sizeof(mcreate)))
+ return -EFAULT;
+
+ /*
+ * validate mshare region
+ */
+ spin_lock(&m_data->m_lock);
+ if (m_data->minfo.size == 0) {
+ spin_unlock(&m_data->m_lock);
+ return -EINVAL;
+ }
+ spin_unlock(&m_data->m_lock);
+
+ return msharefs_create_mapping(host_mm, m_data, &mcreate);
+
default:
return -ENOTTY;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 17/20] mshare: add MSHAREFS_UNMAP
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (15 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 16/20] mshare: add MSHAREFS_CREATE_MAPPING Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
` (6 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Add an ioctl for unmapping objects in an mshare region.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/uapi/linux/msharefs.h | 7 ++++++
mm/mshare.c | 44 +++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/include/uapi/linux/msharefs.h b/include/uapi/linux/msharefs.h
index fea0afdf000d..f7af1f2b5ee7 100644
--- a/include/uapi/linux/msharefs.h
+++ b/include/uapi/linux/msharefs.h
@@ -21,6 +21,7 @@
#define MSHAREFS_GET_SIZE _IOR('x', 0, struct mshare_info)
#define MSHAREFS_SET_SIZE _IOW('x', 1, struct mshare_info)
#define MSHAREFS_CREATE_MAPPING _IOW('x', 2, struct mshare_create)
+#define MSHAREFS_UNMAP _IOW('x', 3, struct mshare_unmap)
struct mshare_info {
__u64 start;
@@ -35,4 +36,10 @@ struct mshare_create {
__u32 flags;
__u32 fd;
};
+
+struct mshare_unmap {
+ __u64 addr;
+ __u64 size;
+};
+
#endif
diff --git a/mm/mshare.c b/mm/mshare.c
index d70f10210b46..8f53b8132895 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -241,6 +241,32 @@ msharefs_create_mapping(struct mm_struct *host_mm, struct mshare_data *m_data,
return err;
}
+static long
+msharefs_unmap(struct mm_struct *host_mm, struct mshare_data *m_data,
+ struct mshare_unmap *m_unmap)
+{
+ unsigned long mshare_start, mshare_end;
+ unsigned long addr = m_unmap->addr;
+ unsigned long size = m_unmap->size;
+ int err;
+
+ mshare_start = m_data->minfo.start;
+ mshare_end = mshare_start + m_data->minfo.size;
+
+ if ((addr < mshare_start) || (addr >= mshare_end) ||
+ (addr + size > mshare_end))
+ return -EINVAL;
+
+ if (mmap_write_lock_killable(host_mm))
+ return -EINTR;
+
+ err = do_munmap(host_mm, addr, size, NULL);
+
+ mmap_write_unlock(host_mm);
+
+ return err;
+}
+
static long
msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
@@ -248,6 +274,7 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
struct mm_struct *host_mm = m_data->mm;
struct mshare_info minfo;
struct mshare_create mcreate;
+ struct mshare_unmap m_unmap;
switch (cmd) {
case MSHAREFS_GET_SIZE:
@@ -293,6 +320,23 @@ msharefs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
return msharefs_create_mapping(host_mm, m_data, &mcreate);
+ case MSHAREFS_UNMAP:
+ if (copy_from_user(&m_unmap, (struct mshare_unmap __user *)arg,
+ sizeof(m_unmap)))
+ return -EFAULT;
+
+ /*
+ * validate mshare region
+ */
+ spin_lock(&m_data->m_lock);
+ if (m_data->minfo.size == 0) {
+ spin_unlock(&m_data->m_lock);
+ return -EINVAL;
+ }
+ spin_unlock(&m_data->m_lock);
+
+ return msharefs_unmap(host_mm, m_data, &m_unmap);
+
default:
return -ENOTTY;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (16 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 17/20] mshare: add MSHAREFS_UNMAP Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 19/20] mm/mshare: get memcg from current->mm instead of mshare mm Anthony Yznaga
` (5 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Add new mm flag, MMF_MSHARE.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/mm_types.h | 2 ++
mm/mshare.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f1b2dc788e2..dfbeb50e4c9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1642,6 +1642,8 @@ enum {
#define MMF_TOPDOWN 31 /* mm searches top down by default */
#define MMF_TOPDOWN_MASK (1 << MMF_TOPDOWN)
+#define MMF_MSHARE 32 /* mm is an mshare host mm */
+
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
diff --git a/mm/mshare.c b/mm/mshare.c
index 8f53b8132895..4c3f6c2410d6 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -365,6 +365,7 @@ msharefs_fill_mm(struct inode *inode)
goto err_free;
}
+ set_bit(MMF_MSHARE, &mm->flags);
mm->mmap_base = mm->task_size = 0;
m_data = kzalloc(sizeof(*m_data), GFP_KERNEL);
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 19/20] mm/mshare: get memcg from current->mm instead of mshare mm
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (17 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 18/20] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-24 23:54 ` [PATCH 20/20] mm/mshare: associate a mem cgroup with an mshare file Anthony Yznaga
` (4 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
Because handle_mm_fault() may operate on a vma from an mshare host mm,
the mm passed to cgroup functions count_memcg_event_mm() and
get_mem_cgroup_from_mm() may be an mshare host mm. These functions find
a memcg by dereferencing mm->owner which is set when an mm is allocated.
Since the task that created an mshare file may exit before the file is
deleted, use current->mm instead to find the memcg to update or charge
to.
This may not be the right solution but is hopefully a good starting
point. If charging should always go to a single memcg associated with
the mshare file, perhaps active_memcg could be used.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 3 ++-
mm/mshare.c | 3 +++
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6e74b8254d9b..e458ca80e833 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -987,6 +987,9 @@ static inline void count_memcg_events_mm(struct mm_struct *mm,
if (mem_cgroup_disabled())
return;
+ if (test_bit(MMF_MSHARE, &mm->flags))
+ mm = current->mm;
+
rcu_read_lock();
memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (likely(memcg))
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..ba6267615ee6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -938,7 +938,8 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
mm = current->mm;
if (unlikely(!mm))
return root_mem_cgroup;
- }
+ } else if (test_bit(MMF_MSHARE, &mm->flags))
+ mm = current->mm;
rcu_read_lock();
do {
diff --git a/mm/mshare.c b/mm/mshare.c
index 4c3f6c2410d6..5cc416cfd78c 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -381,6 +381,9 @@ msharefs_fill_mm(struct inode *inode)
if (ret)
goto err_free;
+#ifdef CONFIG_MEMCG
+ mm->owner = NULL;
+#endif
return 0;
err_free:
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH 20/20] mm/mshare: associate a mem cgroup with an mshare file
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (18 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 19/20] mm/mshare: get memcg from current->mm instead of mshare mm Anthony Yznaga
@ 2025-01-24 23:54 ` Anthony Yznaga
2025-01-27 22:33 ` [PATCH 00/20] Add support for shared PTEs across processes Andrew Morton
` (3 subsequent siblings)
23 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-24 23:54 UTC (permalink / raw)
To: akpm, willy, markhemm, viro, david, khalid
Cc: anthony.yznaga, jthoughton, corbet, dave.hansen, kirill, luto,
brauner, arnd, ebiederm, catalin.marinas, mingo, peterz,
liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups, x86,
linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
This patch shows one approach to associating a specific mem cgroup to
an mshare file and was inspired by code in mem_cgroup_sk_alloc().
Essentially when a process creates an mshare region, a reference is
taken on the mem cgroup that the process belongs to and a pointer to
the memcg is saved. At fault time set_active_memcg() is used to
temporarily enable charging of __GFP_ACCOUNT allocations to the saved
memcg. This does consolidate pagetable charges to a single memcg, but
there are issues to address such as how to handle the case where the
memcg is deleted but becomes a hidden, zombie memcg because the mshare
file has a reference to it.
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
---
arch/x86/mm/fault.c | 11 +++++++++++
include/linux/mm.h | 5 +++++
mm/mshare.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 49 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 4b55ade61a01..1b50417f68ad 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -21,6 +21,7 @@
#include <linux/mm_types.h>
#include <linux/mm.h> /* find_and_lock_vma() */
#include <linux/vmalloc.h>
+#include <linux/memcontrol.h>
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1219,6 +1220,8 @@ void do_user_addr_fault(struct pt_regs *regs,
unsigned int flags = FAULT_FLAG_DEFAULT;
bool is_shared_vma;
unsigned long addr;
+ struct mem_cgroup *mshare_memcg;
+ struct mem_cgroup *memcg;
tsk = current;
mm = tsk->mm;
@@ -1375,6 +1378,8 @@ void do_user_addr_fault(struct pt_regs *regs,
}
if (unlikely(vma_is_mshare(vma))) {
+ mshare_memcg = get_mshare_memcg(vma);
+
fault = find_shared_vma(&vma, &addr);
if (fault) {
@@ -1402,6 +1407,9 @@ void do_user_addr_fault(struct pt_regs *regs,
return;
}
+ if (is_shared_vma && mshare_memcg)
+ memcg = set_active_memcg(mshare_memcg);
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -1417,6 +1425,9 @@ void do_user_addr_fault(struct pt_regs *regs,
*/
fault = handle_mm_fault(vma, addr, flags, regs);
+ if (is_shared_vma && mshare_memcg)
+ set_active_memcg(memcg);
+
if (unlikely(is_shared_vma) && ((fault & VM_FAULT_COMPLETED) ||
(fault & VM_FAULT_RETRY) || fault_signal_pending(fault, regs)))
mmap_read_unlock(mm);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80429d1a6ae4..eaa304d22a9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1110,12 +1110,17 @@ static inline bool vma_is_anon_shmem(struct vm_area_struct *vma) { return false;
int vma_is_stack_for_current(struct vm_area_struct *vma);
#ifdef CONFIG_MSHARE
+struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma);
vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp);
static inline bool vma_is_mshare(const struct vm_area_struct *vma)
{
return vma->vm_flags & VM_MSHARE;
}
#else
+static inline struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma)
+{
+ return NULL;
+}
static inline vm_fault_t find_shared_vma(struct vm_area_struct **vma, unsigned long *addrp)
{
WARN_ON_ONCE(1);
diff --git a/mm/mshare.c b/mm/mshare.c
index 5cc416cfd78c..a56e56c90aaa 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -16,6 +16,7 @@
#include <linux/fs.h>
#include <linux/fs_context.h>
+#include <linux/memcontrol.h>
#include <linux/mman.h>
#include <linux/mmu_notifier.h>
#include <linux/spinlock_types.h>
@@ -30,8 +31,22 @@ struct mshare_data {
spinlock_t m_lock;
struct mshare_info minfo;
struct mmu_notifier mn;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg;
+#endif
};
+struct mem_cgroup *get_mshare_memcg(struct vm_area_struct *vma)
+{
+ struct mshare_data *m_data = vma->vm_private_data;
+
+#ifdef CONFIG_MEMCG
+ return m_data->memcg;
+#else
+ return NULL;
+#endif
+}
+
static void mshare_invalidate_tlbs(struct mmu_notifier *mn, struct mm_struct *mm,
unsigned long start, unsigned long end)
{
@@ -358,6 +373,9 @@ msharefs_fill_mm(struct inode *inode)
struct mm_struct *mm;
struct mshare_data *m_data = NULL;
int ret = 0;
+#ifdef CONFIG_MEMCG
+ struct mem_cgroup *memcg;
+#endif
mm = mm_alloc();
if (!mm) {
@@ -383,6 +401,17 @@ msharefs_fill_mm(struct inode *inode)
#ifdef CONFIG_MEMCG
mm->owner = NULL;
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ if (mem_cgroup_is_root(memcg))
+ goto out;
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ goto out;
+ if (css_tryget(&memcg->css))
+ m_data->memcg = memcg;
+out:
+ rcu_read_unlock();
#endif
return 0;
@@ -396,6 +425,10 @@ msharefs_fill_mm(struct inode *inode)
static void
msharefs_delmm(struct mshare_data *m_data)
{
+#ifdef CONFIG_MEMCG
+ if (m_data->memcg)
+ css_put(&m_data->memcg->css);
+#endif
mmput(m_data->mm);
kfree(m_data);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (19 preceding siblings ...)
2025-01-24 23:54 ` [PATCH 20/20] mm/mshare: associate a mem cgroup with an mshare file Anthony Yznaga
@ 2025-01-27 22:33 ` Andrew Morton
2025-01-27 23:59 ` Anthony Yznaga
2025-01-28 9:21 ` David Hildenbrand
2025-01-28 7:11 ` Bagas Sanjaya
` (2 subsequent siblings)
23 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2025-01-27 22:33 UTC (permalink / raw)
To: Anthony Yznaga
Cc: willy, markhemm, viro, david, khalid, jthoughton, corbet,
dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> Memory pages shared between processes require page table entries
> (PTEs) for each process. Each of these PTEs consume some of
> the memory and as long as the number of mappings being maintained
> is small enough, this space consumed by page tables is not
> objectionable. When very few memory pages are shared between
> processes, the number of PTEs to maintain is mostly constrained by
> the number of pages of memory on the system. As the number of shared
> pages and the number of times pages are shared goes up, amount of
> memory consumed by page tables starts to become significant. This
> issue does not apply to threads. Any number of threads can share the
> same pages inside a process while sharing the same PTEs. Extending
> this same model to sharing pages across processes can eliminate this
> issue for sharing across processes as well.
>
> ...
>
> API
> ===
>
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
>
> 1. Mount msharefs on /sys/fs/mshare -
> mount -t msharefs msharefs /sys/fs/mshare
>
> 2. mshare regions have alignment and size requirements. Start
> address for the region must be aligned to an address boundary and
> be a multiple of fixed size. This alignment and size requirement
> can be obtained by reading the file /sys/fs/mshare/mshare_info
> which returns a number in text format. mshare regions must be
> aligned to this boundary and be a multiple of this size.
>
> 3. For the process creating an mshare region:
> a. Create a file on /sys/fs/mshare, for example -
> fd = open("/sys/fs/mshare/shareme",
> O_RDWR|O_CREAT|O_EXCL, 0600);
>
> b. Establish the starting address and size of the region
> struct mshare_info minfo;
>
> minfo.start = TB(2);
> minfo.size = BUFFER_SIZE;
> ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>
> c. Map some memory in the region
> struct mshare_create mcreate;
>
> mcreate.addr = TB(2);
> mcreate.size = BUFFER_SIZE;
> mcreate.offset = 0;
> mcreate.prot = PROT_READ | PROT_WRITE;
> mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> mcreate.fd = -1;
>
> ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
I'm not really understanding why step a exists. It's basically an
mmap() so why can't this be done within step d?
> d. Map the mshare region into the process
> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
>
> e. Write and read to mshared region normally.
>
> 4. For processes attaching an mshare region:
> a. Open the file on msharefs, for example -
> fd = open("/sys/fs/mshare/shareme", O_RDWR);
>
> b. Get information about mshare'd region from the file:
> struct mshare_info minfo;
>
> ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>
> c. Map the mshare'd region into the process
> mmap(minfo.start, minfo.size,
> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>
> 5. To delete the mshare region -
> unlink("/sys/fs/mshare/shareme");
>
The userspace intergace is the thing we should initially consider. I'm
having ancient memories of hugetlbfs. Over time it was seen that
hugetlbfs was too standalone and huge pages became more (and more (and
more (and more))) integrated into regular MM code. Can we expect a
similar evolution with pte-shared memory and if so, is this the correct
interface to be starting out with?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-27 22:33 ` [PATCH 00/20] Add support for shared PTEs across processes Andrew Morton
@ 2025-01-27 23:59 ` Anthony Yznaga
2025-01-28 9:21 ` David Hildenbrand
1 sibling, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-27 23:59 UTC (permalink / raw)
To: Andrew Morton
Cc: willy, markhemm, viro, david, khalid, jthoughton, corbet,
dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On 1/27/25 2:33 PM, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
>> Memory pages shared between processes require page table entries
>> (PTEs) for each process. Each of these PTEs consume some of
>> the memory and as long as the number of mappings being maintained
>> is small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of PTEs to maintain is mostly constrained by
>> the number of pages of memory on the system. As the number of shared
>> pages and the number of times pages are shared goes up, amount of
>> memory consumed by page tables starts to become significant. This
>> issue does not apply to threads. Any number of threads can share the
>> same pages inside a process while sharing the same PTEs. Extending
>> this same model to sharing pages across processes can eliminate this
>> issue for sharing across processes as well.
>>
>> ...
>>
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>> mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>> address for the region must be aligned to an address boundary and
>> be a multiple of fixed size. This alignment and size requirement
>> can be obtained by reading the file /sys/fs/mshare/mshare_info
>> which returns a number in text format. mshare regions must be
>> aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>> a. Create a file on /sys/fs/mshare, for example -
>> fd = open("/sys/fs/mshare/shareme",
>> O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>> b. Establish the starting address and size of the region
>> struct mshare_info minfo;
>>
>> minfo.start = TB(2);
>> minfo.size = BUFFER_SIZE;
>> ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>>
>> c. Map some memory in the region
>> struct mshare_create mcreate;
>>
>> mcreate.addr = TB(2);
>> mcreate.size = BUFFER_SIZE;
>> mcreate.offset = 0;
>> mcreate.prot = PROT_READ | PROT_WRITE;
>> mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> mcreate.fd = -1;
>>
>> ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
> I'm not really understanding why step a exists. It's basically an
> mmap() so why can't this be done within step d?
One way to think of it is that step d establishes a window to the mshare
region and the objects mapped within it.
Discussions on earlier iterations of mshare pushed back strongly on
introducing special casing in the mmap path to redirect mmaps that fell
within an mshare region to map into an mshare mm. Even then it gets
messier for munmap, i.e. does an unmap of the whole range mean unmap the
window or unmap the objects within it.
>
>> d. Map the mshare region into the process
>> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>> MAP_SHARED, fd, 0);
>>
>> e. Write and read to mshared region normally.
>>
>> 4. For processes attaching an mshare region:
>> a. Open the file on msharefs, for example -
>> fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>> b. Get information about mshare'd region from the file:
>> struct mshare_info minfo;
>>
>> ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>>
>> c. Map the mshare'd region into the process
>> mmap(minfo.start, minfo.size,
>> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>> unlink("/sys/fs/mshare/shareme");
>>
> The userspace intergace is the thing we should initially consider. I'm
> having ancient memories of hugetlbfs. Over time it was seen that
> hugetlbfs was too standalone and huge pages became more (and more (and
> more (and more))) integrated into regular MM code. Can we expect a
> similar evolution with pte-shared memory and if so, is this the correct
> interface to be starting out with?
I don't know. This is an approach that has been refined through a number
of discussions, but I'm certainly open to alternatives.
Anthony
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-27 22:33 ` [PATCH 00/20] Add support for shared PTEs across processes Andrew Morton
2025-01-27 23:59 ` Anthony Yznaga
@ 2025-01-28 9:21 ` David Hildenbrand
1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-01-28 9:21 UTC (permalink / raw)
To: Andrew Morton, Anthony Yznaga
Cc: willy, markhemm, viro, khalid, jthoughton, corbet, dave.hansen,
kirill, luto, brauner, arnd, ebiederm, catalin.marinas, mingo,
peterz, liam.howlett, lorenzo.stoakes, vbabka, jannh, hannes,
mhocko, roman.gushchin, shakeel.butt, muchun.song, tglx, cgroups,
x86, linux-doc, linux-arch, linux-kernel, linux-mm, mhiramat,
rostedt, vasily.averin, xhao, pcc, neilb, maz
On 27.01.25 23:33, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
>> Memory pages shared between processes require page table entries
>> (PTEs) for each process. Each of these PTEs consume some of
>> the memory and as long as the number of mappings being maintained
>> is small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of PTEs to maintain is mostly constrained by
>> the number of pages of memory on the system. As the number of shared
>> pages and the number of times pages are shared goes up, amount of
>> memory consumed by page tables starts to become significant. This
>> issue does not apply to threads. Any number of threads can share the
>> same pages inside a process while sharing the same PTEs. Extending
>> this same model to sharing pages across processes can eliminate this
>> issue for sharing across processes as well.
>>
>> ...
>>
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>> mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>> address for the region must be aligned to an address boundary and
>> be a multiple of fixed size. This alignment and size requirement
>> can be obtained by reading the file /sys/fs/mshare/mshare_info
>> which returns a number in text format. mshare regions must be
>> aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>> a. Create a file on /sys/fs/mshare, for example -
>> fd = open("/sys/fs/mshare/shareme",
>> O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>> b. Establish the starting address and size of the region
>> struct mshare_info minfo;
>>
>> minfo.start = TB(2);
>> minfo.size = BUFFER_SIZE;
>> ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>>>> c. Map some memory in the region
>> struct mshare_create mcreate;
>>
>> mcreate.addr = TB(2);
>> mcreate.size = BUFFER_SIZE;>>
mcreate.offset = 0;
>> mcreate.prot = PROT_READ | PROT_WRITE;
>> mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> mcreate.fd = -1;
>>
>> ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
>
> I'm not really understanding why step a exists. It's basically an
> mmap() so why can't this be done within step d?
Conceptually, it's defining the content of the virtual file: by creating
mappings/unmapping mappings/changing mappings. Some applications will
require multiple different mappings in such a virtual file.
Processes mmap the resulting virtual file.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (20 preceding siblings ...)
2025-01-27 22:33 ` [PATCH 00/20] Add support for shared PTEs across processes Andrew Morton
@ 2025-01-28 7:11 ` Bagas Sanjaya
2025-01-28 19:53 ` Anthony Yznaga
2025-01-28 9:36 ` David Hildenbrand
2025-01-29 0:11 ` Andrew Morton
23 siblings, 1 reply; 37+ messages in thread
From: Bagas Sanjaya @ 2025-01-28 7:11 UTC (permalink / raw)
To: Anthony Yznaga, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
[-- Attachment #1: Type: text/plain, Size: 328 bytes --]
On Fri, Jan 24, 2025 at 03:54:34PM -0800, Anthony Yznaga wrote:
> v1:
> - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
Seems like I can't cleanly apply this series on the aforementioned tag.
Can you give me the exact base commit?
Confused...
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-28 7:11 ` Bagas Sanjaya
@ 2025-01-28 19:53 ` Anthony Yznaga
0 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-28 19:53 UTC (permalink / raw)
To: Bagas Sanjaya, akpm, willy, markhemm, viro, david, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
On 1/27/25 11:11 PM, Bagas Sanjaya wrote:
> On Fri, Jan 24, 2025 at 03:54:34PM -0800, Anthony Yznaga wrote:
>> v1:
>> - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11
> Seems like I can't cleanly apply this series on the aforementioned tag.
> Can you give me the exact base commit?
>
> Confused...
>
Hmm, maybe I goofed something. Last commit was:
103978aab801 mm/compaction: fix UBSAN shift-out-of-bounds warning
Anthony
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (21 preceding siblings ...)
2025-01-28 7:11 ` Bagas Sanjaya
@ 2025-01-28 9:36 ` David Hildenbrand
2025-01-28 19:40 ` Anthony Yznaga
2025-01-29 0:11 ` Andrew Morton
23 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-01-28 9:36 UTC (permalink / raw)
To: Anthony Yznaga, akpm, willy, markhemm, viro, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
> API
> ===
>
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
>
> 1. Mount msharefs on /sys/fs/mshare -
> mount -t msharefs msharefs /sys/fs/mshare
>
> 2. mshare regions have alignment and size requirements. Start
> address for the region must be aligned to an address boundary and
> be a multiple of fixed size. This alignment and size requirement
> can be obtained by reading the file /sys/fs/mshare/mshare_info
> which returns a number in text format. mshare regions must be
> aligned to this boundary and be a multiple of this size.
>
> 3. For the process creating an mshare region:
> a. Create a file on /sys/fs/mshare, for example -
> fd = open("/sys/fs/mshare/shareme",
> O_RDWR|O_CREAT|O_EXCL, 0600);
>
> b. Establish the starting address and size of the region
> struct mshare_info minfo;
>
> minfo.start = TB(2);
> minfo.size = BUFFER_SIZE;
> ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
We could set the size using ftruncate, just like for any other file. It
would have to be the first thing after creating the file, and before we
allow any other modifications.
Idealy, we'd be able to get rid of the "start", use something resaonable
(e.g., TB(2)) internally, and allow processes to mmap() it at different
(suitably-aligned) addresses.
I recall we discussed that in the past. Did you stumble over real
blockers such that we really must mmap() the file at the same address in
all processes? I recall some things around TLB flushing, but not sure.
So we might have to stick to an mmap address for now.
When using fallocate/stat to set/query the file size, we could end up with:
/*
* Set the address where this file can be mapped into processes. Other
* addresses are not supported for now, and mmap will fail. Changing the
* mmap address after mappings were already created is not supported.
*/
MSHAREFS_SET_MMAP_ADDRESS
MSHAREFS_GET_MMAP_ADDRESS
>
> c. Map some memory in the region
> struct mshare_create mcreate;
>
> mcreate.addr = TB(2);
Can we use the offset into the virtual file instead? We should be able
to perform that translation internally fairly easily I assume.
> mcreate.size = BUFFER_SIZE;
> mcreate.offset = 0;
> mcreate.prot = PROT_READ | PROT_WRITE;
> mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
> mcreate.fd = -1;
>
> ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
Would examples with multiple mappings work already in this version?
Did you experiment with other mappings (e.g., ordinary shared file
mappings), and what are the blockers to make that fly?
>
> d. Map the mshare region into the process
> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
>
> e. Write and read to mshared region normally.
>
> 4. For processes attaching an mshare region:
> a. Open the file on msharefs, for example -
> fd = open("/sys/fs/mshare/shareme", O_RDWR);
>
> b. Get information about mshare'd region from the file:
> struct mshare_info minfo;
>
> ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>
> c. Map the mshare'd region into the process
> mmap(minfo.start, minfo.size,
> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>
> 5. To delete the mshare region -
> unlink("/sys/fs/mshare/shareme");
>
I recall discussions around cgroup accounting, OOM handling etc. I
thought the conclusion was that we need an "mshare process" where the
memory is accounted to, and once that process is killed (e.g., OOM), it
must tear down all mappings/pages etc.
How does your design currently look like in that regard? E.g., how can
OOM handling make progress, how is cgroup accounting handled?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-28 9:36 ` David Hildenbrand
@ 2025-01-28 19:40 ` Anthony Yznaga
0 siblings, 0 replies; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-28 19:40 UTC (permalink / raw)
To: David Hildenbrand, akpm, willy, markhemm, viro, khalid
Cc: jthoughton, corbet, dave.hansen, kirill, luto, brauner, arnd,
ebiederm, catalin.marinas, mingo, peterz, liam.howlett,
lorenzo.stoakes, vbabka, jannh, hannes, mhocko, roman.gushchin,
shakeel.butt, muchun.song, tglx, cgroups, x86, linux-doc,
linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
vasily.averin, xhao, pcc, neilb, maz
On 1/28/25 1:36 AM, David Hildenbrand wrote:
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>> mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>> address for the region must be aligned to an address boundary and
>> be a multiple of fixed size. This alignment and size requirement
>> can be obtained by reading the file /sys/fs/mshare/mshare_info
>> which returns a number in text format. mshare regions must be
>> aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating an mshare region:
>> a. Create a file on /sys/fs/mshare, for example -
>> fd = open("/sys/fs/mshare/shareme",
>> O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>> b. Establish the starting address and size of the region
>> struct mshare_info minfo;
>>
>> minfo.start = TB(2);
>> minfo.size = BUFFER_SIZE;
>> ioctl(fd, MSHAREFS_SET_SIZE, &minfo)
>
> We could set the size using ftruncate, just like for any other file.
> It would have to be the first thing after creating the file, and
> before we allow any other modifications.
I'll look into this.
>
> Idealy, we'd be able to get rid of the "start", use something
> resaonable (e.g., TB(2)) internally, and allow processes to mmap() it
> at different (suitably-aligned) addresses.
>
> I recall we discussed that in the past. Did you stumble over real
> blockers such that we really must mmap() the file at the same address
> in all processes? I recall some things around TLB flushing, but not
> sure. So we might have to stick to an mmap address for now.
It's not hard to implement this. It does have the affect that rmap walks
will find the internal VA rather than the actual VA for a given process.
For TLB flushing this isn't a problem for the current implementation
because all TLBs are flushed entirely. I don't know if there might be
other complications. It does mean that an offset rather than address
should be used when creating a mapping as you point out below.
>
> When using fallocate/stat to set/query the file size, we could end up
> with:
>
> /*
> * Set the address where this file can be mapped into processes. Other
> * addresses are not supported for now, and mmap will fail. Changing the
> * mmap address after mappings were already created is not supported.
> */
> MSHAREFS_SET_MMAP_ADDRESS
> MSHAREFS_GET_MMAP_ADDRESS
I'll look into this, too.
>
>
>>
>> c. Map some memory in the region
>> struct mshare_create mcreate;
>>
>> mcreate.addr = TB(2);
>
> Can we use the offset into the virtual file instead? We should be able
> to perform that translation internally fairly easily I assume.
Yes, an offset would be preferable. Especially if mapping the same file
at different VAs is implemented.
>
>> mcreate.size = BUFFER_SIZE;
>> mcreate.offset = 0;
>> mcreate.prot = PROT_READ | PROT_WRITE;
>> mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> mcreate.fd = -1;
>>
>> ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)
>
> Would examples with multiple mappings work already in this version?
>
> Did you experiment with other mappings (e.g., ordinary shared file
> mappings), and what are the blockers to make that fly?
Yes, multiple mappings works. And it's straightforward to make shared
file mappings work. I have a patch where I basically just copied code
from ksys_mmap_pgoff() into msharefs_create_mapping(). Needs some
refactoring and finessing to make it a real patch.
>
>>
>> d. Map the mshare region into the process
>> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>> MAP_SHARED, fd, 0);
>>
>> e. Write and read to mshared region normally.
>>
>> 4. For processes attaching an mshare region:
>> a. Open the file on msharefs, for example -
>> fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>> b. Get information about mshare'd region from the file:
>> struct mshare_info minfo;
>>
>> ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
>>
>> c. Map the mshare'd region into the process
>> mmap(minfo.start, minfo.size,
>> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>> unlink("/sys/fs/mshare/shareme");
>>
>
> I recall discussions around cgroup accounting, OOM handling etc. I
> thought the conclusion was that we need an "mshare process" where the
> memory is accounted to, and once that process is killed (e.g., OOM),
> it must tear down all mappings/pages etc.
>
> How does your design currently look like in that regard? E.g., how can
> OOM handling make progress, how is cgroup accounting handled?
There was some discussion on this at last year's LSF/MM, but it seemed
more like ideas rather than a conclusion on an approach. In any case,
tearing down everything if an owning process is killed does not work for
our internal use cases, and I think that was mentioned somewhere in
discussions. Plus it seems to me that yanking the mappings away from the
unsuspecting non-owner processes could be quite catastrophic. Shouldn't
an mshare virtual file be treated like any other in-memory file? Or do
such files get zapped somehow by OOM? Not saying we shouldn't do
anything for OOM, but I'm not sure what the answer is.
Cgroups are tricky. At the mm alignment meeting last year a use case was
brought up where it would be desirable to have all pagetable pages
charged to one memcg rather than have them charged on a first touch
basis. It was proposed that perhaps an mshare file could associated with
a cgroup at the time it is created. I have figured out a way to do this
but I'm not versed enough in cgroups to know if the approach is viable.
The last three patches provided this functionality as well as
functionality that ensures a newly faulted in page is charged to the
current process. If everything, pagetable and faulted pages, should be
charged to the same cgroup then more work is definitely required.
Hopefully this provides enough context to move towards a complete solution.
Anthony
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-24 23:54 [PATCH 00/20] Add support for shared PTEs across processes Anthony Yznaga
` (22 preceding siblings ...)
2025-01-28 9:36 ` David Hildenbrand
@ 2025-01-29 0:11 ` Andrew Morton
2025-01-29 0:25 ` Anthony Yznaga
23 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2025-01-29 0:11 UTC (permalink / raw)
To: Anthony Yznaga
Cc: willy, markhemm, viro, david, khalid, jthoughton, corbet,
dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is 8 bytes long which is very small compared to the 4K page
> size.
Dumb question: why aren't these applications using huge pages?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-29 0:11 ` Andrew Morton
@ 2025-01-29 0:25 ` Anthony Yznaga
2025-01-29 0:59 ` Matthew Wilcox
0 siblings, 1 reply; 37+ messages in thread
From: Anthony Yznaga @ 2025-01-29 0:25 UTC (permalink / raw)
To: Andrew Morton
Cc: willy, markhemm, viro, david, khalid, jthoughton, corbet,
dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On 1/28/25 4:11 PM, Andrew Morton wrote:
> On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is 8 bytes long which is very small compared to the 4K page
>> size.
> Dumb question: why aren't these applications using huge pages?
>
They often are using hugetlbfs but would also benefit from having page
tables shared for other kinds of memory such as shmem, tmpfs or dax.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH 00/20] Add support for shared PTEs across processes
2025-01-29 0:25 ` Anthony Yznaga
@ 2025-01-29 0:59 ` Matthew Wilcox
0 siblings, 0 replies; 37+ messages in thread
From: Matthew Wilcox @ 2025-01-29 0:59 UTC (permalink / raw)
To: Anthony Yznaga
Cc: Andrew Morton, markhemm, viro, david, khalid, jthoughton, corbet,
dave.hansen, kirill, luto, brauner, arnd, ebiederm,
catalin.marinas, mingo, peterz, liam.howlett, lorenzo.stoakes,
vbabka, jannh, hannes, mhocko, roman.gushchin, shakeel.butt,
muchun.song, tglx, cgroups, x86, linux-doc, linux-arch,
linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
pcc, neilb, maz
On Tue, Jan 28, 2025 at 04:25:22PM -0800, Anthony Yznaga wrote:
>
> On 1/28/25 4:11 PM, Andrew Morton wrote:
> > On Fri, 24 Jan 2025 15:54:34 -0800 Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> >
> > > Some of the field deployments commonly see memory pages shared
> > > across 1000s of processes. On x86_64, each page requires a PTE that
> > > is 8 bytes long which is very small compared to the 4K page
> > > size.
> > Dumb question: why aren't these applications using huge pages?
> >
> They often are using hugetlbfs but would also benefit from having page
> tables shared for other kinds of memory such as shmem, tmpfs or dax.
... and the implementation of PMD sharing in hugetlbfs is horrible. In
addition to inverting the locking order (see gigantic comment in rmap.c),
the semantics aren't what the Oracle DB wants, and it's inefficient.
So when we were looking at implementing page table sharing for DAX, we
examined _and rejected_ porting the hugetlbfs approach. We've discussed
this extensively at the last three LSFMM sessions where mshare has been
a topic, and in previous submissions of mshare. So seeing the question
being asked yet again is disheartening.
^ permalink raw reply [flat|nested] 37+ messages in thread