From: David Hildenbrand <david@redhat.com>
To: Pedro Falcato <pfalcato@suse.de>,
Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
andreyknvl@gmail.com, arnd@arndb.de, bp@alien8.de,
brauner@kernel.org, bsegall@google.com, corbet@lwn.net,
dave.hansen@linux.intel.com, dietmar.eggemann@arm.com,
ebiederm@xmission.com, hpa@zytor.com, jakub.wartak@mailbox.org,
jannh@google.com, juri.lelli@redhat.com, khalid@kernel.org,
liam.howlett@oracle.com, linyongting@bytedance.com,
lorenzo.stoakes@oracle.com, luto@kernel.org,
markhemm@googlemail.com, maz@kernel.org, mhiramat@kernel.org,
mgorman@suse.de, mhocko@suse.com, mingo@redhat.com,
muchun.song@linux.dev, neilb@suse.de, osalvador@suse.de,
pcc@google.com, peterz@infradead.org, rostedt@goodmis.org,
rppt@kernel.org, shakeel.butt@linux.dev, surenb@google.com,
tglx@linutronix.de, vasily.averin@linux.dev, vbabka@suse.cz,
vincent.guittot@linaro.org, viro@zeniv.linux.org.uk,
vschneid@redhat.com, willy@infradead.org, x86@kernel.org,
xhao@linux.alibaba.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
Subject: Re: [PATCH v3 01/22] mm: Add msharefs filesystem
Date: Wed, 10 Sep 2025 14:46:45 +0200 [thread overview]
Message-ID: <e61c1029-d760-4c04-acfb-55bc0af88e88@redhat.com> (raw)
In-Reply-To: <do7cmy4eiiqd5ux62r3u2ghizc62ljg5m3mqx7qzy3im4kc2p6@upmigdbp7eat>
On 10.09.25 14:14, Pedro Falcato wrote:
> On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
>> From: Khalid Aziz <khalid@kernel.org>
>>
>> Add a pseudo filesystem that contains files and page table sharing
>> information that enables processes to share page table entries.
>> This patch adds the basic filesystem that can be mounted, a
>> CONFIG_MSHARE option to enable the feature, and documentation.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>> Documentation/filesystems/index.rst | 1 +
>> Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
>> include/uapi/linux/magic.h | 1 +
>> mm/Kconfig | 11 +++
>> mm/Makefile | 4 ++
>> mm/mshare.c | 97 ++++++++++++++++++++++++++
>> 6 files changed, 210 insertions(+)
>> create mode 100644 Documentation/filesystems/msharefs.rst
>> create mode 100644 mm/mshare.c
>>
>> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> index 11a599387266..dcd6605eb228 100644
>> --- a/Documentation/filesystems/index.rst
>> +++ b/Documentation/filesystems/index.rst
>> @@ -102,6 +102,7 @@ Documentation for filesystem implementations.
>> fuse-passthrough
>> inotify
>> isofs
>> + msharefs
>> nilfs2
>> nfs/index
>> ntfs3
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..3e5b7d531821
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,96 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +Msharefs - A filesystem to support shared page tables
>> +=====================================================
>> +
>> +What is msharefs?
>> +-----------------
>> +
>> +msharefs is a pseudo filesystem that allows multiple processes to
>> +share page table entries for shared pages. To enable support for
>> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
>> +
>> +msharefs is typically mounted like this::
>> +
>> + mount -t msharefs none /sys/fs/mshare
>> +
>> +A file created on msharefs creates a new shared region where all
>> +processes mapping that region will map it using shared page table
>> +entries. Once the size of the region has been established via
>> +ftruncate() or fallocate(), the region can be mapped into processes
>> +and ioctls used to map and unmap objects within it. Note that an
>> +msharefs file is a control file and accessing mapped objects within
>> +a shared region through read or write of the file is not permitted.
>> +
>
> Welp. I really really don't like this API.
> I assume this has been discussed previously, but why do we need a new
> magical pseudofs mounted under some random /sys directory?
>
> But, ok, assuming we're thinking about something hugetlbfs like, that's not too
> bad, and programs already know how to use it.
>
>> +How to use mshare
>> +-----------------
>> +
>> +Here are the basic steps for using mshare:
>> +
>> + 1. Mount msharefs on /sys/fs/mshare::
>> +
>> + mount -t msharefs msharefs /sys/fs/mshare
>> +
>> + 2. mshare regions have alignment and size requirements. Start
>> + address for the region must be aligned to an address boundary and
>> + be a multiple of fixed size. This alignment and size requirement
>> + can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
>> + which returns a number in text format. mshare regions must be
>> + aligned to this boundary and be a multiple of this size.
>> +
>
> I don't see why size and alignment needs to be taken into consideration by
> userspace. You can simply establish a mapping and pad it out.
>
>> + 3. For the process creating an mshare region:
>> +
>> + a. Create a file on /sys/fs/mshare, for example::
>> +
>> + fd = open("/sys/fs/mshare/shareme",
>> + O_RDWR|O_CREAT|O_EXCL, 0600);
>
> Ok, makes sense.
>
>> +
>> + b. Establish the size of the region::
>> +
>> + fallocate(fd, 0, 0, BUF_SIZE);
>> +
>> + or::
>> +
>> + ftruncate(fd, BUF_SIZE);
>> +
>
> Yep.
>
>> + c. Map some memory in the region::
>> +
>> + struct mshare_create mcreate;
>> +
>> + mcreate.region_offset = 0;
>> + mcreate.size = BUF_SIZE;
>> + mcreate.offset = 0;
>> + mcreate.prot = PROT_READ | PROT_WRITE;
>> + mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> + mcreate.fd = -1;
>> +
>> + ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
>
> Why?? Do you want to map mappings in msharefs files, that can themselves be
> mapped? Why do we need an ioctl here?
>
> Really, this feature seems very overengineered. If you want to go the fs route,
> doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
> like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
> Or properly doing a syscall that can try to attach the page-table-sharing
> property to random VMAs.
>
> But I'm wholly opposed to the idea of "mapping a file that itself has more
> mappings, mappings which you establish using a magic filesystem and ioctls".
I don't remember the history (it's been a while) but there was this
interest of
(a) Sharing page tables for smaller files (not just PUD size etc.)
(b) Supporting also ordinary file systems, not just tmpfs
(c) Having a way to update protection of parts of a mapping and
immediately have it visible to everyone mapping that area.
In the past, I raised that some VM use cases around virtio-fs would be
interested in having a "VMA container" that can be updated by the parent
QEMU process, and what gets mapped in there would be immediately visible
to the other processes.
I recall that initially I pushed for just generalizing the support for
shared page tables so it could be used for other file systems. I recall
problems around that, likely around protection changes etc.
So current mshare really is the idea of having a (let's call it) VMA
container that can be mapped into processes where all processes will
observe changes performed by other processes.
I agree that it's complicated, and the semantics are very, very, very weird.
--
Cheers
David / dhildenb
next prev parent reply other threads:[~2025-09-10 12:46 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-20 1:03 [PATCH v3 00/22] Add support for shared PTEs across processes Anthony Yznaga
2025-08-20 1:03 ` [PATCH v3 01/22] mm: Add msharefs filesystem Anthony Yznaga
2025-09-08 18:29 ` Liam R. Howlett
2025-09-08 19:09 ` Anthony Yznaga
2025-09-10 12:14 ` Pedro Falcato
2025-09-10 12:46 ` David Hildenbrand [this message]
2025-08-20 1:03 ` [PATCH v3 02/22] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2025-08-20 1:03 ` [PATCH v3 03/22] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2025-08-20 1:03 ` [PATCH v3 04/22] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2025-08-20 1:03 ` [PATCH v3 05/22] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
2025-08-20 1:03 ` [PATCH v3 06/22] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
2025-09-08 18:45 ` David Hildenbrand
2025-09-08 18:56 ` Anthony Yznaga
2025-09-08 19:02 ` David Hildenbrand
2025-09-08 19:03 ` Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 07/22] mm/mshare: Add mmap support Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 08/22] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 09/22] sched/numa: do not scan msharefs vmas Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 10/22] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 11/22] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 12/22] mm: introduce PUD page table shared count Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 13/22] mm/mshare: prepare for page table sharing support Anthony Yznaga
2025-09-15 15:27 ` Lorenzo Stoakes
2025-08-20 1:04 ` [PATCH v3 14/22] x86/mm: enable page table sharing Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 15/22] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 16/22] mm: pass the mm in vma_munmap_struct Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 17/22] sched/mshare: mshare ownership Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 18/22] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 19/22] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 20/22] mm/mshare: support mapping files and anon hugetlb " Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 21/22] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
2025-08-20 1:04 ` [PATCH v3 22/22] mm/mshare: charge fault handling allocations to the mshare owner Anthony Yznaga
2025-09-08 18:50 ` David Hildenbrand
2025-09-08 19:21 ` Anthony Yznaga
2025-09-08 20:28 ` David Hildenbrand
2025-09-08 20:55 ` Anthony Yznaga
2025-09-08 20:32 ` [PATCH v3 00/22] Add support for shared PTEs across processes David Hildenbrand
2025-09-08 20:59 ` Matthew Wilcox
2025-09-08 21:14 ` Anthony Yznaga
2025-09-09 7:53 ` David Hildenbrand
2025-09-09 18:29 ` Anthony Yznaga
2025-09-09 19:06 ` Lorenzo Stoakes
2026-02-20 21:35 ` Kalesh Singh
2026-02-21 12:40 ` Pedro Falcato
2026-02-23 17:43 ` Kalesh Singh
2026-02-23 19:55 ` anthony.yznaga
2026-02-25 22:53 ` Kalesh Singh
2026-02-24 9:40 ` David Hildenbrand (Arm)
2026-02-25 23:06 ` Kalesh Singh
2026-02-26 9:02 ` David Hildenbrand (Arm)
2026-02-26 21:22 ` Pedro Falcato
2026-02-27 6:34 ` Kalesh Singh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e61c1029-d760-4c04-acfb-55bc0af88e88@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreyknvl@gmail.com \
--cc=anthony.yznaga@oracle.com \
--cc=arnd@arndb.de \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=bsegall@google.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=dietmar.eggemann@arm.com \
--cc=ebiederm@xmission.com \
--cc=hpa@zytor.com \
--cc=jakub.wartak@mailbox.org \
--cc=jannh@google.com \
--cc=juri.lelli@redhat.com \
--cc=khalid@kernel.org \
--cc=liam.howlett@oracle.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyongting@bytedance.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=luto@kernel.org \
--cc=markhemm@googlemail.com \
--cc=maz@kernel.org \
--cc=mgorman@suse.de \
--cc=mhiramat@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=muchun.song@linux.dev \
--cc=neilb@suse.de \
--cc=osalvador@suse.de \
--cc=pcc@google.com \
--cc=peterz@infradead.org \
--cc=pfalcato@suse.de \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=vasily.averin@linux.dev \
--cc=vbabka@suse.cz \
--cc=vincent.guittot@linaro.org \
--cc=viro@zeniv.linux.org.uk \
--cc=vschneid@redhat.com \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=xhao@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox