public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Pedro Falcato <pfalcato@suse.de>,
	Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	andreyknvl@gmail.com, arnd@arndb.de, bp@alien8.de,
	brauner@kernel.org, bsegall@google.com, corbet@lwn.net,
	dave.hansen@linux.intel.com, dietmar.eggemann@arm.com,
	ebiederm@xmission.com, hpa@zytor.com, jakub.wartak@mailbox.org,
	jannh@google.com, juri.lelli@redhat.com, khalid@kernel.org,
	liam.howlett@oracle.com, linyongting@bytedance.com,
	lorenzo.stoakes@oracle.com, luto@kernel.org,
	markhemm@googlemail.com, maz@kernel.org, mhiramat@kernel.org,
	mgorman@suse.de, mhocko@suse.com, mingo@redhat.com,
	muchun.song@linux.dev, neilb@suse.de, osalvador@suse.de,
	pcc@google.com, peterz@infradead.org, rostedt@goodmis.org,
	rppt@kernel.org, shakeel.butt@linux.dev, surenb@google.com,
	tglx@linutronix.de, vasily.averin@linux.dev, vbabka@suse.cz,
	vincent.guittot@linaro.org, viro@zeniv.linux.org.uk,
	vschneid@redhat.com, willy@infradead.org, x86@kernel.org,
	xhao@linux.alibaba.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
Subject: Re: [PATCH v3 01/22] mm: Add msharefs filesystem
Date: Wed, 10 Sep 2025 14:46:45 +0200	[thread overview]
Message-ID: <e61c1029-d760-4c04-acfb-55bc0af88e88@redhat.com> (raw)
In-Reply-To: <do7cmy4eiiqd5ux62r3u2ghizc62ljg5m3mqx7qzy3im4kc2p6@upmigdbp7eat>

On 10.09.25 14:14, Pedro Falcato wrote:
> On Tue, Aug 19, 2025 at 06:03:54PM -0700, Anthony Yznaga wrote:
>> From: Khalid Aziz <khalid@kernel.org>
>>
>> Add a pseudo filesystem that contains files and page table sharing
>> information that enables processes to share page table entries.
>> This patch adds the basic filesystem that can be mounted, a
>> CONFIG_MSHARE option to enable the feature, and documentation.
>>
>> Signed-off-by: Khalid Aziz <khalid@kernel.org>
>> Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
>> ---
>>   Documentation/filesystems/index.rst    |  1 +
>>   Documentation/filesystems/msharefs.rst | 96 +++++++++++++++++++++++++
>>   include/uapi/linux/magic.h             |  1 +
>>   mm/Kconfig                             | 11 +++
>>   mm/Makefile                            |  4 ++
>>   mm/mshare.c                            | 97 ++++++++++++++++++++++++++
>>   6 files changed, 210 insertions(+)
>>   create mode 100644 Documentation/filesystems/msharefs.rst
>>   create mode 100644 mm/mshare.c
>>
>> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> index 11a599387266..dcd6605eb228 100644
>> --- a/Documentation/filesystems/index.rst
>> +++ b/Documentation/filesystems/index.rst
>> @@ -102,6 +102,7 @@ Documentation for filesystem implementations.
>>      fuse-passthrough
>>      inotify
>>      isofs
>> +   msharefs
>>      nilfs2
>>      nfs/index
>>      ntfs3
>> diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
>> new file mode 100644
>> index 000000000000..3e5b7d531821
>> --- /dev/null
>> +++ b/Documentation/filesystems/msharefs.rst
>> @@ -0,0 +1,96 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================================================
>> +Msharefs - A filesystem to support shared page tables
>> +=====================================================
>> +
>> +What is msharefs?
>> +-----------------
>> +
>> +msharefs is a pseudo filesystem that allows multiple processes to
>> +share page table entries for shared pages. To enable support for
>> +msharefs the kernel must be compiled with CONFIG_MSHARE set.
>> +
>> +msharefs is typically mounted like this::
>> +
>> +	mount -t msharefs none /sys/fs/mshare
>> +
>> +A file created on msharefs creates a new shared region where all
>> +processes mapping that region will map it using shared page table
>> +entries. Once the size of the region has been established via
>> +ftruncate() or fallocate(), the region can be mapped into processes
>> +and ioctls used to map and unmap objects within it. Note that an
>> +msharefs file is a control file and accessing mapped objects within
>> +a shared region through read or write of the file is not permitted.
>> +
> 
> Welp. I really really don't like this API.
> I assume this has been discussed previously, but why do we need a new
> magical pseudofs mounted under some random /sys directory?
> 
> But, ok, assuming we're thinking about something hugetlbfs like, that's not too
> bad, and programs already know how to use it.
> 
>> +How to use mshare
>> +-----------------
>> +
>> +Here are the basic steps for using mshare:
>> +
>> +  1. Mount msharefs on /sys/fs/mshare::
>> +
>> +	mount -t msharefs msharefs /sys/fs/mshare
>> +
>> +  2. mshare regions have alignment and size requirements. Start
>> +     address for the region must be aligned to an address boundary and
>> +     be a multiple of fixed size. This alignment and size requirement
>> +     can be obtained by reading the file ``/sys/fs/mshare/mshare_info``
>> +     which returns a number in text format. mshare regions must be
>> +     aligned to this boundary and be a multiple of this size.
>> +
> 
> I don't see why size and alignment needs to be taken into consideration by
> userspace. You can simply establish a mapping and pad it out.
> 
>> +  3. For the process creating an mshare region:
>> +
>> +    a. Create a file on /sys/fs/mshare, for example::
>> +
>> +        fd = open("/sys/fs/mshare/shareme",
>> +                        O_RDWR|O_CREAT|O_EXCL, 0600);
> 
> Ok, makes sense.
> 
>> +
>> +    b. Establish the size of the region::
>> +
>> +        fallocate(fd, 0, 0, BUF_SIZE);
>> +
>> +      or::
>> +
>> +        ftruncate(fd, BUF_SIZE);
>> +
> 
> Yep.
> 
>> +    c. Map some memory in the region::
>> +
>> +	struct mshare_create mcreate;
>> +
>> +	mcreate.region_offset = 0;
>> +	mcreate.size = BUF_SIZE;
>> +	mcreate.offset = 0;
>> +	mcreate.prot = PROT_READ | PROT_WRITE;
>> +	mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
>> +	mcreate.fd = -1;
>> +
>> +	ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
> 
> Why?? Do you want to map mappings in msharefs files, that can themselves be
> mapped? Why do we need an ioctl here?
> 
> Really, this feature seems very overengineered. If you want to go the fs route,
> doing a new pseudofs that's just like hugetlb, but without the hugepages, sounds
> like a decent idea. Or enhancing tmpfs to actually support this kind of stuff.
> Or properly doing a syscall that can try to attach the page-table-sharing
> property to random VMAs.
> 
> But I'm wholly opposed to the idea of "mapping a file that itself has more
> mappings, mappings which you establish using a magic filesystem and ioctls".

I don't remember the history (it's been a while) but there was this 
interest of

(a) Sharing page tables for smaller files (not just PUD size etc.)

(b) Supporting also ordinary file systems, not just tmpfs

(c) Having a way to update protection of parts of a mapping and
     immediately have it visible to everyone mapping that area.

In the past, I raised that some VM use cases around virtio-fs would be 
interested in having a "VMA container" that can be updated by the parent 
QEMU process, and what gets mapped in there would be immediately visible 
to the other processes.

I recall that initially I pushed for just generalizing the support for 
shared page tables so it could be used for other file systems. I recall 
problems around that, likely around protection changes etc.

So current mshare really is the idea of having a (let's call it) VMA 
container that can be mapped into processes where all processes will 
observe changes performed by other processes.

I agree that it's complicated, and the semantics are very, very, very weird.

-- 
Cheers

David / dhildenb


  reply	other threads:[~2025-09-10 12:46 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-20  1:03 [PATCH v3 00/22] Add support for shared PTEs across processes Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 01/22] mm: Add msharefs filesystem Anthony Yznaga
2025-09-08 18:29   ` Liam R. Howlett
2025-09-08 19:09     ` Anthony Yznaga
2025-09-10 12:14   ` Pedro Falcato
2025-09-10 12:46     ` David Hildenbrand [this message]
2025-08-20  1:03 ` [PATCH v3 02/22] mm/mshare: pre-populate msharefs with information file Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 03/22] mm/mshare: make msharefs writable and support directories Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 04/22] mm/mshare: allocate an mm_struct for msharefs files Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 05/22] mm/mshare: add ways to set the size of an mshare region Anthony Yznaga
2025-08-20  1:03 ` [PATCH v3 06/22] mm/mshare: Add a vma flag to indicate " Anthony Yznaga
2025-09-08 18:45   ` David Hildenbrand
2025-09-08 18:56     ` Anthony Yznaga
2025-09-08 19:02       ` David Hildenbrand
2025-09-08 19:03         ` Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 07/22] mm/mshare: Add mmap support Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 08/22] mm/mshare: flush all TLBs when updating PTEs in an mshare range Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 09/22] sched/numa: do not scan msharefs vmas Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 10/22] mm: add mmap_read_lock_killable_nested() Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 11/22] mm: add and use unmap_page_range vm_ops hook Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 12/22] mm: introduce PUD page table shared count Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 13/22] mm/mshare: prepare for page table sharing support Anthony Yznaga
2025-09-15 15:27   ` Lorenzo Stoakes
2025-08-20  1:04 ` [PATCH v3 14/22] x86/mm: enable page table sharing Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 15/22] mm: create __do_mmap() to take an mm_struct * arg Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 16/22] mm: pass the mm in vma_munmap_struct Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 17/22] sched/mshare: mshare ownership Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 18/22] mm/mshare: Add an ioctl for mapping objects in an mshare region Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 19/22] mm/mshare: Add an ioctl for unmapping " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 20/22] mm/mshare: support mapping files and anon hugetlb " Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 21/22] mm/mshare: provide a way to identify an mm as an mshare host mm Anthony Yznaga
2025-08-20  1:04 ` [PATCH v3 22/22] mm/mshare: charge fault handling allocations to the mshare owner Anthony Yznaga
2025-09-08 18:50   ` David Hildenbrand
2025-09-08 19:21     ` Anthony Yznaga
2025-09-08 20:28       ` David Hildenbrand
2025-09-08 20:55         ` Anthony Yznaga
2025-09-08 20:32 ` [PATCH v3 00/22] Add support for shared PTEs across processes David Hildenbrand
2025-09-08 20:59   ` Matthew Wilcox
2025-09-08 21:14     ` Anthony Yznaga
2025-09-09  7:53       ` David Hildenbrand
2025-09-09 18:29         ` Anthony Yznaga
2025-09-09 19:06         ` Lorenzo Stoakes
2026-02-20 21:35 ` Kalesh Singh
2026-02-21 12:40   ` Pedro Falcato
2026-02-23 17:43     ` Kalesh Singh
2026-02-23 19:55       ` anthony.yznaga
2026-02-25 22:53         ` Kalesh Singh
2026-02-24  9:40   ` David Hildenbrand (Arm)
2026-02-25 23:06     ` Kalesh Singh
2026-02-26  9:02       ` David Hildenbrand (Arm)
2026-02-26 21:22       ` Pedro Falcato
2026-02-27  6:34         ` Kalesh Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e61c1029-d760-4c04-acfb-55bc0af88e88@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jakub.wartak@mailbox.org \
    --cc=jannh@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=khalid@kernel.org \
    --cc=liam.howlett@oracle.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyongting@bytedance.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=markhemm@googlemail.com \
    --cc=maz@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhiramat@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=neilb@suse.de \
    --cc=osalvador@suse.de \
    --cc=pcc@google.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=vasily.averin@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox