* [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
@ 2026-04-21 2:09 Yibin Liu
2026-04-21 14:38 ` Matthew Wilcox
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Yibin Liu @ 2026-04-21 2:09 UTC (permalink / raw)
To: linux-mm
Cc: akpm, Liam.Howlett, viro, brauner, mjguzik, wujianyong, huangsj,
zhongyuan
UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
bottlenecked on the i_mmap_rwsem semaphore due to heavy vma insert/remove
operations on the i_mmap tree, where libc.so.6 is the most frequent,
followed by ld-linux-x86-64.so.2 and the test executable itself.
This patch marks such files to skip rmap operations, avoiding frequent
interval tree insert/remove that cause i_mmap_rwsem lock contention.
The downside is these files can no longer be reclaimed (along with compact
and ksm), but since they are small and resident anyway, it's acceptable.
When all mapping processes exit, files can still be reclaimed normally.
Performance testing shows ~80% improvement in UnixBench execl/shellscript
scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
Signed-off-by: Yibin Liu <liuyibin@hygon.cn>
---
fs/fcntl.c | 1 +
fs/open.c | 6 ++++++
include/linux/fs.h | 3 +++
include/uapi/linux/fcntl.h | 1 +
mm/mmap.c | 3 ++-
mm/vma.c | 8 +++++---
6 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index beab8080badf..9b7cc1544735 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -349,6 +349,7 @@ static bool rw_hint_valid(u64 hint)
case RWH_WRITE_LIFE_MEDIUM:
case RWH_WRITE_LIFE_LONG:
case RWH_WRITE_LIFE_EXTREME:
+ case RWH_RMAP_EXCLUDE:
return true;
default:
return false;
diff --git a/fs/open.c b/fs/open.c
index 681d405bc61e..643ab7c6b461 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -46,6 +46,10 @@ int do_truncate(struct mnt_idmap *idmap, struct dentry *dentry,
if (length < 0)
return -EINVAL;
+ /* Prevent truncate on files marked as RMAP_EXCLUDE (e.g., libc, ld.so) */
+ if (filp && (filp->f_mode & FMODE_RMAP_EXCLUDE))
+ return -EPERM;
+
newattrs.ia_size = length;
newattrs.ia_valid = ATTR_SIZE | time_attrs;
if (filp) {
@@ -892,6 +896,8 @@ static int do_dentry_open(struct file *f,
path_get(&f->f_path);
f->f_inode = inode;
f->f_mapping = inode->i_mapping;
+ if (inode->i_write_hint == RWH_RMAP_EXCLUDE)
+ f->f_mode |= FMODE_RMAP_EXCLUDE;
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
f->f_sb_err = file_sample_sb_err(f);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..d5c9e5a4c2b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -189,6 +189,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File does not contribute to nr_files count */
#define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
+/* File should exclude vma from rmap interval tree */
+#define FMODE_RMAP_EXCLUDE ((__force fmode_t)(1 << 30))
+
/*
* The two FMODE_NONOTIFY* define which fsnotify events should not be generated
* for an open file. These are the possible values of
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index aadfbf6e0cb3..4969b4762071 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -72,6 +72,7 @@
#define RWH_WRITE_LIFE_MEDIUM 3
#define RWH_WRITE_LIFE_LONG 4
#define RWH_WRITE_LIFE_EXTREME 5
+#define RWH_RMAP_EXCLUDE 6
/*
* The originally introduced spelling is remained from the first
diff --git a/mm/mmap.c b/mm/mmap.c
index 2311ae7c2ff4..3eb00997e86a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1830,7 +1830,8 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
mapping_allow_writable(mapping);
flush_dcache_mmap_lock(mapping);
/* insert tmp into the share list, just after mpnt */
- vma_interval_tree_insert_after(tmp, mpnt,
+ if (!(file->f_mode & FMODE_RMAP_EXCLUDE))
+ vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
i_mmap_unlock_write(mapping);
diff --git a/mm/vma.c b/mm/vma.c
index 377321b48734..f1e36e6a8702 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -234,7 +234,8 @@ static void __vma_link_file(struct vm_area_struct *vma,
mapping_allow_writable(mapping);
flush_dcache_mmap_lock(mapping);
- vma_interval_tree_insert(vma, &mapping->i_mmap);
+ if (!(vma->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
+ vma_interval_tree_insert(vma, &mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
}
@@ -339,10 +340,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
struct mm_struct *mm)
{
if (vp->file) {
- if (vp->adj_next)
+ if (vp->adj_next && !(vp->adj_next->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
vma_interval_tree_insert(vp->adj_next,
&vp->mapping->i_mmap);
- vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
+ if (!(vp->vma->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
+ vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
flush_dcache_mmap_unlock(vp->mapping);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 2:09 [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing Yibin Liu
@ 2026-04-21 14:38 ` Matthew Wilcox
2026-04-21 15:37 ` Pedro Falcato
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2026-04-21 14:38 UTC (permalink / raw)
To: Yibin Liu
Cc: linux-mm, akpm, Liam.Howlett, viro, brauner, mjguzik, wujianyong,
huangsj, zhongyuan
On Tue, Apr 21, 2026 at 10:09:32AM +0800, Yibin Liu wrote:
> + /* Prevent truncate on files marked as RMAP_EXCLUDE (e.g., libc, ld.so) */
> + if (filp && (filp->f_mode & FMODE_RMAP_EXCLUDE))
> + return -EPERM;
You can't do this. It means I can prevent anybody else from truncating
a file which I can open.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 2:09 [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing Yibin Liu
2026-04-21 14:38 ` Matthew Wilcox
@ 2026-04-21 15:37 ` Pedro Falcato
2026-04-22 7:19 ` David Hildenbrand (Arm)
2026-04-21 19:46 ` Mateusz Guzik
2026-04-22 10:49 ` Lorenzo Stoakes
3 siblings, 1 reply; 9+ messages in thread
From: Pedro Falcato @ 2026-04-21 15:37 UTC (permalink / raw)
To: Yibin Liu
Cc: linux-mm, akpm, Liam.Howlett, viro, brauner, mjguzik, wujianyong,
huangsj, zhongyuan
I'm not sure how you're using get_maintainer.pl, but PLEASE CC people
correctly.
On Tue, Apr 21, 2026 at 10:09:32AM +0800, Yibin Liu wrote:
> UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
> bottlenecked on the i_mmap_rwsem semaphore due to heavy vma insert/remove
> operations on the i_mmap tree, where libc.so.6 is the most frequent,
> followed by ld-linux-x86-64.so.2 and the test executable itself.
>
> This patch marks such files to skip rmap operations, avoiding frequent
> interval tree insert/remove that cause i_mmap_rwsem lock contention.
> The downside is these files can no longer be reclaimed (along with compact
This does not work.
> and ksm), but since they are small and resident anyway, it's acceptable.
Which ones? ELF executables commonly have a lot of padding. How do you know
what to mark as !no_rmap? Who has permissions for that? This patch allows
any user to create a full DoS of the system by simply marking a really large
file as no-reclaim.
> When all mapping processes exit, files can still be reclaimed normally.
>
> Performance testing shows ~80% improvement in UnixBench execl/shellscript
> scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
>
> Signed-off-by: Yibin Liu <liuyibin@hygon.cn>
> ---
> fs/fcntl.c | 1 +
> fs/open.c | 6 ++++++
> include/linux/fs.h | 3 +++
> include/uapi/linux/fcntl.h | 1 +
> mm/mmap.c | 3 ++-
> mm/vma.c | 8 +++++---
> 6 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index beab8080badf..9b7cc1544735 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -349,6 +349,7 @@ static bool rw_hint_valid(u64 hint)
> case RWH_WRITE_LIFE_MEDIUM:
> case RWH_WRITE_LIFE_LONG:
> case RWH_WRITE_LIFE_EXTREME:
> + case RWH_RMAP_EXCLUDE:
> return true;
> default:
> return false;
> diff --git a/fs/open.c b/fs/open.c
> index 681d405bc61e..643ab7c6b461 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -46,6 +46,10 @@ int do_truncate(struct mnt_idmap *idmap, struct dentry *dentry,
> if (length < 0)
> return -EINVAL;
>
> + /* Prevent truncate on files marked as RMAP_EXCLUDE (e.g., libc, ld.so) */
> + if (filp && (filp->f_mode & FMODE_RMAP_EXCLUDE))
> + return -EPERM;
> +
> newattrs.ia_size = length;
> newattrs.ia_valid = ATTR_SIZE | time_attrs;
> if (filp) {
> @@ -892,6 +896,8 @@ static int do_dentry_open(struct file *f,
> path_get(&f->f_path);
> f->f_inode = inode;
> f->f_mapping = inode->i_mapping;
> + if (inode->i_write_hint == RWH_RMAP_EXCLUDE)
> + f->f_mode |= FMODE_RMAP_EXCLUDE;
> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> f->f_sb_err = file_sample_sb_err(f);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..d5c9e5a4c2b9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -189,6 +189,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> /* File does not contribute to nr_files count */
> #define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
>
> +/* File should exclude vma from rmap interval tree */
> +#define FMODE_RMAP_EXCLUDE ((__force fmode_t)(1 << 30))
> +
> /*
> * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
> * for an open file. These are the possible values of
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index aadfbf6e0cb3..4969b4762071 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -72,6 +72,7 @@
> #define RWH_WRITE_LIFE_MEDIUM 3
> #define RWH_WRITE_LIFE_LONG 4
> #define RWH_WRITE_LIFE_EXTREME 5
> +#define RWH_RMAP_EXCLUDE 6
Userspace does not know what "rmap" is, nor does it need to know. Even if you
found a workable solution for the permission side, this is not a good solution,
and it's not a good interface. It also might solve it for you, but definitely
does not solve the whole issue with having a lock there (which you are, for
the record, still acquiring).
--
Pedro
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 2:09 [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing Yibin Liu
2026-04-21 14:38 ` Matthew Wilcox
2026-04-21 15:37 ` Pedro Falcato
@ 2026-04-21 19:46 ` Mateusz Guzik
2026-04-22 13:03 ` 答复: " Yibin Liu
2026-04-22 10:49 ` Lorenzo Stoakes
3 siblings, 1 reply; 9+ messages in thread
From: Mateusz Guzik @ 2026-04-21 19:46 UTC (permalink / raw)
To: Yibin Liu
Cc: linux-mm, akpm, Liam.Howlett, viro, brauner, wujianyong, huangsj,
zhongyuan
On Tue, Apr 21, 2026 at 4:11 AM Yibin Liu <liuyibin@hygon.cn> wrote:
>
> UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
> bottlenecked on the i_mmap_rwsem semaphore due to heavy vma insert/remove
> operations on the i_mmap tree, where libc.so.6 is the most frequent,
> followed by ld-linux-x86-64.so.2 and the test executable itself.
>
> This patch marks such files to skip rmap operations, avoiding frequent
> interval tree insert/remove that cause i_mmap_rwsem lock contention.
> The downside is these files can no longer be reclaimed (along with compact
> and ksm), but since they are small and resident anyway, it's acceptable.
> When all mapping processes exit, files can still be reclaimed normally.
>
> Performance testing shows ~80% improvement in UnixBench execl/shellscript
> scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
>
The other responders have been a little harsh and despite raising
valid points I don't think they gave a proper review.
The bigger picture is that the problematic rwsem is taken several
times during fork + exec + exit cycle. Normally you end up with 5
distinct mappings per binary/so, each created with a separate lock
acquire.
Some time ago I patched exit to batch processing, leaving 1 acquire in
that codepath. fork can and should be patched in a similar vein, but I
don't know if unixbench runs it in this benchmark (i.e., real
workloads certainly suffer from it, I don't know if this particular
bench includes that aspect). This is on top of forking itself being
avoidable should the kernel grow a better interface for executing
binaries.
This leaves us with mapping creation on exec. This problem is
unfixable without introduction of better APIs for userspace, which
constitutes quite a challenge.
The end result is the absolutely horrible case of multiple acquires of
the same lock per iteration.
One common idea how to reduce contention boils down to shortening lock
hold time. This has very limited effect in face of the aforementioned
multiple acquires and is at best a stop gap -- no matter what, the
ceiling is dictated by the extra acquires and it is incredibly low.
Your patch keeps the problematic acquire pattern intact and while the
80% win might sound encouraging, the end result is still severely
underperforming even a state where the lock is taken once in total
during exec.
Besides that, the internally-visible side effect of non-functional
rmap is pretty bad (and thus e.g., truncate) is pretty bad in its own
right, but let's ignore it. The primary problem here is that the patch
exposes a mechanism for userspace to dictate this in the first place.
Even ignoring the question of who should be using it and when, the
real solution to the problem would be confined to the kernel. Suppose
this patch lands and such a solution is implemented later -- now the
kernel is stuck having to support a now-useless (if not outright
harmful) feature.
What will fix the problem is sharding the state in some capacity,
provided no unfixable stopgap shows up.
Any other approach is putting small bandaids on it and can be a
consideration only if the decentralizing locking is proven too
problematic.
Pedro apparently volunteered to do the work, so I think we can wait to
see what he is going to end up cooking.
I hope this helps.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 15:37 ` Pedro Falcato
@ 2026-04-22 7:19 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-22 7:19 UTC (permalink / raw)
To: Pedro Falcato, Yibin Liu
Cc: linux-mm, akpm, Liam.Howlett, viro, brauner, mjguzik, wujianyong,
huangsj, zhongyuan
>> @@ -72,6 +72,7 @@
>> #define RWH_WRITE_LIFE_MEDIUM 3
>> #define RWH_WRITE_LIFE_LONG 4
>> #define RWH_WRITE_LIFE_EXTREME 5
>> +#define RWH_RMAP_EXCLUDE 6
>
> Userspace does not know what "rmap" is, nor does it need to know. Even if you
> found a workable solution for the permission side, this is not a good solution,
> and it's not a good interface.
It's horrible and I am surprised this patch is not tagged as RFC :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 2:09 [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing Yibin Liu
` (2 preceding siblings ...)
2026-04-21 19:46 ` Mateusz Guzik
@ 2026-04-22 10:49 ` Lorenzo Stoakes
2026-04-22 12:51 ` 答复: " Yibin Liu
3 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2026-04-22 10:49 UTC (permalink / raw)
To: Yibin Liu
Cc: linux-mm, akpm, Liam.Howlett, viro, brauner, mjguzik, wujianyong,
huangsj, zhongyuan
NAK obviously.
I hate to keep saying this to people, but you've got no excuse at this stage
it's been a year or so since we added mm maintainers/reviewers and you're not
sending this to the right people.
How hard is doing:
$ scripts/get_maintainer.pl --no-git fs/fcntl.c fs/open.c include/linux/fs.h \
include/uapi/linux/fcntl.h mm/mmap.c mm/vma.c
Jeff Layton <jlayton@kernel.org> (maintainer:FILE LOCKING (flock() and fcntl()/lockf()))
Chuck Lever <chuck.lever@oracle.com> (maintainer:FILE LOCKING (flock() and fcntl()/lockf()))
Alexander Aring <alex.aring@gmail.com> (reviewer:FILE LOCKING (flock() and fcntl()/lockf()))
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Jan Kara <jack@suse.cz> (reviewer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MAPPING)
"Liam R. Howlett" <Liam.Howlett@oracle.com> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
Vlastimil Babka <vbabka@kernel.org> (reviewer:MEMORY MAPPING)
Jann Horn <jannh@google.com> (reviewer:MEMORY MAPPING)
Pedro Falcato <pfalcato@suse.de> (reviewer:MEMORY MAPPING)
linux-fsdevel@vger.kernel.org (open list:FILE LOCKING (flock() and fcntl()/lockf()))
linux-kernel@vger.kernel.org (open list)
linux-mm@kvack.org (open list:MEMORY MAPPING)
?
You're sending an insane patch that breaks core mm and you can't even send it to
the right people...
(And yet Mateusz is somehow cc'd (he loves that :))
This kind of craziness should be an RFC also as David said.
Both of these things are just rude and not helpful wrt upstream.
On Tue, Apr 21, 2026 at 10:09:32AM +0800, Yibin Liu wrote:
> UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
> bottlenecked on the i_mmap_rwsem semaphore due to heavy vma insert/remove
> operations on the i_mmap tree, where libc.so.6 is the most frequent,
> followed by ld-linux-x86-64.so.2 and the test executable itself.
OK that's good to know, but please provide _actual data_. Hand waving isn't ok.
>
> This patch marks such files to skip rmap operations, avoiding frequent
> interval tree insert/remove that cause i_mmap_rwsem lock contention.
OK that's totally insane.
This is a classic example of 'I have problem X, therefore <do something insane
that happens to addess X>'.
> The downside is these files can no longer be reclaimed (along with compact
> and ksm), but since they are small and resident anyway, it's acceptable.
> When all mapping processes exit, files can still be reclaimed normally.
>
Yeah, that's quite the bloody downside. And 'they're small and resident
anyway'... err what on earth makes that a thing?
Also as Matthew points out, you're impacting _everybody else_, you're giving
avenues for unprivileged users to trigger total kernel lockups, you're breaking
migration, you're breaking reclaim, you're breaking basically all of rmap to fix
a performance issue.
> Performance testing shows ~80% improvement in UnixBench execl/shellscript
> scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
Yeah ok, I'm sure if I remove rmap altogether I'll get even better numbers :)
I can also take the oxygen system out of a plane and make it way more fuel
efficient!
>
> Signed-off-by: Yibin Liu <liuyibin@hygon.cn>
> ---
> fs/fcntl.c | 1 +
> fs/open.c | 6 ++++++
> include/linux/fs.h | 3 +++
> include/uapi/linux/fcntl.h | 1 +
> mm/mmap.c | 3 ++-
> mm/vma.c | 8 +++++---
> 6 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index beab8080badf..9b7cc1544735 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -349,6 +349,7 @@ static bool rw_hint_valid(u64 hint)
> case RWH_WRITE_LIFE_MEDIUM:
> case RWH_WRITE_LIFE_LONG:
> case RWH_WRITE_LIFE_EXTREME:
> + case RWH_RMAP_EXCLUDE:
> return true;
> default:
> return false;
> diff --git a/fs/open.c b/fs/open.c
> index 681d405bc61e..643ab7c6b461 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -46,6 +46,10 @@ int do_truncate(struct mnt_idmap *idmap, struct dentry *dentry,
> if (length < 0)
> return -EINVAL;
>
> + /* Prevent truncate on files marked as RMAP_EXCLUDE (e.g., libc, ld.so) */
Prevent truncation :)
RMAP_EXCLUDE :)))
Seriously no.
> + if (filp && (filp->f_mode & FMODE_RMAP_EXCLUDE))
> + return -EPERM;
> +
> newattrs.ia_size = length;
> newattrs.ia_valid = ATTR_SIZE | time_attrs;
> if (filp) {
> @@ -892,6 +896,8 @@ static int do_dentry_open(struct file *f,
> path_get(&f->f_path);
> f->f_inode = inode;
> f->f_mapping = inode->i_mapping;
> + if (inode->i_write_hint == RWH_RMAP_EXCLUDE)
> + f->f_mode |= FMODE_RMAP_EXCLUDE;
> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> f->f_sb_err = file_sample_sb_err(f);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 11559c513dfb..d5c9e5a4c2b9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -189,6 +189,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> /* File does not contribute to nr_files count */
> #define FMODE_NOACCOUNT ((__force fmode_t)(1 << 29))
>
> +/* File should exclude vma from rmap interval tree */
> +#define FMODE_RMAP_EXCLUDE ((__force fmode_t)(1 << 30))
> +
> /*
> * The two FMODE_NONOTIFY* define which fsnotify events should not be generated
> * for an open file. These are the possible values of
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index aadfbf6e0cb3..4969b4762071 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -72,6 +72,7 @@
> #define RWH_WRITE_LIFE_MEDIUM 3
> #define RWH_WRITE_LIFE_LONG 4
> #define RWH_WRITE_LIFE_EXTREME 5
> +#define RWH_RMAP_EXCLUDE 6
As others have pointed out, rmap is not a user API, and it will NEVER be.
>
> /*
> * The originally introduced spelling is remained from the first
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2ff4..3eb00997e86a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1830,7 +1830,8 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> mapping_allow_writable(mapping);
> flush_dcache_mmap_lock(mapping);
> /* insert tmp into the share list, just after mpnt */
> - vma_interval_tree_insert_after(tmp, mpnt,
> + if (!(file->f_mode & FMODE_RMAP_EXCLUDE))
> + vma_interval_tree_insert_after(tmp, mpnt,
Yeah this is just... this seems completely broken?
I'd be curious to see what sashiko finds for this lord :)
> &mapping->i_mmap);
> flush_dcache_mmap_unlock(mapping);
> i_mmap_unlock_write(mapping);
> diff --git a/mm/vma.c b/mm/vma.c
> index 377321b48734..f1e36e6a8702 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -234,7 +234,8 @@ static void __vma_link_file(struct vm_area_struct *vma,
> mapping_allow_writable(mapping);
>
> flush_dcache_mmap_lock(mapping);
> - vma_interval_tree_insert(vma, &mapping->i_mmap);
> + if (!(vma->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
> + vma_interval_tree_insert(vma, &mapping->i_mmap);
> flush_dcache_mmap_unlock(mapping);
> }
>
> @@ -339,10 +340,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> struct mm_struct *mm)
> {
> if (vp->file) {
> - if (vp->adj_next)
> + if (vp->adj_next && !(vp->adj_next->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
> vma_interval_tree_insert(vp->adj_next,
> &vp->mapping->i_mmap);
> - vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
> + if (!(vp->vma->vm_file->f_mode & FMODE_RMAP_EXCLUDE))
> + vma_interval_tree_insert(vp->vma, &vp->mapping->i_mmap);
Hang on, this is struct file * state that impacts folio-granularity behaviour?
I mean ugh anyway.
> flush_dcache_mmap_unlock(vp->mapping);
> }
>
> --
> 2.34.1
>
>
>
This idea is totally broken.
If you want to contribute usefully, PLEASE drop this silly idea, come back with
some NUMBERS about the contention you see, and let's have a sensible discussion
about what we can do to address that?
Also follow standard upstream kernel procedures - figure out who to email
properly, RFC insane ideas, etc.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 9+ messages in thread
* 答复: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-22 10:49 ` Lorenzo Stoakes
@ 2026-04-22 12:51 ` Yibin Liu
2026-04-22 16:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 9+ messages in thread
From: Yibin Liu @ 2026-04-22 12:51 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, viro@zeniv.linux.org.uk,
brauner@kernel.org, mjguzik@gmail.com, Jianyong Wu, Huangsj,
Yuan Zhong, jack@suse.cz, jlayton@kernel.org,
chuck.lever@oracle.com, alex.aring@gmail.com, vbabka@kernel.org,
jannh@google.com, pfalcato@suse.de, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
First of all, I am truly sorry for not using RFC.
Secondly, I omitted many maintainers because I wanted to “not disturb too many people”,
and I apologize deeply for that. I will fully follow these two rules from now on.
As for this patch, indeed, as Matthew said, the truncate part is not feasible.
My original intention was to apply this to frequently used library files like libc and ld.
Contention on the i_mmap_rwsem lock (which eventually turns into osq_lock) caused by
these two files alone reaches up to 70% in the “256-core execl” case, as observed from
flame graphs. Besides, no one performs truncate operations on libc and ld anyway.
So I wanted to try skipping rmap for them. Since they are small, even if they cannot
be reclaimed or migrated, I assumed it would not cause much trouble. Of course,
this idea was totally wrong, and I will definitely mark such insane proposals with RFC in the future.
These ideas are inspired by Mateusz’s work and thoughts
(https://lore.kernel.org/linux-mm/CAGudoHEfiOPJ2VGEV3fDT9cDsuoHB-wk8jg-k-EK6JhWgiHkWw@mail.gmail.com/),
so I specifically CC’d him to seek more opinions and insights.
Lastly, I sincerely apologize for the trouble I have caused the community.
I will strictly follow community conventions when sending patches in the future.
> NAK obviously.
>
> I hate to keep saying this to people, but you've got no excuse at this stage
> it's been a year or so since we added mm maintainers/reviewers and you're not
> sending this to the right people.
>
> How hard is doing:
>
> $ scripts/get_maintainer.pl --no-git fs/fcntl.c fs/open.c include/linux/fs.h \
> include/uapi/linux/fcntl.h mm/mmap.c mm/vma.c
> Jeff Layton <jlayton@kernel.org> (maintainer:FILE LOCKING (flock() and
> fcntl()/lockf()))
> Chuck Lever <chuck.lever@oracle.com> (maintainer:FILE LOCKING (flock() and
> fcntl()/lockf()))
> Alexander Aring <alex.aring@gmail.com> (reviewer:FILE LOCKING (flock() and
> fcntl()/lockf()))
> Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and
> infrastructure))
> Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and
> infrastructure))
> Jan Kara <jack@suse.cz> (reviewer:FILESYSTEMS (VFS and infrastructure))
> Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY
> MAPPING)
> "Liam R. Howlett" <Liam.Howlett@oracle.com> (maintainer:MEMORY
> MAPPING)
> Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
> Vlastimil Babka <vbabka@kernel.org> (reviewer:MEMORY MAPPING)
> Jann Horn <jannh@google.com> (reviewer:MEMORY MAPPING)
> Pedro Falcato <pfalcato@suse.de> (reviewer:MEMORY MAPPING)
> linux-fsdevel@vger.kernel.org (open list:FILE LOCKING (flock() and fcntl()/lockf()))
> linux-kernel@vger.kernel.org (open list)
> linux-mm@kvack.org (open list:MEMORY MAPPING)
>
> ?
>
> You're sending an insane patch that breaks core mm and you can't even send it
> to
> the right people...
>
> (And yet Mateusz is somehow cc'd (he loves that :))
>
> This kind of craziness should be an RFC also as David said.
>
> Both of these things are just rude and not helpful wrt upstream.
> ... ...
> ... ...
> This idea is totally broken.
>
> If you want to contribute usefully, PLEASE drop this silly idea, come back with
> some NUMBERS about the contention you see, and let's have a sensible
> discussion
> about what we can do to address that?
>
> Also follow standard upstream kernel procedures - figure out who to email
> properly, RFC insane ideas, etc.
>
> Thanks, Lorenzo
^ permalink raw reply [flat|nested] 9+ messages in thread
* 答复: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-21 19:46 ` Mateusz Guzik
@ 2026-04-22 13:03 ` Yibin Liu
0 siblings, 0 replies; 9+ messages in thread
From: Yibin Liu @ 2026-04-22 13:03 UTC (permalink / raw)
To: Mateusz Guzik
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, viro@zeniv.linux.org.uk,
brauner@kernel.org, Jianyong Wu, Huangsj, Yuan Zhong,
jack@suse.cz, jlayton@kernel.org, chuck.lever@oracle.com,
alex.aring@gmail.com, vbabka@kernel.org, jannh@google.com,
pfalcato@suse.de, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, Lorenzo Stoakes
> On Tue, Apr 21, 2026 at 4:11 AM Yibin Liu <liuyibin@hygon.cn> wrote:
> >
> > UnixBench execl/shellscript (dynamically linked binaries) at 64+ cores are
> > bottlenecked on the i_mmap_rwsem semaphore due to heavy vma
> insert/remove
> > operations on the i_mmap tree, where libc.so.6 is the most frequent,
> > followed by ld-linux-x86-64.so.2 and the test executable itself.
> >
> > This patch marks such files to skip rmap operations, avoiding frequent
> > interval tree insert/remove that cause i_mmap_rwsem lock contention.
> > The downside is these files can no longer be reclaimed (along with compact
> > and ksm), but since they are small and resident anyway, it's acceptable.
> > When all mapping processes exit, files can still be reclaimed normally.
> >
> > Performance testing shows ~80% improvement in UnixBench execl/shellscript
> > scores on Hygon 7490, AMD zen4 9754 and Intel emerald rapids platform.
> >
>
> The other responders have been a little harsh and despite raising
> valid points I don't think they gave a proper review.
>
> The bigger picture is that the problematic rwsem is taken several
> times during fork + exec + exit cycle. Normally you end up with 5
> distinct mappings per binary/so, each created with a separate lock
> acquire.
>
> Some time ago I patched exit to batch processing, leaving 1 acquire in
> that codepath. fork can and should be patched in a similar vein, but I
> don't know if unixbench runs it in this benchmark (i.e., real
> workloads certainly suffer from it, I don't know if this particular
> bench includes that aspect). This is on top of forking itself being
> avoidable should the kernel grow a better interface for executing
> binaries.
>
Thank you for your opnions and advices, I'll try this way
> This leaves us with mapping creation on exec. This problem is
> unfixable without introduction of better APIs for userspace, which
> constitutes quite a challenge.
>
> The end result is the absolutely horrible case of multiple acquires of
> the same lock per iteration.
>
> One common idea how to reduce contention boils down to shortening lock
> hold time. This has very limited effect in face of the aforementioned
> multiple acquires and is at best a stop gap -- no matter what, the
> ceiling is dictated by the extra acquires and it is incredibly low.
>
> Your patch keeps the problematic acquire pattern intact and while the
> 80% win might sound encouraging, the end result is still severely
> underperforming even a state where the lock is taken once in total
> during exec.
>
> Besides that, the internally-visible side effect of non-functional
> rmap is pretty bad (and thus e.g., truncate) is pretty bad in its own
> right, but let's ignore it. The primary problem here is that the patch
> exposes a mechanism for userspace to dictate this in the first place.
> Even ignoring the question of who should be using it and when, the
> real solution to the problem would be confined to the kernel. Suppose
> this patch lands and such a solution is implemented later -- now the
> kernel is stuck having to support a now-useless (if not outright
> harmful) feature.
OK. I understand it now.
>
> What will fix the problem is sharding the state in some capacity,
> provided no unfixable stopgap shows up.
>
> Any other approach is putting small bandaids on it and can be a
> consideration only if the decentralizing locking is proven too
> problematic.
>
> Pedro apparently volunteered to do the work, so I think we can wait to
> see what he is going to end up cooking.
>
> I hope this helps.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 答复: [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing
2026-04-22 12:51 ` 答复: " Yibin Liu
@ 2026-04-22 16:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2026-04-22 16:16 UTC (permalink / raw)
To: Yibin Liu
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Liam.Howlett@oracle.com, viro@zeniv.linux.org.uk,
brauner@kernel.org, mjguzik@gmail.com, Jianyong Wu, Huangsj,
Yuan Zhong, jack@suse.cz, jlayton@kernel.org,
chuck.lever@oracle.com, alex.aring@gmail.com, vbabka@kernel.org,
jannh@google.com, pfalcato@suse.de, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
On Wed, Apr 22, 2026 at 12:51:06PM +0000, Yibin Liu wrote:
> First of all, I am truly sorry for not using RFC.
> Secondly, I omitted many maintainers because I wanted to “not disturb too many people”,
> and I apologize deeply for that. I will fully follow these two rules from now on.
>
> As for this patch, indeed, as Matthew said, the truncate part is not feasible.
> My original intention was to apply this to frequently used library files like libc and ld.
> Contention on the i_mmap_rwsem lock (which eventually turns into osq_lock) caused by
> these two files alone reaches up to 70% in the “256-core execl” case, as observed from
> flame graphs. Besides, no one performs truncate operations on libc and ld anyway.
Interesting, would be good to see these? And more details on the scenario?
What workloads are contending that exactly?
>
> So I wanted to try skipping rmap for them. Since they are small, even if they cannot
> be reclaimed or migrated, I assumed it would not cause much trouble. Of course,
> this idea was totally wrong, and I will definitely mark such insane proposals with RFC in the future.
>
> These ideas are inspired by Mateusz’s work and thoughts
> (https://lore.kernel.org/linux-mm/CAGudoHEfiOPJ2VGEV3fDT9cDsuoHB-wk8jg-k-EK6JhWgiHkWw@mail.gmail.com/),
> so I specifically CC’d him to seek more opinions and insights.
I think the best thing in general going forwards is to bring up this issues in
advance, we're more than happy to look into things and very interested in issues
with lock contention, latency, etc.
And that way you can discuss ideas you might have to tackle up front and we can
give you early feedback, which should save time all round and help get us to a
good solution :)
Just send with a [DISCUSSION] preface and cc- people you feel are relevant (use
MAINTAINERS to figure out e.g. maintainers of relevant things, like rmap, mmap,
etc.)
>
> Lastly, I sincerely apologize for the trouble I have caused the community.
> I will strictly follow community conventions when sending patches in the future.
It's no problem, better to be direct about this - it's more useful to discuss
rather than to jump to a solution without community involvement, which might not
work out/conflict with other stuff etc.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-22 16:16 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21 2:09 [PATCH] mm: Add RWH_RMAP_EXCLUDE flag to exclude files from rmap sharing Yibin Liu
2026-04-21 14:38 ` Matthew Wilcox
2026-04-21 15:37 ` Pedro Falcato
2026-04-22 7:19 ` David Hildenbrand (Arm)
2026-04-21 19:46 ` Mateusz Guzik
2026-04-22 13:03 ` 答复: " Yibin Liu
2026-04-22 10:49 ` Lorenzo Stoakes
2026-04-22 12:51 ` 答复: " Yibin Liu
2026-04-22 16:16 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox